一步步編寫自己的PHP爬取代理IP項目

知識 09-29

現在我們正式進入爬蟲核心代碼的編寫中，首先我們需要先看看整個目錄

一步步編寫自己的PHP爬取代理IP項目

config.php 這個是我們的配置文件載入文件

ProxyPool.php 這個是爬蟲的核心處理文件
Queue.php 這個是隊列操作的處理文件
Requests.php 這個是發起請求的處理文件

然後我們在回憶一下入口文件的代碼

<?php
require_once __DIR__ . "/autoloader.php";
require_once __DIR__ . "/vendor/autoload.php";
use ProxyPoolcoreProxyPool;
$proxy = new ProxyPool();
$proxy->run();

通過這裡可以看到我們使用了core裡面ProxyPool的run方法，先來看看ProxyPool的內容吧

<?php
use ProxyPoolcoreRequests; //HTTP請求文件
use ProxyPoolcoreQueue; //隊列操作文件
class ProxyPool
{
private $redis;
private $httpClient;
private $queueObj;

function __construct()
{
$redis = new Redis();
$redis->connect(config("database.redis_host"), config("database.redis_port"));
$this->redis = $redis;$this->httpClient = new Requests(["timeout" => 10]);
$this->queueObj = new Queue();
}

public function run()
{
echo "start to spider ip...." . PHP_EOL;
$ip_arr = $this->get_ip(); //獲取IP的具體方法
echo "select IP num: " . count($ip_arr) . PHP_EOL;

echo "start to check ip...." . PHP_EOL;
$this->check_ip($ip_arr); //驗證IP可用性的方法
$ip_pool = $this->redis->smembers("ip_pool"); //讀取redis中的ip
echo "end check ip...." . PHP_EOL;

print_r($ip_pool); //輸出ip數組
die;
}
}

其中get_ip方法會爬取兩個網站的IP

//獲取各大網站代理IP
private function get_ip()
{
$ip_arr = [];
$ip_arr = $this->get_xici_ip($ip_arr); //西刺代理
$ip_arr = $this->get_kuaidaili_ip($ip_arr); //快代理
return $ip_arr;
}

我們先來來看看西刺代理的爬取

private function get_xici_ip($ip_arr)
{
for ($i = 1; $i <= config("spider.page_num"); $i++)
{
list($infoRes, $msg) = $this->httpClient ->request("GET","http://www.xicidaili.com/nn/".$i,[]);
if (!$infoRes)
{
print_r($msg); //輸出錯誤信息
exit();
}
$infoContent = $infoRes->getBody();
$this->convert_encoding($infoContent);
preg_match_all("/<tr.*>[sS]*?<td class="country">[sS]*?</td>[sS]*?<td>(.*?)</td>[sS]*?<td>(.*?)</td>/", $infoContent, $match);

$host_arr = $match[1];
$port_arr = $match[2];
foreach ($host_arr as $key => $value)
{
$ip_arr[] = $host_arr[$key].":".$port_arr[$key];
}
}
return $ip_arr;
}

這個方法裡面，我們首先使用 config("spider.page_num") 這個方法讀取了配置文件裡面定義的爬取頁數，我這裡定義的是3頁，然後我們打開西刺代理的網站，會發現域名是

http://www.xicidaili.com/nn/XX 這個XX是第幾頁，第一頁就是1，第二頁就是2，以此類推

所以我們在代碼裡面循環訪問了三次網站，獲取到網頁的返回值，然後用正則匹配html去獲取裡面的地址和埠號（具體html元素可以在網站右鍵點擊審查元素查看）

preg_match_all("/<tr.*>[sS]*?<td class="country">[sS]*?</td>[sS]*?<td>(.*?)</td>[sS]*?<td>(.*?)</td>/", $infoContent, $match);

然後經過一些處理，將獲取到的IP返回。這就是get_xici_ip這個方法做的事情，它就是負責爬取IP。

然後我們來看看

//檢測IP可用性
private function check_ip($ip_arr)
{
$this->queueObj = $this->queueObj->arr2queue($ip_arr);
$queue = $this->queueObj->getQueue();
foreach ($queue as $key => $value)
{
//用百度網和騰訊網測試IP地址的可用性
for ($i=0; $i < config("spider.examine_round"); $i++)
{
$response = $this->httpClient->test_request("GET","https://www.baidu.com", ["proxy" => "https://".$value]);
if (!$response)
{
$response = $this->httpClient->test_request("GET","http://www.qq.com", ["proxy" => "http://".$value]);
if ($response && $response->getStatusCode() == 200)
{
break;
}
}
else if($response->getStatusCode() == 200)
{
break;
}
}
//將結果存入redis
if ($response && $response->getStatusCode() == 200)
{
$this->set_ip2redis($value);
}
else{
echo $value . " error... ". PHP_EOL;
}
}
}

這裡我們使用了https的百度和http的qq來檢測，如果成功訪問就把這個IP插入redis中。

這樣我們就能做到爬取IP並且校驗可用性了。

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 程序員小新人學習 的精彩文章:

※Hugo + github 搭建個人博客
※Python3 的這幾個特性

TAG:程序員小新人學習 |