代理 IP 抓取及使用（自动重连）-PySuper

简单分析

获取哪些网站的代理IP
IP 保存为{'http': '1x.1x.1x.1x:1x'}
确定IP 可用：timeout，status_code
随机获取：random.choice

代码展示


import requests
from lxml import etree


class DaiLiIP():
    """最好将这里改为异步"""

    def __init__(self):
        """初始化：需要抓取哪些网站的数据"""
        self.yundaili = "http://www.ip3366.net/free/?stype=1&page={}"
        self.kuaidaili = "https://www.kuaidaili.com/free/inha/{}/"
        self.bajiu_url = "https://www.89ip.cn/index_{}.html"

    def get_url_list(self):
        """获取URL列表"""
        return ["http://www.ip3366.net/free/?stype=1&page={}".format(page) for page in range(1, 8)] + \
               [self.kuaidaili.format(page) for page in range(1, 1000)] + \
               [self.bajiu_url.format(page) for page in range(1, 9)]

    def get_html(self, url, proxy=None):
        """获取URL返回的response"""
        response = requests.get(url, proxies=proxy)
        if response.status_code == 200:
            response.encoding = response.apparent_encoding
            return response.text
        return None

    def parse_ip(self, html):
        """解析IP，不同网站使用不同的解析方式"""
        x_html = etree.HTML(html)
        try:
            items = x_html.xpath("//div[@id='list']//tbody/tr")
            for item in items:
                yield {f"{item.xpath('td[4]/text()')[0].lower()}": f"{item.xpath('td[1]/text()')[0]}:{item.xpath('td[2]/text()')[0]}"}
        except:
            items = x_html.xpath("//table[@class='layui-table']/tbody/tr")
            for item in items:
                yield {"http": f"{item.xpath('td[1]/text()')[0]}:{item.xpath('td[2]/text()')[0]}".replace('\n', '').replace('\t', '')}

    def ip_list(self):
        """直接返回IP列表，后续可以直击随机列表"""
        IP_LIST = []
        url_list = self.get_url_list()
        for url in url_list:
            html = self.get_html(url)
            for ip in self.parse_ip(html):
                if requests.get("https://www.baidu.com/", ip).status_code == 200:
                    IP_LIST.append(ip)
                else:
                    self.ip_list()
        return IP_LIST

使用代理

伪装一下自己的访问IP
这里还使用了随机请求头：from fake_useragent import UserAgent

def get_html(self, url):
    """自动重连三次"""
    i = 0
    while i < 3:
        try:
            html = requests.get(
                url,
                proxies=choice(self.proxy),
                headers={"User-Agent": UserAgent().random},
                timeout=2
            )
            return html.text
        except requests.exceptions.RequestException:
            i += 1

应用场景

自建代理池

代理池实现原理

根据访问速度、失败次数给IP 加权
最后直接获取权重最大的IP
存放在Redis中，通过读取redis获取

web返回

写一个简单的web应用
在应用中随机返回一个IP

目录CONTENT

代理 IP 抓取及使用（自动重连）

简单分析

代码展示

使用代理

应用场景

自建代理池

web返回

评论区