Crawler how to access the proxy IP detailed tutorial for beginners

Crawling is a technology that allows crawlers to have more access to the web. The function of IP is to provide multiple IP addresses for crawlers, so as to speed up the speed at which they crawl data, and at the same time, it can also avoid the problem of being blocked due to excessive access frequency. This article will introduce the detailed tutorial of crawling to the **IP.

Step 1: Obtain the **IP

First we need to find a usable **IP source. Here we take a huge number of **IPs as an example, which provides a fee** and an ordinary free **IP (1000 IPs are registered and collected daily)., which is very convenient to use. After the registration is completed, get the free order, and then get the API link address

By requesting the above API interface, we can get a page of **IP information, including IP address and port number. We can get the information returned by the API through the get method of the requests library, as shown in the following example:

import requests

url = ''

response = requests.get(url)

print(response.text)

After the above ** is executed, we can see the obtained **IP information. But we need to parse the return value to extract only the useful IP address and port.

import requests

from bs4 import beautifulsoup

url = ''

response = requests.get(url)

soup = beautifulsoup(response.text, 'html.parser')

proxies =

for tr in soup.find_all('tr')[1:]:

tds = tr.find_all('td')

proxy = tds[0].text + ':' + tds[1].text

proxies.append(proxy)

print(proxies)

In the above, we use the beautifulsoup library to parse the returned HTML text, get all the tags, and then iterate through each tag in a loop to extract the IP address and port information in it, and save it to a list.

2. Verification of **IP

Once we have the IPs, we need to test them to see if they are available. Here we test through the get method of the requests library, and if 200 is returned, the **ip is available. The way we use **ip is by sending a request to the requestsget method to implement the proxies parameter, as shown in the following example:

import requests

url = ''

proxies =

try:response = requests.get(url, proxies=proxies_dict, timeout=10)

if response.status_code == 200:

print('**IP Available:', proxies_dict)

except:

print('**IP not available:', proxies_dict)

In the loop above, we first iterate through all the IPs, and then validate each of them. If the **ip is available, it is printed, otherwise the unavailable information is output.

3. IP testing

Once we have a working IP, we need to further test it to make sure it's actually usable before crawling. We can test it using common search engines such as 360 search. Here we take it as an example to test if the **ip is really usable.

import requests

url = ''

proxies =

Request an IP web page.

url = ""

response = requests.get(url, headers=headers)

Parse the web page to get the list of IPs.

soup = beautifulsoup(response.text, "html.parser")

proxy_list =

table = soup.find("table", )

for tr in table.find_all("tr"):

td_list = tr.find_all("td")

if len(td_list) >0:

ip = td_list[1].text.strip()

port = td_list[2].text.strip()

type = td_list[5].text.strip()

proxy_list.append()

return proxy_list

# 2.Verify IP availability.

def verify_proxy(proxy):

Construct a request header to simulate a browser request.

headers =

Request the landing page and determine the response code.

url = ""

try:response = requests.get(url, headers=headers, proxies=proxy, timeout=5)

if response.status_code == 200:

return true

else:return false

except:

return false

# 3.Test the availability of the IP list.

def test_proxy_list(proxy_list):

valid_proxy_list =

for proxy in proxy_list:

if verify_proxy(proxy):

valid_proxy_list.append(proxy)

return valid_proxy_list

# 4.Use the **ip to send the request.

def send_request(url, headers, proxy):

Send a request and return a response.

response = requests.get(url, headers=headers, proxies=proxy)

return response.text

Program Entrance.

if __name__ == "__main__":

Get a list of IPs.

proxy_list = get_proxy_list()

Verify IP availability.

valid_proxy_list = test_proxy_list(proxy_list)

The output is available **IP

print("Valid **IP List:")

for proxy in valid_proxy_list:

print(proxy)

Use the **ip to send the request.

url = ""

headers =

proxy =

response = send_request(url, headers, proxy)

print(response)

In the ** above, we first get the IP address through the huge http free**. We then verify each IP to determine if it is available or not, and store the available IPs in a list. Finally, we select an available IP and use that IP to send the request.

6. Summary

This article introduces the basic concepts of IP, how to obtain IP for free, how to use IP in Python and examples, and precautions for using IP. I hope it can be helpful to the users of crawlers.

Crawler how to access the proxy IP detailed tutorial for beginners

Related Pages

Crawler proxy IP test is a must-have tool to improve crawler efficiency

What is a crawler proxy IP?How to buy?

Can the available free proxy IPs be used for crawlers?

How to set up a static IP proxy in detail?What are the scenarios in which static IP proxies are suit

How do I change the network address with https proxy IP?Is the proxy IP stability high?