Crawling is a technology that allows crawlers to have more access to the web. The function of IP is to provide multiple IP addresses for crawlers, so as to speed up the speed at which they crawl data, and at the same time, it can also avoid the problem of being blocked due to excessive access frequency. This article will introduce the detailed tutorial of crawling to the **IP.
Step 1: Obtain the **IP
First we need to find a usable **IP source. Here we take a huge number of **IPs as an example, which provides a fee** and an ordinary free **IP (1000 IPs are registered and collected daily)., which is very convenient to use. After the registration is completed, get the free order, and then get the API link address
By requesting the above API interface, we can get a page of **IP information, including IP address and port number. We can get the information returned by the API through the get method of the requests library, as shown in the following example:
import requestsAfter the above ** is executed, we can see the obtained **IP information. But we need to parse the return value to extract only the useful IP address and port.url = ''
response = requests.get(url)
print(response.text)
import requestsIn the above, we use the beautifulsoup library to parse the returned HTML text, get all the tags, and then iterate through each tag in a loop to extract the IP address and port information in it, and save it to a list.from bs4 import beautifulsoup
url = ''
response = requests.get(url)
soup = beautifulsoup(response.text, 'html.parser')
proxies =
for tr in soup.find_all('tr')[1:]:
tds = tr.find_all('td')
proxy = tds[0].text + ':' + tds[1].text
proxies.append(proxy)
print(proxies)
2. Verification of **IP
Once we have the IPs, we need to test them to see if they are available. Here we test through the get method of the requests library, and if 200 is returned, the **ip is available. The way we use **ip is by sending a request to the requestsget method to implement the proxies parameter, as shown in the following example:
import requestsIn the loop above, we first iterate through all the IPs, and then validate each of them. If the **ip is available, it is printed, otherwise the unavailable information is output.url = ''
proxies =
try:response = requests.get(url, proxies=proxies_dict, timeout=10)
if response.status_code == 200:
print('**IP Available:', proxies_dict)
except:
print('**IP not available:', proxies_dict)
3. IP testing
Once we have a working IP, we need to further test it to make sure it's actually usable before crawling. We can test it using common search engines such as 360 search. Here we take it as an example to test if the **ip is really usable.
import requestsIn the ** above, we first get the IP address through the huge http free**. We then verify each IP to determine if it is available or not, and store the available IPs in a list. Finally, we select an available IP and use that IP to send the request.url = ''
proxies =
Request an IP web page.
url = ""
response = requests.get(url, headers=headers)
Parse the web page to get the list of IPs.
soup = beautifulsoup(response.text, "html.parser")
proxy_list =
table = soup.find("table", )
for tr in table.find_all("tr"):
td_list = tr.find_all("td")
if len(td_list) >0:
ip = td_list[1].text.strip()
port = td_list[2].text.strip()
type = td_list[5].text.strip()
proxy_list.append()
return proxy_list
# 2.Verify IP availability.
def verify_proxy(proxy):
Construct a request header to simulate a browser request.
headers =
Request the landing page and determine the response code.
url = ""
try:response = requests.get(url, headers=headers, proxies=proxy, timeout=5)
if response.status_code == 200:
return true
else:return false
except:
return false
# 3.Test the availability of the IP list.
def test_proxy_list(proxy_list):
valid_proxy_list =
for proxy in proxy_list:
if verify_proxy(proxy):
valid_proxy_list.append(proxy)
return valid_proxy_list
# 4.Use the **ip to send the request.
def send_request(url, headers, proxy):
Send a request and return a response.
response = requests.get(url, headers=headers, proxies=proxy)
return response.text
Program Entrance.
if __name__ == "__main__":
Get a list of IPs.
proxy_list = get_proxy_list()
Verify IP availability.
valid_proxy_list = test_proxy_list(proxy_list)
The output is available **IP
print("Valid **IP List:")
for proxy in valid_proxy_list:
print(proxy)
Use the **ip to send the request.
url = ""
headers =
proxy =
response = send_request(url, headers, proxy)
print(response)
6. Summary
This article introduces the basic concepts of IP, how to obtain IP for free, how to use IP in Python and examples, and precautions for using IP. I hope it can be helpful to the users of crawlers.