How to use User Agents while scraping data
User Agents
As I mentioned in the previous post that a browser's user agent is a string which
helps identify which browser is being used, what version, and on which operating
system it is on. So if you try to scrape data intensively using your same User Agent
then the defensive intelligent robots may know and block you from scraping data.
Let us consider the code below:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
from bs4 import BeautifulSoup as soup
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - As you can see in the above code that I have used a different User Agent in headers and
used that to get the page, the robot(defense code) of the particular website would
recognize me as a different user and allow me to scrape information. So it is necessary
that you try to change the User Agent frequently if you try to scrape data from the same
website again and again. The status code should return you 200 upon successful
downloading of web page otherwise, it would return 404.
To help you guys I have considered these helpful User Agents below:
1. Chrome OS-based laptop using Chrome browser
Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
from pandas import ExcelWriter
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
url1 = "https://www.espn.in/football/team/stats/_/id/364/season/2018"url2 = "https://www.espn.in/football/team/stats/_/id/364/league/ENG.1/season/2018/view/discipline"
req1 = requests.get(url1, headers=headers)
req2 = requests.get(url2, headers=headers)
print(req1.status_code) #Tells you about the status code of the file i.e whether the file has been downloaded or notprint(req2.status_code)
page_html1 = req1.text
page_html2 = req2.text
page_soup1 = soup(page_html1, "html.parser")
page_soup2 = soup(page_html2, "html.parser")
table1 = page_soup1.findAll("div", {"class": "InnerLayout__child flex"})
print(table1)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - As you can see in the above code that I have used a different User Agent in headers and
used that to get the page, the robot(defense code) of the particular website would
recognize me as a different user and allow me to scrape information. So it is necessary
that you try to change the User Agent frequently if you try to scrape data from the same
website again and again. The status code should return you 200 upon successful
downloading of web page otherwise, it would return 404.
To help you guys I have considered these helpful User Agents below:
1. Chrome OS-based laptop using Chrome browser
Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36
2. Linux-based PC using a Firefox browser
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1
3. Windows 10-based PC using Edge browser
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246
4. Apple iPhone User Agent
Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1
5. Samsung Galaxy S9(Android based)
Mozilla/5.0 (Linux; Android 8.0.0; SM-G960F Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36
So have a go at any of these and you are good to go. Cheers!!!
Comments
Post a Comment