How to use User Agents while scraping data

September 12, 2019

User Agents

As I mentioned in the previous post that a browser's user agent is a string which

helps identify which browser is being used, what version, and on which operating

system it is on. So if you try to scrape data intensively using your same User Agent

then the defensive intelligent robots may know and block you from scraping data.

Let us consider the code below:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
from bs4 import BeautifulSoup as soup

import requests
import pandas as pd
from pandas import ExcelWriter

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
url1 = "https://www.espn.in/football/team/stats/_/id/364/season/2018"url2 = "https://www.espn.in/football/team/stats/_/id/364/league/ENG.1/season/2018/view/discipline"
req1 = requests.get(url1, headers=headers)
req2 = requests.get(url2, headers=headers)

print(req1.status_code)       #Tells you about the status code of the file i.e whether the file has been downloaded or notprint(req2.status_code)


page_html1 = req1.text
page_html2 = req2.text

page_soup1 = soup(page_html1, "html.parser")
page_soup2 = soup(page_html2, "html.parser")

table1 = page_soup1.findAll("div", {"class": "InnerLayout__child flex"})

print(table1)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - As you can see in the above code that I have used a different User Agent in headers and

used that to get the page, the robot(defense code) of the particular website would

recognize me as a different user and allow me to scrape information. So it is necessary

that you try to change the User Agent frequently if you try to scrape data from the same

website again and again. The status code should return you 200 upon successful

downloading of web page otherwise, it would return 404.

To help you guys I have considered these helpful User Agents below:

1. Chrome OS-based laptop using Chrome browser

Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36

2. Linux-based PC using a Firefox browser

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1

3. Windows 10-based PC using Edge browser

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246

4. Apple iPhone User Agent

Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1

5. Samsung Galaxy S9(Android based)

Mozilla/5.0 (Linux; Android 8.0.0; SM-G960F Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36

So have a go at any of these and you are good to go. Cheers!!!

Search This Blog

Data Viz

How to use User Agents while scraping data

User Agents

Comments

Post a Comment

Popular posts from this blog

How to adjust output size in Pycharm

Data Scraping

Introduction to Data Visualization