How to use User Agents while scraping data


                                     User Agents


       As I mentioned in the previous post that a browser's user agent  is a string which 

       helps identify which browser is being used, what version, and on which  operating 
  
       system it is on. So if you try to scrape data intensively using your same User Agent 

       then the defensive intelligent robots may know and block you from scraping data.



    Let us consider the code below:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
from bs4 import BeautifulSoup as soup
import requests
import pandas as pd
from pandas import ExcelWriter

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
url1 = "https://www.espn.in/football/team/stats/_/id/364/season/2018"url2 = "https://www.espn.in/football/team/stats/_/id/364/league/ENG.1/season/2018/view/discipline"
req1 = requests.get(url1, headers=headers)
req2 = requests.get(url2, headers=headers)

print(req1.status_code)       #Tells you about the status code of the file i.e whether the file has been downloaded or notprint(req2.status_code)


page_html1 = req1.text
page_html2 = req2.text

page_soup1 = soup(page_html1, "html.parser")
page_soup2 = soup(page_html2, "html.parser")

table1 = page_soup1.findAll("div", {"class": "InnerLayout__child flex"})
print(table1)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -    As you can see in the above code that I have used a different User Agent in headers and   

  used that to get the page, the robot(defense code) of the particular website would 

  recognize me as a different user and allow me to scrape information. So it is necessary 

  that you try to change the User Agent frequently if you try to scrape data from the same 

  website again and again. The status code should return you 200 upon successful 
  
 downloading of web page otherwise, it would return 404. 


To help you guys I have considered these helpful User Agents below: 

1. Chrome OS-based laptop using Chrome browser 

Mozilla/5.0 (X11; CrOS x86_64 8172.45.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.64 Safari/537.36

2. Linux-based PC using a Firefox browser

Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:15.0) Gecko/20100101 Firefox/15.0.1

3. Windows 10-based PC using Edge browser

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246

4. Apple iPhone User Agent

Mozilla/5.0 (iPhone; CPU iPhone OS 12_0 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0 Mobile/15E148 Safari/604.1

5. Samsung Galaxy S9(Android based)

Mozilla/5.0 (Linux; Android 8.0.0; SM-G960F Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.84 Mobile Safari/537.36

So have a go at any of these and you are good to go. Cheers!!!

Comments

Popular posts from this blog

How to adjust output size in Pycharm

Data Scraping

Introduction to Data Visualization