Cheat Sheets - Web Scraping by Tacuma Solomon

Again, laying down another cheat sheet for work I’ve done. This time, for web scraping. A very small compendium, of sorts. Assuming of course, that this is being written in python.

Modules you’ll need:

import requests

Library which allows on to easily send HTTP GET and POST requests.

https://requests.readthedocs.io/en/master/

It’s sessions class is also very useful. It can persist values across requests, and is good for setting default values and adapters.

https://requests.readthedocs.io/en/master/user/advanced/

from urllib.parse import unquote

This was really helpful for removing those pesky %xx characters I would get in strings when trying to build POST requests

https://docs.python.org/3/library/urllib.parse.html#urllib.parse.unquote

from requests.adapters import HTTPAdapter

Just an adapter, a class that allows me to add extra features to a request session, such as Retries.

https://kite.com/python/docs/requests.adapters.HTTPAdapter

from requests.packages.urllib3.util.retry import Retry

Retried are added to an adapter, and serves as a good way to make your scraper less fragile. Can hit a webpage multiples times before the script fails

https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html#module-urllib3.util.retry

from time import sleep

Pauses execution of a script. Nothing to see here. Good for creating small delays between requests.

https://www.pythoncentral.io/pythons-time-sleep-pause-wait-sleep-stop-your-code/

from bs4 import BeautifulSoup

And of course, BeautifulSoup. We need the requests library to ping websites and record their responses as text, but BeautifulSoup allows us to efficiently parse the HTTP. Really useful for scraping non-javascript websites.

Example:

soup = BeautifulSoup(response.text, 'html.parser')

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Copying as Curl Method

This was my general method for scraping data, using Google Chrome.

If the website is javascript based, I was probably out of luck. Would be time to use Selenium.

If not, and it seems like the table data may be json:

  • Use the network tab in Chrome’s developer tools and search for json files

  • See what you get, see if it matches the page data - This makes for an easy scrape. All we need here is to parse the json without using more advanced tools, eg. Beautiful soup

  • RIght click on the asset and ‘copy as curl’, using your data to build a request. (Probably a GET)

If the table data for a website isn’t json and is mostly html, you will have to:

  • Use the network tab in Chrome’s developer tools and search for html files

  • Find the html that has the data that you want

  • Right click on the asset and ‘copy as curl’, using your data to build a request. (Again, a GET request)

  • There may be times where you may need to make POST requests. To make these requests, you may need to collect more data (form data)

web-scraping-process1.png