Web scraping is the technique of collecting data from web sites into a well-structured format like CSV, XLS, XML, SQL, etc. Those collected data can later be used for analysis or to get meaningful insights.
I will explain how we can perform web scraping using Python3, Requests, and Beautifulsoup4. Requests and Beautifulsoup4 are very powerful libraries built in python. Requests is used to send a request to a remote server and Beautifulsoup is used to parse HTML.
Requests
Some of the basic features of Requests library are
- Session and cookies support
- Browser-style SSL verification
- Multipart file upload
- Streaming Downloads
Basically, it supports every feature that a modern web requires. You can learn more about it in its documentation http://docs.python-requests.org/
Beautifulsoup (bs4)
The main purpose of the Beautifulsoup4 is to parse HTML contents that we get from the requests library. The raw HTML content needs to be parsed to get the selected elements or the only elements that we are looking to extract. For example, if we need a text located in “<span id=’my-text’>Hello, world</span>”. We can parse it into the Beautifulsoup and simply get “Hellow, world” by Beautifulsoup(‘html_content’, ‘html.parser’).find(‘span’, id=’my-text’).get_text()
Documentation of Beautifulsoup4: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
This might look confusing but I will explain everything with an example. I will show you how to get themeforest top selling themes into a CSV file. Themeforest updates its weekly top selling themes here https://themeforest.net/top-sellers
I had an idea about listing themeforest top selling theme and updating weekly on this blog. This could be tedious if done manually. So, I thought of using basic crawling techniques and automate a task and provide the same as an example for this article. It turned out to be very simple because all the data was already in JSON format. This makes easy for us to collect data and we might not need Beautifulsoup to parse. But I will use it anyway for the sake of an example.
Lets import required libraries first.
import requests import csv from bs4 import BeautifulSoup import re import json import datetime
Create and initialize CSV file with headers.
def __init__(self): """initialize csv file and columns""" self.fieldnames = ['date', 'title', 'author', 'category', 'price', 'description', 'avg_star', 'total_review', 'total_sales', 'tags', 'link'] with open('output.csv', 'w') as csv_writer: self.writer = csv.DictWriter(csv_writer, fieldnames=self.fieldnames) self.writer.writeheader() self.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36'
def crawl() is the main function where we send requests to the URL and collected data.
def crawl(self): """TODO: Docstring for crawl. :returns: TODO """ response = requests.get('https://themeforest.net/top-sellers', headers={'User-Agent': self.user_agent}) content = response.text soup = BeautifulSoup(content, 'html.parser') script = soup.find('script', text=re.compile('window\.INITIAL_STATE=', re.I|re.M)) json_text = script.get_text() json_text = re.search('INITIAL_STATE=(.*}}});', json_text, re.I|re.M) if json_text: json_text = json_text.group(1) data = json.loads(json_text) top_sellers = data['topSellersPage']['topSellers']['matches'] with open('output.csv', 'a') as csv_writer: writer = csv.DictWriter(csv_writer, fieldnames=self.fieldnames) for top in top_sellers: row = dict() row['date'] = datetime.datetime.today().strftime("%Y-%m-%d") row['title'] = top['name'] row['author'] = top['author_username'] row['category'] = top['classification'] price = top['price_cents'] if price: price = price/100 row['price'] = price row['description'] = top['description'].strip() row['avg_star'] = top['rating']['rating'] row['total_review'] = top['rating']['count'] row['total_sales'] = top['number_of_sales'] row['tags'] = ", ".join(top['tags']) row['link'] = top['url'] writer.writerow(row)
Since required data is in <script> tag we can use Beautifulsoup to get only those script tag which has our data.
response = requests.get(‘https://themeforest.net/top-sellers’, headers={‘User-Agent’: self.user_agent})
content = response.text
soup = BeautifulSoup(content, ‘html.parser’)
script = soup.find(‘script’, text=re.compile(‘window\.INITIAL_STATE=’, re.I|re.M))
There are several script tag but our data is in the script tag containing “window\.INITIAL_STATE=” so we use Beautifulsoup with a regular expression to get that script tag.
Wrapping everything in a class,
import requests import csv from bs4 import BeautifulSoup import re import json import datetime class Webscraping(object): """Docstring for Webscraping. """ def __init__(self): """initialize csv file and columns""" self.fieldnames = ['date', 'title', 'author', 'category', 'price', 'description', 'avg_star', 'total_review', 'total_sales', 'tags', 'link'] with open('output.csv', 'w') as csv_writer: self.writer = csv.DictWriter(csv_writer, fieldnames=self.fieldnames) self.writer.writeheader() self.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36' def crawl(self): """TODO: Docstring for crawl. :returns: TODO """ response = requests.get('https://themeforest.net/top-sellers', headers={'User-Agent': self.user_agent}) content = response.text soup = BeautifulSoup(content, 'html.parser') script = soup.find('script', text=re.compile('window\.INITIAL_STATE=', re.I|re.M)) json_text = script.get_text() json_text = re.search('INITIAL_STATE=(.*}}});', json_text, re.I|re.M) if json_text: json_text = json_text.group(1) data = json.loads(json_text) top_sellers = data['topSellersPage']['topSellers']['matches'] with open('output.csv', 'a') as csv_writer: writer = csv.DictWriter(csv_writer, fieldnames=self.fieldnames) for top in top_sellers: row = dict() row['date'] = datetime.datetime.today().strftime("%Y-%m-%d") row['title'] = top['name'] row['author'] = top['author_username'] row['category'] = top['classification'] price = top['price_cents'] if price: price = price/100 row['price'] = price row['description'] = top['description'].strip() row['avg_star'] = top['rating']['rating'] row['total_review'] = top['rating']['count'] row['total_sales'] = top['number_of_sales'] row['tags'] = ", ".join(top['tags']) row['link'] = top['url'] writer.writerow(row) scrape = Webscraping() scrape.crawl()
Finally, all the information about top selling themes is written on the CSV file. There is more information in JSON data than we see on themeforest page like description, tags, etc. I have included most of them on the final data.
Please download the working example with the output CSV file here.
[download id=”158″]
Great content! Super high-quality! Keep it up! 🙂