• Home
  • Tutorial
  • AI
  • WordPress
  • UI/UX
  • Web crawling
The TechnoTreat

Type and hit Enter to search

  • Home
  • Tutorial
  • AI
  • WordPress
  • UI/UX
  • Web crawling
Web scraping - thetechnotreat.com
TutorialWeb crawling

Web scraping with Python 3, Requests and Beautifulsoup (bs4)

Thetechnotreat
July 10, 2019 4 Mins Read
13 Views
0 Comments

Web scraping is the process of extracting data from websites and structuring it into formats like CSV, XLS, XML, or SQL for further analysis and insights.

In this guide, I will explain how to perform web scraping using Python 3, along with the requests and BeautifulSoup4 libraries. These powerful Python libraries simplify the process: requests handles making HTTP requests to fetch web content, while BeautifulSoup is used to parse and extract data from HTML.

Requests Library

The requests library provides several key features:

  • Session and cookies management
  • Browser-style SSL verification
  • Multipart file uploads
  • Streaming downloads

Essentially, it supports all the functionalities required for modern web interactions. You can find more details in the official documentation: Requests Documentation.

BeautifulSoup (bs4)

The primary purpose of BeautifulSoup4 is to parse HTML content retrieved using the requests library. Since raw HTML needs to be processed to extract specific elements, BeautifulSoup simplifies this task.

For example, if we need to extract text from:

htmlCopyEdit<span id="my-text">Hello, world</span>

We can parse the HTML using BeautifulSoup and extract the text directly.Hellow, world” by Beautifulsoup(‘html_content’, ‘html.parser’).find(‘span’, id=’my-text’).get_text()

Documentation of Beautifulsoup4: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

This might look confusing but I will explain everything with an example. I will show you how to get themeforest top selling themes into a CSV file. Themeforest updates its weekly top selling themes here https://themeforest.net/top-sellers 

web scraping - themeforest top seller listing.

I had an idea about listing themeforest top selling theme and updating weekly on this blog. This could be tedious if done manually. So, I thought of using basic crawling techniques and automate a task and provide the same as an example for this article. It turned out to be very simple because all the data was already in JSON format. This makes easy for us to collect data and we might not need Beautifulsoup to parse. But I will use it anyway for the sake of an example.

Lets import required libraries first.

import requests
import csv
from bs4 import BeautifulSoup
import re
import json
import datetime

Create and initialize CSV file with headers.

    def __init__(self):
        """initialize csv file and columns"""
        self.fieldnames = ['date', 'title', 'author', 'category', 'price', 'description',
                           'avg_star', 'total_review', 'total_sales', 'tags', 'link']
        with open('output.csv', 'w') as csv_writer:
            self.writer = csv.DictWriter(csv_writer, fieldnames=self.fieldnames)
            self.writer.writeheader()
        self.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36'

def crawl() is the main function where we send requests to the URL and collect data.

    def crawl(self):
        """TODO: Docstring for crawl.
        :returns: TODO
        """
        response = requests.get('https://themeforest.net/top-sellers', headers={'User-Agent': self.user_agent})
        content = response.text
        soup = BeautifulSoup(content, 'html.parser')
        script = soup.find('script', text=re.compile('window\.INITIAL_STATE=', re.I|re.M))
        json_text = script.get_text()
        json_text = re.search('INITIAL_STATE=(.*}}});', json_text, re.I|re.M)
        if json_text:
            json_text = json_text.group(1)
        data = json.loads(json_text)
        top_sellers = data['topSellersPage']['topSellers']['matches']
        with open('output.csv', 'a') as csv_writer:
            writer = csv.DictWriter(csv_writer, fieldnames=self.fieldnames)
            for top in top_sellers:
                row = dict()
                row['date'] = datetime.datetime.today().strftime("%Y-%m-%d")
                row['title'] = top['name']
                row['author'] = top['author_username']
                row['category'] = top['classification']
                price = top['price_cents']
                if price:
                    price = price/100
                    row['price'] = price
                row['description'] = top['description'].strip()
                row['avg_star'] = top['rating']['rating']
                row['total_review'] = top['rating']['count']
                row['total_sales'] = top['number_of_sales']
                row['tags'] = ", ".join(top['tags'])
                row['link'] = top['url']
                writer.writerow(row)

Since required data is in <script> tag we can use Beautifulsoup to get only those script tag which has our data.

response = requests.get(‘https://themeforest.net/top-sellers’, headers={‘User-Agent’: self.user_agent})

content = response.text

soup = BeautifulSoup(content, ‘html.parser’)

script = soup.find(‘script’, text=re.compile(‘window\.INITIAL_STATE=’, re.I|re.M))

web scraping - themeforest html source

There are several script tag but our data is in the script tag containing “window\.INITIAL_STATE=” so we use Beautifulsoup with a regular expression to get that script tag.

Wrapping everything in a class,

import requests
import csv
from bs4 import BeautifulSoup
import re
import json
import datetime
class Webscraping(object):
    """Docstring for Webscraping. """
    def __init__(self):
        """initialize csv file and columns"""
        self.fieldnames = ['date', 'title', 'author', 'category', 'price', 'description',
                           'avg_star', 'total_review', 'total_sales', 'tags', 'link']
        with open('output.csv', 'w') as csv_writer:
            self.writer = csv.DictWriter(csv_writer, fieldnames=self.fieldnames)
            self.writer.writeheader()
        self.user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.80 Safari/537.36'
    def crawl(self):
        """TODO: Docstring for crawl.
        :returns: TODO
        """
        response = requests.get('https://themeforest.net/top-sellers', headers={'User-Agent': self.user_agent})
        content = response.text
        soup = BeautifulSoup(content, 'html.parser')
        script = soup.find('script', text=re.compile('window\.INITIAL_STATE=', re.I|re.M))
        json_text = script.get_text()
        json_text = re.search('INITIAL_STATE=(.*}}});', json_text, re.I|re.M)
        if json_text:
            json_text = json_text.group(1)
        data = json.loads(json_text)
        top_sellers = data['topSellersPage']['topSellers']['matches']
        with open('output.csv', 'a') as csv_writer:
            writer = csv.DictWriter(csv_writer, fieldnames=self.fieldnames)
            for top in top_sellers:
                row = dict()
                row['date'] = datetime.datetime.today().strftime("%Y-%m-%d")
                row['title'] = top['name']
                row['author'] = top['author_username']
                row['category'] = top['classification']
                price = top['price_cents']
                if price:
                    price = price/100
                    row['price'] = price
                row['description'] = top['description'].strip()
                row['avg_star'] = top['rating']['rating']
                row['total_review'] = top['rating']['count']
                row['total_sales'] = top['number_of_sales']
                row['tags'] = ", ".join(top['tags'])
                row['link'] = top['url']
                writer.writerow(row)
scrape = Webscraping()
scrape.crawl()

Finally, all the details about the top-selling themes are saved in a CSV file. The JSON data contains more information than what is visible on the ThemeForest page, such as descriptions, tags, and more. I have included most of these details in the final dataset.

Please download the working example with the output CSV file here.

[download id=”158″]

Tags:

beautifulsoupcrawlingPythonrequestsscraping

Share Article

Follow Me Written By

Thetechnotreat

Other Articles

James Donkey Gaming Head Phone
Previous

James Donkey 008 Tactical Master Gaming – Headphone For PUBG Mobile

puppeteer-browser-automation
Next

Intercepting Network Requests with Puppeteer

Next
puppeteer-browser-automation
November 16, 2020

Intercepting Network Requests with Puppeteer

Previous
June 20, 2019

James Donkey 008 Tactical Master Gaming – Headphone For PUBG Mobile

James Donkey Gaming Head Phone

No Comment! Be the first one.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

All Right Reserved! | Privacy policy