Web scraping using Beautiful soup

5 min readOct 29, 2022

Data is a buzzword, but where do we get it? Data is essential for data analysis, but where can you find it?

One of the ways to get data is “web scraping which is the process of using bots to extract content and data from a website”. In this article, I will explain how to extract the data from a website using selenium and beautiful soup.

pre-requisite

Basic knowledge of python(Learn python)
Basic knowledge of beautiful soup(Learn beautiful soup)
Python packages: seleinum, webdriver_manager, bs4, pandas

Does web scraping always work?

“No”… Scraping increases website traffic and increases the risk of a website server failure. As a result, not all websites permit scraping. How do you determine whether a website is permitted or not? You can view the website’s “robots.txt” file. You may view information on whether the website host permits you to scrape the website by adding robots.txt after the URL that you wish to collect data from.

How do we get data from websites using web scraping?

Scraping a web page entails retrieving and extracting information from it. The downloading of a page is referred to as fetching. As a result, web crawling is an essential component of web scraping in order to retrieve pages for later processing. Once fetched, extraction can begin. A page’s content can be parsed, searched, and reformatted, and its data can be copied into a spreadsheet or loaded into a database. Web scrapers typically take something from a page and use it for another purpose elsewhere. The steps for web scraping are below:

Locate the URL of the web page you wish to scrape and check whether it allows scraping or not.
Explore the page for the data you want to extract.
Write the logic for extracting data from the web page.
Store the extracted data in a file or database.

Explanation with an example:

1. Locate the URL of the web page you wish to scrape and check whether it allows scraping or not.

let us extract the data from YouTube

weburl = 'https://www.youtube.com'

robots.txt for youtube

# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.User-agent: Mediapartners-Google*
Disallow:User-agent: *
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajaxSitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml

As you can see, several of YouTube’s sub-websites do not permit web scraping. However it does, permit some pathways to scrape which is legal. so, we can continue scraping the site.

2. Explore the page for the data you want to extract

We can find some elements to extract once we begin exploring the chosen URL. For example, we can extract video titles, video views, channels, channel links, video links, video length, and so on from YouTube. Once we’ve found which elements to extract, we’ll inspect the page and retrieve the attributes for those elements from the HTML page.

After inspecting the page we can find the following attributes for the elements mentioned:

Video title - {'id':'video-title','class':'style-scope ytd-rich-grid-media'}
Video channel - {'id':'text','class':'style-scope ytd-channel-name complex-string'}
Video views - {'id':'metadata-line','class':'style-scope ytd-video-meta-block'}
Video length - {'id':'text','class':'style-scope ytd-thumbnail-overlay-time-status-renderer'}
Video link - {'id':'video-title-link','class':'yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media'}

3. Write the logic for extracting data

It's time for coding !!!

Import the seleinum, webdriver_manager, bs4, pandas packages

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd

Install the chrome driver for automated testing of webpages

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

Get the HTML page of the selected URL and pass the HTML page source to beautiful soup

driver.get(weburl)
soup = BeautifulSoup(driver.page_source, 'html.parser')

Now, we can write a logic to retrieve the required data using the attributes we defined in the previous section

titles = []
channels = []
channel_links = []
views = []
lengths = []
links = []for a in soup.findAll(attrs={'id':'dismissible','class':'style-scope ytd-rich-grid-media'}):
    title=a.find('yt-formatted-string', attrs={'id':'video-title','class':'style-scope ytd-rich-grid-media'})
    titles.append(title.text)
    channel = a.find('yt-formatted-string', attrs={'id':'text','class':'style-scope ytd-channel-name complex-string'})
    channels.append(channel.text)
    channel_links.append(weburl+channel.a.get('href'))
    view = a.find('div',attrs={'id':'metadata-line','class':'style-scope ytd-video-meta-block'})
    views.append(view.span.text)
    length = a.find('span',attrs={'id':'text','class':'style-scope ytd-thumbnail-overlay-time-status-renderer'})
    if length:
        lengths.append(length.text.strip())
    else:
        lengths.append(None)
    link = a.find('a',attrs={'id':'video-title-link','class':'yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media'})
    links.append(weburl+link.get('href'))

In the above code snippet, we can see that data is retrieved from the HTML tags we inspected in the last step. The soup.findAll(…) will fetch all the videos and different elements of each video are fetched again using the loop.

4. Save the extracted data into a file

Now we have all the data in different lists let's combine them into a data frame and save the data frame to a CSV file to store the data permanently.

youtube_data = pd.DataFrame(data={"Title":titles,"channel name":channels,"channel link":channel_links,"views":views,"Video length":lengths,"Link":links})
youtube_data.to_csv("youtube.csv",index=False)

Successfully extracted the data from the YouTube website.

References

Beautiful Soup Documentation - Beautiful Soup 4.9.0 documentation

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to…

www.crummy.com

selenium

Python language bindings for Selenium WebDriver. The selenium package is used to automate web browser interaction from…

pypi.org

GitHub - krishnakanth-G/web-scraping

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com