Web scraping using Beautiful soup
Data is a buzzword, but where do we get it? Data is essential for data analysis, but where can you find it?
One of the ways to get data is “web scraping which is the process of using bots to extract content and data from a website”. In this article, I will explain how to extract the data from a website using selenium and beautiful soup.
pre-requisite
- Basic knowledge of python(Learn python)
- Basic knowledge of beautiful soup(Learn beautiful soup)
- Python packages: seleinum, webdriver_manager, bs4, pandas
Does web scraping always work?
“No”… Scraping increases website traffic and increases the risk of a website server failure. As a result, not all websites permit scraping. How do you determine whether a website is permitted or not? You can view the website’s “robots.txt” file. You may view information on whether the website host permits you to scrape the website by adding robots.txt after the URL that you wish to collect data from.
How do we get data from websites using web scraping?
Scraping a web page entails retrieving and extracting information from it. The downloading of a page is referred to as fetching. As a result, web crawling is an essential component of web scraping in order to retrieve pages for later processing. Once fetched, extraction can begin. A page’s content can be parsed, searched, and reformatted, and its data can be copied into a spreadsheet or loaded into a database. Web scrapers typically take something from a page and use it for another purpose elsewhere. The steps for web scraping are below:
- Locate the URL of the web page you wish to scrape and check whether it allows scraping or not.
- Explore the page for the data you want to extract.
- Write the logic for extracting data from the web page.
- Store the extracted data in a file or database.
Explanation with an example:
1. Locate the URL of the web page you wish to scrape and check whether it allows scraping or not.
let us extract the data from YouTube
weburl = 'https://www.youtube.com'
robots.txt for youtube
# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid 90's which wiped out all humans.User-agent: Mediapartners-Google*
Disallow:User-agent: *
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajaxSitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml
As you can see, several of YouTube’s sub-websites do not permit web scraping. However it does, permit some pathways to scrape which is legal. so, we can continue scraping the site.
2. Explore the page for the data you want to extract
We can find some elements to extract once we begin exploring the chosen URL. For example, we can extract video titles, video views, channels, channel links, video links, video length, and so on from YouTube. Once we’ve found which elements to extract, we’ll inspect the page and retrieve the attributes for those elements from the HTML page.
After inspecting the page we can find the following attributes for the elements mentioned:
Video title - {'id':'video-title','class':'style-scope ytd-rich-grid-media'}
Video channel - {'id':'text','class':'style-scope ytd-channel-name complex-string'}
Video views - {'id':'metadata-line','class':'style-scope ytd-video-meta-block'}
Video length - {'id':'text','class':'style-scope ytd-thumbnail-overlay-time-status-renderer'}
Video link - {'id':'video-title-link','class':'yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media'}
3. Write the logic for extracting data
It's time for coding !!!
Import the seleinum, webdriver_manager, bs4, pandas packages
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import pandas as pd
Install the chrome driver for automated testing of webpages
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
Get the HTML page of the selected URL and pass the HTML page source to beautiful soup
driver.get(weburl)
soup = BeautifulSoup(driver.page_source, 'html.parser')
Now, we can write a logic to retrieve the required data using the attributes we defined in the previous section
titles = []
channels = []
channel_links = []
views = []
lengths = []
links = []for a in soup.findAll(attrs={'id':'dismissible','class':'style-scope ytd-rich-grid-media'}):
title=a.find('yt-formatted-string', attrs={'id':'video-title','class':'style-scope ytd-rich-grid-media'})
titles.append(title.text)
channel = a.find('yt-formatted-string', attrs={'id':'text','class':'style-scope ytd-channel-name complex-string'})
channels.append(channel.text)
channel_links.append(weburl+channel.a.get('href'))
view = a.find('div',attrs={'id':'metadata-line','class':'style-scope ytd-video-meta-block'})
views.append(view.span.text)
length = a.find('span',attrs={'id':'text','class':'style-scope ytd-thumbnail-overlay-time-status-renderer'})
if length:
lengths.append(length.text.strip())
else:
lengths.append(None)
link = a.find('a',attrs={'id':'video-title-link','class':'yt-simple-endpoint focus-on-expand style-scope ytd-rich-grid-media'})
links.append(weburl+link.get('href'))
In the above code snippet, we can see that data is retrieved from the HTML tags we inspected in the last step. The soup.findAll(…) will fetch all the videos and different elements of each video are fetched again using the loop.
4. Save the extracted data into a file
Now we have all the data in different lists let's combine them into a data frame and save the data frame to a CSV file to store the data permanently.
youtube_data = pd.DataFrame(data={"Title":titles,"channel name":channels,"channel link":channel_links,"views":views,"Video length":lengths,"Link":links})
youtube_data.to_csv("youtube.csv",index=False)
Successfully extracted the data from the YouTube website.