Loading...

How to Scrape Dynamic Websites with Python

Avatar
Chris Prosser

Scraping dynamic websites is possible and even easy, depending on your target website. I will show you how to scrape dynamic content from websites using Python.

Dynamic websites present a unique challenge to web scrapers that make traditional scraping tools useless. This is because while traditional scraping libraries, frameworks, and tools are designed to scrape pages that their HTML contains all data publicly available on the page, dynamic websites don’t send full content on initial page load and as such, the initial HTML sent is either empty or scarce. As you interact with the page, contents are loaded without page refreshing — this is done by Javascript using AJAX.

You need to use web scraping tools that have been developed for the modern web that can scrape content on dynamic pages. Python is arguably the most popular programming language for web scraping and it has tools for scraping both static and dynamic web pages. In this guide, I will be showing you how to scrape dynamic websites with Python. It is important you know that this page will not be filled with codes. Instead, I will describe tools for you and how to use them to effectively scrape dynamic content without getting detected and banned.

What are Dynamic Websites in the Context of Web Scraping?

Dynamic websites or web pages in the context of web scraping are web pages that don’t send their full content in the initial response sent to you when you request them. Some of them will present the static section of the page such as the header and footer in the first response. The main content will then be loaded using Javascript (AJAX) which is either done automatically or because of your interaction with the page.

Because of Javascript involvement, you can’t access these pages without Javascript as they will return a blank response or request you turn on Javascript execution. This means if you are in the business of turning off Javascript in other to avoid tracking, you can’t do that with dynamic websites. At first, this looks like a bad thing but from a user point of view, this is great for user experience as it saves bandwidth and they don’t have to wait until a page loads completely before they can start interacting with it.

Unfortunately, while this is great for the users, it is a pain for web scrapers. This is because web scrapers have been traditionally developed to scrape static pages that the HTML contains the full page content on load. These traditional web scrapers can’t scrape dynamic pages with the traditional approach — you need a dynamic scraping method for dynamic pages.

Ways to Scrape Dynamic Pages with Python

Most times, when scraping dynamic pages with Python is mentioned, the minds of most scraper developers will race to Selenium and its likes such as Puppeteer and Playwright. Sure these tools work, but they are not an easy option for you in terms of development and they have their performance issues. There is another method you should first utilize known as the reverse engineering approach. It is when this does not work you can switch to using Selenium. In essence, there are two ways you can scrape dynamic content with Python. Either you use reverse engineering or you use Selenium or an alternative. Let's take a look at how to do each of these.

Method 1: Reverse Engineering

There should be a standard name for this procedure but since I didn’t see one, I had to resort to using this name as that is the best term to describe it. What reverse engineering does is it tries to decode and discover all the web requests sent by the Javascript code of the page which returns the content as a response. Something interesting happens behind the scenes that is responsible for the dynamic nature of dynamic web pages. This interesting thing is requests are being sent to the server to retrieve the content that wasn’t sent initially.

If you can find the request URL, payload, and other details, you should be able to replicate the same request using traditional scraping tools such as Python’s requests and BeautifulSoup. In most instances, this is even more efficient than the traditional method of downloading the full web page content before scraping. This is because, on this page, the responses are usually in JSON and contain just the data for that specific piece of information which helps you save more bandwidth than the traditional method. It is also faster and offers you the best performance because of the same reason. However, you need to know how to identify the specific requests and how to replicate them. Let me show you how to do this with an example.

Step By Step Guide on How to Reverse Engineer Websites for Scraping

I will use a test web page that is dynamic and requires your action to display content. The web page is https://www.scrapethissite.com/pages/ajax-javascript/. This site as a whole is dedicated to web scraping guides and there are different web pages to help you build your web scraping skills. This page presents a list of great films for each of the years on the page but you will have to click on each year before this is displayed.

If Javascript is enabled and you click on the link for a year, you will see a table with the details for films for that year. The moment you disable Javascript and you click on the same link, nothing will happen. This is a typical behavior of Ajaxied dynamic web pages and suits perfectly for us to use as a guide on how to scrape Ajax pages.

So how will I scrape the list of sports on the left-hand side of the page? Well, that is what I will show you in this section.

Step 1: Install the Necessary Libraries

Like I mentioned earlier, I wouldn’t be using modern scraping libraries in this reverse engineering method, I will use the traditional tools but use a different technique to make them useful. I will be using the Requests library. You can install it using the “pip install requests” if you already have Python installed on your device. If you don’t, first install the latest version of Python from its official website.

Step 2: Check Network Activities in Your Browser Developer Tool

All popular browsers come with development tools on desktops. I am using Chrome in this guide. To access this tool, visit the https://www.scrapethissite.com/pages/ajax-javascript/. and right-click on anywhere on the page. Choose the “Inspect Element” and an interface will show on the right side of the page. That is the developer tool, and choose the Network tab as that is where you will see all of the network activities including all of the requests sent by the website’s Javascript behind the scene and the responses returned. By now, all of the network activities for loading the page will have been completed.

Reload it and you will see a list of requests being sent. There are some tabs such as all, fetch/xhr. CSS, JS, and a few others. The one we are interested in is the “Fetch/xhr”, click it. Now go to the website at the left-hand side of the page and click on a year, you will notice a new entry in the network tab.

Click on this new entry and you will see a new interface with some tabs. By default, the headers tab is in view. You will see the request URL. You can scroll down to see the headers sent along the request. The Payload tab is for post requests which shows the data you send. Go to the Preview tab. There, you will see a preview of the response data. 

As you can see above, a list of movies contains information such as movie name, year, number of awards, nominations, and even the best picture that describes it. With this, we can see that we do not even need to use BeautifulSoup to parse this as the data is available in JSON. The URL for this is https://www.scrapethissite.com/pages/ajax-javascript/?ajax=true&year=2015. Now all we have to do is code a scraper, send a URL to this request and we will get a response in JSON we can use. Let me do that below.

Step 3: Use Requests to Scrape Content

With the URL information for the request that generates the data gotten above, all one has to do is to replicate the same request and you will get the required data.

import requests

headers = {

            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36',

}

url = "https://www.scrapethissite.com/pages/ajax-javascript/?ajax=true&year=2015"

movies = requests.get(url, headers=headers).json()

for movie in movies:

    print(movie)

First, I use only requests, Beautifulsoup is not even included — this makes this method even simpler than the traditional method of using requests and Beautifulsoup. Since the response is in JSON, there is no need to convert the response to text and back to JSON as requests have support for the JSON method to convert a stringified JSON text to JSON/dictionary. If you run the above, you will get the below response.

That’s it, this is how easy it is to scrape dynamic websites using the reverse engineering method.

Method 2: Scraping Using Browser Automators

The actual standard for scraping dynamic websites is using a browser Automator such as Selenium or Puppeteer. Using either of these, you are able to access a website and automate events such as scrolling or clicking a button that will initiate requests by Javascript to send, receive, and render data on a page. The goal is to automate a browser to render the content on a page after which you can now scrape the content you need.

Selenium is the popular option here. It supports the popular languages for scraping including Python. It also supports more browsers than any other alternatives out there. The advantage Selenium has over using the reverse engineering method is that you don’t have to dig deep into the developer tool and don’t have to worry about malformed requests. However, it is slower and consumes more memory. You can reduce some of its side effects by using the headless mode which, though will automate your browser, will not have to physically render the browser for you to see.

Step by Step Guide on How to Scrape Dynamic Websites Using Selenium

Below are the steps to scrape the same content using Selenium.

Step 1: Install Selenium

Selenium is not part of the Python standard library and as such, you will need to install it. Installing and using Selenium is not like installing other libraries. First, you need to install it and then download a web driver that corresponds with the version of the browser you want to automate.  To install Selenium, run this command in the command prompt/terminal “pip install selenium”.

For the web driver, you need to download and extract the executable in your system path. For this guide, just place it in the same folder as your script. Google Chrome users should click here —- for Firefox users, you can check the Mozilla Gecko driver page to download it. Safari users don’t need to install a driver in other to use it.

Step 2: Inspect Page for Relevant Element IDs and Classes

Now visit the same page (https://www.scrapethissite.com/pages/ajax-javascript/) and right-click the 2015 button then choose inspect element. You will see the ID of the link as (2015). This is what you will use to locate the button and click it. If the web page is a complex one, you will have to do a lot of traversing. However since the target is a simplified sample site, we don’t have to do a lot to reach our target.

Step 3: Code Web Scraper

Below is the source code for scraping the data using Selenium.

from selenium import webdriver

from selenium.webdriver.common.by import By

from bs4 import BeautifulSoup

# Set up Chrome WebDriver

driver = webdriver.Chrome()

# Open the URL

url = "https://www.scrapethissite.com/pages/ajax-javascript/"

driver.get(url)

# Wait for the page to load

driver.implicitly_wait(10)

# Find the link with id "#2015" and click it

link = driver.find_element(By.CSS_SELECTOR, "#2015")

link.click()

# Wait for the new page to load after clicking

driver.implicitly_wait(10)

# Get the page source after clicking the link

page_source = driver.page_source

# Parse the page source using BeautifulSoup

soup = BeautifulSoup(page_source, 'html.parser')

# Find the table element

table = soup.find('table', class_='table')

# Find all rows in the table body

rows = table.find('tbody', id='table-body').find_all('tr', class_='film')

# Initialize a list to store the data

data = []

# Loop through each row and extract the data

for row in rows:

    title = row.find('td', class_='film-title').get_text(strip=True)

    nominations = row.find('td', class_='film-nominations').get_text(strip=True)

    awards = row.find('td', class_='film-awards').get_text(strip=True)

    best_picture = row.find('td', class_='film-best-picture').get_text(strip=True) if row.find('td', class_='film-best-picture').find('i') else ''

    data.append({

        'Title': title,

        'Nominations': nominations,

        'Awards': awards,

        'Best Picture': best_picture

    })

# Print the scraped data

for item in data:

    print(item)

# After scraping, you can perform further actions or data processing as needed

# Close the browser window

driver.quit()

What the code above does is simple. It visits the page and waits for 10 seconds so all the initial elements will load fully after which it finds the 2015 link with the ID (2015) and click. It then waits for 10 seconds so the browser will get and render the content after it then fetch the HTML of the page and feed it to Beautifulsoup for parsing and data extraction.

Conclusion

While dynamic content poses a new set of challenges to web scraping, Python as a programming language has support for scraping them through the use of browser automation such as Selenium or Puppeteer. This is the easier option to use. For those with some skills, they can still use requests and Beautifulsoup but this requires some careful planning and patience to get things to work.

FAQs

No, Scrapy is not meant for scraping dynamic pages. It is meant for scraping pages so that when you request them, all of the content you need will be returned at once. If the required data is not returned in the initial response you get from the request, then the data you need to scrape is not available that Scrapy will access and scrape. You will need to do some hacky stuff to get it to work which for web crawling, is not an efficient thing you will want to do.

Selenium is the easiest-to-use Python module for scraping dynamic pages. However, it consumes a lot of memory and is not the fastest option out there. You can use Puppeteer as an alternative but this requires you to have Chrome installed, unlike Selenium which works with other browsers. For those skilled enough and can reverse engineer sites, you can use requests and Beautifulsoup but this requires patience and skills to perfect.

Because you are not doing it the right way. Traditionally, your request for a page and all content for the page is returned as a response — this is the web request that Beautifulsoup was built for. To use them for scraping dynamic pages, you need to find out all of the web requests sent by the page behind the scenes that are responsible for fetching the data you need and replicating them. With this, you are able to use requests only or together with Beautifulsoup to scrape dynamic content.

Top

Top