How to web scrap on Python with BeautifulSoup?

Author :

React :

Comment

You want to dive into the world of web scraping but without getting lost in complicated codes?

With Python and the library BeautifulSoup, you can easily extract and organize data of a website in just a few lines.

Web scraping on Python with BeautifulSoup.
Web scraping on Python with BeautifulSoup. ©Christina for Alucare.fr

Prerequisites for scraping Python with BeautifulSoup

✅ Before you get started, it's important to have a few programming basics. This gives you a better understanding of how the code works. You don't need to be an expert, but knowing how to read and execute a Python script will help you a lot.

Next, here's what you need to do first to make scraping on Python with BeautifulSoup :

  • ✔ Install Python as well as a development environment.
  • ✔ Install pip, the tool that makes it easy to add Python libraries.
  • ✔ Install BeautifulSoup with the command :
pip install beautifulsoup4
  • ✔ Install Requests to retrieve web pages with the command :
pip install requests

How to web scrap with Python and BeautifulSoup?

Follow our tutorial for a simple web scraping project.

Image showing how Python web scraping works with BeautifulSoup.
Image showing how web scraping works on Python with BeautifulSoup. Cristina for Alucare.fr

Project : retrieve the title of a page and all the links it contains.

Step 1: Retrieve page content with Requests

To perform a HTTP GET request to a URL, use the Requests.

📌 When you send an HTTP request with Requests, the server always returns a status code. These codes indicate whether the request was successful or not.

200 : success.
301 / 302 redirection.
404 page not found.
500 internal server error.

With Requests, you can check the result of a query using the attribute .status_code. Here is an example of code that sends a request to bonjour.comwhich checks the status code and displays a snippet of HTML content if all is well:

import requests

# Target URL
url = "https://bonjour.com"

# Send a GET request
response = requests.get(url)

# Check status code
if response.status_code == 200:
    print("Success: the page has been retrieved!")
    html = response.text # HTML content of the page
    print("Extract HTML content:")
    print(html[:500]) # displays only the first 500 characters
else:
    print(f "Error: status code {response.status_code}")
  

Step 2: Analyze HTML code with BeautifulSoup

When you retrieve the content of a page with Requests (response.text), you get a character string containing all the page's HTML code. To easily manipulate this HTML, we use BeautifulSoup to create an object BeautifulSoup.
📌 When passing raw HTML to BeautifulSoup, you must specify a parser (example: "html.parser). This allows BeautifulSoup to interpret the HTML correctly and avoid warnings.
from bs4 import BeautifulSoup
import requests

url = "https://bonjour.com"
response = requests.get(url)
html = response.text

# Specifying the parser is recommended
soup = BeautifulSoup(html, "html.parser")

Step 3: Find and extract elements

Once you have transformed the HTML into a BeautifulSoupyou can start searching and retrieving the data you're interested in (HTML tags).

  • Use find() and find_all()
# Recover title <h1>
h1 = soup.find("h1")
print(h1.get_text())

# Retrieve all links <a>
liens = soup.find_all("a")
for lien in liens:
    print(lien.get_text(), lien.get("href"))
  • Target elements by attribute

You can refine the search by attributes such as class, id or any other HTML attribute.

⚠️ Remark In Python, we write class_ instead of class to avoid conflict with the reserved word class.

# Retrieve a div with a specific id
container = soup.find("div", id="main")

# Retrieve all links with a specific class
liens_nav = soup.find_all("a", class_="nav-link")
  • Using CSS selectors with select()

For more precise searches, use select() with CSS selectors.

# All links in article titles
links_articles = soup.select("article h2 a")

# All <a> whose href attribute begins with "http".
links_http = soup.select('a[href^="http"]')

The CSS selectors are very powerful if you want to target specific parts of a page without manually going through all the HTML.

How to extract data from an HTML table with BeautifulSoup?

Extract data from an HTML table with BeautifulSoup.
Extract data from an HTML table with BeautifulSoup. ©Christina for Alucare.fr

So far, we have seen how to retrieve titles, links, and text from a web page.

⚠ But often, real-world use cases are more complex: structured data extraction such as tables or lists, pagination management, and resolving common scraping errors. That's exactly what we're going to look at together.

Extract tables and lists

Websites often present their data in HTML tables (<table>, <tr>, <th>, <td>) or lists (

    /
      ,
    1. ). To transform these structures into usable data, you need to learn how to go through them line by line or element by element.

      Whenever you want extract an HTML tablethe principle is simple:

      • ✅ Recover headers (<th>) to identify column headings.
      • ✅ Browse each line (<tr>) and search for cells (<td>) which contain the real data.
      • ✅ Store information in a list or dictionary.

      For a HTML list (

        or
          with
        1. ) :

          • ✅ Locate all beacons
          • with find_all.
          • ✅ Retrieve their content (text or link) and add it to a Python list.

          In summary :

          Beacons <table>, <tr>, <th>, <td> are used to reconstruct an array.
          Beacons

            /
              ,
            1. transform an HTML list into a Python list.

              Here's an example with a table:

              html = """
              <table>
                <tr>
                  <th>Last name</th>
                  <th>Age</th>
                  <th>Town</th>
                </tr>
                <tr>
                  <td>Alice</td>
                  <td>25</td>
                  <td>Paris</td>
                </tr>
                <tr>
                  <td>Bob</td>
                  <td>30</td>
                  <td>Lyon</td>
                </tr>
              </table>
              """
              
              # Create BeautifulSoup object
              soup = BeautifulSoup(html, "html.parser")
              
              # Extract headers from array
              headers = [th.get_text(strip=True) for th in soup.find_all("th")]
              print("Headers:", headers)
              
              # Extract data rows (skip 1st row as these are the headers)
              rows = []
              for tr in soup.find_all("tr")[1:]:
                  cells = [td.get_text(strip=True) for td in tr.find_all("td")]
                  if cells:
                      rows.append(cells)
              
              print("Lines :", rows)
              

              Here, find_all("th") retrieves headers and find_all("td") retrieves the cells in each row. Loop over the <tr> to rebuild the table row by row.

              Here's an example on a list:

              from bs4 import BeautifulSoup
              
              html_list = """
              
              • Apple
              • Banana
              • Orange
              """ soup = BeautifulSoup(html_list, "html.parser") # Retrieve list items items = [li.get_text(strip=True) for li in soup.find_all("li")] print("Extracted list:", items) # ["Apple", "Banana", "Orange"]

              Here, every

            2. is directly transformed into a Python list element, giving the result ["Apple", "Banana", "Orange"].

              Manage pagination and links

              In many cases, the data doesn't fit on a single page. It's spread across several pages via “next page” or a numbered pagination (?page=1, ?page=2, ...).

              📌 In both cases, you must curl (browse in a loop) to fetch all the pages and merge the data.

              Example with a page parameter :

              import time
              import requests
              from bs4 import BeautifulSoup
              
              # Example of URL with pagination
              BASE_URL = "https://bonjour.com/articles?page={}"
              HEADERS = {"User-Agent": "Mozilla/5.0"}
              
              all_articles = []
              
              # Assume 5 pages to browse
              for page in range(1, 6):
                  url = BASE_URL.format(page)
                  r = requests.get(url, headers=HEADERS, timeout=20)
                  if r.status_code == 200:
                      soup = BeautifulSoup(r.text, "html.parser")
                      # Extract article titles
                      articles = [h2.get_text(strip=True) for h2 in soup.find_all("h2", class_="title")]
                      all_articles.extend(articles)
                  else:
                      print(f "Error on page {page} (code : {r.status_code})")
                  time.sleep(1.0) # politeness
              
              print("Articles retrieved :", all_articles)

              Brief explanation:

              • Prepare the URL with a {} placeholder to insert the page number.
              BASE_URL = "https://bonjour.com/articles?page={}
              • Some websites block requests without a “browser identity.” Adding a User-Agent prevents you from being mistaken for a bot.
              headers = {"User-Agent": "Mozilla/5.0"}
              requests.get(url, headers=headers) 
              • Loop from page 1 to 5.
              for page in range(1, 6):
              • Retrieve the page's HTML.
              requests.get(url)
              • Limit waiting time if the site doesn't respond.
              requests.get(url, timeout=20)
              • Parser la page.
              BeautifulSoup(response.text, "html.parser")
              • Retrieve all article titles.
              find_all("h2", class_="title")
              • Add found items to a global list.
              all_articles.extend(articles)
              • Introduce a pause between each request to avoid overloading the server and being banned.
              time.sleep(1.0)
              • After the loop, all_articles contains all 5 page titles.

              Common mistakes and challenges

              ❗ Scraping isn't always as simple as pressing a button and everything is fine. You may encounter frequent obstacles such as:

              • HTTP errors

              404 page not found
              403 access forbidden
              500 server-side error

              Example :

              response = requests.get(url)
              if response.status_code == 200:
                  # Page OK
                  print("Page retrieved successfully")
              elif response.status_code == 404:
                  print("Error: page not found")
              else:
                  print("Code returned:", response.status_code)
              
              • Sites that block scraping

              Some detect automatic requests and block access.

              • Dynamic pages (JavaScript)

              BeautifulSoup only reads static HTML. If the page loads its content with JavaScriptyou'll see nothing.

              ✅ In this case, use tools such as Selenium Where Playwright.

              On the other hand, if you want to scrape efficiently without getting blocked or damaging the site, here are the best practices:

              • ✔ Respect the robots.txt file of a website.
              • ✔ Set up deadlines between requests to avoid overloading the server (using time.sleep()).
              • ✔ Use proxies and rotate them.
              • ✔ Regularly change your User Agent.

              Web scraping with Selenium and BeautifulSoup?

              Web scraping with Selenium and BeautifulSoup on Chrome.
              Web scraping with Selenium and BeautifulSoup on Chrome. ©Christina for Alucare.fr

              A reminder BeautifulSoup is an excellent HTML parser, but it cannot execute JavaScript on a web page. That's where Selenium comes in handy!

              Basically, Selenium control a real browserIt executes JavaScript and displays the page as if a human were browsing. BeautifulSoup will then analyze the HTML code once the page is fully rendered. So you can extract what you want.

              Step 1: Install Selenium and BeautifulSoup

              Here, instead of using the Request library, we will use Selenium. To install it, you must go through pip.

              pip install selenium beautifulsoup4 

              Next, you need to download and install a WebDriver which corresponds to your browser version (e.g. ChromeDrive for Google Chrome).

              ✅ You can either place it in the same folder as your Python script or’add to the PATH environment variable of your system.

              Step 2: Configure Selenium

              First and foremost, you need to import webdriver from Selenium to control a browser.

              from selenium import webdriver
              from selenium.webdriver.common.by import By 

              Next, launch a browser. This will open the web page and will execute the JavaScript (Example: Chrome).

              driver = webdriver.Chrome() 

              You tell the browser which page to visit.

              driver.get("https://www.exemple.com") 

              If the page takes a while to display certain elements, you can tell Selenium to wait a little while.

              driver.implicitly_wait(10) 

              Step 3: Retrieving page content

              Once the page is loaded, the Full DOM (HTML source code after JS execution).

              html_content = driver.page_source 

              Step 4: HTML analysis with BeautifulSoup

              Now pass this source code to BeautifulSoup so you can use it:

              from bs4 import BeautifulSoup # Create a BeautifulSoup object soup = BeautifulSoup(html_content, 'html.parser') # Example: retrieve all page titles titles = soup.find_all('h2') for title in titles: print(title.get_text()) 

              👉 BeautifulSoup offers powerful methods like find(), find_all(), and CSS selectors for target and extract elements HTML code.

              Step 5: Closing the browser

              Very important: always close your browser after running the program to free up resources!

              driver.quit() 

              ✅ And there you have it! You can now combine the power of Selenium to simulate human navigation (clicks, scrolls, etc.) with the efficiency of BeautifulSoup for HTML code analysis.

              FAQs

              What's the best tool for web scraping in Python?

              There is no such thing as the best universal tool, but rather solutions tailored to your project.

              🔥 BeautifulSoup HTML parser: simple and effective for parsing HTML and extracting content quickly. Ideal for beginners and small projects.

              🔥 Scrapy : it is a comprehensive framework designed to manage large volumes of data with advanced features.

              🔥 Playwright : perfect for complex JavaScript-generated sites, as it simulates a real browser and allows you to interact with the page like a human.

              How to use BeautifulSoup to extract content from a tag <div> ?

              With BeautifulSoup, you can target a specific beacon with a CSS selector. To extract content from a <div>here are the steps:

              1. Retrieve the page with Requests, then analyze with BeautifulSoup
              from bs4 import BeautifulSoup
              import requests
              
              url = "URL_OF YOUR_SITE" # Replace with real URL
              response = requests.get(url)
              html_content = reponse.text
              
              soup = BeautifulSoup(html_content, "html.parser")
              1. Use the select() by passing it your CSS selector to target the <div>

              To retrieve the first element, use soup.select_one
              To retrieve all items, use soup.select

              HTML example:

              <div class="article">
                <h2>Article title</h2>
                <p>Here's what the paragraph says.</p>
              </div>

              Example with CSS :

              # Retrieve first div with "article" class
              div_article = soup.select_one("div.article")
              
              # Display text content
              if div_article:
              print(div_article.get_text(strip=True))

              Here, the CSS selector is div.article.

              1. Extract elements inside the <div>
              # Retrieve title from inside div
              title = soup.select_one("div.article h2").get_text()
              
              # Retrieve paragraph inside div
              paragraph = soup.select_one("div.item p").get_text()
              
              print("Title:", title)
              print("Paragraph:", paragraph)

              How do I use Requests and BeautifulSoup together?

              These two libraries are complementary.

              1. Requests retrieves the content of a web page with an HTTP request.

              It sends an HTTP request to the target site and downloads the page's raw HTML code.

              import requests
              
              url = "https://sitecible.com"
              response = requests.get(url) # HTTP request
              print(response.text) # displays raw HTML

              At this stage, all you have is a huge text full of tags (<html>,<div><p>etc.).

              1. BeautifulSoup analyzes this HTML content to extract what's of interest to you.

              It takes raw HTML and transforms it into an organized structure. This allows you to easily navigate within the HTML: locate, extract, and retrieve data.

              from bs4 import BeautifulSoup
              
              soup = BeautifulSoup(response.text, "html.parser") # parses the HTML
              title = soup.find("h1").get_text() # extracts the content of a <h1>
              print(title)

              Why doesn't my web scraping code work on some sites?

              Sometimes your script won't retrieve anything, because some sites don't provide all their content directly in HTML.

              These sites use JavaScript to dynamically load data. However, BeautifulSoup does not allow you to’analyze data rendered by JavaScript.

              In this case, you should turn to tools such as Playwright Where Selenium.

              What role does BeautifulSoup play in web scraping?

              BeautifulSoup acts as’HTML parser.

              It takes the source code of a page in plain text form and transforms it into a structured object that you can easily browse.

              Without this library, you'll see a huge block of unreadable text. Simply put, BeautifulSoup is the translator between the Raw HTML and your Python code.

              Web scraping: BeautifulSoup vs Scrapy?

              BeautifulSoup and Scrapy are very different, although both are used for web scraping.

              BeautifulSoup Scrapy
              A simple library for parsing HTML and extracting data. A complete framework that manages the entire scraping process
              (queries, link tracking, pagination, data export, error handling).

              In summary, BeautifulSoup makes it easier to’HTML data extraction in Python. This library is perfect for beginners, as it makes scraping quick and easy.

              Otherwise, if you don't want no coding, the comprehensive tool Bright Data is also an excellent solution for web scraping.

              👉 Now, tell us in comments what you managed to scrape!

Found this helpful? Share it with a friend!

This content is originally in French (See the editor just below.). It has been translated and proofread in various languages using Deepl and/or the Google Translate API to offer help in as many countries as possible. This translation costs us several thousand euros a month. If it's not 100% perfect, please leave a comment for us to fix. If you're interested in proofreading and improving the quality of translated articles, don't hesitate to send us an e-mail via the contact form!
We appreciate your feedback to improve our content. If you would like to suggest improvements, please use our contact form or leave a comment below. Your feedback always help us to improve the quality of our website Alucare.fr


Alucare is an free independent media. Support us by adding us to your Google News favorites:

Post a comment on the discussion forum