How to Scrape Data from a Website

Published on
Written by
Mantas Kemėšius
How to Scrape Data from a Website

Web scraping means automatically collecting information from web pages using code — no manual copying, no spreadsheets. Whether you’re tracking prices, analyzing markets, or gathering research data, Python makes it surprisingly easy to turn websites into structured datasets.

In this guide, you’ll learn how to scrape data from a website using Python, step by step.

We’ll explore:

  • BeautifulSoup — for static pages
  • Selenium — for JavaScript-rendered content
  • FoxScrape API — for scaling, proxies, and protected pages
  • By the end, you’ll know how to extract data, clean it, and save it — all using Python.

    🌍 Why Web Scraping Matters

    Web scraping powers countless real-world applications:

  • 💰 Price tracking — monitor competitor prices or market shifts
  • 📈 Trend analysis — gather public data for research or forecasting
  • 🧭 Lead generation — collect listings or company info from directories
  • 🧾 Academic research — automate the collection of structured data
  • Used responsibly, web scraping helps developers and analysts make data-driven decisions faster and at scale.

    ⚖️ Important: Always scrape only public, non-sensitive data, and follow the website’s terms of service. Avoid private or restricted information.

    🔍 Understanding How Websites Work

    Before you can scrape data, you need to know what you’re looking at.

    Every website is built from HTML — a structured document containing elements like <div>, <p>, <span>, <table>, and so on. These tags define where data lives.

    Here’s a simple example:

    HTML
    1
    <div class="product">
    2
    <h2>Blue T-shirt</h2>
    3
    <span class="price">$15.99</span>
    4
    </div>

    When you scrape data, your goal is to read this structure and extract the parts you need — such as product titles, prices, or links.

    To find the right elements:

  • Right-click the item in your browser.
  • Choose Inspect or Inspect Element.
  • Note its tag (<div>, <h2>, etc.) and class (e.g., "product").
  • This inspection process is the secret to writing accurate scrapers.

    ⚙️ Setting Up Your Python Environment

    Before you start coding, make sure your environment is ready.

    🧰 You’ll need:

  • Python 3.10+
  • pip (Python package manager)
  • A code editor like VSCode or PyCharm
  • 📦 Install required packages:

    BASH
    1
    pip install requests beautifulsoup4 pandas lxml

    Optional (for advanced scraping):

    BASH
    1
    pip install selenium

    🧩 What these tools do:

    PackagePurpose
    requestsDownloads web pages (HTML).
    BeautifulSoupParses and extracts content from HTML.
    pandasCleans and structures scraped data.
    seleniumAutomates browsers to load dynamic content.

    🧾 Scraping Data from a Static Website

    Let’s start with the simplest and most common scenario — scraping a static webpage.

    Imagine a product listing page like this:

    HTML
    1
    <div class="product">
    2
    <h2>Blue T-shirt</h2>
    3
    <span class="price">$15.99</span>
    4
    </div>
    5
    <div class="product">
    6
    <h2>Red Hoodie</h2>
    7
    <span class="price">$29.99</span>
    8
    </div>

    We can extract both titles and prices using requests and BeautifulSoup.

    🧑‍💻 Example Code

    PYTHON
    1
    import requests
    2
    from bs4 import BeautifulSoup
    3
    4
    url = "https://example.com/products"
    5
    html = requests.get(url).text
    6
    soup = BeautifulSoup(html, "lxml")
    7
    8
    items = soup.find_all("div", class_="product")
    9
    10
    for item in items:
    11
    title = item.find("h2").text.strip()
    12
    price = item.find("span", class_="price").text.strip()
    13
    print(title, price)

    🧩 How It Works:

  • requests.get(url) → Fetches the raw HTML from the page.
  • BeautifulSoup(html, "lxml") → Parses the HTML.
  • find_all("div", class_="product") → Finds all product containers.
  • item.find("h2") and .find("span") → Extract title and price text.
  • Output:

    PLAIN TEXT
    1
    Blue T-shirt $15.99
    2
    Red Hoodie $29.99

    Tip: If your script doesn’t find any results, double-check the class name in your browser’s “Inspect” view — even a small mismatch breaks the selector.

    💾 Saving and Structuring the Data

    Extracting data is only half the job — you’ll usually want to save it for later use.

    PYTHON
    1
    import pandas as pd
    2
    3
    data = []
    4
    for item in items:
    5
    title = item.find("h2").text.strip()
    6
    price = item.find("span", class_="price").text.strip()
    7
    data.append({"Title": title, "Price": price})
    8
    9
    df = pd.DataFrame(data)
    10
    df.to_csv("products.csv", index=False)
    11
    12
    print("Data saved to products.csv")

    Output file:

    PLAIN TEXT
    1
    Title,Price
    2
    Blue T-shirt,$15.99
    3
    Red Hoodie,$29.99

    You can also export data to:

    PYTHON
    1
    df.to_excel("products.xlsx", index=False)
    2
    df.to_json("products.json", orient="records")

    🧩 Handling Common Issues

    When scraping, not everything goes smoothly. Here’s how to fix the usual culprits:

    ProblemCauseSolution
    Empty dataPage uses JavaScriptUse Selenium or FoxScrape
    HTTP 403Site blocks botsAdd headers or rotate proxies
    Missing valuesWrong selectorRecheck HTML structure
    Slow scrapingToo many requestsAdd delays or batching
    Encoding errorNon-UTF-8 contentSet response.encoding = 'utf-8'

    Adding a Custom User-Agent

    Many sites block requests without a browser signature.

    PYTHON
    1
    headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
    2
    html = requests.get(url, headers=headers).text

    This simple trick avoids basic bot blocks.

    ⚡ Scraping Dynamic Websites (JavaScript-Rendered Data)

    Some sites load data dynamically with JavaScript — meaning the data isn’t present in the initial HTML.

    If you inspect the page source and don’t see the data, but it appears in the browser, you’re dealing with a dynamic page.

    Option 1: Selenium (Browser Automation)

    Selenium opens a real browser window, loads the page, runs scripts, and lets you access the fully rendered HTML.

    PYTHON
    1
    from selenium import webdriver
    2
    from bs4 import BeautifulSoup
    3
    import time
    4
    5
    driver = webdriver.Chrome()
    6
    driver.get("https://example.com/products")
    7
    time.sleep(3) # wait for JS to load
    8
    9
    html = driver.page_source
    10
    soup = BeautifulSoup(html, "lxml")
    11
    12
    items = soup.find_all("div", class_="product")
    13
    for item in items:
    14
    print(item.text)
    15
    16
    driver.quit()

    Pros: Works for most dynamic pages.

    ⚠️ Cons: Slow, requires browser setup, not ideal for large-scale scraping.

    Option 2: Using FoxScrape API (Simple & Scalable)

    If you don’t want to deal with browser automation or proxy headaches, the FoxScrape API is a modern alternative.

    It acts like a cloud browser, executes JavaScript, rotates IPs, and returns rendered HTML in one API call.

    PYTHON
    1
    import requests
    2
    from bs4 import BeautifulSoup
    3
    4
    response = requests.get(
    5
    "https://www.foxscrape.com/api/v1",
    6
    params={
    7
    "url": "https://example.com/products",
    8
    "render_js": "true"
    9
    }
    10
    )
    11
    12
    html = response.text
    13
    soup = BeautifulSoup(html, "lxml")
    14
    15
    products = soup.find_all("div", class_="product")
    16
    for p in products:
    17
    print(p.text)

    Why it’s useful:

  • Handles JavaScript rendering automatically.
  • No setup, proxies, or browser drivers.
  • Built for speed and scalability.
  • If you’re scraping hundreds of pages or facing anti-bot systems, this approach saves hours of maintenance time.

    🧹 Cleaning, Transforming, and Exporting Data

    Once your data is loaded into pandas, you can easily clean it:

    PYTHON
    1
    # Example cleaning operations
    2
    df["Price"] = df["Price"].str.replace("$", "").astype(float)
    3
    df = df.drop_duplicates()
    4
    df = df.fillna("N/A")
    5
    6
    # Export
    7
    df.to_csv("cleaned_products.csv", index=False)
    8
    df.to_excel("cleaned_products.xlsx", index=False)
    9
    df.to_json("cleaned_products.json", orient="records")

    This turns raw HTML text into a dataset ready for analysis or visualization.

    🧭 Best Practices for Ethical Scraping

    Responsible scraping keeps your scripts efficient and compliant.

    ✅ Do:

  • Check each site’s robots.txt
  • Identify your scraper with a User-Agent
  • Add time.sleep(1) between requests
  • Use caching for repeated scrapes
  • Cite your data sources
  • ❌ Don’t:

  • Scrape private or personal data
  • Send excessive requests to one domain
  • Ignore copyright or data-use restrictions
  • 🦊 Pro Tip: FoxScrape automatically respects rate limits and rotates proxies — a simple way to stay safe while scraping at scale.

    🧩 Advanced Example: Scraping and Analyzing Data Together

    Here’s a practical mini-project: Scrape a site’s product prices and analyze them with pandas.

    PYTHON
    1
    import requests
    2
    from bs4 import BeautifulSoup
    3
    import pandas as pd
    4
    5
    url = "https://example.com/products"
    6
    html = requests.get(url).text
    7
    soup = BeautifulSoup(html, "lxml")
    8
    9
    data = []
    10
    for item in soup.find_all("div", class_="product"):
    11
    title = item.find("h2").text.strip()
    12
    price = float(item.find("span", class_="price").text.strip().replace("$", ""))
    13
    data.append({"Title": title, "Price": price})
    14
    15
    df = pd.DataFrame(data)
    16
    print(df.describe())

    Output:

    PLAIN TEXT
    1
    Price
    2
    count 12.0000
    3
    mean 28.4900
    4
    min 10.9900
    5
    max 49.9900

    You’ve just gone from raw HTML to usable statistics — all with under 30 lines of Python.

    🏁 Conclusion

    Let’s recap the three main approaches:

    TypeBest ToolDescription
    Static pagesBeautifulSoup + requestsSimple, fast, and lightweight
    JavaScript-renderedSeleniumReliable but slower
    Protected or dynamicFoxScrape APICloud-powered, scalable, effortless

    With these methods, you can extract almost any data — product listings, articles, prices, tables, reviews — from any public website.

    The key is to start small, understand your targets, and scale responsibly.

    ⚡ Next step: Try scraping your favorite site.

    For complex pages, skip browser setup — just send the URL to

    https://www.foxscrape.com/api/v1?url=<your-site>&render_js=true

    and get clean, rendered HTML instantly.

    Happy scraping — ethically, efficiently, and with a little help from 🦊 FoxScrape.