Python Web Scraping: Full Tutorial With Examples

Published on
Written by
Mantas KemΔ—Ε‘ius
Python Web Scraping: Full Tutorial With Examples

Hey there, data enthusiast! πŸ‘‹ Welcome to your ultimate guide to web scraping with Python. Whether you're building a price comparison tool, gathering research data, or just curious about how to extract information from websites, you're in the right place.

In this tutorial, we'll walk through everything from the absolute basics to advanced techniquesβ€”and yes, we'll show you real code examples you can actually use. Let's dive in!

🎯 The 6-Step Web Scraping Process

Before we jump into code, let's understand the roadmap for any successful web scraping project:

  • Understand the Website's Structure: Open up your browser's DevTools (F12) and inspect the HTML to identify the elements you want to scrape.
  • Set Up Your Python Environment: Install Python 3 and create a virtual environment to keep your dependencies organized.
  • Choose Your Tools: For beginners, we recommend starting with Requests (for fetching web pages) and BeautifulSoup (for parsing HTML).
  • Handle Pagination & Dynamic Content: Some sites load content with JavaScript. For these, you'll need tools like Selenium or Playwright.
  • Respect the Rules: Always check robots.txt and follow legal guidelines. Be a good web scraping citizen! 🀝
  • Optimize & Scale: When you're ready to go big, explore frameworks like Scrapy or use Asyncio for concurrent requests.
  • πŸ”§ Method 1: The Low-Level Approach (Manual Socket & Regex)

    Let's start with the absolute basicsβ€”sending HTTP requests using a raw TCP socket and parsing with Regular Expressions.

    Is this practical? Not really. It's educational, but cumbersome for real projects. Think of it as learning to build a car engine before you drive. πŸš—

    PYTHON
    1
    import socket
    2
    import re
    3
    4
    # Create a TCP socket
    5
    sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    6
    sock.connect(("example.com", 80))
    7
    8
    # Send HTTP GET request
    9
    request = "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n"
    10
    sock.send(request.encode())
    11
    12
    # Receive response
    13
    response = sock.recv(4096).decode()
    14
    sock.close()
    15
    16
    # Parse with regex (messy!)
    17
    titles = re.findall(r'<title>(.*?)</title>', response)
    18
    print(titles)

    βš™οΈ Method 2: Urllib3 & LXML (Intermediate Level)

    Now we're getting somewhere! Urllib3 gives you more control over HTTP requests, while LXML lets you parse HTML using XPath expressions.

    PYTHON
    1
    import urllib3
    2
    from lxml import html
    3
    4
    # Create a PoolManager
    5
    http = urllib3.PoolManager()
    6
    7
    # Fetch the page
    8
    response = http.request('GET', 'https://news.ycombinator.com/')
    9
    tree = html.fromstring(response.data)
    10
    11
    # Extract data with XPath
    12
    titles = tree.xpath('//span[@class="titleline"]/a/text()')
    13
    for title in titles:
    14
    print(title)

    When to use this: When you need advanced connection pooling or prefer XPath over CSS selectors.

    🌟 Method 3: Requests & BeautifulSoup (The Beginner's Power Duo)

    This is the gold standard for beginners. It's clean, readable, and gets the job done beautifully.

    Step 1: Install the libraries

    BASH
    1
    pip install requests beautifulsoup4

    Step 2: Scrape Hacker News

    PYTHON
    1
    import requests
    2
    from bs4 import BeautifulSoup
    3
    4
    # Fetch the page
    5
    url = 'https://news.ycombinator.com/'
    6
    response = requests.get(url)
    7
    8
    # Parse with BeautifulSoup
    9
    soup = BeautifulSoup(response.text, 'html.parser')
    10
    11
    # Extract story titles
    12
    titles = soup.select('span.titleline > a')
    13
    for title in titles:
    14
    print(title.get_text())

    Handling Authentication with Sessions

    Need to log in? Use Session objects to maintain cookies:

    PYTHON
    1
    session = requests.Session()
    2
    3
    # Login
    4
    login_data = {'username': 'user', 'password': 'pass'}
    5
    session.post('https://example.com/login', data=login_data)
    6
    7
    # Now make authenticated requests
    8
    response = session.get('https://example.com/protected')
    9
    print(response.text)

    Storing Data in CSV

    PYTHON
    1
    import csv
    2
    3
    data = [
    4
    {'title': 'Example 1', 'link': 'https://example.com/1'},
    5
    {'title': 'Example 2', 'link': 'https://example.com/2'}
    6
    ]
    7
    8
    with open('output.csv', 'w', newline='') as f:
    9
    writer = csv.DictWriter(f, fieldnames=['title', 'link'])
    10
    writer.writeheader()
    11
    writer.writerows(data)

    Storing Data in PostgreSQL

    PYTHON
    1
    import psycopg2
    2
    3
    conn = psycopg2.connect(
    4
    host="localhost",
    5
    database="scraping_db",
    6
    user="user",
    7
    password="password"
    8
    )
    9
    10
    cur = conn.cursor()
    11
    cur.execute("INSERT INTO articles (title, link) VALUES (%s, %s)",
    12
    ("Example Title", "https://example.com"))
    13
    conn.commit()
    14
    cur.close()
    15
    conn.close()

    ⚑ Scaling with Asyncio

    Requests is synchronous, meaning it waits for each request to complete. For scraping many pages, use aiohttp to send requests concurrently:

    PYTHON
    1
    import asyncio
    2
    import aiohttp
    3
    4
    async def fetch(session, url):
    5
    async with session.get(url) as response:
    6
    return await response.text()
    7
    8
    async def main():
    9
    urls = ['https://example.com/page1', 'https://example.com/page2']
    10
    async with aiohttp.ClientSession() as session:
    11
    tasks = [fetch(session, url) for url in urls]
    12
    results = await asyncio.gather(*tasks)
    13
    print(results)
    14
    15
    asyncio.run(main())

    🦊 Method 4: Web Scraping APIs (FoxScrape to the Rescue!)

    Here's the truth: as your scraping needs grow, you'll face challenges like:

  • 🚫 Anti-bot protection (Cloudflare, Captchas)
  • 🌐 JavaScript-heavy websites
  • πŸ”„ IP rotation and proxy management
  • ⚑ Scaling to thousands of requests
  • This is where FoxScrape comes in. Instead of building and maintaining complex infrastructure, FoxScrape handles all the heavy lifting for you through a simple API.

    Why Choose FoxScrape?

  • Bypass Anti-Bot Protection: FoxScrape automatically rotates IPs and handles Cloudflare challenges
  • JavaScript Rendering: Get fully rendered pages, even from dynamic sites
  • Simple Integration: Easy-to-use Python client with CSS selector support
  • AI-Powered Extraction: Use natural language prompts to extract exactly what you need
  • Production-Ready: Scale to millions of requests without managing infrastructure
  • Getting Started with FoxScrape

    PYTHON
    1
    import requests
    2
    3
    api_key = 'YOUR_FOXSCRAPE_API_KEY'
    4
    url = 'https://www.foxscrape.com/api/v1'
    5
    6
    params = {
    7
    'api_key': api_key,
    8
    'url': 'https://news.ycombinator.com/',
    9
    'render_js': 'true',
    10
    'extract_rules': {
    11
    'titles': {
    12
    'selector': 'span.titleline > a',
    13
    'type': 'list',
    14
    'output': 'text'
    15
    }
    16
    }
    17
    }
    18
    19
    response = requests.post(url, json=params)
    20
    data = response.json()
    21
    22
    # Get your extracted data
    23
    titles = data['titles']
    24
    for title in titles:
    25
    print(title)

    AI-Powered Extraction Example

    Want to extract data using natural language? FoxScrape's AI extraction makes it incredibly easy:

    PYTHON
    1
    params = {
    2
    'api_key': api_key,
    3
    'url': 'https://example.com/product',
    4
    'render_js': 'true',
    5
    'ai_extract_rules': {
    6
    'product_name': 'Extract the product title',
    7
    'price': 'Extract the current price',
    8
    'rating': 'Extract the average rating'
    9
    }
    10
    }
    11
    12
    response = requests.post('https://foxscrape.com/api/v1', json=params)
    13
    data = response.json()
    14
    15
    print(f"Product: {data['product_name']}")
    16
    print(f"Price: {data['price']}")
    17
    print(f"Rating: {data['rating']}")

    No CSS selectors. No XPath. Just tell FoxScrape what you want in plain English! 🎯

    Ready to try FoxScrape? Visit https://www.foxscrape.com and get started with a free trial today!

    πŸš€ Method 5: Web Crawling Frameworks (For Large-Scale Projects)

    When you need to scrape hundreds or thousands of pages, frameworks provide structure, performance, and built-in features:

    Scrapy: The Industry Standard

    Scrapy is a complete framework for web crawling. It handles parallelism, throttling, and error handling automatically.

    BASH
    1
    pip install scrapy
    PYTHON
    1
    import scrapy
    2
    3
    class HackerNewsSpider(scrapy.Spider):
    4
    name = 'hackernews'
    5
    start_urls = ['https://news.ycombinator.com/']
    6
    7
    def parse(self, response):
    8
    for article in response.css('span.titleline'):
    9
    yield {
    10
    'title': article.css('a::text').get(),
    11
    'link': article.css('a::attr(href)').get()
    12
    }

    Run it with: scrapy crawl hackernews -o output.json

    PySpider: Web UI for Visual Debugging

    PySpider provides a user-friendly web interface for creating and debugging spiders:

    BASH
    1
    pip install pyspider
    2
    pyspider all

    Then open http://localhost:5000 in your browser and start building your spider visually!

    🌐 Method 6: Headless Browsing (For JavaScript-Heavy Sites)

    Some websites load content dynamically with JavaScript. For these, you need a headless browser:

    Selenium with Chrome

    PYTHON
    1
    from selenium import webdriver
    2
    from selenium.webdriver.common.by import By
    3
    4
    # Set up Chrome in headless mode
    5
    options = webdriver.ChromeOptions()
    6
    options.add_argument('--headless')
    7
    driver = webdriver.Chrome(options=options)
    8
    9
    # Navigate and extract
    10
    driver.get('https://example.com')
    11
    titles = driver.find_elements(By.CSS_SELECTOR, 'h1')
    12
    for title in titles:
    13
    print(title.text)
    14
    15
    driver.quit()

    Playwright: Modern & Fast

    PYTHON
    1
    from playwright.sync_api import sync_playwright
    2
    3
    with sync_playwright() as p:
    4
    browser = p.chromium.launch(headless=True)
    5
    page = browser.new_page()
    6
    page.goto('https://example.com')
    7
    8
    titles = page.query_selector_all('h1')
    9
    for title in titles:
    10
    print(title.inner_text())
    11
    12
    browser.close()

    🎯 Method 7: Using Website APIs (The Smartest Approach)

    Before scraping HTML, always check if the website has a public API. It's faster, more reliable, and legal!

    PYTHON
    1
    import requests
    2
    3
    # Example: Reddit API
    4
    response = requests.get('https://www.reddit.com/r/python/top.json?limit=10',
    5
    headers={'User-Agent': 'Python Tutorial'})
    6
    data = response.json()
    7
    8
    for post in data['data']['children']:
    9
    print(post['data']['title'])

    πŸ›‘οΈ Avoiding Anti-Bot Technology: The Balancing Act

    Websites use various techniques to detect and block scrapers. Here's how to stay under the radar:

    1. Respect robots.txt

    PYTHON
    1
    from urllib.robotparser import RobotFileParser
    2
    3
    rp = RobotFileParser()
    4
    rp.set_url('https://example.com/robots.txt')
    5
    rp.read()
    6
    7
    can_scrape = rp.can_fetch('*', 'https://example.com/page')
    8
    print(f"Can scrape: {can_scrape}")

    2. Rotate User Agents

    PYTHON
    1
    import requests
    2
    import random
    3
    4
    user_agents = [
    5
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    6
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    7
    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
    8
    ]
    9
    10
    headers = {'User-Agent': random.choice(user_agents)}
    11
    response = requests.get('https://example.com', headers=headers)

    3. Use Proxies

    PYTHON
    1
    proxies = {
    2
    'http': 'http://proxy.example.com:8080',
    3
    'https': 'http://proxy.example.com:8080'
    4
    }
    5
    6
    response = requests.get('https://example.com', proxies=proxies)

    4. Undetected ChromeDriver

    For sites with advanced bot detection:

    PYTHON
    1
    import undetected_chromedriver as uc
    2
    3
    driver = uc.Chrome()
    4
    driver.get('https://example.com')
    5
    # Your scraping code here
    6
    driver.quit()

    5. Or... Just Use FoxScrape! 🦊

    All of these anti-bot techniques require constant maintenance and updates. FoxScrape handles all of this automatically, so you can focus on extracting data instead of fighting with websites.

    πŸŽ“ Final Thoughts

    Web scraping in Python is an incredibly powerful skill. You've now learned:

  • βœ… The fundamentals of HTTP requests and HTML parsing
  • βœ… Multiple approaches from low-level to high-level
  • βœ… How to handle authentication, pagination, and dynamic content
  • βœ… Strategies for avoiding detection
  • βœ… When to use APIs like FoxScrape to save time and effort
  • Start with Requests + BeautifulSoup for simple projects. As your needs grow, consider frameworks like Scrapy or APIs like FoxScrape for production-grade scraping.

    Happy scraping! πŸš€

    Want to skip the complexity and start scraping right away? Try FoxScrape today and get your first 1,000 requests free!