Python Web Scraping: Full Tutorial With Examples

Published on
Written by
Mantas KemΔ—Ε‘ius
Python Web Scraping: Full Tutorial With Examples

Hey there, data enthusiast! πŸ‘‹ Welcome to your ultimate guide to web scraping with Python. Whether you're building a price comparison tool, gathering research data, or just curious about how to extract information from websites, you're in the right place.

In this tutorial, we'll walk through everything from the absolute basics to advanced techniquesβ€”and yes, we'll show you real code examples you can actually use. Let's dive in!

🎯 The 6-Step Web Scraping Process

Before we jump into code, let's understand the roadmap for any successful web scraping project:

  • Understand the Website's Structure: Open up your browser's DevTools (F12) and inspect the HTML to identify the elements you want to scrape.
  • Set Up Your Python Environment: Install Python 3 and create a virtual environment to keep your dependencies organized.
  • Choose Your Tools: For beginners, we recommend starting with Requests (for fetching web pages) and BeautifulSoup (for parsing HTML).
  • Handle Pagination & Dynamic Content: Some sites load content with JavaScript. For these, you'll need tools like Selenium or Playwright.
  • Respect the Rules: Always check robots.txt and follow legal guidelines. Be a good web scraping citizen! 🀝
  • Optimize & Scale: When you're ready to go big, explore frameworks like Scrapy or use Asyncio for concurrent requests.
  • πŸ”§ Method 1: The Low-Level Approach (Manual Socket & Regex)

    Let's start with the absolute basicsβ€”sending HTTP requests using a raw TCP socket and parsing with Regular Expressions.

    Is this practical? Not really. It's educational, but cumbersome for real projects. Think of it as learning to build a car engine before you drive. πŸš—

    PYTHON
    1import socket
    2import re
    3
    4# Create a TCP socket
    5sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    6sock.connect(("example.com", 80))
    7
    8# Send HTTP GET request
    9request = "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n"
    10sock.send(request.encode())
    11
    12# Receive response
    13response = sock.recv(4096).decode()
    14sock.close()
    15
    16# Parse with regex (messy!)
    17titles = re.findall(r'<title>(.*?)</title>', response)
    18print(titles)

    βš™οΈ Method 2: Urllib3 & LXML (Intermediate Level)

    Now we're getting somewhere! Urllib3 gives you more control over HTTP requests, while LXML lets you parse HTML using XPath expressions.

    PYTHON
    1import urllib3
    2from lxml import html
    3
    4# Create a PoolManager
    5http = urllib3.PoolManager()
    6
    7# Fetch the page
    8response = http.request('GET', 'https://news.ycombinator.com/')
    9tree = html.fromstring(response.data)
    10
    11# Extract data with XPath
    12titles = tree.xpath('//span[@class="titleline"]/a/text()')
    13for title in titles:
    14    print(title)

    When to use this: When you need advanced connection pooling or prefer XPath over CSS selectors.

    🌟 Method 3: Requests & BeautifulSoup (The Beginner's Power Duo)

    This is the gold standard for beginners. It's clean, readable, and gets the job done beautifully.

    Step 1: Install the libraries

    BASH
    1pip install requests beautifulsoup4

    Step 2: Scrape Hacker News

    PYTHON
    1import requests
    2from bs4 import BeautifulSoup
    3
    4# Fetch the page
    5url = 'https://news.ycombinator.com/'
    6response = requests.get(url)
    7
    8# Parse with BeautifulSoup
    9soup = BeautifulSoup(response.text, 'html.parser')
    10
    11# Extract story titles
    12titles = soup.select('span.titleline > a')
    13for title in titles:
    14    print(title.get_text())

    Handling Authentication with Sessions

    Need to log in? Use Session objects to maintain cookies:

    PYTHON
    1session = requests.Session()
    2
    3# Login
    4login_data = {'username': 'user', 'password': 'pass'}
    5session.post('https://example.com/login', data=login_data)
    6
    7# Now make authenticated requests
    8response = session.get('https://example.com/protected')
    9print(response.text)

    Storing Data in CSV

    PYTHON
    1import csv
    2
    3data = [
    4    {'title': 'Example 1', 'link': 'https://example.com/1'},
    5    {'title': 'Example 2', 'link': 'https://example.com/2'}
    6]
    7
    8with open('output.csv', 'w', newline='') as f:
    9    writer = csv.DictWriter(f, fieldnames=['title', 'link'])
    10    writer.writeheader()
    11    writer.writerows(data)

    Storing Data in PostgreSQL

    PYTHON
    1import psycopg2
    2
    3conn = psycopg2.connect(
    4    host="localhost",
    5    database="scraping_db",
    6    user="user",
    7    password="password"
    8)
    9
    10cur = conn.cursor()
    11cur.execute("INSERT INTO articles (title, link) VALUES (%s, %s)", 
    12            ("Example Title", "https://example.com"))
    13conn.commit()
    14cur.close()
    15conn.close()

    ⚑ Scaling with Asyncio

    Requests is synchronous, meaning it waits for each request to complete. For scraping many pages, use aiohttp to send requests concurrently:

    PYTHON
    1import asyncio
    2import aiohttp
    3
    4async def fetch(session, url):
    5    async with session.get(url) as response:
    6        return await response.text()
    7
    8async def main():
    9    urls = ['https://example.com/page1', 'https://example.com/page2']
    10    async with aiohttp.ClientSession() as session:
    11        tasks = [fetch(session, url) for url in urls]
    12        results = await asyncio.gather(*tasks)
    13        print(results)
    14
    15asyncio.run(main())

    🦊 Method 4: Web Scraping APIs (FoxScrape to the Rescue!)

    Here's the truth: as your scraping needs grow, you'll face challenges like:

  • 🚫 Anti-bot protection (Cloudflare, Captchas)
  • 🌐 JavaScript-heavy websites
  • πŸ”„ IP rotation and proxy management
  • ⚑ Scaling to thousands of requests
  • This is where FoxScrape comes in. Instead of building and maintaining complex infrastructure, FoxScrape handles all the heavy lifting for you through a simple API.

    Why Choose FoxScrape?

  • Bypass Anti-Bot Protection: FoxScrape automatically rotates IPs and handles Cloudflare challenges
  • JavaScript Rendering: Get fully rendered pages, even from dynamic sites
  • Simple Integration: Easy-to-use Python client with CSS selector support
  • AI-Powered Extraction: Use natural language prompts to extract exactly what you need
  • Production-Ready: Scale to millions of requests without managing infrastructure
  • Getting Started with FoxScrape

    PYTHON
    1import requests
    2
    3api_key = 'YOUR_FOXSCRAPE_API_KEY'
    4url = 'https://www.foxscrape.com/api/v1'
    5
    6params = {
    7    'api_key': api_key,
    8    'url': 'https://news.ycombinator.com/',
    9    'render_js': 'true',
    10    'extract_rules': {
    11        'titles': {
    12            'selector': 'span.titleline > a',
    13            'type': 'list',
    14            'output': 'text'
    15        }
    16    }
    17}
    18
    19response = requests.post(url, json=params)
    20data = response.json()
    21
    22# Get your extracted data
    23titles = data['titles']
    24for title in titles:
    25    print(title)

    AI-Powered Extraction Example

    Want to extract data using natural language? FoxScrape's AI extraction makes it incredibly easy:

    PYTHON
    1params = {
    2    'api_key': api_key,
    3    'url': 'https://example.com/product',
    4    'render_js': 'true',
    5    'ai_extract_rules': {
    6        'product_name': 'Extract the product title',
    7        'price': 'Extract the current price',
    8        'rating': 'Extract the average rating'
    9    }
    10}
    11
    12response = requests.post('https://foxscrape.com/api/v1', json=params)
    13data = response.json()
    14
    15print(f"Product: {data['product_name']}")
    16print(f"Price: {data['price']}")
    17print(f"Rating: {data['rating']}")

    No CSS selectors. No XPath. Just tell FoxScrape what you want in plain English! 🎯

    Ready to try FoxScrape? Visit https://www.foxscrape.com and get started with a free trial today!

    πŸš€ Method 5: Web Crawling Frameworks (For Large-Scale Projects)

    When you need to scrape hundreds or thousands of pages, frameworks provide structure, performance, and built-in features:

    Scrapy: The Industry Standard

    Scrapy is a complete framework for web crawling. It handles parallelism, throttling, and error handling automatically.

    BASH
    1pip install scrapy
    PYTHON
    1import scrapy
    2
    3class HackerNewsSpider(scrapy.Spider):
    4    name = 'hackernews'
    5    start_urls = ['https://news.ycombinator.com/']
    6    
    7    def parse(self, response):
    8        for article in response.css('span.titleline'):
    9            yield {
    10                'title': article.css('a::text').get(),
    11                'link': article.css('a::attr(href)').get()
    12            }

    Run it with: scrapy crawl hackernews -o output.json

    PySpider: Web UI for Visual Debugging

    PySpider provides a user-friendly web interface for creating and debugging spiders:

    BASH
    1pip install pyspider
    2pyspider all

    Then open http://localhost:5000 in your browser and start building your spider visually!

    🌐 Method 6: Headless Browsing (For JavaScript-Heavy Sites)

    Some websites load content dynamically with JavaScript. For these, you need a headless browser:

    Selenium with Chrome

    PYTHON
    1from selenium import webdriver
    2from selenium.webdriver.common.by import By
    3
    4# Set up Chrome in headless mode
    5options = webdriver.ChromeOptions()
    6options.add_argument('--headless')
    7driver = webdriver.Chrome(options=options)
    8
    9# Navigate and extract
    10driver.get('https://example.com')
    11titles = driver.find_elements(By.CSS_SELECTOR, 'h1')
    12for title in titles:
    13    print(title.text)
    14
    15driver.quit()

    Playwright: Modern & Fast

    PYTHON
    1from playwright.sync_api import sync_playwright
    2
    3with sync_playwright() as p:
    4    browser = p.chromium.launch(headless=True)
    5    page = browser.new_page()
    6    page.goto('https://example.com')
    7    
    8    titles = page.query_selector_all('h1')
    9    for title in titles:
    10        print(title.inner_text())
    11    
    12    browser.close()

    🎯 Method 7: Using Website APIs (The Smartest Approach)

    Before scraping HTML, always check if the website has a public API. It's faster, more reliable, and legal!

    PYTHON
    1import requests
    2
    3# Example: Reddit API
    4response = requests.get('https://www.reddit.com/r/python/top.json?limit=10',
    5                       headers={'User-Agent': 'Python Tutorial'})
    6data = response.json()
    7
    8for post in data['data']['children']:
    9    print(post['data']['title'])

    πŸ›‘οΈ Avoiding Anti-Bot Technology: The Balancing Act

    Websites use various techniques to detect and block scrapers. Here's how to stay under the radar:

    1. Respect robots.txt

    PYTHON
    1from urllib.robotparser import RobotFileParser
    2
    3rp = RobotFileParser()
    4rp.set_url('https://example.com/robots.txt')
    5rp.read()
    6
    7can_scrape = rp.can_fetch('*', 'https://example.com/page')
    8print(f"Can scrape: {can_scrape}")

    2. Rotate User Agents

    PYTHON
    1import requests
    2import random
    3
    4user_agents = [
    5    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    6    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
    7    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
    8]
    9
    10headers = {'User-Agent': random.choice(user_agents)}
    11response = requests.get('https://example.com', headers=headers)

    3. Use Proxies

    PYTHON
    1proxies = {
    2    'http': 'http://proxy.example.com:8080',
    3    'https': 'http://proxy.example.com:8080'
    4}
    5
    6response = requests.get('https://example.com', proxies=proxies)

    4. Undetected ChromeDriver

    For sites with advanced bot detection:

    PYTHON
    1import undetected_chromedriver as uc
    2
    3driver = uc.Chrome()
    4driver.get('https://example.com')
    5# Your scraping code here
    6driver.quit()

    5. Or... Just Use FoxScrape! 🦊

    All of these anti-bot techniques require constant maintenance and updates. FoxScrape handles all of this automatically, so you can focus on extracting data instead of fighting with websites.

    πŸŽ“ Final Thoughts

    Web scraping in Python is an incredibly powerful skill. You've now learned:

  • βœ… The fundamentals of HTTP requests and HTML parsing
  • βœ… Multiple approaches from low-level to high-level
  • βœ… How to handle authentication, pagination, and dynamic content
  • βœ… Strategies for avoiding detection
  • βœ… When to use APIs like FoxScrape to save time and effort
  • Start with Requests + BeautifulSoup for simple projects. As your needs grow, consider frameworks like Scrapy or APIs like FoxScrape for production-grade scraping.

    Happy scraping! πŸš€

    Want to skip the complexity and start scraping right away? Try FoxScrape today and get your first 1,000 requests free!