Python Web Scraping: Full Tutorial With Examples

Hey there, data enthusiast! π Welcome to your ultimate guide to web scraping with Python. Whether you're building a price comparison tool, gathering research data, or just curious about how to extract information from websites, you're in the right place.
In this tutorial, we'll walk through everything from the absolute basics to advanced techniquesβand yes, we'll show you real code examples you can actually use. Let's dive in!
π― The 6-Step Web Scraping Process
Before we jump into code, let's understand the roadmap for any successful web scraping project:
robots.txt and follow legal guidelines. Be a good web scraping citizen! π€π§ Method 1: The Low-Level Approach (Manual Socket & Regex)
Let's start with the absolute basicsβsending HTTP requests using a raw TCP socket and parsing with Regular Expressions.
Is this practical? Not really. It's educational, but cumbersome for real projects. Think of it as learning to build a car engine before you drive. π
1import socket
2import re
3
4# Create a TCP socket
5sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
6sock.connect(("example.com", 80))
7
8# Send HTTP GET request
9request = "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n"
10sock.send(request.encode())
11
12# Receive response
13response = sock.recv(4096).decode()
14sock.close()
15
16# Parse with regex (messy!)
17titles = re.findall(r'<title>(.*?)</title>', response)
18print(titles)βοΈ Method 2: Urllib3 & LXML (Intermediate Level)
Now we're getting somewhere! Urllib3 gives you more control over HTTP requests, while LXML lets you parse HTML using XPath expressions.
1import urllib3
2from lxml import html
3
4# Create a PoolManager
5http = urllib3.PoolManager()
6
7# Fetch the page
8response = http.request('GET', 'https://news.ycombinator.com/')
9tree = html.fromstring(response.data)
10
11# Extract data with XPath
12titles = tree.xpath('//span[@class="titleline"]/a/text()')
13for title in titles:
14 print(title)When to use this: When you need advanced connection pooling or prefer XPath over CSS selectors.
π Method 3: Requests & BeautifulSoup (The Beginner's Power Duo)
This is the gold standard for beginners. It's clean, readable, and gets the job done beautifully.
Step 1: Install the libraries
1pip install requests beautifulsoup4Step 2: Scrape Hacker News
1import requests
2from bs4 import BeautifulSoup
3
4# Fetch the page
5url = 'https://news.ycombinator.com/'
6response = requests.get(url)
7
8# Parse with BeautifulSoup
9soup = BeautifulSoup(response.text, 'html.parser')
10
11# Extract story titles
12titles = soup.select('span.titleline > a')
13for title in titles:
14 print(title.get_text())Handling Authentication with Sessions
Need to log in? Use Session objects to maintain cookies:
1session = requests.Session()
2
3# Login
4login_data = {'username': 'user', 'password': 'pass'}
5session.post('https://example.com/login', data=login_data)
6
7# Now make authenticated requests
8response = session.get('https://example.com/protected')
9print(response.text)Storing Data in CSV
1import csv
2
3data = [
4 {'title': 'Example 1', 'link': 'https://example.com/1'},
5 {'title': 'Example 2', 'link': 'https://example.com/2'}
6]
7
8with open('output.csv', 'w', newline='') as f:
9 writer = csv.DictWriter(f, fieldnames=['title', 'link'])
10 writer.writeheader()
11 writer.writerows(data)Storing Data in PostgreSQL
1import psycopg2
2
3conn = psycopg2.connect(
4 host="localhost",
5 database="scraping_db",
6 user="user",
7 password="password"
8)
9
10cur = conn.cursor()
11cur.execute("INSERT INTO articles (title, link) VALUES (%s, %s)",
12 ("Example Title", "https://example.com"))
13conn.commit()
14cur.close()
15conn.close()β‘ Scaling with Asyncio
Requests is synchronous, meaning it waits for each request to complete. For scraping many pages, use aiohttp to send requests concurrently:
1import asyncio
2import aiohttp
3
4async def fetch(session, url):
5 async with session.get(url) as response:
6 return await response.text()
7
8async def main():
9 urls = ['https://example.com/page1', 'https://example.com/page2']
10 async with aiohttp.ClientSession() as session:
11 tasks = [fetch(session, url) for url in urls]
12 results = await asyncio.gather(*tasks)
13 print(results)
14
15asyncio.run(main())π¦ Method 4: Web Scraping APIs (FoxScrape to the Rescue!)
Here's the truth: as your scraping needs grow, you'll face challenges like:
This is where FoxScrape comes in. Instead of building and maintaining complex infrastructure, FoxScrape handles all the heavy lifting for you through a simple API.
Why Choose FoxScrape?
Getting Started with FoxScrape
1import requests
2
3api_key = 'YOUR_FOXSCRAPE_API_KEY'
4url = 'https://www.foxscrape.com/api/v1'
5
6params = {
7 'api_key': api_key,
8 'url': 'https://news.ycombinator.com/',
9 'render_js': 'true',
10 'extract_rules': {
11 'titles': {
12 'selector': 'span.titleline > a',
13 'type': 'list',
14 'output': 'text'
15 }
16 }
17}
18
19response = requests.post(url, json=params)
20data = response.json()
21
22# Get your extracted data
23titles = data['titles']
24for title in titles:
25 print(title)AI-Powered Extraction Example
Want to extract data using natural language? FoxScrape's AI extraction makes it incredibly easy:
1params = {
2 'api_key': api_key,
3 'url': 'https://example.com/product',
4 'render_js': 'true',
5 'ai_extract_rules': {
6 'product_name': 'Extract the product title',
7 'price': 'Extract the current price',
8 'rating': 'Extract the average rating'
9 }
10}
11
12response = requests.post('https://foxscrape.com/api/v1', json=params)
13data = response.json()
14
15print(f"Product: {data['product_name']}")
16print(f"Price: {data['price']}")
17print(f"Rating: {data['rating']}")No CSS selectors. No XPath. Just tell FoxScrape what you want in plain English! π―
Ready to try FoxScrape? Visit https://www.foxscrape.com and get started with a free trial today!
π Method 5: Web Crawling Frameworks (For Large-Scale Projects)
When you need to scrape hundreds or thousands of pages, frameworks provide structure, performance, and built-in features:
Scrapy: The Industry Standard
Scrapy is a complete framework for web crawling. It handles parallelism, throttling, and error handling automatically.
1pip install scrapy1import scrapy
2
3class HackerNewsSpider(scrapy.Spider):
4 name = 'hackernews'
5 start_urls = ['https://news.ycombinator.com/']
6
7 def parse(self, response):
8 for article in response.css('span.titleline'):
9 yield {
10 'title': article.css('a::text').get(),
11 'link': article.css('a::attr(href)').get()
12 }Run it with: scrapy crawl hackernews -o output.json
PySpider: Web UI for Visual Debugging
PySpider provides a user-friendly web interface for creating and debugging spiders:
1pip install pyspider
2pyspider allThen open http://localhost:5000 in your browser and start building your spider visually!
π Method 6: Headless Browsing (For JavaScript-Heavy Sites)
Some websites load content dynamically with JavaScript. For these, you need a headless browser:
Selenium with Chrome
1from selenium import webdriver
2from selenium.webdriver.common.by import By
3
4# Set up Chrome in headless mode
5options = webdriver.ChromeOptions()
6options.add_argument('--headless')
7driver = webdriver.Chrome(options=options)
8
9# Navigate and extract
10driver.get('https://example.com')
11titles = driver.find_elements(By.CSS_SELECTOR, 'h1')
12for title in titles:
13 print(title.text)
14
15driver.quit()Playwright: Modern & Fast
1from playwright.sync_api import sync_playwright
2
3with sync_playwright() as p:
4 browser = p.chromium.launch(headless=True)
5 page = browser.new_page()
6 page.goto('https://example.com')
7
8 titles = page.query_selector_all('h1')
9 for title in titles:
10 print(title.inner_text())
11
12 browser.close()π― Method 7: Using Website APIs (The Smartest Approach)
Before scraping HTML, always check if the website has a public API. It's faster, more reliable, and legal!
1import requests
2
3# Example: Reddit API
4response = requests.get('https://www.reddit.com/r/python/top.json?limit=10',
5 headers={'User-Agent': 'Python Tutorial'})
6data = response.json()
7
8for post in data['data']['children']:
9 print(post['data']['title'])π‘οΈ Avoiding Anti-Bot Technology: The Balancing Act
Websites use various techniques to detect and block scrapers. Here's how to stay under the radar:
1. Respect robots.txt
1from urllib.robotparser import RobotFileParser
2
3rp = RobotFileParser()
4rp.set_url('https://example.com/robots.txt')
5rp.read()
6
7can_scrape = rp.can_fetch('*', 'https://example.com/page')
8print(f"Can scrape: {can_scrape}")2. Rotate User Agents
1import requests
2import random
3
4user_agents = [
5 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
6 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
7 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
8]
9
10headers = {'User-Agent': random.choice(user_agents)}
11response = requests.get('https://example.com', headers=headers)3. Use Proxies
1proxies = {
2 'http': 'http://proxy.example.com:8080',
3 'https': 'http://proxy.example.com:8080'
4}
5
6response = requests.get('https://example.com', proxies=proxies)4. Undetected ChromeDriver
For sites with advanced bot detection:
1import undetected_chromedriver as uc
2
3driver = uc.Chrome()
4driver.get('https://example.com')
5# Your scraping code here
6driver.quit()5. Or... Just Use FoxScrape! π¦
All of these anti-bot techniques require constant maintenance and updates. FoxScrape handles all of this automatically, so you can focus on extracting data instead of fighting with websites.
π Final Thoughts
Web scraping in Python is an incredibly powerful skill. You've now learned:
Start with Requests + BeautifulSoup for simple projects. As your needs grow, consider frameworks like Scrapy or APIs like FoxScrape for production-grade scraping.
Happy scraping! π
Want to skip the complexity and start scraping right away? Try FoxScrape today and get your first 1,000 requests free!
Further Reading

Web Scraping with Rust
Rust is increasingly popular for web scraping because of its speed, memory safety, and concurrency capabilities. In this guide, weβll build a scrap...

Web Scraping with Golang
Go, also known as Golang, is a language built for speed, simplicity, and concurrency. Itβs particularly well-suited for tasks like web scraping, wh...

Web Scraping with Ruby
Web scraping is one of those quiet superpowers every developer eventually picks up. Whether youβre building a price tracker, collecting research da...