Python Web Scraping: Full Tutorial With Examples

Hey there, data enthusiast! π Welcome to your ultimate guide to web scraping with Python. Whether you're building a price comparison tool, gathering research data, or just curious about how to extract information from websites, you're in the right place.
In this tutorial, we'll walk through everything from the absolute basics to advanced techniquesβand yes, we'll show you real code examples you can actually use. Let's dive in!
π― The 6-Step Web Scraping Process
Before we jump into code, let's understand the roadmap for any successful web scraping project:
robots.txt and follow legal guidelines. Be a good web scraping citizen! π€π§ Method 1: The Low-Level Approach (Manual Socket & Regex)
Let's start with the absolute basicsβsending HTTP requests using a raw TCP socket and parsing with Regular Expressions.
Is this practical? Not really. It's educational, but cumbersome for real projects. Think of it as learning to build a car engine before you drive. π
1import socket2import re34# Create a TCP socket5sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)6sock.connect(("example.com", 80))78# Send HTTP GET request9request = "GET / HTTP/1.1\r\nHost: example.com\r\n\r\n"10sock.send(request.encode())1112# Receive response13response = sock.recv(4096).decode()14sock.close()1516# Parse with regex (messy!)17titles = re.findall(r'<title>(.*?)</title>', response)18print(titles)
βοΈ Method 2: Urllib3 & LXML (Intermediate Level)
Now we're getting somewhere! Urllib3 gives you more control over HTTP requests, while LXML lets you parse HTML using XPath expressions.
1import urllib32from lxml import html34# Create a PoolManager5http = urllib3.PoolManager()67# Fetch the page8response = http.request('GET', 'https://news.ycombinator.com/')9tree = html.fromstring(response.data)1011# Extract data with XPath12titles = tree.xpath('//span[@class="titleline"]/a/text()')13for title in titles:14print(title)
When to use this: When you need advanced connection pooling or prefer XPath over CSS selectors.
π Method 3: Requests & BeautifulSoup (The Beginner's Power Duo)
This is the gold standard for beginners. It's clean, readable, and gets the job done beautifully.
Step 1: Install the libraries
1pip install requests beautifulsoup4
Step 2: Scrape Hacker News
1import requests2from bs4 import BeautifulSoup34# Fetch the page5url = 'https://news.ycombinator.com/'6response = requests.get(url)78# Parse with BeautifulSoup9soup = BeautifulSoup(response.text, 'html.parser')1011# Extract story titles12titles = soup.select('span.titleline > a')13for title in titles:14print(title.get_text())
Handling Authentication with Sessions
Need to log in? Use Session objects to maintain cookies:
1session = requests.Session()23# Login4login_data = {'username': 'user', 'password': 'pass'}5session.post('https://example.com/login', data=login_data)67# Now make authenticated requests8response = session.get('https://example.com/protected')9print(response.text)
Storing Data in CSV
1import csv23data = [4{'title': 'Example 1', 'link': 'https://example.com/1'},5{'title': 'Example 2', 'link': 'https://example.com/2'}6]78with open('output.csv', 'w', newline='') as f:9writer = csv.DictWriter(f, fieldnames=['title', 'link'])10writer.writeheader()11writer.writerows(data)
Storing Data in PostgreSQL
1import psycopg223conn = psycopg2.connect(4host="localhost",5database="scraping_db",6user="user",7password="password"8)910cur = conn.cursor()11cur.execute("INSERT INTO articles (title, link) VALUES (%s, %s)",12("Example Title", "https://example.com"))13conn.commit()14cur.close()15conn.close()
β‘ Scaling with Asyncio
Requests is synchronous, meaning it waits for each request to complete. For scraping many pages, use aiohttp to send requests concurrently:
1import asyncio2import aiohttp34async def fetch(session, url):5async with session.get(url) as response:6return await response.text()78async def main():9urls = ['https://example.com/page1', 'https://example.com/page2']10async with aiohttp.ClientSession() as session:11tasks = [fetch(session, url) for url in urls]12results = await asyncio.gather(*tasks)13print(results)1415asyncio.run(main())
π¦ Method 4: Web Scraping APIs (FoxScrape to the Rescue!)
Here's the truth: as your scraping needs grow, you'll face challenges like:
This is where FoxScrape comes in. Instead of building and maintaining complex infrastructure, FoxScrape handles all the heavy lifting for you through a simple API.
Why Choose FoxScrape?
Getting Started with FoxScrape
1import requests23api_key = 'YOUR_FOXSCRAPE_API_KEY'4url = 'https://www.foxscrape.com/api/v1'56params = {7'api_key': api_key,8'url': 'https://news.ycombinator.com/',9'render_js': 'true',10'extract_rules': {11'titles': {12'selector': 'span.titleline > a',13'type': 'list',14'output': 'text'15}16}17}1819response = requests.post(url, json=params)20data = response.json()2122# Get your extracted data23titles = data['titles']24for title in titles:25print(title)
AI-Powered Extraction Example
Want to extract data using natural language? FoxScrape's AI extraction makes it incredibly easy:
1params = {2'api_key': api_key,3'url': 'https://example.com/product',4'render_js': 'true',5'ai_extract_rules': {6'product_name': 'Extract the product title',7'price': 'Extract the current price',8'rating': 'Extract the average rating'9}10}1112response = requests.post('https://foxscrape.com/api/v1', json=params)13data = response.json()1415print(f"Product: {data['product_name']}")16print(f"Price: {data['price']}")17print(f"Rating: {data['rating']}")
No CSS selectors. No XPath. Just tell FoxScrape what you want in plain English! π―
Ready to try FoxScrape? Visit https://www.foxscrape.com and get started with a free trial today!
π Method 5: Web Crawling Frameworks (For Large-Scale Projects)
When you need to scrape hundreds or thousands of pages, frameworks provide structure, performance, and built-in features:
Scrapy: The Industry Standard
Scrapy is a complete framework for web crawling. It handles parallelism, throttling, and error handling automatically.
1pip install scrapy
1import scrapy23class HackerNewsSpider(scrapy.Spider):4name = 'hackernews'5start_urls = ['https://news.ycombinator.com/']67def parse(self, response):8for article in response.css('span.titleline'):9yield {10'title': article.css('a::text').get(),11'link': article.css('a::attr(href)').get()12}
Run it with: scrapy crawl hackernews -o output.json
PySpider: Web UI for Visual Debugging
PySpider provides a user-friendly web interface for creating and debugging spiders:
1pip install pyspider2pyspider all
Then open http://localhost:5000 in your browser and start building your spider visually!
π Method 6: Headless Browsing (For JavaScript-Heavy Sites)
Some websites load content dynamically with JavaScript. For these, you need a headless browser:
Selenium with Chrome
1from selenium import webdriver2from selenium.webdriver.common.by import By34# Set up Chrome in headless mode5options = webdriver.ChromeOptions()6options.add_argument('--headless')7driver = webdriver.Chrome(options=options)89# Navigate and extract10driver.get('https://example.com')11titles = driver.find_elements(By.CSS_SELECTOR, 'h1')12for title in titles:13print(title.text)1415driver.quit()
Playwright: Modern & Fast
1from playwright.sync_api import sync_playwright23with sync_playwright() as p:4browser = p.chromium.launch(headless=True)5page = browser.new_page()6page.goto('https://example.com')78titles = page.query_selector_all('h1')9for title in titles:10print(title.inner_text())1112browser.close()
π― Method 7: Using Website APIs (The Smartest Approach)
Before scraping HTML, always check if the website has a public API. It's faster, more reliable, and legal!
1import requests23# Example: Reddit API4response = requests.get('https://www.reddit.com/r/python/top.json?limit=10',5headers={'User-Agent': 'Python Tutorial'})6data = response.json()78for post in data['data']['children']:9print(post['data']['title'])
π‘οΈ Avoiding Anti-Bot Technology: The Balancing Act
Websites use various techniques to detect and block scrapers. Here's how to stay under the radar:
1. Respect robots.txt
1from urllib.robotparser import RobotFileParser23rp = RobotFileParser()4rp.set_url('https://example.com/robots.txt')5rp.read()67can_scrape = rp.can_fetch('*', 'https://example.com/page')8print(f"Can scrape: {can_scrape}")
2. Rotate User Agents
1import requests2import random34user_agents = [5'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',6'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',7'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'8]910headers = {'User-Agent': random.choice(user_agents)}11response = requests.get('https://example.com', headers=headers)
3. Use Proxies
1proxies = {2'http': 'http://proxy.example.com:8080',3'https': 'http://proxy.example.com:8080'4}56response = requests.get('https://example.com', proxies=proxies)
4. Undetected ChromeDriver
For sites with advanced bot detection:
1import undetected_chromedriver as uc23driver = uc.Chrome()4driver.get('https://example.com')5# Your scraping code here6driver.quit()
5. Or... Just Use FoxScrape! π¦
All of these anti-bot techniques require constant maintenance and updates. FoxScrape handles all of this automatically, so you can focus on extracting data instead of fighting with websites.
π Final Thoughts
Web scraping in Python is an incredibly powerful skill. You've now learned:
Start with Requests + BeautifulSoup for simple projects. As your needs grow, consider frameworks like Scrapy or APIs like FoxScrape for production-grade scraping.
Happy scraping! π
Want to skip the complexity and start scraping right away? Try FoxScrape today and get your first 1,000 requests free!
Further Reading

Web Scraping with Rust
Rust is increasingly popular for web scraping because of its speed, memory safety, and concurrency capabilities. In this guide, weβll build a scrap...

Web Scraping with Golang
Go, also known as Golang, is a language built for speed, simplicity, and concurrency. Itβs particularly well-suited for tasks like web scraping, wh...

Web Scraping with Ruby
Web scraping is one of those quiet superpowers every developer eventually picks up. Whether youβre building a price tracker, collecting research da...