Web Scraping with PHP

Published on
Written by
Mantas Kemėšius
Web Scraping with PHP

Web scraping is one of the most powerful ways to collect structured data from the internet — and PHP remains a surprisingly capable tool for the job.

In this 2025 guide, we’ll explore how to perform web scraping with PHP, starting with native techniques and gradually moving to modern APIs. Along the way, we’ll build a fun mini-project: scraping famous birthdays from popular websites like Wikipedia and IMDb.

We’ll begin by learning core PHP scraping methods (like cURL, regex, DOM, and XPath) and end with a truly scalable solution using FoxScrape — a modern web scraping API that handles JavaScript rendering, proxy rotation, and anti-bot protection automatically.

1. PHP Web Scraping Libraries

Before diving into code, let’s explore some of the most popular PHP scraping tools and frameworks available in 2025.

Library / FrameworkPurposeNotes
GuzzleModern HTTP clientGreat for sending requests and managing concurrency.
Goutte / Symfony HttpBrowserCrawling and DOM parsingBuilt on top of BrowserKit + DomCrawler.
Simple HTML DOM Parser / DiDOM / phpQuery / hQueryHTML parsersParse HTML easily using CSS selectors.
Php-webdriver / Panther / PuphpeteerBrowser automationIdeal for scraping JavaScript-heavy pages.
Roach PHP / PHP-SpiderFull scraping frameworksSimilar to Python’s Scrapy.
Embed / Httpful / Chrome PHPSpecialized toolsHandle media embedding, simplified HTTP, or Chrome control.
Crawler DetectDetect bot user-agentsAvoid scraping blocks.

Each tool fits a different use case — from lightweight scraping to full browser automation.

2. Birthday Scraping Mini-Project 🎂

Let’s make this practical.

We’ll build a small PHP project that scrapes lists of famous birthdays from Wikipedia and IMDb.

Our Goal

SiteTargetOutput Example
WikipediaBirths section"November 10 – Miranda Lambert (1983), American singer"
IMDbActor birthdays"Josh Peck (1986), Actor, USA"

3. Raw HTTP Requests (Low-Level Approach)

At its core, scraping is just sending an HTTP request and reading the response.

Here’s how to do that in PHP using native functions like fsockopen() and cURL.

PHP
1<?php
2// Using cURL to fetch raw HTML
3$url = "https://en.wikipedia.org/wiki/November_10";
4$ch = curl_init($url);
5curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
6$response = curl_exec($ch);
7curl_close($ch);
8
9echo substr($response, 0, 500); // Print first 500 chars
10?>
11

This gives us raw HTML — the same as if you “view source” in your browser.

It’s not elegant yet, but understanding this low-level approach helps you appreciate what libraries like Guzzle and FoxScrape do for you behind the scenes.

4. Scraping with Strings & Regex (Wikipedia Example)

Now let’s try to extract some real data — like the “Births” section on Wikipedia.

PHP
1<?php
2$html = file_get_contents("https://en.wikipedia.org/wiki/November_10");
3
4// Match items like: <li>1983 – Miranda Lambert, American singer-songwriter</li>
5preg_match_all('/<li>(\d{4}) – (.+?)<\/li>/', $html, $matches);
6
7foreach ($matches[0] as $i => $line) {
8    echo "{$matches[1][$i]} - {$matches[2][$i]}\n";
9}
10?>
11
⚠️ Regex scraping works, but it’s fragile.

If the HTML structure changes, your code breaks. For production scraping, always use DOM parsing.

5. Scraping with Guzzle, DOM, and XPath (IMDb Example)

Now, let’s upgrade our scraper using Guzzle and DOMDocument.

PHP
1<?php
2require 'vendor/autoload.php';
3use GuzzleHttp\Client;
4
5$client = new Client();
6$response = $client->get('https://www.imdb.com/search/name/?birth_monthday=11-10');
7$html = (string) $response->getBody();
8
9$dom = new DOMDocument();
10@$dom->loadHTML($html);
11$xpath = new DOMXPath($dom);
12
13$nodes = $xpath->query('//h3[@class="lister-item-header"]/a');
14
15foreach ($nodes as $node) {
16    echo $node->textContent . "\n";
17}
18?>
19

This gives you actor names from IMDb’s “Born Today” page.

🧩 Challenge: IMDb uses JavaScript and may block bots.

To scrape reliably, we need an API that handles rendering, proxy rotation, and anti-bot detection — that’s where FoxScrape shines.

6. Scraping IMDb with FoxScrape API 🚀

Now let’s replace all that complexity with a single API call using FoxScrape.

FoxScrape handles:

  • Anti-bot and CAPTCHA bypass
  • JavaScript rendering
  • IP rotation
  • Structured extraction (XPath or AI)
  • Example 1: Simple XPath Scraper

    PHP
    1<?php
    2$endpoint = "https://foxscrape.com/api/v1";
    3$url = "https://www.imdb.com/search/name/?birth_monthday=11-10";
    4
    5$payload = [
    6  "url" => $url,
    7  "render_js" => true,
    8  "extract" => [
    9    "actors" => [
    10      "selector" => "//h3[@class='lister-item-header']/a",
    11      "type" => "text"
    12    ]
    13  ]
    14];
    15
    16$ch = curl_init("$endpoint/scrape");
    17curl_setopt($ch, CURLOPT_POSTFIELDS, json_encode($payload));
    18curl_setopt($ch, CURLOPT_HTTPHEADER, ['Content-Type: application/json']);
    19curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    20$response = curl_exec($ch);
    21curl_close($ch);
    22
    23echo $response;
    24?>
    25

    Output Example:

    JSON
    1{
    2  "actors": [
    3    "Josh Peck",
    4    "Ellen Pompeo",
    5    "Brittany Murphy"
    6  ]
    7}
    8

    Example 2: AI-Powered Extraction

    No need to write selectors — FoxScrape can auto-detect structured data:

    PHP
    1$payload = [
    2  "url" => $url,
    3  "ai_extract" => true
    4];
    5
    🦊 With FoxScrape, you get clean, structured data in one line — no dealing with proxies, captchas, or HTML parsing.

    7. Goutte / Symfony Example

    For more control, you can use Symfony’s scraping components like DomCrawler and CssSelector.

    PHP
    1<?php
    2require 'vendor/autoload.php';
    3use Goutte\Client;
    4
    5$client = new Client();
    6$crawler = $client->request('GET', 'https://www.imdb.com/search/name/?birth_monthday=11-10');
    7
    8$crawler->filter('.lister-item-header a')->each(function ($node) {
    9    echo $node->text() . PHP_EOL;
    10});
    11?>
    12

    Goutte makes your code cleaner, but it still can’t handle dynamic content or anti-bot systems — again, FoxScrape solves both effortlessly.

    8. Headless Browsers (Dynamic Content)

    Sites that load data via JavaScript need headless browsers.

    You can use Symfony Panther or Puphpeteer:

    PHP
    1<?php
    2require 'vendor/autoload.php';
    3use Symfony\Component\Panther\Client;
    4
    5$client = Client::createChromeClient();
    6$crawler = $client->request('GET', 'https://www.imdb.com/search/name/?birth_monthday=11-10');
    7
    8$crawler->filter('.lister-item-header a')->each(function ($node) {
    9    echo $node->text() . PHP_EOL;
    10});
    11?>
    12

    While this works, headless browsers are slow and resource-heavy — and still get blocked easily.

    That’s why developers increasingly use APIs like FoxScrape to scale safely and efficiently.

    9. Summary & Optimization Ideas

    We’ve covered a full spectrum — from manual HTTP to full browser scraping.

    Here’s how they compare:

    MethodSpeedHandles JSAvoids BlocksBest For
    cURL + Regex🚀 Fast❌ No❌ NoBasic HTML pages
    Guzzle + DOM⚡ Fast❌ No⚠️ PartialStatic pages
    Headless Browser🐢 Slow✅ Yes⚠️ LimitedDynamic pages
    FoxScrape API⚡⚡ Fast✅ Yes✅ YesScalable, production scraping

    Optimization Tips

  • Add concurrency with Guzzle for faster results.
  • Handle pagination automatically.
  • Extract images, descriptions, and links.
  • Use FoxScrape’s AI extraction for instant structured data.
  • 10. Conclusion

    PHP offers multiple levels of scraping sophistication — from file_get_contents() to full browser automation.

    But as websites grow smarter and anti-bot systems evolve, manual scraping gets harder.

    That’s why modern developers use FoxScrape.

    🦊 FoxScrape lets you focus on data, not infrastructure.

    With built-in IP rotation, JS rendering, and AI-powered extraction, you can scrape at scale with one simple API call.

    💡 Try FoxScrape for Free

    👉 Scrape smarter, not harder.

    Sign up for FoxScrape and start extracting structured data instantly.

    BASH
    1curl -X POST https://foxscrape.com/api/v1/scrape \
    2  -H "Content-Type: application/json" \
    3  -d '{"url":"https://www.imdb.com/search/name/?birth_monthday=11-10"}'
    4
    ⚡ Start scraping in seconds — no proxies, no headless browsers, no headaches.