Web Scraping with Perl

Published on
Written by
Mantas Kemėšius
Web Scraping with Perl

Perl has long been a favorite language for text processing and automation. Its rich ecosystem of libraries makes it easy to scrape websites, parse HTML, and extract structured data. In this tutorial, we’ll explore how to scrape song lyrics and other web content using Perl, and how FoxScrape can simplify scraping pages that block bots or require JavaScript rendering.

By the end, you’ll understand:

  • How to fetch web pages with Perl’s HTTP clients
  • How to parse HTML using HTML::TreeBuilder
  • How to automate browser interactions with WWW::Mechanize::Chrome
  • How to integrate FoxScrape to handle dynamic or protected content

  • 1. Perl’s Web Scraping Toolbox

    Perl has multiple options for scraping and HTTP requests:

  • LWP::UserAgent – Standard HTTP client for web requests
  • HTTP::Request – Simplifies crafting HTTP requests
  • HTTP::Tiny – Lightweight HTTP client
  • HTML::TreeBuilder – Parses HTML into a DOM tree
  • WWW::Mechanize – Automates browser-like actions
  • Selenium::Chrome – Controls a real browser for JavaScript-heavy sites
  • This toolbox allows you to handle anything from static HTML pages to interactive web apps.

    2. Common Use Cases

    Web scraping in Perl is useful for:

  • Lead generation and industry research
  • Price monitoring and market analysis
  • Academic research or data aggregation
  • Extracting content not available via an API (e.g., song lyrics on Genius.com)
  • 3. Making HTTP Requests

    A typical approach is using LWP::UserAgent:

    PERL
    1
    use LWP::UserAgent;
    2
    3
    my $ua = LWP::UserAgent->new;
    4
    $ua->agent("LyricsScraper");
    5
    6
    my $url = "https://genius.com/DJ-Shadow-Six-Days-lyrics";
    7
    my $request = $ua->get($url) or die "Cannot contact Genius $!\n";
  • $ua->agent sets a custom User-Agent
  • get($url) fetches the page
  • $request->content contains the HTML
  • 4. Parsing HTML with TreeBuilder

    HTML::TreeBuilder lets you transform raw HTML into a DOM tree and query it:

    PERL
    1
    use HTML::TreeBuilder;
    2
    use Encode qw(decode_utf8);
    3
    4
    my $root = HTML::TreeBuilder->new();
    5
    $root->parse(decode_utf8 $request->content);
    6
    7
    my $data = $root->look_down(_tag => "div", id => "lyrics-root");
  • look_down searches for elements with specific attributes
  • You can target <div> tags, classes, or other HTML structures
  • The returned $data is a subtree you can manipulate
  • To format the extracted text:

    PERL
    1
    use HTML::FormatText;
    2
    3
    my $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
    4
    my $lyrics_text = $formatter->format($data);
    5
    6
    print $lyrics_text;

    This ensures your output is readable in the terminal.

    5. Making Your Scraper Reusable

    Command-line arguments allow you to scrape different songs:

    PERL
    1
    my $song_slug = $ARGV[0];
    2
    my $url = "https://genius.com/$song_slug";

    Run it with:

    BASH
    1
    perl scraper.pl DJ-Shadow-Six-Days-lyrics

    The same logic now works for any song or page, making your scraper flexible.

    6. Handling JavaScript-heavy Sites

    Some pages load content dynamically. WWW::Mechanize::Chrome allows full browser control:

    PERL
    1
    use WWW::Mechanize::Chrome;
    2
    3
    my $mech = WWW::Mechanize::Chrome->new();
    4
    $mech->get('https://www.example.com/');
    5
    print "Page title: " . $mech->title . "\n";
    6
    7
    # Capture screenshot
    8
    my $png = $mech->content_as_png();
  • Supports clicks, scrolling, and waiting for JS elements
  • Returns page content after scripts are executed
  • Useful for pages where LWP cannot access data
  • 7. Simplifying Dynamic & Protected Pages with FoxScrape

    Instead of configuring browser automation manually, FoxScrape provides an API that fetches pages fully rendered, bypasses anti-bot protections, and retries automatically if a request fails.

    Example: fetching lyrics page through FoxScrape:

    PERL
    1
    use LWP::UserAgent;
    2
    3
    my $ua = LWP::UserAgent->new;
    4
    my $api_key = "YOUR_API_KEY";
    5
    my $target_url = "https://genius.com/DJ-Shadow-Six-Days-lyrics";
    6
    7
    my $fox_url = "https://www.foxscrape.com/api/v1?api_key=$api_key&url=$target_url";
    8
    9
    my $response = $ua->get($fox_url) or die "Failed to fetch page: $!\n";
    10
    11
    use HTML::TreeBuilder;
    12
    use Encode qw(decode_utf8);
    13
    14
    my $root = HTML::TreeBuilder->new();
    15
    $root->parse(decode_utf8 $response->content);
    16
    17
    my $data = $root->look_down(_tag => "div", id => "lyrics-root");
    18
    print HTML::FormatText->new->format($data);

    Benefits of using FoxScrape:

  • No need for proxy rotation or custom headers
  • JavaScript-rendered content is handled automatically
  • You can continue using your existing parsing code unchanged
  • You can also enable JS rendering explicitly:

    PERL
    1
    my $fox_url = "https://www.foxscrape.com/api/v1?api_key=$api_key&url=$target_url&render_js=true";

    8. Best Practices

  • Respect robots.txt and site rate limits
  • Use meaningful User-Agent headers
  • Validate parsed data to avoid empty or malformed results
  • Use FoxScrape for pages that block direct scraping
  • Modularize scrapers to handle different sites via command-line arguments
  • 9. Conclusion

    By combining Perl’s text-processing strengths with HTML::TreeBuilder, FormatText, and optionally WWW::Mechanize::Chrome, you can build versatile web scrapers. Adding FoxScrape simplifies handling anti-bot measures, JavaScript-rendered pages, and retries, letting you focus on parsing and data extraction.

    Whether scraping song lyrics, e-commerce sites, or research data, Perl provides the flexibility, and FoxScrape provides reliability.

    🦊 Try FoxScrape in Perl:

    PERL
    1
    my $api_key = "YOUR_API_KEY";
    2
    my $url = "https://genius.com/DJ-Shadow-Six-Days-lyrics";
    3
    my $fox_url = "https://www.foxscrape.com/api/v1?api_key=$api_key&url=$url";

    Fetch fully-rendered pages without configuring headless browsers or proxies — making your Perl scrapers faster, simpler, and more reliable.