Common Use Cases

Below you will find common extraction rules often used by our users.

Extract all links from a page

For SEO purposes, lead generation, or simply data harvesting it can be useful to quickly extract all links from a single page.

The following extract_rules will allow you to do that with one simple API call:

JSON

1{
2  "extract_rules": {
3    "all_links": {
4      "selector": "a",
5      "type": "list",
6      "output": "@href"
7    }
8  }
9}

The JSON response will be as follow:

JSON

1{
2  "all_links": [
3    "https://www.foxscrape.com/",
4    "...",
5    "https://www.foxscrape.com/api-store/"
6  ]
7}

If you wish to extract both the href and the anchors of links you can use those rules instead:

JSON

1{
2  "extract_rules": {
3    "all_links": {
4      "selector": "a",
5      "type": "list",
6      "output": {
7        "anchor": "a",
8        "href": {
9          "selector": "a",
10          "output": "@href"
11        }
12      }
13    }
14  }
15}

The JSON response will be as follow:

JSON

1{
2  "all_links": [
3    {
4      "anchor": "Blog",
5      "href": "https://www.foxscrape.com/blog/"
6    },
7    "...",
8    {
9      "anchor": " Linkedin ",
10      "href": "https://www.linkedin.com/company/26175275/admin/"
11    }
12  ]
13}

Extract all text from a page

If you need to get all the text of a web page, and only the text, meaning no HTML tags or attributes, you can use those rules:

JSON

1{
2  "extract_rules": {
3    "text": "body"
4  }
5}

For example, using those rules with this FoxScrape landing page returns this result:

JSON

1{
2  "text": "Login Sign Up Pricing FAQ Blog Other Features Screenshots Google search API Data extraction JavaScript scenario No code scraping with Integromat Documentation Tired of getting blocked while scraping the web? FoxScrape API handles headless browsers and rotates proxies for you. Try FoxScrape for Free based on 25+ reviews. Render your web page as if it were a real browser. We manage thousands of headless instances using the latest Chrome version. Focus on extracting the data you need, and not dealing with concurrent headless browsers that will eat up all your RAM and CPU. Latest Chrome version Fast, no matter what! FoxScrape simplified our day-to-day marketing and engineering operations a lot . We no longer have to worry about managing our own fleet of headless browsers, and we no longer have to spend days sourcing the right proxy provider Mike Ritchie CEO @ SeekWell Javascript Rendering We render Javascript with a simple parameter so you can scrape every website, even Single Page Applications using React, AngularJS, Vue.js or any other libraries. Execute custom JS snippet Custom wait for all JS to be executed FoxScrape is helping us scrape many job boards and company websites without having to deal with proxies or chrome browsers. It drastically simplified our data pipeline Russel Taylor CEO @ HelloOutbound Rotating Proxies Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots! Large proxy pool Geotargeting Automatic proxy rotation FoxScrape clear documentation, easy-to-use API, and great success rate made it a no-brainer. Dominic Phillips Co-Founder @ CodeSubmit Three specific ways to use FoxScrape How our customers use our API: 1. ..."
3}

Extract all email addresses from a page

If you need to get all the email addresses of a web page you can use those rules:

JSON

1{
2  "extract_rules": {
3    "email_addresses": {
4      "selector": "a[href^='mailto']",
5      "output": "@href",
6      "type": "list"
7    }
8  }
9}

Using those rules with this FoxScrape landing page returns this result:

JSON

1{
2  "email_addresses": [
3    "mailto:contact@foxscrape.com"
4  ]
5}

How does this work?

First, we target all anchor (a tag) that has and href attribute that starts with the string mailto, then we decide to only extract the href attribute. And since we want all email addresses on the page and not just one, we use the type list (on FoxScrape landing page there is just one email address anyway).

Limitation

Those rules will only work for links whose href attributes contain mailto. If the email addresses on the page are just plain text or simple anchors. Then you should either extract all the text on the page an run some regular expression or extract all link's on the page and filter for email addresses on your side.