Output Format

For a given selector, you can extract different kinds of data using the output option.

Output Options

text: Text content of selector (default)

text_relevant: Text content of selector, but trimmed of scripts, CSS, header, footer in order to only keep "content". Very useful for AI training (beta)

markdown_relevant: Markdown content of selector, but trimmed of scripts, CSS, header, footer in order to only keep "content". Very useful for AI training (beta)

html: HTML content of selector

@...: Attribute of selector (prefixed by @)

table_json: JSON representation of a <table> (more details below)

table_array: Array representation of a <table> (more details below)

Examples

Below is an example of different output options using the same selector:

JSON

1{
2  "extract_rules": {
3    "title_text": {
4      "selector": "h1",
5      "output": "text"
6    },
7    "title_text_relevant": {
8      "selector": "h1",
9      "output": "text_relevant"
10    },
11    "title_markdown_relevant": {
12      "selector": "h1",
13      "output": "markdown_relevant"
14    },
15    "title_html": {
16      "selector": "h1",
17      "output": "html"
18    },
19    "title_id": {
20      "selector": "h1",
21      "output": "@id"
22    },
23    "table_array": {
24      "selector": "table",
25      "output": "table_array"
26    },
27    "table_json": {
28      "selector": "table",
29      "output": "table_json"
30    }
31  }
32}

The information extracted by the above rules on a documentation page will be:

JSON

1{
2  "title_text": "Documentation - HTML API",
3  "title_text_relevant": "Documentation - HTML API",
4  "title_markdown_relevant": "# Documentation - HTML API",
5  "title_html": "<h1 id="the-documentation">Documentation - HTML API</h1>",
6  "title_id": "the-documentation",
7  "table_array": [
8    ["Feature used", "API credit cost"],
9    ["Basic scraping without JavaScript rendering", "1"],
10    ["Scraping with JavaScript rendering (default)", "5"],
11    ["Premium scraping without JavaScript rendering", "10"],
12    ["Premium scraping with JavaScript rendering", "25"]
13  ],
14  "table_json": [
15    {
16      "Feature used": "Basic scraping without JavaScript rendering",
17      "API credit cost": "1"
18    },
19    {
20      "Feature used": "Scraping with JavaScript rendering (default)",
21      "API credit cost": "5"
22    },
23    {
24      "Feature used": "Premium scraping without JavaScript rendering",
25      "API credit cost": "10"
26    },
27    {
28      "Feature used": "Premium scraping with JavaScript rendering",
29      "API credit cost": "25"
30    }
31  ]
32}

Note: text_relevant may not show a particular effect on simple selectors like h1. Use it on body to see the difference with text.

Shortcuts

To make extract rules easier to write and maintain, you can use a simpler syntax to extract text and @attribute.

Meaning that using:

JSON

1{
2  "extract_rules": {
3    "title": "h1",
4    "link": "a@href"
5  }
6}

Is the same as using:

JSON

1{
2  "extract_rules": {
3    "title": {
4      "selector": "h1",
5      "output": "text",
6      "type": "item"
7    },
8    "link": {
9      "selector": "a",
10      "output": "@href",
11      "type": "item"
12    }
13  }
14}

Extracting Information from Tables

FoxScrape allows you to easily get formatted information from HTML tables.

We offer two modes to do it: table_array and table_json.

Let's say you want to extract this table from the HTML page:

Feature used	API credit cost
Basic scraping without JavaScript rendering	1
Scraping with JavaScript rendering (default)	5
Premium scraping without JavaScript rendering	10
Premium scraping with JavaScript rendering	25

And let's say that this table has its id set to pricing_table.

JSON Representation

If you use those extract rules:

JSON

1{
2  "extract_rules": {
3    "table_json": {
4      "selector": "#pricing_table",
5      "output": "table_json"
6    }
7  }
8}

You will get this result:

JSON

1{
2  "table_json": [
3    {
4      "Feature used": "Basic scraping without JavaScript rendering",
5      "API credit cost": "1"
6    },
7    {
8      "Feature used": "Scraping with JavaScript rendering (default)",
9      "API credit cost": "5"
10    },
11    {
12      "Feature used": "Premium scraping without JavaScript rendering",
13      "API credit cost": "10"
14    },
15    {
16      "Feature used": "Premium scraping with JavaScript rendering",
17      "API credit cost": "25"
18    }
19  ]
20}

Each line of the table is turned into a JSON object where keys would be column name and value would be content of the table.

We advise to use this mode if the table is correctly formatted and has a header line (first line with columns name).

Array Representation

If you use those extract rules:

JSON

1{
2  "extract_rules": {
3    "table_array": {
4      "selector": "#pricing_table",
5      "output": "table_array"
6    }
7  }
8}

You will get this result:

JSON

1{
2  "table_array": [
3    ["Feature used", "API credit cost"],
4    ["Basic scraping without JavaScript rendering", "1"],
5    ["Scraping with JavaScript rendering (default)", "5"],
6    ["Premium scraping without JavaScript rendering", "10"],
7    ["Premium scraping with JavaScript rendering", "25"]
8  ]
9}

Each line of the table is turned into an array of N elements where N is the number of columns of the table.

We advise to use this mode if the table is not correctly formatted or doesn't have a header line (first line with columns name).