Output Format

For a given selector, you can extract different kinds of data using the output option.

output: text | text_relevant | markdown_relevant | html | @... | table_array | table_json (default: text)

Output Options

text: Text content of selector (default)

text_relevant: Text content of selector, but trimmed of scripts, CSS, header, footer in order to only keep "content". Very useful for AI training (beta)

markdown_relevant: Markdown content of selector, but trimmed of scripts, CSS, header, footer in order to only keep "content". Very useful for AI training (beta)

html: HTML content of selector

@...: Attribute of selector (prefixed by @)

table_json: JSON representation of a <table> (more details below)

table_array: Array representation of a <table> (more details below)

Examples

Below is an example of different output options using the same selector:

JSON
1
{
2
"extract_rules": {
3
"title_text": {
4
"selector": "h1",
5
"output": "text"
6
},
7
"title_text_relevant": {
8
"selector": "h1",
9
"output": "text_relevant"
10
},
11
"title_markdown_relevant": {
12
"selector": "h1",
13
"output": "markdown_relevant"
14
},
15
"title_html": {
16
"selector": "h1",
17
"output": "html"
18
},
19
"title_id": {
20
"selector": "h1",
21
"output": "@id"
22
},
23
"table_array": {
24
"selector": "table",
25
"output": "table_array"
26
},
27
"table_json": {
28
"selector": "table",
29
"output": "table_json"
30
}
31
}
32
}

The information extracted by the above rules on a documentation page will be:

JSON
1
{
2
"title_text": "Documentation - HTML API",
3
"title_text_relevant": "Documentation - HTML API",
4
"title_markdown_relevant": "# Documentation - HTML API",
5
"title_html": "<h1 id="the-documentation">Documentation - HTML API</h1>",
6
"title_id": "the-documentation",
7
"table_array": [
8
["Feature used", "API credit cost"],
9
["Basic scraping without JavaScript rendering", "1"],
10
["Scraping with JavaScript rendering (default)", "5"],
11
["Premium scraping without JavaScript rendering", "10"],
12
["Premium scraping with JavaScript rendering", "25"]
13
],
14
"table_json": [
15
{
16
"Feature used": "Basic scraping without JavaScript rendering",
17
"API credit cost": "1"
18
},
19
{
20
"Feature used": "Scraping with JavaScript rendering (default)",
21
"API credit cost": "5"
22
},
23
{
24
"Feature used": "Premium scraping without JavaScript rendering",
25
"API credit cost": "10"
26
},
27
{
28
"Feature used": "Premium scraping with JavaScript rendering",
29
"API credit cost": "25"
30
}
31
]
32
}

Note: text_relevant may not show a particular effect on simple selectors like h1. Use it on body to see the difference with text.

Shortcuts

To make extract rules easier to write and maintain, you can use a simpler syntax to extract text and @attribute.

Meaning that using:

JSON
1
{
2
"extract_rules": {
3
"title": "h1",
4
"link": "a@href"
5
}
6
}

Is the same as using:

JSON
1
{
2
"extract_rules": {
3
"title": {
4
"selector": "h1",
5
"output": "text",
6
"type": "item"
7
},
8
"link": {
9
"selector": "a",
10
"output": "@href",
11
"type": "item"
12
}
13
}
14
}

Extracting Information from Tables

FoxScrape allows you to easily get formatted information from HTML tables.

We offer two modes to do it: table_array and table_json.

Let's say you want to extract this table from the HTML page:

Feature usedAPI credit cost
Basic scraping without JavaScript rendering1
Scraping with JavaScript rendering (default)5
Premium scraping without JavaScript rendering10
Premium scraping with JavaScript rendering25

And let's say that this table has its id set to pricing_table.

JSON Representation

If you use those extract rules:

JSON
1
{
2
"extract_rules": {
3
"table_json": {
4
"selector": "#pricing_table",
5
"output": "table_json"
6
}
7
}
8
}

You will get this result:

JSON
1
{
2
"table_json": [
3
{
4
"Feature used": "Basic scraping without JavaScript rendering",
5
"API credit cost": "1"
6
},
7
{
8
"Feature used": "Scraping with JavaScript rendering (default)",
9
"API credit cost": "5"
10
},
11
{
12
"Feature used": "Premium scraping without JavaScript rendering",
13
"API credit cost": "10"
14
},
15
{
16
"Feature used": "Premium scraping with JavaScript rendering",
17
"API credit cost": "25"
18
}
19
]
20
}

Each line of the table is turned into a JSON object where keys would be column name and value would be content of the table.

We advise to use this mode if the table is correctly formatted and has a header line (first line with columns name).

Array Representation

If you use those extract rules:

JSON
1
{
2
"extract_rules": {
3
"table_array": {
4
"selector": "#pricing_table",
5
"output": "table_array"
6
}
7
}
8
}

You will get this result:

JSON
1
{
2
"table_array": [
3
["Feature used", "API credit cost"],
4
["Basic scraping without JavaScript rendering", "1"],
5
["Scraping with JavaScript rendering (default)", "5"],
6
["Premium scraping without JavaScript rendering", "10"],
7
["Premium scraping with JavaScript rendering", "25"]
8
]
9
}

Each line of the table is turned into an array of N elements where N is the number of columns of the table.

We advise to use this mode if the table is not correctly formatted or doesn't have a header line (first line with columns name).