Output Format
For a given selector, you can extract different kinds of data using the output option.
output: text | text_relevant | markdown_relevant | html | @... | table_array | table_json (default: text)
Output Options
text: Text content of selector (default)
text_relevant: Text content of selector, but trimmed of scripts, CSS, header, footer in order to only keep "content". Very useful for AI training (beta)
markdown_relevant: Markdown content of selector, but trimmed of scripts, CSS, header, footer in order to only keep "content". Very useful for AI training (beta)
html: HTML content of selector
@...: Attribute of selector (prefixed by @)
table_json: JSON representation of a <table> (more details below)
table_array: Array representation of a <table> (more details below)
Examples
Below is an example of different output options using the same selector:
1{2"extract_rules": {3"title_text": {4"selector": "h1",5"output": "text"6},7"title_text_relevant": {8"selector": "h1",9"output": "text_relevant"10},11"title_markdown_relevant": {12"selector": "h1",13"output": "markdown_relevant"14},15"title_html": {16"selector": "h1",17"output": "html"18},19"title_id": {20"selector": "h1",21"output": "@id"22},23"table_array": {24"selector": "table",25"output": "table_array"26},27"table_json": {28"selector": "table",29"output": "table_json"30}31}32}
The information extracted by the above rules on a documentation page will be:
1{2"title_text": "Documentation - HTML API",3"title_text_relevant": "Documentation - HTML API",4"title_markdown_relevant": "# Documentation - HTML API",5"title_html": "<h1 id="the-documentation">Documentation - HTML API</h1>",6"title_id": "the-documentation",7"table_array": [8["Feature used", "API credit cost"],9["Basic scraping without JavaScript rendering", "1"],10["Scraping with JavaScript rendering (default)", "5"],11["Premium scraping without JavaScript rendering", "10"],12["Premium scraping with JavaScript rendering", "25"]13],14"table_json": [15{16"Feature used": "Basic scraping without JavaScript rendering",17"API credit cost": "1"18},19{20"Feature used": "Scraping with JavaScript rendering (default)",21"API credit cost": "5"22},23{24"Feature used": "Premium scraping without JavaScript rendering",25"API credit cost": "10"26},27{28"Feature used": "Premium scraping with JavaScript rendering",29"API credit cost": "25"30}31]32}
Note: text_relevant may not show a particular effect on simple selectors like h1. Use it on body to see the difference with text.
Shortcuts
To make extract rules easier to write and maintain, you can use a simpler syntax to extract text and @attribute.
Meaning that using:
1{2"extract_rules": {3"title": "h1",4"link": "a@href"5}6}
Is the same as using:
1{2"extract_rules": {3"title": {4"selector": "h1",5"output": "text",6"type": "item"7},8"link": {9"selector": "a",10"output": "@href",11"type": "item"12}13}14}
Extracting Information from Tables
FoxScrape allows you to easily get formatted information from HTML tables.
We offer two modes to do it: table_array and table_json.
Let's say you want to extract this table from the HTML page:
| Feature used | API credit cost |
|---|---|
| Basic scraping without JavaScript rendering | 1 |
| Scraping with JavaScript rendering (default) | 5 |
| Premium scraping without JavaScript rendering | 10 |
| Premium scraping with JavaScript rendering | 25 |
And let's say that this table has its id set to pricing_table.
JSON Representation
If you use those extract rules:
1{2"extract_rules": {3"table_json": {4"selector": "#pricing_table",5"output": "table_json"6}7}8}
You will get this result:
1{2"table_json": [3{4"Feature used": "Basic scraping without JavaScript rendering",5"API credit cost": "1"6},7{8"Feature used": "Scraping with JavaScript rendering (default)",9"API credit cost": "5"10},11{12"Feature used": "Premium scraping without JavaScript rendering",13"API credit cost": "10"14},15{16"Feature used": "Premium scraping with JavaScript rendering",17"API credit cost": "25"18}19]20}
Each line of the table is turned into a JSON object where keys would be column name and value would be content of the table.
We advise to use this mode if the table is correctly formatted and has a header line (first line with columns name).
Array Representation
If you use those extract rules:
1{2"extract_rules": {3"table_array": {4"selector": "#pricing_table",5"output": "table_array"6}7}8}
You will get this result:
1{2"table_array": [3["Feature used", "API credit cost"],4["Basic scraping without JavaScript rendering", "1"],5["Scraping with JavaScript rendering (default)", "5"],6["Premium scraping without JavaScript rendering", "10"],7["Premium scraping with JavaScript rendering", "25"]8]9}
Each line of the table is turned into an array of N elements where N is the number of columns of the table.
We advise to use this mode if the table is not correctly formatted or doesn't have a header line (first line with columns name).