Output Format
For a given selector, you can extract different kinds of data using the output option.
output: text | text_relevant | markdown_relevant | html | @... | table_array | table_json (default: text)
Output Options
text: Text content of selector (default)
text_relevant: Text content of selector, but trimmed of scripts, CSS, header, footer in order to only keep "content". Very useful for AI training (beta)
markdown_relevant: Markdown content of selector, but trimmed of scripts, CSS, header, footer in order to only keep "content". Very useful for AI training (beta)
html: HTML content of selector
@...: Attribute of selector (prefixed by @)
table_json: JSON representation of a <table> (more details below)
table_array: Array representation of a <table> (more details below)
Examples
Below is an example of different output options using the same selector:
1{
2 "extract_rules": {
3 "title_text": {
4 "selector": "h1",
5 "output": "text"
6 },
7 "title_text_relevant": {
8 "selector": "h1",
9 "output": "text_relevant"
10 },
11 "title_markdown_relevant": {
12 "selector": "h1",
13 "output": "markdown_relevant"
14 },
15 "title_html": {
16 "selector": "h1",
17 "output": "html"
18 },
19 "title_id": {
20 "selector": "h1",
21 "output": "@id"
22 },
23 "table_array": {
24 "selector": "table",
25 "output": "table_array"
26 },
27 "table_json": {
28 "selector": "table",
29 "output": "table_json"
30 }
31 }
32}The information extracted by the above rules on a documentation page will be:
1{
2 "title_text": "Documentation - HTML API",
3 "title_text_relevant": "Documentation - HTML API",
4 "title_markdown_relevant": "# Documentation - HTML API",
5 "title_html": "<h1 id="the-documentation">Documentation - HTML API</h1>",
6 "title_id": "the-documentation",
7 "table_array": [
8 ["Feature used", "API credit cost"],
9 ["Basic scraping without JavaScript rendering", "1"],
10 ["Scraping with JavaScript rendering (default)", "5"],
11 ["Premium scraping without JavaScript rendering", "10"],
12 ["Premium scraping with JavaScript rendering", "25"]
13 ],
14 "table_json": [
15 {
16 "Feature used": "Basic scraping without JavaScript rendering",
17 "API credit cost": "1"
18 },
19 {
20 "Feature used": "Scraping with JavaScript rendering (default)",
21 "API credit cost": "5"
22 },
23 {
24 "Feature used": "Premium scraping without JavaScript rendering",
25 "API credit cost": "10"
26 },
27 {
28 "Feature used": "Premium scraping with JavaScript rendering",
29 "API credit cost": "25"
30 }
31 ]
32}Note: text_relevant may not show a particular effect on simple selectors like h1. Use it on body to see the difference with text.
Shortcuts
To make extract rules easier to write and maintain, you can use a simpler syntax to extract text and @attribute.
Meaning that using:
1{
2 "extract_rules": {
3 "title": "h1",
4 "link": "a@href"
5 }
6}Is the same as using:
1{
2 "extract_rules": {
3 "title": {
4 "selector": "h1",
5 "output": "text",
6 "type": "item"
7 },
8 "link": {
9 "selector": "a",
10 "output": "@href",
11 "type": "item"
12 }
13 }
14}Extracting Information from Tables
FoxScrape allows you to easily get formatted information from HTML tables.
We offer two modes to do it: table_array and table_json.
Let's say you want to extract this table from the HTML page:
| Feature used | API credit cost |
|---|---|
| Basic scraping without JavaScript rendering | 1 |
| Scraping with JavaScript rendering (default) | 5 |
| Premium scraping without JavaScript rendering | 10 |
| Premium scraping with JavaScript rendering | 25 |
And let's say that this table has its id set to pricing_table.
JSON Representation
If you use those extract rules:
1{
2 "extract_rules": {
3 "table_json": {
4 "selector": "#pricing_table",
5 "output": "table_json"
6 }
7 }
8}You will get this result:
1{
2 "table_json": [
3 {
4 "Feature used": "Basic scraping without JavaScript rendering",
5 "API credit cost": "1"
6 },
7 {
8 "Feature used": "Scraping with JavaScript rendering (default)",
9 "API credit cost": "5"
10 },
11 {
12 "Feature used": "Premium scraping without JavaScript rendering",
13 "API credit cost": "10"
14 },
15 {
16 "Feature used": "Premium scraping with JavaScript rendering",
17 "API credit cost": "25"
18 }
19 ]
20}Each line of the table is turned into a JSON object where keys would be column name and value would be content of the table.
We advise to use this mode if the table is correctly formatted and has a header line (first line with columns name).
Array Representation
If you use those extract rules:
1{
2 "extract_rules": {
3 "table_array": {
4 "selector": "#pricing_table",
5 "output": "table_array"
6 }
7 }
8}You will get this result:
1{
2 "table_array": [
3 ["Feature used", "API credit cost"],
4 ["Basic scraping without JavaScript rendering", "1"],
5 ["Scraping with JavaScript rendering (default)", "5"],
6 ["Premium scraping without JavaScript rendering", "10"],
7 ["Premium scraping with JavaScript rendering", "25"]
8 ]
9}Each line of the table is turned into an array of N elements where N is the number of columns of the table.
We advise to use this mode if the table is not correctly formatted or doesn't have a header line (first line with columns name).