For example, if you want the content of the h1 and p tags from a series of articles on example.com where the h1 tag contains the title of the article and the p tags contain the body text.
Using ScrapeOwl’s API, you specify example.com as the URL you would like to scrape, and h1 and p as the elements you would like to parse and retrieve content from.
The first step before you can start using the ScrapeOwl API is signing up and creating an account by going to the registration page or logging into the dashboard if you already have an account.
Once you're in, your API key should be visible on the dashboard where you can simply copy and paste it to your requests. It is a long string (80 characters) that is a series of random numbers and letters that looks something like this:
9ijf24fk93rg038jg30rigj394f34f0kh12d12ep3fp24gk3pgk34g23gf74fl430913fj2133f32ffj
The base URL for our scraping API is:
https://api.scrapeowl.com/v1/scrapeAPIs are consumed programmatically, meaning you write a program to get the data you want from websites you want to scrape and parse.
To demonstrate the power of the ScrapeOwl use the example below to return a test query using Javascript Object Notation (JSON) to make a request to the API.
{
"api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
"url": "https://httpbin.org/ip" // The site you want to scrape and parse
}
{
"status": 200,
"is_billed": true,
"credits": {
"available": 0,
"used": 0,
"request_cost": 0
},
"html": "{\n \"origin\": \"98.118.113.251\"\n}\n"
}
Name | Type | Default | Description |
---|---|---|---|
api_key | string | required | Your ScrapeOwl API Key |
url | string | required | Target URL to scrape |
elements | array | optional | List of elements to extract from the webpage |
html | boolean | optional | Return page's HTML if set to true. Defaults to false |
return_headers | boolean | false | Return headers returned by the target website server |
return_cookies | boolean | false | Return cookies returned by the target website server |
cookies | string | optional | Cookies headers to send to target URL |
headers | array | optional | HTTP headers to send to target URL |
request_method | string | GET | GET, POST, PUT. Default is GET but if you want to POST/PUT data then use appropriate value. |
post_data | string | array | optional | POST data on the page |
premium_proxies | boolean | optional | Use residential proxies to scrape |
country | string | optional | Set source IP country for residential proxy |
render_js | boolean | optional | Use headless chrome instance to execute JavaScript |
custom_js | string | optional | Send custom JavaScript to run on the page prior to extracting the HMTL of the page |
wait_for | string | optional | Specify an element that we should wait for to load before scraping the page or an amount time in ms. |
reject_requests | array | optional | Block requests on page |
json_response | boolean | optional | Default is true. If it’s set to false, it will return html results instead of JSON, also useful for downloading files. |
screenshot | boolean | optional | Default is false, screenshot only works with render_js. It is useful for debugging and shows you the full page as seen by our API. |
block_resources | boolean | optional | Default is true. When running render_js we block CSS, Images, and fonts by default. This can break some sites (sites using React for example). When set to false, the API will load the page without blocking any resources i.e exactly as a page is loaded in a normal browser. |
ScrapeOwl's killer feature is that it's able to extract data from pages, so you do not need to parse the page's HTML yourself. Below is a table detailing the selection fields.
Name | Type | Default | Description |
---|---|---|---|
type | string | css | Specify XPath if you are trying to extract using XPath. The default is CSS |
selector | string | required | XPath or css selector for the element you are trying to extract. |
timeout | integer | optional | Time in milliseconds to wait for dynamic elements to load on the page |
html | boolean | false | Return the inner HTML of the element |
attributes | boolean | array | false | Return the attributes of the element. |
{
"api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
"url": "https://example.com", // The site you want to scrape and parse
//After you've provided your API key and the URL of the site you wish to parse,
"elements" : [
{
"type": "xpath",
"selector": "//p"
},
{
"type": "css",
"selector": "p"
}
]
}
The above JSON code shows the request containing element attributes with all potential values. The first uses XPath as a selector and the second uses the CSS selector. The CSS selector is essentially your HTML element name, class, or ID. It's what you would write in your stylesheet (CSS).
{
"status": 200,
"is_billed": true,
"credits": {
"available": 0,
"used": 0,
"request_cost": 0
},
"data": [
{
"type": "xpath",
"selector": "//p",
"results": [
{
"text": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."
},
{
"text": "More information..."
}
]
},
{
"type": "css",
"selector": "p",
"results": [
{
"text": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."
},
{
"text": "More information..."
}
]
}
],
"html": ""
}
This is an example of a response when extracting all p tags on https://example.com where the API extracts all elements that match the given conditions. The response contains your data as objects where each object contains the text, HTML, or attributes of each element.
Attributes are key and value pairs of HTML attributes like class or any custom data-* attributes set to the particular elements on the page.
wait_for can be used to make ScrapeOwl wait for a given amount of time (in milliseconds) before capturing the elements or page's HTML as shown in the following example.
{
"wait_for": 6000
}
The above code waits for the duration of 6000ms (6 seconds) before capturing the elements or page's HTML.
The wait_for can also be used to wait for an element to load on the page
{
"wait_for": "p"
}
The above will wait for all of the p tag elements on the page to load before extracting data from the page.
You can pass custom JavaScript to ScrapeOwl using the custom_js attribute, the custom JavaScript is executed on the page's console before capturing the content.
When testing make sure that the JavaScript snippet executes on your browser's console, as ScrapeOwl executes the code in the same environment.
For example, by using the following JavaScript we can scroll the page to the bottom of the page.
{
"custom_js": "window.scrollTo(0,document.body.scrollHeight);"
}
The headers attribute is an object containing all of the headers that you want to send to the target URL. You can easily set custom HTTP headers to pass along with your requests as follows:
{
"headers": {
'Accept-Language': 'en-US'
}
}
The above code example sends "Accept Language" HTTP header to all outgoing requests.
By default images, css, and fonts resources are blocked. To bypass this and make sure that all resources are allowed, set block_resources to false. This reduces your chances of being blocked and makes you look like a real user. However, if you want to allow or disallow certain resources, by using reject_requests you can easily block any resource of your choice on the page by specifying the file format. For example, you can block tracking scripts (.js files), images (.png, .jpg, or .jpeg files) or CSS (.css files)
For example, to block all CSS on a page:
{
"reject_requests": ["css"]
}
And if you need to block any PNG and JPG image(s) on a page:
{
"reject_requests": ["png", "jpg"]
}
If you need to block some particular URL on a page
{
"reject_requests": ["someurl.com/style.css"]
}
If you want to send an HTTP POST/PUT request, then set request_method to POST or PUT.
If you also want to send data with the request, then you can set your data in post_data attribute.
When posting data, we try to detect the data type and set an appropriate Content-Type header.
We set application/json if you set the render_js as an array or object.
We set application/x-www-form-urlencoded if you set the render_js as a string.
The Content-Type: application/x-www-form-urlencoded header makes the posted data as a normal form submit.
The following code example uses POST request to send form data:
{
"api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
"url": "https://httpbin.org/anything", // The site you want to scrape and parse
"request_method": "POST",
"post_data": "name=value"
}
{
"args":{
},
"data":"",
"files":{
},
"form":{
"name":"value"
},
"headers":{
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding":"gzip, deflate, br",
"Accept-Language":"en-US",
"Cache-Control":"no-cache",
"Content-Length":"27",
"Content-Type":"application/x-www-form-urlencoded",
"Host":"httpbin.org",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
"X-Amzn-Trace-Id":"Root=1-5f760c71-78192ebb56c4138b2952949a"
},
"json":null,
"origin":"98.118.113.251",
"url":"https://httpbin.org/post"
}
You can re-use the same IP address for multiple requests using the sticky sessions feature. In order to make sure that the IP address is sticky and does not change, you need to set a value for the session when supplying the request. For example, “session”: “123456” in the first request and if you want to use the same IP, make sure the session value is the same as the prior request.
There is no limit on the number of sessions you can have at a given time. However, the premium_proxies parameter needs to be set to “true”
{
"session": 1234
}
By default, we use data center proxies to scrape the data from the web. But on some websites that could be difficult to scrape, you might need to switch to our premium proxies.
Premium proxies use residential IP addresses so they evade detection as data center addresses.
If you use premium proxies, then each request costs 10 credits.
If you use premium proxies with render_js, then each request costs 25 credits.
{
"premium_proxies": true
}
The default location of proxies is set to the USA by default. You can specify the country parameter to other countries to rotate or view localized versions as needed.
{
"premium_proxies": true,
"country": "gb"
}
The above example uses IP's from the UK.
ISO code | Name |
---|---|
br | Brazil |
ca | Canada |
fr | France |
de | Germany |
ge | Greece |
il | Israel |
in | India |
it | Italy |
mx | Mexico |
nl | Netherlands |
ru | Russia |
es | Spain |
se | Sweden |
gb | United Kingdom |
us | United States |
ScrapeOwl can be used just like any other proxies.
When used in proxy mode, the data parsing (elements) feature is not usable because you are no longer receiving JSON responses.
To use ScrapeOwl as a proxy, copy the following code, and set your API key.
http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000
curl -x "http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000" -k "https://httpbin.org/ip"
Because ScrapeOwl has multiple features, the credit consumption per request depends on the specifics of your request and which attributes you're sending along.
The request_cost attribute in the response shows the cost in credit that the request costed.
Description | Value |
---|---|
A request using rotating proxies without render_js | 1 credit |
A request using rotating proxies with render_js | 5 credits |
A request using premium proxies without render_js | 10 credits |
A request using premium proxies with render_js | 25 credits |
The response status code is set in the header as well as in the JSON object that you get in response.
The is_billed attribute shows if the request was billed against credits in your account.
Response Code | Description |
---|---|
200 billed | Successful Call |
400 | Bad Request |
401 | Unauthorized - Your API key is wrong or you haven't provided one |
403 | Forbidden - You do not have enough credits left for this API call |
404 billed | Page Not found |
429 | Too Many Concurrent Requests -- You consumed all of allowed concurrent requests. Please upgrade or contact support to increase the limit. |
500 | Internal Error, Please try again. |
If you would like to check the stats about credit usage in your account, send a GET request to the following endpoint:
Requests to the usage endpoint are limited to 20 requests per minute.
{
"credits": 1000,
"credits_used": 600,
"requests": 510,
"failed_requests": 0,
"successful_requests": 510,
"concurrency_limit": 1
}