Introduction
What is the ScrapeOwl API?
ScrapeOwl is a simple web scraping and data extraction API.
How does it work?
ScrapeOwl's API allows you to send requests specifying the website and its elements you want to scrape.
Example
For example, if you want the content of the h1 and p tags from a series of articles on example.com where the h1 tag contains the title of the article and the p tags contain the body text.
Using ScrapeOwl’s API, you specify example.com as the URL you would like to scrape, and h1 and p as the elements you would like to parse and retrieve content from.
Getting Started
The first step before you can start using the ScrapeOwl API is signing up and creating an account by going to the registration page or logging into the dashboard if you already have an account.
Once you're in, your API key should be visible on the dashboard where you can simply copy and paste it to your requests. It is a long string (80 characters) that is a series of random numbers and letters that looks something like this:
9ijf24fk93rg038jg30rigj394f34f0kh12d12ep3fp24gk3pgk34g23gf74fl430913fj2133f32ffj
Api endpoint
The base URL for our scraping API is:
https://api.scrapeowl.com/v1/scrapeMaking Your First Request
APIs are consumed programmatically, meaning you write a program to get the data you want from websites you want to scrape and parse.
To demonstrate the power of the ScrapeOwl use the example below to return a test query using Javascript Object Notation (JSON) to make a request to the API.
Code example
{
"api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
"url": "https://httpbin.org/ip" // The site you want to scrape and parse
}
Result
{
"status": 200,
"is_billed": true,
"credits": {
"available": 0,
"used": 0,
"request_cost": 0
},
"html": "{\n \"origin\": \"98.118.113.251\"\n}\n"
}
The following table lists all available ScrapeOwl API parameters
Name | Type | Default | Description |
---|---|---|---|
api_key | string | required | Your ScrapeOwl API Key |
url | string | required | Target URL to scrape |
elements | array | optional | List of elements to extract from the webpage |
html | boolean | optional | Return page's HTML if set to true. Defaults to false |
return_headers | boolean | false | Return headers returned by the target website server |
return_cookies | boolean | false | Return cookies returned by the target website server |
cookies | string | optional | Cookies headers to send to target URL |
headers | array | optional | HTTP headers to send to target URL |
request_method | string | GET | GET, POST, PUT. Default is GET but if you want to POST/PUT data then use appropriate value. |
post_data | string | array | optional | POST data on the page |
premium_proxies | boolean | optional | Use residential proxies to scrape |
country | string | optional | Set source IP country for residential proxy |
render_js | boolean | optional | Use headless chrome instance to execute JavaScript |
custom_js | string | optional | Send custom JavaScript to run on the page prior to extracting the HMTL of the page |
wait_for | string | optional | Specify an element that we should wait for to load before scraping the page or an amount time in ms. |
reject_requests | array | optional | Block requests on page |
json_response | boolean | optional | Default is true. If it’s set to false, it will return html results instead of JSON, also useful for downloading files. |
screenshot | boolean | optional | Default is false, screenshot only works with render_js. It is useful for debugging and shows you the full page as seen by our API. |
block_resources | boolean | optional | Default is true. When running render_js we block CSS, Images, and fonts by default. This can break some sites (sites using React for example). When set to false, the API will load the page without blocking any resources i.e exactly as a page is loaded in a normal browser. |
Extracting Custom Elements
ScrapeOwl's killer feature is that it's able to extract data from pages, so you do not need to parse the page's HTML yourself. Below is a table detailing the selection fields.
ScrapeOwl API Element Selection
Name | Type | Default | Description |
---|---|---|---|
type | string | css | Specify XPath if you are trying to extract using XPath. The default is CSS |
selector | string | required | XPath or css selector for the element you are trying to extract. |
timeout | integer | optional | Time in milliseconds to wait for dynamic elements to load on the page |
html | boolean | false | Return the inner HTML of the element |
attributes | boolean | array | false | Return the attributes of the element. |
Code example
{
"api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
"url": "https://example.com", // The site you want to scrape and parse
//After you've provided your API key and the URL of the site you wish to parse,
"elements" : [
{
"type": "xpath",
"selector": "//p"
},
{
"type": "css",
"selector": "p"
}
]
}
The above JSON code shows the request containing element attributes with all potential values. The first uses XPath as a selector and the second uses the CSS selector. The CSS selector is essentially your HTML element name, class, or ID. It's what you would write in your stylesheet (CSS).
Element extraction result
{
"status": 200,
"is_billed": true,
"credits": {
"available": 0,
"used": 0,
"request_cost": 0
},
"data": [
{
"type": "xpath",
"selector": "//p",
"results": [
{
"text": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."
},
{
"text": "More information..."
}
]
},
{
"type": "css",
"selector": "p",
"results": [
{
"text": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."
},
{
"text": "More information..."
}
]
}
],
"html": ""
}
This is an example of a response when extracting all p tags on https://example.com where the API extracts all elements that match the given conditions. The response contains your data as objects where each object contains the text, HTML, or attributes of each element.
Attributes are key and value pairs of HTML attributes like class or any custom data-* attributes set to the particular elements on the page.
Waiting for Elements to Load
wait_for can be used to make ScrapeOwl wait for a given amount of time (in milliseconds) before capturing the elements or page's HTML as shown in the following example.
{
"wait_for": 6000
}
The above code waits for the duration of 6000ms (6 seconds) before capturing the elements or page's HTML.
The wait_for can also be used to wait for an element to load on the page
{
"wait_for": "p"
}
The above will wait for all of the p tag elements on the page to load before extracting data from the page.
Executing Custom JavaScript on the webpage
You can pass custom JavaScript to ScrapeOwl using the custom_js attribute, the custom JavaScript is executed on the page's console before capturing the content.
When testing make sure that the JavaScript snippet executes on your browser's console, as ScrapeOwl executes the code in the same environment.
For example, by using the following JavaScript we can scroll the page to the bottom of the page.
{
"custom_js": "window.scrollTo(0,document.body.scrollHeight);"
}
Setting Custom HTTP Headers
The headers attribute is an object containing all of the headers that you want to send to the target URL. You can easily set custom HTTP headers to pass along with your requests as follows:
{
"headers": {
'Accept-Language': 'en-US'
}
}
The above code example sends "Accept Language" HTTP header to all outgoing requests.
Blocking Requests on Page
By default images, css, and fonts resources are blocked. To bypass this and make sure that all resources are allowed, set block_resources to false. This reduces your chances of being blocked and makes you look like a real user. However, if you want to allow or disallow certain resources, by using reject_requests you can easily block any resource of your choice on the page by specifying the file format. For example, you can block tracking scripts (.js files), images (.png, .jpg, or .jpeg files) or CSS (.css files)
For example, to block all CSS on a page:
{
"reject_requests": ["css"]
}
And if you need to block any PNG and JPG image(s) on a page:
{
"reject_requests": ["png", "jpg"]
}
If you need to block some particular URL on a page
{
"reject_requests": ["someurl.com/style.css"]
}
Posting data
If you want to send an HTTP POST/PUT request, then set request_method to POST or PUT.
If you also want to send data with the request, then you can set your data in post_data attribute.
When posting data, we try to detect the data type and set an appropriate Content-Type header.
We set application/json if you set the render_js as an array or object.
We set application/x-www-form-urlencoded if you set the render_js as a string.
The Content-Type: application/x-www-form-urlencoded header makes the posted data as a normal form submit.
The following code example uses POST request to send form data:
Code example
{
"api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
"url": "https://httpbin.org/anything", // The site you want to scrape and parse
"request_method": "POST",
"post_data": "name=value"
}
Result
{
"args":{
},
"data":"",
"files":{
},
"form":{
"name":"value"
},
"headers":{
"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
"Accept-Encoding":"gzip, deflate, br",
"Accept-Language":"en-US",
"Cache-Control":"no-cache",
"Content-Length":"27",
"Content-Type":"application/x-www-form-urlencoded",
"Host":"httpbin.org",
"Pragma":"no-cache",
"Upgrade-Insecure-Requests":"1",
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
"X-Amzn-Trace-Id":"Root=1-5f760c71-78192ebb56c4138b2952949a"
},
"json":null,
"origin":"98.118.113.251",
"url":"https://httpbin.org/post"
}
Sessions
You can re-use the same IP address for multiple requests using the sticky sessions feature. In order to make sure that the IP address is sticky and does not change, you need to set a value for the session when supplying the request. For example, “session”: “123456” in the first request and if you want to use the same IP, make sure the session value is the same as the prior request.
There is no limit on the number of sessions you can have at a given time. However, the premium_proxies parameter needs to be set to “true”
{
"session": 1234
}
Using Premium Proxies
By default, we use data center proxies to scrape the data from the web. But on some websites that could be difficult to scrape, you might need to switch to our premium proxies.
Premium proxies use residential IP addresses so they evade detection as data center addresses.
If you use premium proxies, then each request costs 10 credits.
If you use premium proxies with render_js, then each request costs 25 credits.
{
"premium_proxies": true
}
The default location of proxies is set to the USA by default. You can specify the country parameter to other countries to rotate or view localized versions as needed.
{
"premium_proxies": true,
"country": "gb"
}
The above example uses IP's from the UK.
List of countries with their ISO codes
ISO code | Name |
---|---|
br | Brazil |
ca | Canada |
fr | France |
de | Germany |
ge | Greece |
il | Israel |
in | India |
it | Italy |
mx | Mexico |
nl | Netherlands |
ru | Russia |
es | Spain |
se | Sweden |
gb | United Kingdom |
us | United States |
Proxy Mode Beta
ScrapeOwl can be used just like any other proxies.
When used in proxy mode, the data parsing (elements) feature is not usable because you are no longer receiving JSON responses.
To use ScrapeOwl as a proxy, copy the following code, and set your API key.
API parameters example
http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000
Code example
curl -x "http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000" -k "https://httpbin.org/ip"
API Request Cost Calculation
Because ScrapeOwl has multiple features, the credit consumption per request depends on the specifics of your request and which attributes you're sending along.
The request_cost attribute in the response shows the cost in credit that the request costed.
Credit consumption per request type is as follows
Description | Value |
---|---|
A request using rotating proxies without render_js | 1 credit |
A request using rotating proxies with render_js | 5 credits |
A request using premium proxies without render_js | 10 credits |
A request using premium proxies with render_js | 25 credits |
API Response Status Codes
The response status code is set in the header as well as in the JSON object that you get in response.
The is_billed attribute shows if the request was billed against credits in your account.
The ScrapeOwl API uses the following response status codes
Response Code | Description |
---|---|
200 billed | Successful Call |
400 | Bad Request |
401 | Unauthorized - Your API key is wrong or you haven't provided one |
403 | Forbidden - You do not have enough credits left for this API call |
404 billed | Page Not found |
429 | Too Many Concurrent Requests -- You consumed all of allowed concurrent requests. Please upgrade or contact support to increase the limit. |
500 | Internal Error, Please try again. |
If you would like to check the stats about credit usage in your account, send a GET request to the following endpoint:
Requests to the usage endpoint are limited to 20 requests per minute.
The following JSON shows example response from /usage endpoint
{
"credits": 1000,
"credits_used": 600,
"requests": 510,
"failed_requests": 0,
"successful_requests": 510,
"concurrency_limit": 1
}