ScrapeOwl is a simple web scraping and data extraction API.
ScrapeOwl's API allows you to send requests specifying the website and its elements you want to scrape.
For example, if you want the content of the h1 and p tags from a series of articles on example.com where the h1 tag contains the title of the article and the p tags contain the body text.
Using ScrapeOwl’s API, you specify example.com as the URL you would like to scrape, and h1 and p as the elements you would like to parse and retrieve content from.
Once you're in, your API key should be visible on the dashboard where you can simply copy and paste it to your requests. It is a long string (80 characters) that is a series of random numbers and letters that looks something like this:
The base URL for our scraping API is:https://api.scrapeowl.com/v1/scrape
APIs are consumed programmatically, meaning you write a program to get the data you want from websites you want to scrape and parse.
|api_key||string||required||Your ScrapeOwl API Key|
|url||string||required||Target URL to scrape|
|elements||array||optional||List of elements to extract from the webpage|
|html||boolean||optional||Return page's HTML if set to true. Defaults to false|
|return_headers||boolean||false||Return headers returned by the target website server|
|return_cookies||boolean||false||Return cookies returned by the target website server|
|cookies||string||optional||Cookies headers to send to target URL|
|headers||array||optional||HTTP headers to send to target URL|
|request_method||string||GET||GET, POST, PUT. Default is GET but if you want to POST/PUT data then use appropriate value.|
|post_data||string | array||optional||POST data on the page|
|premium_proxies||boolean||optional||Use residential proxies to scrape|
|country||string||optional||Set source IP country for residential proxy|
|wait_for||string||optional||Specify an element that we should wait for to load before scraping the page or an amount time in ms.|
|reject_requests||array||optional||Block requests on page|
|goto_options||array||optional||Goto options Learn More|
ScrapeOwl's killer feature is that it's able to extract data from pages, so you do not need to parse the page's HTML yourself. Below is a table detailing the selection fields.
|type||string||css||Specify XPath if you are trying to extract using XPath. The default is CSS|
|selector||string||required||XPath or css selector for the element you are trying to extract.|
|timeout||integer||optional||Time in milliseconds to wait for dynamic elements to load on the page|
|html||boolean||false||Return the inner HTML of the element|
|attributes||boolean | array||false||Return the attributes of the element.|
If you're making a POST request using JSON, it looks as follows:
The above JSON code shows the request containing element attributes with all potential values. The first uses XPath as a selector and the second uses the CSS selector. The CSS selector is essentially your HTML element name, class, or ID. It's what you would write in your stylesheet (CSS).
This is an example of a response when extracting all p tags on https://example.com where the API extracts all elements that match the given conditions. The response contains your data as objects where each object contains the text, HTML, or attributes of each element.
Attributes are key and value pairs of HTML attributes like class or any custom data-* attributes set to the particular elements on the page.
wait_for can be used to make ScrapeOwl wait for a given amount of time (in milliseconds) before capturing the elements or page's HTML as shown in the following example.
The above code waits for the duration of 6000ms (6 seconds) before capturing the elements or page's HTML.
The wait_for can also be used to wait for an element to load on the page
The above will wait for all of the p tag elements on the page to load before extracting data from the page.
The headers attribute is an object containing all of the headers that you want to send to the target URL. You can easily set custom HTTP headers to pass along with your requests as follows:
The above code example sends "Accept Language" HTTP header to all outgoing requests.
By using reject_requests you can easily block any resource of your choice on the page by specifying the file format. For example, you can block tracking scripts (.js files), images (.png, .jpg, or .jpeg files) or CSS (.css files)
For example, to block all CSS on a page:
And if you need to block any PNG and JPG image(s) on a page:
If you need to block some particular URL on a page
If you want to send an HTTP POST/PUT request, then set request_method to POST or PUT.
If you also want to send data with the request, then you can set your data in post_data attribute.
When posting data, we try to detect the data type and set an appropriate Content-Type header.
We set application/json if you set the render_js as an array or object.
We set application/x-www-form-urlencoded if you set the render_js as a string.
The Content-Type: application/x-www-form-urlencoded header makes the posted data as a normal form submit.
The following code example uses POST request to send form data:
Documentation coming soon.
By default, we use data center proxies to scrape the data from the web. But on some websites that could be difficult to scrape, you might need to switch to our premium proxies.
Premium proxies use residential IP addresses so they evade detection as data center addresses.
If you use premium proxies, then each request costs 10 credits.
If you use premium proxies with render_js, then each request costs 25 credits.
The default location of proxies is set to the USA by default. You can specify the country parameter to other countries to rotate or view localized versions as needed.
The above example uses IP's from the UK.
ScrapeOwl can be used just like any other proxies.
When used in proxy mode, the data parsing (elements) feature is not usable.
To use ScrapeOwl as a proxy, copy the following code, and set your API key.
Because ScrapeOwl has multiple features, the credit consumption per request depends on the specifics of your request and which attributes you're sending along.
The request_cost attribute in the response shows the cost in credit that the request costed.
|A request using rotating proxies without render_js||1 credit|
|A request using rotating proxies with render_js||5 credits|
|A request using premium proxies without render_js||10 credits|
|A request using premium proxies with render_js||25 credits|
The response status code is set in the header as well as in the JSON object that you get in response.
The is_billed attribute shows if the request was billed against credits in your account.
|200 billed||Successful Call|
|401||Unauthorized - Your API key is wrong or you haven't provided one|
|403||Forbidden - You do not have enough credits left for this API call|
|404 billed||Page Not found|
|429||Too Many Concurrent Requests -- You consumed all of allowed concurrent requests. |
Please upgrade or contact support to increase the limit.
|500||Internal Error, Please try again.|
If you would like to check the stats about credit usage in your account, send a GET request to the following endpoint:
Requests to the usage endpoint are limited to 20 requests per minute.