Introduction

What is the ScrapeOwl API?

ScrapeOwl is a simple web scraping and data extraction API.

How does it work?

ScrapeOwl's API allows you to send requests specifying the website and its elements you want to scrape.

Data Extraction

Example

For example, if you want the content of the h1 and p tags from a series of articles on example.com where the h1 tag contains the title of the article and the p tags contain the body text.

Using ScrapeOwl’s API, you specify example.com as the URL you would like to scrape, and h1 and p as the elements you would like to parse and retrieve content from.

Getting Started

The first step before you can start using the ScrapeOwl API is signing up and creating an account by going to the registration page or logging into the dashboard if you already have an account.

Once you're in, your API key should be visible on the dashboard where you can simply copy and paste it to your requests. It is a long string (80 characters) that is a series of random numbers and letters that looks something like this:

9ijf24fk93rg038jg30rigj394f34f0kh12d12ep3fp24gk3pgk34g23gf74fl430913fj2133f32ffj

Api endpoint

The base URL for our scraping API is:

https://api.scrapeowl.com/v1/scrape
Note: All response are in JSON formatted output by default. If you want to disable this behavior and return HTML instead, set json_response to false.

Making Your First Request

APIs are consumed programmatically, meaning you write a program to get the data you want from websites you want to scrape and parse.

To demonstrate the power of the ScrapeOwl use the example below to return a test query using Javascript Object Notation (JSON) to make a request to the API.

Code example

{ "api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here "url": "https://httpbin.org/ip" // The site you want to scrape and parse }

Result

{ "status": 200, "is_billed": true, "credits": { "available": 0, "used": 0, "request_cost": 0 }, "html": "{\n \"origin\": \"98.118.113.251\"\n}\n" }

The following table lists all available ScrapeOwl API parameters

NameTypeDefaultDescription
api_keystringrequiredYour ScrapeOwl API Key
urlstringrequiredTarget URL to scrape
elementsarrayoptionalList of elements to extract from the webpage
htmlbooleanoptionalReturn page's HTML if set to true. Defaults to false
return_headersbooleanfalseReturn headers returned by the target website server
return_cookiesbooleanfalseReturn cookies returned by the target website server
cookiesstringoptionalCookies headers to send to target URL
headersarrayoptionalHTTP headers to send to target URL
request_methodstringGETGET, POST, PUT. Default is GET but if you want to POST/PUT data then use appropriate value.
post_datastring | arrayoptionalPOST data on the page
premium_proxiesbooleanoptionalUse residential proxies to scrape
countrystringoptionalSet source IP country for residential proxy
render_jsbooleanoptionalUse headless chrome instance to execute JavaScript
custom_jsstringoptionalSend custom JavaScript to run on the page prior to extracting the HMTL of the page
wait_forstringoptionalSpecify an element that we should wait for to load before scraping the page or an amount time in ms.
reject_requestsarrayoptionalBlock requests on page
json_responsebooleanoptionalDefault is true. If it’s set to false, it will return html results instead of JSON, also useful for downloading files.
screenshotbooleanoptionalDefault is false, screenshot only works with render_js. It is useful for debugging and shows you the full page as seen by our API.
block_resourcesbooleanoptionalDefault is true. When running render_js we block CSS, Images, and fonts by default. This can break some sites (sites using React for example). When set to false, the API will load the page without blocking any resources i.e exactly as a page is loaded in a normal browser.

Extracting Custom Elements

ScrapeOwl's killer feature is that it's able to extract data from pages, so you do not need to parse the page's HTML yourself. Below is a table detailing the selection fields.

ScrapeOwl API Element Selection

NameTypeDefaultDescription
typestringcssSpecify XPath if you are trying to extract using XPath. The default is CSS
selectorstringrequiredXPath or css selector for the element you are trying to extract.
timeoutintegeroptionalTime in milliseconds to wait for dynamic elements to load on the page
htmlbooleanfalseReturn the inner HTML of the element
attributesboolean | arrayfalseReturn the attributes of the element.
The timeout parameter only works when render_js is set to true

Code example

{ "api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here "url": "https://example.com", // The site you want to scrape and parse //After you've provided your API key and the URL of the site you wish to parse, "elements" : [ { "type": "xpath", "selector": "//p" }, { "type": "css", "selector": "p" } ] }

The above JSON code shows the request containing element attributes with all potential values. The first uses XPath as a selector and the second uses the CSS selector. The CSS selector is essentially your HTML element name, class, or ID. It's what you would write in your stylesheet (CSS).

Element extraction result

{ "status": 200, "is_billed": true, "credits": { "available": 0, "used": 0, "request_cost": 0 }, "data": [ { "type": "xpath", "selector": "//p", "results": [ { "text": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission." }, { "text": "More information..." } ] }, { "type": "css", "selector": "p", "results": [ { "text": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission." }, { "text": "More information..." } ] } ], "html": "" }

This is an example of a response when extracting all p tags on https://example.com where the API extracts all elements that match the given conditions. The response contains your data as objects where each object contains the text, HTML, or attributes of each element.

Attributes are key and value pairs of HTML attributes like class or any custom data-* attributes set to the particular elements on the page.

Waiting for Elements to Load

wait_for can be used to make ScrapeOwl wait for a given amount of time (in milliseconds) before capturing the elements or page's HTML as shown in the following example.

{ "wait_for": 6000 }

The above code waits for the duration of 6000ms (6 seconds) before capturing the elements or page's HTML.

Note:timeout set for each element will override above value.

The wait_for can also be used to wait for an element to load on the page

{ "wait_for": "p" }

The above will wait for all of the p tag elements on the page to load before extracting data from the page.

Note:wait_for only works when render_js is set to true

Executing Custom JavaScript on the webpage

You can pass custom JavaScript to ScrapeOwl using the custom_js attribute, the custom JavaScript is executed on the page's console before capturing the content.

When testing make sure that the JavaScript snippet executes on your browser's console, as ScrapeOwl executes the code in the same environment.

For example, by using the following JavaScript we can scroll the page to the bottom of the page.

{ "custom_js": "window.scrollTo(0,document.body.scrollHeight);" }
Note: Because you can not pass a multi-line string in a JSON object, please remove all white spaces and convert it to single line before sending.

Setting Custom HTTP Headers

The headers attribute is an object containing all of the headers that you want to send to the target URL. You can easily set custom HTTP headers to pass along with your requests as follows:

{ "headers": { 'Accept-Language': 'en-US' } }

The above code example sends "Accept Language" HTTP header to all outgoing requests.

Blocking Requests on Page

By default images, css, and fonts resources are blocked. To bypass this and make sure that all resources are allowed, set block_resources to false. This reduces your chances of being blocked and makes you look like a real user. However, if you want to allow or disallow certain resources, by using reject_requests you can easily block any resource of your choice on the page by specifying the file format. For example, you can block tracking scripts (.js files), images (.png, .jpg, or .jpeg files) or CSS (.css files)

For example, to block all CSS on a page:

{ "reject_requests": ["css"] }

And if you need to block any PNG and JPG image(s) on a page:

{ "reject_requests": ["png", "jpg"] }

If you need to block some particular URL on a page

{ "reject_requests": ["someurl.com/style.css"] }
Note: Blocking the requests only works when render_js is set to true

Posting data

If you want to send an HTTP POST/PUT request, then set request_method to POST or PUT.

If you also want to send data with the request, then you can set your data in post_data attribute.

When posting data, we try to detect the data type and set an appropriate Content-Type header.

We set application/json if you set the render_js as an array or object.

We set application/x-www-form-urlencoded if you set the render_js as a string.

The Content-Type: application/x-www-form-urlencoded header makes the posted data as a normal form submit.

The following code example uses POST request to send form data:

Code example

{ "api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here "url": "https://httpbin.org/anything", // The site you want to scrape and parse "request_method": "POST", "post_data": "name=value" }

Result

{ "args":{ }, "data":"", "files":{ }, "form":{ "name":"value" }, "headers":{ "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3", "Accept-Encoding":"gzip, deflate, br", "Accept-Language":"en-US", "Cache-Control":"no-cache", "Content-Length":"27", "Content-Type":"application/x-www-form-urlencoded", "Host":"httpbin.org", "Pragma":"no-cache", "Upgrade-Insecure-Requests":"1", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36", "X-Amzn-Trace-Id":"Root=1-5f760c71-78192ebb56c4138b2952949a" }, "json":null, "origin":"98.118.113.251", "url":"https://httpbin.org/post" }
Note: Above response example is actually parsed from ScrapeOwl’s response

Sessions

You can re-use the same IP address for multiple requests using the sticky sessions feature. In order to make sure that the IP address is sticky and does not change, you need to set a value for the session when supplying the request. For example, “session”: “123456” in the first request and if you want to use the same IP, make sure the session value is the same as the prior request.

There is no limit on the number of sessions you can have at a given time. However, the premium_proxies parameter needs to be set to “true”

{ "session": 1234 }

Using Premium Proxies

By default, we use data center proxies to scrape the data from the web. But on some websites that could be difficult to scrape, you might need to switch to our premium proxies.

Premium proxies use residential IP addresses so they evade detection as data center addresses.

If you use premium proxies, then each request costs 10 credits.

If you use premium proxies with render_js, then each request costs 25 credits.

{ "premium_proxies": true }

The default location of proxies is set to the USA by default. You can specify the country parameter to other countries to rotate or view localized versions as needed.

{ "premium_proxies": true, "country": "gb" }

The above example uses IP's from the UK.

Note: Using premium_proxies costs between 10 or 25 credits per request depending on whether render_js is used or not.

List of countries with their ISO codes

ISO codeName
brBrazil
caCanada
frFrance
deGermany
geGreece
ilIsrael
inIndia
itItaly
mxMexico
nlNetherlands
ruRussia
esSpain
seSweden
gbUnited Kingdom
usUnited States

Proxy Mode Beta

ScrapeOwl can be used just like any other proxies.

When used in proxy mode, the data parsing (elements) feature is not usable because you are no longer receiving JSON responses.

To use ScrapeOwl as a proxy, copy the following code, and set your API key.

API parameters example

http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000

Code example

curl -x "http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000" -k "https://httpbin.org/ip"

API Request Cost Calculation

Because ScrapeOwl has multiple features, the credit consumption per request depends on the specifics of your request and which attributes you're sending along.

The request_cost attribute in the response shows the cost in credit that the request costed.

Credit consumption per request type is as follows

DescriptionValue
A request using rotating proxies without render_js1 credit
A request using rotating proxies with render_js5 credits
A request using premium proxies without render_js10 credits
A request using premium proxies with render_js25 credits

API Response Status Codes

The response status code is set in the header as well as in the JSON object that you get in response.

The is_billed attribute shows if the request was billed against credits in your account.

The ScrapeOwl API uses the following response status codes

Response CodeDescription
200 billedSuccessful Call
400Bad Request
401Unauthorized - Your API key is wrong or you haven't provided one
403Forbidden - You do not have enough credits left for this API call
404 billedPage Not found
429Too Many Concurrent Requests -- You consumed all of allowed concurrent requests.
Please upgrade or contact support to increase the limit.
500Internal Error, Please try again.

If you would like to check the stats about credit usage in your account, send a GET request to the following endpoint:

https://api.scrapeowl.com/v1/usage

Requests to the usage endpoint are limited to 20 requests per minute.

The following JSON shows example response from /usage endpoint

{ "credits": 1000, "credits_used": 600, "requests": 510, "failed_requests": 0, "successful_requests": 510, "concurrency_limit": 1 }
If you have any questions, reach out to support@scrapeowl.com, and we will be more than happy to help you out.
ScrapeOwl

Web scraping API and proxy service

© ScrapeOwl 2024