ScrapeOwl

Introduction

What is the ScrapeOwl API?

ScrapeOwl is a simple web scraping and data extraction API.

How does it work?

ScrapeOwl's API allows you to send requests specifying the website and its elements you want to scrape.

Example

For example, if you want the content of the h1 and p tags from a series of articles on example.com where the h1 tag contains the title of the article and the p tags contain the body text.

Using ScrapeOwl’s API, you specify example.com as the URL you would like to scrape, and h1 and p as the elements you would like to parse and retrieve content from.

Getting Started

The first step before you can start using the ScrapeOwl API is signing up and creating an account by going to the registration page or logging into the dashboard if you already have an account.

Once you're in, your API key should be visible on the dashboard where you can simply copy and paste it to your requests. It is a long string (80 characters) that is a series of random numbers and letters that looks something like this:

9ijf24fk93rg038jg30rigj394f34f0kh12d12ep3fp24gk3pgk34g23gf74fl430913fj2133f32ffj

Api endpoint

The base URL for our scraping API is:

https://api.scrapeowl.com/v1/scrape

Note: All response are in JSON formatted output by default. If you want to disable this behavior and return HTML instead, set json_response to false.

Making Your First Request

APIs are consumed programmatically, meaning you write a program to get the data you want from websites you want to scrape and parse.

To demonstrate the power of the ScrapeOwl use the example below to return a test query using Javascript Object Notation (JSON) to make a request to the API.

Code example

{
  "api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
  "url": "https://httpbin.org/ip" // The site you want to scrape and parse
}

Result

{
  "status": 200,
  "is_billed": true,
  "credits": {
    "available": 0,
    "used": 0,
    "request_cost": 0
  },
  "html": "{\n  \"origin\": \"98.118.113.251\"\n}\n"
}

The following table lists all available ScrapeOwl API parameters

Name	Type	Default	Description
api_key	string	required	Your ScrapeOwl API Key
url	string	required	Target URL to scrape
elements	array	optional	List of elements to extract from the webpage
html	boolean	optional	Return page's HTML if set to true. Defaults to false
return_headers	boolean	false	Return headers returned by the target website server
return_cookies	boolean	false	Return cookies returned by the target website server
cookies	string	optional	Cookies headers to send to target URL
headers	array	optional	HTTP headers to send to target URL
request_method	string	GET	GET, POST, PUT. Default is GET but if you want to POST/PUT data then use appropriate value.
post_data	string \| array	optional	POST data on the page
premium_proxies	boolean	optional	Use residential proxies to scrape
country	string	optional	Set source IP country for residential proxy
render_js	boolean	optional	Use headless chrome instance to execute JavaScript
custom_js	string	optional	Send custom JavaScript to run on the page prior to extracting the HMTL of the page
wait_for	string	optional	Specify an element that we should wait for to load before scraping the page or an amount time in ms.
reject_requests	array	optional	Block requests on page
json_response	boolean	optional	Default is true. If it’s set to false, it will return html results instead of JSON, also useful for downloading files.
screenshot	boolean	optional	Default is false, screenshot only works with render_js. It is useful for debugging and shows you the full page as seen by our API.
block_resources	boolean	optional	Default is true. When running render_js we block CSS, Images, and fonts by default. This can break some sites (sites using React for example). When set to false, the API will load the page without blocking any resources i.e exactly as a page is loaded in a normal browser.

Extracting Custom Elements

ScrapeOwl's killer feature is that it's able to extract data from pages, so you do not need to parse the page's HTML yourself. Below is a table detailing the selection fields.

ScrapeOwl API Element Selection

Name	Type	Default	Description
type	string	css	Specify XPath if you are trying to extract using XPath. The default is CSS
selector	string	required	XPath or css selector for the element you are trying to extract.
timeout	integer	optional	Time in milliseconds to wait for dynamic elements to load on the page
html	boolean	false	Return the inner HTML of the element
attributes	boolean \| array	false	Return the attributes of the element.

The timeout parameter only works when render_js is set to true

Code example

{
  "api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
  "url": "https://example.com", // The site you want to scrape and parse

  //After you've provided your API key and the URL of the site you wish to parse, 
  "elements" : [ 
    {
      "type": "xpath",
      "selector": "//p"
    },
    {
      "type": "css",
      "selector": "p"
    }
  ]
}

The above JSON code shows the request containing element attributes with all potential values. The first uses XPath as a selector and the second uses the CSS selector. The CSS selector is essentially your HTML element name, class, or ID. It's what you would write in your stylesheet (CSS).

Element extraction result

{
  "status": 200,
  "is_billed": true,
  "credits": {
    "available": 0,
    "used": 0,
    "request_cost": 0
  },
  "data": [
    {
      "type": "xpath",
      "selector": "//p",
      "results": [
        {
          "text": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."
        },
        {
          "text": "More information..."
        }
      ]
    },
    {
      "type": "css",
      "selector": "p",
      "results": [
        {
          "text": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission."
        },
        {
          "text": "More information..."
        }
      ]
    }
  ],
  "html": ""
}

This is an example of a response when extracting all p tags on https://example.com where the API extracts all elements that match the given conditions. The response contains your data as objects where each object contains the text, HTML, or attributes of each element.

Attributes are key and value pairs of HTML attributes like class or any custom data-* attributes set to the particular elements on the page.

Waiting for Elements to Load

wait_for can be used to make ScrapeOwl wait for a given amount of time (in milliseconds) before capturing the elements or page's HTML as shown in the following example.

{
  "wait_for": 6000
}

The above code waits for the duration of 6000ms (6 seconds) before capturing the elements or page's HTML.

Note:timeout set for each element will override above value.

The wait_for can also be used to wait for an element to load on the page

{
  "wait_for": "p"
}

The above will wait for all of the p tag elements on the page to load before extracting data from the page.

Note:wait_for only works when render_js is set to true

Executing Custom JavaScript on the webpage

You can pass custom JavaScript to ScrapeOwl using the custom_js attribute, the custom JavaScript is executed on the page's console before capturing the content.

When testing make sure that the JavaScript snippet executes on your browser's console, as ScrapeOwl executes the code in the same environment.

For example, by using the following JavaScript we can scroll the page to the bottom of the page.

{
  "custom_js": "window.scrollTo(0,document.body.scrollHeight);"
}

Note: Because you can not pass a multi-line string in a JSON object, please remove all white spaces and convert it to single line before sending.

Setting Custom HTTP Headers

The headers attribute is an object containing all of the headers that you want to send to the target URL. You can easily set custom HTTP headers to pass along with your requests as follows:

{
  "headers": {
      'Accept-Language': 'en-US'
  }
}

The above code example sends "Accept Language" HTTP header to all outgoing requests.

Blocking Requests on Page

By default images, css, and fonts resources are blocked. To bypass this and make sure that all resources are allowed, set block_resources to false. This reduces your chances of being blocked and makes you look like a real user. However, if you want to allow or disallow certain resources, by using reject_requests you can easily block any resource of your choice on the page by specifying the file format. For example, you can block tracking scripts (.js files), images (.png, .jpg, or .jpeg files) or CSS (.css files)

For example, to block all CSS on a page:

{
  "reject_requests": ["css"]
}

And if you need to block any PNG and JPG image(s) on a page:

{
  "reject_requests": ["png", "jpg"]
}

If you need to block some particular URL on a page

{
  "reject_requests": ["someurl.com/style.css"]
}

Note: Blocking the requests only works when render_js is set to true

Posting data

If you want to send an HTTP POST/PUT request, then set request_method to POST or PUT.

If you also want to send data with the request, then you can set your data in post_data attribute.

When posting data, we try to detect the data type and set an appropriate Content-Type header.

We set application/json if you set the render_js as an array or object.

We set application/x-www-form-urlencoded if you set the render_js as a string.

The Content-Type: application/x-www-form-urlencoded header makes the posted data as a normal form submit.

The following code example uses POST request to send form data:

Code example

{
  "api_key": "YOUR_API_KEY", // Your API key you generated from the dashboard goes here
  "url": "https://httpbin.org/anything", // The site you want to scrape and parse
  "request_method": "POST",
  "post_data": "name=value"
}

Result

{
  "args":{
  },
  "data":"",
  "files":{
  },
  "form":{
    "name":"value"
  },
  "headers":{
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
    "Accept-Encoding":"gzip, deflate, br",
    "Accept-Language":"en-US",
    "Cache-Control":"no-cache",
    "Content-Length":"27",
    "Content-Type":"application/x-www-form-urlencoded",
    "Host":"httpbin.org",
    "Pragma":"no-cache",
    "Upgrade-Insecure-Requests":"1",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36",
    "X-Amzn-Trace-Id":"Root=1-5f760c71-78192ebb56c4138b2952949a"
  },
  "json":null,
  "origin":"98.118.113.251",
  "url":"https://httpbin.org/post"
}

Note: Above response example is actually parsed from ScrapeOwl’s response

Sessions

You can re-use the same IP address for multiple requests using the sticky sessions feature. In order to make sure that the IP address is sticky and does not change, you need to set a value for the session when supplying the request. For example, “session”: “123456” in the first request and if you want to use the same IP, make sure the session value is the same as the prior request.

There is no limit on the number of sessions you can have at a given time. However, the premium_proxies parameter needs to be set to “true”

{
  "session": 1234
}

Using Premium Proxies

By default, we use data center proxies to scrape the data from the web. But on some websites that could be difficult to scrape, you might need to switch to our premium proxies.

Premium proxies use residential IP addresses so they evade detection as data center addresses.

If you use premium proxies, then each request costs 10 credits.

If you use premium proxies with render_js, then each request costs 25 credits.

{
  "premium_proxies": true
}

The default location of proxies is set to the USA by default. You can specify the country parameter to other countries to rotate or view localized versions as needed.

{
  "premium_proxies": true,
  "country": "gb"
}

The above example uses IP's from the UK.

Note: Using premium_proxies costs between 10 or 25 credits per request depending on whether render_js is used or not.

List of countries with their ISO codes

ISO code	Name
br	Brazil
ca	Canada
fr	France
de	Germany
ge	Greece
il	Israel
in	India
it	Italy
mx	Mexico
nl	Netherlands
ru	Russia
es	Spain
se	Sweden
gb	United Kingdom
us	United States

Proxy Mode ^Beta

ScrapeOwl can be used just like any other proxies.

When used in proxy mode, the data parsing (elements) feature is not usable because you are no longer receiving JSON responses.

To use ScrapeOwl as a proxy, copy the following code, and set your API key.

API parameters example

http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000

Code example

curl -x "http://scrapeowl:YOUR_API_KEY@proxy.scrapeowl.com:9000" -k "https://httpbin.org/ip"

API Request Cost Calculation

Because ScrapeOwl has multiple features, the credit consumption per request depends on the specifics of your request and which attributes you're sending along.

The request_cost attribute in the response shows the cost in credit that the request costed.

Credit consumption per request type is as follows

Description	Value
A request using rotating proxies without render_js	1 credit
A request using rotating proxies with render_js	5 credits
A request using premium proxies without render_js	10 credits
A request using premium proxies with render_js	25 credits

API Response Status Codes

The response status code is set in the header as well as in the JSON object that you get in response.

The is_billed attribute shows if the request was billed against credits in your account.

The ScrapeOwl API uses the following response status codes

Response Code	Description
200 billed	Successful Call
400	Bad Request
401	Unauthorized - Your API key is wrong or you haven't provided one
403	Forbidden - You do not have enough credits left for this API call
404 billed	Page Not found
429	Too Many Concurrent Requests -- You consumed all of allowed concurrent requests. Please upgrade or contact support to increase the limit.
500	Internal Error, Please try again.

If you would like to check the stats about credit usage in your account, send a GET request to the following endpoint:

https://api.scrapeowl.com/v1/usage

Requests to the usage endpoint are limited to 20 requests per minute.

The following JSON shows example response from /usage endpoint

{
  "credits": 1000,
  "credits_used": 600,
  "requests": 510,
  "failed_requests": 0,
  "successful_requests": 510,
  "concurrency_limit": 1
}

If you have any questions, reach out to support@scrapeowl.com, and we will be more than happy to help you out.

Introduction

What is the ScrapeOwl API?

How does it work?

Example

Getting Started

Api endpoint

Making Your First Request

Code example

Result

The following table lists all available ScrapeOwl API parameters

Extracting Custom Elements

ScrapeOwl API Element Selection

Code example

Element extraction result

Waiting for Elements to Load

Executing Custom JavaScript on the webpage

Setting Custom HTTP Headers

Blocking Requests on Page

Posting data

Code example

Result

Sessions

Using Premium Proxies

List of countries with their ISO codes

Proxy Mode Beta

API parameters example

Code example

API Request Cost Calculation

Credit consumption per request type is as follows

API Response Status Codes

The ScrapeOwl API uses the following response status codes

The following JSON shows example response from /usage endpoint

Proxy Mode ^Beta