API Scraping Python — Scraping API’s by Scrapy

michel john
4 min readMay 24, 2021

When we do talk about web scraping, we mean to fetch and to analyze data from web. We are available with various tools related to different software to do so. API Scraping Python is something really rich in this regard. It has some really strong tools available for web scraping using API??

What API is?

Application programming Interface commonly referred as API is actually a software available in websites to enable communication of two applications. It makes the extraction of data (commonly referred to as data parsing) easy. Data parsing from some websites which are not installed with API’s is too much difficult. Most of the times while scraping we do not scrap the HTML markup. It means that we do not use XPath or CSS selectors.

Need of API Scraping Python

Internet is a hub of information. We are available with variety on leading and misleading information on internet related to different topics such as science, technology, environment and nature. These niches are of great interest for extraction and analysis of data. Such kind of data is surely available on certain websites which are working upon pre mentioned niches (topics). The easiest way for data parsing from such websites is Python’s API scraping using python code.

Advantage of API Scraping by using Python:

Every problem in programing or web development require tools for its solution. More the power of python tools, easy will be to sort out the problem. Python is here for us to sort out the issue of scraping data from a website using API with most powerful tools which make API scraping of data really easy and time saving for moderators and developers. For this particular reason API scraping python is preferred as it has numerous advantages upon other techniques.

How API Scraping using Python is done?

To scrap data from a website we must understand that some websites do not contain pagination, but as we scroll down to them new quotes get inserted. This means that we are dealing with dynamic pages or in short we can say that these pages are based upon JAVASCRIPT. For this sort of situation just we have to follow these steps which will serve as a python API scarping tutorial for us.

Open the developer’s tool by Ctrl+Shift+I.

In above figure we can see that NETWORK tab is open and HXR filter is selected which stands for XML http request. We will apply XHR filter if there is an API within the website.

We will refresh the provided page to check if there is a new quote added or not.

Upon refreshing the page we can observe that “quote?page=1” (encircled red) is a new request present. Upon clicking this we will get a new tab opened (pointed with red arrow). We have to click upon the header tab to view the request URL, which will always be different from website’s URL. Now we will click upon the PREVIEW tap.

As we have opened the preview tab, we can see that this tab contains a JSON object and some key value pairs but we are concerned with “quotes” key (arrow). Upon its expansion, we will have a few objects inside it. Each object is actually a quote. Quote means it will surely represent the headings or text present in various sections of website.

Now we are interested to scrap a page of API.

As to deal with first step, we will code the following to create proper spider file for further proceedings.

We named the project as demo_api. Before running the code, we have to copy the Request URL from chrome to be placed within scrapy spider section created by us in this code. This piece of code in Visual studio will open the following layout for us after pressing enter.

After entering the code, a new layout will come across in view. We will minimize the studio and will copy Request URL from CHROME in which our desired website is already open. After being copied the request URL will be pasted in place of previous URL in spider file (encircled red). At the end of this spider code, we are available with the command naming pass. Replace this with following to see the response.

Print(response.body)

Press Ctrl+S to save the file. Now we will click on the TERMINAL button and select the New Terminal. Following interface will appear in front of us.

Now we will launch the spider code to view the response. After pressing enter, we will get the response block of code having JSON object. We will approach the “quote” key by scrolling down the code.

We have to access “quote” key because it contains all the quotes. Now as a next step, we have to convert that JSON OBJECT to PYTHON DECT. The reason behind this is to extract whatever we want to extract from it. To do so we need to import a module called JSON. We will write in spider file after line 2 at top

import JSON

Further in parse method at line 12, we will remove the print statement and will define a new variable naming resp. The statement will be written as:

By using “Json.load” we will convert the JSON object we get from response body to a python deck.

Upon execution of code and calling the previous command, we will get all the quotes listed in as output.

This output has author key and tags key in it. Now we have list of quotes, we can get into all these quotes and extract all the data points we want to have. Now we will finally modify the spider code to extract all remaining data as follows.

Upon Execution, we will have all the data inside quotes as well. That is a practical result for API scraping using Python. A fundamental practice to learn API scraping using Scrapy. So we will have final output in this form.

Conclusion

Python usage is surely a strong way to have API’s scraping look so easy and comfortable to be done. It has no flaws to adopt this technique for API scarping using scrapy. Ultimately it is a time efficient method.

#web scraping #scrape data from website #JAVA api scraping

--

--