Web Scrape With Node Js

Hey developers! In this article, we are going to see how to make Web Scraper using JavaScript and NodeJS. For the example of this article, we are going to use NodeJS and Express as its framework. Along with a few common NPM libraries such as Axios and Cheerio. And for demonstration purposes, we are going to extract data from the current website, Programatically 

What is a Web Scraper

A web scraping tool is an easy and convenient way of extracting data and content from a website. Instead of tediously copy-pasting or jotting it down manually, a web scraper tool extracts the data you are looking for and saves it in a format that you want. You just need to target the fields and values using CSS classes and id and it will start scraping data.  

Tl;dr

Just initialize an NPM project using these commands: 

				
					npm init -y
npm i 
npm i express  
npm i cheerio 
npm i axios 
				
			

Then create an “index.js” file in the project directory and copy the following code.  

				
					const axios = require('axios') 
const express = require('express') 
const cheerio = require('cheerio') 
const { response } = require('express') 
  
const PORT = 8000 
const app = express() 
  
app.listen(PORT, () => console.log(`server is running on PORT ${PORT}`)) 
   
axios('https://programatically.com').then( 
    response => { 
        const html = response.data 
        const $ = cheerio.load(html) 
        var list = [] 
         
        $('.heading-title-text').each(function() { 
            const blog_title = $(this).text() 
            const blog_link = $(this).find('a').attr('href')  
            list.push( {blog_title, blog_link} ) 
        }) 
        console.log(list) 
    } 
).catch(err => console.log(err))  
				
			

After copying all the above code in “index.js” file, run this command to start the Web Scraper tool: 

				
					node index.js 
				
			

You can find the complete project file GitHub repository link at the bottom of this article.

Prerequisites

– NodeJS should be installed (Download NodeJS)
– Should Know About NPM

Table of Content

  1. Configure a Web Scraper Project
  2. Installing NPM Libraries Used in Web Scraping Project
  3. Create a Basic Web Server Using NodeJS
  4. Scraping All Html Script
  5. Summary of Web Scraping Code

STEP 1: Configure a Web Scraper Project

To begin with, create a folder called “Web-Scraper”. Open it in VSCode or any other IDE you like. Open a terminal or CMD and type in this command: 

				
					npm init
				
			

After you execute the above command, it will ask you a list of questions. Simply keep on pressing enter to all of the questions and you’ll be done.  

Next, create a new file called “index.js” in the same project folder. After that, execute the following command in the terminal or CMD: 

				
					npm i
				
			

This will create a new file called package.json” which will contain a list of all the dependencies that we’ll be using in this Web Scraper tool using JavaScript and NodeJS. See the image below of package.json file:

Initial NPM Configuration for Creating Web Scrapper

STEP 2: Installing NPM Libraries Used in Web Scraping Project

This is a very simple step. We need 3 NPM libraries in our Web Scraping project. They are ExpressCheerio, and Axios

Express is a very popular NodeJS Framework (Learn More 

Cheerio is used for traversing and targeting elements in your HTML script. It has a very similar syntax to jQuery. (Learn More 

Axios is a popular library used for creating and handling HTTP calls and requests. (Learn More 

Execute the following commands to get these libraries installed.  

				
					npm i express  
npm i cheerio 
npm i axios 
				
			

STEP 3: Create a Basic Web Server Using NodeJS

Moving on, it’s time to create a basic Web Server Using NodeJS and Express for our Web Scraper using JavaScript and NodeJS. Open the “index.js” file that we created earlier and write in the following code.  

				
					const axios = require('axios') 
const express = require('express') 
const cheerio = require('cheerio') 
const { response } = require('express') 
  
const PORT = 8000 
const app = express() 
  
app.listen(PORT, () => console.log(`server is running on PORT ${PORT}`)) 
				
			

This will create a basic Web Server on Port 8000. It is where we will send and receive our HTTP requests and response. To see if this basic Web server is working, write the following code in the terminal: 

				
					node index.js 
				
			

Note that you need to stop the project and rerun the above command whenever you make changes to your code. Now check the console and it should print a statement as shown in the image below. 

Running A Web Server on JavaScript and NodeJS

STEP 4: Scraping All Html Script

Finally, since we have our basic Web Server up and running, it’s time to create our Web Scraper using JavaScript and NodeJS. Write the following code in your “index.js” file after the webserver code that we wrote in the previous step. Afterward, restart your node project using the “node index.js” command.  

				
					axios('https://programatically.com').then( 
    response => { 
        const html = response.data 
        const $ = cheerio.load(html) 
        var list = [] 
         
// Here I am Targeting CSS class and its Attributes to Fetch the Data.  
// You Would Make your Changes Here  
        $('.heading-title-text').each(function() { 
            const blog_title = $(this).text() 
            const blog_link = $(this).find('a').attr('href') 
            list.push( {blog_title, blog_link} ) 
        }) 
        console.log(list) 
    } 
).catch(err => console.log(err)) 
				
			

STEP 5: Explanation of JavaScript Code

				
					const html = response.data 
const $ = cheerio.load(html) 
var list = [] 
				
			

The first line is simply fetching ALL the raw HTML content from the website link that we gave Axios. The next line is where we are using cheerios to parse in the raw HTML so that we can target our specific elements jQuery style. The last line is simply creating an empty list so that all targeted fetched data can be stored in it.  

				
					$('.heading-title-text').each(function() { 
    const blog_title = $(this).text() 
    const blog_link = $(this).find('a').attr('href') 
    list.push( {blog_title, blog_link} ) 
}) 
console.log(list) 
				
			

The first line is where we are using cheerio syntax to target the CSS class of all the articles titles from the list of blogs on my website. The “.each is used for performing an action for each of the traversed headings of the article. It is where I am fetching the href attribute of the blogs as well. I am storing both the title and href link to separate constant variables and simply pushing them in the list that we created earlier. However, I am pushing them inside “{ } brackets to make it a list of objects. Lastly, I am simply printing out the list values in the console log.  

The Big Picture

Big Picture of Web Scrapper Using JavaScript and NodeJS

And we’re done.  

Hope this article helps you guys to learn how to make a web scraper using JavaScript and NodeJS. Feel free to download and use the Web Scraper project that I have uploaded on my GitHub account. If there is any particular topic that you want me to cover just drop a message in the comment and hit the like button. Have a great one!