Research Topic 3: Using Nodejs crawler to get Medium articles

Recently, I watched a youtube video Intro To Web Scraping With Node.js & Cheerio about Node.js crawler.

In project 2, there is an article section required to display a set of article regarding programing language. We looked for the site Medium, but unfortunately, Medium doesn’t provide API for the article content. So I believe maybe we can use crawler to fetch article content and send to our project 2 front-end.

Step 1: Create node/express server

This application is running based on express server, so it is necessary to init it with express. And create the endpoint /articles used for getting post request.

const app = express();
app.post(‘/articles’,(req,res)=>{
//do something…
}

Step 2: Using Request package

use npm install request to install the Request package. Request is designed to be the simplest way possible to make http calls. It supports HTTPS and follows redirects by default. For example, if we want to search article about JavaScript in Medium. We can use the Request method.

request(‘https://medium.com/tag/javascript’, (error, response, html) => {
     //do something…
     }

Step 3: Using Cheerio

Cheerio is a library  that allow developer using jQuery syntax select HTML element in the backend server. Once we fetch the article search result page from Medium. Cheerio give us ability to parse the result html page and get the information that we need. After we get the html page from the request method, we use Cheerio to parse it.

const $ = cheerio.load(html);
If we look through the architecture of the result html page, each article is inside a div with class name called .postArticle. With Cheerio, we can point to each article based on this class name. We only require some fields of the article like Title, Description, Link, Author, Date, Reading Time, and ThumbnailURL based on specific class name or id name. And push them into a JSON object.
 $(‘.postArticle’).each((i, el) => {
          const title = $(el)
            .find(‘.graf–title’)
            .text();
 
          const description = $(el).find(‘.graf–trailing’).text();
          const link = $(el).find(‘.postArticle-content’).parent().attr(‘href’);
 
          const author = $(el).find(‘.postMetaInline’).find(‘[data-action=show-user-card]’).text();
 
          const date = $(el).find(‘time’).text();
 
          const readingTime = $(el).find(‘.readingTime’).attr(‘title’);
          const imgURL = $(el).find(‘.graf-image’).attr(‘src’);
          data.push({
            title,
            description,
            link,
            author,
            date,
            readingTime,
            imgURL
          })
 
        });

Step 4: Create POST form

Because we need to get article based on user input instead fetching same category  article. We need a form to gather user input. So in index.html, we create a simple POST method form with text field to get user input. And link it to the endpoint /articles that we created at the Step 1.

   <form action=”/articles” method=”POST”>
    <label for=”query”>Type the <strong>programing language</strong>  you want to know about:
        <input type=”text” name=”query”>
    </label>
    <button type=”submit”>Fetch</button>
    </form>

Step 5: Deploy in Heroku

In order to make this crawler live, we decided to deploy it in the cloud platform Heroku. Heroku is a cloud platform as a service supporting several programming languages. The biggest advantage is they provide free plan which is convenient for student to test their project. After creating account, and follow the deploy instruction to make you node server application alive.

Done!

Here is the live demo

Live Demo

Github Repo