The Genesis of Smart News Scraper
We built a news scraper to identify news relevant to a particular college - from a hackathon I participated in way back in college
I spent the last week coding away at Hackathon 5.0 with my friends Himanshu Garg and Priyanka Agrawal. It was organised by The Lakshya Foundation and Innovation Garage (the only place where I get to be the participant as well as the planner!) Here I discuss our problem statement, how we approached it and the end result.
The Problem Statement
As always, Hackathon offered quite a big and diverse list of interesting hardware and software problems. After much deliberation, we decided to work on Almabase’s problem which was to build a machine learning system to find news that is relevant to a particular college. The problem statement consisted of two parts :
Write crawlers to fetch news articles from a predefined list of news websites. Use RSS feeds if available.
From the fetched articles, identify the ones that are relevant to a particular college. To begin with, consider only NIT Warangal. A list of synonyms of NITW and an alumni database of names and batches will be provided to the teams.
We selected this problem because :
We found it to be challenging. None of us had ever written a crawler or applied ML and Data Classification concepts to practical problems. There was a lot of scope to learn.
We use Almabase’s product NITWAA very often and we love it!
We had heard a lot about the awesome team at Almabase from mutual friends and were eager to work with them 🙂
Figuring It Out
We still had a week to go when the problem statement was released. Part 1 was relatively easy. We quickly learnt how basic crawlers are written and what are the various libraries available. We learnt about several useful libraries like Beautiful Soup, Scrapy, Feedparser and Newspaper — all of which are simply amazing! By now, it was clear that we were going to build the application in Python on Django.
Now that we have all the content posted on various news websites, how do we classify it? Heck, how to even make our bot understand that if an article is about “Armstrong” — whether it is Neil Armstrong or the physical unit Armstrong or some alumnus of a college Armstrong? We realized that we needed to take into account the context of the article e.g. if it says Armstrong ‘graduated from’ or ‘studied in’ , etc. we can say that the article is relevant to at least some college. This is where Natural Language Processing came into picture and we were introduced to the mighty Natural Language Toolkit (NLTK) for python.
By going through a few blogs and the NLTK book, we formulated a solution. We will identify names of all the Persons and Organisations mentioned in the article and check if they exist in Almabase’s database. This is known as Named Entity Recognition. Cool, we got the keywords. Now how to use these keywords to classify the data as relevant or not? What exactly might be a helpful keyword apart from names of persons and organisations?
Asking For Help
We were clueless at this point given the wide variety of ML approaches and algorithms and our lack of experience with them. So I headed to the Systers mailing list and posted a question explaining the problem and my approach towards it. Surprisingly, I got a reply from Yoly Ceron in just a few hours (Thanks Yoly!) .
She shared a list of useful resources and libraries and suggested to use the ‘ bag of words’ approach and Naive Bayes Classifier of Python Sci-kit (another simply amazing library). We went through the links and thought it to be a reasonably correct approach to start with. Now we had quite a clear picture of what we would do in the Hackathon.
The Solution
Summarizing our final solution :
Write a crawler to fetch articles from various sites. Use RSS feeds and feedparser wherever available. We will conver it into a cron job that can be run daily.
Maintain a dictionary of keywords and synonyms of that college (e.g. NITW , NIT Warangal, RECW, etc.) Apply NER and extract keywords from the article. If a keyword doesn’t already exists, add it to the dictionary . This is the learning part of the algo.
Perform the following four tests on the article and give it a score based on Naive Bayes Classifier:
Check if it contains synonyms of NIT Warangal.
Check if it contains other keywords and phrases like ‘graduated from’,’Lakshya Foundation’, etc.
Query the Almabase DB and check if the names identified in the article exist in there.
Perform sentimental analysis on the article.
Store relevant and positive articles in the database.
Write a Django view to display articles relevant to a college according to various filters (e.g. published on a given date, published in a particular website, etc.)
The Outcome
We were ready with an approach and had already got our hands dirty on NLTK and Sci-kit. After coding for 24 hours, we came up with the first version of Smart News Scraper. Here is the code on github (its still a work in progress).
What Next?
The project has just started and a lot of improvement can be made to it. Would love to get and implement your suggestions :
Improve accuracy of the algorithm . We currently use Naive Bayes which has performed with an accuracy of around 75%
Improve the view and add more filters to admin panel.
Increase the speed by adding multithreading.
Identify relevance to more colleges than just NIT Warangal
Improve the training set data.
Resources
Here are a few resources to get you started if you are looking at working on something similar.
In addition to working on our problem, it was great to meet so many startup founders and alumni and gain their insights. As always, Hackathon was amazing 🙂