Case Study: Making legislation changes easy to understand with automated web crawling

However, running web searches, reviewing the articles and creating valuable summaries would take an enormous effort from the law experts on a daily basis. Luckily, automation can help.

Case Study: Making legislation changes easy to understand with automated web crawling

The project:

Our client is a consulting firm. They created a service which is continuously monitoring legislation changes, gathers and shares the new web articles about them, and turns the information in them into easily usable data.

The problem:

The text of the legislation itself is one source of information, these can be stored and made searchable. However, this data is not enough to give a thorough explanation about the meaning and plausible effect of a change.

So the client needed a tool which can interpret the articles that are written to interpret these legislation changes…

Through this, the company can provide value for the clients: they can add professional commentaire to the legislation texts.

However, running web searches, reviewing the articles and creating valuable summaries would take an enormous effort from the law experts on a daily basis. Luckily, automation can help.

The solution:

The first element was the intelligent news scanner.

This system gathers the online articles, filters and saves the data, and ranks the articles based on their quality and relevance with an NLP (Natural Language Processing) solution.

1. Running search, gathering data

We gather the legislations affected by changes, and identify the links.

2. Web-crawling

Our scraping tool opens the links and gets the DOM information from the web pages. Usually a DOM element contains text, image, styles (background color, font type, size and color) and their position on the site.

3. Classification

The data is classified into 'article' and 'not article' types. This is done by a machine learning model.

4. Data extraction

Another machine learning model extracts the title, the time of release, the content, and the author.

5. Ranking

After data extraction, the system compares the article content with the changed text of the legislation. The more similar it is, the higher rank it gets. The algorithm takes date and length into consideration. It also checks the article against white- and blacklists.

Methods, tools and technologies:

Python, Google Search API, PostgrSQL database, we used Tensorflow and LightGBM frameworks for the models

Results:

We developed a software that automatically gathers articles about legislation changes. After that, it classifies the articles based on importance, relevance, and offers them to the users as an attachment to the legislation text itself. The articles are listed based on the ranking they got.

This service makes sure that a user can instantly read the best content on the Internet about the very legislation they are checking at the moment.