NLP and Me - 2: Collecting Text Data

One of the most important things in NLP is text data. Collecting text data is not a simple task, especially when it comes to minor languages like Mizo. This time I’d like to share some simple tactics that I used for collecting data for Natural Language Processing research last year i.e. 2015 – 2016 academic session.
A clean Mizo text data is not simply available. Since I was responsible for collecting a huge amount of clean Mizo text data, I had to go to some offices of local newspapers like Vanglaini. We get a big file (may be larger than 3GB), but when we try to work on it, it is just a collection of useless stuff for us. So, I had to make clean data by myself. I plan to download every page of their website and extract clean text data from it.

I am a web developer! I know how websites work and how files like web pages are stored on the server. I know the pattern of how they can display the pages.

If you see some websites, you may have seen the URL of the page ending as ?id=1234 , ?page=23, ?userid=1256, etc. These are the query with which you can request a page.

For example :
If you see www.angelvestgroup.com/info.php?id=1, you will be redirected to a page. Now, if you modify the id to 2 i.e. www.angelvestgroup.com/info.php?id=2, you will go to a different page. Like that, you can go on.
When data is entered into the database, all the entry has given an ID or name so that the particular data can be extracted and displayed in the web browser. But I do not say that this is the only way! If you are a Facebook user, you may have seen something like profile.php?id=123456789! This is the profile ID of the user. By going to www.facebook.com/profile.php?id=XXXX, you can see whose profile is that.
Like this way, most news websites and blogs are implemented.
Apparently, the Vanglaini website uses a Laravel PHP Framework. If you see their website you will see a pattern on their web page that is similar to the above-mentioned technique.
They have six (6) directories viz., tualchhung, hmarchhak, ramchhung, khawvel, thalai and infiamna.
All the pages on the website have an ID that can be extracted and displayed simply by the format: www.vanglaini.org/any_of_the_above_mentioned_directory/PAGEID
e.g: www.vanglaini.org/tualchhung/23456
The website has a good MVC (model view control) thing which is very good. The URL “www.vanglaini.org/tualchhung/12345” will display the same webpage as “www.vanglaini.org/thalai/12345” or “www.vanglaini.org/any_directory_name/12345”.
Since I recognized all these patterns, I can simply use the “wget” command in my Linux system to download all the pages that I required.
I simply used the shell command below which gives me all the web pages that I require.
#!/bin/bash
for i in $(seq 1 1 61234)
do
wget http://vanglaini.org/tualchhung/$i
done

Now after I downloaded all the required pages, I need to make them into a text file. For this, a very simple but powerful program html2text is there to fulfill my requirement. The following lines of bash code did everything for me.
for file in `find . -type f -not -name "*.*"`; do html2text "$file" > "$file.txt"; done

These lines of codes convert all the files to a text file (.txt).

Now, I need only the TXT files. I can delete all the files which are not .txt files. I can do this by
rm !(*.txt)

This bash command works very fine for me.

Now the only thing that I still need to do is to merge all the text files into one file, which can be done by using the cat command

cat *.txt > final.txt

which merge all the contents on all the text files into a file called final.txt file.

In such a way, I collect ~1GB of clean Mizo text data.

I tell you this, collecting 1GB of text data is such a big task and takes lots of time.