Let’s start talking about Data Mining! In today’s post, we are going to dive into Topic Modeling, a unique technique that extracts the topics from a text. It is a really impressive technique that has many appliances in the world of Data Science. The following post will go as follows. First I am going to give some basic definitions and explain what Topic Modeling is. Then, I will shortly refer to preprocessing, since I am going to dedicate a whole post for this. Continuing, I will present a Python algorithm and I will conclude with a visualization process. For the sake of this post, I am going to use a known dataset from lda python library called reuters and not my previous blog posts, since they are not that many. Let’s begin!

The code of this project will be uploaded soon and it will contain the preprocessing step too!

### 1. Topic Modeling, Definitions

In Wiki’s page, there is this definition.

```A topic model is a type of statistical model for discovering the abstract "topics"
that occur in a collection of documents.```

As we can see, Topic Model is the method of topic extraction from a document. For a human, to find the text’s topic is really easy. Even if the text is unreadable, only from some specific words, he/she will understand the topic. For a computer, this method is not that trivial, a computer cannot understand the meaning of words. If we pick two random words from a physical book and we give them to a computer, the computer cannot comprehend the difference, for example, the words the and Juliette. The computer must have some previous knowledge about the book or to be able to scrape/search/crawl/etc the internet or any other source of information and even then it will just deduct an analysis.

With topic modeling, a computer deducts a statistical analysis on a document and outputs a series of words that are relevant to that document(very roughly explanation). Let’s take a closer look.

There are several methods for performing  Topic Modeling, some of them are:

• LSA
• NMF
• pLSA
• LDA

In this post, we are going to see a well-known algorithm that is very flexible. The name of this algorithm is LDA  or Latent Dirichlet Allocation. A very good explanation is given by Christine Doig.

Check her out! She is amazing!

### 2. Preprocessing the Data

To perform LDA or every other Topic Modeling algorithm, you will need a nice text corpus. The corpus that you will need depends on the application. If you need to perform topic modeling on articles from CNN/BBC/or any other news website, you will need a good utility corpus like Wikipedia, because you will have to deal with different categories (sports, politics, food, movies, ….). At the bottom line, a good corpus will give you better results. I am not going to jump into details here because as I said before, I will write preprocessing on a different post. Here, for example, we can do the following:

• Get title and content of all wiki pages
• Get rid of short articles
• tokenize the remaining articles
• sort the words according to Tf-idf
• perform stemming
• remove a %, top% and bottom% from the sorted list
• remove stopwords
• keep the top % of the remaining list

These are some basic steps for preprocessing a text corpus, we will discuss more of them and in depth in a later post.

```#This is the preprocessing step
```

### 3. Perform LDA

There are lots of implementations of LDA, here are some of them:

Assuming now that we have a very good corpus, we will perform topic modeling using lda algorithm for Python 2.7.

```#LDA
#First create the model
model = lda.LDA(n_topics=10, n_iter=500, random_state=1)
#Perform LDA
model.fit(X)
#Print the topics
topic_word = model.topic_word_
#Number of words per topic
n_top_words = 5
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))

#Get the document titles and see the assigned topics
doc_topic = model.doc_topic_
for n in range(10):
topic_most_pr = doc_topic[n].argmax()
print("doc: {} topic: {}\n{}...".format(n,
topic_most_pr,
titles[n]))
```

The results are the following:

Topic 0: police church catholic women
Topic 1: elvis film music fans
Topic 2: yeltsin president political russian
Topic 3: city million century art
Topic 4: charles prince king diana
Topic 5: germany against french german
Topic 6: church people years first
Topic 7: pope mother teresa vatican
Topic 8: harriman u.s clinton churchill
Topic 9: died former life funeral

We can see that the topics are not making any sense whatsoever, but we can clearly get the sense of what the documents are talking about! With this kind of information we can manipulate and analyze the documents, for example, we can cluster the documents for a recommendation system.

Furthermore, we can see the first ten documents and the assigned topics:

doc: 0 topic: 4
0 UK: Prince Charles spearheads British royal revolution. LONDON 1996-08-20…
doc: 1 topic: 6
1 GERMANY: Historic Dresden church rising from WW2 ashes. DRESDEN, Germany 1996-08-21…
doc: 2 topic: 7
2 INDIA: Mother Teresa’s condition said still unstable. CALCUTTA 1996-08-23…
doc: 3 topic: 4
3 UK: Palace warns British weekly over Charles pictures. LONDON 1996-08-25…
doc: 4 topic: 7
4 INDIA: Mother Teresa, slightly stronger, blesses nuns. CALCUTTA 1996-08-25…
doc: 5 topic: 7
5 INDIA: Mother Teresa’s condition unchanged, thousands pray. CALCUTTA 1996-08-25…
doc: 6 topic: 7
6 INDIA: Mother Teresa shows signs of strength, blesses nuns. CALCUTTA 1996-08-26…
doc: 7 topic: 7
7 INDIA: Mother Teresa’s condition improves, many pray. CALCUTTA, India 1996-08-25…
doc: 8 topic: 7
8 INDIA: Mother Teresa improves, nuns pray for “miracle”. CALCUTTA 1996-08-26…
doc: 9 topic: 4
9 UK: Charles under fire over prospect of Queen Camilla. LONDON 1996-08-26…

### 4. Conclusion

We can clearly see that the topics were to the point. For the evaluation process, we can use several methods, for example, we can compute the distance between documents which translates to the similarity between documents. We can use cosine similarity or Jensen-Shannon Distance similarity to cluster the documents or use perplexity to see if the model is representative of the documents we are scoring on.

For the evaluation process, we can use several methods according to the needs of our application, for example, we can compute the distance between documents which translates to the similarity between documents. We can use cosine similarity or Jensen-Shannon Distance similarity to cluster the documents or use perplexity to see if the model is representative of the documents we are scoring on.

That’s all for today’s post! Please let me know if you have any question in the comments section below! Till next time, take care and bye bye!

Yours,

Siaterlis Konstantinos

Similar to the previous post, in this post, we are going to learn how to extract information from the Internet. We have to create a dataset first, to implement data mining techniques. So, let’s start.

Github Code of this project.

## 1. What is scraping?

Scraping is a technique that allows us to extract information from the Internet. For example, scraping a web page means that we are going to extract the HTML from that page and then take the ‘useful’ information from the HTML. Useful information is the information that we need, for example, the infobox of a Wikipedia page or the meta tags of a web page, etc. For more information, you can check the definition of Web Scraping.

## 2. Scraping a Webpage

For this project we are going to need the following packages:

Like before, I am going to build the project as a Python Class callable from any file. In this example, I am going to scrap my previous blog posts, first blog post and Mining the social media using python 2.7.

To begin with, I am going to create a text file that holds all the links that we want to scrap (in our case there will be only one link). So, we need to open the file and put the links into a list for further process.

```https://mydataminingsite.com/2017/02/24/first-blog-post/
https://mydataminingsite.com/2017/03/01/mining-the-social-media-using-python-2-7-13/
testkdkljfhaslkd
jfhaslk
https://mydataminingsite```

As you can see my file does not contain only URLs, so we are going to perform a ‘sanitize’ to ensure the robustness of our code. I am going to use regular expressions (regex) to filter the links but you can use other libraries or packages like:

• urlparse
• urllib (For Python <2.7.9, urllib does not attempt to validate the server certificates of HTTPS URIs!)
• etc

In any case, when you are building a project, always but ALWAYS perform data validation! This way it becomes a part of your life.

```with open(linksPath, 'r') as file:

urlFilter = re.compile(
r'^(?:http)s?://'  # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+(?:[A-Z]{2,6}\.?|[A-Z0-9-]{2,}\.?)|'
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' #ip
r'(?:/?|[/?]\S+)\$', re.IGNORECASE)

WebScrap.urls = [url.strip() for url in WebScrap.urls if urlFilter.match(url.strip())]
```

Next step is to perform a scraping technique, where we are going to extract the needed information from a web page. If you have different websites in your URL list, then you have to scrap each website separately because we are aiming for specific content. In that case, you can extract the HTML and then perform analysis on each website page separately. In our case, I am going to extract information only from my blog posts (only two at that time). By checking the source of the page we can target the exact div/span/a/p, where the needed information is.

Here our desired element is a div with the class name ‘entry-content’. So using urllib2 and BeautifulSoup4 I am going to sample the html content and filter it to get the text of that div.

```content = []

for url in WebScrap.urls:

try:
hdr = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}
page = urllib2.urlopen(req)

pageContent = soup.find('div', {'class': 'entry-content'})

content.append(pageContent.text)

except urllib2.HTTPError, e:
print 'HTTPError = ' + str(e.code)
except urllib2.URLError, e:
print 'URLError = ' + str(e.reason)
except httplib.HTTPException, e:
print 'HTTPException'
except Exception:
print 'generic exception: ' + traceback.format_exc()
```

The output of the list ‘content’ is like this:

```Greetings everyone!
This is my first blog post ever! In this post, I will try to explain the purpose
of this blog along side with the fist topics that I am going to cover. First of
all, I would like to thank each and every one of you for the time that you will
dedicate to read my posts! Now, let the fun begins!
As the title of the web page states, in this blog, I will analyze Data Mining
algorithms implemented in Python. In the future, I would like to make some
tutorials too, about Python language, C/C++ language, BioInformatics algorithms
etc…
Considering Data Mining posts, I will start presenting methods and Python
libraries that are used to collect data, something like urllib2, beautifulsoup4,
lxml, scrapy, tweety and many more! So my first goal is to show you a method to
collect data from the Internet, after that we will be able to process them with
many more algorithms and methods and in the final process, extract information
from them!
For each project I do and each line of code I post, I will upload everything to
my personal GitHub with an appropriate link provided.

Yours,
Siaterlis Konstantinos

Greetings!
In this post, I will show you how to mine the Social Media, to be more precice
Twitter! It is a very simple process and I will show you how to do it in Python
2.7 in a couple of steps.
Step 1 – Install Python Packages
....```

And that’s it! The problems I encountered was only on User-Agent of urllib2 where I had to specify the compatibility.

Other ways of scraping a web page are:

• Python scrapy – A very powerful tool, which I am going to make a tutorial about.
• Python requests – Simple but effective way of getting the content of a web page
• Python webbrowser – This is an integrated python library, where is opens a browser with the page you have selected (same as selenium). This kind of scraping is useful when you have to deal with javascript generated web pages.
• Dryscrape – An awesome tool for scraping javascript generated web pages.

In the bottom line, urllib2 is a nice library for simple scraping. I prefer using scrapy on more complex projects, but always use API when possible.

Using a crawler to scrape a website, or using multiple ‘scrappers’ on the same website could cause damage to the actual website. API is the most convenient method of extracting information from a web page. For example, Wikipedia is full of information, but instead of scraping it we can use DBPedia to access everything in a reasonable amount of time. That’s all for today! Until next time, take care and have fun!

Yours,

Siaterlis Konstantinos

P.S. In the next posts we are going to see and implement methods for finding the topics a web page is about, the emotions a tweet has and much more! Also, in the future, we are going to start a secret project called ‘Siakon’!

Greetings!

In this post, I will show you how to mine the Social Media, to be more precise Twitter! It is a very simple process and I will show you how to do it in Python 2.7 in a couple of steps.

## Step 1 – Install Python Packages

First of all, let’s see the list with all the packages that we are going to use for this project:

Json is already implemented in Python >=2.7 and python-twitter installs all the appropriate packages. After that, you are ready to start!

## Step 2 – Make a Twitter app

This is an easy step and I am going to walk you through it. First go here and log in to your twitter account. This is the development site of twitter, where you can build your own apps!

Click on the button “Create new app” at the top right corner. Fill in the blanks with your information and then click on “Create your Twitter application”. Here is an example.

After you have created your app, you will be redirected to the App’s homepage. Go to Keys and Access Tokens and click on “Create My Access Token” at the bottom of the page. At the top of your page, you can find your secret keys and at the bottom your access tokens. Here is an example.

Write down those keys and remember, those keys are secret! DO NOT SHARE! After that you need to adjust your app’s access level, just to avoid further validation (if you are going to use it for your own account you do not need to change this). Go to Permissions->Select “Read Only”->Update Settings. That’s it! Now we can now write code.

## Step 3 – Get the Tweets

First of all, we want to import the appropriate packages.

```import twitter
import json
```

Json is needed because the twitter API returns us the tweet in json format. For example:

```{"created_at": "Wed Mar 01 09:44:29 +0000 2017",
"hashtags": [],
"id": 836874776106926080,
"id_str": "836874776106926080",
"lang": "en",
"media": [
{... "text": "First blog post https://t.co/Uqp7sA86Tw
https://t.co/4zkWvT1EtN",
"urls": [
{"expanded_url": "https://mydatam...",
"url": "htt..."}],
"user": {"id": },
"user_mentions": []}```

We need to access the text field, so let’s see how we can accomplish that.

First, we need to connect to Twitter’s API. This is where we are going to use the API keys we generated earlier.

```#create a class to be able to use it properly
#declare class variables
consumer_key = ''
consumer_secret = ''
access_token_key = ''
access_token_secret = ''

def __init__(self, consumer_key, consumer_secret, access_token_key,
access_token_secret):
```

As you can see I created a class because I am using this sampling a lot in my research, I suggest you do the same. When I am going to create my class object, I will parse the API keys. Next, in the SampleTwitter class, I created a method called getTweets() where I gave as input the account I want to sample. BE CAREFUL, there is a limit on how many tweets per day you can retrieve!

```#use the python-twitter package to get the tweets
#where screen_name is name of the account you want to sample
def getTweets(self, screen_name):

statuses = api.GetUserTimeline(screen_name=screen_name,
count=200, include_rts=True,
trim_user=False, exclude_replies=True)
#Gather all tweets to a list
tweets = []

for i in statuses:
#the tweets come ona jason format
tweets.append(tweet['text'])

return tweets
```

As you can see at line 15 and 16 I extract the tweet’s text from the json format. Also, I want to talk about the GetUserTimeline’s parameter at line 7. Here I sampled the last 200 tweets, without replies, without retweets and with the user handles. You can find all the parameters here.

## Step 4 – Calling the class, iterate through tweets

Concluding, I created a main.py file to retrieve the tweets.

```#import your class

sampling = SampleTwitter(consumer_key, consumer_secret, access_token_key, access_token_secret)
#call the getTweets() method with the account you want to sample
tweets = sampling.getTweets('siaterliskonsta')

#iterate through tweets
for tweet in tweets:
print tweet
```

## Conclusion

This is it! You can now sample twitter account, harvest tweets and process the results. Be careful tho, as I said before, there is a limit on how many tweets you can retrieve! Anyways, until next time, take care and have fun!

Yours,

Siaterlis Konstantinos

P.S. The whole code of this post is here.

Greetings everyone!

This is my first blog post ever! In this post, I will try to explain the purpose of this blog along side with the fist topics that I am going to cover. First of all, I would like to thank each and every one of you for the time that you will dedicate to read my posts! Now, let the fun begins!

As the title of the web page states, in this blog, I will analyze Data Mining algorithms implemented in Python. In the future, I would like to make some tutorials too, about Python language, C/C++ language, BioInformatics algorithms etc…

Considering Data Mining posts, I will start presenting methods and Python libraries that are used to collect data, something like urllib2, beautifulsoup4, lxml, scrapy, tweety and many more! So my first goal is to show you a method to collect data from the Internet, after that we will be able to process them with many more algorithms and methods and in the final process, extract information from them!

For each project I do and each line of code I post, I will upload everything to my personal GitHub with an appropriate link provided.

Yours,

Siaterlis Konstantinos