Monday, June 16, 2014

Developing a Naive Bayes Classifier for Spam Detection in Python

The Naive Bayes Classifier

Classifiers based on Bayesian methods utilize training data to calculate an observed
probability of each class based on feature values. When the classifier is used later on
unlabeled data, it uses the observed probabilities to predict the most likely class for the
new features. It's a simple idea, but it results in a method that often has results on par
with more sophisticated algorithms.

Begin Exercise.

Getting the required packages:

First, down the required packages for the classifier. The NLTK package is widely used for the purposes of Natural Language Processing in Python. You can read more about NLTK here. TextBlob allows simpler interface for the same. We are going to use both of them to achieve our goal. You can download them from your Linux Shell using pip. 

sudo pip install nltk
sudo pip install textblob 

After your installation of these packages is done, we are now ready to code.

Start coding!

1. Collect training data set

All machine learning algorithms require training. We all know that Naive Bayes classifies the unclassified text by learning from past. Luckily there is a well shaped data set which actually contains a messages tagged with spam vs ham. You can get it from here. This is how the first few lines of the data set look like:
ham    Even my brother is not like to speak with me. They treat me like aids patent.
ham    As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam    WINNER!! As a valued network customer you have been selected to receivea 900
spam    Had your mobile 11 months or more? U R entitled to Update to the latest colour     Free! Call The Mobile Update Co FREE on 08002986030
ham    I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
As you can see, there is the type on the left side and separated by a tab, there is a message. This is going to be o

2.  Import the libraries

There are the libraries that we need to import. These include few from the two we have downloaded and some more to perform some debugging tasks on our Python code.
import random #to shuffle the training set
import time #to time learning and and classification
from textblob import TextBlob #to tokenize our sentences into words
from nltk.corpus import stopwords #to remove unwanted stop words 
from text.classifiers import NaiveBayesClassifierur training data set.

3. Creating training tuples for the classifier

The NaiveBayesClassifier function accepts tuples in the order, (body, class). Where body is the text that is to be classified and class  referes to the class of text. For this experiment, we are going to classify each word of the message with respect to its class. Making our text of tuples look like:
list_of_tuples = [
('call','ham')
('free', 'spam')
('word','ham'
]
To achieve this, I have written a function that reads all lines of the tab separated file and generated list of tuples. The algorithm that I have followed to achieve this is:

initialize an empty list of tuples
for each line in file:
    split the line using the tab delimiter
    extract the second part of the split as sentence
    split sentence into words
    for each word in sentence:
        if word is not a stop word or a number:
             create tuple (word,type)
             append tuple to the list
return the list 
To work with sentences and words, we are going to use the NLTK corpus and TextBlob APIs. Here is how the function is going to look like:
def get_list_tuples(read_file):
    list_tuples = []
    with open(read_file,"r") as r:
    c=0
    for line in r:
        tabsep = line.strip().split('\t')
 msg = TextBlob(tabsep[1])
 try:
     words=msg.words
 except:
     continue
 for word in words:
     if word not in stopwords.words() and not word.isdigit():
         list_tuples.append((word.lower(),tabsep[0]))
 c+=1 #limiting factor begins
 if c==500:
     break #limiting factor ends
    return list_tuples

For the sake of simplicity, we have created a TextBlob for every sentence that makes it easier to extract words and save few lines of code to remove punctuation and all. Also, if we encounter some words that cannot be rendered by the TextBlob, we just skip the line, and continue scanning our data set. You might have also observed that I have introduced a variable c that actually limits the number of lines that we are going to import from our data set. All these modifications are made just for simplicity and so that the code be executed faster. You can increase this counter or remove that part of code to let the classifier train over all the 5700+ lines present in the data set.

Now, we will have to call the function with the parameter being the path to file where our data set resides. For me, I have called the function like this:
a = time.time()
entire_data = get_list_tuples("~/Documents/DataSci/DataSets/sms/SMSSpamCollection")
print "It took "+str(time.time()-a)+" seconds to import data" #10.031548
As you might have seen I have also calculated the amount of time required to import data from the file and create our entire_data structure. This is helpful is measuring the performance of our program. For me, it took around 10 seconds to import 500 lines from the data set. You can also omit those lines if you are not bothered about it.

4. Some shuffling required

Before beginning our training, if we look at the tuples we generated, most of them are partly sequentially arranged. For example, many 'ham' classified text are contiguous. We will now simply shuffle our array using the random.shuffle() function in Python. This will enable us to create the train and test data sets for our classifiers to be created from the same list.
random.seed(1)
random.shuffle(entire_data) #shuffling the data
train = entire_data[:250] #list of training tuples for classifier
test = entire_data[251:500] #list of tuples for testing the accuracy of the classifier

5. Train the classifier!

Now that everything is ready, we can train the classifier using NaiveBayesClassifier present in the text.classifiers package. Yet again, I have introduced timers to check how much time was required for the training to be completed.
a = time.time()
cl = NaiveBayesClassifier(train)
print "It took "+str(time.time()-a)+" seconds to train data" #7.003

Testing the classifier!

Now, we will test the classifier using the testing set that we have created earlier.
accuracy = cl.accuracy(test)
print "accuracy: "+str(accuracy)
Well, when I tried it, I got an accuracy of around 78%.
You can also play around with the classifier by typing in your favorite messages and see how they are classified. All you have to do is to create the tuple of (message,class) as I have told you people before.
To classify individual sentences and getting the result, you can use this one line of code:

print cl.classify("Hey bud, what's up") #ham
print cl.classify("Get a brand new mobile phone by being an agent of The Mob! Plus loads more goodies! For more info just text MAT to 87021") #spam

The entire python file:

Conclusion:

We have designed a simple SPAM vs HAM classifier using Naive Bayes Classification algorithm. You can use this tutorial to develop various other systems of classifications. There are many data sets present on this website which can be used for classification purposes. Using similar algorithm that we have used here, you can also use various sentiment analysis databases to classify a given sentence to be positive or negative. It is just that as our classes here were spam and ham, for that experiment, they would be positive or negative. Do you have something more to add to this? Please let me know your suggestions and comments below. Thank you.
More coding tutorials at Let's Code
You can also connect me on various social networks, links are on top of the page. Also, if you like it, share it :)

Saturday, June 7, 2014

Installing and Running Apache Pig on Hadoop 2.x versions

Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. Pig Latin can be extended using UDF (User Defined Functions) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.
Installing Pig is just as simple. What you will have to do is:
  1. Download the desired pig distribution from any one of the Apache Mirrors. It is best if you choose the latest version of Pig. Also, download the file whose name is like pig-0.12.1.tar.gz, where "0.12.1"is the version number. 
  2. Extract Pig to a desired directory.
    1.  Simple way is to copy the tar.gz to the root directory where you want your installation to reside.
    2. Now, execute the following code in Linux bash at the directory
      tar -xzf pig-0.12.1.tar.gz
    3. You will have the installation ready.
  3. Editing Path
    1. To access the pig installation easily and run your scripts from anywhere, make sure to add the pig's bin in your path.
    2. To do this, open the file /home/user/.bashrc in your favorite editor and copy the following line at the end of the file. 
      export PATH=/<my path to pig>/pig-n.n.n/bin:$PATH
      
    3. After doing all this, your Pig installation is ready for further configuration.
You might get the following error when running Pig scripts with Apache Hadoop 2.x. Here is a highlight of the error:
Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected

Now, that's a problem. Don't worry, solving it is unbelievably simple. This is what you will have to do:
  • cd  to your pig installation directory. Yes, inside the Pig directory.
  • And run this code:
    ant clean jar-withouthadoop -Dhadoopversion=23
    

After that, try running your Pig script again. You will find that everything is alright now.
Any problems working this around or have any suggestions? Just comment it below :)