Techie Anish's Blog: Developing a Naive Bayes Classifier for Spam Detection in Python

The Naive Bayes Classifier

Classifiers based on Bayesian methods utilize training data to calculate an observed
probability of each class based on feature values. When the classifier is used later on
unlabeled data, it uses the observed probabilities to predict the most likely class for the
new features. It's a simple idea, but it results in a method that often has results on par
with more sophisticated algorithms.

Begin Exercise.

Getting the required packages:

First, down the required packages for the classifier. The NLTK package is widely used for the purposes of Natural Language Processing in Python. You can read more about NLTK here. TextBlob allows simpler interface for the same. We are going to use both of them to achieve our goal. You can download them from your Linux Shell using pip.

sudo pip install nltk

sudo pip install textblob

After your installation of these packages is done, we are now ready to code.

Start coding!

1. Collect training data set

All machine learning algorithms require training. We all know that Naive Bayes classifies the unclassified text by learning from past. Luckily there is a well shaped data set which actually contains a messages tagged with spam vs ham. You can get it from here. This is how the first few lines of the data set look like:

ham Even my brother is not like to speak with me. They treat me like aids patent.
ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam WINNER!! As a valued network customer you have been selected to receivea 900
spam Had your mobile 11 months or more? U R entitled to Update to the latest colour Free! Call The Mobile Update Co FREE on 08002986030
ham I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.

As you can see, there is the type on the left side and separated by a tab, there is a message. This is going to be o

2. Import the libraries

There are the libraries that we need to import. These include few from the two we have downloaded and some more to perform some debugging tasks on our Python code.

import random #to shuffle the training set
import time #to time learning and and classification
from textblob import TextBlob #to tokenize our sentences into words
from nltk.corpus import stopwords #to remove unwanted stop words

from text.classifiers import NaiveBayesClassifierur training data set.

3. Creating training tuples for the classifier

The NaiveBayesClassifier function accepts tuples in the order, (body, class). Where body is the text that is to be classified and class referes to the class of text. For this experiment, we are going to classify each word of the message with respect to its class. Making our text of tuples look like:

list_of_tuples = [
('call','ham')
('free', 'spam')
('word','ham'
]

To achieve this, I have written a function that reads all lines of the tab separated file and generated list of tuples. The algorithm that I have followed to achieve this is:

initialize an empty list of tuples
for each line in file:
split the line using the tab delimiter
extract the second part of the split as sentence
split sentence into words
for each word in sentence:
if word is not a stop word or a number:
create tuple (word,type)
append tuple to the list
return the list

To work with sentences and words, we are going to use the NLTK corpus and TextBlob APIs. Here is how the function is going to look like:

def get_list_tuples(read_file):
    list_tuples = []
    with open(read_file,"r") as r:
    c=0
    for line in r:
        tabsep = line.strip().split('\t')
 msg = TextBlob(tabsep[1])
 try:
     words=msg.words
 except:
     continue
 for word in words:
     if word not in stopwords.words() and not word.isdigit():
         list_tuples.append((word.lower(),tabsep[0]))
 c+=1 #limiting factor begins
 if c==500:
     break #limiting factor ends
    return list_tuples

For the sake of simplicity, we have created a TextBlob for every sentence that makes it easier to extract words and save few lines of code to remove punctuation and all. Also, if we encounter some words that cannot be rendered by the TextBlob, we just skip the line, and continue scanning our data set. You might have also observed that I have introduced a variable c that actually limits the number of lines that we are going to import from our data set. All these modifications are made just for simplicity and so that the code be executed faster. You can increase this counter or remove that part of code to let the classifier train over all the 5700+ lines present in the data set.

Now, we will have to call the function with the parameter being the path to file where our data set resides. For me, I have called the function like this:

a = time.time()
entire_data = get_list_tuples("~/Documents/DataSci/DataSets/sms/SMSSpamCollection")
print "It took "+str(time.time()-a)+" seconds to import data" #10.031548

As you might have seen I have also calculated the amount of time required to import data from the file and create our entire_data structure. This is helpful is measuring the performance of our program. For me, it took around 10 seconds to import 500 lines from the data set. You can also omit those lines if you are not bothered about it.

4. Some shuffling required

Before beginning our training, if we look at the tuples we generated, most of them are partly sequentially arranged. For example, many 'ham' classified text are contiguous. We will now simply shuffle our array using the random.shuffle() function in Python. This will enable us to create the train and test data sets for our classifiers to be created from the same list.

random.seed(1)
random.shuffle(entire_data) #shuffling the data
train = entire_data[:250] #list of training tuples for classifier
test = entire_data[251:500] #list of tuples for testing the accuracy of the classifier

5. Train the classifier!

Now that everything is ready, we can train the classifier using NaiveBayesClassifier present in the text.classifiers package. Yet again, I have introduced timers to check how much time was required for the training to be completed.

a = time.time()
cl = NaiveBayesClassifier(train)
print "It took "+str(time.time()-a)+" seconds to train data" #7.003

Testing the classifier!

Now, we will test the classifier using the testing set that we have created earlier.

accuracy = cl.accuracy(test)
print "accuracy: "+str(accuracy)

Well, when I tried it, I got an accuracy of around 78%.

You can also play around with the classifier by typing in your favorite messages and see how they are classified. All you have to do is to create the tuple of (message,class) as I have told you people before.
To classify individual sentences and getting the result, you can use this one line of code:

print cl.classify("Hey bud, what's up") #ham
print cl.classify("Get a brand new mobile phone by being an agent of The Mob! Plus loads more goodies! For more info just text MAT to 87021") #spam

The entire python file:

Conclusion:

We have designed a simple SPAM vs HAM classifier using Naive Bayes Classification algorithm. You can use this tutorial to develop various other systems of classifications. There are many data sets present on this website which can be used for classification purposes. Using similar algorithm that we have used here, you can also use various sentiment analysis databases to classify a given sentence to be positive or negative. It is just that as our classes here were spam and ham, for that experiment, they would be positive or negative. Do you have something more to add to this? Please let me know your suggestions and comments below. Thank you.

More coding tutorials at Let's Code
You can also connect me on various social networks, links are on top of the page. Also, if you like it, share it :)

Techie Anish's Blog

Navigation

Monday, June 16, 2014

Developing a Naive Bayes Classifier for Spam Detection in Python