Heroku is a cloud platform that helps in developing and deploying apps written in several programming languages. What I like in Heroku is that it creates a Virtual Environment for every app you publish. So, you can have your own packages and frameworks from the language of your choice.
What is webapp2?
webapp2 is a lightweight web development framework for Python by Google. It is one of the easiest frameworks to work with in Python. In fact, I started by journey of web development by learning this framework via the Google App Engine.
Lets begin!
1) Set up Heroku on your system
I am not going to write much about this, as Heroku has given an awesome Getting Started tutorial for Python here. Follow that link and you will get the basics of the platform and be up and running with a Hello World application in Heroku in about 15 minutes. Yes, just 15 minutes. The tutorial will also talk about something called a Procfile, its a very important part of this project so, just have an idea about what it exactly does. I highly recommend you to follow that tutorial now, if you are new to Heroku. This tutorial follows a unix shell command line interface.
2) Now, comes the fun part
By this time, you must have installed the Heroku tool belt.
Create a new folder called hellowebapp. This will be the project folder. All files must reside here.
Now login to heroku.
$ heroku login
cd to the project directory
$ cd hellowebapp
Create a virtual environment. If you don't know what this is, follow this link.
$ virtualenv venv
Activate the virtual environment
$ source venv/bin/activate
Install WebOp, Paste and webapp2 using python's setup tools. We are going to use pip for this tutorial.
Now, create a python file which defines your app. For the sake of the tutorial, we are going to run a simple hello world! example.:
import webapp2
class HelloWebapp2(webapp2.RequestHandler):
def get(self):
self.response.write('Hello, webapp2!')
app = webapp2.WSGIApplication([
('/', HelloWebapp2),
], debug=True)
def main():
from paste import httpserver
httpserver.serve(app, host='127.0.0.1', port='8080')
if __name__ == '__main__':
main()
Create a file called Procfile.txt
touch Procfile.txt
nano Procfile.txt
Add the following contents to that file and Save it
web: gunicorn hello:app --log-file=-
Test your app, locally. After the execution of this command, the app will mostly run at http://localhost:5000
$ foreman start
Copy the list of packages to a file called requirements.txt so that when you deploy, the web server installs the packages automatically. The pip freeze command will come to help.
Ta-da! Your webapp2 powered app is now running on heroku! Now start exploring the framework and continue with the work. Any problems? Just comment below. I will reply asap. If you have problems in setting up Heroku, you can ask that too.
Classifiers based on Bayesian methods utilize training data to calculate an observed
probability of each class based on feature values. When the classifier is used later on
unlabeled data, it uses the observed probabilities to predict the most likely class for the
new features. It's a simple idea, but it results in a method that often has results on par
with more sophisticated algorithms.
Begin Exercise.
Getting the required packages:
First, down the required packages for the classifier. The NLTK package is widely used for the purposes of Natural Language Processing in Python. You can read more about NLTK here. TextBlob allows simpler interface for the same. We are going to use both of them to achieve our goal. You can download them from your Linux Shell using pip.
sudo pip install nltk
sudo pip install textblob
After your installation of these packages is done, we are now ready to code.
Start coding!
1. Collect training data set
All machine learning algorithms require training. We all know that Naive Bayes classifies the unclassified text by learning from past. Luckily there is a well shaped data set which actually contains a messages tagged with spam vs ham. You can get it from here. This is how the first few lines of the data set look like:
ham Even my brother is not like to speak with me. They treat me like aids patent.
ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune
spam WINNER!! As a valued network customer you have been selected to receivea 900
spam Had your mobile 11 months or more? U R entitled to Update to the latest colour Free! Call The Mobile Update Co FREE on 08002986030
ham I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.
As you can see, there is the type on the left side and separated by a tab, there is a message. This is going to be o
2. Import the libraries
There are the libraries that we need to import. These include few from the two we have downloaded and some more to perform some debugging tasks on our Python code.
import random #to shuffle the training set
import time #to time learning and and classification
from textblob import TextBlob #to tokenize our sentences into words
from nltk.corpus import stopwords #to remove unwanted stop words
from text.classifiers import NaiveBayesClassifierur training data set.
3. Creating training tuples for the classifier
The NaiveBayesClassifier function accepts tuples in the order, (body, class). Where body is the text that is to be classified and class referes to the class of text. For this experiment, we are going to classify each word of the message with respect to its class. Making our text of tuples look like:
To achieve this, I have written a function that reads all lines of the tab separated file and generated list of tuples. The algorithm that I have followed to achieve this is:
initialize an empty list of tuples
for each line in file:
split the line using the tab delimiter
extract the second part of the split as sentence
split sentence into words
for each word in sentence:
if word is not a stop word or a number:
create tuple (word,type)
append tuple to the list
return the list
To work with sentences and words, we are going to use the NLTK corpus and TextBlob APIs. Here is how the function is going to look like:
def get_list_tuples(read_file):
list_tuples = []
with open(read_file,"r") as r:
c=0
for line in r:
tabsep = line.strip().split('\t')
msg = TextBlob(tabsep[1])
try:
words=msg.words
except:
continue
for word in words:
if word not in stopwords.words() and not word.isdigit():
list_tuples.append((word.lower(),tabsep[0]))
c+=1 #limiting factor begins
if c==500:
break #limiting factor ends
return list_tuples
For the sake of simplicity, we have created a TextBlob for every sentence that makes it easier to extract words and save few lines of code to remove punctuation and all. Also, if we encounter some words that cannot be rendered by the TextBlob, we just skip the line, and continue scanning our data set. You might have also observed that I have introduced a variable c that actually limits the number of lines that we are going to import from our data set. All these modifications are made just for simplicity and so that the code be executed faster. You can increase this counter or remove that part of code to let the classifier train over all the 5700+ lines present in the data set.
Now, we will have to call the function with the parameter being the path to file where our data set resides. For me, I have called the function like this:
a = time.time()
entire_data = get_list_tuples("~/Documents/DataSci/DataSets/sms/SMSSpamCollection")
print "It took "+str(time.time()-a)+" seconds to import data" #10.031548
As you might have seen I have also calculated the amount of time required to import data from the file and create our entire_data structure. This is helpful is measuring the performance of our program. For me, it took around 10 seconds to import 500 lines from the data set. You can also omit those lines if you are not bothered about it.
4. Some shuffling required
Before beginning our training, if we look at the tuples we generated, most of them are partly sequentially arranged. For example, many 'ham' classified text are contiguous. We will now simply shuffle our array using the random.shuffle() function in Python. This will enable us to create the train and test data sets for our classifiers to be created from the same list.
random.seed(1)
random.shuffle(entire_data) #shuffling the data
train = entire_data[:250] #list of training tuples for classifier
test = entire_data[251:500] #list of tuples for testing the accuracy of the classifier
5. Train the classifier!
Now that everything is ready, we can train the classifier using NaiveBayesClassifier present in the text.classifiers package. Yet again, I have introduced timers to check how much time was required for the training to be completed.
a = time.time()
cl = NaiveBayesClassifier(train)
print "It took "+str(time.time()-a)+" seconds to train data" #7.003
Testing the classifier!
Now, we will test the classifier using the testing set that we have created earlier.
Well, when I tried it, I got an accuracy of around 78%.
You can also play around with the classifier by typing in your favorite messages and see how they are classified. All you have to do is to create the tuple of (message,class) as I have told you people before.
To classify individual sentences and getting the result, you can use this one line of code:
print cl.classify("Hey bud, what's up") #ham
print cl.classify("Get a brand new mobile phone by being an agent of The Mob! Plus loads more goodies! For more info just text MAT to 87021") #spam
The entire python file:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We have designed a simple SPAM vs HAM classifier using Naive Bayes Classification algorithm. You can use this tutorial to develop various other systems of classifications. There are many data sets present on this website which can be used for classification purposes. Using similar algorithm that we have used here, you can also use various sentiment analysis databases to classify a given sentence to be positive or negative. It is just that as our classes here were spam and ham, for that experiment, they would be positive or negative. Do you have something more to add to this? Please let me know your suggestions and comments below. Thank you.
More coding tutorials at Let's Code
You can also connect me on various social networks, links are on top of the page. Also, if you like it, share it :)