While we get started, let’s go ahead and install some packages that we’ll need. R packages extend the language’s base functions, and in RStudio you can manage them using the packages pane to the right. To make sure we’re all on the same page to start, we’ll just run the code below by clicking the little green arrow to the right.
install.packages("tm")
install.packages("SnowballC")
install.packages("topicmodels")
install.packages("RTextTools")
install.packages("dplyr")
install.packages("ggplot2")
library("tm")
library("SnowballC")
library("topicmodels")
library("RTextTools")
library("dplyr")
library("ggplot2")
This notebook covers basics of supervised and unsupervised machine learning.
Machine learning refers to a series of techniques through which a computer iteratively “learns” features of a dataset. For example, machine learning might be used to identify the author of a text or to classify texts by genre.
Unsupervised machine learning refers to techniques in which little or no input is given about what the algorithm should “look for” or “learn.” Topic modeling, demonstrated below, for example, assigns a topic distribution to texts, but the user does not provide examples of the topics that she is interested in finding.
Supervised techniques, on the other hand, allow a user to “supervise” the learning by providing examples of the phenomena to be identified. Classification algorithms, for example, generally require the user to provide examples of the different classes into which the data will be sorted.
Unsupervised Techniques: Topic Modeling
We’ll start with topic modeling, an unsupervised technique that assigns a topic distribution to each document in a collection. This technique assumes that each document is composed of a mixture of topics and that each topic is a mixture of words. So, if we topic modeled the entries in an encyclopedia, we might see topics like:
topic 1: animal, mammal, bird, verterbrate, invertebrate, skeleton topic 2: planets, stars, orbits, rockets, spacemen
And we might see that the model assigns an article about the use of animals in testing spaceflight a mixture of primarily topics 1 and 2.
In this example, we’ll be working with the texrt of articles published in Big Data and Society. First we’ll read in the data:
#setwd("~/Desktop/PhDigital_Computational_Data_Analysis")
articles = read.csv('big_data_articles.csv')
articles_text = as.character(articles[,3])
articles_text = gsub("‘","",articles_text)
articles_text = gsub("–","",articles_text)
articles_text = gsub("’","",articles_text)
articles_text = gsub("“","",articles_text)
The tm package works with an object type called a corpus, so the first thing we want to do is create that from our list of texts. To do this, we need to pass the Corpus() function a vector (or list) of our plot summaries.
articles_corpus = Corpus(VectorSource(articles_text))
One common task in lots of machine learning workflows is processing text. For example, we usually don’t care if a word is capitalized, and a lot of times we don’t care about words like “the” or “a”. One of the more interesting choices we make is how to process text in relation to a specific problem or domain.
The following block of code performs various text processing operations, beginning with simpler operations and ending with more extreme operations. Think about which would be beneficial to our analysis of plot summaries – you can comment out lines (preventing them from executing) by prefacing the line with a #.
#Remove whitespace
articles_corpus = tm_map(articles_corpus, stripWhitespace)
#Convert to lowercase
articles_corpus = tm_map(articles_corpus, tolower)
#Remove punctuation
articles_corpus = tm_map(articles_corpus, removePunctuation)
#Strip digits
articles_corpus = tm_map(articles_corpus, removeNumbers)
#Remove stopwords
articles_corpus = tm_map(articles_corpus, removeWords, stopwords('english'))
#Remove common words for this corpus
articles_corpus = tm_map(articles_corpus, removeWords, c("also","can","big","data","may","might","two","use","one","new","different"))
#Stem document
#articles_corpus = tm_map(articles_corpus,stemDocument)
** Need filler here.
Many textual analysis techniques work on document-term matrices (DTMs). A document-term matrix is a data format in which each row is a document, each column is a term and the values represent the number of occurrences (sometimes weighted in various ways) of a term in a document. This is the step where the documents get turned into “bags of words” – we lose the order of words in exchange for being able to compare our documents.
articles_dtm = DocumentTermMatrix(articles_corpus)
DTMs are often very “sparse” matrices – meaning that many of the values are zero. If you inspect the articles_dtm object at this point, you’ll see that the matrix probably has tens of thousands of columns – that’s a lot to work with, so we’re also going to go ahead and remove terms that don’t appear in many tweets at this point. This should get us down to a few hundred terms.
articles_dtm = removeSparseTerms(articles_dtm, .99)
articles_dtm_matrix = as.matrix(articles_dtm)
Go ahead and view the articles_dtm_matrix object by clicking on the little table icon in the Environment pane at the top right. It can be easy to just run code and forget what the data structures look like, but it’s important to keep in mind what we’re working with.
The last thing we need to do is get rid of rows that have zero in every column. We need to do this to both out DTM and our original dataset:
row_totals = apply(articles_dtm , 1, sum)
articles = articles[row_totals > 0,]
articles_text = articles_text[row_totals > 0]
articles_dtm = articles_dtm[row_totals> 0, ]
Now we can go ahead and create our topic model. At this point, we’re just going to tell it how many topics we want. Let’s start with 5:
articles_topic_model = LDA(articles_dtm, k = 5 )
We usually examine our topics informally, looking at the words most likely to appear in that topic and asking whether they seem comprehensible. We might, at this point, need to revise the number of topics we’re asking for.
terms(articles_topic_model, 10)
Each document is now described as a mixture of topics, but we’ll stop here by just looking at the top topic for each document.
articles$top_topic = topics(articles_topic_model)
articles_summary = articles[,c(2,4)]
At this point, there are lots of analysis and visualization options, depending on your methodology.
Supervised Techniques: Support Vector Machine (SVM) Classifiers
Where topic modeling doesn’t give the user any control over the model’s results (e.g., the topics identified might not be – and often aren’t – meaningful to humans), supervised techniques allow the user to provide examples of what they’re looking for. A basic example of this is a binary classifier, where observations are sorted into one of two classes. The supervised part of these techniques is that the user provides examples for the model to “learn” from. So, if we created a classifier to sort images into those that contain stop signs and those that don’t, we would provide the model with examples of pictures that contain and don’t contain stop signs. (When you “prove you’re human” to login to websites, this is roughly what you’re doing.)
Supervised machine learning typically requires a training data set – a subset of your data that you’ve classified (also referred to as labeled data).We use this labeled data in order to “train” a model, which we’ll then use to classify a larger collection of unlabeled data. Some workflows also include a test set of data that is used to measure the model’s performance, but for this example we’ll just use a training set.
In this example, we’ll be working with a series of airline tweets, and we’ll attempt to classify whether or not they’re complaints. I’ve created a training set for us to classify together. It’s available here: https://docs.google.com/spreadsheets/d/1En4YzB7yuW8GET_PLufKDhhmXb6ynON6sHhl7WyDLqc/edit?usp=sharing
The spreadsheet has a column called complaint – working in groups, we’ll fill in that column with -1 if the tweet isn’t a complaint and 1 if it is.
Once we’re done Download the spreadsheet as a CSV, rename it airline_tweets_training.csv and move it to the folder you’re working in.
Read in the training and full data:
tweets_all = read.csv('airline_tweets.csv')
tweets_training = read.csv('airlines_tweets_training.csv')
We start again by creating a DTM matrix. This gives us a feature vector for each tweet – a list of features that describes the tweet. Our features could be anything. But for this exercise – and conventionally when using SVM classifiers on text – we’re going to use a DTM to produce our feature vectors. So, the features are just word counts.
In the following steps, we’ll repeat some of the same text processing steps from the topic models example.
tweet_training_corpus = Corpus(VectorSource(tweets_training$text))
#Remove whitespace
tweet_training_corpus = tm_map(tweet_training_corpus, stripWhitespace)
#Convert to lowercase
tweet_training_corpus = tm_map(tweet_training_corpus, tolower)
#Remove punctuation
tweet_training_corpus = tm_map(tweet_training_corpus, removePunctuation)
#Strip digits
tweet_training_corpus = tm_map(tweet_training_corpus, removeNumbers)
#Remove stopwords
tweet_training_corpus = tm_map(tweet_training_corpus, removeWords, c(stopwords('english')))
# Create DTM
tweet_training_dtm = DocumentTermMatrix(tweet_training_corpus)
Now we train the SVM model:
training_container = create_container(tweet_training_dtm, tweets_training$complaint, trainSize=1:nrow(tweets_training), virgin=FALSE)
complaints_model = train_model(training_container, "SVM", kernel="linear", cost=1)
Now that we have our model, we can use it to predict whether the rest of the tweets are complaints or not. We have our full set in the tweets_all object, and we’re going to create another DTM from that. We’re going to do it a little differently this time, because we want to constrain the vocabulary of the new DTM to the vocabulary of our training set.
tweet_corpus = Corpus(VectorSource(tweets_all$text))
#Remove whitespace
tweet_corpus = tm_map(tweet_corpus, stripWhitespace)
#Convert to lowercase
tweet_corpus = tm_map(tweet_corpus, tolower)
#Remove punctuation
tweet_corpus = tm_map(tweet_corpus, removePunctuation)
#Strip digits
tweet_corpus = tm_map(tweet_corpus, removeNumbers)
#Remove stopwords
tweet_corpus = tm_map(tweet_corpus, removeWords, c(stopwords('english')))
# Create DTM
tweet_dtm = DocumentTermMatrix(tweet_corpus, control = list(dictionary=Terms(tweet_training_dtm)))
#Formatting and getting things ready....
prediction_container = create_container(tweet_dtm, labels=rep(0,nrow(tweets_all)), testSize=1:nrow(tweets_all), virgin=FALSE)
tweets_all = cbind(tweets_all, classify_model(prediction_container, complaints_model))
Again, there are lots of things we could do with these predictions.
For now, we’ll just quickly compare the different airlines based on how many complaints they get. For now, we’ll just average the label column in our tweets_all dataframe.
Note: I’m using dplyr and ggplot2 here. We’re not going to go over these packages, but if you were to learn two R packages, these would be very good candidates.
airlines_summary = tweets_all %>% group_by(airline) %>% summarise(
avg_label = mean(as.numeric(as.character(SVM_LABEL)))
)
ggplot(data=airlines_summary, aes(x=airline, y=avg_label, fill=airline)) +
geom_bar(stat="identity")

