Document Clustering with Daily Kos

In this problem, we’ll be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. Daily Kos was founded by Markos Moulitsas in 2002, and as of September 2014, the site had an average weekday traffic of hundreds of thousands of visits.

The file dailykos.csv contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John Kerry (democratic). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.

Each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). The set of words has been trimmed according to some of the techniques covered in the previous week on text analytics (punctuation has been removed, and stop words have been removed). For each document, the variable values are the number of times that word appeared in the document.

dailykos <- read.csv("dailykos.csv")

# calc distance
kosdist <- dist(dailykos, method = "euclidean")

# kos hierclust
koshier <- hclust(kosdist, method = "ward.D")
plot(koshier)

Hierarchical Clustering

Let’s pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.

Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.

How many observations are in cluster 3?

newsCluster <- cutree(koshier, k = 7)

cluster1 <- subset(dailykos, newsCluster ==1)
cluster2 <- subset(dailykos, newsCluster ==2)
cluster3 <- subset(dailykos, newsCluster ==3)
cluster4 <- subset(dailykos, newsCluster ==4)
cluster5 <- subset(dailykos, newsCluster ==5)
cluster6 <- subset(dailykos, newsCluster ==6)
cluster7 <- subset(dailykos, newsCluster ==7)

Hierarchical Clustering 2

Instead of looking at the average value in each variable individually, we’ll just look at the top 6 words in each cluster. To do this for cluster 1, type the following in your R console (where “HierCluster1” should be replaced with the name of your first cluster subset):

tail(sort(colMeans(HierCluster1)))

This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occur the most frequently. The colMeans function computes the column (word) means, the sort function orders the words in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the ones with the largest column means.

What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:

tail(sort(colMeans(cluster1)))

##      state republican       poll   democrat      kerry       bush 
##  0.7575039  0.7590837  0.9036335  0.9194313  1.0624013  1.7053712

tail(sort(colMeans(cluster2)))

##      bush  democrat challenge      vote      poll  november 
##  2.847352  2.850467  4.096573  4.398754  4.847352 10.339564

tail(sort(colMeans(cluster3)))

##      elect    parties      state republican   democrat       bush 
##   1.647059   1.665775   2.320856   2.524064   3.823529   4.406417

tail(sort(colMeans(cluster4)))

## campaign    voter presided     poll     bush    kerry 
## 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849

tail(sort(colMeans(cluster5)))

##       american       presided administration            war           iraq 
##       1.090909       1.120393       1.230958       1.776413       2.427518 
##           bush 
##       3.941032

tail(sort(colMeans(cluster6)))

##      race      bush     kerry     elect  democrat      poll 
## 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325

tail(sort(colMeans(cluster7)))

## democrat    clark   edward     poll    kerry     dean 
## 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828

K-means Clustering

Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don’t need to add the iters.max argument.

Subset your data into the 7 clusters (7 new datasets) by using the “cluster” variable of your kmeans output.

How many observations are in Cluster 3?

set.seed(1000)

kmc_news <- kmeans(dailykos, centers = 7)

# 7 new datasets

kmeans_cluster1 <- subset(dailykos, kmc_news$cluster == 1)
kmeans_cluster2 <- subset(dailykos, kmc_news$cluster == 2)
kmeans_cluster3 <- subset(dailykos, kmc_news$cluster == 3)
kmeans_cluster4 <- subset(dailykos, kmc_news$cluster == 4)
kmeans_cluster5 <- subset(dailykos, kmc_news$cluster == 5)
kmeans_cluster6 <- subset(dailykos, kmc_news$cluster == 6)
kmeans_cluster7 <- subset(dailykos, kmc_news$cluster == 7)

nrow(kmeans_cluster3)

## [1] 300

K-means 2

Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the k-means clusters.

Which k-means cluster best corresponds to the Iraq War?

tail(sort(colMeans(kmeans_cluster1)))

##           time           iraq          kerry administration       presided 
##       1.586667       1.640000       1.653333       2.620000       2.726667 
##           bush 
##      11.333333

tail(sort(colMeans(kmeans_cluster2)))

##  democrat      bush challenge      vote      poll  november 
##  2.899696  2.960486  4.121581  4.446809  4.872340 10.370821

tail(sort(colMeans(kmeans_cluster3)))

##    voter presided campaign     poll     bush    kerry 
## 1.326667 1.336667 1.403333 2.816667 5.963333 6.613333

tail(sort(colMeans(kmeans_cluster4)))

## republican      elect      kerry       poll   democrat       bush 
##  0.5772077  0.5786877  0.6581154  0.7380365  0.7409965  1.1588555

tail(sort(colMeans(kmeans_cluster5)))

## primaries  democrat    edward     clark     kerry      dean 
##  2.333333  2.708333  2.826389  3.083333  5.041667  8.236111

tail(sort(colMeans(kmeans_cluster6)))

## administration          iraqi       american           bush            war 
##       1.396364       1.621818       1.694545       2.607273       3.036364 
##           iraq 
##       4.094545

tail(sort(colMeans(kmeans_cluster7)))

##       race     senate      state    parties republican   democrat 
##   2.341463   2.409756   2.995122   3.243902   4.200000   6.185366

Market Segmentation for Airlines: Task 2

this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.

The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.

There are seven different variables in the dataset, described below:

Balance = number of miles eligible for award travel
QualMiles = number of miles qualifying for TopFlight status
BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months
BonusTrans = number of non-flight bonus transactions in the past 12 months
FlightMiles = number of flight miles in the past 12 months
FlightTrans = number of flight transactions in the past 12 months
DaysSinceEnroll = number of days since enrolled in the frequent flyer program

airlines <- read.csv("AirlinesCluster.csv")
summary(airlines)

##     Balance          QualMiles         BonusMiles       BonusTrans  
##  Min.   :      0   Min.   :    0.0   Min.   :     0   Min.   : 0.0  
##  1st Qu.:  18528   1st Qu.:    0.0   1st Qu.:  1250   1st Qu.: 3.0  
##  Median :  43097   Median :    0.0   Median :  7171   Median :12.0  
##  Mean   :  73601   Mean   :  144.1   Mean   : 17145   Mean   :11.6  
##  3rd Qu.:  92404   3rd Qu.:    0.0   3rd Qu.: 23801   3rd Qu.:17.0  
##  Max.   :1704838   Max.   :11148.0   Max.   :263685   Max.   :86.0  
##   FlightMiles       FlightTrans     DaysSinceEnroll
##  Min.   :    0.0   Min.   : 0.000   Min.   :   2   
##  1st Qu.:    0.0   1st Qu.: 0.000   1st Qu.:2330   
##  Median :    0.0   Median : 0.000   Median :4096   
##  Mean   :  460.1   Mean   : 1.374   Mean   :4119   
##  3rd Qu.:  311.0   3rd Qu.: 1.000   3rd Qu.:5790   
##  Max.   :30817.0   Max.   :53.000   Max.   :8296

Normalizing the data

Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).

Now, create a normalized data frame called “airlinesNorm” by running the following commands:

preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines)

The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.

In the normalized data, which variable has the largest maximum value?

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

preproc <- preProcess(airlines)
airlinesNorm <- predict(preproc, airlines)

summary(airlinesNorm)

##     Balance          QualMiles         BonusMiles        BonusTrans      
##  Min.   :-0.7303   Min.   :-0.1863   Min.   :-0.7099   Min.   :-1.20805  
##  1st Qu.:-0.5465   1st Qu.:-0.1863   1st Qu.:-0.6581   1st Qu.:-0.89568  
##  Median :-0.3027   Median :-0.1863   Median :-0.4130   Median : 0.04145  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.1866   3rd Qu.:-0.1863   3rd Qu.: 0.2756   3rd Qu.: 0.56208  
##  Max.   :16.1868   Max.   :14.2231   Max.   :10.2083   Max.   : 7.74673  
##   FlightMiles       FlightTrans       DaysSinceEnroll   
##  Min.   :-0.3286   Min.   :-0.36212   Min.   :-1.99336  
##  1st Qu.:-0.3286   1st Qu.:-0.36212   1st Qu.:-0.86607  
##  Median :-0.3286   Median :-0.36212   Median :-0.01092  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.1065   3rd Qu.:-0.09849   3rd Qu.: 0.80960  
##  Max.   :21.6803   Max.   :13.61035   Max.   : 2.02284

hierarchical Clustering

Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters. According to the dendrogram, which of the following is NOT a good choice for the number of clusters?

# calc distance
airdist <- dist(airlinesNorm, method = "euclidean")

# kos hierclust
airHclust <- hclust(airdist, method = "ward.D")
plot(airHclust)

# cutree, k = 5
airCluster <- cutree(airHclust, k = 5)

aircluster1 <- subset(airlinesNorm, airCluster == 1)

Hierarchical Clustering 2

Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:

tapply(airlines$Balance, clusterGroups, mean)

Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.

tapply(airlines$Balance, airCluster, mean)

##         1         2         3         4         5 
##  57866.90 110669.27 198191.57  52335.91  36255.91

tapply(airlines$DaysSinceEnroll, airCluster, mean)

##        1        2        3        4        5 
## 6235.365 4402.414 5615.709 2840.823 3060.081

# cluster 2
tapply(airlines$FlightMiles, airCluster, mean)

##          1          2          3          4          5 
##   83.18428 2613.41811  327.67611  111.57373  119.32191

# cluster 3
tapply(airlines$BonusMiles, airCluster, mean)

##         1         2         3         4         5 
## 10360.124 22881.763 55795.860 20788.766  2264.788

K-means Clustering

Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

How many clusters have more than 1,000 observations?

set.seed(88)

kmc_air <- kmeans(airlinesNorm, centers = 5, iter.max = 1000)

#creating subset
KmeansCluster1 <-  subset(airlinesNorm, kmc_air$cluster == 1)
KmeansCluster2 <-  subset(airlinesNorm, kmc_air$cluster == 2)
KmeansCluster3 <-  subset(airlinesNorm, kmc_air$cluster == 3)
KmeansCluster4 <-  subset(airlinesNorm, kmc_air$cluster == 4)
KmeansCluster5 <-  subset(airlinesNorm, kmc_air$cluster == 5)

# similarity with the hierarchical clustering

tapply(airlines$Balance, kmc_air$cluster, mean)

##         1         2         3         4         5 
## 152879.30 114012.18 191736.34  57416.14  38150.31

Predicting Stock Returns with Cluster-Then-Predict: Task 3

For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

ReturnJan = the return for the company's stock during January (in the year of the observation). 
ReturnFeb = the return for the company's stock during February (in the year of the observation). 
ReturnMar = the return for the company's stock during March (in the year of the observation). 
ReturnApr = the return for the company's stock during April (in the year of the observation). 
ReturnMay = the return for the company's stock during May (in the year of the observation). 
ReturnJune = the return for the company's stock during June (in the year of the observation). 
ReturnJuly = the return for the company's stock during July (in the year of the observation). 
ReturnAug = the return for the company's stock during August (in the year of the observation). 
ReturnSep = the return for the company's stock during September (in the year of the observation). 
ReturnOct = the return for the company's stock during October (in the year of the observation). 
ReturnNov = the return for the company's stock during November (in the year of the observation). 
PositiveDec = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

stocks <- read.csv("StocksCluster.csv")
nrow(stocks)

## [1] 11580

table(stocks$PositiveDec)

## 
##    0    1 
## 5256 6324

6324/(6324+5256)

## [1] 0.546114

Exploring dataset

What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.

cor(stocks)

##                ReturnJan   ReturnFeb    ReturnMar    ReturnApr    ReturnMay
## ReturnJan    1.000000000  0.06677458 -0.090496798 -0.037678006 -0.044411417
## ReturnFeb    0.066774583  1.00000000 -0.155983263 -0.191351924 -0.095520920
## ReturnMar   -0.090496798 -0.15598326  1.000000000  0.009726288 -0.003892789
## ReturnApr   -0.037678006 -0.19135192  0.009726288  1.000000000  0.063822504
## ReturnMay   -0.044411417 -0.09552092 -0.003892789  0.063822504  1.000000000
## ReturnJune   0.092238307  0.16999448 -0.085905486 -0.011027752 -0.021074539
## ReturnJuly  -0.081429765 -0.06177851  0.003374160  0.080631932  0.090850264
## ReturnAug   -0.022792019  0.13155979 -0.022005400 -0.051756051 -0.033125658
## ReturnSep   -0.026437153  0.04350177  0.076518327 -0.028920972  0.021962862
## ReturnOct    0.142977229 -0.08732427 -0.011923758  0.048540025  0.017166728
## ReturnNov    0.067632333 -0.15465828  0.037323535  0.031761837  0.048046590
## PositiveDec  0.004728518 -0.03817318  0.022408661  0.094353528  0.058201934
##              ReturnJune    ReturnJuly     ReturnAug     ReturnSep   ReturnOct
## ReturnJan    0.09223831 -0.0814297650 -0.0227920187 -0.0264371526  0.14297723
## ReturnFeb    0.16999448 -0.0617785094  0.1315597863  0.0435017706 -0.08732427
## ReturnMar   -0.08590549  0.0033741597 -0.0220053995  0.0765183267 -0.01192376
## ReturnApr   -0.01102775  0.0806319317 -0.0517560510 -0.0289209718  0.04854003
## ReturnMay   -0.02107454  0.0908502642 -0.0331256580  0.0219628623  0.01716673
## ReturnJune   1.00000000 -0.0291525996  0.0107105260  0.0447472692 -0.02263599
## ReturnJuly  -0.02915260  1.0000000000  0.0007137558  0.0689478037 -0.05470891
## ReturnAug    0.01071053  0.0007137558  1.0000000000  0.0007407139 -0.07559456
## ReturnSep    0.04474727  0.0689478037  0.0007407139  1.0000000000 -0.05807924
## ReturnOct   -0.02263599 -0.0547089088 -0.0755945614 -0.0580792362  1.00000000
## ReturnNov   -0.06527054 -0.0483738369 -0.1164890345 -0.0197197998  0.19167279
## PositiveDec  0.02340975  0.0743642097  0.0041669657  0.0416302863 -0.05257496
##               ReturnNov  PositiveDec
## ReturnJan    0.06763233  0.004728518
## ReturnFeb   -0.15465828 -0.038173184
## ReturnMar    0.03732353  0.022408661
## ReturnApr    0.03176184  0.094353528
## ReturnMay    0.04804659  0.058201934
## ReturnJune  -0.06527054  0.023409745
## ReturnJuly  -0.04837384  0.074364210
## ReturnAug   -0.11648903  0.004166966
## ReturnSep   -0.01971980  0.041630286
## ReturnOct    0.19167279 -0.052574956
## ReturnNov    1.00000000 -0.062346556
## PositiveDec -0.06234656  1.000000000

summary(stocks)

##    ReturnJan            ReturnFeb           ReturnMar        
##  Min.   :-0.7616205   Min.   :-0.690000   Min.   :-0.712994  
##  1st Qu.:-0.0691663   1st Qu.:-0.077748   1st Qu.:-0.046389  
##  Median : 0.0009965   Median :-0.010626   Median : 0.009878  
##  Mean   : 0.0126316   Mean   :-0.007605   Mean   : 0.019402  
##  3rd Qu.: 0.0732606   3rd Qu.: 0.043600   3rd Qu.: 0.077066  
##  Max.   : 3.0683060   Max.   : 6.943694   Max.   : 4.008621  
##    ReturnApr           ReturnMay          ReturnJune       
##  Min.   :-0.826503   Min.   :-0.92207   Min.   :-0.717920  
##  1st Qu.:-0.054468   1st Qu.:-0.04640   1st Qu.:-0.063966  
##  Median : 0.009059   Median : 0.01293   Median :-0.000880  
##  Mean   : 0.026308   Mean   : 0.02474   Mean   : 0.005938  
##  3rd Qu.: 0.085338   3rd Qu.: 0.08396   3rd Qu.: 0.061566  
##  Max.   : 2.528827   Max.   : 6.93013   Max.   : 4.339713  
##    ReturnJuly           ReturnAug           ReturnSep        
##  Min.   :-0.7613096   Min.   :-0.726800   Min.   :-0.839730  
##  1st Qu.:-0.0731917   1st Qu.:-0.046272   1st Qu.:-0.074648  
##  Median :-0.0008047   Median : 0.007205   Median :-0.007616  
##  Mean   : 0.0030509   Mean   : 0.016198   Mean   :-0.014721  
##  3rd Qu.: 0.0718205   3rd Qu.: 0.070783   3rd Qu.: 0.049476  
##  Max.   : 2.5500000   Max.   : 3.626609   Max.   : 5.863980  
##    ReturnOct           ReturnNov          PositiveDec    
##  Min.   :-0.685504   Min.   :-0.747171   Min.   :0.0000  
##  1st Qu.:-0.070915   1st Qu.:-0.054890   1st Qu.:0.0000  
##  Median : 0.002115   Median : 0.008522   Median :1.0000  
##  Mean   : 0.005651   Mean   : 0.011387   Mean   :0.5461  
##  3rd Qu.: 0.074542   3rd Qu.: 0.076576   3rd Qu.:1.0000  
##  Max.   : 5.665138   Max.   : 3.271676   Max.   :1.0000

Initial Logistic Regression Model

Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

set.seed(144)

spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don’t forget to add the argument family=binomial to your glm command.

What is the overall accuracy on the training set, using a threshold of 0.5?

library(caTools)
# library(kableExtra)
set.seed(144)

spl <-  sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain <-  subset(stocks, spl == TRUE)
stocksTest <-  subset(stocks, spl == FALSE)

# Building logistic regression model
stocksLog <- glm(PositiveDec ~ ., data = stocksTrain, family = binomial)
predictTrain <- predict(stocksLog, type = "response")

table(stocksTrain$PositiveDec, predictTrain > 0.5)

##    
##     FALSE TRUE
##   0   990 2689
##   1   787 3640

accuracy <- (990+3640)/nrow(stocksTrain)
accuracy

## [1] 0.5711818

# on test data
predictTest <- predict(stocksLog, type = "response", newdata = stocksTest)
table(stocksTest$PositiveDec, predictTest > 0.5)

##    
##     FALSE TRUE
##   0   417 1160
##   1   344 1553

(accuracy2 <- (417+1553)/nrow(stocksTest))

## [1] 0.5670697

#baseline
table(stocksTest$PositiveDec)

## 
##    0    1 
## 1577 1897

(1897)/nrow(stocksTest)

## [1] 0.5460564

Clustering Stocks

Now, let’s cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:

limitedTrain = stocksTrain

limitedTrain$PositiveDec = NULL

limitedTest = stocksTest

limitedTest$PositiveDec = NULL

Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?

limitedTrain <-  stocksTrain

limitedTrain$PositiveDec <-  NULL

limitedTest <-  stocksTest

limitedTest$PositiveDec <-  NULL

Clustering Stocks 2

In the market segmentation assignment in this week’s homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we’ll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:

library(caret)

preproc = preProcess(limitedTrain)

normTrain = predict(preproc, limitedTrain)

normTest = predict(preproc, limitedTest)

What is the mean of the ReturnJan variable in normTrain?

library(caret)
preproc <-  preProcess(limitedTrain)

normTrain <-  predict(preproc, limitedTrain)

normTest <-  predict(preproc, limitedTest)

#exploration
mean(normTrain$ReturnJan)

## [1] 2.100586e-17

mean(normTest$ReturnJan)

## [1] -0.0004185886

Clustering stocks 3

Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called km.

Which cluster has the largest number of observations?

set.seed(144)

km <- kmeans(normTrain, centers = 3)

Km1 <-  subset(normTrain, km$cluster == 1)
Km2 <-  subset(normTrain, km$cluster == 2)
Km3 <-  subset(normTrain, km$cluster == 3)

Clustering Stocks 4

Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):

library(flexclust)

km.kcca = as.kcca(km, normTrain)

clusterTrain = predict(km.kcca)

clusterTest = predict(km.kcca, newdata=normTest)

How many test-set observations were assigned to Cluster 2?

library(flexclust)

## Warning: package 'flexclust' was built under R version 4.0.3

## Loading required package: grid

## Loading required package: modeltools

## Warning: package 'modeltools' was built under R version 4.0.3

## Loading required package: stats4

km.kcca <- as.kcca(km, normTrain)

clusterTrain <-  predict(km.kcca)

clusterTest <-  predict(km.kcca, newdata=normTest)
table(clusterTest)

## clusterTest
##    1    2    3 
## 1058 2029  387

Cluster specific predictions

Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

Which training set data frame has the highest average value of the dependent variable?

# Subsetting stocksTrain
stocksTrain1 <-  subset(stocksTrain, clusterTrain == 1)
stocksTrain2 <-  subset(stocksTrain, clusterTrain == 2)
stocksTrain3 <-  subset(stocksTrain, clusterTrain == 3)

# Subsetting stocksTest 
stocksTest1 <-  subset(stocksTest, clusterTest == 1)
stocksTest2 <-  subset(stocksTest, clusterTest == 2)
stocksTest3 <-  subset(stocksTest, clusterTest == 3)

# exploration
mean(stocksTrain1$PositiveDec)

## [1] 0.6103267

mean(stocksTrain2$PositiveDec)

## [1] 0.5250476

mean(stocksTrain3$PositiveDec)

## [1] 0.4799107

Cluster specific predictions 2

Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.

Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.

# log reg model 1,2,3
Stocksmodel1 <- glm(PositiveDec ~ ., data = stocksTrain1, family = binomial)
Stocksmodel2 <- glm(PositiveDec ~ ., data = stocksTrain2, family = binomial)
Stocksmodel3 <- glm(PositiveDec ~ ., data = stocksTrain3, family = binomial)

summary(Stocksmodel1)

## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7220  -1.2879   0.8679   1.0096   1.7170  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.20739    0.08020   2.586  0.00971 ** 
## ReturnJan    0.12448    0.31474   0.396  0.69246    
## ReturnFeb   -0.46307    0.32713  -1.416  0.15692    
## ReturnMar    0.55465    0.24804   2.236  0.02534 *  
## ReturnApr    1.08354    0.25005   4.333 1.47e-05 ***
## ReturnMay    0.30487    0.24993   1.220  0.22253    
## ReturnJune   0.00172    0.33525   0.005  0.99591    
## ReturnJuly  -0.02763    0.30216  -0.091  0.92714    
## ReturnAug    0.40299    0.34570   1.166  0.24373    
## ReturnSep    0.70779    0.32611   2.170  0.02998 *  
## ReturnOct   -1.33254    0.29055  -4.586 4.51e-06 ***
## ReturnNov   -0.78944    0.30583  -2.581  0.00984 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3314.9  on 2478  degrees of freedom
## Residual deviance: 3244.0  on 2467  degrees of freedom
## AIC: 3268
## 
## Number of Fisher Scoring iterations: 4

summary(Stocksmodel2)

## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2268  -1.2086   0.9698   1.1294   1.6769  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.1165255  0.0335061   3.478 0.000506 ***
## ReturnJan    0.4489316  0.2414090   1.860 0.062938 .  
## ReturnFeb   -0.1385458  0.1514734  -0.915 0.360373    
## ReturnMar    0.4657754  0.2442129   1.907 0.056488 .  
## ReturnApr    0.7839726  0.2558668   3.064 0.002184 ** 
## ReturnMay    0.7709831  0.2625217   2.937 0.003316 ** 
## ReturnJune   0.5764584  0.2191809   2.630 0.008537 ** 
## ReturnJuly   0.8654737  0.2869778   3.016 0.002563 ** 
## ReturnAug    0.0177313  0.2317020   0.077 0.939001    
## ReturnSep    1.0464947  0.2684133   3.899 9.67e-05 ***
## ReturnOct   -0.0001062  0.2426989   0.000 0.999651    
## ReturnNov   -0.3716212  0.2603210  -1.428 0.153421    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6546.7  on 4730  degrees of freedom
## Residual deviance: 6481.4  on 4719  degrees of freedom
## AIC: 6505.4
## 
## Number of Fisher Scoring iterations: 4

summary(Stocksmodel3)

## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8938  -1.0863  -0.5241   1.0874   2.1892  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   0.4504     0.1700   2.649 0.008071 ** 
## ReturnJan     0.3816     0.2742   1.392 0.163927    
## ReturnFeb     0.3943     0.4612   0.855 0.392490    
## ReturnMar    -1.7220     0.4372  -3.938 8.20e-05 ***
## ReturnApr     0.5214     0.3275   1.592 0.111316    
## ReturnMay     1.1301     0.4274   2.644 0.008194 ** 
## ReturnJune    1.5889     0.4423   3.592 0.000328 ***
## ReturnJuly    1.3800     0.4602   2.999 0.002709 ** 
## ReturnAug     0.4146     0.4824   0.860 0.390054    
## ReturnSep    -0.0148     0.4754  -0.031 0.975167    
## ReturnOct    -0.6820     0.3044  -2.241 0.025039 *  
## ReturnNov    -1.7175     0.4133  -4.156 3.24e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1240.7  on 895  degrees of freedom
## Residual deviance: 1131.5  on 884  degrees of freedom
## AIC: 1155.5
## 
## Number of Fisher Scoring iterations: 4

Cluster specific predictions 3

Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.

What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?

# predict test 1
predictTest1 <- predict(Stocksmodel1, type = "response", newdata = stocksTest1)
table(stocksTest1$PositiveDec, predictTest1 > 0.5)

##    
##     FALSE TRUE
##   0    43  350
##   1    26  639

(43+639)/nrow(stocksTest1)

## [1] 0.6446125

#test 2
predictTest2 <- predict(Stocksmodel1, type = "response", newdata = stocksTest2)
table(stocksTest2$PositiveDec, predictTest2 > 0.5)

##    
##     FALSE TRUE
##   0   190  806
##   1   159  874

(190+874)/nrow(stocksTest2)

## [1] 0.5243963

#test 3
predictTest3 <- predict(Stocksmodel1, type = "response", newdata = stocksTest3)
table(stocksTest3$PositiveDec, predictTest3 > 0.5)

##    
##     FALSE TRUE
##   0   120   68
##   1    98  101

(120+101)/nrow(stocksTest3)

## [1] 0.5710594

Cluster specific pred: FINALE

To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:

AllPredictions = c(PredictTest1, PredictTest2, PredictTest3)

AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)

What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?

AllPredictions <- c(predictTest1, predictTest2, predictTest3)
AllOutcomes <-  c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)

# comparing the results
table(AllOutcomes, AllPredictions > 0.5)

##            
## AllOutcomes FALSE TRUE
##           0   353 1224
##           1   283 1614

(accuracy_finale <- (353+1614)/(353+1614+283+1224))

## [1] 0.5662061

MITx-week6-Cluster

Raka

11/2/2020

Document Clustering with Daily Kos

Hierarchical Clustering

Hierarchical Clustering 2

K-means Clustering

K-means 2

Market Segmentation for Airlines: Task 2

Normalizing the data

hierarchical Clustering

Hierarchical Clustering 2

K-means Clustering

Predicting Stock Returns with Cluster-Then-Predict: Task 3

Exploring dataset

Initial Logistic Regression Model

Clustering Stocks

Clustering Stocks 2

Clustering stocks 3

Clustering Stocks 4

Cluster specific predictions

Cluster specific predictions 2

Cluster specific predictions 3

Cluster specific pred: FINALE