In this problem, we’ll be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. Daily Kos was founded by Markos Moulitsas in 2002, and as of September 2014, the site had an average weekday traffic of hundreds of thousands of visits.
The file dailykos.csv contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John Kerry (democratic). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.
Each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). The set of words has been trimmed according to some of the techniques covered in the previous week on text analytics (punctuation has been removed, and stop words have been removed). For each document, the variable values are the number of times that word appeared in the document.
dailykos <- read.csv("dailykos.csv")
# calc distance
kosdist <- dist(dailykos, method = "euclidean")
# kos hierclust
koshier <- hclust(kosdist, method = "ward.D")
plot(koshier)
Let’s pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.
Now, we don’t really want to run tapply on every single variable when we have over 1,000 different variables. Let’s instead use the subset function to subset our data by cluster. Create 7 new datasets, each containing the observations from one of the clusters.
How many observations are in cluster 3?
newsCluster <- cutree(koshier, k = 7)
cluster1 <- subset(dailykos, newsCluster ==1)
cluster2 <- subset(dailykos, newsCluster ==2)
cluster3 <- subset(dailykos, newsCluster ==3)
cluster4 <- subset(dailykos, newsCluster ==4)
cluster5 <- subset(dailykos, newsCluster ==5)
cluster6 <- subset(dailykos, newsCluster ==6)
cluster7 <- subset(dailykos, newsCluster ==7)
Instead of looking at the average value in each variable individually, we’ll just look at the top 6 words in each cluster. To do this for cluster 1, type the following in your R console (where “HierCluster1” should be replaced with the name of your first cluster subset):
tail(sort(colMeans(HierCluster1)))
This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occur the most frequently. The colMeans function computes the column (word) means, the sort function orders the words in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the ones with the largest column means.
What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:
tail(sort(colMeans(cluster1)))
## state republican poll democrat kerry bush
## 0.7575039 0.7590837 0.9036335 0.9194313 1.0624013 1.7053712
tail(sort(colMeans(cluster2)))
## bush democrat challenge vote poll november
## 2.847352 2.850467 4.096573 4.398754 4.847352 10.339564
tail(sort(colMeans(cluster3)))
## elect parties state republican democrat bush
## 1.647059 1.665775 2.320856 2.524064 3.823529 4.406417
tail(sort(colMeans(cluster4)))
## campaign voter presided poll bush kerry
## 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849
tail(sort(colMeans(cluster5)))
## american presided administration war iraq
## 1.090909 1.120393 1.230958 1.776413 2.427518
## bush
## 3.941032
tail(sort(colMeans(cluster6)))
## race bush kerry elect democrat poll
## 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325
tail(sort(colMeans(cluster7)))
## democrat clark edward poll kerry dean
## 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828
Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don’t need to add the iters.max argument.
Subset your data into the 7 clusters (7 new datasets) by using the “cluster” variable of your kmeans output.
How many observations are in Cluster 3?
set.seed(1000)
kmc_news <- kmeans(dailykos, centers = 7)
# 7 new datasets
kmeans_cluster1 <- subset(dailykos, kmc_news$cluster == 1)
kmeans_cluster2 <- subset(dailykos, kmc_news$cluster == 2)
kmeans_cluster3 <- subset(dailykos, kmc_news$cluster == 3)
kmeans_cluster4 <- subset(dailykos, kmc_news$cluster == 4)
kmeans_cluster5 <- subset(dailykos, kmc_news$cluster == 5)
kmeans_cluster6 <- subset(dailykos, kmc_news$cluster == 6)
kmeans_cluster7 <- subset(dailykos, kmc_news$cluster == 7)
nrow(kmeans_cluster3)
## [1] 300
Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the k-means clusters.
Which k-means cluster best corresponds to the Iraq War?
tail(sort(colMeans(kmeans_cluster1)))
## time iraq kerry administration presided
## 1.586667 1.640000 1.653333 2.620000 2.726667
## bush
## 11.333333
tail(sort(colMeans(kmeans_cluster2)))
## democrat bush challenge vote poll november
## 2.899696 2.960486 4.121581 4.446809 4.872340 10.370821
tail(sort(colMeans(kmeans_cluster3)))
## voter presided campaign poll bush kerry
## 1.326667 1.336667 1.403333 2.816667 5.963333 6.613333
tail(sort(colMeans(kmeans_cluster4)))
## republican elect kerry poll democrat bush
## 0.5772077 0.5786877 0.6581154 0.7380365 0.7409965 1.1588555
tail(sort(colMeans(kmeans_cluster5)))
## primaries democrat edward clark kerry dean
## 2.333333 2.708333 2.826389 3.083333 5.041667 8.236111
tail(sort(colMeans(kmeans_cluster6)))
## administration iraqi american bush war
## 1.396364 1.621818 1.694545 2.607273 3.036364
## iraq
## 4.094545
tail(sort(colMeans(kmeans_cluster7)))
## race senate state parties republican democrat
## 2.341463 2.409756 2.995122 3.243902 4.200000 6.185366
this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.
The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.
There are seven different variables in the dataset, described below:
Balance = number of miles eligible for award travel
QualMiles = number of miles qualifying for TopFlight status
BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months
BonusTrans = number of non-flight bonus transactions in the past 12 months
FlightMiles = number of flight miles in the past 12 months
FlightTrans = number of flight transactions in the past 12 months
DaysSinceEnroll = number of days since enrolled in the frequent flyer program
airlines <- read.csv("AirlinesCluster.csv")
summary(airlines)
## Balance QualMiles BonusMiles BonusTrans
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 18528 1st Qu.: 0.0 1st Qu.: 1250 1st Qu.: 3.0
## Median : 43097 Median : 0.0 Median : 7171 Median :12.0
## Mean : 73601 Mean : 144.1 Mean : 17145 Mean :11.6
## 3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.: 23801 3rd Qu.:17.0
## Max. :1704838 Max. :11148.0 Max. :263685 Max. :86.0
## FlightMiles FlightTrans DaysSinceEnroll
## Min. : 0.0 Min. : 0.000 Min. : 2
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:2330
## Median : 0.0 Median : 0.000 Median :4096
## Mean : 460.1 Mean : 1.374 Mean :4119
## 3rd Qu.: 311.0 3rd Qu.: 1.000 3rd Qu.:5790
## Max. :30817.0 Max. :53.000 Max. :8296
Let’s go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the “caret” package. You should already have this package installed from Week 4, but if not, go ahead and install it with install.packages(“caret”). Then load the package with library(caret).
Now, create a normalized data frame called “airlinesNorm” by running the following commands:
preproc = preProcess(airlines)
airlinesNorm = predict(preproc, airlines)
The first command pre-processes the data, and the second command performs the normalization. If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.
In the normalized data, which variable has the largest maximum value?
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
preproc <- preProcess(airlines)
airlinesNorm <- predict(preproc, airlines)
summary(airlinesNorm)
## Balance QualMiles BonusMiles BonusTrans
## Min. :-0.7303 Min. :-0.1863 Min. :-0.7099 Min. :-1.20805
## 1st Qu.:-0.5465 1st Qu.:-0.1863 1st Qu.:-0.6581 1st Qu.:-0.89568
## Median :-0.3027 Median :-0.1863 Median :-0.4130 Median : 0.04145
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.1866 3rd Qu.:-0.1863 3rd Qu.: 0.2756 3rd Qu.: 0.56208
## Max. :16.1868 Max. :14.2231 Max. :10.2083 Max. : 7.74673
## FlightMiles FlightTrans DaysSinceEnroll
## Min. :-0.3286 Min. :-0.36212 Min. :-1.99336
## 1st Qu.:-0.3286 1st Qu.:-0.36212 1st Qu.:-0.86607
## Median :-0.3286 Median :-0.36212 Median :-0.01092
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.1065 3rd Qu.:-0.09849 3rd Qu.: 0.80960
## Max. :21.6803 Max. :13.61035 Max. : 2.02284
Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method=“ward.D”) on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.
Then, plot the dendrogram of the hierarchical clustering process. Suppose the airline is looking for somewhere between 2 and 10 clusters. According to the dendrogram, which of the following is NOT a good choice for the number of clusters?
# calc distance
airdist <- dist(airlinesNorm, method = "euclidean")
# kos hierclust
airHclust <- hclust(airdist, method = "ward.D")
plot(airHclust)
# cutree, k = 5
airCluster <- cutree(airHclust, k = 5)
aircluster1 <- subset(airlinesNorm, airCluster == 1)
Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable “Balance” with the following command:
tapply(airlines$Balance, clusterGroups, mean)
Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.
tapply(airlines$Balance, airCluster, mean)
## 1 2 3 4 5
## 57866.90 110669.27 198191.57 52335.91 36255.91
tapply(airlines$DaysSinceEnroll, airCluster, mean)
## 1 2 3 4 5
## 6235.365 4402.414 5615.709 2840.823 3060.081
# cluster 2
tapply(airlines$FlightMiles, airCluster, mean)
## 1 2 3 4 5
## 83.18428 2613.41811 327.67611 111.57373 119.32191
# cluster 3
tapply(airlines$BonusMiles, airCluster, mean)
## 1 2 3 4 5
## 10360.124 22881.763 55795.860 20788.766 2264.788
Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.
How many clusters have more than 1,000 observations?
set.seed(88)
kmc_air <- kmeans(airlinesNorm, centers = 5, iter.max = 1000)
#creating subset
KmeansCluster1 <- subset(airlinesNorm, kmc_air$cluster == 1)
KmeansCluster2 <- subset(airlinesNorm, kmc_air$cluster == 2)
KmeansCluster3 <- subset(airlinesNorm, kmc_air$cluster == 3)
KmeansCluster4 <- subset(airlinesNorm, kmc_air$cluster == 4)
KmeansCluster5 <- subset(airlinesNorm, kmc_air$cluster == 5)
# similarity with the hierarchical clustering
tapply(airlines$Balance, kmc_air$cluster, mean)
## 1 2 3 4 5
## 152879.30 114012.18 191736.34 57416.14 38150.31
For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.
Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.
This dataset contains the following variables:
ReturnJan = the return for the company's stock during January (in the year of the observation).
ReturnFeb = the return for the company's stock during February (in the year of the observation).
ReturnMar = the return for the company's stock during March (in the year of the observation).
ReturnApr = the return for the company's stock during April (in the year of the observation).
ReturnMay = the return for the company's stock during May (in the year of the observation).
ReturnJune = the return for the company's stock during June (in the year of the observation).
ReturnJuly = the return for the company's stock during July (in the year of the observation).
ReturnAug = the return for the company's stock during August (in the year of the observation).
ReturnSep = the return for the company's stock during September (in the year of the observation).
ReturnOct = the return for the company's stock during October (in the year of the observation).
ReturnNov = the return for the company's stock during November (in the year of the observation).
PositiveDec = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.
For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.
stocks <- read.csv("StocksCluster.csv")
nrow(stocks)
## [1] 11580
table(stocks$PositiveDec)
##
## 0 1
## 5256 6324
6324/(6324+5256)
## [1] 0.546114
What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.
cor(stocks)
## ReturnJan ReturnFeb ReturnMar ReturnApr ReturnMay
## ReturnJan 1.000000000 0.06677458 -0.090496798 -0.037678006 -0.044411417
## ReturnFeb 0.066774583 1.00000000 -0.155983263 -0.191351924 -0.095520920
## ReturnMar -0.090496798 -0.15598326 1.000000000 0.009726288 -0.003892789
## ReturnApr -0.037678006 -0.19135192 0.009726288 1.000000000 0.063822504
## ReturnMay -0.044411417 -0.09552092 -0.003892789 0.063822504 1.000000000
## ReturnJune 0.092238307 0.16999448 -0.085905486 -0.011027752 -0.021074539
## ReturnJuly -0.081429765 -0.06177851 0.003374160 0.080631932 0.090850264
## ReturnAug -0.022792019 0.13155979 -0.022005400 -0.051756051 -0.033125658
## ReturnSep -0.026437153 0.04350177 0.076518327 -0.028920972 0.021962862
## ReturnOct 0.142977229 -0.08732427 -0.011923758 0.048540025 0.017166728
## ReturnNov 0.067632333 -0.15465828 0.037323535 0.031761837 0.048046590
## PositiveDec 0.004728518 -0.03817318 0.022408661 0.094353528 0.058201934
## ReturnJune ReturnJuly ReturnAug ReturnSep ReturnOct
## ReturnJan 0.09223831 -0.0814297650 -0.0227920187 -0.0264371526 0.14297723
## ReturnFeb 0.16999448 -0.0617785094 0.1315597863 0.0435017706 -0.08732427
## ReturnMar -0.08590549 0.0033741597 -0.0220053995 0.0765183267 -0.01192376
## ReturnApr -0.01102775 0.0806319317 -0.0517560510 -0.0289209718 0.04854003
## ReturnMay -0.02107454 0.0908502642 -0.0331256580 0.0219628623 0.01716673
## ReturnJune 1.00000000 -0.0291525996 0.0107105260 0.0447472692 -0.02263599
## ReturnJuly -0.02915260 1.0000000000 0.0007137558 0.0689478037 -0.05470891
## ReturnAug 0.01071053 0.0007137558 1.0000000000 0.0007407139 -0.07559456
## ReturnSep 0.04474727 0.0689478037 0.0007407139 1.0000000000 -0.05807924
## ReturnOct -0.02263599 -0.0547089088 -0.0755945614 -0.0580792362 1.00000000
## ReturnNov -0.06527054 -0.0483738369 -0.1164890345 -0.0197197998 0.19167279
## PositiveDec 0.02340975 0.0743642097 0.0041669657 0.0416302863 -0.05257496
## ReturnNov PositiveDec
## ReturnJan 0.06763233 0.004728518
## ReturnFeb -0.15465828 -0.038173184
## ReturnMar 0.03732353 0.022408661
## ReturnApr 0.03176184 0.094353528
## ReturnMay 0.04804659 0.058201934
## ReturnJune -0.06527054 0.023409745
## ReturnJuly -0.04837384 0.074364210
## ReturnAug -0.11648903 0.004166966
## ReturnSep -0.01971980 0.041630286
## ReturnOct 0.19167279 -0.052574956
## ReturnNov 1.00000000 -0.062346556
## PositiveDec -0.06234656 1.000000000
summary(stocks)
## ReturnJan ReturnFeb ReturnMar
## Min. :-0.7616205 Min. :-0.690000 Min. :-0.712994
## 1st Qu.:-0.0691663 1st Qu.:-0.077748 1st Qu.:-0.046389
## Median : 0.0009965 Median :-0.010626 Median : 0.009878
## Mean : 0.0126316 Mean :-0.007605 Mean : 0.019402
## 3rd Qu.: 0.0732606 3rd Qu.: 0.043600 3rd Qu.: 0.077066
## Max. : 3.0683060 Max. : 6.943694 Max. : 4.008621
## ReturnApr ReturnMay ReturnJune
## Min. :-0.826503 Min. :-0.92207 Min. :-0.717920
## 1st Qu.:-0.054468 1st Qu.:-0.04640 1st Qu.:-0.063966
## Median : 0.009059 Median : 0.01293 Median :-0.000880
## Mean : 0.026308 Mean : 0.02474 Mean : 0.005938
## 3rd Qu.: 0.085338 3rd Qu.: 0.08396 3rd Qu.: 0.061566
## Max. : 2.528827 Max. : 6.93013 Max. : 4.339713
## ReturnJuly ReturnAug ReturnSep
## Min. :-0.7613096 Min. :-0.726800 Min. :-0.839730
## 1st Qu.:-0.0731917 1st Qu.:-0.046272 1st Qu.:-0.074648
## Median :-0.0008047 Median : 0.007205 Median :-0.007616
## Mean : 0.0030509 Mean : 0.016198 Mean :-0.014721
## 3rd Qu.: 0.0718205 3rd Qu.: 0.070783 3rd Qu.: 0.049476
## Max. : 2.5500000 Max. : 3.626609 Max. : 5.863980
## ReturnOct ReturnNov PositiveDec
## Min. :-0.685504 Min. :-0.747171 Min. :0.0000
## 1st Qu.:-0.070915 1st Qu.:-0.054890 1st Qu.:0.0000
## Median : 0.002115 Median : 0.008522 Median :1.0000
## Mean : 0.005651 Mean : 0.011387 Mean :0.5461
## 3rd Qu.: 0.074542 3rd Qu.: 0.076576 3rd Qu.:1.0000
## Max. : 5.665138 Max. : 3.271676 Max. :1.0000
Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:
set.seed(144)
spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)
stocksTrain = subset(stocks, spl == TRUE)
stocksTest = subset(stocks, spl == FALSE)
Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don’t forget to add the argument family=binomial to your glm command.
What is the overall accuracy on the training set, using a threshold of 0.5?
library(caTools)
# library(kableExtra)
set.seed(144)
spl <- sample.split(stocks$PositiveDec, SplitRatio = 0.7)
stocksTrain <- subset(stocks, spl == TRUE)
stocksTest <- subset(stocks, spl == FALSE)
# Building logistic regression model
stocksLog <- glm(PositiveDec ~ ., data = stocksTrain, family = binomial)
predictTrain <- predict(stocksLog, type = "response")
table(stocksTrain$PositiveDec, predictTrain > 0.5)
##
## FALSE TRUE
## 0 990 2689
## 1 787 3640
accuracy <- (990+3640)/nrow(stocksTrain)
accuracy
## [1] 0.5711818
# on test data
predictTest <- predict(stocksLog, type = "response", newdata = stocksTest)
table(stocksTest$PositiveDec, predictTest > 0.5)
##
## FALSE TRUE
## 0 417 1160
## 1 344 1553
(accuracy2 <- (417+1553)/nrow(stocksTest))
## [1] 0.5670697
#baseline
table(stocksTest$PositiveDec)
##
## 0 1
## 1577 1897
(1897)/nrow(stocksTest)
## [1] 0.5460564
Now, let’s cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:
limitedTrain = stocksTrain
limitedTrain$PositiveDec = NULL
limitedTest = stocksTest
limitedTest$PositiveDec = NULL
Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?
limitedTrain <- stocksTrain
limitedTrain$PositiveDec <- NULL
limitedTest <- stocksTest
limitedTest$PositiveDec <- NULL
In the market segmentation assignment in this week’s homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.
In cases where we have a training and testing set, we’ll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:
library(caret)
preproc = preProcess(limitedTrain)
normTrain = predict(preproc, limitedTrain)
normTest = predict(preproc, limitedTest)
What is the mean of the ReturnJan variable in normTrain?
library(caret)
preproc <- preProcess(limitedTrain)
normTrain <- predict(preproc, limitedTrain)
normTest <- predict(preproc, limitedTest)
#exploration
mean(normTrain$ReturnJan)
## [1] 2.100586e-17
mean(normTest$ReturnJan)
## [1] -0.0004185886
Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called km.
Which cluster has the largest number of observations?
set.seed(144)
km <- kmeans(normTrain, centers = 3)
Km1 <- subset(normTrain, km$cluster == 1)
Km2 <- subset(normTrain, km$cluster == 2)
Km3 <- subset(normTrain, km$cluster == 3)
Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):
library(flexclust)
km.kcca = as.kcca(km, normTrain)
clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)
How many test-set observations were assigned to Cluster 2?
library(flexclust)
## Warning: package 'flexclust' was built under R version 4.0.3
## Loading required package: grid
## Loading required package: modeltools
## Warning: package 'modeltools' was built under R version 4.0.3
## Loading required package: stats4
km.kcca <- as.kcca(km, normTrain)
clusterTrain <- predict(km.kcca)
clusterTest <- predict(km.kcca, newdata=normTest)
table(clusterTest)
## clusterTest
## 1 2 3
## 1058 2029 387
Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.
Which training set data frame has the highest average value of the dependent variable?
# Subsetting stocksTrain
stocksTrain1 <- subset(stocksTrain, clusterTrain == 1)
stocksTrain2 <- subset(stocksTrain, clusterTrain == 2)
stocksTrain3 <- subset(stocksTrain, clusterTrain == 3)
# Subsetting stocksTest
stocksTest1 <- subset(stocksTest, clusterTest == 1)
stocksTest2 <- subset(stocksTest, clusterTest == 2)
stocksTest3 <- subset(stocksTest, clusterTest == 3)
# exploration
mean(stocksTrain1$PositiveDec)
## [1] 0.6103267
mean(stocksTrain2$PositiveDec)
## [1] 0.5250476
mean(stocksTrain3$PositiveDec)
## [1] 0.4799107
Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.
Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.
# log reg model 1,2,3
Stocksmodel1 <- glm(PositiveDec ~ ., data = stocksTrain1, family = binomial)
Stocksmodel2 <- glm(PositiveDec ~ ., data = stocksTrain2, family = binomial)
Stocksmodel3 <- glm(PositiveDec ~ ., data = stocksTrain3, family = binomial)
summary(Stocksmodel1)
##
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7220 -1.2879 0.8679 1.0096 1.7170
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.20739 0.08020 2.586 0.00971 **
## ReturnJan 0.12448 0.31474 0.396 0.69246
## ReturnFeb -0.46307 0.32713 -1.416 0.15692
## ReturnMar 0.55465 0.24804 2.236 0.02534 *
## ReturnApr 1.08354 0.25005 4.333 1.47e-05 ***
## ReturnMay 0.30487 0.24993 1.220 0.22253
## ReturnJune 0.00172 0.33525 0.005 0.99591
## ReturnJuly -0.02763 0.30216 -0.091 0.92714
## ReturnAug 0.40299 0.34570 1.166 0.24373
## ReturnSep 0.70779 0.32611 2.170 0.02998 *
## ReturnOct -1.33254 0.29055 -4.586 4.51e-06 ***
## ReturnNov -0.78944 0.30583 -2.581 0.00984 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3314.9 on 2478 degrees of freedom
## Residual deviance: 3244.0 on 2467 degrees of freedom
## AIC: 3268
##
## Number of Fisher Scoring iterations: 4
summary(Stocksmodel2)
##
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2268 -1.2086 0.9698 1.1294 1.6769
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1165255 0.0335061 3.478 0.000506 ***
## ReturnJan 0.4489316 0.2414090 1.860 0.062938 .
## ReturnFeb -0.1385458 0.1514734 -0.915 0.360373
## ReturnMar 0.4657754 0.2442129 1.907 0.056488 .
## ReturnApr 0.7839726 0.2558668 3.064 0.002184 **
## ReturnMay 0.7709831 0.2625217 2.937 0.003316 **
## ReturnJune 0.5764584 0.2191809 2.630 0.008537 **
## ReturnJuly 0.8654737 0.2869778 3.016 0.002563 **
## ReturnAug 0.0177313 0.2317020 0.077 0.939001
## ReturnSep 1.0464947 0.2684133 3.899 9.67e-05 ***
## ReturnOct -0.0001062 0.2426989 0.000 0.999651
## ReturnNov -0.3716212 0.2603210 -1.428 0.153421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6546.7 on 4730 degrees of freedom
## Residual deviance: 6481.4 on 4719 degrees of freedom
## AIC: 6505.4
##
## Number of Fisher Scoring iterations: 4
summary(Stocksmodel3)
##
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8938 -1.0863 -0.5241 1.0874 2.1892
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.4504 0.1700 2.649 0.008071 **
## ReturnJan 0.3816 0.2742 1.392 0.163927
## ReturnFeb 0.3943 0.4612 0.855 0.392490
## ReturnMar -1.7220 0.4372 -3.938 8.20e-05 ***
## ReturnApr 0.5214 0.3275 1.592 0.111316
## ReturnMay 1.1301 0.4274 2.644 0.008194 **
## ReturnJune 1.5889 0.4423 3.592 0.000328 ***
## ReturnJuly 1.3800 0.4602 2.999 0.002709 **
## ReturnAug 0.4146 0.4824 0.860 0.390054
## ReturnSep -0.0148 0.4754 -0.031 0.975167
## ReturnOct -0.6820 0.3044 -2.241 0.025039 *
## ReturnNov -1.7175 0.4133 -4.156 3.24e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1240.7 on 895 degrees of freedom
## Residual deviance: 1131.5 on 884 degrees of freedom
## AIC: 1155.5
##
## Number of Fisher Scoring iterations: 4
Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.
What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?
# predict test 1
predictTest1 <- predict(Stocksmodel1, type = "response", newdata = stocksTest1)
table(stocksTest1$PositiveDec, predictTest1 > 0.5)
##
## FALSE TRUE
## 0 43 350
## 1 26 639
(43+639)/nrow(stocksTest1)
## [1] 0.6446125
#test 2
predictTest2 <- predict(Stocksmodel1, type = "response", newdata = stocksTest2)
table(stocksTest2$PositiveDec, predictTest2 > 0.5)
##
## FALSE TRUE
## 0 190 806
## 1 159 874
(190+874)/nrow(stocksTest2)
## [1] 0.5243963
#test 3
predictTest3 <- predict(Stocksmodel1, type = "response", newdata = stocksTest3)
table(stocksTest3$PositiveDec, predictTest3 > 0.5)
##
## FALSE TRUE
## 0 120 68
## 1 98 101
(120+101)/nrow(stocksTest3)
## [1] 0.5710594
To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:
AllPredictions = c(PredictTest1, PredictTest2, PredictTest3)
AllOutcomes = c(stocksTest1\(PositiveDec, stocksTest2\)PositiveDec, stocksTest3$PositiveDec)
What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?
AllPredictions <- c(predictTest1, predictTest2, predictTest3)
AllOutcomes <- c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)
# comparing the results
table(AllOutcomes, AllPredictions > 0.5)
##
## AllOutcomes FALSE TRUE
## 0 353 1224
## 1 283 1614
(accuracy_finale <- (353+1614)/(353+1614+283+1224))
## [1] 0.5662061