Unit 6: Clustering

Preliminaries

These are my notes for the lectures of the The_Analytics_Edge_edX_MIT15.0 71x_June2015“ by Professor Dimitris Bertsimas. The goal of these notes is to provide the reproducible R code for all the lectures.
A good list of resources about using R for Clustering are given below:
Getting your clustering right (Part I)_Analytics Vidya
Cluster Analysis_Quick-R
Hierarchical Cluster Analysis
Cluster Analysis in R
Cluster Analysis and R_stat.berkeley
K-means Clustering (from “R in Action”)_R-statistics blog
Cluster Analysis: Tutorial with R_Jari Oksanen
Cluster Analysis using R - IASRI
Performing a cluster analysis in R
Cluster Analysis with R_Rpubs
Cluster analysis tutorial - University of Georgia
Cluster Analysis in R
Cluster Analysis using R with banking customer balance distribution
CLUSTER ANALYSIS: HOW TO IDENTIFY INHERENT GROUPS WITHIN DATA?_Rstatistics.net
CRAN Task View: Cluster Analysis & Finite Mixture Models
R Script for K-Means Cluster Analysis
R & Bioconductor Manual
Data Science with R_Clusters analysis_togaware
Quick-R_ Tree-Based Models
Quick-R_ Tree-Based Models

Recommendations Worth a Million_An Introduction to Clustering_1

An Introduction to Clustering

VIDEO 1: INTRODUCTION TO NETFLIX

Netflix

Online DVD rental and streaming video service
More than 40 million subscribers worldwide
$3.6 billion in revenue
Key aspect is being able to offer customers accurate movie recommendations based on a customer’s own preferences and viewing history

netflix

The Netflix Prize

From 2006 - 2009 Netflix ran a contest asking the public to submit algorithms to predict user ratings for movies
Training data set of ~100,000,000 ratings and test data set of ~3,000,000 ratings were provided
Offered a grand prize of $1,000,000 USD to the team who could beat Netflix’s own algorithm,Cinematch, by more than 10%, measured in RMSE

Contest Rules

If the grand prize was not yet reached, progress prizes of $50,000 USD per year would be awarded for the best result so far, as long as it had >1% improvement over the previous year.
Teams must submit code and a description of the algorithm to be awarded any prizes
If any team met the 10% improvement goal, last call would be issued and 30 days would remain for all teams to submit their best algorithm.

Last Call Announced

On June 26, 2009, the team BellKor’s Pragmatic Chaos submitted a 10.05% improvement over Cinematch

QUICK QUESTION

About how many years did it take for a team to submit a 10% improvement over Cinematch?

Ans:2.5

EXPLANATION:The contest started in October 2006, and eneded in July 2009. So it took about 2.5 years for a team to submit a 10% improvement solution.

VIDEO 2: RECOMMENDATION SYSTEMS

Predicting the Best User Ratings

Netflix was willing to pay over $1M for the best user rating algorithm, which shows how critical the recommendation system was to their business
What data could be used to predict user ratings?
Every movie in Netflix’s database has the ranking from all users who have ranked that movie
We also know facts about the movie itself: actors,director, genre classifications, year released, etc.

Recommender system_Collaborative Filtering

Using Other Users’ Rankings

userrating

Consider suggesting to Carl that he watch “Men in Black”, since Amy rated it highly and Carl and Amy seem to have similar preferences
This technique is called Collaborative Filtering

Recommender system_Content Filtering

Using Movie Information

contentfiltering

Strengths and Weaknesses

Collaborative Filtering Systems
- Can accurately suggest complex items without understanding the nature of the items
- Requires a lot of data about the user to make accurate recommendations
- Millions of items - need lots of computing power
Content Filtering
- Requires very little data to get started
- Can be limited in scope

Hybrid Recommendation Systems

Netflix uses both collaborative and content filtering
For example, consider a collaborative filtering approach where we determine that Amy and Carl have similar preferences.
We could then do content filtering, where we would find that “Terminator”, which both Amy and Carl liked, is classified in almost the same set of genres as “Starship Troopers”
Recommend “Starship Troopers” to both Amy and Carl,even though neither of them have seen it before

QUICK QUESTION

Let’s consider a recommendation system on Amazon.com, an online retail site.

If Amazon.com constructs a recommendation system for books, and would like to use the same exact algorithm for shoes, what type would it have to be?

Ans: Collaborative Filtering

If Amazon.com would like to suggest books to users based on the previous books they have purchased, what type of recommendation system would it be?

Ans:Content Filtering

EXPLANATION:In the first case, the recommendation system would have to be collaborative filtering, since it can’t use information about the items. In the second case, the recommendation system would be content filtering since other users are not involved.

VIDEO 3: MOVIE DATA AND CLUSTERING

MovieLens Item Dataset

Movies in the dataset are categorized as belonging to different genres

genres

Each movie may belong to many genres
Can we systematically find groups of movies with similar sets of genres?

Why Clustering?

“Unsupervised” learning
- Goal is to segment the data into similar groups instead of prediction
Can also cluster data into “similar” groups and then build a predictive model for each group
Be careful not to overfit your model!
This works best with large datasets

cluster

Types of Clustering Methods

There are many different algorithms for clustering
- Differ in what makes a cluster and how to find them
We will cover
- Hierarchical
- K-means in the next lecture

QUICK QUESTION

In the previous video, we discussed how clustering is used to split the data into similar groups. Which of the following tasks do you think are appropriate for clustering? Select all that apply.

Ans:Dividing search results on Google into categories based on the topic & Grouping players into different “types” of basketball players that make it to the NBA

EXPLANATION:The first two options are appropriate tasks for clustering. Clustering probably wouldn’t help us predict the winner of the World Series.

VIDEO 4: COMPUTING DISTANCES

How does Clustering work?

Distance Between Points

Need to define distance between two data points
- Most popular is “Euclidean distance”
- Distance between points i and j is where k is the number of independent variables

Distance Example

The movie “Toy Story” is categorized as Animation, Comedy, and Children’s
- Toy Story:(0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)
toy
The movie “Batman Forever” is categorized as Action, Adventure, Comedy, and Crime
- Batman Forever:(0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0)
batman

Distance Between Points

Toy Story: (0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0)
Batman Forever: (0,1,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0)

calc1

Other popular distance metrics:
- Manhattan Distance
  - Sum of absolute values instead of squares
- Maximum Coordinate Distance
  - Only consider measurement for which data points deviate the most

Distance Between Clusters

Minimum Distance
- Distance between clusters is the distance between points that are the closest

min

Maximum Distance
- Distance between clusters is the distance between points that are the farthest

max

Centroid Distance
- Distance between centroids of clusters
  - Centroid is point that has the average of all data points in each component

centroid

Normalize Data

Distance is highly influenced by scale of variables, so customary to normalize first
In our movie dataset, all genre variables are on the same scale and so normalization is not necessary
However, if we included a variable such as “Box Office Revenue,” we would need to normalize.

QUICK QUESTION

The movie “The Godfather” is in the genres action, crime, and drama, and is defined by the vector: (0,1,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0)

The movie “Titanic” is in the genres action, drama, and romance, and is defined by the vector: (0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0)

What is the distance between “The Godfather” and “Titanic”, using euclidean distance?

Ans:1.414214

EXPLANATION:The distance between these two movies is the square root of 2. They have a difference of 1 in two genres - crime and romance.

VIDEO 5: HIERARCHICAL CLUSTERING

Hierarchical

Start with each data point in its own cluster

step1

Combine two nearest clusters (Euclidean, Centroid)

step2

combine two nearest clusters (Euclidean, Centroid)

step3

Combine two nearest clusters (Euclidean, Centroid)

step4

Combine two nearest clusters (Euclidean, Centroid)

step5

Combine two nearest clusters (Euclidean, Centroid)

step6

Combine two nearest clusters (Euclidean, Centroid)

step7

Combine two nearest clusters (Euclidean, Centroid)

step8

Display Cluster Process:Dendogram

dendo

Select Clusters

select

Meaningful Clusters?

Look at statistics (mean, min, max, . . .) for each cluster and each variable
See if the clusters have a feature in common that was not used in the clustering (like an outcome)

QUICK QUESTION

Suppose you are running the Hierarchical clustering algorithm with 212 observations.

How many clusters will there be at the start of the algorithm?

Ans:212

How many clusters will there be at the end of the algorithm?

Ans:1

EXPLANATION:The Hierarchical clustering algorithm always starts with each data point in its own cluster, and ends with all data points in the same cluster. So there will be 212 clusters at the beginning of the algorithm, and 1 cluster at the end of the algorithm.

VIDEO 6: GETTING THE DATA (R script reproduced here)

In this video, we’ll be downloading our dataset from the MovieLens website. Please open the following link in a new window or tab of your browser to access the data: movieLens.txt

#Unit 6 - Introduction to Clustering

#Video 6

#After following the steps in the video, load the data into R
movies = read.table("movieLens.txt", header=FALSE, sep="|",quote="\"")

str(movies)

## 'data.frame':    1682 obs. of  24 variables:
##  $ V1 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V2 : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ V3 : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
##  $ V4 : logi  NA NA NA NA NA NA ...
##  $ V5 : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
##  $ V6 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V7 : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ V8 : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ V9 : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ V10: int  1 0 0 0 0 0 0 1 0 0 ...
##  $ V11: int  1 0 0 1 0 0 0 1 0 0 ...
##  $ V12: int  0 0 0 0 1 0 0 0 0 0 ...
##  $ V13: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V14: int  0 0 0 1 1 1 1 1 1 1 ...
##  $ V15: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V16: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V17: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V18: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V19: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V20: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ V21: int  0 0 0 0 0 0 1 0 0 0 ...
##  $ V22: int  0 1 1 0 1 0 0 0 0 0 ...
##  $ V23: int  0 0 0 0 0 0 0 0 0 1 ...
##  $ V24: int  0 0 0 0 0 0 0 0 0 0 ...

#Add column names
colnames(movies) = c("ID", "Title", "ReleaseDate", "VideoReleaseDate", "IMDB", "Unknown", "Action", "Adventure", "Animation", "Childrens", "Comedy", "Crime", "Documentary", "Drama", "Fantasy", "FilmNoir", "Horror", "Musical", "Mystery", "Romance", "SciFi", "Thriller", "War", "Western")

str(movies)

## 'data.frame':    1682 obs. of  24 variables:
##  $ ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Title           : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ ReleaseDate     : Factor w/ 241 levels "","01-Aug-1997",..: 71 71 71 71 71 71 71 71 71 182 ...
##  $ VideoReleaseDate: logi  NA NA NA NA NA NA ...
##  $ IMDB            : Factor w/ 1661 levels "","http://us.imdb.com/M/title-exact/Independence%20(1997)",..: 1431 565 505 543 310 1661 1453 103 357 1183 ...
##  $ Unknown         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action          : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure       : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animation       : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Childrens       : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Comedy          : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime           : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama           : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FilmNoir        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SciFi           : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller        : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War             : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western         : int  0 0 0 0 0 0 0 0 0 0 ...

# Remove unnecessary variables
movies$ID = NULL
movies$ReleaseDate = NULL
movies$VideoReleaseDate = NULL
movies$IMDB = NULL

#Remove duplicates
movies = unique(movies)

#Take a look at our data again:
str(movies)

## 'data.frame':    1664 obs. of  20 variables:
##  $ Title      : Factor w/ 1664 levels "'Til There Was You (1997)",..: 1525 618 555 594 344 1318 1545 111 391 1240 ...
##  $ Unknown    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Action     : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ Adventure  : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Childrens  : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ Comedy     : int  1 0 0 1 0 0 0 1 0 0 ...
##  $ Crime      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Documentary: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Drama      : int  0 0 0 1 1 1 1 1 1 1 ...
##  $ Fantasy    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FilmNoir   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Horror     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Musical    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mystery    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SciFi      : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Thriller   : int  0 1 1 0 1 0 0 0 0 0 ...
##  $ War        : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ Western    : int  0 0 0 0 0 0 0 0 0 0 ...

#QUICK QUESTION  

#Using the table function in R, please answer the following questions about the dataset "movies".

#How many movies are classified as comedies?
table(movies$Comedy)

## 
##    0    1 
## 1162  502

#Ans:502 

#How many movies are classified as westerns
table(movies$Western)

## 
##    0    1 
## 1637   27

#Ans:27 

#How many movies are classified as romance AND drama?
table(movies$Drama,movies$Romance)

##    
##       0   1
##   0 801 147
##   1 619  97

#Ans:97

VIDEO 7: HIERARCHICAL CLUSTERING IN R (R script reproduced here)

#Video 7

#There are 2 steps to Hierarchial Clustering
#1.compute the distances between all data points
#2.Then we need to cluster the points

# Calculate distances between genre features:
distances = dist(movies[2:20], method = "euclidean")

# Hierarchical clustering
clusterMovies = hclust(distances, method = "ward.D") #the ward method cares about the distance between clusters using centroid distance and also the variance in each cluster. 

# Plot the dendrogram
plot(clusterMovies)

#from the dendogram, drawing the Horizontal line, a per the requirement of our problem, we select 10 clusters (i.e. the horizontal line cuts across 10 vertical lines)

#Assign points to clusters (Label each movie in cluster (with 10 clusters))
clusterGroups = cutree(clusterMovies, k = 10)

#Now let's figure out what the clusters are like.

# Let's use the tapply function to compute the percentage of movies in each genre and cluster

#Calculate average "action" genre value for each cluster
tapply(movies$Action, clusterGroups, mean)

##         1         2         3         4         5         6         7 
## 0.1784512 0.7839196 0.1238532 0.0000000 0.0000000 0.1015625 0.0000000 
##         8         9        10 
## 0.0000000 0.0000000 0.0000000

#Calculate average "Romance" genre value for each cluster
tapply(movies$Romance, clusterGroups, mean)

##          1          2          3          4          5          6 
## 0.10437710 0.04522613 0.03669725 0.00000000 0.00000000 1.00000000 
##          7          8          9         10 
## 1.00000000 0.00000000 0.00000000 0.00000000

#We can repeat this for each genre. If you do, you get the results in ClusterMeans.ods


#Find which cluster Men in Black is in.
subset(movies, Title=="Men in Black (1997)")

##                   Title Unknown Action Adventure Animation Childrens
## 257 Men in Black (1997)       0      1         1         0         0
##     Comedy Crime Documentary Drama Fantasy FilmNoir Horror Musical Mystery
## 257      1     0           0     0       0        0      0       0       0
##     Romance SciFi Thriller War Western
## 257       0     1        0   0       0

mibGrp <-clusterGroups[257] # to get which cluster Men in Black (1997) goes into
mibGrp

## 257 
##   2

# Create a new data set with just the movies from cluster 2
cluster2 = subset(movies, clusterGroups==mibGrp)

#Look at the first 10 titles in this cluster(Find other movies in same cluster as "Men in Black"):
cluster2$Title[1:10]

##  [1] GoldenEye (1995)                              
##  [2] Bad Boys (1995)                               
##  [3] Apollo 13 (1995)                              
##  [4] Net, The (1995)                               
##  [5] Natural Born Killers (1994)                   
##  [6] Outbreak (1995)                               
##  [7] Stargate (1994)                               
##  [8] Fugitive, The (1993)                          
##  [9] Jurassic Park (1993)                          
## [10] Robert A. Heinlein's The Puppet Masters (1994)
## 1664 Levels: 'Til There Was You (1997) ... Zeus and Roxanne (1997)

#AN ADVANCED APPROACH TO FINDING CLUSTER CENTROIDS
#Other ways to calculate average genre value:

#In this video, we explain how you can find the cluster centroids by using the function "tapply" for each variable in the dataset. While this approach works and is familiar to us, it can be a little tedious when there are a lot of variables. An alternative approach is to use the colMeans function. With this approach, you only have one command for each cluster instead of one command for each variable. 

#If you run the following command in your R console, you can get all of the column (variable) means for cluster 1:
colMeans(subset(movies[2:20], clusterGroups == 1))

##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226 
##     Western 
## 0.090909091

#You can repeat this for each cluster by changing the clusterGroups number. However, if you also have a lot of clusters, this approach is not that much more efficient than just using the tapply function.

#A more advanced approach uses the "split" and "lapply" functions. The following command will split the data into subsets based on the clusters:
spl = split(movies[2:20], clusterGroups)

#Then you can use spl to access the different clusters, because
head(spl[[1]])

##    Unknown Action Adventure Animation Childrens Comedy Crime Documentary
## 1        0      0         0         1         1      1     0           0
## 4        0      1         0         0         0      1     0           0
## 7        0      0         0         0         0      0     0           0
## 8        0      0         0         0         1      1     0           0
## 10       0      0         0         0         0      0     0           0
## 21       0      1         1         0         0      1     0           0
##    Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller
## 1      0       0        0      0       0       0       0     0        0
## 4      1       0        0      0       0       0       0     0        0
## 7      1       0        0      0       0       0       0     1        0
## 8      1       0        0      0       0       0       0     0        0
## 10     1       0        0      0       0       0       0     0        0
## 21     0       0        0      0       1       0       0     0        1
##    War Western
## 1    0       0
## 4    0       0
## 7    0       0
## 8    0       0
## 10   1       0
## 21   0       0

#is the same as
head(subset(movies[2:20], clusterGroups == 1))

##    Unknown Action Adventure Animation Childrens Comedy Crime Documentary
## 1        0      0         0         1         1      1     0           0
## 4        0      1         0         0         0      1     0           0
## 7        0      0         0         0         0      0     0           0
## 8        0      0         0         0         1      1     0           0
## 10       0      0         0         0         0      0     0           0
## 21       0      1         1         0         0      1     0           0
##    Drama Fantasy FilmNoir Horror Musical Mystery Romance SciFi Thriller
## 1      0       0        0      0       0       0       0     0        0
## 4      1       0        0      0       0       0       0     0        0
## 7      1       0        0      0       0       0       0     1        0
## 8      1       0        0      0       0       0       0     0        0
## 10     1       0        0      0       0       0       0     0        0
## 21     0       0        0      0       1       0       0     0        1
##    War Western
## 1    0       0
## 4    0       0
## 7    0       0
## 8    0       0
## 10   1       0
## 21   0       0

colMeans(spl[[1]])

##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226 
##     Western 
## 0.090909091

#so colMeans(spl[[1]]) will output the centroid of cluster 1. But an even easier approach uses the lapply function. The following command will output the cluster centroids for all clusters:

lapply(spl, colMeans)

## $`1`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.006734007 0.178451178 0.185185185 0.134680135 0.393939394 0.363636364 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.033670034 0.010101010 0.306397306 0.070707071 0.000000000 0.016835017 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.188552189 0.000000000 0.104377104 0.074074074 0.040404040 0.225589226 
##     Western 
## 0.090909091 
## 
## $`2`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.000000000 0.783919598 0.351758794 0.010050251 0.005025126 0.065326633 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.005025126 0.000000000 0.110552764 0.000000000 0.000000000 0.080402010 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.000000000 0.000000000 0.045226131 0.346733668 0.376884422 0.015075377 
##     Western 
## 0.000000000 
## 
## $`3`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.000000000 0.123853211 0.036697248 0.000000000 0.009174312 0.064220183 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.412844037 0.000000000 0.380733945 0.004587156 0.105504587 0.018348624 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.000000000 0.275229358 0.036697248 0.041284404 0.610091743 0.000000000 
##     Western 
## 0.000000000 
## 
## $`4`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           0 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           1           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           0           0           0           0 
##     Western 
##           0 
## 
## $`5`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           1 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           0           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           0           0           0           0 
##     Western 
##           0 
## 
## $`6`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##   0.0000000   0.1015625   0.0000000   0.0000000   0.0000000   0.1093750 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##   0.0468750   0.0000000   0.6640625   0.0000000   0.0078125   0.0156250 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##   0.0000000   0.0000000   1.0000000   0.0000000   0.1406250   0.0000000 
##     Western 
##   0.0000000 
## 
## $`7`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           1 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           0           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           1           0           0           0 
##     Western 
##           0 
## 
## $`8`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0212766 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##   0.0000000   1.0000000   0.0000000   0.0000000   0.0000000   0.0000000 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.0212766 
##     Western 
##   0.0000000 
## 
## $`9`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           1 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           1           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           0           0           0           0 
##     Western 
##           0 
## 
## $`10`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   0.1587302 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.0000000   1.0000000 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##   0.0000000   0.0000000   0.0000000   0.0000000   0.1587302   0.0000000 
##     Western 
##   0.0000000

#The lapply function runs the second argument (colMeans) on each element of the first argument (each cluster subset in spl). So instead of using 19 tapply commands, or 10 colMeans commands, we can output our centroids with just two commands: one to define spl, and then the lapply command.

#Even a better way
sapply(spl, colMeans)

##                       1           2           3 4 5         6 7         8
## Unknown     0.006734007 0.000000000 0.000000000 0 0 0.0000000 0 0.0000000
## Action      0.178451178 0.783919598 0.123853211 0 0 0.1015625 0 0.0000000
## Adventure   0.185185185 0.351758794 0.036697248 0 0 0.0000000 0 0.0000000
## Animation   0.134680135 0.010050251 0.000000000 0 0 0.0000000 0 0.0000000
## Childrens   0.393939394 0.005025126 0.009174312 0 0 0.0000000 0 0.0000000
## Comedy      0.363636364 0.065326633 0.064220183 0 1 0.1093750 1 0.0212766
## Crime       0.033670034 0.005025126 0.412844037 0 0 0.0468750 0 0.0000000
## Documentary 0.010101010 0.000000000 0.000000000 0 0 0.0000000 0 1.0000000
## Drama       0.306397306 0.110552764 0.380733945 1 0 0.6640625 0 0.0000000
## Fantasy     0.070707071 0.000000000 0.004587156 0 0 0.0000000 0 0.0000000
## FilmNoir    0.000000000 0.000000000 0.105504587 0 0 0.0078125 0 0.0000000
## Horror      0.016835017 0.080402010 0.018348624 0 0 0.0156250 0 0.0000000
## Musical     0.188552189 0.000000000 0.000000000 0 0 0.0000000 0 0.0000000
## Mystery     0.000000000 0.000000000 0.275229358 0 0 0.0000000 0 0.0000000
## Romance     0.104377104 0.045226131 0.036697248 0 0 1.0000000 1 0.0000000
## SciFi       0.074074074 0.346733668 0.041284404 0 0 0.0000000 0 0.0000000
## Thriller    0.040404040 0.376884422 0.610091743 0 0 0.1406250 0 0.0000000
## War         0.225589226 0.015075377 0.000000000 0 0 0.0000000 0 0.0212766
## Western     0.090909091 0.000000000 0.000000000 0 0 0.0000000 0 0.0000000
##             9        10
## Unknown     0 0.0000000
## Action      0 0.0000000
## Adventure   0 0.0000000
## Animation   0 0.0000000
## Childrens   0 0.0000000
## Comedy      1 0.1587302
## Crime       0 0.0000000
## Documentary 0 0.0000000
## Drama       1 0.0000000
## Fantasy     0 0.0000000
## FilmNoir    0 0.0000000
## Horror      0 1.0000000
## Musical     0 0.0000000
## Mystery     0 0.0000000
## Romance     0 0.0000000
## SciFi       0 0.0000000
## Thriller    0 0.1587302
## War         0 0.0000000
## Western     0 0.0000000

#QUICK QUESTION  

#Run the cutree function again to create the cluster groups, but this time pick k = 2 clusters. It turns out that the algorithm groups all of the movies that only belong to one specific genre in one cluster (cluster 2), and puts all of the other movies in the other cluster (cluster 1). What is the genre that all of the movies in cluster 2 belong to?

clusterGroups <- cutree(clusterMovies, k = 2)
spl <- split(movies[2:20], clusterGroups)
sapply(spl, colMeans)

##                       1 2
## Unknown     0.001545595 0
## Action      0.192426584 0
## Adventure   0.102782071 0
## Animation   0.032457496 0
## Childrens   0.092735703 0
## Comedy      0.387944359 0
## Crime       0.082689335 0
## Documentary 0.038639876 0
## Drama       0.267387944 1
## Fantasy     0.017001546 0
## FilmNoir    0.018547141 0
## Horror      0.069551777 0
## Musical     0.043276662 0
## Mystery     0.046367852 0
## Romance     0.188562597 0
## SciFi       0.077279753 0
## Thriller    0.191653787 0
## War         0.054868624 0
## Western     0.020865533 0

#or

lapply(spl, colMeans)

## $`1`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
## 0.001545595 0.192426584 0.102782071 0.032457496 0.092735703 0.387944359 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
## 0.082689335 0.038639876 0.267387944 0.017001546 0.018547141 0.069551777 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
## 0.043276662 0.046367852 0.188562597 0.077279753 0.191653787 0.054868624 
##     Western 
## 0.020865533 
## 
## $`2`
##     Unknown      Action   Adventure   Animation   Childrens      Comedy 
##           0           0           0           0           0           0 
##       Crime Documentary       Drama     Fantasy    FilmNoir      Horror 
##           0           0           1           0           0           0 
##     Musical     Mystery     Romance       SciFi    Thriller         War 
##           0           0           0           0           0           0 
##     Western 
##           0

#Ans:Drama

VIDEO 8: THE ANALYTICS EDGE OF RECOMMENDATION SYSTEMS

Beyond Movies: Mass Personalization

beyond

Cornerstone of these Top Businesses

corner

Recommendation Method Used

methodused

Winners are Declared!

winners

The Edge of Recommendation Systems

In today’s digital age, businesses often have hundreds of thousands of items to offer their customers
Excellent recommendation systems can make or break these businesses
Clustering algorithms, which are tailored to find similar customers or similar items, form the backbone of many of these recommendation systems

Predictive Diagnosis_Discovering Patterns for Disease Detection_2

VIDEO 1: HEART ATTACKS

Heart Attacks

Heart attack is a common complication of coronary heart disease resulting from the interruption of blood supply to part of the heart
2012 report from the American Heart Association estimates about 715,000 Americans have a heart attack every year
- Every 20 seconds, a person has a heart attack in the US
- Nearly half occur without prior warning signs
- 250,000 Americans die of Sudden Cardiac Death yearly

Analytics Helps Monitoring

Understanding the clinical characteristics of patients in whom heart attack was missed is key
Need for an increased understanding of the patterns in a patient’s diagnostic history that link to a heart attack
Predicting whether a patient is at risk of a heart attack helps monitoring and calls for action
Analytics helps understand patterns of heart attacks and provides good predictions

QUICK QUESTION

In this class, we’ve learned many different methods for predicting outcomes. Which of the following methods is designed to be used to predict an outcome like whether or not someone will experience a heart attack? Select all that apply.

Ans:Logistic Regression,CART, Random Forest

EXPLANATION:Logistic Regression, CART, and Random Forest are all designed to be used to predict whether or not someone has a heart attack, since this is a classification problem. Linear regression would be appropriate for a problem with a continuous outcome, such as the amount of time until someone has a heart attack. In this lecture, we’ll use random forest, but the other methods could be used too.

VIDEO 2: THE DATA

Claims Data

Claims data offers an expansive view of a patient’s health history
- Demographics, medical history and medications
- Offers insights regarding a patient’s risk
- May reveal indicative signals and patterns
We will use health insurance claims filed for about 7,000 members from January 2000 - November 2007
Concentrated on members with the following attributes
- At least 5 claims with coronary artery disease diagnosis
- At least 5 claims with hypertension diagnostic codes
- At least 100 total medical claims
- At least 5 pharmacy claims
- Data from at least 5 years
Yields patients with a high risk of heart attack and a reasonably rich history and continuous coverage

Data Aggregation

The resulting dataset includes about 20 million health insurance entries including individual medical and pharmaceutical records
Diagnosis, procedure and drug codes in the dataset comprise tens of thousands of attributes
Codes were aggregated into groups
- 218 diagnosis groups, 180 procedure groups, 538 drug groups
- 46 diagnosis groups were considered by clinicians as possible risk factors for heart attacks

Diagnostic History

We then compress medical records to obtain a chronological representation of a patient’s diagnostic profile
- Cost and number of medical claims and hospital visits by diagnosis
Observations split into 21 periods, each 90 days in length
- Examined 9 months of diagnostic history leading up to heart attack/no heart attack event
- Align data to make observations date-independent while preserving the order of events
  - 3 months ~ 0-3 months before heart attack
  - 6 months ~ 3-6 months before heart attack
  - 9 months ~ 6-9 months before heart attack

Target Variable

Target prediction is the first occurrence of a heart attack
- Diagnosis on medical claim
- Visit to emergency room followed by hospitalization
- Binary Yes/No

target

Dataset Compilation

data1

Cost Bucket Partitioning

Cost is a good summary of a person’s overall health
Divide population into similar smaller groups
- Low risk, average risk, high risk
Build models for each group

QUICK QUESTION

In the previous video, we discussed how we split the data into three groups, or buckets, according to cost.

Which bucket has the most data, in terms of number of patients?

Ans:Cost Bucket 1

Which bucket probably has the densest data, in terms of number of claims per person?

Ans:Cost Bucket 3

EXPLANATION:Cost Bucket 1 contains the most patients (see slide 7 of the previous video), and Cost Bucket 3 probably has the densest data, since these are the patients with the highest cost in terms of claims.

VIDEO 3: PREDICTING HEART ATTACKS USING CLUSTERING

Predicting Heart Attacks (Random Forest)

Predicting whether a patient has a heart attack for each of the cost buckets using the random forest algorithm

Incorporating Clustering

Patients in each bucket may have different characteristics

cluster11

Clustering Cost Buckets

Two clustering algorithms were used for the analysis as an alternative to hierarchal clustering
- Spectral Clustering
- k-means clustering

Clustering Cost Buckets

Two clustering algorithms were used for the analysis as an alternative to hierarchal clustering
- Spectral Clustering
- k-means clustering

k-Means Clustering

algo1

Algo2

algo3

algo4

algo5

algo6

algo7

Practical Considerations

The number of clusters k can be selected from previous knowledge or experimenting
Can strategically select initial partition of points into clusters if you have some knowledge of the data
Can run algorithm several times with different random starting points
In recitation, we will learn how to run the k-means clustering algorithm in R

Random Forest with Clustering

RF2

QUICK QUESTION

K-means clustering differs from Hierarchical clustering in a couple important ways. Which of the following statements is true?

Ans:In k-means clustering, you have to pick the number of clusters you want before you run the algorithm

EXPLANATION:In k-means clustering, you have to pick the number of clusters before you run the algorithm, but the computational effort needed is much less than that for hierarchical clustering (we’ll see this in more detail during the recitation).

VIDEO 4: UNDERSTANDING CLUSTER PATTERNS

Understanding Cluster Patterns

Clusters are interpretable and reveal unique patterns of diagnostic history among the population

patterns

QUICK QUESTION

As we saw in the previous video, the clusters can be used to find interesting patterns of health in addition to being used to improve predictive models. By changing the number of clusters, you can find more general or more specific patterns.

If you wanted to find more unusual patterns shared by a small number of people, would you increase or decrease the number of clusters?

Ans:Increase

EXPLANATION:If you wanted to find more unusual patterns, you would increase the number of clusters since the clusters would become smaller and more patterns would probably emerge.

VIDEO 5: THE ANALYTICS EDGE

Impact of Clustering

Clustering members within each cost bucket yielded better predictions of heart attacks within clusters
Grouping patients in clusters exhibits temporal diagnostic patterns within 9 months of a heart attack
These patterns can be incorporated in the diagnostic rules for heart attacks
Great research interest in using analytics for early heart failure detection through pattern recognition

Seeing the Big Picture_Segmenting Images to Create Data(Recitation)_3

VIDEO 1: IMAGE SEGMENTATION

Segmenting Images to Create Data

Image Segmentation

Divide up digital images to salient regions/clusters corresponding to individual surfaces, objects, or natural parts of objects
Clusters should be uniform and homogenous with respect to certain characteristics (color, intensity, texture)
Goal: Useful and analyzable image representation

Wide Applications

Medical Imaging
- Locate tissue classes, organs, pathologies and tumors
- Measure tissue/tumor volume
Object Detection
- Detect facial features in photos
- Detect pedestrians in footages of surveillance videos
Recognition tasks
- Fingerprint/Iris recognition

Various Methods

Clustering methods
- Partition image to clusters based on differences in pixel colors, intensity or texture
Edge detection
- Based on the detection of discontinuity, such as an abrupt change in the gray level in gray-scale images
Region-growing methods
- Divides image into regions, then sequentially merges sufficiently similar regions

In this Recitation.

Review hierarchical and k-means clustering in R
Restrict ourselves to gray-scale images
- Simple example of a flower image (flower.csv)
- Medical imaging application with examples of transverse MRI images of the brain (healthy.csv and tumor.csv)
Compare the use, pros and cons of all analytics methods we have seen so far

VIDEO 2: CLUSTERING PIXELS (R script reproduced here)

Grayscale Images

Image is represented as a matrix of pixel intensity values ranging from 0 (black) to 1 (white)
For 8 bits/pixel (bpp), 256 color levels

gray

Grayscale Image Segmentation

Cluster pixels according to their intensity values

gray2

#Unit 6 - Recitation

#Video 2

flower = read.csv("flower.csv", header=FALSE) #since we do not have headers in the data and we dont want R to convert the first row as header by default, hence we use header=FALSE argument
str(flower)

## 'data.frame':    50 obs. of  50 variables:
##  $ V1 : num  0.0991 0.0991 0.1034 0.1034 0.1034 ...
##  $ V2 : num  0.112 0.108 0.112 0.116 0.108 ...
##  $ V3 : num  0.134 0.116 0.121 0.116 0.112 ...
##  $ V4 : num  0.138 0.138 0.121 0.121 0.112 ...
##  $ V5 : num  0.138 0.134 0.125 0.116 0.112 ...
##  $ V6 : num  0.138 0.129 0.121 0.108 0.112 ...
##  $ V7 : num  0.129 0.116 0.103 0.108 0.112 ...
##  $ V8 : num  0.116 0.103 0.103 0.103 0.116 ...
##  $ V9 : num  0.1121 0.0991 0.1078 0.1121 0.1164 ...
##  $ V10: num  0.121 0.108 0.112 0.116 0.125 ...
##  $ V11: num  0.134 0.125 0.129 0.134 0.129 ...
##  $ V12: num  0.147 0.134 0.138 0.129 0.138 ...
##  $ V13: num  0.000862 0.146552 0.142241 0.142241 0.133621 ...
##  $ V14: num  0.000862 0.000862 0.142241 0.133621 0.12931 ...
##  $ V15: num  0.142 0.142 0.134 0.121 0.116 ...
##  $ V16: num  0.125 0.125 0.116 0.108 0.108 ...
##  $ V17: num  0.1121 0.1164 0.1078 0.0991 0.0991 ...
##  $ V18: num  0.108 0.112 0.108 0.108 0.108 ...
##  $ V19: num  0.121 0.129 0.125 0.116 0.116 ...
##  $ V20: num  0.138 0.129 0.125 0.116 0.116 ...
##  $ V21: num  0.138 0.134 0.121 0.125 0.125 ...
##  $ V22: num  0.134 0.129 0.125 0.121 0.103 ...
##  $ V23: num  0.125 0.1207 0.1164 0.1164 0.0819 ...
##  $ V24: num  0.1034 0.1034 0.0991 0.0991 0.1034 ...
##  $ V25: num  0.0948 0.0905 0.0905 0.1034 0.125 ...
##  $ V26: num  0.0862 0.0862 0.0991 0.125 0.1422 ...
##  $ V27: num  0.086207 0.086207 0.103448 0.12931 0.000862 ...
##  $ V28: num  0.0991 0.1078 0.1164 0.1293 0.1466 ...
##  $ V29: num  0.116 0.134 0.134 0.121 0.142 ...
##  $ V30: num  0.121 0.138 0.142 0.129 0.138 ...
##  $ V31: num  0.121 0.134 0.142 0.134 0.129 ...
##  $ V32: num  0.116 0.134 0.129 0.116 0.112 ...
##  $ V33: num  0.108 0.112 0.116 0.108 0.108 ...
##  $ V34: num  0.1078 0.1078 0.1034 0.0991 0.1034 ...
##  $ V35: num  0.1078 0.1034 0.0991 0.0991 0.0991 ...
##  $ V36: num  0.1078 0.1034 0.1034 0.0905 0.0862 ...
##  $ V37: num  0.1078 0.1078 0.1034 0.0819 0.0733 ...
##  $ V38: num  0.0948 0.0991 0.0776 0.069 0.0733 ...
##  $ V39: num  0.0733 0.056 0.0474 0.0474 0.056 ...
##  $ V40: num  0.0474 0.0388 0.0431 0.0474 0.0603 ...
##  $ V41: num  0.0345 0.0345 0.0388 0.0474 0.0647 ...
##  $ V42: num  0.0259 0.0259 0.0345 0.0431 0.056 ...
##  $ V43: num  0.0259 0.0259 0.0388 0.0517 0.0603 ...
##  $ V44: num  0.0302 0.0302 0.0345 0.0517 0.0603 ...
##  $ V45: num  0.0259 0.0259 0.0259 0.0388 0.0474 ...
##  $ V46: num  0.0259 0.0172 0.0172 0.0259 0.0345 ...
##  $ V47: num  0.01724 0.01724 0.00862 0.02155 0.02586 ...
##  $ V48: num  0.0216 0.0129 0.0129 0.0172 0.0302 ...
##  $ V49: num  0.0216 0.0216 0.0216 0.0345 0.0603 ...
##  $ V50: num  0.0302 0.0345 0.0388 0.0603 0.0776 ...

# Change the data type to matrix
#since the data is given as 50 by 50 pixel intensities data frame, we need to convert/manipulate the data so that Clustering can be applied.This requires converting the data frame first to matrix and then converting this matrix into a vector using as. functions

flowerMatrix = as.matrix(flower)
str(flowerMatrix)

##  num [1:50, 1:50] 0.0991 0.0991 0.1034 0.1034 0.1034 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:50] "V1" "V2" "V3" "V4" ...

# Turn matrix into a vector
flowerVector = as.vector(flowerMatrix)
str(flowerVector)

##  num [1:2500] 0.0991 0.0991 0.1034 0.1034 0.1034 ...

#Had we converted the data frame directly into vector as below, we see that it still remains a data frame.
flowerVector2 = as.vector(flower)
str(flowerVector2)

## 'data.frame':    50 obs. of  50 variables:
##  $ V1 : num  0.0991 0.0991 0.1034 0.1034 0.1034 ...
##  $ V2 : num  0.112 0.108 0.112 0.116 0.108 ...
##  $ V3 : num  0.134 0.116 0.121 0.116 0.112 ...
##  $ V4 : num  0.138 0.138 0.121 0.121 0.112 ...
##  $ V5 : num  0.138 0.134 0.125 0.116 0.112 ...
##  $ V6 : num  0.138 0.129 0.121 0.108 0.112 ...
##  $ V7 : num  0.129 0.116 0.103 0.108 0.112 ...
##  $ V8 : num  0.116 0.103 0.103 0.103 0.116 ...
##  $ V9 : num  0.1121 0.0991 0.1078 0.1121 0.1164 ...
##  $ V10: num  0.121 0.108 0.112 0.116 0.125 ...
##  $ V11: num  0.134 0.125 0.129 0.134 0.129 ...
##  $ V12: num  0.147 0.134 0.138 0.129 0.138 ...
##  $ V13: num  0.000862 0.146552 0.142241 0.142241 0.133621 ...
##  $ V14: num  0.000862 0.000862 0.142241 0.133621 0.12931 ...
##  $ V15: num  0.142 0.142 0.134 0.121 0.116 ...
##  $ V16: num  0.125 0.125 0.116 0.108 0.108 ...
##  $ V17: num  0.1121 0.1164 0.1078 0.0991 0.0991 ...
##  $ V18: num  0.108 0.112 0.108 0.108 0.108 ...
##  $ V19: num  0.121 0.129 0.125 0.116 0.116 ...
##  $ V20: num  0.138 0.129 0.125 0.116 0.116 ...
##  $ V21: num  0.138 0.134 0.121 0.125 0.125 ...
##  $ V22: num  0.134 0.129 0.125 0.121 0.103 ...
##  $ V23: num  0.125 0.1207 0.1164 0.1164 0.0819 ...
##  $ V24: num  0.1034 0.1034 0.0991 0.0991 0.1034 ...
##  $ V25: num  0.0948 0.0905 0.0905 0.1034 0.125 ...
##  $ V26: num  0.0862 0.0862 0.0991 0.125 0.1422 ...
##  $ V27: num  0.086207 0.086207 0.103448 0.12931 0.000862 ...
##  $ V28: num  0.0991 0.1078 0.1164 0.1293 0.1466 ...
##  $ V29: num  0.116 0.134 0.134 0.121 0.142 ...
##  $ V30: num  0.121 0.138 0.142 0.129 0.138 ...
##  $ V31: num  0.121 0.134 0.142 0.134 0.129 ...
##  $ V32: num  0.116 0.134 0.129 0.116 0.112 ...
##  $ V33: num  0.108 0.112 0.116 0.108 0.108 ...
##  $ V34: num  0.1078 0.1078 0.1034 0.0991 0.1034 ...
##  $ V35: num  0.1078 0.1034 0.0991 0.0991 0.0991 ...
##  $ V36: num  0.1078 0.1034 0.1034 0.0905 0.0862 ...
##  $ V37: num  0.1078 0.1078 0.1034 0.0819 0.0733 ...
##  $ V38: num  0.0948 0.0991 0.0776 0.069 0.0733 ...
##  $ V39: num  0.0733 0.056 0.0474 0.0474 0.056 ...
##  $ V40: num  0.0474 0.0388 0.0431 0.0474 0.0603 ...
##  $ V41: num  0.0345 0.0345 0.0388 0.0474 0.0647 ...
##  $ V42: num  0.0259 0.0259 0.0345 0.0431 0.056 ...
##  $ V43: num  0.0259 0.0259 0.0388 0.0517 0.0603 ...
##  $ V44: num  0.0302 0.0302 0.0345 0.0517 0.0603 ...
##  $ V45: num  0.0259 0.0259 0.0259 0.0388 0.0474 ...
##  $ V46: num  0.0259 0.0172 0.0172 0.0259 0.0345 ...
##  $ V47: num  0.01724 0.01724 0.00862 0.02155 0.02586 ...
##  $ V48: num  0.0216 0.0129 0.0129 0.0172 0.0302 ...
##  $ V49: num  0.0216 0.0216 0.0216 0.0345 0.0603 ...
##  $ V50: num  0.0302 0.0345 0.0388 0.0603 0.0776 ...

#Hence its crucial that we first convert the data frame into a matrix and then into a vector

#lets proceed with the familiar Hierarchial clustering by computing the pair-wise distances between each intensity value in the flower vector as below:

#Compute "euclidean" distances
distance = dist(flowerVector, method = "euclidean")

VIDEO 3: HIERARCHICAL CLUSTERING (R script reproduced here)

Dendrogram Example

dendo1

The lowest level denotes all the obs or individual data points.All higher nodes denotes the clusters.
Vertical lines denote the distance between the clusters.Taller this lines, more DISSIMILAR the cluster are.

Choosing the no of clusters by drawing the horizontal lines cutting through the vertical distance lines in the dendogram

dendo2

What no of clusters to choose then?

Smaller the number of clusters means coarser the clustering will be.Higher number would result in over segmentation.Hence we should think about the trade-off.

Criteria to choose the no of clusters: * If the horizontal line passing through the vertical distance has enough head room to move up & down, then that can be one reason to choose the cut.In above diagram, 3 cluster cut seems appropriate as the horizontal cut line has enough head room up & down.

#Video 3

# Hierarchical clustering of the intensity values which is the dendogram tree
clusterIntensity = hclust(distance, method="ward.D2")

#Plot the dendrogram
plot(clusterIntensity)

# Select 3 clusters
#visualising the cuts by plotting a rectangle around the clusters
rect.hclust(clusterIntensity, k = 3, border = "red")

#lets split the data into these 3 clusters
flowerClusters = cutree(clusterIntensity, k = 3)
flowerClusters

##    [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##   [35] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##   [69] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [103] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [137] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [171] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [205] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [239] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [273] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [307] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [341] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3
##  [375] 2 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [409] 1 1 1 1 1 1 1 1 1 2 2 1 1 1 3 3 3 3 3 2 1 2 3 2 1 1 1 1 1 1 1 1 1 1
##  [443] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 1 1 1 3 3 3 3
##  [477] 3 2 2 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [511] 1 1 1 1 1 1 2 3 3 2 1 1 3 3 3 3 3 2 3 3 3 1 2 3 3 1 1 1 1 1 1 1 1 1
##  [545] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 2 3 3 3 1 1 3 3 3 3 3 2
##  [579] 3 3 2 2 3 3 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [613] 2 3 3 2 1 3 3 3 2 1 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1
##  [647] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 3 3 3 1 2 3 3 3 3 3 3 3
##  [681] 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
##  [715] 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1
##  [749] 1 1 1 1 1 1 1 1 1 1 1 2 3 2 1 1 2 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3
##  [783] 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 2 1 1
##  [817] 3 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 3 3 3 2 1 1 2 2 3 3 1 1 1 1 1 1 1 1
##  [851] 1 1 1 1 1 1 1 1 1 1 2 3 3 3 2 1 2 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3
##  [885] 2 1 2 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 2 2
##  [919] 3 3 3 3 3 2 2 3 3 3 3 3 3 3 3 2 1 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1
##  [953] 1 1 1 1 1 1 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 3 3 3 3 2 2 3 3
##  [987] 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3
## [1021] 3 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1055] 1 1 2 2 2 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 2
## [1089] 2 2 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2
## [1123] 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1
## [1157] 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 2
## [1191] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 3 3 3 3 3 2 2 2 2 2 2
## [1225] 2 2 2 2 2 2 2 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1259] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
## [1293] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2
## [1327] 2 2 2 2 3 3 3 3 3 3 3 3 3 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3
## [1361] 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1
## [1395] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2
## [1429] 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3
## [1463] 3 3 3 3 3 3 3 3 3 3 3 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 2 1 1 1 1 1
## [1497] 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 2 2 3 3 3 2 3
## [1531] 3 3 3 3 3 2 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 1
## [1565] 2 3 3 3 3 3 3 3 3 2 2 3 3 3 3 2 3 3 3 3 3 2 1 2 2 2 2 2 1 1 1 1 1 1
## [1599] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 2 3 3
## [1633] 3 3 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 3 3
## [1667] 2 3 3 3 3 3 2 2 3 3 3 3 3 2 1 3 3 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1
## [1701] 1 1 1 1 1 1 1 1 1 1 1 2 3 3 2 2 3 3 3 3 3 3 1 3 3 3 3 3 3 2 1 2 3 3
## [1735] 3 3 3 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 3 3 3
## [1769] 3 3 3 1 1 3 3 3 3 3 3 2 1 1 2 3 3 3 3 2 3 2 1 1 1 1 1 1 1 1 1 1 1 1
## [1803] 1 1 1 1 1 1 1 1 1 2 2 1 2 3 3 3 3 3 2 1 2 3 3 2 3 3 3 2 1 1 1 1 3 3
## [1837] 3 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3
## [1871] 1 1 2 3 3 2 3 3 3 3 1 1 1 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [1905] 1 1 1 1 1 1 1 1 1 1 2 3 3 3 3 3 1 1 2 3 3 1 3 3 3 3 1 1 1 1 1 1 1 1
## [1939] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 2 1 1
## [1973] 2 3 3 1 2 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2007] 1 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 3 3 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1
## [2041] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
## [2075] 3 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2109] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2143] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2177] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2245] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2279] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2313] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2347] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2381] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2415] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2449] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [2483] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

#this assigns values 1 to 3 denoting which of the 3 clusters the intensity(data point) was assigned to

# Find mean intensity values of each of these 3 clusters
tapply(flowerVector, flowerClusters, mean)

##          1          2          3 
## 0.08574315 0.50826255 0.93147713

# Plot the image and the clusters
#Before plotting the image we want to convert the flowerClusters vector into a matrix of form n x m using dim()
dim(flowerClusters) = c(50,50)
#plotting the image
image(flowerClusters, axes = FALSE)

#Lets see how the original flower looks so that we can compare with out image output
# Original image with gray colour scale
image(flowerMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))

#this produces a low resolution image.Next we try high resolution brain image

VIDEO 4: MRI IMAGE (R script reproduced here)

k-Means Clustering

The k-means clustering aims at partitioning the data into k clusters in which each data point belongs to the cluster whose mean is the nearest

kmeans1

k-Means Clustering steps:

kmeans2

kmeans3

kmeans4

kmeans5

kmeans6

kmeans7

kmeans8

# Video 4

# Let's try this with an MRI image of the brain
#healthy.csv dataframe consists of matrix of intensity values 
healthy = read.csv("healthy.csv", header=FALSE)

#Creating the healthy matrix
healthyMatrix = as.matrix(healthy)
str(healthyMatrix)

##  num [1:566, 1:646] 0.00427 0.00855 0.01282 0.01282 0.01282 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:646] "V1" "V2" "V3" "V4" ...

#we can see that the image is 566 by 646....considerably larger than the flower

# Plot the image in gray scale
image(healthyMatrix,axes=FALSE,col=grey(seq(0,1,length=256)))

# Hierarchial clustering  by computing the distance vector
healthyVector = as.vector(healthyMatrix)
#distance = dist(healthyVector, method = "euclidean")
#throws up error that the vector size is huge 480GB+

# We have an error - why?
#lets see the size of the vector
str(healthyVector)

##  num [1:365636] 0.00427 0.00855 0.01282 0.01282 0.01282 ...

#Lets  store the value in n
n=365636

#therefore the pairwise distance between each of the above n data points is:
n*(n-1)/2

## [1] 66844659430

#wow...66 billion+ pairwise distances

#so the bad news now is that we cannot use hierarchial clustering because of high resolution of the image.Hence we go for other clustering method called k-means

VIDEO 5: K-MEANS CLUSTERING (R script reproduced here)

#Video 5

# Specify number of clusters (Depends upon what you want to extract from the image)
k = 5

# Run k-means
set.seed(1)#since k means clustering starts by randomly assigning data points to cluster we set the seed to obtain similar results
KMC = kmeans(healthyVector, centers = k, iter.max = 1000)
str(KMC)

## List of 9
##  $ cluster     : int [1:365636] 3 3 3 3 3 3 3 3 3 3 ...
##  $ centers     : num [1:5, 1] 0.4818 0.1062 0.0196 0.3094 0.1842
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "1" "2" "3" "4" ...
##   .. ..$ : NULL
##  $ totss       : num 5775
##  $ withinss    : num [1:5] 96.6 47.2 39.2 57.5 62.3
##  $ tot.withinss: num 303
##  $ betweenss   : num 5472
##  $ size        : int [1:5] 20556 101085 133162 31555 79278
##  $ iter        : int 2
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

# Extract clusters variable for plotting it
healthyClusters = KMC$cluster

#mean intensity values of each clusters is already available under the variable name centers in KMC object(NO NEED to calculate as was done for hierarchial clustering)
KMC$centers[2] #mean intensity value of 2nd cluster

## [1] 0.1061945

# Plot the image with the clusters
#first convert into a matrix using dim()
dim(healthyClusters) = c(nrow(healthyMatrix), ncol(healthyMatrix))
image(healthyClusters, axes = FALSE, col=rainbow(k))

#SCREE PLOTS

#While dendrograms can be used to select the final number of clusters for Hierarchical Clustering, we can't use dendrograms for k-means clustering. However, there are several other ways that the number of clusters can be selected. One common way to select the number of clusters is by using a scree plot, which works for any clustering algorithm.

#A standard scree plot has the number of clusters on the x-axis, and the sum of the within-cluster sum of squares on the y-axis. The within-cluster sum of squares for a cluster is the sum, across all points in the cluster, of the squared distance between each point and the centroid of the cluster.  We ideally want very small within-cluster sum of squares, since this means that the points are all very close to their centroid. 

#To create the scree plot, the clustering algorithm is run with a range of values for the number of clusters. For each number of clusters, the within-cluster sum of squares can easily be extracted when using k-means clustering. For example, suppose that we want to cluster the MRI image from this video into two clusters. We can first run the k-means algorithm with two clusters:

KMC2 = kmeans(healthyVector, centers = 2, iter.max = 1000)

#Then, the within-cluster sum of squares is just an element of KMC2:
KMC2$withinss

## [1]  803.2878 1214.2523

#This gives a vector of the within-cluster sum of squares for each cluster (in this case, there should be two numbers). 

#Now suppose we want to determine the best number of clusters for this dataset. We would first repeat the kmeans function call above with centers = 3, centers = 4, etc. to create KMC3, KMC4, and so on. Then, we could generate the following plot:

KMC3 = kmeans(healthyVector, centers = 3, iter.max = 1000)
KMC4 = kmeans(healthyVector, centers = 4, iter.max = 1000)
KMC5 = kmeans(healthyVector, centers = 5, iter.max = 1000)
KMC6 = kmeans(healthyVector, centers = 6, iter.max = 1000)
KMC7 = kmeans(healthyVector, centers = 7, iter.max = 1000)
KMC8 = kmeans(healthyVector, centers = 8, iter.max = 1000)
KMC9 = kmeans(healthyVector, centers = 9, iter.max = 1000)
KMC10 = kmeans(healthyVector, centers =10, iter.max = 1000)

NumClusters = seq(2,10,1)

SumWithinss = c(sum(KMC2$withinss), sum(KMC3$withinss), sum(KMC4$withinss), sum(KMC5$withinss), sum(KMC6$withinss), sum(KMC7$withinss), sum(KMC8$withinss), sum(KMC9$withinss), sum(KMC10$withinss))

#SCREE plot
plot(NumClusters, SumWithinss, type="b")

#The plot looks like this (the type="b" argument just told the plot command to give us points and lines):

#To determine the best number of clusters using this plot, we want to look for a bend, or elbow, in the plot. This means that we want to find the number of clusters for which increasing the number of clusters further does not significantly help to reduce the within-cluster sum of squares. For this particular dataset, it looks like 4 or 5 clusters is a good choice. Beyond 5, increasing the number of clusters does not really reduce the within-cluster sum of squares too much.

#Note: You may have noticed it took a lot of typing to generate SumWithinss; this is because we limited ourselves to R functions we've learned so far in the course. In fact, R has powerful functions for repeating tasks with a different input (in this case running kmeans with different cluster sizes). For instance, we could generate SumWithinss with:

#All the above steps in one go using the sapply ()
SumWithinss = sapply(2:10, function(x) sum(kmeans(healthyVector, centers=x, iter.max=1000)$withinss))
#SCREE plot
plot(NumClusters, SumWithinss, type="b")

VIDEO 6: DETECTING TUMORS (R script reproduced here)

First Taste of a Fascinating Field

MRI image segmentation is subject of ongoing research

k-means is a good starting point, but not enough

Advanced clustering techniques such as the modified fuzzy k-means (MFCM) clustering technique
Packages in R specialized for medical image analysis Medical Imaging

#Video 6

#Apply to a test image

tumor = read.csv("tumor.csv", header=FALSE)
tumorMatrix = as.matrix(tumor)
tumorVector = as.vector(tumorMatrix)

#Apply clusters from before to new image, using the flexclust package
#install.packages("flexclust")
library(flexclust)

## Loading required package: grid

## Loading required package: modeltools

## Loading required package: stats4

#converting KMC into kcca object
KMC.kcca = as.kcca(KMC, healthyVector) #Healthy vector as a training set
tumorClusters = predict(KMC.kcca, newdata = tumorVector) #tumorvector as a testing set

#Visualize the clusters

#converting tumorClusters into a matrix
dim(tumorClusters) = c(nrow(tumorMatrix), ncol(tumorMatrix))

#visualising the tumorClusters
image(tumorClusters, axes = FALSE, col=rainbow(k))

Segmented MRI Images

mri1

Left side image is of a healthy brain

T2 Weighted MRI Images

mri2

VIDEO 7: COMPARING METHODS USED IN THE CLASS

Comparison of Methods

method1

method2

method3

Assignment 6

DOCUMENT CLUSTERING WITH DAILY KOS

Document clustering, or text clustering, is a very popular application of clustering algorithms. A web search engine, like Google, often returns thousands of results for a simple query. For example, if you type the search term “jaguar” into Google, around 200 million results are returned. This makes it very difficult to browse or find relevant information, especially if the search term has multiple meanings. If we search for “jaguar”, we might be looking for information about the animal, the car, or the Jacksonville Jaguars football team.

Clustering methods can be used to automatically group search results into categories, making it easier to find relavent results. This method is used in the search engines PolyMeta and Helioid, as well as on FirstGov.gov, the official Web portal for the U.S. government. The two most common algorithms used for document clustering are Hierarchical and k-means.

In this problem, we’ll be clustering articles published on Daily Kos, an American political blog that publishes news and opinion articles written from a progressive point of view. Daily Kos was founded by Markos Moulitsas in 2002, and as of September 2014, the site had an average weekday traffic of hundreds of thousands of visits.

The file dailykos.csv contains data on 3,430 news articles or blogs that have been posted on Daily Kos. These articles were posted in 2004, leading up to the United States Presidential Election. The leading candidates were incumbent President George W. Bush (republican) and John Kerry (democratic). Foreign policy was a dominant topic of the election, specifically, the 2003 invasion of Iraq.

Each of the variables in the dataset is a word that has appeared in at least 50 different articles (1,545 words in total). The set of words has been trimmed according to some of the techniques covered in the previous week on text analytics (punctuation has been removed, and stop words have been removed). For each document, the variable values are the number of times that word appeared in the document.

#PROBLEM 1.1 - HIERARCHICAL CLUSTERING  

#Let's start by building a hierarchical clustering model. First, read the data set into R.
dailykos<- read.csv("dailykos.csv")
dim(dailykos)

## [1] 3430 1545

#compute the distances (using method="euclidean")
kosDist<- dist(dailykos,method="euclidean")

#use hclust to build the model (using method="ward.D"),clustering on all of the variables
kosHierClust<- hclust(kosDist, method="ward.D")

#Running the dist function will probably take you a while. Why? Select all that apply.
#Ans:We have a lot of observations, so it takes a long time to compute the distance between each pair of observations & We have a lot of variables, so the distance computation is long.
#Explanation:The distance computation can take a long time if you have a lot of observations and/or if there are a lot of variables. As we saw in recitation, it might not even work if you have too many of either!

###########################################

#PROBLEM 1.2 - HIERARCHICAL CLUSTERING  

#Plot the dendrogram of your hierarchical clustering model. Just looking at the dendrogram, which of the following seem like good choices for the number of clusters? Select all that apply.

#plot the dendrogram
plot(kosHierClust)#where "kosHierClust" is the name of the clustering model

#Ans:2 & 3
#Explanation:The choices 2 and 3 are good cluster choices according to the dendrogram, because there is a lot of space between the horizontal lines in the dendrogram in those cut off spots (draw a horizontal line across the dendrogram where it crosses 2 or 3 vertical lines). The choices of 5 and 6 do not seem good according to the dendrogram because there is very little space.

#######################################

#PROBLEM 1.3 - HIERARCHICAL CLUSTERING  

#In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. Just thinking about this application, what are good choices for the number of clusters? Select all that apply.
#Ans:7 & 8
#Explanation:Thinking about the application, it is probably better to show the reader more categories than 2 or 3. These categories would probably be too broad to be useful. Seven or eight categories seems more reasonable.

######################################

#PROBLEM 1.4 - HIERARCHICAL CLUSTERING  

#Let's pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters.

#You can split your data into clusters by first using the cutree function to compute the cluster numbers
#cluster assignment of hierarchical clustering
hierGroups<-cutree(kosHierClust, k=7)

#Then, you can use the subset function 7 times to split the data into the 7 clusters(we don't really want to run tapply on every single variable when we have over 1,000 different variables):
HierCluster1 = subset(dailykos, hierGroups == 1)
nrow(HierCluster1)

## [1] 1266

HierCluster2 = subset(dailykos, hierGroups == 2)
nrow(HierCluster2)

## [1] 321

HierCluster3 = subset(dailykos, hierGroups == 3)
nrow(HierCluster3)

## [1] 374

HierCluster4 = subset(dailykos, hierGroups == 4)
nrow(HierCluster4)

## [1] 139

HierCluster5 = subset(dailykos, hierGroups == 5)
nrow(HierCluster5)

## [1] 407

HierCluster6 = subset(dailykos, hierGroups == 6)
nrow(HierCluster6)

## [1] 714

HierCluster7 = subset(dailykos, hierGroups == 7)
nrow(HierCluster7)

## [1] 209

#or
table(hierGroups)

## hierGroups
##    1    2    3    4    5    6    7 
## 1266  321  374  139  407  714  209

#More Advanced Approach:

#There is a very useful function in R called the "split" function. Given a vector assigning groups like hierGroups, you could split dailykos into the clusters by typing:
HierCluster = split(dailykos, hierGroups)

#Then cluster 1 can be accessed by typing HierCluster[[1]], cluster 2 can be accessed by typing HierCluster[[2]], etc. If you have a variable in your current R session called "split", you will need to remove it with rm(split) before using the split function.

#How many observations are in cluster 3?
#Ans:374

#Which cluster has the most observations?
#Ans:1

#Which cluster has the fewest observations?
#Ans:4

#########################################

#PROBLEM 1.5 - HIERARCHICAL CLUSTERING  

#Instead of looking at the average value in each variable individually, we'll just look at the top 6 words in each cluster. To do this for cluster 1, type the following in your R console (where "HierCluster1" should be replaced with the name of your first cluster subset):
tail(sort(colMeans(HierCluster[[1]])))

##      state republican       poll   democrat      kerry       bush 
##  0.7575039  0.7590837  0.9036335  0.9194313  1.0624013  1.7053712

#This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occur the most frequently. The colMeans function computes the column (word) means, the sort function orders the words in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the ones with the largest column means.

#What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:
#Ans:bush 
#Explanation:After running the R command given above, we can see that the most frequent word on average is "bush". This corresponds to President George W. Bush.

##################################################

#PROBLEM 1.6 - HIERARCHICAL CLUSTERING  

#Now repeat the command given in the previous problem for each of the other clusters, and answer the following questions.

#Get top 6 words of all clusters

#using lapply() rather than doing clusterwise
topWords <- lapply(HierCluster, function(c) tail(sort(colMeans(c))))
topWords

## $`1`
##      state republican       poll   democrat      kerry       bush 
##  0.7575039  0.7590837  0.9036335  0.9194313  1.0624013  1.7053712 
## 
## $`2`
##      bush  democrat challenge      vote      poll  november 
##  2.847352  2.850467  4.096573  4.398754  4.847352 10.339564 
## 
## $`3`
##      elect    parties      state republican   democrat       bush 
##   1.647059   1.665775   2.320856   2.524064   3.823529   4.406417 
## 
## $`4`
## campaign    voter presided     poll     bush    kerry 
## 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849 
## 
## $`5`
##       american       presided administration            war           iraq 
##       1.090909       1.120393       1.230958       1.776413       2.427518 
##           bush 
##       3.941032 
## 
## $`6`
##      race      bush     kerry     elect  democrat      poll 
## 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325 
## 
## $`7`
## democrat    clark   edward     poll    kerry     dean 
## 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828

#or doing each cluster wise
tail(sort(colMeans(HierCluster2)))

##      bush  democrat challenge      vote      poll  november 
##  2.847352  2.850467  4.096573  4.398754  4.847352 10.339564

tail(sort(colMeans(HierCluster3)))

##      elect    parties      state republican   democrat       bush 
##   1.647059   1.665775   2.320856   2.524064   3.823529   4.406417

tail(sort(colMeans(HierCluster4)))

## campaign    voter presided     poll     bush    kerry 
## 1.431655 1.539568 1.625899 3.589928 7.834532 8.438849

tail(sort(colMeans(HierCluster5)))

##       american       presided administration            war           iraq 
##       1.090909       1.120393       1.230958       1.776413       2.427518 
##           bush 
##       3.941032

tail(sort(colMeans(HierCluster6)))

##      race      bush     kerry     elect  democrat      poll 
## 0.4579832 0.4887955 0.5168067 0.5350140 0.5644258 0.5812325

tail(sort(colMeans(HierCluster7)))

## democrat    clark   edward     poll    kerry     dean 
## 2.148325 2.497608 2.607656 2.765550 3.952153 5.803828

#Which words best describe cluster 2?
#Ans:november, poll, vote, challenge

#Which cluster could best be described as the cluster related to the Iraq war?
#Ans:Cluster 5

#In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard Dean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of John Kerry (the Vice President nominee). Given this information, which cluster best corresponds to the democratic party?
#Ans:Cluster 7
#Explanation:You can see that the words that best describe Cluster 2 are november, poll, vote, and challenge. The most common words in Cluster 5 are bush, iraq, war, and administration, so it is the cluster that can best be described as corresponding to the Iraq war. And the most common words in Cluster 7 are dean, kerry, poll, and edward, so it looks like the democratic cluster.

##############################################

#PROBLEM 2.1 - K-MEANS CLUSTERING  

#Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don't need to add the iters.max argument.

set.seed(1000)

#cluster assignment of k-means clustering
KmeansCluster<-kmeans(dailykos, centers=7)
str(KmeansCluster)

## List of 9
##  $ cluster     : int [1:3430] 4 4 6 4 1 4 7 4 4 4 ...
##  $ centers     : num [1:7, 1:1545] 0.0342 0.0556 0.0253 0.0136 0.0491 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:7] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:1545] "abandon" "abc" "ability" "abortion" ...
##  $ totss       : num 896461
##  $ withinss    : num [1:7] 76583 52693 99504 258927 88632 ...
##  $ tot.withinss: num 730632
##  $ betweenss   : num 165829
##  $ size        : int [1:7] 146 144 277 2063 163 329 308
##  $ iter        : int 7
##  $ ifault      : int 0
##  - attr(*, "class")= chr "kmeans"

#Then, you can subset your data into the 7 clusters by using the following commands:
KmeansCluster1 = subset(dailykos, KmeansCluster$cluster == 1)
nrow(KmeansCluster1)

## [1] 146

KmeansCluster2 = subset(dailykos, KmeansCluster$cluster == 2)
nrow(KmeansCluster2)

## [1] 144

KmeansCluster3 = subset(dailykos, KmeansCluster$cluster == 3)
nrow(KmeansCluster3)

## [1] 277

KmeansCluster4 = subset(dailykos, KmeansCluster$cluster == 4)
nrow(KmeansCluster4)

## [1] 2063

KmeansCluster5 = subset(dailykos, KmeansCluster$cluster == 5)
nrow(KmeansCluster5)

## [1] 163

KmeansCluster6 = subset(dailykos, KmeansCluster$cluster == 6)
nrow(KmeansCluster6)

## [1] 329

KmeansCluster7 = subset(dailykos, KmeansCluster$cluster == 7)
nrow(KmeansCluster7)

## [1] 308

#Alternatively, you could answer these questions by looking at
table(KmeansCluster$cluster)

## 
##    1    2    3    4    5    6    7 
##  146  144  277 2063  163  329  308

#More Advanced Approach:

#rm(split)
KmeansClusterspl<-split(dailykos,KmeansCluster$cluster)

#How many observations are in Cluster 3?
nrow(KmeansClusterspl[[3]])

## [1] 277

#Ans:

#using sapply() to get the no of obs in all 7 clusters
sapply(KmeansClusterspl, nrow)

##    1    2    3    4    5    6    7 
##  146  144  277 2063  163  329  308

#Which cluster has the most observations?
#Ans:Cluster 4

#Which cluster has the fewest number of observations?
#Ans:Cluster 2


###################################################

#PROBLEM 2.2 - K-MEANS CLUSTERING  

#Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the k-means clusters.

#most frequent words from each kmeans cluster:
kmTopWords <- lapply(KmeansClusterspl, function(c) tail(sort(colMeans(c))))
kmTopWords

## $`1`
##          state           iraq          kerry administration       presided 
##       1.609589       1.616438       1.636986       2.664384       2.767123 
##           bush 
##      11.431507 
## 
## $`2`
## primaries  democrat    edward     clark     kerry      dean 
##  2.319444  2.694444  2.798611  3.090278  4.979167  8.277778 
## 
## $`3`
## administration          iraqi       american           bush            war 
##       1.389892       1.610108       1.685921       2.610108       3.025271 
##           iraq 
##       4.093863 
## 
## $`4`
##      elect republican      kerry       poll   democrat       bush 
##  0.6010664  0.6175473  0.6495395  0.7474552  0.7891420  1.1473582 
## 
## $`5`
##       race     senate      state    parties republican   democrat 
##   2.484663   2.650307   3.521472   3.619632   4.638037   6.993865 
## 
## $`6`
##  democrat      bush challenge      vote      poll  november 
##  2.899696  2.960486  4.121581  4.446809  4.872340 10.370821 
## 
## $`7`
## presided    voter campaign     poll     bush    kerry 
## 1.324675 1.334416 1.383117 2.788961 5.970779 6.480519

#or 

#we can output the most frequent words in each of the k-means clusters by using the following commands:
tail(sort(colMeans(KmeansCluster1)))

##          state           iraq          kerry administration       presided 
##       1.609589       1.616438       1.636986       2.664384       2.767123 
##           bush 
##      11.431507

tail(sort(colMeans(KmeansCluster2)))

## primaries  democrat    edward     clark     kerry      dean 
##  2.319444  2.694444  2.798611  3.090278  4.979167  8.277778

tail(sort(colMeans(KmeansCluster3)))

## administration          iraqi       american           bush            war 
##       1.389892       1.610108       1.685921       2.610108       3.025271 
##           iraq 
##       4.093863

tail(sort(colMeans(KmeansCluster4)))

##      elect republican      kerry       poll   democrat       bush 
##  0.6010664  0.6175473  0.6495395  0.7474552  0.7891420  1.1473582

tail(sort(colMeans(KmeansCluster5)))

##       race     senate      state    parties republican   democrat 
##   2.484663   2.650307   3.521472   3.619632   4.638037   6.993865

tail(sort(colMeans(KmeansCluster6)))

##  democrat      bush challenge      vote      poll  november 
##  2.899696  2.960486  4.121581  4.446809  4.872340 10.370821

tail(sort(colMeans(KmeansCluster7)))

## presided    voter campaign     poll     bush    kerry 
## 1.324675 1.334416 1.383117 2.788961 5.970779 6.480519

#Which k-means cluster best corresponds to the Iraq War?
#Ans:Cluster 3

#Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of the key democratic party leaders.)
#Ans:Cluster 2
#Explanation:By looking at the output, you can see that the cluster best correponding to the Iraq War is cluster 3 (top words are iraq, war, and bush) and the cluster best corresponding to the democratic party is cluster 2 (top words dean, kerry, clark, and edward).

##################################################

#PROBLEM 2.3 - K-MEANS CLUSTERING  

#For the rest of this problem, we'll ask you to compare how observations were assigned to clusters in the two different methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluster assignment of k-means clustering.

#topWords
#kmTopWords
table(hierGroups,KmeansCluster$cluster)

##           
## hierGroups    1    2    3    4    5    6    7
##          1    3   11   64 1045   32    0  111
##          2    0    0    0    0    0  320    1
##          3   85   10   42   79  126    8   24
##          4   10    5    0    0    1    0  123
##          5   48    0  171  145    3    1   39
##          6    0    2    0  712    0    0    0
##          7    0  116    0   82    1    0   10

#Which Hierarchical Cluster best corresponds to K-Means Cluster 2?
#Ans:Hierarchical Cluster 7
#Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 116 (80.6%) of the observations in K-Means Cluster 2 also fall in Hierarchical Cluster 7.

####################################################

#PROBLEM 2.4 - K-MEANS CLUSTERING

#Which Hierarchical Cluster best corresponds to K-Means Cluster 3?
table(hierGroups, KmeansCluster$cluster)

##           
## hierGroups    1    2    3    4    5    6    7
##          1    3   11   64 1045   32    0  111
##          2    0    0    0    0    0  320    1
##          3   85   10   42   79  126    8   24
##          4   10    5    0    0    1    0  123
##          5   48    0  171  145    3    1   39
##          6    0    2    0  712    0    0    0
##          7    0  116    0   82    1    0   10

#Ans:Hierarchical Cluster 5 
#Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 171 (61.7%) of the observations in K-Means Cluster 3 also fall in Hierarchical Cluster 5.

###########################################

#PROBLEM 2.5 - K-MEANS CLUSTERING  

#Which Hierarchical Cluster best corresponds to K-Means Cluster 7?
table(hierGroups, KmeansCluster$cluster)

##           
## hierGroups    1    2    3    4    5    6    7
##          1    3   11   64 1045   32    0  111
##          2    0    0    0    0    0  320    1
##          3   85   10   42   79  126    8   24
##          4   10    5    0    0    1    0  123
##          5   48    0  171  145    3    1   39
##          6    0    2    0  712    0    0    0
##          7    0  116    0   82    1    0   10

#Ans:No Hierarchical Cluster contains at least half of the points in K-Means Cluster 7.
#Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that no more than 123 (39.9%) of the observations in K-Means Cluster 7 fall in any hierarchical cluster.

##########################################

#PROBLEM 2.6 - K-MEANS CLUSTERING  

#Which Hierarchical Cluster best corresponds to K-Means Cluster 6?
table(hierGroups, KmeansCluster$cluster)

##           
## hierGroups    1    2    3    4    5    6    7
##          1    3   11   64 1045   32    0  111
##          2    0    0    0    0    0  320    1
##          3   85   10   42   79  126    8   24
##          4   10    5    0    0    1    0  123
##          5   48    0  171  145    3    1   39
##          6    0    2    0  712    0    0    0
##          7    0  116    0   82    1    0   10

#Ans:Hierarchical Cluster 2
#Explanation:From "table(hierGroups, KmeansCluster$cluster)", we read that 320 (97.3%) of observations in K-Means Cluster 6 fall in Hierarchical Cluster 2.

MARKET SEGMENTATION FOR AIRLINES

Market segmentation is a strategy that divides a broad target market of customers into smaller, more similar groups, and then designs a marketing strategy specifically for each group. Clustering is a common technique for market segmentation since it automatically finds similar groups given a data set.

In this problem, we’ll see how clustering can be used to find similar groups of customers who belong to an airline’s frequent flyer program. The airline is trying to learn more about its customers so that it can target different customer segments with different types of mileage offers.

The file AirlinesCluster.csv contains information on 3,999 members of the frequent flyer program. This data comes from the textbook “Data Mining for Business Intelligence,” by Galit Shmueli, Nitin R. Patel, and Peter C. Bruce. For more information, see the website for the book.

There are seven different variables in the dataset, described below:

Balance = number of miles eligible for award travel
QualMiles = number of miles qualifying for TopFlight status
BonusMiles = number of miles earned from non-flight bonus transactions in the past 12 months
BonusTrans = number of non-flight bonus transactions in the past 12 months
FlightMiles = number of flight miles in the past 12 months
FlightTrans = number of flight transactions in the past 12 months
DaysSinceEnroll = number of days since enrolled in the frequent flyer program

#Read in the data
airlines<- read.csv("AirlinesCluster.csv")
summary(airlines)

##     Balance          QualMiles         BonusMiles       BonusTrans  
##  Min.   :      0   Min.   :    0.0   Min.   :     0   Min.   : 0.0  
##  1st Qu.:  18528   1st Qu.:    0.0   1st Qu.:  1250   1st Qu.: 3.0  
##  Median :  43097   Median :    0.0   Median :  7171   Median :12.0  
##  Mean   :  73601   Mean   :  144.1   Mean   : 17145   Mean   :11.6  
##  3rd Qu.:  92404   3rd Qu.:    0.0   3rd Qu.: 23801   3rd Qu.:17.0  
##  Max.   :1704838   Max.   :11148.0   Max.   :263685   Max.   :86.0  
##   FlightMiles       FlightTrans     DaysSinceEnroll
##  Min.   :    0.0   Min.   : 0.000   Min.   :   2   
##  1st Qu.:    0.0   1st Qu.: 0.000   1st Qu.:2330   
##  Median :    0.0   Median : 0.000   Median :4096   
##  Mean   :  460.1   Mean   : 1.374   Mean   :4119   
##  3rd Qu.:  311.0   3rd Qu.: 1.000   3rd Qu.:5790   
##  Max.   :30817.0   Max.   :53.000   Max.   :8296

#or better way 
colMeans(airlines)

##         Balance       QualMiles      BonusMiles      BonusTrans 
##    73601.327582      144.114529    17144.846212       11.601900 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##      460.055764        1.373593     4118.559390

#PROBLEM 1.1 - NORMALIZING THE DATA  

#Looking at the summary of airlines, which TWO variables have (on average) the smallest values?
#Ans:BonusTrans & FlightTrans

#Which TWO variables have (on average) the largest values?
#Ans:Balance & BonusMiles

#EXPLANATION: For the smallest values, BonusTrans and FlightTrans are on the scale of tens, whereas all other variables have values in the thousands.For the largest values, Balance and BonusMiles have average values in the tens of thousands.

#################################

#PROBLEM 1.2 - NORMALIZING THE DATA  

#In this problem, we will normalize our data before we run the clustering algorithms. Why is it important to normalize the data before clustering?
#Ans:If we don't normalize the data, the clustering will be dominated by the variables that are on a larger scale. 
#EXPLANATION:If we don't normalize the data, the variables that are on a larger scale will contribute much more to the distance calculation, and thus will dominate the clustering.

###################################

#PROBLEM 1.3 - NORMALIZING THE DATA  

#Let's go ahead and normalize our data. You can normalize the variables in a data frame by using the preProcess function in the "caret" package.

library(caret)

## 
## Attaching package: 'caret'

## The following object is masked from 'package:mosaic':
## 
##     dotPlot

#Now, create a normalized data frame called "airlinesNorm" by running the following commands:

preproc = preProcess(airlines)

airlinesNorm = predict(preproc, airlines) #Normalising the data
##The first command pre-processes the data, and the second command performs the normalization.

summary(airlinesNorm)

##     Balance          QualMiles         BonusMiles        BonusTrans      
##  Min.   :-0.7303   Min.   :-0.1863   Min.   :-0.7099   Min.   :-1.20805  
##  1st Qu.:-0.5465   1st Qu.:-0.1863   1st Qu.:-0.6581   1st Qu.:-0.89568  
##  Median :-0.3027   Median :-0.1863   Median :-0.4130   Median : 0.04145  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.1866   3rd Qu.:-0.1863   3rd Qu.: 0.2756   3rd Qu.: 0.56208  
##  Max.   :16.1868   Max.   :14.2231   Max.   :10.2083   Max.   : 7.74673  
##   FlightMiles       FlightTrans       DaysSinceEnroll   
##  Min.   :-0.3286   Min.   :-0.36212   Min.   :-1.99336  
##  1st Qu.:-0.3286   1st Qu.:-0.36212   1st Qu.:-0.86607  
##  Median :-0.3286   Median :-0.36212   Median :-0.01092  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.1065   3rd Qu.:-0.09849   3rd Qu.: 0.80960  
##  Max.   :21.6803   Max.   :13.61035   Max.   : 2.02284

#Mean of eah variable
round(colMeans(airlinesNorm),6) #or

##         Balance       QualMiles      BonusMiles      BonusTrans 
##               0               0               0               0 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##               0               0               0

round(sapply(airlinesNorm,mean),6)

##         Balance       QualMiles      BonusMiles      BonusTrans 
##               0               0               0               0 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##               0               0               0

#std.deviation of each var
sapply(airlinesNorm,sd)

##         Balance       QualMiles      BonusMiles      BonusTrans 
##               1               1               1               1 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##               1               1               1

#If you look at the summary of airlinesNorm, you should see that all of the variables now have mean zero. You can also see that each of the variables has standard deviation 1 by using the sd() function.

#In the normalized data, which variable has the largest maximum value?
apply(airlinesNorm, 2, max)

##         Balance       QualMiles      BonusMiles      BonusTrans 
##       16.186811       14.223084       10.208293        7.746727 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##       21.680292       13.610351        2.022842

#Ans:FlightMiles 

#In the normalized data, which variable has the smallest minimum value?
apply(airlinesNorm, 2, min)

##         Balance       QualMiles      BonusMiles      BonusTrans 
##      -0.7303482      -0.1862754      -0.7099031      -1.2080518 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##      -0.3285622      -0.3621226      -1.9933614

#Ans:DaysSinceEnroll
#EXPLANATION:You can see from the output that FlightMiles now has the largest maximum value, and DaysSinceEnroll now has the smallest minimum value. Note that these were not the variables with the largest and smallest values in the original dataset airlines.

#####################################

#PROBLEM 2.1 - HIERARCHICAL CLUSTERING  

#Compute the distances between data points (using euclidean distance) and then run the Hierarchical clustering algorithm (using method="ward.D") on the normalized data. It may take a few minutes for the commands to finish since the dataset has a large number of observations for hierarchical clustering.

distances <- dist(airlinesNorm,method="euclidean")
airlineClust <- hclust(distances,method="ward.D")

#Then, plot the dendrogram of the hierarchical clustering process: 
plot(airlineClust)

#Suppose the airline is looking for somewhere between 2 and 10 clusters. According to the dendrogram, which of the following is NOT a good choice for the number of clusters?
#Ans:6
#EXPLANATION:if you run a horizontal line down the dendrogram, you can see that there is a long time that the line crosses 2 clusters, 3 clusters, or 7 clusters. However, it it hard to see the horizontal line cross 6 clusters. This means that 6 clusters is probably not a good choice.

##########################################

#PROBLEM 2.2 - HIERARCHICAL CLUSTERING  

#Suppose that after looking at the dendrogram and discussing with the marketing department, the airline decides to proceed with 5 clusters. Divide the data points into 5 clusters by using the cutree function. 

#How many data points are in Cluster 1?

#Divide into 5 clusters
clusterGroups<- cutree(airlineClust, k=5)
airlineClusters <-split(airlinesNorm, clusterGroups)
nrow(airlineClusters[[1]])

## [1] 776

#or
table(clusterGroups)

## clusterGroups
##    1    2    3    4    5 
##  776  519  494  868 1342

#Ans:776
#EXPLANATION:you can see that there are 776 data points in the first cluster.

#####################################

#PROBLEM 2.3 - HIERARCHICAL CLUSTERING

#Now, use tapply to compare the average values in each of the variables for the 5 clusters (the centroids of the clusters). You may want to compute the average values of the unnormalized data so that it is easier to interpret. You can do this for the variable "Balance" with the following command:

#The centroids of the clusters (Un-normalised data)
#average values in 'Balance'variables for the 5 cluster

tapply(airlines$Balance, clusterGroups, mean)

##         1         2         3         4         5 
##  57866.90 110669.27 198191.57  52335.91  36255.91

#then similarly compute for all other variables
tapply(airlines$QualMiles, clusterGroups, mean)

##            1            2            3            4            5 
##    0.6443299 1065.9826590   30.3461538    4.8479263    2.5111773

tapply(airlines$BonusMiles, clusterGroups, mean)

##         1         2         3         4         5 
## 10360.124 22881.763 55795.860 20788.766  2264.788

tapply(airlines$BonusTrans, clusterGroups, mean)

##         1         2         3         4         5 
## 10.823454 18.229287 19.663968 17.087558  2.973174

tapply(airlines$FlightMiles, clusterGroups, mean)

##          1          2          3          4          5 
##   83.18428 2613.41811  327.67611  111.57373  119.32191

tapply(airlines$FlightTrans, clusterGroups, mean)

##         1         2         3         4         5 
## 0.3028351 7.4026975 1.0688259 0.3444700 0.4388972

tapply(airlines$DaysSinceEnroll, clusterGroups, mean)

##        1        2        3        4        5 
## 6235.365 4402.414 5615.709 2840.823 3060.081

#or

#Advanced Explanation:

#Instead of using tapply, you could have alternatively used colMeans and subset, as follows:
colMeans(subset(airlines, clusterGroups == 1))

##         Balance       QualMiles      BonusMiles      BonusTrans 
##    5.786690e+04    6.443299e-01    1.036012e+04    1.082345e+01 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##    8.318428e+01    3.028351e-01    6.235365e+03

colMeans(subset(airlines, clusterGroups == 2))

##         Balance       QualMiles      BonusMiles      BonusTrans 
##    1.106693e+05    1.065983e+03    2.288176e+04    1.822929e+01 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##    2.613418e+03    7.402697e+00    4.402414e+03

colMeans(subset(airlines, clusterGroups == 3))

##         Balance       QualMiles      BonusMiles      BonusTrans 
##    1.981916e+05    3.034615e+01    5.579586e+04    1.966397e+01 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##    3.276761e+02    1.068826e+00    5.615709e+03

colMeans(subset(airlines, clusterGroups == 4))

##         Balance       QualMiles      BonusMiles      BonusTrans 
##    52335.913594        4.847926    20788.766129       17.087558 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##      111.573733        0.344470     2840.822581

colMeans(subset(airlines, clusterGroups == 5))

##         Balance       QualMiles      BonusMiles      BonusTrans 
##    3.625591e+04    2.511177e+00    2.264788e+03    2.973174e+00 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##    1.193219e+02    4.388972e-01    3.060081e+03

#But an even more compact way of finding the centroids would be to use the function "split" to first split the data into clusters, and then to use the function "lapply" to apply the function "colMeans" to each of the clusters:
lapply(split(airlines, clusterGroups), colMeans)

## $`1`
##         Balance       QualMiles      BonusMiles      BonusTrans 
##    5.786690e+04    6.443299e-01    1.036012e+04    1.082345e+01 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##    8.318428e+01    3.028351e-01    6.235365e+03 
## 
## $`2`
##         Balance       QualMiles      BonusMiles      BonusTrans 
##    1.106693e+05    1.065983e+03    2.288176e+04    1.822929e+01 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##    2.613418e+03    7.402697e+00    4.402414e+03 
## 
## $`3`
##         Balance       QualMiles      BonusMiles      BonusTrans 
##    1.981916e+05    3.034615e+01    5.579586e+04    1.966397e+01 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##    3.276761e+02    1.068826e+00    5.615709e+03 
## 
## $`4`
##         Balance       QualMiles      BonusMiles      BonusTrans 
##    52335.913594        4.847926    20788.766129       17.087558 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##      111.573733        0.344470     2840.822581 
## 
## $`5`
##         Balance       QualMiles      BonusMiles      BonusTrans 
##    3.625591e+04    2.511177e+00    2.264788e+03    2.973174e+00 
##     FlightMiles     FlightTrans DaysSinceEnroll 
##    1.193219e+02    4.388972e-01    3.060081e+03

#or better & faster ways is using sapply()
airlinesUnnormClusters<-split(airlines,clusterGroups)
round(sapply(airlinesUnnormClusters,colMeans),4)

##                          1           2           3          4          5
## Balance         57866.9046 110669.2659 198191.5749 52335.9136 36255.9098
## QualMiles           0.6443   1065.9827     30.3462     4.8479     2.5112
## BonusMiles      10360.1237  22881.7630  55795.8603 20788.7661  2264.7876
## BonusTrans         10.8235     18.2293     19.6640    17.0876     2.9732
## FlightMiles        83.1843   2613.4181    327.6761   111.5737   119.3219
## FlightTrans         0.3028      7.4027      1.0688     0.3445     0.4389
## DaysSinceEnroll  6235.3647   4402.4143   5615.7085  2840.8226  3060.0812

#Compared to the other clusters, Cluster 1 has the largest average values in which variables (if any)? Select all that apply.
#Ans:DaysSinceEnroll
#EXPLANATION:The only variable for which Cluster 1 has large values is DaysSinceEnroll.

#How would you describe the customers in Cluster 1?
#Ans: Infrequent but loyal customers.
#EXPLANATION:Cluster 1 mostly contains customers with few miles, but who have been with the airline the longest.

####################################

#PROBLEM 2.4 - HIERARCHICAL CLUSTERING  

#Compared to the other clusters, Cluster 2 has the largest average values in which variables (if any)? Select all that apply.
#Ans:QualMiles,FlightMiles,FlightTrans
#EXPLANATION:Cluster 2 has the largest average values in the variables QualMiles, FlightMiles and FlightTrans. This cluster also has relatively large values in BonusTrans and Balance.

#How would you describe the customers in Cluster 2?
#Ans: Customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions. 
#EXPLANATION:Cluster 2 contains customers with a large amount of miles, mostly accumulated through flight transactions.

##################################

#PROBLEM 2.5 - HIERARCHICAL CLUSTERING

#Compared to the other clusters, Cluster 3 has the largest average values in which variables (if any)? Select all that apply.
#Ans:Balance,BonusMiles,BonusTrans
#EXPLANATION:Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relatively large values in other variables, these are the three for which it has the largest values.

#How would you describe the customers in Cluster 3?
#Ans:Customers who have accumulated a large amount of miles, mostly through non-flight transactions.
#EXPLANATION:Cluster 3 mostly contains customers with a lot of miles, and who have earned the miles mostly through bonus transactions.

########################################

#PROBLEM 2.6 - HIERARCHICAL CLUSTERING  

#Compared to the other clusters, Cluster 4 has the largest average values in which variables (if any)? Select all that apply.
#Ans:None 
#EXPLANATION:Cluster 4 does not have the largest values in any of the variables.

#How would you describe the customers in Cluster 4?
#Ans:Relatively new customers who seem to be accumulating miles, mostly through non-flight transactions.
#EXPLANATION:Cluster 4 customers have the smallest value in DaysSinceEnroll, but they are already accumulating a reasonable number of miles.

################################

#PROBLEM 2.7 - HIERARCHICAL CLUSTERING  

#Compared to the other clusters, Cluster 5 has the largest average values in which variables (if any)? Select all that apply.
#Ans:None 
#EXPLANATION:Cluster 5 does not have the largest values in any of the variables.

#How would you describe the customers in Cluster 5?
#Ans:Relatively new customers who don't use the airline very often.
#EXPLANATION:Cluster 5 customers have lower than average values in all variables.

####################################

#PROBLEM 3.1 - K-MEANS CLUSTERING  

#Now run the k-means clustering algorithm on the normalized data, again creating 5 clusters. Set the seed to 88 right before running the clustering algorithm, and set the argument iter.max to 1000.

#Kmeans clustering

set.seed(88)
kmeansClust<- kmeans(airlinesNorm, centers=5, iter.max=1000)

#you can look at the number of observations in each cluster with the following command:
table(kmeansClust$cluster)

## 
##    1    2    3    4    5 
##  408  141  993 1182 1275

#or
sum(kmeansClust$size > 1000)

## [1] 2

#How many clusters have more than 1,000 observations?
#Ans:2
#Explanation:There are two clusters with more than 1000 observations.

###################################

#PROBLEM 3.2 - K-MEANS CLUSTERING  

#Now, compare the cluster centroids to each other either by dividing the data points into groups and then using tapply, or by looking at the output of kmeansClust$centers, where "kmeansClust" is the name of the output of the kmeans function. (Note that the output of kmeansClust$centers will be for the normalized data. If you want to look at the average values for the unnormalized data, you need to use tapply like we did for hierarchical clustering.)

#the centroids of the clusters (Normalised data)

#Hierarchial clusters centroid
airlinesUnnormClusters<-split(airlinesNorm,clusterGroups)
round(sapply(airlinesUnnormClusters,colMeans),4)

##                       1      2       3       4       5
## Balance         -0.1561 0.3678  1.2363 -0.2110 -0.3706
## QualMiles       -0.1854 1.1916 -0.1471 -0.1800 -0.1830
## BonusMiles      -0.2809 0.2375  1.6004  0.1509 -0.6161
## BonusTrans      -0.0811 0.6901  0.8395  0.5712 -0.8985
## FlightMiles     -0.2692 1.5379 -0.0945 -0.2489 -0.2433
## FlightTrans     -0.2823 1.5895 -0.0803 -0.2713 -0.2464
## DaysSinceEnroll  1.0250 0.1375  0.7250 -0.6187 -0.5125

#Kmeans centroid
kmeansClust$centers

##       Balance   QualMiles BonusMiles BonusTrans FlightMiles FlightTrans
## 1  1.44439706  0.51115730  1.8769284  1.0331951   0.1169945   0.1444636
## 2  1.00054098  0.68382234  0.6144780  1.7214887   3.8559798   4.1196141
## 3 -0.05580605 -0.14104391  0.3041358  0.7108744  -0.1218278  -0.1287569
## 4 -0.13331742 -0.11491607 -0.3492669 -0.3373455  -0.1833989  -0.1961819
## 5 -0.40579897 -0.02281076 -0.5816482 -0.7619054  -0.1989602  -0.2196582
##   DaysSinceEnroll
## 1       0.7198040
## 2       0.2742394
## 3      -0.3398209
## 4       0.9640923
## 5      -0.8897747

#Do you expect Cluster 1 of the K-Means clustering output to necessarily be similar to Cluster 1 of the Hierarchical clustering output?
#Ans:No, because cluster ordering is not meaningful in either k-means clustering or hierarchical clustering. 
#EXPLANATION:The clusters are not displayed in a meaningful order, so while there may be a cluster produced by the k-means algorithm that is similar to Cluster 1 produced by the Hierarchical method, it will not necessarily be shown first.

PREDICTING STOCK RETURNS WITH CLUSTER-THEN-PREDICT

In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we’ll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we’ll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we’ll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

This dataset contains the following variables:

ReturnJan = the return for the company’s stock during January (in the year of the observation).
ReturnFeb = the return for the company’s stock during February (in the year of the observation).
ReturnMar = the return for the company’s stock during March (in the year of the observation).
ReturnApr = the return for the company’s stock during April (in the year of the observation).
ReturnMay = the return for the company’s stock during May (in the year of the observation).
ReturnJune = the return for the company’s stock during June (in the year of the observation).
ReturnJuly = the return for the company’s stock during July (in the year of the observation).
ReturnAug = the return for the company’s stock during August (in the year of the observation).
ReturnSep = the return for the company’s stock during September (in the year of the observation).
ReturnOct = the return for the company’s stock during October (in the year of the observation).
ReturnNov = the return for the company’s stock during November (in the year of the observation).
PositiveDec = whether or not the company’s stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

#PROBLEM 1.1 - EXPLORING THE DATASET

#Load StocksCluster.csv into a data frame called "stocks". 

#How many observations are in the dataset?
stocks <- read.csv("StocksCluster.csv")
str(stocks)

## 'data.frame':    11580 obs. of  12 variables:
##  $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
##  $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
##  $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
##  $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
##  $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
##  $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
##  $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
##  $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
##  $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
##  $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
##  $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
##  $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...

#or
nrow(stocks)

## [1] 11580

#Ans:11580

###################################

#PROBLEM 1.2 - EXPLORING THE DATASET  

#What proportion of the observations have positive returns in December?

#Proportion of stocks with positive return

prop.table(table(stocks$PositiveDec)) #6324/11580 = 0.546

## 
##        0        1 
## 0.453886 0.546114

#or
sum(stocks$PositiveDec) / nrow(stocks)

## [1] 0.546114

#or
mean(stocks$PositiveDec)

## [1] 0.546114

#Ans:0.546114 

######################################

#PROBLEM 1.3 - EXPLORING THE DATASET

#What is the maximum correlation between any two return variables in the dataset? You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.

#Max correlation between months January to November
cm <- cor(stocks[1:11])
diag(cm) <- -1 # making the diagonals as -1
max(cm)

## [1] 0.1916728

#Ans:0.1916728

####################################

#PROBLEM 1.4 - EXPLORING THE DATASET  

#Which month (from January through November) has the largest mean return across all observations in the dataset?

summary(stocks)

##    ReturnJan            ReturnFeb           ReturnMar        
##  Min.   :-0.7616205   Min.   :-0.690000   Min.   :-0.712994  
##  1st Qu.:-0.0691663   1st Qu.:-0.077748   1st Qu.:-0.046389  
##  Median : 0.0009965   Median :-0.010626   Median : 0.009878  
##  Mean   : 0.0126316   Mean   :-0.007605   Mean   : 0.019402  
##  3rd Qu.: 0.0732606   3rd Qu.: 0.043600   3rd Qu.: 0.077066  
##  Max.   : 3.0683060   Max.   : 6.943694   Max.   : 4.008621  
##    ReturnApr           ReturnMay          ReturnJune       
##  Min.   :-0.826503   Min.   :-0.92207   Min.   :-0.717920  
##  1st Qu.:-0.054468   1st Qu.:-0.04640   1st Qu.:-0.063966  
##  Median : 0.009059   Median : 0.01293   Median :-0.000880  
##  Mean   : 0.026308   Mean   : 0.02474   Mean   : 0.005938  
##  3rd Qu.: 0.085338   3rd Qu.: 0.08396   3rd Qu.: 0.061566  
##  Max.   : 2.528827   Max.   : 6.93013   Max.   : 4.339713  
##    ReturnJuly           ReturnAug           ReturnSep        
##  Min.   :-0.7613096   Min.   :-0.726800   Min.   :-0.839730  
##  1st Qu.:-0.0731917   1st Qu.:-0.046272   1st Qu.:-0.074648  
##  Median :-0.0008047   Median : 0.007205   Median :-0.007616  
##  Mean   : 0.0030509   Mean   : 0.016198   Mean   :-0.014721  
##  3rd Qu.: 0.0718205   3rd Qu.: 0.070783   3rd Qu.: 0.049476  
##  Max.   : 2.5500000   Max.   : 3.626609   Max.   : 5.863980  
##    ReturnOct           ReturnNov          PositiveDec    
##  Min.   :-0.685504   Min.   :-0.747171   Min.   :0.0000  
##  1st Qu.:-0.070915   1st Qu.:-0.054890   1st Qu.:0.0000  
##  Median : 0.002115   Median : 0.008522   Median :1.0000  
##  Mean   : 0.005651   Mean   : 0.011387   Mean   :0.5461  
##  3rd Qu.: 0.074542   3rd Qu.: 0.076576   3rd Qu.:1.0000  
##  Max.   : 5.665138   Max.   : 3.271676   Max.   :1.0000

#If you look at the mean value for each variable, you can see that April has the largest mean value (0.026308), and September has the smallest mean value (-0.014721).

#or better 

cm <- colMeans(stocks[1:11])
which.max(cm)

## ReturnApr 
##         4

#Ans:ReturnApr

#Which month (from January through November) has the smallest mean return across all observations in the dataset?
which.min(cm)

## ReturnSep 
##         9

#Ans:ReturnSep

###########################################

#PROBLEM 2.1 - INITIAL LOGISTIC REGRESSION MODEL  

#Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

#Split the data into a training set and testing set

library(caTools)

## 
## Attaching package: 'caTools'

## The following objects are masked from 'package:base64enc':
## 
##     base64decode, base64encode

set.seed(144)
spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)
stocksTrain = subset(stocks, spl == TRUE)
stocksTest = subset(stocks, spl == FALSE)

#Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don't forget to add the argument family=binomial to your glm command.

#Train Logistic regression model as:
StocksModel = glm(PositiveDec ~ ., data=stocksTrain, family=binomial)

#we can compute our predictions on the training set with:
PredictTrain = predict(StocksModel, type="response")

#construct a confusion/classification matrix with the table function using a threshold of 0.5:
cmat_LR <-table(stocksTrain$PositiveDec, PredictTrain > 0.5)
cmat_LR

##    
##     FALSE TRUE
##   0   990 2689
##   1   787 3640

#lets now compute the overall accuracy of the training set
accu_LR <-(cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR)
accu_LR

## [1] 0.5711818

#or
sum(diag(cmat_LR))/nrow(stocksTrain)

## [1] 0.5711818

#What is the overall accuracy on the training set using a threshold of 0.5?
#Ans: 0.5711818    ((990 + 3640)/(990 + 2689 + 787 + 3640) = 0.571)

#########################################

#PROBLEM 2.2 - INITIAL LOGISTIC REGRESSION MODEL  

#Now obtain test set predictions from StocksModel. What is the overall accuracy of the model on the test, again using a threshold of 0.5?

#Out-of-Sample predictions of the Logistic Regression model
PredictTest = predict(StocksModel, newdata=stocksTest, type="response")

#Then we compute the confusion matrix for the testing set using a threshold of 0.5
cmat_LR<-table(stocksTest$PositiveDec, PredictTest > 0.5)
cmat_LR

##    
##     FALSE TRUE
##   0   417 1160
##   1   344 1553

#lets now compute the overall accuracy of the test set
accu_LR <-(cmat_LR[1,1] + cmat_LR[2,2])/sum(cmat_LR)
accu_LR

## [1] 0.5670697

#or
sum(diag(cmat_LR))/nrow(stocksTest)

## [1] 0.5670697

#Ans:0.5670697  (417 + 1553)/(417 + 1160 + 344 + 1553) = 0.567

########################################

#PROBLEM 2.3 - INITIAL LOGISTIC REGRESSION MODEL  

#What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?

#baseline model computed by making a table of the outcome variable in the test set:
baseline<-table(stocksTest$PositiveDec)
baseline

## 
##    0    1 
## 1577 1897

#Baseline accuracy
accu_baseline <- max(baseline)/sum(baseline)
accu_baseline #1897/(1577 + 1897) = 0.5460564

## [1] 0.5460564

#or
baseline[2] / sum(baseline)

##         1 
## 0.5460564

#Ans:0.5460564
#EXPLANATION:The baseline model would get all of the PositiveDec = 1 cases correct, and all of the PositiveDec = 0 cases wrong, for an accuracy of 1897/(1577 + 1897) = 0.5460564.

##################################

#PROBLEM 3.1 - CLUSTERING STOCKS  

#Now, let's cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:

#Remove dependent variable from data because we are trying to use clustering to "discover" the dependent variable,hence dependent variable removed
limitedTrain = stocksTrain
limitedTrain$PositiveDec = NULL
limitedTest = stocksTest
limitedTest$PositiveDec = NULL

#Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?
#Ans:Needing to know the dependent variable value to assign an observation to a cluster defeats the purpose of the methodology 
#EXPLANATION:In cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the methodology is no longer useful for prediction of an unknown outcome value.
#This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is using the outcome to determine the clusters, which is not valid.

################################################

#PROBLEM 3.2 - CLUSTERING STOCKS  

#In the market segmentation assignment in this week's homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

#In cases where we have a training and testing set, we'll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:

library(caret)
preproc = preProcess(limitedTrain)
normTrain = predict(preproc, limitedTrain)
normTest = predict(preproc, limitedTest)

#What is the mean of the ReturnJan variable in normTrain?
mean(normTrain$ReturnJan)

## [1] 2.100586e-17

#Ans:2.100586e-17

#What is the mean of the ReturnJan variable in normTest?
mean(normTest$ReturnJan)

## [1] -0.0004185886

#Ans:-0.0004185886

###############################

#PROBLEM 3.3 - CLUSTERING STOCKS  

#Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?
#Ans:The distribution of the ReturnJan variable is different in the training and testing set
#EXPLANATION:From mean(stocksTrain$ReturnJan) and mean(stocksTest$ReturnJan), we see that the average return in January is slightly higher in the training set than in the testing set. Since normTest was constructed by subtracting by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly negative in normTest.

#####################################

#PROBLEM 3.4 - CLUSTERING STOCKS  

#Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called km.

set.seed(144)
km <- kmeans(normTrain, centers=3, iter.max=1000)

#We can see the number of observations in each cluster by:
km$size

## [1] 3157 4696  253

#or
table(km$cluster)

## 
##    1    2    3 
## 3157 4696  253

#Which cluster has the largest number of observations?
#Ans:Cluster 2

##################################

#PROBLEM 3.5 - CLUSTERING STOCKS  

#Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):

#Obtain training set and testing set cluster assignments:

library(flexclust)

km.kcca = as.kcca(km, normTrain)
clusterTrain = predict(km.kcca)
clusterTest = predict(km.kcca, newdata=normTest)

#How many test-set observations were assigned to Cluster 2?

#we can obtain the breakdown of the testing set clusters with:
table(clusterTest)

## clusterTest
##    1    2    3 
## 1298 2080   96

#or getting directly as
sum(clusterTest == 2) #test-set observations assigned to Cluster 2

## [1] 2080

#Ans:2080

#################################

#PROBLEM 4.1 - CLUSTER-SPECIFIC PREDICTIONS 

#Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

#We can obtain the necessary subsets with:
stocksTrain1 = subset(stocksTrain, clusterTrain == 1)
stocksTrain2 = subset(stocksTrain, clusterTrain == 2)
stocksTrain3 = subset(stocksTrain, clusterTrain == 3)
stocksTest1 = subset(stocksTest, clusterTest == 1)
stocksTest2 = subset(stocksTest, clusterTest == 2)
stocksTest3 = subset(stocksTest, clusterTest == 3)

#Which training set data frame has the highest average value of the dependent variable?
mean(stocksTrain1$PositiveDec)

## [1] 0.6024707

mean(stocksTrain2$PositiveDec)

## [1] 0.5140545

mean(stocksTrain3$PositiveDec)

## [1] 0.4387352

#or better way

#Builing stocksTrain[1,2,3] using clustering clusterTrain and also stocksTest[1,2,3] from clusterTest
stocksTrain11 <- split(stocksTrain, clusterTrain)
stocksTest11 <- split(stocksTest, clusterTest)

sapply(stocksTrain11, function(s){ mean(s$PositiveDec) })

##         1         2         3 
## 0.6024707 0.5140545 0.4387352

#Ans:stocksTrain1 
#EXPLANATION: we see that stocksTrain1 has the observations with the highest average value of the dependent variable.

####################################

#PROBLEM 4.2 - CLUSTER-SPECIFIC PREDICTIONS  

#Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.

#We can build the Logistic models with:
StocksModel1 = glm(PositiveDec ~ ., data=stocksTrain1, family=binomial)
StocksModel2 = glm(PositiveDec ~ ., data=stocksTrain2, family=binomial)
StocksModel3 = glm(PositiveDec ~ ., data=stocksTrain3, family=binomial)

summary(StocksModel1)

## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7307  -1.2910   0.8878   1.0280   1.5023  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.17224    0.06302   2.733  0.00628 ** 
## ReturnJan    0.02498    0.29306   0.085  0.93206    
## ReturnFeb   -0.37207    0.29123  -1.278  0.20139    
## ReturnMar    0.59555    0.23325   2.553  0.01067 *  
## ReturnApr    1.19048    0.22439   5.305 1.12e-07 ***
## ReturnMay    0.30421    0.22845   1.332  0.18298    
## ReturnJune  -0.01165    0.29993  -0.039  0.96901    
## ReturnJuly   0.19769    0.27790   0.711  0.47685    
## ReturnAug    0.51273    0.30858   1.662  0.09660 .  
## ReturnSep    0.58833    0.28133   2.091  0.03651 *  
## ReturnOct   -1.02254    0.26007  -3.932 8.43e-05 ***
## ReturnNov   -0.74847    0.28280  -2.647  0.00813 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4243.0  on 3156  degrees of freedom
## Residual deviance: 4172.9  on 3145  degrees of freedom
## AIC: 4196.9
## 
## Number of Fisher Scoring iterations: 4

summary(StocksModel2)

## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2012  -1.1941   0.8583   1.1334   1.9424  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.10293    0.03785   2.719 0.006540 ** 
## ReturnJan    0.88451    0.20276   4.362 1.29e-05 ***
## ReturnFeb    0.31762    0.26624   1.193 0.232878    
## ReturnMar   -0.37978    0.24045  -1.579 0.114231    
## ReturnApr    0.49291    0.22460   2.195 0.028189 *  
## ReturnMay    0.89655    0.25492   3.517 0.000436 ***
## ReturnJune   1.50088    0.26014   5.770 7.95e-09 ***
## ReturnJuly   0.78315    0.26864   2.915 0.003554 ** 
## ReturnAug   -0.24486    0.27080  -0.904 0.365876    
## ReturnSep    0.73685    0.24820   2.969 0.002989 ** 
## ReturnOct   -0.27756    0.18400  -1.509 0.131419    
## ReturnNov   -0.78747    0.22458  -3.506 0.000454 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6506.3  on 4695  degrees of freedom
## Residual deviance: 6362.2  on 4684  degrees of freedom
## AIC: 6386.2
## 
## Number of Fisher Scoring iterations: 4

summary(StocksModel3)

## 
## Call:
## glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9146  -1.0393  -0.7689   1.1921   1.6939  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.181896   0.325182  -0.559   0.5759  
## ReturnJan   -0.009789   0.448943  -0.022   0.9826  
## ReturnFeb   -0.046883   0.213432  -0.220   0.8261  
## ReturnMar    0.674179   0.564790   1.194   0.2326  
## ReturnApr    1.281466   0.602672   2.126   0.0335 *
## ReturnMay    0.762512   0.647783   1.177   0.2392  
## ReturnJune   0.329434   0.408038   0.807   0.4195  
## ReturnJuly   0.774164   0.729360   1.061   0.2885  
## ReturnAug    0.982605   0.533158   1.843   0.0653 .
## ReturnSep    0.363807   0.627774   0.580   0.5622  
## ReturnOct    0.782242   0.733123   1.067   0.2860  
## ReturnNov   -0.873752   0.738480  -1.183   0.2367  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 346.92  on 252  degrees of freedom
## Residual deviance: 328.29  on 241  degrees of freedom
## AIC: 352.29
## 
## Number of Fisher Scoring iterations: 4

#or better way using a custom function

#Create logistic regression models 1 2 and 3 using stocksTrain11[[1]]
stocksModels <- lapply(stocksTrain11, function(s){
glm(s$PositiveDec ~ ., family=binomial, data=s)
})

sapply(stocksModels, function(m){ m$coefficients })

##                       1          2            3
## (Intercept)  0.17223985  0.1029318 -0.181895809
## ReturnJan    0.02498357  0.8845148 -0.009789345
## ReturnFeb   -0.37207369  0.3176221 -0.046883260
## ReturnMar    0.59554957 -0.3797811  0.674179495
## ReturnApr    1.19047752  0.4929105  1.281466189
## ReturnMay    0.30420906  0.8965492  0.762511555
## ReturnJune  -0.01165375  1.5008787  0.329433917
## ReturnJuly   0.19769226  0.7831487  0.774164370
## ReturnAug    0.51272941 -0.2448602  0.982605385
## ReturnSep    0.58832685  0.7368522  0.363806823
## ReturnOct   -1.02253506 -0.2775631  0.782242086
## ReturnNov   -0.74847186 -0.7874737 -0.873752144

#Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3? Select all that apply.
#Ans: ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct  differ in sign between the models.

###########################################

#PROBLEM 4.3 - CLUSTER-SPECIFIC PREDICTIONS  

#Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.

#The Out of samples predictions can be obtained with:
PredictTest1 = predict(StocksModel1, newdata = stocksTest1, type="response")
PredictTest2 = predict(StocksModel2, newdata = stocksTest2, type="response")
PredictTest3 = predict(StocksModel3, newdata = stocksTest3, type="response")

#The Out of samples Confusion/classification matrices using a threshold of 0.5 are:
cmat1<-table(stocksTest1$PositiveDec, PredictTest1 > 0.5)
cmat1

##    
##     FALSE TRUE
##   0    30  471
##   1    23  774

cmat2<-table(stocksTest2$PositiveDec, PredictTest2 > 0.5)
cmat2

##    
##     FALSE TRUE
##   0   388  626
##   1   309  757

cmat3<-table(stocksTest3$PositiveDec, PredictTest3 > 0.5)
cmat3

##    
##     FALSE TRUE
##   0    49   13
##   1    21   13

#lets now compute the Out of Samples overall accuracy :
accu_1<- (cmat1[1,1] + cmat1[2,2])/sum(cmat1)
accu_1

## [1] 0.6194145

accu_2<- (cmat2[1,1] + cmat2[2,2])/sum(cmat2)
accu_2

## [1] 0.5504808

accu_3<- (cmat3[1,1] + cmat3[2,2])/sum(cmat3)
accu_3

## [1] 0.6458333

#or better way using a custom function

#Make predictions using a thresholf of 0.5 & get accuracies of, using stockModels[[i]] against stocksTest11[[i]]

predictions <- sapply(1:3, function (i) {
p <- predict(stocksModels[[i]], newdata=stocksTest11[[i]], type="response")
(conf.mat <- table(stocksTest11[[i]]$PositiveDec, p > 0.5))
accuracy <- sum(diag(conf.mat)) / sum(conf.mat)
list(predict=p, accuracy=accuracy)
})
predictions

##          [,1]         [,2]         [,3]      
## predict  Numeric,1298 Numeric,2080 Numeric,96
## accuracy 0.6194145    0.5504808    0.6458333

#What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?
#Ans:0.6194145  (30 + 774)/(30 + 471 + 23 + 774) = 0.6194145

#What is the overall accuracy of StocksModel2 on the test set stocksTest2, using a threshold of 0.5?
#Ans:0.5504808  (388 + 757)/(388 + 626 + 309 + 757) = 0.5504808

#What is the overall accuracy of StocksModel3 on the test set stocksTest3, using a threshold of 0.5?
#Ans:0.6458333   (49 + 13)/(49 + 13 + 21 + 13) = 0.6458333

#############################################

#PROBLEM 4.4 - CLUSTER-SPECIFIC PREDICTIONS  

#To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:

AllPredictions = c(PredictTest1, PredictTest2, PredictTest3)
AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)

#What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?

#Computing overall confusion matrix of the cluster-then-predict approach
cmatoverall<-table(AllOutcomes, AllPredictions > 0.5)
cmatoverall

##            
## AllOutcomes FALSE TRUE
##           0   467 1110
##           1   353 1544

#lets now compute the Out of Samples overall accuracy :
accu_overall<- (cmatoverall[1,1] + cmatoverall[2,2])/sum(cmatoverall)
accu_overall   #(467 + 1544)/(467 + 1110 + 353 + 1544) = 0.5788716

## [1] 0.5788716

#Ans:0.5788716
#EXPLANATION:Which tells us that the overall accuracy is (467 + 1544)/(467 + 1110 + 353 + 1544) = 0.5788716. 

#We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.

sessionInfo()

## R version 3.3.0 (2016-05-03)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_India.1252  LC_CTYPE=English_India.1252   
## [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.1252    
## 
## attached base packages:
## [1] stats4    grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] caTools_1.17.1      caret_6.0-68        flexclust_1.3-4    
##  [4] modeltools_0.2-21   DataComputing_0.8.3 curl_0.9.7         
##  [7] base64enc_0.1-3     manipulate_1.0.1    mosaic_0.13.0      
## [10] mosaicData_0.13.0   car_2.1-2           lattice_0.20-33    
## [13] knitr_1.13          stringr_1.0.0       tidyr_0.4.1        
## [16] lubridate_1.5.6     dplyr_0.4.3         ggplot2_2.1.0      
## 
## loaded via a namespace (and not attached):
##  [1] reshape2_1.4.1     splines_3.3.0      colorspace_1.2-6  
##  [4] htmltools_0.3.5    yaml_2.1.13        mgcv_1.8-12       
##  [7] nloptr_1.0.4       DBI_0.4-1          foreach_1.4.3     
## [10] plyr_1.8.4         MatrixModels_0.4-1 munsell_0.4.3     
## [13] gtable_0.2.0       codetools_0.2-14   evaluate_0.9      
## [16] SparseM_1.7        quantreg_5.26      pbkrtest_0.4-6    
## [19] parallel_3.3.0     Rcpp_0.12.5        scales_0.4.0      
## [22] formatR_1.4        mime_0.4           lme4_1.1-12       
## [25] gridExtra_2.2.1    digest_0.6.9       stringi_1.1.1     
## [28] bitops_1.0-6       tools_3.3.0        magrittr_1.5      
## [31] lazyeval_0.1.10    ggdendro_0.1-20    MASS_7.3-45       
## [34] Matrix_1.2-6       iterators_1.0.8    assertthat_0.1    
## [37] minqa_1.2.4        rmarkdown_0.9.6    R6_2.1.2          
## [40] nnet_7.3-12        nlme_3.1-128

################# Finally clustering unit finished.......Yay! ###################

The_Analytics_Edge_edX_MIT15.071x_June2015_5

Nishant Upadhyay

16 July 2015

Unit 6: Clustering

Preliminaries

Recommendations Worth a Million_An Introduction to Clustering_1

VIDEO 1: INTRODUCTION TO NETFLIX

VIDEO 2: RECOMMENDATION SYSTEMS

VIDEO 3: MOVIE DATA AND CLUSTERING

VIDEO 4: COMPUTING DISTANCES

VIDEO 5: HIERARCHICAL CLUSTERING

VIDEO 6: GETTING THE DATA (R script reproduced here)

VIDEO 7: HIERARCHICAL CLUSTERING IN R (R script reproduced here)

VIDEO 8: THE ANALYTICS EDGE OF RECOMMENDATION SYSTEMS

Predictive Diagnosis_Discovering Patterns for Disease Detection_2

VIDEO 1: HEART ATTACKS

VIDEO 2: THE DATA

VIDEO 3: PREDICTING HEART ATTACKS USING CLUSTERING

VIDEO 4: UNDERSTANDING CLUSTER PATTERNS

VIDEO 5: THE ANALYTICS EDGE

Seeing the Big Picture_Segmenting Images to Create Data(Recitation)_3

VIDEO 1: IMAGE SEGMENTATION

VIDEO 2: CLUSTERING PIXELS (R script reproduced here)

VIDEO 3: HIERARCHICAL CLUSTERING (R script reproduced here)

VIDEO 4: MRI IMAGE (R script reproduced here)

VIDEO 5: K-MEANS CLUSTERING (R script reproduced here)

VIDEO 6: DETECTING TUMORS (R script reproduced here)

VIDEO 7: COMPARING METHODS USED IN THE CLASS

Assignment 6

DOCUMENT CLUSTERING WITH DAILY KOS

MARKET SEGMENTATION FOR AIRLINES

PREDICTING STOCK RETURNS WITH CLUSTER-THEN-PREDICT