About the Notebook
This Rnotebook covers part of the material in chapter 4 of “An Introduction to Statistical Learning with Applications in R” by James, Witten, Hastie and Tibshirani (ISLR). We are going to work on sections 4.6.3 (Linear Discriminant Analysis) and section 4.6.4 (Quadratic Discriminant Analysis).
The first part of the notebook covers important concepts behind Discriminant Analysis and its relation to Bayes’ Theorem then we go over creating a model using the mlr package plotting and ROC curve, on this notebook we are also going to use the MASS package to follow a similar example on the book (ISLR).
Notebook Setup: Set the working directory to the source location by: Session -> Set Working Directory -> To Source File Location
#### load and install packages if necessary
### https://mlr-org.github.io/
if(!require("mlr",quietly = TRUE)) install.packages("mlr")
### http://www.stats.ox.ac.uk/pub/MASS4/
if(!require("MASS",quietly = TRUE)) install.packages("MASS")
### http://ggplot2.org/
if(!require("ggplot2",quietly = TRUE)) install.packages("ggplot2")
### https://github.com/taiyun/corrplot
if(!require("corrplot",quietly = TRUE)) install.packages("corrplot")
#### set seed and turn off scientific notation to see small/large numbers
set.seed(1234)
options(scipen = 9999)
Background
Logistic Regression (Review)
- Models the conditional distribution of the response Y, \(Pr(Y = k|X = x)\) given the predictor(s) X, using the logistic function for the case of two response classes:
\[ p(X) = \frac{e^{\beta_0+\beta_1X_1+\dots+\beta_pX_p}}{1 + e^{\beta_0+\beta_1X_1+···+\beta_pX_p}} .\]
Linear discriminant model (LDA)
- Can be consider a less direct approach to classification compared to Logistic regression.
- Model’s the distribution of the predictors \(X\) separately in each of the response classes (i.e. given \(Y\) ), and then use Bayes’ theorem to flip these around into estimates for \(Pr(Y=k|X=x)\).
- Moreover Linear discriminant analysis (LDA) is a generalization of Fisher’s linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events.
- The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.
When to use Linear Discriminant Analysis
When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.
If \(n\) is small and the distribution of the predictors \(X\) is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.
Linear discriminant analysis is popular when there are more than two response classes.
Classification using Bayes’ Theorem
Bayes’ theorem
\[
Pr(Y=k|X=x) = \frac{\pi_kf_k(x)}{\sum^K_{l=1}\pi_lf_l(x)}. \qquad \qquad (4.10)
\]
Posterior Probability
We refer to \(pk(x)\) as the posterior probability that an observation \(X = x\) belongs to the \(kth\) class. That is, it is the probability that the observation belongs to the \(kth\) class, given the predictor value for that observation.
Suppose we assume that \(f_k(x)\) is normal or Gaussian. In the one normal dimensional setting, the normal density takes the form:
\[
f_k(x) = \frac{1}{\sqrt{2\pi\sigma_k}}exp\Bigg( -\frac{1}{2\sigma^2_k}(x-\mu_k)^2\Bigg). \qquad \qquad (4.11)
\]
Where \(\mu_k\) and \(\sigma^2_k\) are the mean and variance parameters for the \(kth\) class.
Assume that \(\sigma^2_1 = \dots = \sigma^2_K\), such as there is a shared variance term across all \(K\) classes, denote by \(\sigma^2\).
By plugging (4.11) into (4.10), we get:
\[
p_k(x) = \frac{\pi_k\frac{1}{\sqrt{2\pi}\sigma}\text{exp}\big(-\frac{1}{2\sigma^2}(x-\mu_k)^2 \big)}{\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi}\sigma}\text{exp}\big(-\frac{1}{2\sigma^2}(x- \mu_l)^2 \big)}. \qquad \qquad \qquad (4.12)
\]
- In equation (4.12) \(\pi_k\) denotes the prior probability that an observation belongs to the kth class (not \(\pi \approx 3.14159\), the mathematical constant).
Bayes classifier and Decision Boundary
Bayes classifier, which classifies an observation (\(X = x\)) to the class for which \(p_k(X)\) (4.12) is largest. By taking the log of (4.12) and rearranging the terms we get:
\[
\delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu^2_k}{2\sigma^2} + log(\pi_k) \qquad \qquad \qquad (4.13)
\] If \(K=2\) and \(\pi_1 = \pi_2\), then the Bayes classifier assigns:
\[
\text{BayesClassifier} = \Bigg\{
\begin{align}
1, & \text{ if } 2x(\mu1-\mu2)>\mu^2_1-\mu^2_2 \\
2, & \text{ otherwise }
\end{align}
\]
The Bayes decision boundary corresponds to the point where:
\[
x = \frac{\mu^2_1 - \mu^2_2}{2(\mu_1 - \mu_2)}=\frac{\mu1 + \mu2}{2}. \qquad \qquad \qquad (4.14)
\]
Plot of two one-dimensional normal density functions
The dashed vertical line represents the Bayes decision boundary calculated using the formula above.
mean_a <- 1.5
mean_b <- -1.5
bayes_decision_boundary <- (mean_a + mean_b)/2
p <- ggplot( data.frame(x=c(-4,4)), aes(x)) +
stat_function( fun = dnorm, args = list(mean = mean_a, sd = 1), col='red' ) +
stat_function( fun = dnorm, args = list(mean = mean_b, sd = 1), col='blue' )
p + geom_vline(xintercept = bayes_decision_boundary, linetype="dashed")

Linear Discriminant Analysis
First lets create a data set by of size \(n=500\) by randomly sampling from a normal distribution.
x <- seq(-4, 4, length = 500)
df <- data.frame(cond = rep(letters[1:2], each = 500),
values = c( rnorm(x,mean = mean_a, sd = 1), rnorm(x,mean = mean_b, sd = 1)))
head(df, n=5)
- From the generated data set lets take a sample of size \(100\) from each category cond = a | b also from the samples we need to calculate the \(\mu\),\(\sigma\) and probabilities.
The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates for \(\pi_k\), \(\mu_k\), and \(\sigma^2\) into (4.13)
\[
\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i \qquad \qquad \qquad \qquad \qquad \qquad \qquad \\
\hat{\sigma}^2 = \frac{1}{n-K}\sum^K_{k=1}\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2 \qquad \qquad \quad (4.15)
\]
- Where \(n\) is the total number of training observations, and \(n_k\) is the number of training observations in the \(kth\) class. *The estimate for \(\mu_k\) is simply the average of all the training observations from the \(kth\) class
- \(\hat{\sigma^2}\) is given by the weighted average of the sample variances for each of the \(K\) classes.
- \(\pi_k\) is estimating by using the proportion of the training observations that belong to the \(kth\) class given by: \(\hat{\pi_k} = n_k/n\).
\[
\hat{\delta_k(x)} = x \cdot \frac{\hat{\mu_k}}{\hat{\sigma}^2} - \frac{\hat{\mu}^2_k}{2\hat{\sigma}^2} + log(\hat{\pi}_k) \qquad \qquad \qquad (4.17)
\]
sample_a <- data.frame("values"= sample(df$values[df$cond=="a"], 100), "cond" = "a")
sample_b <- data.frame("values"= sample(df$values[df$cond=="b"], 100), "cond" = "b")
mean_sample_a <- mean( sample_a$values )
mean_sample_b <- mean( sample_b$values )
hist_a <- hist( sample_a$values, plot = FALSE )
hist_b <- hist( sample_b$values, plot = FALSE )
sigma_a <- weighted.mean( hist_a$density, hist_a$counts )
sigma_b <- weighted.mean( hist_b$density, hist_b$counts )
prob_a <- 30/100
prob_b <- 30/100
Now to were drawn from each of the two classes, and are shown as histograms. The Bayes decision boundary is again shown as a dashed vertical line. The solid vertical line represents the LDA decision boundary estimated from the training data.
LDA_decision_boundary <- (mean_sample_a + mean_sample_b)/2
ggplot(df, aes(x = values, fill = cond)) +
geom_histogram( binwidth = 0.4, alpha = 0.5, position = "identity" ) +
geom_vline( xintercept = LDA_decision_boundary, linetype = "solid" ) +
geom_vline( xintercept = bayes_decision_boundary, linetype = "dashed" )

Use case: IMDb Movies, TV and Celebrities – Ratings Dataset
For this use case we are going to explore the IMDb movie ratings data set then we are going to use the Linear Discriminant Analysis and Quadratic Discriminant Analysis to classify movies by gender given a number of features present in the data.
First lets load the data, explore a couple of rows, data types make some changes and deal with NA’s
imdb <- read.csv("data/imdb_rating.csv")
head(imdb)
Lets take a closer look to the numeric data by running the command summary()
and `str()
to try to find any irregularities in the data determine the data types.
summary(imdb[7:20])
length budget director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes
Min. : 7.0 Min. : 218 Min. : 0.0 Min. : 0 Min. : 0 Min. : 0.0
1st Qu.: 93.0 1st Qu.: 6000000 1st Qu.: 7.0 1st Qu.: 614 1st Qu.: 281 1st Qu.: 133.0
Median :103.0 Median : 20000000 Median : 49.0 Median : 988 Median : 595 Median : 371.5
Mean :107.2 Mean : 39752620 Mean : 686.5 Mean : 6560 Mean : 1652 Mean : 645.0
3rd Qu.:118.0 3rd Qu.: 45000000 3rd Qu.: 194.5 3rd Qu.: 11000 3rd Qu.: 918 3rd Qu.: 636.0
Max. :511.0 Max. :12215500000 Max. :23000.0 Max. :640000 Max. :137000 Max. :23000.0
NA's :15 NA's :492 NA's :104 NA's :7 NA's :13 NA's :23
total_cast_likes fb_likes critic_reviews users_reviews users_votes score aspect_ratio
Min. : 0 Min. : 0 Min. : 1.0 Min. : 1.0 Min. : 5 Min. :1.600 Min. : 1.18
1st Qu.: 1411 1st Qu.: 0 1st Qu.: 50.0 1st Qu.: 65.0 1st Qu.: 8594 1st Qu.:5.800 1st Qu.: 1.85
Median : 3090 Median : 166 Median :110.0 Median : 156.0 Median : 34359 Median :6.600 Median : 2.35
Mean : 9699 Mean : 7526 Mean :140.2 Mean : 272.8 Mean : 83668 Mean :6.442 Mean : 2.22
3rd Qu.: 13756 3rd Qu.: 3000 3rd Qu.:195.0 3rd Qu.: 326.0 3rd Qu.: 96309 3rd Qu.:7.200 3rd Qu.: 2.35
Max. :656730 Max. :349000 Max. :813.0 Max. :5060.0 Max. :1689764 Max. :9.500 Max. :16.00
NA's :50 NA's :21 NA's :329
gross
Min. : 162
1st Qu.: 5340988
Median : 25517500
Mean : 48468408
3rd Qu.: 62309438
Max. :760505847
NA's :884
Check the structure of the data set and data types
str(imdb)
'data.frame': 5043 obs. of 21 variables:
$ title : Factor w/ 4917 levels "[Rec] 2Â ","[Rec]Â ",..: 398 2731 3279 3707 3332 1961 3289 3459 399 1631 ...
$ genres : Factor w/ 914 levels "Action","Action|Adventure",..: 107 101 128 288 754 126 120 308 126 447 ...
$ director : Factor w/ 2399 levels "","A. Raven Cruz",..: 929 801 2027 380 606 109 2030 1652 1228 554 ...
$ actor1 : Factor w/ 2098 levels "","50 Cent","A.J. Buckley",..: 305 983 355 1968 528 443 787 223 338 34 ...
$ actor2 : Factor w/ 3033 levels "","50 Cent","A. Michael Baldwin",..: 1408 2218 2489 534 2433 2549 1228 801 2440 653 ...
$ actor3 : Factor w/ 3522 levels "","50 Cent","A.J. Buckley",..: 3442 1395 3134 1771 1 2714 1970 2163 3018 2941 ...
$ length : int 178 169 148 164 NA 132 156 100 141 153 ...
$ budget : num 237000000 300000000 245000000 250000000 NA ...
$ director_fb_likes: int 0 563 0 22000 131 475 0 15 0 282 ...
$ actor1_fb_likes : int 1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
$ actor2_fb_likes : int 936 5000 393 23000 12 632 11000 553 21000 11000 ...
$ actor3_fb_likes : int 855 1000 161 23000 NA 530 4000 284 19000 10000 ...
$ total_cast_likes : int 4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
$ fb_likes : int 33000 0 85000 164000 0 24000 0 29000 118000 10000 ...
$ critic_reviews : int 723 302 602 813 NA 462 392 324 635 375 ...
$ users_reviews : int 3054 1238 994 2701 NA 738 1902 387 1117 973 ...
$ users_votes : int 886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
$ score : num 7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
$ aspect_ratio : num 1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
$ gross : int 760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
$ year : int 2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
From the table above we can see that there is some work to do in order to start working with the imdb data. First we need to change the genres column to character type and extract the first entry in order to create our labels. Also on this step we want to find a way to remove or replace missing values for this case we are going to remove them.
### change data type and remove NA's from the data
imdb$genres <- as.character(imdb$genres)
imdb <- na.omit(imdb)
### Now we want to extract the first movie genre from the column and find the frequency of the genre in the data set
genre <- sapply( imdb$genres, strsplit,"[|]", USE.NAMES = FALSE )
get_one_genre <- function(entry) for(i in entry) return(i[[1]])
imdb$genres <- unlist(lapply(genre, get_one_genre))
table_genre <- table(imdb$genres)
table_genre
Action Adventure Animation Biography Comedy Crime Documentary Drama Family Fantasy
963 372 45 208 1001 259 30 685 3 37
Horror Musical Mystery Romance Sci-Fi Thriller Western
165 2 23 2 8 1 3
The frequency table above we can see that some of the genres don’t have enough observations and others are over represented so lets take the observations greater than 50 and less than 400 as this range would give us a set of classes which are proportional to each other. Also we have to check for classes that might represent similar/related genres such as Crime and Horror
### Here we are selecting observations greather than 50 and less than 400 alse
### creating a vector containing the name of the classes that are interested
table_genre_400 <- table_genre[ table_genre > 50 & table_genre < 400 ]
y_classes <- names(table_genre_400[ c(1,2,4) ])
y_classes
[1] "Adventure" "Biography" "Horror"
Now we are going to use those cases to select movies in the genres that we are interested We also want to consider newer movies as most metrics use FB likes, critics reviews and other digital forms of engagement with movie that if we use older data it might not be that accurate
### To create ROC plot for two classes
#y_classes <- y_classes[c(1,3)]
movies <- imdb[ imdb$genres %in% y_classes, ]
movies <- movies[ movies$year > 2006, ]
### Here we are only selecting genre and other numeric values that we are going to use to create the model
head(movies)
movies <- movies[ c(2,7:18,20) ]
Training, Testing Data Split with mlr package
### First lets selesct two classes so we can later create an ROC plot
y_classes_mlr <- y_classes[ c(1,3) ]
### Select only movies mad ein the last 10 years within the selected classes
movies_mlr <- imdb[ imdb$genres %in% y_classes_mlr,]
movies_mlr <- movies_mlr[ movies_mlr$year > 2006, ]
### Disregard non-numeric variables and some correlated variables
movies_mlr <- movies_mlr[c(2,7:18,20)]
### Create a mlr task object using the movies dataset and gender as a target
movies_mlr.task <- makeClassifTask(data = movies_mlr, target = "genres")
### Now we can create the Training (~65%) and Testing (~35%) sets based on the number of samples
n = getTaskSize(movies_mlr.task)
train_mlr.set = sample( n, size = round(2/3 * n) )
test_mlr.set = setdiff( seq_len(n), train_mlr.set )
Training, Testing Data Split
When building the model it is important to split our data in train and test sets using about \(75\%\) of the data for training and the remaining for testing. In this case we are taking a sample of \(300\) observations from the larger movies data set
### Here we are creating a random sample of size n=100 from the dataset
### the sample function return the index of the selected rows
idx_smp <- sample(seq_len(nrow(movies)), size = 100)
### Now we extract select that datsa using the index of the rows and take another sample
### But this time is to create the Training and Testing sets
movies <- movies[idx_smp,]
smp_size <- floor(0.75 * nrow(movies))
train_ind <- sample(seq_len(nrow(movies)), size = smp_size)
### TRAINING
train_data <- movies[ train_ind, ]
### TESTING
test_data <- movies[ -train_ind, ]
### Labels from the testing set
y_test <- as.factor(test_data['genres']$genres)
test_data['genres'] <- NULL
head(train_data)
Visual Analysis
Lets make some scatter plots to visually identify any variable interactions
selected_columns = c('director_fb_likes',
'total_cast_likes',
'fb_likes',
'users_votes',
'score',
'gross')
#plot(train)
plot( train_data[ selected_columns ] )

Creating Z-scores using the training data
### Uncomment to use all the columns in the dataset
#zscores <- apply(train, 1, function(x) (x - mean(x)) / sd(x))
zscores <- apply( train_data[selected_columns], 1, function(x) (x - mean(x)) / sd(x))
zscores <- data.frame( t(zscores) )
Correlation Plot
corr <- cor(zscores)
corrplot(corr)

Model Building: Linear Discriminat Analysis with mlr
### Here we are making a object of type makeLearner using the LDA classifier
lrn = makeLearner("classif.lda", predict.type = "prob")
### Now using the train function in mlr we can train our model using the training set
mod = train(lrn, task = movies_mlr.task, subset = train_mlr.set)
### After trining the model we test it on the testing data
pred = predict(mod, task = movies_mlr.task, subset = test_mlr.set)
### Measure performance of prediction
performance(pred, measures = list(mmce, acc))
mmce acc
0.1875 0.8125
Threshold vs. Performance(s) for 2-class classification Table
The following mlr command generates data on threshold vs. performance(s) for a 2-class classification that can be used for plotting.
df = generateThreshVsPerfData(pred, measures = list(fpr, tpr) )
head(df)
$measures
$measures$fpr
Name: False positive rate
Performance measure: fpr
Properties: classif,req.pred,req.truth
Minimize: TRUE
Best: 0; Worst: 1
Aggregated by: test.mean
Note: Percentage of misclassified observations in the positive class. Also called false alarm rate or fall-out.
$measures$tpr
Name: True positive rate
Performance measure: tpr
Properties: classif,req.pred,req.truth
Minimize: FALSE
Best: 1; Worst: 0
Aggregated by: test.mean
Note: Percentage of correctly classified observations in the positive class. Also called hit rate or recall.
$data
$aggregate
[1] TRUE
Plotting ROCCurves with mlr for 2-class classification
The ROC curve is a popular graphic for simultaneously displaying the ROC curve two types of errors for all possible thresholds.
The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC).
An ideal ROC curve will hug the top left corner, so the larger area under the (ROC) curve the AUC the better the classifier.
plotROCCurves(df)

Plotting threshold vs. performance with mlr
plotThreshVsPerf(df)

Model Building: Linear Discriminat Analysis with MASS
Lets fit a LDA model for using the training data ( movies after 2006 )
### Uncomment to create a model containing only selected columns
#lda.fit = lda( genres ~ ., data = train_data[ c('genres', selected_columns) ])
lda.fit = lda(genres ~ ., data = train_data)
lda.fit
Call:
lda(genres ~ ., data = train_data)
Prior probabilities of groups:
Adventure Biography Horror
0.5066667 0.2666667 0.2266667
Group means:
length budget director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes fb_likes
Adventure 110.28947 83521053 346.76316 12283.842 5506.4474 1616.6842 21143.421 37318.18
Biography 117.75000 18781500 1718.00000 7663.750 1830.0000 366.2500 10406.000 25596.15
Horror 97.11765 10923235 63.64706 3626.765 585.1765 387.6471 5455.882 8000.00
critic_reviews users_reviews users_votes score gross
Adventure 269.7632 393.0789 167158.97 6.539474 139676049
Biography 271.5500 253.2000 113415.45 7.175000 37172899
Horror 198.2941 196.7647 41990.29 5.541176 25836690
Coefficients of linear discriminants:
LD1 LD2
length 0.028589606074320 -0.030733759679642
budget -0.000000026963002 -0.000000004151626
director_fb_likes 0.000050624280169 -0.000057538957066
actor1_fb_likes 0.000197787882912 0.000033187876416
actor2_fb_likes 0.000106542979971 -0.000001814925458
actor3_fb_likes 0.000308201807052 0.000014853943635
total_cast_likes -0.000184779740657 -0.000028537049727
fb_likes -0.000003348054640 0.000007656551923
critic_reviews 0.005663886688470 0.004775206739475
users_reviews -0.001085486238006 0.000536940711429
users_votes -0.000004606327040 -0.000004104176206
score 0.169874351479755 -0.845638008525884
gross 0.000000004142433 0.000000001638829
Proportion of trace:
LD1 LD2
0.7115 0.2885
The LDA output shows Prior probabilities for each group such as Adventure (\(\hat{\pi_1} = 0.5066667\)), Biography (\(\hat{\pi_2} = 0.2666667\)) and Horror (\(\hat{\pi_3} = 0.2266667\)). From the prior probabilities we can determine that about \(50\%\) of the training data correspond to genre Adventure the other \(50\%\) is split between the other two genres (Biography and Horros).
plot(lda.fit)

lda.pred = predict( lda.fit, test_data )
names(lda.pred)
[1] "class" "posterior" "x"
The predict()
function returns a list with three elements. The first element, class, contains LDA’s predictions of genre.
The second element, posterior, is a matrix whose \(kth\) column contains the posterior probability that the corresponding observation belongs to the \(kth\) class
\(x\) contains the linear discriminant, described earlier.
lda.class = lda.pred$class
table(lda.class, y_test)
y_test
lda.class Adventure Biography Horror
Adventure 12 3 0
Biography 1 3 0
Horror 0 1 5
mean(lda.class == y_test)
[1] 0.8
Quadratic Discriminat Analysis
To perform a Quadratic Discriminant Analysis we need to resample the data and follow similar steps that when performing a Linear Discriminant Analysis. But for this case we cant plot the QLD model.
### Here we are only selecting genre and other numeric values that we are going to use to create the model
movies <- imdb[ imdb$genres %in% y_classes, ]
movies <- movies[ movies$year > 1996, ]
movies <- movies[ c(2,7:18,20) ]
### Now we extract select that datsa using the index of the rows and take another sample
### But this time is to create the Training and Testing sets
idx_smp <- sample(seq_len(nrow(movies)), size = 500)
movies <- movies[idx_smp,]
smp_size <- floor(0.75 * nrow(movies))
train_ind <- sample(seq_len(nrow(movies)), size = smp_size)
### TRAINING
train_data <- movies[ train_ind, ]
### TESTING
test_data <- movies[ -train_ind, ]
### Labels from the testing set
y_test <- as.factor(test_data['genres']$genres)
test_data['genres'] <- NULL
We are going to fit a QDA model that is larger in size as the training data starts from 1996.
qda.fit = qda(genres ~ ., data = train_data)
qda.fit
Call:
qda(genres ~ ., data = train_data)
Prior probabilities of groups:
Adventure Biography Horror
0.4906667 0.2826667 0.2266667
Group means:
length budget director_fb_likes actor1_fb_likes actor2_fb_likes actor3_fb_likes total_cast_likes fb_likes
Adventure 105.25000 89126685 830.52717 8298.935 2868.2609 1391.7120 13820.837 16321.016
Biography 118.02830 24977927 1567.05660 9294.925 2652.8113 1032.2075 14025.472 11439.962
Horror 98.23529 16383176 77.30588 3472.906 648.8941 409.2824 5318.847 7219.106
critic_reviews users_reviews users_votes score gross
Adventure 205.3370 371.6957 136349.41 6.520652 93081457
Biography 199.8868 250.4057 82349.67 7.074528 32278978
Horror 207.8235 437.2706 71073.22 5.650588 38360442
For the QDA model the output contains the group means and Prior probabilities of the groups just like in LDA, but it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors.
The predict() function works in exactly the same fashion as for LDA.
qda.pred = predict(qda.fit, test_data)
qda.class = qda.pred$class
table(qda.class, y_test)
y_test
qda.class Adventure Biography Horror
Adventure 38 4 1
Biography 5 19 0
Horror 20 13 25
mean(qda.class == y_test)
[1] 0.656
---
title: "Linear & Nonlinear - Discriminant Analysis"
author: "Jose Luis Rodriguez"
date: "September 21, 2017"
output:
  html_notebook: default
  html_document: default
  pdf_document: default
  toc: yes
subtitle: STAT 488 - Predictive Analytics
---

------------

## About the Notebook

This Rnotebook covers part of the material in chapter 4 of "An Introduction to Statistical Learning with Applications in R" by James, Witten, Hastie and Tibshirani (ISLR). We are going to work on sections 4.6.3 (Linear Discriminant Analysis) and section 4.6.4 (Quadratic Discriminant Analysis). 

The first part of the notebook covers important concepts behind Discriminant Analysis and its relation to Bayes' Theorem then we go over creating a model using the *mlr* package plotting and ROC curve, on this notebook we are also going to use the *MASS* package to follow a similar example on the book (ISLR).

**Notebook Setup:** Set the working directory to the source location by: *Session -> Set Working Directory -> To Source File Location*

```{r}
#### load and install packages if necessary

### https://mlr-org.github.io/
if(!require("mlr",quietly = TRUE)) install.packages("mlr")

### http://www.stats.ox.ac.uk/pub/MASS4/
if(!require("MASS",quietly = TRUE)) install.packages("MASS")

### http://ggplot2.org/
if(!require("ggplot2",quietly = TRUE)) install.packages("ggplot2")

### https://github.com/taiyun/corrplot
if(!require("corrplot",quietly = TRUE)) install.packages("corrplot")

#### set seed and turn off scientific notation to see small/large numbers
set.seed(1234)
options(scipen = 9999)
```

-----------

## Background

------------


**Logistic Regression (Review)** 

* Models the conditional distribution of the response Y, $Pr(Y = k|X = x)$ given the predictor(s) X, using the logistic function for the case of two response classes:

$$ p(X) = \frac{e^{\beta_0+\beta_1X_1+\dots+\beta_pX_p}}{1 + e^{\beta_0+\beta_1X_1+···+\beta_pX_p}} .$$


**Linear discriminant model (LDA) ** 

* Can be consider a less direct approach to classification compared to Logistic regression.
* Model's the distribution of the predictors $X$ separately in each of the response classes (i.e. given $Y$ ), and then use Bayes' theorem to flip these around into estimates for $Pr(Y=k|X=x)$.
* Moreover Linear discriminant analysis (LDA) is a generalization of Fisher's linear discriminant, a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events. 
* The resulting combination may be used as a linear classifier, or, more commonly, for dimensionality reduction before later classification.

**When to use Linear Discriminant Analysis**

* When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.

* If $n$ is small and the distribution of the predictors $X$ is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.

* Linear discriminant analysis is popular when there are more than two response classes.

<!-- ISLR 138 -->

------------

## Classification using Bayes' Theorem 

------------

**Bayes' theorem**

$$
Pr(Y=k|X=x) = \frac{\pi_kf_k(x)}{\sum^K_{l=1}\pi_lf_l(x)}. \qquad \qquad (4.10)
$$

**Posterior Probability**

We refer to $pk(x)$ as the posterior probability that an observation $X = x$ belongs to the $kth$ class. That is, it is the probability that the observation belongs to the $kth$ class, given the predictor value for that observation.

Suppose we assume that $f_k(x)$ is normal or Gaussian. In the one normal dimensional setting, the normal density takes the form:

$$
f_k(x) = \frac{1}{\sqrt{2\pi\sigma_k}}exp\Bigg( -\frac{1}{2\sigma^2_k}(x-\mu_k)^2\Bigg). \qquad \qquad (4.11)
$$

* Where $\mu_k$ and $\sigma^2_k$ are the mean and variance parameters for the $kth$ class. 

* Assume that $\sigma^2_1 = \dots = \sigma^2_K$, such as there is a shared variance term across all $K$ classes, denote by $\sigma^2$.

By plugging (4.11) into (4.10), we get: 

$$
p_k(x) = \frac{\pi_k\frac{1}{\sqrt{2\pi}\sigma}\text{exp}\big(-\frac{1}{2\sigma^2}(x-\mu_k)^2 \big)}{\sum^K_{l=1} \pi_l\frac{1}{\sqrt{2\pi}\sigma}\text{exp}\big(-\frac{1}{2\sigma^2}(x- \mu_l)^2 \big)}. \qquad \qquad \qquad (4.12)
$$

* In equation (4.12) $\pi_k$ denotes the **prior probability** that an observation belongs to the *kth* class (not $\pi \approx 3.14159$, the mathematical constant). 

------------

## Bayes classifier and Decision Boundary

------------

Bayes classifier, which classifies an observation ($X = x$) to the class for which $p_k(X)$ (4.12) is largest. By taking the log of (4.12) and rearranging the terms we get:

$$
\delta_k(x) = x \cdot \frac{\mu_k}{\sigma^2} - \frac{\mu^2_k}{2\sigma^2} + log(\pi_k) \qquad \qquad \qquad (4.13)
$$
If $K=2$ and $\pi_1 = \pi_2$, then the Bayes classifier assigns:

$$
\text{BayesClassifier} = \Bigg\{
\begin{align} 
  1, & \text{ if }  2x(\mu1-\mu2)>\mu^2_1-\mu^2_2 \\ 
  2, & \text{ otherwise } 
\end{align}
$$

The **Bayes decision boundary** corresponds to the point where:

$$
x = \frac{\mu^2_1 - \mu^2_2}{2(\mu_1 - \mu_2)}=\frac{\mu1 + \mu2}{2}. \qquad \qquad \qquad (4.14)
$$

**Plot of two one-dimensional normal density functions**

The dashed vertical line represents the Bayes decision boundary calculated using the formula above.

```{r}
mean_a <- 1.5
mean_b <- -1.5
bayes_decision_boundary <- (mean_a + mean_b)/2

p <- ggplot( data.frame(x=c(-4,4)), aes(x)) +
  stat_function( fun = dnorm, args = list(mean = mean_a, sd = 1), col='red' ) +
  stat_function( fun = dnorm, args = list(mean = mean_b, sd = 1), col='blue' )

p + geom_vline(xintercept = bayes_decision_boundary, linetype="dashed") 

```

<!-- ISLR 138 -->

------------

## Linear Discriminant Analysis

------------

First lets create a data set by of size $n=500$ by randomly sampling from a normal distribution.

```{r}
x <- seq(-4, 4, length = 500)
df <- data.frame(cond = rep(letters[1:2], each = 500),
                 values = c( rnorm(x,mean = mean_a, sd = 1), rnorm(x,mean = mean_b, sd = 1)))
head(df, n=5)
```

* From the generated data set lets take a sample of size $100$ from each category **cond = a | b**  also from the samples we need to calculate the $\mu$,$\sigma$ and probabilities.

The linear discriminant analysis (LDA) method approximates the Bayes classifier by plugging estimates for $\pi_k$, $\mu_k$, and $\sigma^2$ into (4.13)

$$
\hat{\mu_k} = \frac{1}{n_k}\sum_{i:y_i=k}x_i \qquad \qquad \qquad \qquad \qquad \qquad \qquad  \\ 
\hat{\sigma}^2 = \frac{1}{n-K}\sum^K_{k=1}\sum_{i:y_i=k}(x_i-\hat{\mu}_k)^2 \qquad \qquad \quad (4.15)
$$

* Where $n$ is the total number of training observations, and $n_k$ is the number of training observations in the $kth$ class.
*The estimate for $\mu_k$ is simply the average of all the training observations from the $kth$ class
* $\hat{\sigma^2}$ is given by the weighted average of the sample variances for each of the $K$ classes.
* $\pi_k$ is estimating by using the proportion of the training observations that belong to the $kth$ class given by: $\hat{\pi_k} = n_k/n$.

$$
\hat{\delta_k(x)} = x \cdot \frac{\hat{\mu_k}}{\hat{\sigma}^2} - \frac{\hat{\mu}^2_k}{2\hat{\sigma}^2} + log(\hat{\pi}_k) \qquad \qquad \qquad (4.17)
$$

```{r}
sample_a <- data.frame("values"= sample(df$values[df$cond=="a"], 100), "cond" = "a")
sample_b <- data.frame("values"= sample(df$values[df$cond=="b"], 100), "cond" = "b")

mean_sample_a <- mean( sample_a$values )
mean_sample_b <- mean( sample_b$values )

hist_a <- hist( sample_a$values, plot = FALSE )
hist_b <- hist( sample_b$values, plot = FALSE )

sigma_a <- weighted.mean( hist_a$density, hist_a$counts )
sigma_b <- weighted.mean( hist_b$density, hist_b$counts )
prob_a <- 30/100
prob_b <- 30/100
```


Now to were drawn from each of the two classes, and are shown as histograms. The Bayes decision boundary is again shown as a dashed vertical line. The solid vertical line represents the LDA decision boundary estimated from the training data.


```{r}
LDA_decision_boundary <- (mean_sample_a + mean_sample_b)/2

ggplot(df, aes(x = values, fill = cond)) +
  geom_histogram( binwidth = 0.4, alpha = 0.5, position = "identity" ) +
  geom_vline( xintercept = LDA_decision_boundary, linetype = "solid" ) +
  geom_vline( xintercept = bayes_decision_boundary, linetype = "dashed" )
```

------------

## Use case: IMDb Movies, TV and Celebrities -- Ratings Dataset

------------

For this use case we are going to explore the IMDb movie ratings data set then we are going to use the Linear Discriminant Analysis and Quadratic Discriminant Analysis to classify movies by gender given a number of features present in the data.


First lets load the data, explore a couple of rows, data types make some changes and deal with  NA's
```{r}
imdb <- read.csv("data/imdb_rating.csv")
head(imdb)
```

Lets take a closer look to the numeric data by running the command ```summary()``` and ```str()`` to try to find any irregularities in the data determine the data types. 
```{r}
summary(imdb[7:20])
```

Check the structure of the data set and data types
```{r}
str(imdb)
```

From the table above we can see that there is some work to do in order to start working with the imdb data. First we need to change the *genres* column to character type and extract the first entry in order to create our labels. Also on this step we want to find a way to remove or replace missing values for this case we are going to remove them.

```{r}
### change data type and remove NA's from the data
imdb$genres <- as.character(imdb$genres)
imdb <- na.omit(imdb)

### Now we want to extract the first movie genre from the column and find the frequency of the genre in the data set
genre <- sapply( imdb$genres, strsplit,"[|]", USE.NAMES = FALSE )
get_one_genre <- function(entry) for(i in entry) return(i[[1]])
imdb$genres <- unlist(lapply(genre, get_one_genre))
table_genre <- table(imdb$genres)
table_genre
```

The frequency table above we can see that some of the genres don't have enough observations and others are over represented so lets take the observations greater than 50 and less than 400 as this range would give us a set of classes which are proportional to each other. Also we have to check for classes that might represent similar/related genres such as *Crime* and *Horror*

```{r}
### Here we are selecting observations greather than 50 and less than 400 alse
### creating a vector containing the name of the classes that are interested

table_genre_400 <- table_genre[ table_genre > 50 & table_genre < 400 ]
y_classes <- names(table_genre_400[ c(1,2,4) ])
y_classes
```

Now we are going to use those cases to select movies in the genres that we are interested We also want to consider newer movies as most metrics use FB likes, critics reviews and other digital forms of engagement with movie that if we use older data it might not be that accurate

```{r}
### To create ROC plot for two classes 
#y_classes <- y_classes[c(1,3)] 

movies <- imdb[ imdb$genres %in% y_classes, ]
movies <- movies[ movies$year > 2006, ]

### Here we are only selecting genre and other numeric values that we are going to use to create the model

head(movies)
movies <- movies[ c(2,7:18,20) ]
```

------------

## Training, Testing Data Split with mlr package 

------------

```{r}
### First lets selesct two classes so we can later create an ROC plot
y_classes_mlr <- y_classes[ c(1,3) ] 

### Select only movies mad ein the last 10 years within the selected classes  
movies_mlr <- imdb[ imdb$genres %in% y_classes_mlr,]
movies_mlr <- movies_mlr[ movies_mlr$year > 2006, ]

### Disregard non-numeric variables and some correlated variables
movies_mlr <- movies_mlr[c(2,7:18,20)]

### Create a mlr task object using the movies dataset and gender as a target
movies_mlr.task <- makeClassifTask(data = movies_mlr, target = "genres")

### Now we can create the Training (~65%) and Testing (~35%) sets based on the number of samples 
n = getTaskSize(movies_mlr.task)
train_mlr.set = sample( n, size = round(2/3 * n) )
test_mlr.set = setdiff( seq_len(n), train_mlr.set )
```

------------

## Training, Testing Data Split

------------

When building the model it is important to split our data in train and test sets using about $75\%$ of the data for training and the remaining for testing. In this case we are taking a sample of $300$ observations from the larger movies data set

```{r}
### Here we are creating a random sample of size n=100 from the dataset
### the sample function return the index of the selected rows
idx_smp <- sample(seq_len(nrow(movies)), size = 100)

### Now we extract select that datsa using the index of the rows and take another sample
### But this time is to create the Training and Testing sets
movies <- movies[idx_smp,]
smp_size <- floor(0.75 * nrow(movies))
train_ind <- sample(seq_len(nrow(movies)), size = smp_size)

### TRAINING
train_data <- movies[ train_ind, ]

### TESTING
test_data <- movies[ -train_ind, ]

### Labels from the testing set
y_test <- as.factor(test_data['genres']$genres)
test_data['genres'] <- NULL

head(train_data)
```

------------

## Visual Analysis

------------

Lets make some scatter plots to visually identify any variable interactions

```{r}
selected_columns = c('director_fb_likes',
                     'total_cast_likes',
                     'fb_likes',
                     'users_votes',
                     'score',
                     'gross')

#plot(train)
plot( train_data[ selected_columns ] )
```


**Creating Z-scores using the training data**
```{r}
### Uncomment to use all the columns in the dataset
#zscores <- apply(train, 1, function(x) (x - mean(x)) / sd(x))
zscores <- apply( train_data[selected_columns], 1, function(x) (x - mean(x)) / sd(x))
zscores <- data.frame( t(zscores) )
```

**Correlation Plot**
```{r}
corr <- cor(zscores)
corrplot(corr)
```

------------

## Model Building: Linear Discriminat Analysis with mlr

------------

```{r}
### Here we are making a object of type makeLearner using the LDA classifier
lrn = makeLearner("classif.lda", predict.type = "prob")

### Now using the train function in mlr we can train our model using the training set
mod = train(lrn, task = movies_mlr.task, subset = train_mlr.set)

### After trining the model we test it on the testing data
pred = predict(mod, task = movies_mlr.task, subset = test_mlr.set)

### Measure performance of prediction
performance(pred, measures = list(mmce, acc))
```

**Threshold vs. Performance(s) for 2-class classification Table**

The following mlr command generates data on threshold vs. performance(s) for a 2-class classification that can be used for plotting.

```{r}
df = generateThreshVsPerfData(pred, measures = list(fpr, tpr) )
head(df)
```

**Plotting ROCCurves with mlr for 2-class classification**

* The ROC curve is a popular graphic for simultaneously displaying the ROC curve two types of errors for all possible thresholds. 

* The overall performance of a classifier, summarized over all possible thresholds, is given by the area under the (ROC) curve (AUC). 

* An ideal ROC curve will hug the top left corner, so the larger area under the (ROC) curve the AUC the better the classifier.

```{r}
plotROCCurves(df)
```

**Plotting threshold vs. performance with mlr**

```{r}
plotThreshVsPerf(df)
```

------------

## Model Building: Linear Discriminat Analysis with MASS

------------

Lets fit a LDA model for using the training data ( movies after 2006 )

```{r}
### Uncomment to create a model containing only selected columns
#lda.fit = lda( genres ~ ., data = train_data[ c('genres', selected_columns) ])
lda.fit = lda(genres ~ ., data = train_data)
lda.fit
```

The LDA output shows Prior probabilities for each group such as *Adventure* ($\hat{\pi_1} = 0.5066667$), *Biography* ($\hat{\pi_2} = 0.2666667$) and *Horror* ($\hat{\pi_3} = 0.2266667$). From the prior probabilities we can determine that about $50\%$ of the training data correspond to genre *Adventure* the other $50\%$ is split between the other two genres (*Biography*  and *Horros*).

```{r}
plot(lda.fit)
```


```{r}
lda.pred = predict( lda.fit, test_data )
names(lda.pred)
```

* The ```predict()``` function returns a list with three elements. The first element, **class**, contains LDA’s predictions of genre.

* The second element, posterior, is a matrix whose $kth$ column contains the posterior probability that the corresponding observation belongs to the $kth$ class

* $x$ contains the linear discriminant, described earlier.

```{r}
lda.class = lda.pred$class
table(lda.class, y_test)
```


```{r}
mean(lda.class == y_test)
```

------------

## Quadratic Discriminat Analysis

------------

To perform a Quadratic Discriminant Analysis we need to resample the data and follow similar steps that when performing a Linear Discriminant Analysis. But for this case we cant plot the QLD model.

```{r}
### Here we are only selecting genre and other numeric values that we are going to use to create the model

movies <- imdb[ imdb$genres %in% y_classes, ]
movies <- movies[ movies$year > 1996, ]
movies <- movies[ c(2,7:18,20) ]

### Now we extract select that datsa using the index of the rows and take another sample
### But this time is to create the Training and Testing sets
idx_smp <- sample(seq_len(nrow(movies)), size = 500)
movies <- movies[idx_smp,]
smp_size <- floor(0.75 * nrow(movies))
train_ind <- sample(seq_len(nrow(movies)), size = smp_size)

### TRAINING
train_data <- movies[ train_ind, ]

### TESTING
test_data <- movies[ -train_ind, ]

### Labels from the testing set
y_test <- as.factor(test_data['genres']$genres)
test_data['genres'] <- NULL
```

We are going to fit a QDA model that is larger in size as the training data starts from 1996.

```{r}
qda.fit = qda(genres ~ ., data = train_data)
qda.fit
```

For the QDA model the output contains the group means and Prior probabilities of the groups just like in LDA, but it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. 

The predict() function works in exactly the same fashion as for LDA.

```{r}
qda.pred = predict(qda.fit, test_data)
qda.class = qda.pred$class
table(qda.class, y_test)
```


```{r}
mean(qda.class == y_test)
```
