Homework 5

#1) Question 4.7.6 pg 170 Suppose we collect data for a group of students in a statistics class with variables \(X_{1}\) =hours studied, \(X_{2}\) = undergrad GPA, and Y = receive an A. We fit a logistic regression and produce estimated coefficient, \(\hat{\beta}_0 = -6, \hat{\beta}_1 = 0.05, \hat{\beta}_2 = 1\)

##Problem 6 a) Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.

##Now plug -0.5 into the following expression and evaulate in order to make the prediction

##Therefore a student who studies 40 hours and has an undergrad GPA of 3.5 has a probabiity of roughly 37.75% of receiving an A in the class

##Problem 6 b) How many hours would the student in part (a) need to study to have a 50% chance of getting an A in the class?

In order to do this we have work backwards. In other words, where we ended in part a, we need to start at the point at go backwards.

\(\frac{1}{2} = \frac{e^{k}}{1+e^{k}}\)

\(1 + e^{k} = 2e^{k}\)

\(1 = e^{k}\)

\(ln(1) = ln(e^{k})\)

\(ln(1) = k\)

0 = k

##So now we can back the original expression equal to 0:

\(\hat{\beta}_{1}X_{1} + \hat{\beta}_{1}X_{2} + \hat{\beta}_0 = 0\)

##Now we just need to solve for \(X_{1}\)

\(0.05X_{1} + 1(3.5) + (-6) = 0\)

\(0.05X_{1} - 2.5 = 0\)

\(0.05X_{1} = 2.5\)

\(X_{1} = \frac{2.5}{0.05}\)

\(X_{1} = 50\)

##So if the student wanted a probability of 50% of receiving an A in the class they would have to study for 50 hours

#2) Question 4.7.7 pg 170

##Companies that do not give out dividens

Given:

\(X = 0\)

##Lets plug the given values into the given function:

\(f(x) =\frac{1}{\sqrt{2\pi\sigma^{2}}}e^\frac{{-(x-\mu)^{2}}}{2\sigma^{2}}\)

\(\frac{1}{\sqrt{2\pi36}}e^\frac{{-(4-0)^{2}}}{2(36)}\)

\(\frac{1}{6\sqrt{2\pi}}e^\frac{-16}{2(36)}\)

\(\frac{1}{6\sqrt{2\pi}}e^\frac{-2}{9}\)

\(f(4|dividend=No) = 0.05324133\)

##Companies that give out dividens

##Given:

\(\frac{1}{\sqrt{2\pi\sigma^{2}}}e^\frac{{-(x-\mu)^{2}}}{2\sigma^{2}}\)

\(\bar{X}=10\)

\(\sigma^{2}=36\)

\(X = 4\)

##Lets plug the given values into the given function:

\(f(x) =\frac{1}{\sqrt{2\pi\sigma^{2}}}e^\frac{{-(x-\mu)^{2}}}{2\sigma^{2}}\)

\(\frac{1}{\sqrt{2\pi36}}e^\frac{{-(4-10)^{2}}}{2(36)}\)

\(\frac{1}{6\sqrt{2\pi}}e^\frac{-36}{2(36)}\)

\(\frac{1}{6\sqrt{2\pi}}e^\frac{-1}{2}\)

\(f(4|Dividend = Yes) = 0.04032845\)

##Now as the hint specifies I will need to use Bayes’ Theorem

\(P(dividend = Yes | X = 4) = \frac{f(4|dividend=Yes)*P(dividend = Yes)}{f(4|dividend=No)*P(dividend = No) + f(4|dividend=Yes)*P(dividend = Yes)}\)

##Now I will just the values into the expression on the right hand side:

\(\frac{0.04032845(0.80)}{0.05324133(0.20) + 0.04032845(0.80)}\)

\(P(dividend = Yes | X = 4) = 0.7518524\)

##Therefore the probability that company will payout a dividend this year given that its percentage profit was X = 4 last year is roughly 75.19% (rounded)

#3) Continue from Homework 3 & 4 using the Weekly dataset from 4.7.10), fit a model (using the predictors chosen for previous homework) for classification using the MclustDA function from the mclust-package.

#Summary of MclustDA Model

i) Do a summary of your model.

In Homework 4 the last model that I looked at for weekly data set was all the lag variables interacting with each other Lag1Lag2Lag5Lag4Lag3 in order to prediction direction (up or down). Now I will use the same covariates to predict direction of the weekly data set using MclustDA model. I will output the summary of the model.

library(MASS)
#install.packages('mclust')
library('mclust')
library(ISLR)
data(Weekly)

mclust.mod1 <- MclustDA(Weekly[,2:6], class=Weekly[,9], G=1:15)

summary(mclust.mod1)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood    n df       BIC
##       -11781.72 1089 40 -23843.16
##        
## Classes   n     % Model G
##    Down 484 44.44   VII 3
##    Up   605 55.56   VII 3
## 
## Training confusion matrix:
##       Predicted
## Class  Down  Up
##   Down  161 323
##   Up    125 480
## Classification error = 0.4114 
## Brier score          = 0.239

#Model selection by BIC

    -What is the best model selected by BIC? Report the Model Name and the BIC. (See https://www.rdocumentation.org/packages/mclust/versions/5.4/topics/mclustModelNames)

The model selected by the MclustDA model is the VII with 3 Groups. The BIC is -23843.16. Below is varying the above results by looking at BIC values based on groups and models.

I will subset the data set containing only the rows with the up direction fit the Mclust without the DA portion and output a graph containing the BIC vs no. of groups categorized by the model name.

#Subsetting the data
up <- subset(Weekly, Weekly$Direction == 'Up')

#Fitting the data to Mclust
mclust.mod1.up <- Mclust(up[,2:6], class=up[,9], G=1:9)

#Plot of BIC vs no. groups for each model type.
plot(mclust.mod1.up, what='BIC')

#Testing and training error rates

    -What is the training error? What is the test error?

As in the previous homework it asked to split the data set between training and testing, the training all be years between 1990 and 2008. The testing dat set will be years 2009 and 2010

#subsetting the data for years 1990 through 2008 - Training Data
weekly_training <- subset(Weekly, Year >= 1990 & Year <= 2008)

#subsetting the date for years 2009 and 2010 - Testing data
weekly_testing <- subset(Weekly, Year > 2008)

mclust.mod2 <- MclustDA(weekly_training[,2:6], weekly_training[,9], G=1:9)

summary(mclust.mod2, newdata=weekly_testing[,2:6], newclass = weekly_testing[,9])

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -10471.14 985 33 -21169.73
##        
## Classes   n     % Model G
##    Down 441 44.77   VII 3
##    Up   544 55.23   VII 2
## 
## Training confusion matrix:
##       Predicted
## Class  Down  Up
##   Down  185 256
##   Up    165 379
## Classification error = 0.4274 
## Brier score          = 0.243 
## 
## Test confusion matrix:
##       Predicted
## Class  Down Up
##   Down   20 23
##   Up     24 37
## Classification error = 0.4519 
## Brier score          = 0.2675

The testing Error is 0.4519231 The Training Error is 0.4527919

#True Positive and True Negative Rates

    -Report the True Positive Rate and the True Negative Rate.

cat('Testing True Positive Rate: ', 1-(7/43))

## Testing True Positive Rate:  0.8372093

cat('Testing True Negative Rate: ', 1-(50/61))

## Testing True Negative Rate:  0.1803279

cat('Training True Positive Rate: ', 1-(36/(36+405)))

## Training True Positive Rate:  0.9183673

cat('Training True Negative Rate: ', 1-(503/(41+503)))

## Training True Negative Rate:  0.07536765

#Running model with EDDA type

ii) Specify modelType="EDDA" and run MclustDA again. Do a summary of your model.

mclust.mod2.edda <- MclustDA(weekly_training[,2:6], weekly_training[,9], modelType='EDDA')
summary(mclust.mod2.edda)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## EDDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -11026.14 985 11 -22128.11
##        
## Classes   n     % Model G
##    Down 441 44.77   EII 1
##    Up   544 55.23   EII 1
## 
## Training confusion matrix:
##       Predicted
## Class  Down  Up
##   Down   65 376
##   Up     62 482
## Classification error = 0.4447 
## Brier score          = 0.2449

#BIC Selected Model

    -What is the best model selected by BIC?

The best model based on BIC is EII with G=1

#Training and Testing Error Rates

    -Find the training and test error rates.

Now I will refit the model using the previouly split data and use model type EDDA

mclust.mod2.edda <- MclustDA(weekly_training[,2:6], weekly_training[,9], modelType="EDDA")

summary(mclust.mod2.edda, newdata=weekly_testing[,2:6], newclass = weekly_testing[,9])

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## EDDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -11026.14 985 11 -22128.11
##        
## Classes   n     % Model G
##    Down 441 44.77   EII 1
##    Up   544 55.23   EII 1
## 
## Training confusion matrix:
##       Predicted
## Class  Down  Up
##   Down   65 376
##   Up     62 482
## Classification error = 0.4447 
## Brier score          = 0.2449 
## 
## Test confusion matrix:
##       Predicted
## Class  Down Up
##   Down    9 34
##   Up     13 48
## Classification error = 0.4519 
## Brier score          = 0.2466

Using the EDDA model type the training error rate 0.446701 Testing error rate is 0.4519231

#True Postive and True Negative Rates

    -Report the True Positive and True Negative Rate.

cat('Testing True Positive Rate: ', 1-(9/43))

## Testing True Positive Rate:  0.7906977

cat('Testing True Negative Rate: ', 1-(48/61))

## Testing True Negative Rate:  0.2131148

cat('Training True Positive Rate: ', 1-(65/(65+376)))

## Training True Positive Rate:  0.8526077

cat('Training True Negative Rate: ', 1-(482/(62+482)))

## Training True Negative Rate:  0.1139706

iii) Compare the results with Homework \#3 \& 4. Which method performed the best? Justify your answer. \textit{Here you need to list the previous methods and their corresponding rates.}

The following error rates with the corresponding models from homework 4 with using all the lag variable interactions are (This is only testing error rate:

GLM Error Rate: 0.2403846 LDA Error Rate: 0.4807692 QDA Error Rate: 0.5673077 KNN Error Rate: 0.4519231 MClustDA Error Rate: 0.4519231 MClustDA EDDA Error Rate: 0.4519231

It appears that the lowest error rate is produced using the GLM method.

#4) Continue from Homework 3 & 4 using the Auto dataset from 4.7.11). Fit a classification model (using the predictors chosen for previous homework) using the MclustDA function from the mclust-package. Use the same training and test set from previous homework assignments.

#Summary of MclustDA Model for Auto Dataset

i)  Do a summary of your model.

In Homework 4, I used mpg01 is explained by weight, acceleration, and year because some of the other covariates were strongly correlated amongst themselves

data(Auto)

#calculating and assigning the median value of mpg from the auto data set
med_value <- median(Auto$mpg)

#converting Auto data set into a data.frame
auto_dat <- as.data.frame(Auto)

#assigning the binary response 
auto_dat$mpg01 <- cbind(ifelse(auto_dat$mpg > med_value, 1,0))

auto_dat <- subset(auto_dat, select = c(weight, acceleration, year, mpg01))


mclust.mod.auto1 <- MclustDA(auto_dat[,1:3], class=auto_dat[,4], G=1:9)

summary(mclust.mod.auto1)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -4986.395 392 44 -10235.53
##        
## Classes   n  % Model G
##       0 196 50   VEI 5
##       1 196 50   EVI 3
## 
## Training confusion matrix:
##      Predicted
## Class   0   1
##     0 180  16
##     1  11 185
## Classification error = 0.0689 
## Brier score          = 0.0533

#BIC and Model type

    -What is the best model selected by BIC? Report the model name and BIC.

It appears the best model select by BIC is VEI and group = 5 for class 0 which is below the median value for mpg. For mpg greater than the median value the model is EVI and group = 3. The BIC is -10235.53

#Testing and training error

    -What is the training error? What is the test error?

Now I will split the data set 70% training and the other 30% for testing and then fit the model and compare the training and testing errors

library(MASS)
set.seed(123)
#testing and train data split
index <- sample(x=nrow(auto_dat), size=0.70*nrow(auto_dat))
training <- auto_dat[index,]
testing <- auto_dat[-index,]

mclust.mod.auto2 <- MclustDA(training[,1:3], training[,4], G=1:9)

summary(mclust.mod.auto2, newdata=testing[,1:3], newclass = testing[,4])

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## MclustDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -3511.006 274 32 -7201.633
##        
## Classes   n     % Model G
##       0 136 49.64   EEI 3
##       1 138 50.36   EVI 3
## 
## Training confusion matrix:
##      Predicted
## Class   0   1
##     0 126  10
##     1   6 132
## Classification error = 0.0584 
## Brier score          = 0.0442 
## 
## Test confusion matrix:
##      Predicted
## Class  0  1
##     0 53  7
##     1  6 52
## Classification error = 0.1102 
## Brier score          = 0.0832

Training Error is 0.0729927 Testing Error is 0.07627119

#True Postive and True Negative Rates

    -Report the True Positive and True Negative Rate.

cat('Testing True Positive Rate: ', 1-(57/64))

## Testing True Positive Rate:  0.109375

cat('Testing True Negative Rate: ', 1-(52/54))

## Testing True Negative Rate:  0.03703704

cat('Training True Positive Rate: ', 1-(132/(135)))

## Training True Positive Rate:  0.02222222

cat('Training True Negative Rate: ', 1-(122/(129)))

## Training True Negative Rate:  0.05426357

#EDDA Model Type for Auto Data set

ii) Specify modelType="EDDA" and run MclustDA again. Do a summary of your model.

mclust.mod.auto.edda <- MclustDA(auto_dat[,1:3], class=auto_dat[,4], modelType='EDDA')

summary(mclust.mod.auto.edda)

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## EDDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -5101.022 392 18 -10309.53
##        
## Classes   n  % Model G
##       0 196 50   VVV 1
##       1 196 50   VVV 1
## 
## Training confusion matrix:
##      Predicted
## Class   0   1
##     0 171  25
##     1   9 187
## Classification error = 0.0867 
## Brier score          = 0.0658

#BIC and Model Type

    -What is the best model selected by BIC?

It appears the best model type for both classes is VII with G=1. The BIC is -10309.53

#Training and testing errors for EDDA Model type per Auto data set

    -Find the training and test error rates.

mclust.mod.auto.edda2 <- MclustDA(training[,1:3], training[,4], modelType='EDDA')

summary(mclust.mod.auto.edda2, newdata=testing[,1:3], newclass = testing[,4])

## ------------------------------------------------ 
## Gaussian finite mixture model for classification 
## ------------------------------------------------ 
## 
## EDDA model summary: 
## 
##  log-likelihood   n df       BIC
##       -3569.424 274 18 -7239.884
##        
## Classes   n     % Model G
##       0 136 49.64   VVV 1
##       1 138 50.36   VVV 1
## 
## Training confusion matrix:
##      Predicted
## Class   0   1
##     0 119  17
##     1   5 133
## Classification error = 0.0803 
## Brier score          = 0.0606 
## 
## Test confusion matrix:
##      Predicted
## Class  0  1
##     0 55  5
##     1  6 52
## Classification error = 0.0932 
## Brier score          = 0.0758

The Training Error is 0.07664234 The Testing Error is 0.1016949

#True Postive and True Negative rates for EDDA Model type for Auto Dataset

    -Report the True Positive and True Negative Rate.

cat('Testing True Positive Rate: ', 1-(53/64))

## Testing True Positive Rate:  0.171875

cat('Testing True Negative Rate: ', 1-(53/54))

## Testing True Negative Rate:  0.01851852

cat('Training True Positive Rate: ', 1-(117/(117+15)))

## Training True Positive Rate:  0.1136364

cat('Training True Negative Rate: ', 1-(136/(142)))

## Training True Negative Rate:  0.04225352

#Model Comparison by Error Rates

iii) Compare the results with Homework 3 & 4. Which method performed the best? Justify your answer. Here you need to list the previous methods and their corresponding rates.

GLM Error Rate: 0.1101695 LDA Error Rate: 0.1355932 QDA Error Rate: 0.09322034 KNN Error Rate: 0.1186441 MclustDA Error Rate: 0.07627119 MclustDA EDDA Model Type Error Rate: 0.1016949

It appears the MclustDA has lowest testing error rate. The one that does specifiy the model as being EDDA model type.

#5) Read the paper “Who Wrote Ronald Reagan’s Radio Addresses?” posted on D2L. Write a one page (no more, no less) summary. You may use 1.5 or double spacing.

##1 Page Summary (Also uploaded as a word document)

“Who Wrote Ronald Reagan’s Radio Addresses?” was a research study in order to try and determine what the title indicates: who was the author of Reagan’s radio addresses? Was it himself or someone else who wrote the speeches? In Reagan’s campaign for president of the United States he delivered over 1000 radio broadcasts. However, the researchers have only 600 broadcasts that are directly linked to Reagan as the author. There are 312 remaining ones that do not have direct association as Reagan being the author.

The researchers use these 600 broadcasts as their basis for developing and producing a model to predict the remaining 312 broadcasts. In order to perform the classification work, they pin-pointed three types of features, words, n-grams, and semantics. These three types of main characteristics of a given author led to six different pools of features. Semantic features, for example, have been studied by other researchers which have given birth to the idea known as representational theory of composition. This theory proposes that semantic features provides insight as it suggests that sets of patterns of words can be attributed to an author.

Many different models where fitted and tested using cross validation methods. Models such as linear discriminant analysis, quadratic discriminant analysis, logistic regression, support vector machines, k-nearest neighbor, and many others. When comparing all the different predictive models on the 312 uncertain authorship they agree on the 135 of them. However, when implementing fully Bayesian models they approve of another 154 of the speeches. When testing models using cross validation methods on the known radio broadcasts the error rate was less than 10% in all cases. Ultimately, this leads them to a conclusion that they can predict 167 of the 312 been authored by Ronald Reagan.

#6) Last homework you chose a dataset from this website. Please do some initial exploration of the dataset. If you don’t like the dataset you chose you may change it with another. It has to be a new dataset that we haven’t used in class. Please report the analysis you did and discuss the challenges with analyzing the data. Any plots for this question need to be done using only GGplot2-based plots.

I will be chosing different data set than what I chose in homework 4. This one is called:

Breast Cancer and can be found here: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer

I will load the data set and assign the names. To make sure it correctly loaded and the names are correct I will output the first five observations

#install.packags(data.table)
library(data.table)

bc.dat <- fread('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data', col.names=c('class','age','menopause','tumorsize','invnodes','nodecaps','degmalig','breast','breastquad','irradiant'))

head(bc.dat,5)

library(tidyverse)


ggplot(bc.dat, aes(x=age, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of Age Grouped by Class', x='Age',y='Frequency')

ggplot(bc.dat, aes(x=breastquad, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of Breast Quadrant Grouped by Class', x='Breast Qaudrant',y='Count')

ggplot(bc.dat, aes(x=irradiant, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of irradiated Grouped by Class', x='Irradiated',y='Count')

ggplot(bc.dat, aes(x=degmalig, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of Degree of Malignancy Grouped by Class', x='Degree Of Malignancy',y='Count')

ggplot(bc.dat, aes(x=tumorsize, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of Tumor Size Grouped by Class', x="Tumor Size",y='Count')

ggplot(bc.dat, aes(x=invnodes, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of Inv Noides Grouped by Class',x='Inv Nodes',y='Count')

ggplot(bc.dat, aes(x=menopause, fill=class)) + geom_bar(position='dodge') + labs(title='Histogram of Menopause status Grouped by Class',x='Menopause Group',y='Count')

It appears age has an over normal distribution amoung classes. It seems the breast quadrant of where the tumor is located is close the same for each class. The degree of Malignancy of the tumor is histogram is interesting. It seems that the degree of malignancy for no-recurrance-events is most frequent around 2, in contrast the recurrance-events is highest at 3. The Tumor Size histogram is also a little interesting, because for the no-recurrance-events appears the be frequent at 10-14, but recurrance class is roughly the same bteween 0-4, 5-9, and 10-14. Inv-Nodes tends to have the greatest occurances for both classes at 0-2 and both steadily decrease as the Inv Nodes increases. The Menopause group tends to have highest frequencies for both group classes at ge40 and pre-menopause groups. The lt40 has substantially low occurances for both classes.

Homework 5

Modern Applied Statistics II