DATA609 HW5

Exercise 1

Carry out the logistic regression (Example 22, page 94) in R.

formula:

\(y(x) = \frac{1}{1 + exp(-(a+bx))}\)

x <- c(0.1, 0.5, 1.0,   1.5, 2.0, 2.5)
y <- c(0, 0, 1, 1, 1, 0)

lr <- glm(y ~ x, family = "binomial")
summary(lr)

## 
## Call:
## glm(formula = y ~ x, family = "binomial")
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -0.8518  -0.9570   1.2583   1.1075   0.9653  -1.5650  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.8982     1.5811  -0.568    0.570
## x             0.7099     1.0557   0.672    0.501
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3178  on 5  degrees of freedom
## Residual deviance: 7.8325  on 4  degrees of freedom
## AIC: 11.832
## 
## Number of Fisher Scoring iterations: 4

We can plug these values back into our original equation

\(y(x) = \frac{1}{1 + exp(-(-0.8982 + 0.7099x))}\)

lr.probs <- predict(lr,type = "response")
lr.probs

##         1         2         3         4         5         6 
## 0.3042349 0.3674359 0.4530738 0.5415825 0.6275427 0.7061303

Lets apply these predictions

predictions <- as.numeric(lr.probs > 0.5)
outcomes <- data.frame(y, predictions)
outcomes

##   y predictions
## 1 0           0
## 2 0           0
## 3 1           0
## 4 1           1
## 5 1           1
## 6 0           1

Exercise 2

Use the motor car database (mtcars) of the built-data sets in R to carry out the basic principal component analysis and explain your results

Here is a great article that gave me some help in setting up the PCA.

Datacamp PCA Example

First lets inspect the data

glimpse(mtcars)

## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

All of our data is numeric which is essential, but we need to check for missingness

colSums(is.na(mtcars))

##  mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
##    0    0    0    0    0    0    0    0    0    0    0

We have a complete data set which is rare but missing values would be problematic for PCA. Next we need to normalize our data. We will use the ‘scale’ function to normalize (it returns a matrix). Not sure if I need to remove the row names?

mtcars_norm <- scale(mtcars)
head(mtcars_norm)

##                          mpg        cyl        disp         hp       drat
## Mazda RX4          0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
## Mazda RX4 Wag      0.1508848 -0.1049878 -0.57061982 -0.5350928  0.5675137
## Datsun 710         0.4495434 -1.2248578 -0.99018209 -0.7830405  0.4739996
## Hornet 4 Drive     0.2172534 -0.1049878  0.22009369 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345  1.0148821  1.04308123  0.4129422 -0.8351978
## Valiant           -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078
##                             wt       qsec         vs         am       gear
## Mazda RX4         -0.610399567 -0.7771651 -0.8680278  1.1899014  0.4235542
## Mazda RX4 Wag     -0.349785269 -0.4637808 -0.8680278  1.1899014  0.4235542
## Datsun 710        -0.917004624  0.4260068  1.1160357  1.1899014  0.4235542
## Hornet 4 Drive    -0.002299538  0.8904872  1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout  0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192
## Valiant            0.248094592  1.3269868  1.1160357 -0.8141431 -0.9318192
##                         carb
## Mazda RX4          0.7352031
## Mazda RX4 Wag      0.7352031
## Datsun 710        -1.1221521
## Hornet 4 Drive    -1.1221521
## Hornet Sportabout -0.5030337
## Valiant           -1.1221521

Next we need to calculate the correlation matrix

corr_matrix <- cor(mtcars_norm)
ggcorrplot(corr_matrix)

Now we can do our PCA

data.pca <- princomp(corr_matrix)
summary(data.pca)

## Importance of components:
##                           Comp.1    Comp.2      Comp.3      Comp.4       Comp.5
## Standard deviation     1.9924440 0.7625637 0.164481730 0.079889308 0.0655577780
## Proportion of Variance 0.8640097 0.1265606 0.005888188 0.001389069 0.0009353945
## Cumulative Proportion  0.8640097 0.9905703 0.996458532 0.997847601 0.9987829955
##                              Comp.6       Comp.7       Comp.8       Comp.9
## Standard deviation     0.0525410926 0.0406708369 0.0248297195 0.0227226207
## Proportion of Variance 0.0006008203 0.0003600084 0.0001341807 0.0001123733
## Cumulative Proportion  0.9993838158 0.9997438241 0.9998780048 0.9999903781
##                             Comp.10      Comp.11
## Standard deviation     6.649006e-03 2.221126e-08
## Proportion of Variance 9.621877e-06 1.073724e-16
## Cumulative Proportion  1.000000e+00 1.000000e+00

99% of our variance is captured by the first two components

data.pca$loadings[, 1:2]

##          Comp.1      Comp.2
## mpg   0.3625374  0.01730837
## cyl  -0.3739053  0.04854007
## disp -0.3681717 -0.05592734
## hp   -0.3296681  0.23787287
## drat  0.2945686  0.26307056
## wt   -0.3460884 -0.16151119
## qsec  0.2001137 -0.48313839
## vs    0.3064370 -0.25220365
## am    0.2354162  0.43087065
## gear  0.2075510  0.45009716
## carb -0.2133891  0.39632634

We can see that miles per gallon has a negative association with cylinders and weight in first component, which makes sense.

Exercise 3

Generate a random 4 X 5 matrix, and find its singular value decomposition using R

Here we create our matrix

M2<-matrix(runif(20),nrow=4)
M2

##           [,1]       [,2]      [,3]       [,4]      [,5]
## [1,] 0.1245855 0.25424769 0.7932840 0.32171953 0.7898057
## [2,] 0.9375482 0.17375196 0.4586843 0.36550287 0.4963039
## [3,] 0.7635621 0.31394919 0.8487027 0.11847207 0.7256782
## [4,] 0.6004540 0.07490632 0.4068308 0.02153997 0.9462833

Here we use the MASS package to find the singular value decomposition

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

M2.svd <- svd(M2)
M2.svd$d

## [1] 2.3709466 0.6601780 0.4320911 0.2498734

Exercise 4

set.seed(99)
x3 <- rnorm(100)
x4 <- rnorm(100)

x1 <- runif(100, 1, 2)
x2 <- runif(100, 1, 2)

y <- 5*x1 + 2*x2 + 2*x3 + x4

df <- data.frame(y, x1, x2, x3, x4)
head(df)

##           y       x1       x2         x3         x4
## 1  7.735173 1.209589 1.392185  0.2139625 -1.5250694
## 2 12.062348 1.700581 1.550559  0.4796581 -0.5009917
## 3  8.940530 1.413671 1.454849  0.0878287 -1.2131812
## 4  8.205732 1.030847 1.397029  0.4438585 -0.6302768
## 5  5.275449 1.017241 1.181204 -0.3628379 -1.4474855
## 6  8.610392 1.256236 1.125386  0.1226740 -0.1669084

df_norm = scale(df)
head(df_norm)

##               y         x1          x2         x3          x4
## [1,] -0.8819809 -0.8488776 -0.42741399  0.3530242 -1.34817031
## [2,]  0.7700503  0.8266679  0.08343516  0.6479987 -0.38765148
## [3,] -0.4217990 -0.1524340 -0.22528460  0.2129908 -1.05563932
## [4,] -0.7023305 -1.4588499 -0.41178946  0.6082541 -0.50891251
## [5,] -1.8210557 -1.5052821 -1.10795451 -0.2873379 -1.27540162
## [6,] -0.5478393 -0.6896919 -1.28800062  0.2516760 -0.07430289

corr_matrix <- cor(df_norm)
ggcorrplot(corr_matrix)

Now we can do our PCA

data.pca <- princomp(corr_matrix)
summary(data.pca)

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4       Comp.5
## Standard deviation     0.5082216 0.4921451 0.3702481 0.23427070 1.825012e-08
## Proportion of Variance 0.3730011 0.3497760 0.1979655 0.07925738 4.809891e-16
## Cumulative Proportion  0.3730011 0.7227771 0.9207426 1.00000000 1.000000e+00

data.pca$loadings[, 1:3]

##         Comp.1     Comp.2     Comp.3
## y   0.04908436  0.3889408  0.0307745
## x1 -0.76523885  0.1736418  0.2471880
## x2  0.04671314 -0.4608059 -0.6856569
## x3  0.36740532  0.7349768 -0.2875830
## x4  0.52424368 -0.2569839  0.6205817

To be honest, this isn’t what I expected. I thought that in the first component, \(y\) would have a positive association with \(x_1\) and \(x_2\)

DATA609 HW5

William Aiken

4/2/2023

Exercise 1

Exercise 2

Exercise 3

Exercise 4