Carry out the logistic regression (Example 22, page 94) in R.
formula:
\(y(x) = \frac{1}{1 + exp(-(a+bx))}\)
x <- c(0.1, 0.5, 1.0, 1.5, 2.0, 2.5)
y <- c(0, 0, 1, 1, 1, 0)
lr <- glm(y ~ x, family = "binomial")
summary(lr)
##
## Call:
## glm(formula = y ~ x, family = "binomial")
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -0.8518 -0.9570 1.2583 1.1075 0.9653 -1.5650
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8982 1.5811 -0.568 0.570
## x 0.7099 1.0557 0.672 0.501
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.3178 on 5 degrees of freedom
## Residual deviance: 7.8325 on 4 degrees of freedom
## AIC: 11.832
##
## Number of Fisher Scoring iterations: 4
We can plug these values back into our original equation
\(y(x) = \frac{1}{1 + exp(-(-0.8982 + 0.7099x))}\)
lr.probs <- predict(lr,type = "response")
lr.probs
## 1 2 3 4 5 6
## 0.3042349 0.3674359 0.4530738 0.5415825 0.6275427 0.7061303
Lets apply these predictions
predictions <- as.numeric(lr.probs > 0.5)
outcomes <- data.frame(y, predictions)
outcomes
## y predictions
## 1 0 0
## 2 0 0
## 3 1 0
## 4 1 1
## 5 1 1
## 6 0 1
Use the motor car database (mtcars) of the built-data sets in R to carry out the basic principal component analysis and explain your results
Here is a great article that gave me some help in setting up the PCA.
First lets inspect the data
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
All of our data is numeric which is essential, but we need to check for missingness
colSums(is.na(mtcars))
## mpg cyl disp hp drat wt qsec vs am gear carb
## 0 0 0 0 0 0 0 0 0 0 0
We have a complete data set which is rare but missing values would be problematic for PCA. Next we need to normalize our data. We will use the ‘scale’ function to normalize (it returns a matrix). Not sure if I need to remove the row names?
mtcars_norm <- scale(mtcars)
head(mtcars_norm)
## mpg cyl disp hp drat
## Mazda RX4 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137
## Mazda RX4 Wag 0.1508848 -0.1049878 -0.57061982 -0.5350928 0.5675137
## Datsun 710 0.4495434 -1.2248578 -0.99018209 -0.7830405 0.4739996
## Hornet 4 Drive 0.2172534 -0.1049878 0.22009369 -0.5350928 -0.9661175
## Hornet Sportabout -0.2307345 1.0148821 1.04308123 0.4129422 -0.8351978
## Valiant -0.3302874 -0.1049878 -0.04616698 -0.6080186 -1.5646078
## wt qsec vs am gear
## Mazda RX4 -0.610399567 -0.7771651 -0.8680278 1.1899014 0.4235542
## Mazda RX4 Wag -0.349785269 -0.4637808 -0.8680278 1.1899014 0.4235542
## Datsun 710 -0.917004624 0.4260068 1.1160357 1.1899014 0.4235542
## Hornet 4 Drive -0.002299538 0.8904872 1.1160357 -0.8141431 -0.9318192
## Hornet Sportabout 0.227654255 -0.4637808 -0.8680278 -0.8141431 -0.9318192
## Valiant 0.248094592 1.3269868 1.1160357 -0.8141431 -0.9318192
## carb
## Mazda RX4 0.7352031
## Mazda RX4 Wag 0.7352031
## Datsun 710 -1.1221521
## Hornet 4 Drive -1.1221521
## Hornet Sportabout -0.5030337
## Valiant -1.1221521
Next we need to calculate the correlation matrix
corr_matrix <- cor(mtcars_norm)
ggcorrplot(corr_matrix)
Now we can do our PCA
data.pca <- princomp(corr_matrix)
summary(data.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9924440 0.7625637 0.164481730 0.079889308 0.0655577780
## Proportion of Variance 0.8640097 0.1265606 0.005888188 0.001389069 0.0009353945
## Cumulative Proportion 0.8640097 0.9905703 0.996458532 0.997847601 0.9987829955
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.0525410926 0.0406708369 0.0248297195 0.0227226207
## Proportion of Variance 0.0006008203 0.0003600084 0.0001341807 0.0001123733
## Cumulative Proportion 0.9993838158 0.9997438241 0.9998780048 0.9999903781
## Comp.10 Comp.11
## Standard deviation 6.649006e-03 2.221126e-08
## Proportion of Variance 9.621877e-06 1.073724e-16
## Cumulative Proportion 1.000000e+00 1.000000e+00
99% of our variance is captured by the first two components
data.pca$loadings[, 1:2]
## Comp.1 Comp.2
## mpg 0.3625374 0.01730837
## cyl -0.3739053 0.04854007
## disp -0.3681717 -0.05592734
## hp -0.3296681 0.23787287
## drat 0.2945686 0.26307056
## wt -0.3460884 -0.16151119
## qsec 0.2001137 -0.48313839
## vs 0.3064370 -0.25220365
## am 0.2354162 0.43087065
## gear 0.2075510 0.45009716
## carb -0.2133891 0.39632634
We can see that miles per gallon has a negative association with cylinders and weight in first component, which makes sense.
Generate a random 4 X 5 matrix, and find its singular value decomposition using R
Here we create our matrix
M2<-matrix(runif(20),nrow=4)
M2
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.1245855 0.25424769 0.7932840 0.32171953 0.7898057
## [2,] 0.9375482 0.17375196 0.4586843 0.36550287 0.4963039
## [3,] 0.7635621 0.31394919 0.8487027 0.11847207 0.7256782
## [4,] 0.6004540 0.07490632 0.4068308 0.02153997 0.9462833
Here we use the MASS package to find the singular value decomposition
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
M2.svd <- svd(M2)
M2.svd$d
## [1] 2.3709466 0.6601780 0.4320911 0.2498734
set.seed(99)
x3 <- rnorm(100)
x4 <- rnorm(100)
x1 <- runif(100, 1, 2)
x2 <- runif(100, 1, 2)
y <- 5*x1 + 2*x2 + 2*x3 + x4
df <- data.frame(y, x1, x2, x3, x4)
head(df)
## y x1 x2 x3 x4
## 1 7.735173 1.209589 1.392185 0.2139625 -1.5250694
## 2 12.062348 1.700581 1.550559 0.4796581 -0.5009917
## 3 8.940530 1.413671 1.454849 0.0878287 -1.2131812
## 4 8.205732 1.030847 1.397029 0.4438585 -0.6302768
## 5 5.275449 1.017241 1.181204 -0.3628379 -1.4474855
## 6 8.610392 1.256236 1.125386 0.1226740 -0.1669084
df_norm = scale(df)
head(df_norm)
## y x1 x2 x3 x4
## [1,] -0.8819809 -0.8488776 -0.42741399 0.3530242 -1.34817031
## [2,] 0.7700503 0.8266679 0.08343516 0.6479987 -0.38765148
## [3,] -0.4217990 -0.1524340 -0.22528460 0.2129908 -1.05563932
## [4,] -0.7023305 -1.4588499 -0.41178946 0.6082541 -0.50891251
## [5,] -1.8210557 -1.5052821 -1.10795451 -0.2873379 -1.27540162
## [6,] -0.5478393 -0.6896919 -1.28800062 0.2516760 -0.07430289
corr_matrix <- cor(df_norm)
ggcorrplot(corr_matrix)
Now we can do our PCA
data.pca <- princomp(corr_matrix)
summary(data.pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 0.5082216 0.4921451 0.3702481 0.23427070 1.825012e-08
## Proportion of Variance 0.3730011 0.3497760 0.1979655 0.07925738 4.809891e-16
## Cumulative Proportion 0.3730011 0.7227771 0.9207426 1.00000000 1.000000e+00
data.pca$loadings[, 1:3]
## Comp.1 Comp.2 Comp.3
## y 0.04908436 0.3889408 0.0307745
## x1 -0.76523885 0.1736418 0.2471880
## x2 0.04671314 -0.4608059 -0.6856569
## x3 0.36740532 0.7349768 -0.2875830
## x4 0.52424368 -0.2569839 0.6205817
To be honest, this isn’t what I expected. I thought that in the first component, \(y\) would have a positive association with \(x_1\) and \(x_2\)