| x | 0.1 | 0.5 | 1.0 | 1.5 | 2.0 | 2.5 |
| y | 0 | 0 | 1 | 1 | 1 | 0 |
glm function with family = 'binomial' to create our model.log_reg <- glm(y ~ x, family = "binomial")
summary(log_reg)
##
## Call:
## glm(formula = y ~ x, family = "binomial")
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -0.852 -0.957 1.258 1.107 0.965 -1.565
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.898 1.581 -0.57 0.57
## x 0.710 1.056 0.67 0.50
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.3178 on 5 degrees of freedom
## Residual deviance: 7.8325 on 4 degrees of freedom
## AIC: 11.83
##
## Number of Fisher Scoring iterations: 4
\[P = \frac{1}{1+exp[-(-0.898+0.710x)]}\]
predict function to see probabilities of a positive outcome based on our fit model.log_reg.probs <- predict(log_reg,type = "response")
log_reg.probs
## 1 2 3 4 5 6
## 0.3042 0.3674 0.4531 0.5416 0.6275 0.7061
log_reg.pred <- ifelse(log_reg.probs > 0.5, 1, 0)
log_reg.pred
## 1 2 3 4 5 6
## 0 0 0 1 1 1
table(log_reg.pred,y)
## y
## log_reg.pred 0 1
## 0 2 1
## 1 1 2
mtcars) of the built-in data sets in R, carry out a basic principal component analysis and explain your results.mtcars_pca <- prcomp(mtcars, scale=TRUE)
summary(mtcars_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.571 1.628 0.792 0.5192 0.4727 0.4600 0.3678 0.3506
## Proportion of Variance 0.601 0.241 0.057 0.0245 0.0203 0.0192 0.0123 0.0112
## Cumulative Proportion 0.601 0.842 0.899 0.9232 0.9436 0.9628 0.9751 0.9863
## PC9 PC10 PC11
## Standard deviation 0.278 0.22811 0.148
## Proportion of Variance 0.007 0.00473 0.002
## Cumulative Proportion 0.993 0.99800 1.000
We can see above from the āCumulative Proportionā numbers that just over 60% of the variance can be explained by the first principal component, 94.36% cumulative variance by PC5 and 100% by 11 principal components!
However, from the plot below we can see that the majority of the variance can be explained by just the first 3 principal components. Each of the following components explains less than 3% of the remaining variance.
screeplot(mtcars_pca)
If we want to see details for what each component is comprised of we can call the prcomp function results without summarizingā¦
mtcars_pca
## Standard deviations (1, .., p=11):
## [1] 2.5707 1.6280 0.7920 0.5192 0.4727 0.4600 0.3678 0.3506 0.2776 0.2281
## [11] 0.1485
##
## Rotation (n x k) = (11 x 11):
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## mpg -0.3625 0.01612 -0.22574 -0.022540 0.10284 -0.10880 0.367724 -0.754091
## cyl 0.3739 0.04374 -0.17531 -0.002592 0.05848 0.16855 0.057278 -0.230825
## disp 0.3682 -0.04932 -0.06148 0.256608 0.39400 -0.33616 0.214303 0.001142
## hp 0.3301 0.24878 0.14001 -0.067676 0.54005 0.07144 -0.001496 -0.222358
## drat -0.2942 0.27469 0.16119 0.854829 0.07733 0.24450 0.021120 0.032194
## wt 0.3461 -0.14304 0.34182 0.245899 -0.07503 -0.46494 -0.020668 -0.008572
## qsec -0.2005 -0.46337 0.40317 0.068077 -0.16467 -0.33048 0.050011 -0.231840
## vs -0.3065 -0.23165 0.42882 -0.214849 0.59954 0.19402 -0.265781 0.025935
## am -0.2349 0.42942 -0.20577 -0.030463 0.08978 -0.57082 -0.587305 -0.059747
## gear -0.2069 0.46235 0.28978 -0.264691 0.04833 -0.24356 0.605098 0.336150
## carb 0.2140 0.41357 0.52854 -0.126789 -0.36132 0.18352 -0.174603 -0.395629
## PC9 PC10 PC11
## mpg 0.235702 0.13929 -0.124896
## cyl 0.054035 -0.84642 -0.140695
## disp 0.198428 0.04938 0.660606
## hp -0.575830 0.24782 -0.256492
## drat -0.046901 -0.10149 -0.039530
## wt 0.359498 0.09439 -0.567449
## qsec -0.528377 -0.27067 0.181362
## vs 0.358583 -0.15904 0.008415
## am -0.047404 -0.17779 0.029824
## gear -0.001735 -0.21383 -0.053507
## carb 0.170641 0.07226 0.319595
# create and print random matrix
A <- matrix(rnorm(20),nrow=4)
A
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.3710 0.40427 2.01842 -1.3889 -0.2843
## [2,] -0.5647 -0.10612 -0.06271 -0.2788 -2.6565
## [3,] 0.3631 1.51152 1.30487 -0.1333 -2.4405
## [4,] 0.6329 -0.09466 2.28665 0.6360 1.3201
# perform singular value decomposition
A_svd <- svd(A)
A_svd
## $d
## [1] 4.0950 3.6835 1.4662 0.9936
##
## $u
## [,1] [,2] [,3] [,4]
## [1,] 0.3636 -0.5964 -0.70525 0.1213
## [2,] 0.5434 0.3796 0.08704 0.7437
## [3,] 0.7496 -0.0822 0.36171 -0.5481
## [4,] -0.1028 -0.7025 0.60350 0.3630
##
## $v
## [,1] [,2] [,3] [,4]
## [1,] 0.09737 -0.40897 -0.34290 -0.22445
## [2,] 0.30089 -0.09207 0.13318 -0.89855
## [3,] 0.35236 -0.79849 0.28851 0.31499
## [4,] -0.20068 0.07785 0.88039 -0.07231
## [5,] -0.85765 -0.42500 -0.07968 -0.19439
We can check that the math works by multiplying \(A=UDV^t\) and seeing if it equals our original random matrix.
A == A_svd$u %*% diag(A_svd$d) %*% t(A_svd$v)
## [,1] [,2] [,3] [,4] [,5]
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE TRUE TRUE FALSE TRUE
## [4,] TRUE FALSE FALSE FALSE FALSE
Probably just rounding errors so letās try this againā¦
round(A,12) == round(A_svd$u %*% diag(A_svd$d) %*% t(A_svd$v), 12)
## [,1] [,2] [,3] [,4] [,5]
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE
After running that a few times I found that its accurate to about 12 digits. (I had to set a seed to make the results come out the same on rerunning the same code) After about 12 digits the rounding errors cause some values to be FALSE (not match).
x1 <- runif(100, 1, 2)
x2 <- runif(100, 1, 2)
x3 <- rnorm(100, 0, 1)
x4 <- rnorm(100, 0, 1)
par(mfrow=c(2,2))
hist(x1)
hist(x2)
hist(x3)
hist(x4)
df <- data.frame(5*x1,2*x2,2*x3,x4)
df_pca <- prcomp(df, scale=TRUE)
summary(df_pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.106 1.016 0.941 0.926
## Proportion of Variance 0.306 0.258 0.222 0.214
## Cumulative Proportion 0.306 0.564 0.786 1.000
df_pca
## Standard deviations (1, .., p=4):
## [1] 1.1061 1.0165 0.9414 0.9258
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## X5...x1 0.53612 -0.3635 -0.70122 0.2979
## X2...x2 -0.60469 -0.2254 -0.02107 0.7636
## X2...x3 -0.58194 -0.2370 -0.55406 -0.5461
## x4 0.09096 -0.8723 0.44818 -0.1731
screeplot(df_pca)
To be perfectly honest I am not sure how to tell what I should expect based on the formula. Letās look at histograms of the scaled data to get an ideaā¦
x1_scaled <- scale(5*x1)
x2_scaled <- scale(2*x2)
x3_scaled <- scale(2*x3)
x4_scaled <- scale(x4)
par(mfrow=c(2,2))
hist(x1_scaled)
hist(x2_scaled)
hist(x3_scaled)
hist(x4_scaled)
After multiplying and scaling the 4 variables have the same variances, so I guess it makes sense that the PCA was not really able to reduce the number of dimensions necessary to explain the variance of our data set because they are all somewhat similar distributions after scaling. So all 4 are really necessary to explain all the variance in the data set.
Just out of curiosity letās see what it looks like without scalingā¦
df_pca2 <- prcomp(df, scale=FALSE)
summary(df_pca2)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.862 1.420 0.936 0.5639
## Proportion of Variance 0.519 0.302 0.131 0.0476
## Cumulative Proportion 0.519 0.821 0.952 1.0000
df_pca2
## Standard deviations (1, .., p=4):
## [1] 1.8620 1.4201 0.9360 0.5639
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## X5...x1 0.163970 -0.98259 -0.07530 0.04425
## X2...x2 -0.050594 0.03484 0.02167 0.99788
## X2...x3 -0.985166 -0.16524 -0.01478 -0.04386
## x4 -0.001123 -0.07743 0.99682 -0.01900
screeplot(df_pca2)
So in this case maybe we would not want to scale our data? 96% of the variance can be explained with just three variables in that caseā¦