Carry out the logistic regression (Example 22 on page 94) in R using the data.
x <- c(0.1, 0.5, 1.0, 1.5, 2.0, 2.5)
y <- c(0, 0, 1, 1, 1, 0)
The formula is \[y(x)=\frac{1}{1+exp[-(a+bx)]}\]
glm(y ~ x, family = "binomial")
##
## Call: glm(formula = y ~ x, family = "binomial")
##
## Coefficients:
## (Intercept) x
## -0.8982 0.7099
##
## Degrees of Freedom: 5 Total (i.e. Null); 4 Residual
## Null Deviance: 8.318
## Residual Deviance: 7.832 AIC: 11.83
Using the motor car database(mtcars) of the built-in data sets in R to carry out the basic principal component analysis and explain your results.
dplyr::glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~
mtcars data set have 11 features including two categorical variables (vs and am). I will exclude those two categorical variables for this analysis as PCA works best with numerical data. Then we are left with 9 features.
mtcars_pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE, scale = TRUE)
summary(mtcars_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion 0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
## PC8 PC9
## Standard deviation 0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion 0.9975 1.00000
Based on the summary we got above we have 11 principal components. Each of these explains a percentage of the total variation in the data set. PC1 explains 63% of the total variance, which means that nearly 2/3 of the information in the dataset (9 variables) can be encapsulated by just that one Principal component. PC2 explains 23% of the variance. This shows that just PC1 and PC2 can explain 86% of the variance.
Generate a random 4 x 5 matrix, and find its singular value decomposition using R.
random_M <- matrix(rnorm(20), nrow = 4)
svd(random_M)
## $d
## [1] 3.9781996 2.1893973 1.0275484 0.6748422
##
## $u
## [,1] [,2] [,3] [,4]
## [1,] 0.58897271 0.1966029 -0.775228241 0.11610169
## [2,] 0.32138185 -0.9415140 -0.009710949 -0.10085053
## [3,] -0.02632203 -0.1158331 0.098595276 0.98801256
## [4,] 0.74103441 0.2479544 0.623863641 -0.01344432
##
## $v
## [,1] [,2] [,3] [,4]
## [1,] -0.1743665 0.1718656 0.5893422 -0.07987306
## [2,] -0.3076412 -0.3264880 -0.3323519 -0.81110126
## [3,] -0.2566979 0.2477065 -0.7071701 0.38864733
## [4,] -0.6535820 0.6210076 0.1184890 -0.13524263
## [5,] 0.6179725 0.6456461 -0.1675972 -0.40791877
First try to simulate 100 data points for y using \[y = 5x_{1}+2x_{2}+2x_{3}+x_{4},\]
Where \(x, x_{2}\) are uniformly distributed in [1,2], while \(x_{3}, x_{4}\) are normally distributed with zero mean and unit variance, Then use the principal component analysis (PCA) to analyze the data to find its principal components. Are the results expected from the formula?
x1 <- runif(100, min = 1, max = 2)
x2 <- runif(100, min = 1, max = 2)
x3 <- rnorm(100, mean = 0, sd = 1)
x4 <- rnorm(100, mean = 0, sd = 1)
y <- 5 * x1 + 2 * x2 + 2 * x3 + x4
data_frame <- as.data.frame(cbind(y, x1, x2, x3, x4))
data_pca <- prcomp(data_frame, scale = TRUE)
summary(data_pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.3923 1.0747 0.9957 0.9567 3.328e-16
## Proportion of Variance 0.3877 0.2310 0.1983 0.1831 0.000e+00
## Cumulative Proportion 0.3877 0.6187 0.8169 1.0000 1.000e+00
summary(data_frame)
## y x1 x2 x3
## Min. : 3.498 Min. :1.032 Min. :1.004 Min. :-2.357181
## 1st Qu.: 8.794 1st Qu.:1.232 1st Qu.:1.194 1st Qu.:-0.681240
## Median :10.574 Median :1.541 Median :1.428 Median : 0.125700
## Mean :10.641 Mean :1.526 Mean :1.458 Mean : 0.006642
## 3rd Qu.:12.441 3rd Qu.:1.807 3rd Qu.:1.731 3rd Qu.: 0.668423
## Max. :15.586 Max. :1.998 Max. :1.997 Max. : 2.522886
## x4
## Min. :-3.02570
## 1st Qu.:-0.61593
## Median : 0.01306
## Mean : 0.08339
## 3rd Qu.: 0.78708
## Max. : 2.31092
str(data_pca)
## List of 5
## $ sdev : num [1:5] 1.39 1.07 9.96e-01 9.57e-01 3.33e-16
## $ rotation: num [1:5, 1:5] 0.718 0.35 0.162 0.531 0.233 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "y" "x1" "x2" "x3" ...
## .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:5] 10.64133 1.52572 1.45802 0.00664 0.08339
## ..- attr(*, "names")= chr [1:5] "y" "x1" "x2" "x3" ...
## $ scale : Named num [1:5] 2.747 0.31 0.298 1.033 1.012
## ..- attr(*, "names")= chr [1:5] "y" "x1" "x2" "x3" ...
## $ x : num [1:100, 1:5] -0.513 -0.228 1.577 0.514 -3.618 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
When looking at the pca values above we can see that the results are what you would expect from formula.