Ex.1

Carry out the logistic regression (Example 22 on page 94) in R using the data.

x <- c(0.1, 0.5, 1.0, 1.5, 2.0, 2.5)
y <- c(0, 0, 1, 1, 1, 0)

The formula is \[y(x)=\frac{1}{1+exp[-(a+bx)]}\]

glm(y ~ x, family = "binomial")
## 
## Call:  glm(formula = y ~ x, family = "binomial")
## 
## Coefficients:
## (Intercept)            x  
##     -0.8982       0.7099  
## 
## Degrees of Freedom: 5 Total (i.e. Null);  4 Residual
## Null Deviance:       8.318 
## Residual Deviance: 7.832     AIC: 11.83

Ex.2

Using the motor car database(mtcars) of the built-in data sets in R to carry out the basic principal component analysis and explain your results.

dplyr::glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,~
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,~
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16~
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180~
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,~
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.~
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18~
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,~
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,~
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,~
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,~

mtcars data set have 11 features including two categorical variables (vs and am). I will exclude those two categorical variables for this analysis as PCA works best with numerical data. Then we are left with 9 features.

mtcars_pca <- prcomp(mtcars[,c(1:7,10,11)], center = TRUE, scale = TRUE)
summary(mtcars_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.3782 1.4429 0.71008 0.51481 0.42797 0.35184 0.32413
## Proportion of Variance 0.6284 0.2313 0.05602 0.02945 0.02035 0.01375 0.01167
## Cumulative Proportion  0.6284 0.8598 0.91581 0.94525 0.96560 0.97936 0.99103
##                           PC8     PC9
## Standard deviation     0.2419 0.14896
## Proportion of Variance 0.0065 0.00247
## Cumulative Proportion  0.9975 1.00000

Based on the summary we got above we have 11 principal components. Each of these explains a percentage of the total variation in the data set. PC1 explains 63% of the total variance, which means that nearly 2/3 of the information in the dataset (9 variables) can be encapsulated by just that one Principal component. PC2 explains 23% of the variance. This shows that just PC1 and PC2 can explain 86% of the variance.

Ex.3

Generate a random 4 x 5 matrix, and find its singular value decomposition using R.

random_M <- matrix(rnorm(20), nrow = 4)
svd(random_M)
## $d
## [1] 3.9781996 2.1893973 1.0275484 0.6748422
## 
## $u
##             [,1]       [,2]         [,3]        [,4]
## [1,]  0.58897271  0.1966029 -0.775228241  0.11610169
## [2,]  0.32138185 -0.9415140 -0.009710949 -0.10085053
## [3,] -0.02632203 -0.1158331  0.098595276  0.98801256
## [4,]  0.74103441  0.2479544  0.623863641 -0.01344432
## 
## $v
##            [,1]       [,2]       [,3]        [,4]
## [1,] -0.1743665  0.1718656  0.5893422 -0.07987306
## [2,] -0.3076412 -0.3264880 -0.3323519 -0.81110126
## [3,] -0.2566979  0.2477065 -0.7071701  0.38864733
## [4,] -0.6535820  0.6210076  0.1184890 -0.13524263
## [5,]  0.6179725  0.6456461 -0.1675972 -0.40791877

Ex.4

First try to simulate 100 data points for y using \[y = 5x_{1}+2x_{2}+2x_{3}+x_{4},\]
Where \(x, x_{2}\) are uniformly distributed in [1,2], while \(x_{3}, x_{4}\) are normally distributed with zero mean and unit variance, Then use the principal component analysis (PCA) to analyze the data to find its principal components. Are the results expected from the formula?

x1 <- runif(100, min = 1, max = 2)
x2 <- runif(100, min = 1, max = 2)
x3 <- rnorm(100, mean = 0, sd = 1)
x4 <- rnorm(100, mean = 0, sd = 1)
y <- 5 * x1 + 2 * x2 + 2 * x3 + x4
data_frame <- as.data.frame(cbind(y, x1, x2, x3, x4))
data_pca <- prcomp(data_frame, scale = TRUE)
summary(data_pca)
## Importance of components:
##                           PC1    PC2    PC3    PC4       PC5
## Standard deviation     1.3923 1.0747 0.9957 0.9567 3.328e-16
## Proportion of Variance 0.3877 0.2310 0.1983 0.1831 0.000e+00
## Cumulative Proportion  0.3877 0.6187 0.8169 1.0000 1.000e+00
summary(data_frame)
##        y                x1              x2              x3           
##  Min.   : 3.498   Min.   :1.032   Min.   :1.004   Min.   :-2.357181  
##  1st Qu.: 8.794   1st Qu.:1.232   1st Qu.:1.194   1st Qu.:-0.681240  
##  Median :10.574   Median :1.541   Median :1.428   Median : 0.125700  
##  Mean   :10.641   Mean   :1.526   Mean   :1.458   Mean   : 0.006642  
##  3rd Qu.:12.441   3rd Qu.:1.807   3rd Qu.:1.731   3rd Qu.: 0.668423  
##  Max.   :15.586   Max.   :1.998   Max.   :1.997   Max.   : 2.522886  
##        x4          
##  Min.   :-3.02570  
##  1st Qu.:-0.61593  
##  Median : 0.01306  
##  Mean   : 0.08339  
##  3rd Qu.: 0.78708  
##  Max.   : 2.31092
str(data_pca)
## List of 5
##  $ sdev    : num [1:5] 1.39 1.07 9.96e-01 9.57e-01 3.33e-16
##  $ rotation: num [1:5, 1:5] 0.718 0.35 0.162 0.531 0.233 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "y" "x1" "x2" "x3" ...
##   .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:5] 10.64133 1.52572 1.45802 0.00664 0.08339
##   ..- attr(*, "names")= chr [1:5] "y" "x1" "x2" "x3" ...
##  $ scale   : Named num [1:5] 2.747 0.31 0.298 1.033 1.012
##   ..- attr(*, "names")= chr [1:5] "y" "x1" "x2" "x3" ...
##  $ x       : num [1:100, 1:5] -0.513 -0.228 1.577 0.514 -3.618 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

When looking at the pca values above we can see that the results are what you would expect from formula.