Question 1

Carry out the logistic regression in R using the data. The formula is \(y(x) = \frac{1}{1+e^{-a-bx}}\)

x <- c(0.1, 0.5, 1.0, 1.5, 2.0, 2.5)
y <- c(0, 0, 1, 1, 1, 0)
glm(y ~ x, family = "binomial")
## 
## Call:  glm(formula = y ~ x, family = "binomial")
## 
## Coefficients:
## (Intercept)            x  
##     -0.8982       0.7099  
## 
## Degrees of Freedom: 5 Total (i.e. Null);  4 Residual
## Null Deviance:       8.318 
## Residual Deviance: 7.832     AIC: 11.83

Question 2

Using the motor car database of the built in data sets in R to carry out the basic principal component analysis and explain your results.

pca<- prcomp(mtcars)

print(pca)
## Standard deviations (1, .., p=11):
##  [1] 136.5330479  38.1480776   3.0710166   1.3066508   0.9064862   0.6635411
##  [7]   0.3085791   0.2859604   0.2506973   0.2106519   0.1984238
## 
## Rotation (n x k) = (11 x 11):
##               PC1          PC2          PC3          PC4         PC5
## mpg  -0.038118199  0.009184847  0.982070847  0.047634784 -0.08832843
## cyl   0.012035150 -0.003372487 -0.063483942 -0.227991962  0.23872590
## disp  0.899568146  0.435372320  0.031442656 -0.005086826 -0.01073597
## hp    0.434784387 -0.899307303  0.025093049  0.035715638  0.01655194
## drat -0.002660077 -0.003900205  0.039724928 -0.057129357 -0.13332765
## wt    0.006239405  0.004861023 -0.084910258  0.127962867 -0.24354296
## qsec -0.006671270  0.025011743 -0.071670457  0.886472188 -0.21416101
## vs   -0.002729474  0.002198425  0.004203328  0.177123945 -0.01688851
## am   -0.001962644 -0.005793760  0.054806391 -0.135658793 -0.06270200
## gear -0.002604768 -0.011272462  0.048524372 -0.129913811 -0.27616440
## carb  0.005766010 -0.027779208 -0.102897231 -0.268931427 -0.85520810
##               PC6          PC7           PC8          PC9         PC10
## mpg  -0.143790084 -0.039239174  2.271040e-02 -0.002790139  0.030630361
## cyl  -0.793818050  0.425011021 -1.890403e-01  0.042677206  0.131718534
## disp  0.007424138  0.000582398 -5.841464e-04  0.003532713 -0.005399132
## hp    0.001653685 -0.002212538  4.748087e-06 -0.003734085  0.001862554
## drat  0.227229260  0.034847411 -9.385817e-01 -0.014131110  0.184102094
## wt   -0.127142296 -0.186558915  1.561907e-01 -0.390600261  0.829886844
## qsec -0.189564973  0.254844548 -1.028515e-01 -0.095914479 -0.204240658
## vs    0.102619063 -0.080788938 -2.132903e-03  0.684043835  0.303060724
## am    0.205217266  0.200858874 -2.273255e-02 -0.572372433 -0.162808201
## gear  0.334971103  0.801625551  2.174878e-01  0.156118559  0.203540645
## carb -0.283788381 -0.165474186  3.972219e-03  0.127583043 -0.239954748
##               PC11
## mpg  -0.0158569365
## cyl   0.1454453628
## disp  0.0009420262
## hp   -0.0021526102
## drat -0.0973818815
## wt   -0.0198581635
## qsec  0.0110677880
## vs    0.6256900918
## am    0.7331658036
## gear -0.1909325849
## carb  0.0557957968
summary(pca)
## Importance of components:
##                            PC1      PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     136.533 38.14808 3.07102 1.30665 0.90649 0.66354 0.3086
## Proportion of Variance   0.927  0.07237 0.00047 0.00008 0.00004 0.00002 0.0000
## Cumulative Proportion    0.927  0.99937 0.99984 0.99992 0.99996 0.99998 1.0000
##                          PC8    PC9   PC10   PC11
## Standard deviation     0.286 0.2507 0.2107 0.1984
## Proportion of Variance 0.000 0.0000 0.0000 0.0000
## Cumulative Proportion  1.000 1.0000 1.0000 1.0000

100% of the variance is accounted for in the 7th principal component.

Based on the results shown, 99% of the variation is captured within the second principal component.

Question 3

Generate a random 4x5 matrix, and find its singular value decomposition using R.

a <- matrix(1:20,nrow = 4, ncol = 5)

print(svd(a))
## $d
## [1] 5.352022e+01 2.363426e+00 4.870683e-15 7.906968e-16
## 
## $u
##            [,1]       [,2]        [,3]       [,4]
## [1,] -0.4430188 -0.7097424 -0.52426094  0.1585890
## [2,] -0.4798725 -0.2640499  0.81721984  0.1793091
## [3,] -0.5167262  0.1816426 -0.06165685 -0.8343851
## [4,] -0.5535799  0.6273351 -0.23130204  0.4964870
## 
## $v
##             [,1]        [,2]       [,3]       [,4]
## [1,] -0.09654784  0.76855612 -0.6000256  0.1704800
## [2,] -0.24551564  0.48961420  0.5577664 -0.5560862
## [3,] -0.39448345  0.21067228  0.2312115  0.1606664
## [4,] -0.54345125 -0.06826963  0.2643802  0.6650059
## [5,] -0.69241905 -0.34721155 -0.4533325 -0.4400661

Exercise 4

Try to simulate 100 data points for y using \(y = 5x_1 + 2x_2 + 2x_3 +x_4\) where \(x_1 and x_2\) are uniformly distributed in [1,2], while \(x_3, x_4\) are normally distributed with zero mean and unit variance. THen, use the principal component analysis to analyse the data to find its principal components. Are resuls expected from the formula?

y <- function(a,b,c,d){
  
  x <- (5*a+2*b+2*c+d)
  return(x)
}

x1 <- runif(100, min=1, max=2)
x2 <- runif(100, min=1, max=2)
x3 <- rnorm(100, mean=0, sd=1)
x4 <- rnorm(100, mean=0, sd=1)

y = 5*x1 + 2*x2 + 2*x3 + x4

df <- as.data.frame(cbind(y,x1,x2,x3,x4))

pc <- prcomp(df)

summary(pc)
## Importance of components:
##                           PC1    PC2    PC3     PC4       PC5
## Standard deviation     2.5106 1.0635 0.6312 0.29633 3.009e-16
## Proportion of Variance 0.7958 0.1428 0.0503 0.01109 0.000e+00
## Cumulative Proportion  0.7958 0.9386 0.9889 1.00000 1.000e+00

The results are expected. Since the y is already determined by the equation, most of the variation is caught early with 100% of the variation caught in the fourth principal component.