Ex. 1

Carry out the logistic regression (Example 22 on Page 94) in R using the data

x	0.1	0.5	1.0	1.5	2.0	2.5
y	0	0	1	1	1	0

The formula is \(y(x) = \frac{1}{1 + exp[-(a + bx)]}\)

Solution

We will use here glm function with family as binomial to perform logistic regression for given values of x and y.

x <- c(0.1, 0.5, 1.0, 1.5, 2.0, 2.5)
y <- c(0, 0, 1, 1, 1, 0)

# Logistics Regression
lr <- glm(y~x, family = binomial)
summary(lr)

## 
## Call:
## glm(formula = y ~ x, family = binomial)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -0.8518  -0.9570   1.2583   1.1075   0.9653  -1.5650  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.8982     1.5811  -0.568    0.570
## x             0.7099     1.0557   0.672    0.501
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3178  on 5  degrees of freedom
## Residual deviance: 7.8325  on 4  degrees of freedom
## AIC: 11.832
## 
## Number of Fisher Scoring iterations: 4

The null deviance shows how well our model can predict by only using intercept. Residual deviance shows how well our model can predict using intercepts and inputs. AIC is Akaike information criterian. It is used to compare models. All these values if smaller than better.

Ex. 2

Using the motor car database (mtcars) of the built-in data sets in R to carry out the basic principal component analysis and explain your results

Solution

We will use prcomp() to perform principal component analysis on mtcars dataset. The scale argument True implies variables will be scaled to have unit variance.

pca <- prcomp(mtcars, scale=TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
##                            PC8    PC9    PC10   PC11
## Standard deviation     0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

We can see here the first PCA components shows 60% variation in the data which gets reduced in further components. Drwaing screeplot below confirms the variance decrease.

screeplot(pca)

pca biplot shows the scores of each case and the loading of each variable on the first two principal components. The left and bottom axes shows principal axes scores and top and right axes shows the loadings.

Seeing the biplot below hp, cycl, disp and wt are similar and gear, am, mpg, drat, qsec, and vs are similar based on their pca1 components.

biplot(pca)

Ex. 3

Generate a random 4 X 5 matrix, and find its singular value decomposition using R.

Solution

matrix <- matrix(rnorm(20), nrow = 4)
matrix

##            [,1]       [,2]       [,3]        [,4]       [,5]
## [1,] -0.9443479  1.3784804 -0.2580736  0.07821805 -2.4513584
## [2,]  0.1075014 -0.7626122  1.1431946 -0.81568273 -1.6517063
## [3,] -1.1579729 -0.6242446  1.1977297 -0.22989418  0.9040357
## [4,] -1.5009241 -1.4770055 -0.2316544 -1.90938005  1.6064364

svd(matrix)

## $d
## [1] 4.075891 2.678069 1.824568 1.342764
## 
## $u
##            [,1]      [,2]       [,3]       [,4]
## [1,]  0.5830753 0.5248648  0.6040281 -0.1403219
## [2,]  0.1431537 0.6818300 -0.6728919  0.2486593
## [3,] -0.3471664 0.2819460 -0.1159010 -0.8868759
## [4,] -0.7204205 0.4244188  0.4110153  0.3632209
## 
## $v
##             [,1]       [,2]        [,3]       [,4]
## [1,]  0.23260415 -0.5174861 -0.61682642  0.4774145
## [2,]  0.48464720 -0.2237911  0.44452991 -0.2725074
## [3,] -0.05783932  0.3298598 -0.63530733 -0.6150745
## [4,]  0.33960852 -0.5191418 -0.08880304 -0.5238760
## [5,] -0.76963178 -0.5511896  0.10206343 -0.2122556

Ex. 4

First try to simulate 100 data points for y using \(y = 5 x_{1} + 2 x_{2} + 2 x_{3} + x_{4}.\) where \(x_{1}, x_{2}\) are uniformly distributed in [1,2], while \(x_{3} , x_{4}\) are normally distributed with zero mean and unit variance. Then, use the principal component analysis (PCA) to analyze the data to find its principal components. Are the results expected from the formula?

Solution

In the first step, we will simulate x1, x2, x3 and x4 and create a dataframe following the given formula.

set.seed(609)
x1 <- runif(100, min=1, max=2)
x2 <- runif(100, min=1, max=2)
x3 <- rnorm(100, mean=0, sd=1)
x4 <- rnorm(100, mean=0, sd=1)

y = 5*x1 + 2*x2 + 2*x3 + x4

df <- as.data.frame(cbind(y,x1,x2,x3,x4))
summary(df)

##        y                x1              x2              x3         
##  Min.   : 1.548   Min.   :1.010   Min.   :1.009   Min.   :-2.0969  
##  1st Qu.: 8.461   1st Qu.:1.271   1st Qu.:1.198   1st Qu.:-0.7122  
##  Median : 9.880   Median :1.482   Median :1.450   Median :-0.1631  
##  Mean   :10.037   Mean   :1.479   Mean   :1.469   Mean   :-0.1081  
##  3rd Qu.:11.914   3rd Qu.:1.699   3rd Qu.:1.757   3rd Qu.: 0.4325  
##  Max.   :15.064   Max.   :1.989   Max.   :1.979   Max.   : 2.1450  
##        x4          
##  Min.   :-2.08701  
##  1st Qu.:-0.85024  
##  Median : 0.01832  
##  Mean   :-0.08108  
##  3rd Qu.: 0.61931  
##  Max.   : 2.14581

We will now use prcomp() to perform principal component analysis. The scale argument True implies variables will be scaled to have unit variance.

pca_df <- prcomp(df, scale=TRUE)
summary(pca_df)

## Importance of components:
##                           PC1    PC2    PC3    PC4       PC5
## Standard deviation     1.4339 1.0377 0.9873 0.9447 4.513e-16
## Proportion of Variance 0.4112 0.2154 0.1949 0.1785 0.000e+00
## Cumulative Proportion  0.4112 0.6266 0.8215 1.0000 1.000e+00

The scree plot below shows that principal components variance values do not differ much and they cover similar variation in the data points.

screeplot(pca_df)

biplot(pca_df)

str(pca_df)

## List of 5
##  $ sdev    : num [1:5] 1.43 1.04 9.87e-01 9.45e-01 4.51e-16
##  $ rotation: num [1:5, 1:5] 0.693 0.436 0.221 0.515 0.123 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "y" "x1" "x2" "x3" ...
##   .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:5] 10.0372 1.4795 1.4685 -0.1081 -0.0811
##   ..- attr(*, "names")= chr [1:5] "y" "x1" "x2" "x3" ...
##  $ scale   : Named num [1:5] 2.529 0.281 0.295 0.873 0.912
##   ..- attr(*, "names")= chr [1:5] "y" "x1" "x2" "x3" ...
##  $ x       : num [1:100, 1:5] -1.234 -0.603 -0.19 0.541 2.363 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"

The results above do show what we would expect from formula.

Data 609 - Module5

Amit Kapoor

10/22/2021

Ex. 1

Ex. 2

Ex. 3

Ex. 4