Ex. 1

Carry out the logistic regression (Example 22 on Page 94 in the book, Introduction to Algorithms for Data Mining and Machine Learning, by Xin-She Yang), in R using the data.

x 0.1 0.5 1.0 1.5 2.0 2.5
y 0 0 1 1 1 0

The formula is \[y(x) = \frac{1}{1+exp[-(a+bx)]}\]

First we use the glm function with family = 'binomial' to create our model.
log_reg <- glm(y ~ x, family = "binomial")
summary(log_reg)
## 
## Call:
## glm(formula = y ~ x, family = "binomial")
## 
## Deviance Residuals: 
##      1       2       3       4       5       6  
## -0.852  -0.957   1.258   1.107   0.965  -1.565  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -0.898      1.581   -0.57     0.57
## x              0.710      1.056    0.67     0.50
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3178  on 5  degrees of freedom
## Residual deviance: 7.8325  on 4  degrees of freedom
## AIC: 11.83
## 
## Number of Fisher Scoring iterations: 4
We get the same model as found in the book (rounded to 3 digits):

\[P = \frac{1}{1+exp[-(-0.898+0.710x)]}\]

Then we use the predict function to see probabilities of a positive outcome based on our fit model.
log_reg.probs <- predict(log_reg,type = "response")
log_reg.probs
##      1      2      3      4      5      6 
## 0.3042 0.3674 0.4531 0.5416 0.6275 0.7061
Then we can use those probabilities to make predictions based on our model and then create a confusion matrix for our results.
log_reg.pred <- ifelse(log_reg.probs > 0.5, 1, 0)
log_reg.pred
## 1 2 3 4 5 6 
## 0 0 0 1 1 1
table(log_reg.pred,y)
##             y
## log_reg.pred 0 1
##            0 2 1
##            1 1 2

Ex. 2

Using the motor car database (mtcars) of the built-in data sets in R, carry out a basic principal component analysis and explain your results.

mtcars_pca <- prcomp(mtcars, scale=TRUE)
summary(mtcars_pca)
## Importance of components:
##                          PC1   PC2   PC3    PC4    PC5    PC6    PC7    PC8
## Standard deviation     2.571 1.628 0.792 0.5192 0.4727 0.4600 0.3678 0.3506
## Proportion of Variance 0.601 0.241 0.057 0.0245 0.0203 0.0192 0.0123 0.0112
## Cumulative Proportion  0.601 0.842 0.899 0.9232 0.9436 0.9628 0.9751 0.9863
##                          PC9    PC10  PC11
## Standard deviation     0.278 0.22811 0.148
## Proportion of Variance 0.007 0.00473 0.002
## Cumulative Proportion  0.993 0.99800 1.000

We can see above from the ā€œCumulative Proportionā€ numbers that just over 60% of the variance can be explained by the first principal component, 94.36% cumulative variance by PC5 and 100% by 11 principal components!

However, from the plot below we can see that the majority of the variance can be explained by just the first 3 principal components. Each of the following components explains less than 3% of the remaining variance.

screeplot(mtcars_pca)

If we want to see details for what each component is comprised of we can call the prcomp function results without summarizing…

mtcars_pca
## Standard deviations (1, .., p=11):
##  [1] 2.5707 1.6280 0.7920 0.5192 0.4727 0.4600 0.3678 0.3506 0.2776 0.2281
## [11] 0.1485
## 
## Rotation (n x k) = (11 x 11):
##          PC1      PC2      PC3       PC4      PC5      PC6       PC7       PC8
## mpg  -0.3625  0.01612 -0.22574 -0.022540  0.10284 -0.10880  0.367724 -0.754091
## cyl   0.3739  0.04374 -0.17531 -0.002592  0.05848  0.16855  0.057278 -0.230825
## disp  0.3682 -0.04932 -0.06148  0.256608  0.39400 -0.33616  0.214303  0.001142
## hp    0.3301  0.24878  0.14001 -0.067676  0.54005  0.07144 -0.001496 -0.222358
## drat -0.2942  0.27469  0.16119  0.854829  0.07733  0.24450  0.021120  0.032194
## wt    0.3461 -0.14304  0.34182  0.245899 -0.07503 -0.46494 -0.020668 -0.008572
## qsec -0.2005 -0.46337  0.40317  0.068077 -0.16467 -0.33048  0.050011 -0.231840
## vs   -0.3065 -0.23165  0.42882 -0.214849  0.59954  0.19402 -0.265781  0.025935
## am   -0.2349  0.42942 -0.20577 -0.030463  0.08978 -0.57082 -0.587305 -0.059747
## gear -0.2069  0.46235  0.28978 -0.264691  0.04833 -0.24356  0.605098  0.336150
## carb  0.2140  0.41357  0.52854 -0.126789 -0.36132  0.18352 -0.174603 -0.395629
##            PC9     PC10      PC11
## mpg   0.235702  0.13929 -0.124896
## cyl   0.054035 -0.84642 -0.140695
## disp  0.198428  0.04938  0.660606
## hp   -0.575830  0.24782 -0.256492
## drat -0.046901 -0.10149 -0.039530
## wt    0.359498  0.09439 -0.567449
## qsec -0.528377 -0.27067  0.181362
## vs    0.358583 -0.15904  0.008415
## am   -0.047404 -0.17779  0.029824
## gear -0.001735 -0.21383 -0.053507
## carb  0.170641  0.07226  0.319595

Ex. 3

Generate a random \(4 \times 5\) matrix, and find its singular value decompostition using R.

# create and print random matrix
A <- matrix(rnorm(20),nrow=4)
A
##         [,1]     [,2]     [,3]    [,4]    [,5]
## [1,]  1.3710  0.40427  2.01842 -1.3889 -0.2843
## [2,] -0.5647 -0.10612 -0.06271 -0.2788 -2.6565
## [3,]  0.3631  1.51152  1.30487 -0.1333 -2.4405
## [4,]  0.6329 -0.09466  2.28665  0.6360  1.3201
# perform singular value decomposition
A_svd <- svd(A) 
A_svd
## $d
## [1] 4.0950 3.6835 1.4662 0.9936
## 
## $u
##         [,1]    [,2]     [,3]    [,4]
## [1,]  0.3636 -0.5964 -0.70525  0.1213
## [2,]  0.5434  0.3796  0.08704  0.7437
## [3,]  0.7496 -0.0822  0.36171 -0.5481
## [4,] -0.1028 -0.7025  0.60350  0.3630
## 
## $v
##          [,1]     [,2]     [,3]     [,4]
## [1,]  0.09737 -0.40897 -0.34290 -0.22445
## [2,]  0.30089 -0.09207  0.13318 -0.89855
## [3,]  0.35236 -0.79849  0.28851  0.31499
## [4,] -0.20068  0.07785  0.88039 -0.07231
## [5,] -0.85765 -0.42500 -0.07968 -0.19439

We can check that the math works by multiplying \(A=UDV^t\) and seeing if it equals our original random matrix.

A == A_svd$u %*% diag(A_svd$d) %*% t(A_svd$v)
##       [,1]  [,2]  [,3]  [,4]  [,5]
## [1,] FALSE FALSE FALSE FALSE FALSE
## [2,] FALSE FALSE FALSE FALSE FALSE
## [3,] FALSE  TRUE  TRUE FALSE  TRUE
## [4,]  TRUE FALSE FALSE FALSE FALSE

Probably just rounding errors so let’s try this again…

round(A,12) == round(A_svd$u %*% diag(A_svd$d) %*% t(A_svd$v), 12)
##      [,1] [,2] [,3] [,4] [,5]
## [1,] TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE
## [4,] TRUE TRUE TRUE TRUE TRUE

After running that a few times I found that its accurate to about 12 digits. (I had to set a seed to make the results come out the same on rerunning the same code) After about 12 digits the rounding errors cause some values to be FALSE (not match).

Ex. 4

First try to simulate 100 data points for \(y\) using \[y=5x_1+2x_2+2x_3+x_4\] where \(x_1, x_2\) are uniformly distributed in [1,2] while \(x_3, x_4\) are normally distributed with zero mean and unit variance. Then use principal component analysis (PCA) to analyze the data to find its principal components. Are the results expected from the formula?

x1 <- runif(100, 1, 2)
x2 <- runif(100, 1, 2)
x3 <- rnorm(100, 0, 1)
x4 <- rnorm(100, 0, 1)
par(mfrow=c(2,2))
hist(x1)
hist(x2)
hist(x3)
hist(x4)

df <- data.frame(5*x1,2*x2,2*x3,x4)
df_pca <- prcomp(df, scale=TRUE)
summary(df_pca)
## Importance of components:
##                          PC1   PC2   PC3   PC4
## Standard deviation     1.106 1.016 0.941 0.926
## Proportion of Variance 0.306 0.258 0.222 0.214
## Cumulative Proportion  0.306 0.564 0.786 1.000
df_pca
## Standard deviations (1, .., p=4):
## [1] 1.1061 1.0165 0.9414 0.9258
## 
## Rotation (n x k) = (4 x 4):
##              PC1     PC2      PC3     PC4
## X5...x1  0.53612 -0.3635 -0.70122  0.2979
## X2...x2 -0.60469 -0.2254 -0.02107  0.7636
## X2...x3 -0.58194 -0.2370 -0.55406 -0.5461
## x4       0.09096 -0.8723  0.44818 -0.1731
screeplot(df_pca)

To be perfectly honest I am not sure how to tell what I should expect based on the formula. Let’s look at histograms of the scaled data to get an idea…

x1_scaled <- scale(5*x1)
x2_scaled <- scale(2*x2)
x3_scaled <- scale(2*x3)
x4_scaled <- scale(x4)
par(mfrow=c(2,2))
hist(x1_scaled)
hist(x2_scaled)
hist(x3_scaled)
hist(x4_scaled)

After multiplying and scaling the 4 variables have the same variances, so I guess it makes sense that the PCA was not really able to reduce the number of dimensions necessary to explain the variance of our data set because they are all somewhat similar distributions after scaling. So all 4 are really necessary to explain all the variance in the data set.

Just out of curiosity let’s see what it looks like without scaling…

df_pca2 <- prcomp(df, scale=FALSE)
summary(df_pca2)
## Importance of components:
##                          PC1   PC2   PC3    PC4
## Standard deviation     1.862 1.420 0.936 0.5639
## Proportion of Variance 0.519 0.302 0.131 0.0476
## Cumulative Proportion  0.519 0.821 0.952 1.0000
df_pca2
## Standard deviations (1, .., p=4):
## [1] 1.8620 1.4201 0.9360 0.5639
## 
## Rotation (n x k) = (4 x 4):
##               PC1      PC2      PC3      PC4
## X5...x1  0.163970 -0.98259 -0.07530  0.04425
## X2...x2 -0.050594  0.03484  0.02167  0.99788
## X2...x3 -0.985166 -0.16524 -0.01478 -0.04386
## x4      -0.001123 -0.07743  0.99682 -0.01900
screeplot(df_pca2)

So in this case maybe we would not want to scale our data? 96% of the variance can be explained with just three variables in that case…