Carry out the logistic regression (Example 22 on Page 94) in R using the data
x | 0.1 | 0.5 | 1.0 | 1.5 | 2.0 | 2.5 |
---|---|---|---|---|---|---|
y | 0 | 0 | 1 | 1 | 1 | 0 |
The formula is \(y(x) = \frac{1}{1 + exp[-(a + bx)]}\)
Solution
We will use here glm
function with family as binomial to perform logistic regression for given values of x and y.
<- c(0.1, 0.5, 1.0, 1.5, 2.0, 2.5)
x <- c(0, 0, 1, 1, 1, 0)
y
# Logistics Regression
<- glm(y~x, family = binomial)
lr summary(lr)
##
## Call:
## glm(formula = y ~ x, family = binomial)
##
## Deviance Residuals:
## 1 2 3 4 5 6
## -0.8518 -0.9570 1.2583 1.1075 0.9653 -1.5650
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8982 1.5811 -0.568 0.570
## x 0.7099 1.0557 0.672 0.501
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.3178 on 5 degrees of freedom
## Residual deviance: 7.8325 on 4 degrees of freedom
## AIC: 11.832
##
## Number of Fisher Scoring iterations: 4
The null deviance shows how well our model can predict by only using intercept. Residual deviance shows how well our model can predict using intercepts and inputs. AIC is Akaike information criterian. It is used to compare models. All these values if smaller than better.
Using the motor car database (mtcars) of the built-in data sets in R to carry out the basic principal component analysis and explain your results
Solution
We will use prcomp() to perform principal component analysis on mtcars dataset. The scale argument True implies variables will be scaled to have unit variance.
<- prcomp(mtcars, scale=TRUE)
pca summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion 0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
## PC8 PC9 PC10 PC11
## Standard deviation 0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion 0.98626 0.9933 0.99800 1.0000
We can see here the first PCA components shows 60% variation in the data which gets reduced in further components. Drwaing screeplot below confirms the variance decrease.
screeplot(pca)
pca biplot shows the scores of each case and the loading of each variable on the first two principal components. The left and bottom axes shows principal axes scores and top and right axes shows the loadings.
Seeing the biplot below hp, cycl, disp and wt are similar and gear, am, mpg, drat, qsec, and vs are similar based on their pca1 components.
biplot(pca)
Generate a random 4 X 5 matrix, and find its singular value decomposition using R.
Solution
<- matrix(rnorm(20), nrow = 4)
matrix matrix
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.9443479 1.3784804 -0.2580736 0.07821805 -2.4513584
## [2,] 0.1075014 -0.7626122 1.1431946 -0.81568273 -1.6517063
## [3,] -1.1579729 -0.6242446 1.1977297 -0.22989418 0.9040357
## [4,] -1.5009241 -1.4770055 -0.2316544 -1.90938005 1.6064364
svd(matrix)
## $d
## [1] 4.075891 2.678069 1.824568 1.342764
##
## $u
## [,1] [,2] [,3] [,4]
## [1,] 0.5830753 0.5248648 0.6040281 -0.1403219
## [2,] 0.1431537 0.6818300 -0.6728919 0.2486593
## [3,] -0.3471664 0.2819460 -0.1159010 -0.8868759
## [4,] -0.7204205 0.4244188 0.4110153 0.3632209
##
## $v
## [,1] [,2] [,3] [,4]
## [1,] 0.23260415 -0.5174861 -0.61682642 0.4774145
## [2,] 0.48464720 -0.2237911 0.44452991 -0.2725074
## [3,] -0.05783932 0.3298598 -0.63530733 -0.6150745
## [4,] 0.33960852 -0.5191418 -0.08880304 -0.5238760
## [5,] -0.76963178 -0.5511896 0.10206343 -0.2122556
First try to simulate 100 data points for y using \(y = 5 x_{1} + 2 x_{2} + 2 x_{3} + x_{4}.\) where \(x_{1}, x_{2}\) are uniformly distributed in [1,2], while \(x_{3} , x_{4}\) are normally distributed with zero mean and unit variance. Then, use the principal component analysis (PCA) to analyze the data to find its principal components. Are the results expected from the formula?
Solution
In the first step, we will simulate x1, x2, x3 and x4 and create a dataframe following the given formula.
set.seed(609)
<- runif(100, min=1, max=2)
x1 <- runif(100, min=1, max=2)
x2 <- rnorm(100, mean=0, sd=1)
x3 <- rnorm(100, mean=0, sd=1)
x4
= 5*x1 + 2*x2 + 2*x3 + x4
y
<- as.data.frame(cbind(y,x1,x2,x3,x4))
df summary(df)
## y x1 x2 x3
## Min. : 1.548 Min. :1.010 Min. :1.009 Min. :-2.0969
## 1st Qu.: 8.461 1st Qu.:1.271 1st Qu.:1.198 1st Qu.:-0.7122
## Median : 9.880 Median :1.482 Median :1.450 Median :-0.1631
## Mean :10.037 Mean :1.479 Mean :1.469 Mean :-0.1081
## 3rd Qu.:11.914 3rd Qu.:1.699 3rd Qu.:1.757 3rd Qu.: 0.4325
## Max. :15.064 Max. :1.989 Max. :1.979 Max. : 2.1450
## x4
## Min. :-2.08701
## 1st Qu.:-0.85024
## Median : 0.01832
## Mean :-0.08108
## 3rd Qu.: 0.61931
## Max. : 2.14581
We will now use prcomp() to perform principal component analysis. The scale argument True implies variables will be scaled to have unit variance.
<- prcomp(df, scale=TRUE)
pca_df summary(pca_df)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.4339 1.0377 0.9873 0.9447 4.513e-16
## Proportion of Variance 0.4112 0.2154 0.1949 0.1785 0.000e+00
## Cumulative Proportion 0.4112 0.6266 0.8215 1.0000 1.000e+00
The scree plot below shows that principal components variance values do not differ much and they cover similar variation in the data points.
screeplot(pca_df)
biplot(pca_df)
str(pca_df)
## List of 5
## $ sdev : num [1:5] 1.43 1.04 9.87e-01 9.45e-01 4.51e-16
## $ rotation: num [1:5, 1:5] 0.693 0.436 0.221 0.515 0.123 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:5] "y" "x1" "x2" "x3" ...
## .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:5] 10.0372 1.4795 1.4685 -0.1081 -0.0811
## ..- attr(*, "names")= chr [1:5] "y" "x1" "x2" "x3" ...
## $ scale : Named num [1:5] 2.529 0.281 0.295 0.873 0.912
## ..- attr(*, "names")= chr [1:5] "y" "x1" "x2" "x3" ...
## $ x : num [1:100, 1:5] -1.234 -0.603 -0.19 0.541 2.363 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
The results above do show what we would expect from formula.