Data609-Module5

Example 1

Carry out the logistic regression (example 22 on page 94) in R using the data.

We’ll start with initial numbers, a = 1 and b = 1.

x <- c(0.1,0.5,1,1.5,2,2.5)
y <- c(0,0,1,1,1,0)
data <- rbind(x,y)
X <- as.matrix(x)

a <- 1
b <- 1

P_val <- function(x,y,a,b){
  return (1/(1+exp(a+b*x)))
}
#hypothesis/prediction
P <- matrix(0,1,ncol(data))
for (col in 1:ncol(data)){
  P[1,col] <- P_val(data[1,col],data[2,col],a,b)
}
P

##           [,1]      [,2]      [,3]       [,4]       [,5]       [,6]
## [1,] 0.2497399 0.1824255 0.1192029 0.07585818 0.04742587 0.02931223

log_likelihood <- function(y,p){
  return((y*log(p)) + ((1-y)*(log(1-p))))
}
L <- matrix(0,1,ncol(P))
for (col in 1:ncol(P)){
  L[1,col] <- log_likelihood(data[2,col],P[1,col])
}
L

##            [,1]       [,2]      [,3]     [,4]      [,5]        [,6]
## [1,] -0.2873353 -0.2014133 -2.126928 -2.57889 -3.048587 -0.02975042

sum_L <- sum(L)
sum_L

## [1] -8.272904

W <- matrix(0,6,6)
for(i in 1:ncol(P)){
  W[i,i] <- P[i]*(1-P[i])
}
W

##           [,1]      [,2]      [,3]       [,4]       [,5]       [,6]
## [1,] 0.1873699 0.0000000 0.0000000 0.00000000 0.00000000 0.00000000
## [2,] 0.0000000 0.1491465 0.0000000 0.00000000 0.00000000 0.00000000
## [3,] 0.0000000 0.0000000 0.1049936 0.00000000 0.00000000 0.00000000
## [4,] 0.0000000 0.0000000 0.0000000 0.07010372 0.00000000 0.00000000
## [5,] 0.0000000 0.0000000 0.0000000 0.00000000 0.04517666 0.00000000
## [6,] 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000 0.02845302

#1x6 %*% 6x6 %*% 6x1 -> 1x1
J <- solve(t(X) %*% W %*% X)
YP <- (as.matrix(y) - t(P))

#1x1 %*% 1x6 %*% 6x1
B_1 <- b + (J %*%t(X)%*%YP)
B_1

##          [,1]
## [1,] 7.030494

Example 2

Using the motor car database (mtcars) of the built-in data sets in R to carry out the basic principal component analysis and explain your results.

The first component explains 60% of the variation in the data and the second component explains 24%. Using the screeplot we can also see the variation amount decreases as the component number increases, which is what we want to see. There are 11 princal components because there are 11 columns in the dataset mtcars. The biplot shows that carb, hp, cyl, disp and wt are similar

pca <- prcomp(mtcars, scale = TRUE)
summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
##                            PC8    PC9    PC10   PC11
## Standard deviation     0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

screeplot(pca)

The variables carb, hp, cycl, disp, and wt are similar because their PC1 component is around 0.2 to 0.4
The variables gear, am, mpg, drat, qsec, and vs are similar because their PC1 component is around -0.2 to -0.3

biplot(pca)

pca$rotation[,c("PC1","PC2")]

##             PC1         PC2
## mpg  -0.3625305  0.01612440
## cyl   0.3739160  0.04374371
## disp  0.3681852 -0.04932413
## hp    0.3300569  0.24878402
## drat -0.2941514  0.27469408
## wt    0.3461033 -0.14303825
## qsec -0.2004563 -0.46337482
## vs   -0.3065113 -0.23164699
## am   -0.2349429  0.42941765
## gear -0.2069162  0.46234863
## carb  0.2140177  0.41357106

Example 3

Generate a random 4 x 5 matrix, and find its singular value decomposition using R.

a <- matrix(sample(1:10, 20,replace=TRUE),nrow=5)
svd(a)

## $d
## [1] 26.536357  7.166976  5.626670  1.340438
## 
## $u
##            [,1]        [,2]       [,3]       [,4]
## [1,] -0.5421794 -0.04988251 -0.6994960 -0.4612315
## [2,] -0.4707918 -0.77763769  0.2980074  0.1652674
## [3,] -0.5001427  0.57225295  0.4346565 -0.1714005
## [4,] -0.3869430  0.25380127 -0.2530864  0.8275172
## [5,] -0.2907388  0.03004734  0.4109963 -0.2139844
## 
## $v
##            [,1]       [,2]         [,3]       [,4]
## [1,] -0.4299994 -0.6245212  0.001724634  0.6519745
## [2,] -0.4271137 -0.3240851 -0.603163595 -0.5905391
## [3,] -0.5860607  0.7103796 -0.255138814  0.2946148
## [4,] -0.5377799 -0.0174060  0.755708203 -0.3733563

Example 4

First try to simulate 100 data points for y using

\(y = 5x_{1} + 2x_{2} + 2x_{3} + x_{4}\)

where \(x_{1},x_{2}\) are uniformly distributed in [1,2], while \(x_{3},x_{4}\), are normally distributed with zero mean and unit variance. Then, use the principal component analysis (PCA) to analyze the data to find its principal components. Are the results expected from the formula?

The scree plot shows that the principal components values are very similar meaning that they all account for similar variation in the data points. The eigen values are also very similar which means no one direction accounts for a majority of the variation.

When we plot PC1 and PC2 we see there isn’t strong direction either vertically or horizontally. I would think this makes sense because the uniform data would have some variance horizontally but the normal data would have variance vertically. There seems to be slight more variation horizontally which would make sense.

x1 <- runif(100, min=1, max=2)
x2 <- runif(100, min=1, max=2)
x3 <- rnorm(100,mean=0,sd=1)
x4 <- rnorm(100,mean=0,sd=1)
y <- 5*x1 + 2*x2 + 2*x3 + x4

#Plots of original data
plot(x1)

plot(x2)

hist(x1)

hist(x2)

hist(x3)

hist(x4)

#pca information
data <- as.data.frame(cbind(y,x1,x2,x3,x4))
pca_data <- prcomp(data, scale = TRUE)
summary(pca_data)

## Importance of components:
##                           PC1    PC2    PC3    PC4       PC5
## Standard deviation     1.3944 1.0768 0.9858 0.9613 4.006e-16
## Proportion of Variance 0.3889 0.2319 0.1943 0.1848 0.000e+00
## Cumulative Proportion  0.3889 0.6208 0.8152 1.0000 1.000e+00

screeplot(pca_data)

#covariance matrix
cov <- cov(data)

#eigen information
ev <- eigen(cov)
ev

## eigen() decomposition
## $values
## [1]  6.681456e+00  1.073970e+00  3.653439e-01  8.286878e-02 -1.077700e-17
## 
## $vectors
##            [,1]        [,2]       [,3]          [,4]       [,5]
## [1,] 0.95881030 -0.07649916  0.2150790  0.0005117351  0.1690309
## [2,] 0.05760977 -0.07899689  0.3801522 -0.3628211472 -0.8451543
## [3,] 0.02850393 -0.02147710  0.1287587  0.9315909786 -0.3380617
## [4,] 0.22992742 -0.29390303 -0.8638150 -0.0170977035 -0.3380617
## [5,] 0.15389877  0.94924555 -0.2155694 -0.0143690790 -0.1690309

#Let's plot first 2 principal components
plot(pca_data$x[,1],pca_data$x[,2])

Here is just the example from class recreated.

x <- c(-2.5,-1.5,-0.5,0.5,1.5,2.5)
y <- c(-0.8,-1.2,0.4,-0.5,0.3,1.8)
X <- as.data.frame(cbind(x,y))
cov <- cov(X)
ev <- eigen(cov)
U = as.matrix(ev$vectors)

t(U) %*% t(X)

##            [,1]      [,2]       [,3]       [,4]       [,5]       [,6]
## [1,]  2.5874319 1.8841891  0.2594742 -0.2134006 -1.4695266 -3.0481679
## [2,] -0.4418104 0.3739406 -0.5853829  0.6741366  0.4248429 -0.4457269