Data 609 Module 5

library(ggplot2)
library(ggfortify)
library(cluster)
library(dplyr)

1) Logistic Regression

In R, we can carry out the logistic regression of the set of points below with the GLM() function. The glm() will take the logistic regression given \[y(x)=\frac{1}{1+exp(-(a+bx))}\] and measure the probability between the x and y values.

Looking at our given x an y values, we can see the points form a binomial curve. In our logistic function, we will specify the family as binomial for its model. Our logistic model created matches the a and b values seen in the textbook within 4 iterations compared to 20. We can rewrite the logistic model to be \[P=\frac{1}{1+exp(0.8982-0.7099x)}\]. I noticed that the p-values for x are not under alpha, so there is not a significance difference between x and y. Our model curve will not a close fit as there isn’t a correlation between the x and y values.

data<-data.frame(x=c(0.1,0.5,1.0,1.5,2.0,2.5),y=c(0,0,1,1,1,0))


#Plotting the points to see the type of regression this model can fit
#The points align towards a binomial curve
ggplot(data,aes(x=x,y=y))+geom_jitter()+theme_light()+labs(title = "Plotting the x and y points")

logm<-glm(y~x,family = "binomial",data)

#see the summary stats from our logistic formula to create the best fit curve
summary(logm)

## 
## Call:
## glm(formula = y ~ x, family = "binomial", data = data)
## 
## Deviance Residuals: 
##       1        2        3        4        5        6  
## -0.8518  -0.9570   1.2583   1.1075   0.9653  -1.5650  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -0.8982     1.5811  -0.568    0.570
## x             0.7099     1.0557   0.672    0.501
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.3178  on 5  degrees of freedom
## Residual deviance: 7.8325  on 4  degrees of freedom
## AIC: 11.832
## 
## Number of Fisher Scoring iterations: 4

fx<-function(x){
  y<-1/(1+exp((0.8982)-0.7099 *x))
  return(y)
}

#plotting our best fit curve
ggplot(data,aes(x=x,y=y))+geom_point()+theme_light()+geom_function(fun=fx,colour="pink")+labs(title = "Plotting w/ the logisitic model curve")

2) PCA with mtcars

PCA of the mtcars data set will reduce the amount of variables in the mtcars. We want to be able to identify the differences between the models by limiting any principle components (our data set features) that correlate to each other so we able to explain the prominent components of mtcars.

Looking at the summary of our PCs, we noticed that PC1 and PC2 sum of proportion of variance explains for 84% of the total variance in mtcars. This means these two components has the most accurate accounts for the data set, so can we rely only with these two PCs. This is confirmed in our screeplot which shows PC 1& 2 variances are above our baseline of 1. The baseline is that the cumulative of the total variance of all of components will be equal to one.

In the first graph, we see a singular cluster around all the models. There appear to a strong sub cluster of models on the bottom right of our plot. When looking at the bi plot, it appear the features cyl, w*,disp have a correlation between each other.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

#find the PC of the mtcars dataset
pca<-prcomp(mtcars,scale.=TRUE)

#See the PCA of mtcars
summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion  0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
##                            PC8    PC9    PC10   PC11
## Standard deviation     0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion  0.98626 0.9933 0.99800 1.0000

#See all PCs and their variances,we want to at least account over 10% of total variance
screeplot(pca,type="l",main="Viewing total Variance of MTcars")

#plot the PCA plot with the PCs that carry the most variance
autoplot(pca,mtcars,label=TRUE,frame=TRUE)

#See if there are features causing sub clusters in our main
pca%>%biplot(cex=0.5)

3) SVD in R

The singular value decomposition takes any matrix into the formula \[A=U\sum V^T\]. This method is transforms a data set which has correlated variables (seen with the cyl, w*, disp) and transformed them into non correlated variables. In r, this process is simplified with the svd() function!

#create matrix of random values
data<-matrix(rnorm(20),nrow=4)

#use svd() to find its three  matrices
sdata<-svd(data)
sdata

## $d
## [1] 3.9367230 2.0093268 0.8216280 0.1545532
## 
## $u
##            [,1]      [,2]       [,3]       [,4]
## [1,]  0.2521362 0.8524567  0.3949345  0.2318872
## [2,]  0.4234798 0.3001932 -0.5200524 -0.6783025
## [3,] -0.6432167 0.1766105  0.3989033 -0.6292512
## [4,] -0.5859731 0.3898851 -0.6437766  0.3002947
## 
## $v
##             [,1]        [,2]        [,3]       [,4]
## [1,]  0.03402112 -0.14842106  0.02440098 -0.4038798
## [2,]  0.05041672  0.60237566 -0.71097117  0.2698056
## [3,]  0.76069241 -0.33515751  0.02397506  0.5334735
## [4,]  0.52883273 -0.06193846 -0.38220640 -0.6816931
## [5,] -0.37146160 -0.70636239 -0.58929454  0.1216004

4)

I’m confused by this question. What would be different with this matrix compared to the other matrices given previously? At most, this is be as random as the matrix in question 3 as the possibilities of the sequences and starting values are endless.

The function that will supply column y differentiates all of the paired x lists, so x1 and x2 will produce different values that the original x. The X columns will also have slight differences as the runif produces uniform sequence between 1 and 2, but the sequence it selects will never be the same. Likewise with the x3 and x4, where its distribution starting values will vary once the function runs. This is why the total variance of this matrix is small with PC total variances below 40% as there is not a PC which can account majority the variances in the matrix.

#create the matrix
set.seed(2)
x<-runif(100,min=1,max=2)
x2<-runif(100,min=1,max=2)
x3<-rnorm(100,mean=0,sd=1)
x4<-rnorm(100,mean=0,sd=1)
y<-5*x+2*x2+2*x3+x4

data<-as.data.frame(cbind(y,x,x2,x3,x4))

pca<-prcomp(data,scale.=TRUE)

summary(pca)

## Importance of components:
##                          PC1    PC2    PC3    PC4       PC5
## Standard deviation     1.405 1.0255 0.9996 0.9870 2.475e-16
## Proportion of Variance 0.395 0.2103 0.1998 0.1948 0.000e+00
## Cumulative Proportion  0.395 0.6054 0.8052 1.0000 1.000e+00

# see PC variances, Noticed that p1-p4 create the "elbow"
#There's not much of a variance difference between p2-p4
screeplot(pca,type="l",main="Viewing the total variance of the prinicpal components")

#See if there are features causing sub clusters in our main
pca%>%biplot(cex=0.5)

Data 609 Module 5

Vyanna Hill

2023-03-19

1) Logistic Regression

2) PCA with mtcars

3) SVD in R

4)