R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Part A

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
## First of all, we need to assign numeric value to goal the through using ifelse statement.
## Then we used ggplot to plot logistic regression.
df<-read.csv("https://s3.amazonaws.com/programmingforanalytics/footballkicks.csv")
df$goal<-ifelse(df$goal =="Y",1,0)
str(df)
## 'data.frame':    952 obs. of  3 variables:
##  $ yards          : int  18 19 19 19 19 19 20 20 20 20 ...
##  $ goal           : num  1 1 1 1 1 1 0 1 1 1 ...
##  $ practiceormatch: Factor w/ 2 levels "M","P": 2 2 2 2 2 2 2 2 2 2 ...
a <- ggplot(df) + aes(x=yards,y=goal,colour=practiceormatch) + geom_smooth(method = "glm", method.args=list(family=binomial), se=FALSE) + xlab("Distance to goal in yards") + ylab("Probability of scoring a goal") + ggtitle("probability of scoring a goal")
a

Part B

library(car)
## Warning: package 'car' was built under R version 3.5.2
## Loading required package: carData
##a) 
train <- read.csv("~/Desktop/train.csv")
boxplot(train$x)

boxplot(train$y)

outx <- boxplot(train$x, plot =FALSE)$out
print(outx)
## [1] 3530.157
##The outlier for X is 3530.157 through observing boxplot. There is no outliers for y.

##b) 
## remove outliers
train_new <- train[-which(train$x %in% outx),]
## There is the positive relationship between x and y
scatterplot(y~x, regLine=TRUE, 
            smooth=FALSE, 
            boxplots='xy',
            pch = 19, 
            cex = 1,
            col = "red",
            data=train_new)

##c)
regModel <- lm(y ~ x, data = train_new)
summary(regModel)
## 
## Call:
## lm(formula = y ~ x, data = train_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1523 -2.0179  0.0325  1.8573  8.9132 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.107265   0.212170  -0.506    0.613    
## x            1.000656   0.003672 272.510   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.809 on 697 degrees of freedom
## Multiple R-squared:  0.9907, Adjusted R-squared:  0.9907 
## F-statistic: 7.426e+04 on 1 and 697 DF,  p-value: < 2.2e-16
plot(regModel)

## The model is y = 1.00605 x -0.10725. 
## Every single unit increase in the x will lead to the 1.00605 increase in the y.
## we use the coefficient p-value to determine if we keeo the terms or not. In this case, p value is smaller than 0.05, so that we can reject the null hypothesis and ensure this term is significant. 
## From plot 1 (residuals vs fitted values), it looks like random pattern, thus the variance of residual is constant
## From plot 2, there is linear pattern in the normaility graph, so the residuals are normally distributed.
## From plot 3, it looks random and no patterns, so we are good. 
## From plot 4, the last plot (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). It only has three points 633, 280, and 662, that have influence on the model.
##In conclusion, residual plots has good performance, and R square is 0.9901, which means 99.07% of outcome can be explained by the independent vairable in the model.

Part C

##a)
test <- read.csv("~/Desktop/test.csv")
regModel2 <- lm(y ~ x, data = test)
summary(regModel2)
## 
## Call:
## lm(formula = y ~ x, data = test)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6515  -2.0356  -0.1612   2.0239   8.3965 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.461811   0.359560  -1.284      0.2    
## x            1.014335   0.006162 164.598   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.037 on 298 degrees of freedom
## Multiple R-squared:  0.9891, Adjusted R-squared:  0.9891 
## F-statistic: 2.709e+04 on 1 and 298 DF,  p-value: < 2.2e-16
summary(regModel2$residuals)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10.6515  -2.0356  -0.1612   0.0000   2.0239   8.3965
##residual mean is 0, mean and median are very clsoe to each other, so it is more likely to be the normal distribution.

##b)
##using scatterplot to plot the relationship between x and y in test. They have positive relationship.
scatterplot(y~x, regLine=TRUE, 
            smooth=FALSE, 
            boxplots='xy',
            pch = 20, 
            cex = 1,
            col = "red",
            data=test)

Part D

## Through histogram, we found out mpg as our depdendent variable is not normally distributed, so we use log(mpg)as our depdendent variable
library(GGally)
hist(mtcars$mpg)

hist(log(mtcars$mpg))

## We involve all the variables as our inpendent variables.
regmodel3 <- lm(log(mpg)~ cyl+ disp + hp + drat + wt + qsec + vs + am + gear + carb, data = mtcars)
summary(regmodel3)
## 
## Call:
## lm(formula = log(mpg) ~ cyl + disp + hp + drat + wt + qsec + 
##     vs + am + gear + carb, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14569 -0.07886 -0.01752  0.06524  0.25130 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.776e+00  8.492e-01   3.268  0.00367 **
## cyl          7.657e-03  4.741e-02   0.161  0.87326   
## disp         4.989e-05  8.102e-04   0.062  0.95149   
## hp          -8.964e-04  9.877e-04  -0.908  0.37439   
## drat         2.220e-02  7.420e-02   0.299  0.76772   
## wt          -1.723e-01  8.595e-02  -2.005  0.05804 . 
## qsec         3.077e-02  3.316e-02   0.928  0.36401   
## vs          -2.874e-03  9.548e-02  -0.030  0.97627   
## am           4.738e-02  9.331e-02   0.508  0.61693   
## gear         5.925e-02  6.775e-02   0.875  0.39170   
## carb        -2.012e-02  3.760e-02  -0.535  0.59826   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1202 on 21 degrees of freedom
## Multiple R-squared:  0.8895, Adjusted R-squared:  0.8369 
## F-statistic: 16.91 on 10 and 21 DF,  p-value: 6.89e-08
## Through summary statistics and residual analysis, we found out it is not a good model.
plot(regmodel3)

x<-subset(mtcars,select=-c(mpg))
ggpairs(x)

round(cor(x),2)
##        cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
## cyl   1.00  0.90  0.83 -0.70  0.78 -0.59 -0.81 -0.52 -0.49  0.53
## disp  0.90  1.00  0.79 -0.71  0.89 -0.43 -0.71 -0.59 -0.56  0.39
## hp    0.83  0.79  1.00 -0.45  0.66 -0.71 -0.72 -0.24 -0.13  0.75
## drat -0.70 -0.71 -0.45  1.00 -0.71  0.09  0.44  0.71  0.70 -0.09
## wt    0.78  0.89  0.66 -0.71  1.00 -0.17 -0.55 -0.69 -0.58  0.43
## qsec -0.59 -0.43 -0.71  0.09 -0.17  1.00  0.74 -0.23 -0.21 -0.66
## vs   -0.81 -0.71 -0.72  0.44 -0.55  0.74  1.00  0.17  0.21 -0.57
## am   -0.52 -0.59 -0.24  0.71 -0.69 -0.23  0.17  1.00  0.79  0.06
## gear -0.49 -0.56 -0.13  0.70 -0.58 -0.21  0.21  0.79  1.00  0.27
## carb  0.53  0.39  0.75 -0.09  0.43 -0.66 -0.57  0.06  0.27  1.00
## We visualized the correlaiton, and found out we have multicollinearity issues, which means some of independent variables are highly correlated, for example, displacement and cyl have high correlation, 0.9. 
## we use vif to detect multicollinearity
vif(regmodel3)
##       cyl      disp        hp      drat        wt      qsec        vs 
## 15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873 
##        am      gear      carb 
##  4.648487  5.357452  7.908747
regmodel4 <- lm(log(mpg)~ hp + drat + wt + qsec + vs + am + gear + carb+ cyl, data = mtcars)
regmodel5 <- lm(log(mpg)~ hp + drat + wt + qsec + vs + am + gear + carb, data = mtcars)
summary(regmodel5)
## 
## Call:
## lm(formula = log(mpg) ~ hp + drat + wt + qsec + vs + am + gear + 
##     carb, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.14449 -0.08253 -0.02332  0.06706  0.24736 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.8867103  0.5774247   4.999 4.67e-05 ***
## hp          -0.0008033  0.0007389  -1.087  0.28821    
## drat         0.0192522  0.0681399   0.283  0.78006    
## wt          -0.1667382  0.0519261  -3.211  0.00387 ** 
## qsec         0.0279593  0.0281197   0.994  0.33043    
## vs          -0.0098131  0.0846891  -0.116  0.90876    
## am           0.0424353  0.0856907   0.495  0.62515    
## gear         0.0552780  0.0606266   0.912  0.37134    
## carb        -0.0213290  0.0264872  -0.805  0.42892    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.115 on 23 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8509 
## F-statistic: 23.11 on 8 and 23 DF,  p-value: 2.995e-09
## In the dataset of mtcars, we intended to invetigate the relationship between the numeric variable, miles per gallon(mpg) and a set of independent variables,including numeric and chategorical variables. First, I decided to use log(mpg) as our dependent variable, because the initial mpg is not normally distributed, which would influence the regression modelling.Second, we involved all 10 variables,cyl, disp, hp, drat, wt , qsec, vs, am, gear, carb, as our independent variables and built the multivariable regression model. A good regression model is based on some assumptions of residuals, so residual analysis is very important. Through summary statistics and residual plots, we found out that it is not a good regression. One of the most significant reasons is multicollinearity, which means some of independent variables are highly correlated. For example, I found out displacement and number of cylinders have strong correlation through visualization, which obviously make sense. Therefore, I chose to use the VIF scores to remove variables with the highest VIF. Variance inflation factor(VIF) is a measurement of how much the variance od estimated coefficient is inflated due to multicolinearity. When it is bigger than 10, multicolinearity issues are severe. After removing the variable distance with highest VIF, we check out the VIF one more time. Then, we still have cyl with VIF bigger than 10, so we continue to remove this variable. After removing cyl, we do not have variables with VIF bigger than 10, so we do not need pay attention on removing variables. We got log(mpg) = 2.8867103 - 0.0008033hp + 0.0192522drat -0.1667382wt +0.0279593qsex -0.0098131vs +0.0424353am + 0.0552780gear -0.0213290 carb. Through summary statistics, we found out weight is the most important term in the model through using p value. The coefficient of weight means every single unit increase in weight will lead to decrease 0.1667 in miles per gallon. They have negative relationship. If we intend to increase miles per gallon, we need to reduce weights in priority.