CUNY Data 605 - Assignment# 10

Single Variable Regression Line

We will be investigating a data set given by the following table.

table

We will perform the analysis in R.

Analysis

my_data <- data.frame("age"=c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37), "heart_rate"=c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198,  183, 178))

The data set size is 15. For this analysis, age will be the explanatory (independent) variable and heart_rate the response (dependent) variable. We will now plot the distribution.

library(ggplot2)
sp<-ggplot(my_data, aes(x=my_data$age, y=my_data$heart_rate)) + geom_point(size = 1.5, shape=20)
sp

From the scatterplot, it appears that the data show a relatively strong downward linear trend. Let us investigate the level of correlation between these variables.

cor(my_data$heart_rate, my_data$age)

## [1] -0.9534656

It appeared that we have a strong corelation between the 2 variables. Let us show the plot and add the least square linear regression line and display the summary statistics.

sp + stat_smooth(method = lm, level=0.95)

m1 <- lm(my_data$heart_rate ~ my_data$age, data = my_data)
summary(m1)

## 
## Call:
## lm(formula = my_data$heart_rate ~ my_data$age, data = my_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## my_data$age  -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08

From the Summary Statistics for the regression line, we can write the following equation:
\(\hat { y } =\quad 210.04846\quad -\quad 0.79773\cdot x\)

The \(R^2\) value represents the proportion of variability in the response variable (heart_rate) that is explained by the explanatory variable (age). For this model, 90.91% of the variability in heart-rate is explained by age.

Model diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, (3) constant variability, and (4) Independence of Observations.

Linearity: We already confirmed that the relationship between heart_rate and age is linear using a scatterplot.

Nearly normal residuals: We will look at the distribution of the residuals by looking at the histogram.

my_residual <- data.frame(my_data$age, m1$residuals)
names(my_residual) <- c('age', 'residuals')

res_hist<-ggplot(my_residual, aes(residuals)) + geom_histogram(binwidth = 5)
res_hist

From the histogram, it is not clear whether the residuals distribution is near normal. There are limited data points in the sample. The sample size is 15.

Constant Variability: We will look at the scatterplot for the residuals, for constant variability to hold true we would expect the data points of the residual plot to be equally scattered.

res_plot<-ggplot(my_residual, aes(x=my_residual$age, y=my_residual$residuals)) + geom_point(size = 1.5, shape=21) + geom_hline(mapping = NULL, data=Null, yintercept = 0, color = 'Blue', na.rm=FALSE, show.legend=NA)
res_plot

From the residual plot, it appears that there is constant variability. It also shows the presense of outsiders.

Independent Observations: We have no information on how the sample was selected and we will assume independence.

Model Diagnostics Let us set-up the following hypothesis:
H0 = The true linear model has slope zero
H1 = \({ \beta }_{ 1 }\quad <\quad 0\)

From the Summary Statistics, we have t = -11.40 with 13 df of freedom. From the value of the p-value, we can reject the H0 hypothesis.

Conclusion:
From the analysis, it appears that there is strong negative relationship between Age and Heart-Rate, however, some caution should be exercised in using the model due the limited sample size and the distribution of the residuals does not show a very normal distribution.

Multiple Variables Regression Line

We will now investigate the data in the following Auto data set. We will consider mpg as the respondent (or dependent) variable and the other 4 (displacement, horsepower, weight, acceleration) as explanatory (or independent) variables.

We will first performed our analysis on a randomly selected sample of size 40. Then we will execute the analysis on the full population.

df_mpg <- read.table("auto-mpg.data")
names(df_mpg) <- c('displacement', 'horsepower', 'weight', 'acceleration', 'mpg')

We will extract a random sample of 40 data set and perform the analysis.

Analysis of Sample

# Take a random sample of 40 data elements
mpg_sample <- df_mpg[sample(1:nrow(df_mpg), 40, replace=FALSE),]
write.csv(mpg_sample, file = "mpg_sample.csv")

#load sample to preserve results
mpg_sample <- read.csv("mpg_sample.csv")
mpg_sample<-mpg_sample[,2:6]

Since we are intersted in analysis of the variable mpg, let us first have a look at the histogram of this variable.

mpg_sample_hist<-ggplot(mpg_sample, aes(mpg)) + geom_histogram(binwidth = 4)
mpg_sample_hist

The distribution of mpg is skewed to the right, this would suggest the presence of outsiders.

We will now look at this variables with respect with each of the other.

sp_d<-ggplot(mpg_sample, aes(x=mpg_sample$displacement, y=mpg_sample$mpg)) + geom_point(size = 1.5, shape=20)
sp_h<-ggplot(mpg_sample, aes(x=mpg_sample$horsepower, y=mpg_sample$mpg)) + geom_point(size = 1.5, shape=20)
sp_w<-ggplot(mpg_sample, aes(x=mpg_sample$weight, y=mpg_sample$mpg)) + geom_point(size = 1.5, shape=20)
sp_a<-ggplot(mpg_sample, aes(x=mpg_sample$acceleration, y=mpg_sample$mpg)) + geom_point(size = 1.5, shape=20)
sp_d

sp_h

sp_w

sp_a

While the mpg shows a negative relation with ’displacement, horsepower, and weight, it has a positive relation with acceleration.

We will now construct the multivariable linear model and observe the summary statistics, we will looking for the best fit model.

m2_full <- lm(mpg_sample$mpg ~ mpg_sample$displacement + mpg_sample$horsepower + mpg_sample$weight + mpg_sample$acceleration, data = mpg_sample)

summary(m2_full)

## 
## Call:
## lm(formula = mpg_sample$mpg ~ mpg_sample$displacement + mpg_sample$horsepower + 
##     mpg_sample$weight + mpg_sample$acceleration, data = mpg_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8467 -2.8193 -0.1959  2.2399 10.1579 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             60.500960   7.972661   7.589 6.76e-09 ***
## mpg_sample$displacement -0.024741   0.029596  -0.836   0.4088    
## mpg_sample$horsepower   -0.010290   0.051960  -0.198   0.8442    
## mpg_sample$weight       -0.006302   0.003368  -1.871   0.0697 .  
## mpg_sample$acceleration -0.830350   0.422516  -1.965   0.0574 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.394 on 35 degrees of freedom
## Multiple R-squared:  0.7524, Adjusted R-squared:  0.7241 
## F-statistic: 26.59 on 4 and 35 DF,  p-value: 3.485e-10

By observing the p-values for each independent variables;
We have horsepower to be the least significant variable with a p-value = 0.8442 and the adjusted \({ { R }^{ 2 } }\) = 0.7241
We will now remove this variable from the model and re-evaluate.

m2_p1 <- lm(mpg_sample$mpg ~  mpg_sample$displacement +  mpg_sample$weight + mpg_sample$acceleration, data = mpg_sample)

summary(m2_p1)

## 
## Call:
## lm(formula = mpg_sample$mpg ~ mpg_sample$displacement + mpg_sample$weight + 
##     mpg_sample$acceleration, data = mpg_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8484 -2.6439 -0.1932  2.2747 10.2236 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             59.829621   7.119093   8.404  5.2e-10 ***
## mpg_sample$displacement -0.026124   0.028374  -0.921   0.3633    
## mpg_sample$weight       -0.006513   0.003153  -2.066   0.0461 *  
## mpg_sample$acceleration -0.799083   0.386642  -2.067   0.0460 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.335 on 36 degrees of freedom
## Multiple R-squared:  0.7521, Adjusted R-squared:  0.7315 
## F-statistic: 36.41 on 3 and 36 DF,  p-value: 5.338e-11

The adjusted \({ { R }^{ 2 } }\) is now 0.7315, hence we have horsepower that was slightly collinear with the other independent variables. We will now consider displacement that has the highest p-value = 0.3633. We will drop this variable from the model and re-evaluate.

m2_p2 <- lm(mpg_sample$mpg ~  mpg_sample$weight + mpg_sample$acceleration, data = mpg_sample)

summary(m2_p2)

## 
## Call:
## lm(formula = mpg_sample$mpg ~ mpg_sample$weight + mpg_sample$acceleration, 
##     data = mpg_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5992 -2.7989 -0.3176  2.6307 10.2783 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             60.3733261  7.0799427   8.527 2.92e-10 ***
## mpg_sample$weight       -0.0092774  0.0009592  -9.672 1.13e-11 ***
## mpg_sample$acceleration -0.6299876  0.3395373  -1.855   0.0715 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.326 on 37 degrees of freedom
## Multiple R-squared:  0.7463, Adjusted R-squared:  0.7326 
## F-statistic: 54.41 on 2 and 37 DF,  p-value: 9.565e-12

We now have an adjusted \({ { R }^{ 2 } }\) = 0.7326, since this is better than the previous one, we will continue with finding the best fitted line. Looking at the summary, we observe that acceleration has a p-value of 0.0715. We will now drop this variable and re-evaluate.

m2_p3 <- lm(mpg_sample$mpg ~ mpg_sample$weight, data = mpg_sample)

summary(m2_p3)

## 
## Call:
## lm(formula = mpg_sample$mpg ~ mpg_sample$weight, data = mpg_sample)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8089 -3.1934 -0.8721  2.4876 11.9714 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       48.0348158  2.5066479  19.163  < 2e-16 ***
## mpg_sample$weight -0.0083277  0.0008369  -9.951 3.91e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.462 on 38 degrees of freedom
## Multiple R-squared:  0.7227, Adjusted R-squared:  0.7154 
## F-statistic: 99.02 on 1 and 38 DF,  p-value: 3.914e-12

Since the \({ { R }^{ 2 } }\) = 0.7227 (now that we are down to a single independent variable, the \({ { R }^{ 2 } }\) is a better indicator that the adjusted \({ { R }^{ 2 } }\) value). The best fit model for the selected sample is:
\(\hat { mpg } \quad =\quad 60.3733\quad -\quad 0.0093\cdot weight\quad -\quad 0.63\cdot acceleration\)

The standard error for these coefficients are as follows:
for weight = 0.0009592
for acceleration = 0.3395373

Analysis of population

We will now look at the entire population given to us in the data set auto-mpg.

mpg_pop_hist<-ggplot(df_mpg, aes(mpg)) + geom_histogram(binwidth = 4)
mpg_pop_hist

From the histogram, we observe that the population has a right-skeewed distribution.

We will now look at each independent variable with respect to mpg.

sp_p_d<-ggplot(df_mpg, aes(x=df_mpg$displacement, y=df_mpg$mpg)) + geom_point(size = 1.5, shape=20)
sp_p_h<-ggplot(df_mpg, aes(x=df_mpg$horsepower, y=df_mpg$mpg)) + geom_point(size = 1.5, shape=20)
sp_p_w<-ggplot(df_mpg, aes(x=df_mpg$weight, y=df_mpg$mpg)) + geom_point(size = 1.5, shape=20)
sp_p_a<-ggplot(df_mpg, aes(x=df_mpg$acceleration, y=df_mpg$mpg)) + geom_point(size = 1.5, shape=20)
sp_p_d

sp_p_h

sp_p_w

sp_p_a

While the mpg shows a negative relation with ’displacement, horsepower, and weight, it has a positive relation with acceleration.

We will now construct the multivariable linear model and observe the summary statistics, we will looking for the best fit model.

m3_full <- lm(df_mpg$mpg ~ df_mpg$displacement + df_mpg$horsepower + df_mpg$weight + df_mpg$acceleration, data = df_mpg)

summary(m3_full)

## 
## Call:
## lm(formula = df_mpg$mpg ~ df_mpg$displacement + df_mpg$horsepower + 
##     df_mpg$weight + df_mpg$acceleration, data = df_mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         45.2511397  2.4560447  18.424  < 2e-16 ***
## df_mpg$displacement -0.0060009  0.0067093  -0.894  0.37166    
## df_mpg$horsepower   -0.0436077  0.0165735  -2.631  0.00885 ** 
## df_mpg$weight       -0.0052805  0.0008109  -6.512  2.3e-10 ***
## df_mpg$acceleration -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16

By observing the p-values for each independent variables;
We have acceleration to be the least significant variable with a p-value = 0.85388 and the adjusted \({ { R }^{ 2 } }\) = 0.704
We will now remove this variable from the model and re-evaluate.

m3_p1 <- lm(df_mpg$mpg ~ df_mpg$displacement + df_mpg$horsepower + df_mpg$weight, data = df_mpg)

summary(m3_p1)

## 
## Call:
## lm(formula = df_mpg$mpg ~ df_mpg$displacement + df_mpg$horsepower + 
##     df_mpg$weight, data = df_mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3347  -2.8028  -0.3402   2.2037  16.2409 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         44.8559357  1.1959200  37.507  < 2e-16 ***
## df_mpg$displacement -0.0057688  0.0065819  -0.876  0.38132    
## df_mpg$horsepower   -0.0416741  0.0128139  -3.252  0.00125 ** 
## df_mpg$weight       -0.0053516  0.0007124  -7.513 4.04e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.241 on 388 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.7047 
## F-statistic:   312 on 3 and 388 DF,  p-value: < 2.2e-16

We now have a slightly better adjusted \({ { R }^{ 2 } }\) = 0.7047, we will proceed with removing the next variable with the highest p-value, displacement, with p-value = 0.38132 and re-evaluate.

m3_p2 <- lm(df_mpg$mpg ~ df_mpg$horsepower + df_mpg$weight, data = df_mpg)

summary(m3_p2)

## 
## Call:
## lm(formula = df_mpg$mpg ~ df_mpg$horsepower + df_mpg$weight, 
##     data = df_mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0762  -2.7340  -0.3312   2.1752  16.2601 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       45.6402108  0.7931958  57.540  < 2e-16 ***
## df_mpg$horsepower -0.0473029  0.0110851  -4.267 2.49e-05 ***
## df_mpg$weight     -0.0057942  0.0005023 -11.535  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.24 on 389 degrees of freedom
## Multiple R-squared:  0.7064, Adjusted R-squared:  0.7049 
## F-statistic: 467.9 on 2 and 389 DF,  p-value: < 2.2e-16

Again, we now have a slightly better adjusted \({ { R }^{ 2 } }\) = 0.7049, both remaing variables are significant. We will proceed with removing the next variable with the highest p-value, horsepower, with p-value = 2.49e-05 and re-evaluate.

m3_p3 <- lm(df_mpg$mpg ~ df_mpg$weight, data = df_mpg)

summary(m3_p3)

## 
## Call:
## lm(formula = df_mpg$mpg ~ df_mpg$weight, data = df_mpg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.9736  -2.7556  -0.3358   2.1379  16.5194 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   46.216524   0.798673   57.87   <2e-16 ***
## df_mpg$weight -0.007647   0.000258  -29.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.333 on 390 degrees of freedom
## Multiple R-squared:  0.6926, Adjusted R-squared:  0.6918 
## F-statistic: 878.8 on 1 and 390 DF,  p-value: < 2.2e-16

We are now done to a single independent variable (weight). Here also as in the sample, the best fit line is with the 2 variables. However for the sample, we had weight and acceleration. For the population, we have weight and horsepower. The best fit line can be written as follows;
\(\hat { mpg } \quad =\quad 45.6402\quad -\quad 0.0473\cdot horsepower\quad -\quad 0.0058\cdot weight\)

The standard error for these coefficients are as follows:
for weight = 0.0005023 for horsepower = 0.0110851