Homework 11

#First 5 observations
head(cars);

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

#Dimension of dataset
dim(cars)

## [1] 50  2

create scatter plot with line of best fit

ggplot(data=cars,mapping=aes(speed,dist)) + geom_point() + 
  geom_smooth(method=lm, se=FALSE)

The scatter plot of stopping distance and speed depicts a linear relationship. The relationship is positive because the trend shows that stopping distance increases as speed of the car increases. However, there seem to be outliers in the distribution, but that would be detected and confirmed by the model.

Correlation

#Correlation
cor(cars$dist,cars$speed)

## [1] 0.8068949

The correlation coefficient of 0.8 shows that there is a strong positive relationship between stopping distance and the speed of the car.

Building a linear regression model

lreg <- lm(dist ~ speed, data=cars)
summary(lreg)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The model has an intercept of -17.57 and a slope of 3.9 with a p-value of 1.49e-12. Having a Null hypothesis \(H_0\) that slop is equal to zero and a miniature p-value of 1.49e-12, the Null hypothesis \(H_0\) is rejected as the p-values is statistically significant. The R-squared \(R^2\) value of 0.6511 means that the model is reasonable to explain about 65% variation in stopping distance as a result of speed of car. Thus, it is correct to conclude that the model seam reliable and statistically significant.

Histogram

ggplot(data=cars,mapping=aes(lreg$residuals)) + 
  geom_histogram(aes(y=..density..),binwidth = 5,color='black',fill='grey') + 
  geom_density(alpha=0.2, fill='lightpink') +
  theme(panel.background = element_rect(fill='white'),
        axis.line.x=element_line(),
        axis.line.y=element_line())+ggtitle('Histogram for model Residuals')

The distribution of the residual depicted by the histogram shows that it is unimodal, centered at approximately zero (i.e zero mean) with presence of outliers in the far right. The histogram is right-skewed indicating that most of the data falls to the right of the histogram. However, the shape of the histogram confirms that the residual is normally distributed. But it is not a perfect normal distribution as the histogram did not maintain a perfect bell shape.

Box Plot

ggplot(data=cars,mapping=aes(lreg$residuals)) + 
  geom_boxplot(color='black',fill='grey') +
  theme(panel.background = element_rect(fill='white'),
        axis.line.x=element_line(),
        axis.line.y=element_line())+ggtitle('Box plot for model Residuals')

The box plot shows that the median of the residual is less than zero and there are outliers above the upper limit of the data distribution.

Evaluating Linear Regression Assumptions

plot(lreg)

Residual Analysis:

The Residual vs Fitted value plot shows that the error maintains a constant variability. However, there are few data points at the far left of the distribution where fitted values are less than zero.
The Q-Q plot shows that the distribution of the residual is fairly normal with outliers in the far end identifiable with index 23, 35 and 49 of the dataset

Detect and Removing Outliers from Dataset Using Box plot

#Outliers
boxplot.stats(lreg$residuals)$out

##       23       49 
## 42.52537 43.20128

#Row numbers

out <-boxplot.stats(lreg$residuals)$out
out_ind <-which(lreg$residuals %in%c(out))
out_ind

## [1] 23 49

#Observations
cars[out_ind,]

##    speed dist
## 23    14   80
## 49    24  120

#Displaying outliers on Box plot
boxplot(lreg$residuals,ylab='distance',
        main='Box plot of distance per mile')

mtext(paste("Outliers: ",paste(out,collapse = ". ")))

#Lower bound
lower_bound = quantile(lreg$residuals,0.025)
lower_bound

##     2.5% 
## -21.3462

#Upper bound
upper_bound = quantile(lreg$residuals,0.975)
upper_bound

##   97.5% 
## 39.8862

#Lower $ Upper bound index
outlier_ind <- which(lreg$residuals < lower_bound | lreg$residuals > upper_bound)

outlier_ind

## 23 24 39 49 
## 23 24 39 49

#Outliers in the cars dataset
cars[outlier_ind,]

##    speed dist
## 23    14   80
## 24    15   20
## 39    20   32
## 49    24  120

Remove outlier from original cars dataset

#create id column
cars$id <- as.integer(rownames(cars))
#cars

#remove outliers
cars2 <- cars[cars$id[-c(23,35,39,49)], ]
#cars2

The dataset consists of two features namely, speed and stopping distance. However, a third variable id was created as a unique identifier to make the data manipulation as easy as possible.

ggplot(data=cars2,mapping=aes(speed,dist)) + geom_point() + 
  geom_smooth(method=lm, se=FALSE)

Building a linear regression model after removing initial outliers

lreg2 <- lm(dist ~ speed, data=cars2)
summary(lreg2)

## 
## Call:
## lm(formula = dist ~ speed, data = cars2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.526  -8.188  -1.232   5.798  25.357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.0635     5.0969  -3.152  0.00292 ** 
## speed         3.7059     0.3192  11.612 5.47e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.34 on 44 degrees of freedom
## Multiple R-squared:  0.754,  Adjusted R-squared:  0.7484 
## F-statistic: 134.8 on 1 and 44 DF,  p-value: 5.471e-15

Upon removing the initial detected outliers, the intercept of the model shifted to -16.06 and the slope changed to 3.7. However, the model p-value remains statistically significant at 5.471e-15. The R-squared \(R^2\) increased to 0.754 (~75%) and these changes show that the model would do a better prediction when the initial outliers are eliminated from the training dataset.

Histogram after removing initial outliers

ggplot(data=cars2,mapping=aes(lreg2$residuals)) + 
  geom_histogram(aes(y=..density..),binwidth = 5,color='black',fill='grey') + 
  geom_density(alpha=0.2, fill='lightpink') +
  theme(panel.background = element_rect(fill='white'),
        axis.line.x=element_line(),
        axis.line.y=element_line())+ggtitle('Histogram for model Residuals')

Box Plot after removing initial outliers

ggplot(data=cars2,mapping=aes(lreg2$residuals)) + 
  geom_boxplot(color='black',fill='grey') +
  theme(panel.background = element_rect(fill='white'),
        axis.line.x=element_line(),
        axis.line.y=element_line())+ggtitle('Box plot for model Residuals')

Assumptions check after removing initial outliers

plot(lreg2)

Residual Analysis after removing initial outliers:

The Residual vs Fitted value plot shows that the error maintains a constant variability. However, there are few data points at the far left of the distribution where fitted values are less than zero.
The removal of the initial outliers impacted the distribution of the residuals. The new Q-Q plot shows that the distribution of the residual is fairly normal. Though, better without the initial outliers. But there are still new outliers in the far end identifiable with index 22, 34 and 48 of the dataset

Homework 11

Nnaemeka Okereafor

2023-11-30