Dataset
The dataset you are going to use for this assignment is called cars. This dataset is built-in in R; therefore, all you have to do to use it is to call its name. Get familiar with this dataset by running the following two statements in R (run them as two separate statements): ?cars str(cars)
Question
You have been hired as a data analyst by the police department in your city. They ask you to run a regression analysis that allows them to predict the speed at which a car was driving based on its stopping distance. The police department states that it is easy for them to measure the stopping distance of cars involved in accidents but it is hard for them to figure out the distance at which they were traveling. Therefore, they want your help to find an equation that allows them to predict speed based on distance.
?cars
## starting httpd help server ... done
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
Question
You have been hired as a data analyst by the police department in your city. They ask you to run a regression analysis that allows them to predict the speed at which a car was driving based on its stopping distance. The police department states that it is easy for them to measure the stopping distance of cars involved in accidents but it is hard for them to figure out the distance at which they were traveling. Therefore, they want your help to find an equation that allows them to predict speed based on distance.
cor (cars$speed, cars$dist)
## [1] 0.8068949
Since the speed and distance have a fairly high positive correlation at 0.81, this is without us removing any outliers. This level of correlation can prove to have statistical and even practical significance.
plot(cars$dist, cars$speed)
# If you want to include the least square line in the plot, add the following:
abline(lm(speed ~ dist, data = cars), col = "blue")
The Scatter plot does fairly confirm my results I obtained for the
correlation coefficient since the least square mean line has a positive
slope. The grouping of the points also has decent linear grouping which
would lead me to assume that the RSE may be fairly low, but weather it
will be low enough to be practically significant has still yet to be
determined, regardless of how promising the graph looks.
spd_simple_out= lm(speed ~ dist, data=cars)
# Then, we call the summary() function on the defined object
summary (spd_simple_out)
##
## Call:
## lm(formula = speed ~ dist, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5293 -2.1550 0.3615 2.4377 6.4179
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.28391 0.87438 9.474 1.44e-12 ***
## dist 0.16557 0.01749 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
The equation we got is:
Predicted speed of the car = 8.28 + 0.17*(stop distance in feet)
Predicted speed = 8.28 + 0.17*(dist)
Practical significance R squared= 65.11% RSE= 3.16 units
Since the residual standard error is 3.156 (= 3.16 mph) On average, the values of speed deviate from the regression equation by 3.156 mph. Meaning that we assume that our mph predictions errors can be off by 3.15mph which I would say is practical since it is less than a 5 mph error. So if someone was actually going 70 mph in a 60 mph speed limit and resulted in an fatal accident with a brake distance measured at a (363 ft)value that returned an standard error value like 66.84 mph, we can assume that the police department can use this measure to still convict for disobeying the speed limit. I would say there is practical significance to the equation since the RSE is small and the chart plot shows the values do not deviate far from the trend line, and a clear trend and linear grouping is available.
Correlation and not causation is not a factor, since we know that braking distance is directly influenced by speed. Given the desire to predict the speed and our ability to do so within 5 mph from a readily available predictor like the breaking distance; I would continue to assume that this is a very practically significant equation for the police department to assess the speed at the scene of an accident. But without the RSE
The slope of the equation is 0.16557 Interpretation the slope: When the braking distance increases by one foot, we expect the speed to increase on average by 0.17 mph
The RSE is 3.156 (= 3.16 mph)
On average, the values of speed deviate from the regression equation by 3.156 mph. Meaning that we assume that our mph predictions errors can be off by 3.16mph. I would say there is likely practical significance to the equation since the RSE is low.
correlation an not causation is not a factor since we know that breaking distance is directly influenced by speed and given the desire to predict the speed and our ability to do so within 5 mph from a readily available predictor like the breaking distance I would say that this is a very practically significant equation for the police department to assess speed at the scene of an accident. Is this equation good?
Statistical significance Test for B1:
Ho: B1 = 0 (the slope of the real equation is zero; thus, there is no linear relationship between X and Y)
Ha: B1 != 0 (the slope of the real equation is different from zero; thus, there is a linear relationship between X and Y)
The t test statistic for the b1 coefficient is 9.46 (large = far from zero). Therefore, naturally, the PV is small (PV > 1.49e-12). This PV is less than alpha. Therefore, we can reject Ho and support Ha. The data give us evidence to conclude that there is a statistically significant relationship between speed (Y) and distance (X). R squared is 0.6511 (65.1 %) using the equation is better than using the sample mean to make predictions.
Practical significance R squared= 65.11% RSE= 3.16 units
3.156/ mean(cars$speed)
## [1] 0.2049351
(3.156/ mean(cars$speed))*100
## [1] 20.49351
Converted into a %, the coefficient of variation is 20.5%
Though I thought An RSE of 3.16 mph does seem to be as low as we would desire IN THIS PROBLEM. It is more desirable to get a coefficient of variation close to 10%. We should attempt to get a smaller RSE.
So though I and probably most would feel safe to assume that the equation is practically significant, It can be said that statistics do not prove that beyond a doubt, since the RSE coefficient variation percent is still high at 20.5%. Also, it is better than the sample mean to make predictions of speed (this statement is based on the R sq value that we got 65.11%) Final thought: Though Correlation and Statistical significance are present in this equation making it good, practical significance isn’t statistically irrefutable so it can be better.
If you look at the cars dataset you can see that the fithe cars stopping distance was recorded as 16 feet. Let us plug that in
PredictedSpeed1 = 8.28 + 0.17*(16)
PredictedSpeed1
## [1] 11
We can see that our equation finds the speed of the fifth car which stopped in a distance of 16 feet to be 11 mph. Let us see how these results match to the built in linear model predict function.
predict(spd_simple_out, data.frame(dist=c(16)))
## 1
## 10.93299
We can see that the values from our equation and the built in function slightly differ. Let us test to see if this is due to our rounding of the slope and y intercept
PredictedSpeed = 8.28391 + 0.16557*(16)
PredictedSpeed
## [1] 10.93303
We can see that when we use the full coeficient values retured in our linear model summary that out prediciton is closer to the built in functions prediction.
So I would say the 5th car that stopped in a distance of 15 ft was travelling anywhere from 10.93 mph - 11 mph.
PredictedSpeed2 = 8.28 + 0.17*(12)
PredictedSpeed2
## [1] 10.32
predict(spd_simple_out, data.frame(dist=c(12)))
## 1
## 10.27072
Above I used my equation and the built in equation. So I would say the 5th car that stopped in a distance of 15 ft was travelling anywhere from 10.27 mph - 10.32 mph.
To check assumptions 1 (linearity) we do a graph to plot the residuals versus the predicted values of Y.
plot(predict (spd_simple_out), residuals (spd_simple_out))
abline(h=0)
Interpretation: In this case, we can clearly see that the residuals show
a vaguely nonlinear pattern. The data points seem to be scattered or
follow a parabolic shape; therefore, a quadratic model might be more
appropriate than a linear model ( assumption 1 is not
satisfied). We had already observed this vaguely non-linear
pattern when we did the original scatter plot. However, this graphs
shows the non-linear pattern more clearly.
To check assumption 3 (whether the residuals follow a normal distribution), we are going to conduct a hypothesis test: the Shapiro test to check for normality
Ho: The residuals follow a Normal distribution Ha: The residuals do not follow a Normal distribution
shapiro.test(residuals (spd_simple_out))
##
## Shapiro-Wilk normality test
##
## data: residuals(spd_simple_out)
## W = 0.98745, p-value = 0.8696
Since the values oth seem significant at values abobe 0.5 I want to say that Assumption 3 is satisfied and normal distribution is found.
hist(residuals (spd_simple_out))