Dataset

The dataset you are going to use for this assignment is called cars. This dataset is built-in in R; therefore, all you have to do to use it is to call its name. Get familiar with this dataset by running the following two statements in R (run them as two separate statements): ?cars str(cars)

Question

You have been hired as a data analyst by the police department in your city. They ask you to run a regression analysis that allows them to predict the speed at which a car was driving based on its stopping distance. The police department states that it is easy for them to measure the stopping distance of cars involved in accidents but it is hard for them to figure out the distance at which they were traveling. Therefore, they want your help to find an equation that allows them to predict speed based on distance.

?cars
## starting httpd help server ... done
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

Question

You have been hired as a data analyst by the police department in your city. They ask you to run a regression analysis that allows them to predict the speed at which a car was driving based on its stopping distance. The police department states that it is easy for them to measure the stopping distance of cars involved in accidents but it is hard for them to figure out the distance at which they were traveling. Therefore, they want your help to find an equation that allows them to predict speed based on distance.

  1. Compute the correlation coefficient between the dependent variable and the predictor and discuss its sign and value (i.e. What do the sign and value tell you?)
cor (cars$speed, cars$dist)
## [1] 0.8068949

Since the speed and distance have a fairly high positive correlation at 0.81, this is without us removing any outliers. This level of correlation can prove to have statistical and even practical significance.

  1. Obtain the scatter plot between the dependent variable and the predictor and discuss it. Does it confirm the results you obtained for the correlation coefficient? Does it give you any extra information?
plot(cars$dist, cars$speed)

# If you want to include the least square line in the plot, add the following:

abline(lm(speed ~ dist, data = cars), col = "blue")

The Scatter plot does fairly confirm my results I obtained for the correlation coefficient since the least square mean line has a positive slope. The grouping of the points also has decent linear grouping which would lead me to assume that the RSE may be fairly low, but weather it will be low enough to be practically significant has still yet to be determined, regardless of how promising the graph looks.

  1. Obtain the linear regression equation between the dependent variable and the predictor and discuss its practical significance.
spd_simple_out= lm(speed ~ dist, data=cars)

# Then, we call the summary() function on the defined object

summary (spd_simple_out)
## 
## Call:
## lm(formula = speed ~ dist, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5293 -2.1550  0.3615  2.4377  6.4179 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.28391    0.87438   9.474 1.44e-12 ***
## dist         0.16557    0.01749   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.156 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

The equation we got is:

Predicted speed of the car = 8.28 + 0.17*(stop distance in feet)

Predicted speed = 8.28 + 0.17*(dist)

Practical significance R squared= 65.11% RSE= 3.16 units

Since the residual standard error is 3.156 (= 3.16 mph) On average, the values of speed deviate from the regression equation by 3.156 mph. Meaning that we assume that our mph predictions errors can be off by 3.15mph which I would say is practical since it is less than a 5 mph error. So if someone was actually going 70 mph in a 60 mph speed limit and resulted in an fatal accident with a brake distance measured at a (363 ft)value that returned an standard error value like 66.84 mph, we can assume that the police department can use this measure to still convict for disobeying the speed limit. I would say there is practical significance to the equation since the RSE is small and the chart plot shows the values do not deviate far from the trend line, and a clear trend and linear grouping is available.

Correlation and not causation is not a factor, since we know that braking distance is directly influenced by speed. Given the desire to predict the speed and our ability to do so within 5 mph from a readily available predictor like the breaking distance; I would continue to assume that this is a very practically significant equation for the police department to assess the speed at the scene of an accident. But without the RSE

  1. Interpret the slope of the equation you obtained in 3)

The slope of the equation is 0.16557 Interpretation the slope: When the braking distance increases by one foot, we expect the speed to increase on average by 0.17 mph

  1. Obtain the Residual Standard Error of the equation you obtain in part 3) Interpret its value (what does the value of the Residual Standard Error mean in this specific problem?) What does the value of the Residual Standard Error tell you about the quality of the equation? Is the equation good or not? JUSTIFY your answer

The RSE is 3.156 (= 3.16 mph)

On average, the values of speed deviate from the regression equation by 3.156 mph. Meaning that we assume that our mph predictions errors can be off by 3.16mph. I would say there is likely practical significance to the equation since the RSE is low.

correlation an not causation is not a factor since we know that breaking distance is directly influenced by speed and given the desire to predict the speed and our ability to do so within 5 mph from a readily available predictor like the breaking distance I would say that this is a very practically significant equation for the police department to assess speed at the scene of an accident. Is this equation good?

Statistical significance Test for B1:

Ho: B1 = 0 (the slope of the real equation is zero; thus, there is no linear relationship between X and Y)

Ha: B1 != 0 (the slope of the real equation is different from zero; thus, there is a linear relationship between X and Y)

The t test statistic for the b1 coefficient is 9.46 (large = far from zero). Therefore, naturally, the PV is small (PV > 1.49e-12). This PV is less than alpha. Therefore, we can reject Ho and support Ha. The data give us evidence to conclude that there is a statistically significant relationship between speed (Y) and distance (X). R squared is 0.6511 (65.1 %) using the equation is better than using the sample mean to make predictions.

Practical significance R squared= 65.11% RSE= 3.16 units

3.156/ mean(cars$speed)
## [1] 0.2049351
(3.156/ mean(cars$speed))*100
## [1] 20.49351

Converted into a %, the coefficient of variation is 20.5%

Though I thought An RSE of 3.16 mph does seem to be as low as we would desire IN THIS PROBLEM. It is more desirable to get a coefficient of variation close to 10%. We should attempt to get a smaller RSE.

So though I and probably most would feel safe to assume that the equation is practically significant, It can be said that statistics do not prove that beyond a doubt, since the RSE coefficient variation percent is still high at 20.5%. Also, it is better than the sample mean to make predictions of speed (this statement is based on the R sq value that we got 65.11%) Final thought: Though Correlation and Statistical significance are present in this equation making it good, practical significance isn’t statistically irrefutable so it can be better.

  1. Use the equation to predict the speed at which the fifth car in the data set was traveling.

If you look at the cars dataset you can see that the fithe cars stopping distance was recorded as 16 feet. Let us plug that in

PredictedSpeed1 = 8.28 + 0.17*(16)
PredictedSpeed1
## [1] 11

We can see that our equation finds the speed of the fifth car which stopped in a distance of 16 feet to be 11 mph. Let us see how these results match to the built in linear model predict function.

predict(spd_simple_out, data.frame(dist=c(16)))
##        1 
## 10.93299

We can see that the values from our equation and the built in function slightly differ. Let us test to see if this is due to our rounding of the slope and y intercept

PredictedSpeed = 8.28391 + 0.16557*(16)
PredictedSpeed
## [1] 10.93303

We can see that when we use the full coeficient values retured in our linear model summary that out prediciton is closer to the built in functions prediction.

So I would say the 5th car that stopped in a distance of 15 ft was travelling anywhere from 10.93 mph - 11 mph.

  1. Use the equation to predict the speed of a new car not included in the cars dataset. The stopping distance of this car was 12 feet.
PredictedSpeed2 = 8.28 + 0.17*(12)
PredictedSpeed2
## [1] 10.32
predict(spd_simple_out, data.frame(dist=c(12)))
##        1 
## 10.27072

Above I used my equation and the built in equation. So I would say the 5th car that stopped in a distance of 15 ft was travelling anywhere from 10.27 mph - 10.32 mph.

  1. Check the validity of assumption number 1. Discuss your results.

To check assumptions 1 (linearity) we do a graph to plot the residuals versus the predicted values of Y.

plot(predict (spd_simple_out), residuals (spd_simple_out))

abline(h=0)

Interpretation: In this case, we can clearly see that the residuals show a vaguely nonlinear pattern. The data points seem to be scattered or follow a parabolic shape; therefore, a quadratic model might be more appropriate than a linear model ( assumption 1 is not satisfied). We had already observed this vaguely non-linear pattern when we did the original scatter plot. However, this graphs shows the non-linear pattern more clearly.

  1. Check the validity of assumption number 3. Discuss your results.

To check assumption 3 (whether the residuals follow a normal distribution), we are going to conduct a hypothesis test: the Shapiro test to check for normality

Ho: The residuals follow a Normal distribution Ha: The residuals do not follow a Normal distribution

shapiro.test(residuals (spd_simple_out))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(spd_simple_out)
## W = 0.98745, p-value = 0.8696

Since the values oth seem significant at values abobe 0.5 I want to say that Assumption 3 is satisfied and normal distribution is found.

hist(residuals (spd_simple_out))