library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Mecha_df <- read.csv("MechaCar_mpg.csv")
head(Mecha_df, 5)
## vehicle_length vehicle_weight spoiler_angle ground_clearance AWD mpg
## 1 14.69710 6407.946 48.78998 14.64098 1 49.04918
## 2 12.53421 5182.081 90.00000 14.36668 1 36.76606
## 3 20.00000 8337.981 78.63232 12.25371 0 80.00000
## 4 13.42849 9419.671 55.93903 12.98936 1 18.94149
## 5 15.44998 3772.667 26.12816 15.10396 1 63.82457
The information from the tables establishes that the measurements for the cars were length, weight, spoiler angle, ground clearance, all wheel drive and miles per gallon. In order to assess the effects that each of the physical parameters of the vehicle of the car have on the miles per gallon, a multiple regression was performed.
multi <- lm(mpg ~ vehicle_length + vehicle_weight + spoiler_angle + ground_clearance + AWD, Mecha_df)
multi
##
## Call:
## lm(formula = mpg ~ vehicle_length + vehicle_weight + spoiler_angle +
## ground_clearance + AWD, data = Mecha_df)
##
## Coefficients:
## (Intercept) vehicle_length vehicle_weight spoiler_angle
## -1.040e+02 6.267e+00 1.245e-03 6.877e-02
## ground_clearance AWD
## 3.546e+00 -3.411e+00
summary(multi)
##
## Call:
## lm(formula = mpg ~ vehicle_length + vehicle_weight + spoiler_angle +
## ground_clearance + AWD, data = Mecha_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4701 -4.4994 -0.0692 5.4433 18.5849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.040e+02 1.585e+01 -6.559 5.08e-08 ***
## vehicle_length 6.267e+00 6.553e-01 9.563 2.60e-12 ***
## vehicle_weight 1.245e-03 6.890e-04 1.807 0.0776 .
## spoiler_angle 6.877e-02 6.653e-02 1.034 0.3069
## ground_clearance 3.546e+00 5.412e-01 6.551 5.21e-08 ***
## AWD -3.411e+00 2.535e+00 -1.346 0.1852
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.774 on 44 degrees of freedom
## Multiple R-squared: 0.7149, Adjusted R-squared: 0.6825
## F-statistic: 22.07 on 5 and 44 DF, p-value: 5.35e-11
The standard p-value cutoff is 0.05, so in interpreting which parameters have a very low probability of affecting the mpg, any value less than 0.05 under Pr(>|t|) are likely not contributing to the mpg. The vehicle length, at 2.6e-12, and ground clearance, at 5.21e-08, are very unlikely to provide random amounts of variance to the mpg in this linear model.
On the other hand, the vehicle weight, 0.08, spoiler angle, 0.31, and AWD, 0.19, all had show that there is a potentially significant contribution to providing variance to the mpg.
The p-value of 5.35e-11 suggests that there is an extremely low probability that the variance is caused by random variation, and that the parameters investigated indeed have a significant correlation to the mpg data measured. The R-squared value of 0.71 shows a strong linear correlation, thus the linear model is not considered to have a zero slope. Combining the R-squared finding, demonstrating a strong linear correlation, and the p-value, demonstrating that the variance is very unlikely to be due to random variation, it can be concluded that there is a significant linear correlation between mpg and the parameters investigated.
At this point in the exploratory analysis, it is unlikely the linear model can effectively predict the mpg of prototype cars. While there is a strong and significant correlation, there are several features in the data that make this analysis problematic, thus further analyses will be necessary to determine whether linear regression is the best model.
The biggest issue is the AWD column. AWD is not a continuous variable - it is dichotomous, and thus must be treated as categorical data. This cannot be used in a regression analysis, and any findings including the AWD parameter cannot be used. To investigate whether AWD has a significant impact on the mpg, a two-sample t-test would be performed. Something very present in the data is that the coefficient associated with AWD is the only negative coefficient with a high order of magnitude comparatively, thus no further conclusions can be made about these data until AWD removal.
Further considerations, if the data were all continuous, it is still unclear whether there is indeed a linear relationship. The parameters should each be investigated individually, especially those that do seem to impact the correlation with the mpg. If these data are visualized, which is not meaningful in a multiple regression, it can be observed whether the data have a better correlation in a logistic regression.
multi_no_awd <- lm(mpg ~ vehicle_length + vehicle_weight + spoiler_angle + ground_clearance, Mecha_df)
multi_no_awd
##
## Call:
## lm(formula = mpg ~ vehicle_length + vehicle_weight + spoiler_angle +
## ground_clearance, data = Mecha_df)
##
## Coefficients:
## (Intercept) vehicle_length vehicle_weight spoiler_angle
## -1.076e+02 6.240e+00 1.276e-03 8.031e-02
## ground_clearance
## 3.659e+00
summary(multi_no_awd)
##
## Call:
## lm(formula = mpg ~ vehicle_length + vehicle_weight + spoiler_angle +
## ground_clearance, data = Mecha_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.3395 -4.1155 -0.2094 6.8789 17.2672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.076e+02 1.576e+01 -6.823 1.87e-08 ***
## vehicle_length 6.240e+00 6.609e-01 9.441 3.05e-12 ***
## vehicle_weight 1.277e-03 6.948e-04 1.837 0.0728 .
## spoiler_angle 8.031e-02 6.656e-02 1.207 0.2339
## ground_clearance 3.659e+00 5.394e-01 6.784 2.13e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.853 on 45 degrees of freedom
## Multiple R-squared: 0.7032, Adjusted R-squared: 0.6768
## F-statistic: 26.65 on 4 and 45 DF, p-value: 2.277e-11
After removing the categorical data, the two parameters that could not be removed due to random variation remain: spoiler angle and vehicle weight. The R-squared value dropped slightly from 0.71 to 0.70, still indicating a strong linear relationship, and the p-value remains far below 0.05, indicating that this is a significant relationship.
Finally, in testing whether there is a statistical difference in mpg for cars with or without AWD, a two-sample t-test can be used:
no_awd <- Mecha_df[Mecha_df$AWD == 0,]$mpg
w_awd <- Mecha_df[Mecha_df$AWD == 1,]$mpg
t.test(no_awd,w_awd)
##
## Welch Two Sample t-test
##
## data: no_awd and w_awd
## t = 0.99152, df = 45.947, p-value = 0.3266
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4.499691 13.235437
## sample estimates:
## mean of x mean of y
## 47.31253 42.94466
Interestingly, while the AWD seemed to show that there was a correlation with mpg from the regression analysis, it failed to disprove the null hypothesis in the Welch Two-Sample t-Test. The p-value of 0.33 is much greater than 0.05, which would be required to disopve that the true difference in means is not equal to 0. The confidence interval makes this very clear, in that the test states that we are 95% confident that the true mean of the population lies between -4.50 and 13.23, which clearly includes 0. Therefore, in just looking at the multiple regression blindly, false conclusions can be made regarding AWD and gas mileage.
Suspension <- read.csv("Suspension_Coil.csv")
head(Suspension,4)
## VehicleID Manufacturing_Lot PSI
## 1 V40858 Lot1 1499
## 2 V40607 Lot1 1500
## 3 V31443 Lot1 1500
## 4 V6004 Lot1 1500
total_summary <- Suspension %>% summarize(Mean=mean(PSI), Median=median(PSI), Variance=var(PSI), SD=sd(PSI))
total_summary
## Mean Median Variance SD
## 1 1498.78 1500 62.29356 7.892627
According to the specification requirement that the variance must be below 100 PSI, these data appear to support that this specification is met.*
lot_summary <- Suspension %>% group_by(Manufacturing_Lot) %>% summarize(Mean=mean(PSI), Median=median(PSI), Variance=var(PSI), SD=sd(PSI))
lot_summary
## # A tibble: 3 × 5
## Manufacturing_Lot Mean Median Variance SD
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Lot1 1500 1500 0.980 0.990
## 2 Lot2 1500. 1500 7.47 2.73
## 3 Lot3 1496. 1498. 170. 13.0
*Per the design specifications for the MechaCar Suspension coils, the
variance must not exceed 100 pounds per square inch. Unfortunately, this
is an unreliable specification, because the variance cannot be measured
in units of PSI, but rather PSI squared. When considering the different
lots, only Lot 3 had a variance exceeding 100-PSI squared, at 170.29
sq-PSI, whereas Lot 1 was far below, at 0.98 sq-PSI, and Lot 2 as well,
at 7.47 sq-PSI. So if the true specification is to limit the variance to
100 PSI-squared, or a variation in the coils of 10 PSI, then the coils
from Lot 3 would not meet this specification.
(If, however, the specification is in the actual pressure, PSI, then
there would be no issue with any of the coils, as the SD is the
indicator of variation in the units PSI. The pressure variation in Lot 1
is 0.99 PSI, in Lot 2 is 2.73 PSI, and in Lot 3 13.05 PSI. All three of
the lots were well below 100 PSI.)
To determine whether the suspension coils installed in the lots are different from the population mean of the suspension coils, a one-sample t-test was used, with a significance cutoff at 0.05. Initially, all of the grouped lots were investigated.
All_Lots_t <- t.test(Suspension$PSI, mu=1500)
Lot1 <- Suspension %>% subset(Manufacturing_Lot == "Lot1")
Lot2 <- Suspension %>% subset(Manufacturing_Lot == "Lot2")
Lot3 <- Suspension %>% subset(Manufacturing_Lot == "Lot3")
t1 <- t.test(Lot1$PSI, mu=1500)
t2 <- t.test(Lot2$PSI, mu=1500)
t3 <- t.test(Lot3$PSI, mu=1500)
All_Lots_t
##
## One Sample t-test
##
## data: Suspension$PSI
## t = -1.8931, df = 149, p-value = 0.06028
## alternative hypothesis: true mean is not equal to 1500
## 95 percent confidence interval:
## 1497.507 1500.053
## sample estimates:
## mean of x
## 1498.78
The results of the test yielded a p-value of 0.06, which is higher than the 0.05 cutoff. This indicates that the null hypothesis cannot be rejected, and therefore there is no significant difference between the mean of the sampled three lots to the mean of the population, which is 1500.
However, based on the previous findings, where the one lot produced coils that exceeded the variance parameters established by the manufacturer, it is also important to investigate whether the production at each lot also maintains that there is no difference from the population mean.
t1
##
## One Sample t-test
##
## data: Lot1$PSI
## t = 0, df = 49, p-value = 1
## alternative hypothesis: true mean is not equal to 1500
## 95 percent confidence interval:
## 1499.719 1500.281
## sample estimates:
## mean of x
## 1500
The p-value from Lot 1 is 1. In other words, the mean value of the the sample, Lot 1, is exactly the same as the mu value, the population mean. Clearly the null hypothesis is not rejected, because there is no difference between the means.
t2
##
## One Sample t-test
##
## data: Lot2$PSI
## t = 0.51745, df = 49, p-value = 0.6072
## alternative hypothesis: true mean is not equal to 1500
## 95 percent confidence interval:
## 1499.423 1500.977
## sample estimates:
## mean of x
## 1500.2
The p-value for second lot is 0.61, still far above the cutoff p-value of 0.05, again failing to reject the null hypothesis, supporting that there is no difference between the sample, Lot 2, and the population.
t3
##
## One Sample t-test
##
## data: Lot3$PSI
## t = -2.0916, df = 49, p-value = 0.04168
## alternative hypothesis: true mean is not equal to 1500
## 95 percent confidence interval:
## 1492.431 1499.849
## sample estimates:
## mean of x
## 1496.14
The p-value for Lot 3, which did not meet the variance specifications, was 0.04. This falls below the 0.05 cutoff, and thus the sample mean, 1496.14, is considered significantly different from the population mean, 1500. The confidence interval suggests that with 95% certainty, the true mean of the population should be between 1492.3 and 1499.8, whereas we know that the true mean of the population is actually 1500, not in the range observed from this sample, Lot 3.
When comparing a vehicle to the competition, there are several factors that consumers are interested in. Some of the top priorities that consumers have are related to the safety features and aesthetics of the vehicle, which will vary depending on the price models of the vehicles. However, for the real automobile connoisseurs, the specifications they will be interested are more related to the fuel economy, horsepower, quarter mile time, hauling capacity, and whether the vehicle has a moon roof in addition to the parameters already recorded. Since consumers usually have a price-point in mind when purchasing a new vehicle, it is important include both the style and price of the vehicle to make a fair comparison.
df <- Mecha_df %>% mutate(model=NA,hp=NA,qsec=NA,haul_cap=NA,moon_roof=NA, price=NA, safety=NA, mnt_cost=NA)
df <- df[,c(7,12,14,13,6,8,9,1,2,4,10,3,5,11)] # Change the order of the columns, while maintaining values
head(df)
## model price mnt_cost safety mpg hp qsec vehicle_length vehicle_weight
## 1 NA NA NA NA 49.04918 NA NA 14.69710 6407.946
## 2 NA NA NA NA 36.76606 NA NA 12.53421 5182.081
## 3 NA NA NA NA 80.00000 NA NA 20.00000 8337.981
## 4 NA NA NA NA 18.94149 NA NA 13.42849 9419.671
## 5 NA NA NA NA 63.82457 NA NA 15.44998 3772.667
## 6 NA NA NA NA 48.54268 NA NA 14.45357 7286.595
## ground_clearance haul_cap spoiler_angle AWD moon_roof
## 1 14.64098 NA 48.78998 1 NA
## 2 14.36668 NA 90.00000 1 NA
## 3 12.25371 NA 78.63232 0 NA
## 4 12.98936 NA 55.93903 1 NA
## 5 15.10396 NA 26.12816 1 NA
## 6 13.10695 NA 30.58568 0 NA
The data types required would be as follows: model: nominal price: continuous maintenance_cost: interval (1:3) safety_rating: interval (1:10) mpg: continuous hp: continuous mpg: continuous qsec: continuous vehicle_length: continuous vehicle_weight: continuous ground_clearance: continuous hauling_capacity: continuous spoiler_angle: continuous AWD: dichotomous moon_roof: dichotomous
Common assumptions that people make when purchasing a vehicle include how higher price is associated with safety, acceleration, and maintenance costs. Specifically that more expensive cars are 1. safer, 2. can accelerate faster and thus have a shorter quarter mile time, and 3. often have higher maintenance costs.
The Null hypothesis in this case is that the safety rating is no different in cars at or above $55,000 (the cost of an standard-level luxury vehicle) than vehicles under that price point. The alternative hypothesis specifies that there is a difference in safety rating between cars at or above $55,000 compared to those under that price. To make this comparison, a two-sample t-test could be used to see if there is a significant difference between the two populations. Alternatively, if the price points were grouped in $20,000 bins, an Chi-Squared test could be used to compare the price categories with the frequencies of ratings from the safety categories.
To determine whether price is related to the acceleration, consider the hypotheses: There is no correlation between the price of vehicles and the quarter mile time of vehicles. The alternative hypothesis is that there is a correlation between the price of vehicles and their quarter mile times. To determine this, a linear regression can be used since both variables are continuous. Furthermore, a multiple regression analysis could be used to determine what other factors could be associated with the quarter mile time, being careful only to select the continuous data. This would include the price, miles per gallon, horsepower, length, weight, ground clearance, hauling capacity and spoiler angle, for those vehicles with spoilers. Although miles per gallon, like the qsec, is typically the dependent variable as a result of the other factors, a correlation could give insight to further relationships as to how the two could be linked.
Expensive cars are typically associated with higher maintenance costs. The null hypothesis in this scenario states that there as vehicle price increases, there is no correlation with the maintenance cost of those vehicles. Alternatively, as the price of a vehicle increases, there is a significant change in the cost of maintenance. It is important to not specify an increase or decrease, as the prior information is simply hearsay and should have no bearing on the data. If there is a correlation, it is more important to note that either yes, or no, there is a difference, and only then specify that it increased or decreased. Otherwise, if the alternative states that it explicitly an increase, it is possible that we would have to reject both hypotheses if a decrease in costs was observed. This also avoids having to do two tests, one to state that there is no increase in maintenance cost with price, followed by another post hoc analysis stating that there is no decrease. To test this, since the maintenance costs are categorical but the price is continuous, an ANOVA test would provide the distribution means. This remains true even if we were to increase the categories from 3 to 5 in the maintenance cost measurements.
One final set of tests could be whether there is a correlation between vehicles with or without AWD and/or a moon roof and maintenance costs or safety ratings. Because all of these parameters can be considered categorical, the AWD and moon roof are dichotomous while the maintenance costs and safety ratings are scaled interval data but can also be considered ordinal, the only test equipped to compare sets of categorical data is the Chi-squared test.