library(mosaic)
library(Stat2Data)
library(readr)
library(car)
library(corrplot)
library(dplyr)
UsedCars <- read_csv("UsedCars.csv")
## Parsed with column specification:
## cols(
##   Id = col_double(),
##   Price = col_double(),
##   Year = col_double(),
##   Mileage = col_double(),
##   City = col_character(),
##   State = col_character(),
##   Vin = col_character(),
##   Make = col_character(),
##   Model = col_character()
## )
head(UsedCars)
Cars = as.data.frame(table(UsedCars$Model))
head(Cars)
names(Cars)[1] = "Model"
names(Cars)[2] = "Count"
head(Cars)
Cars2 = subset(Cars, Count >= 1000)
Cars2
set.seed(1938575)
germCar = sample_n(subset(UsedCars, Model == "E-ClassE350"), 100)
germCar$Country = "Germany"
germCar$Type = "Car"
germSUV = sample_n(subset(UsedCars, Model == "X5AWD"), 100)
germSUV$Country = "Germany"
germSUV$Type = "SUV"
japCar = sample_n(subset(UsedCars, Model == "Altima2.5"), 100)
japCar$Country = "Japanese"
japCar$Type = "Car"
japSUV = sample_n(subset(UsedCars, Model == "RAV44X2"), 100)
japSUV$Country = "Japanese"
japSUV$Type = "SUV"
amerCar = sample_n(subset(UsedCars, Model == "FocusSedan"), 100)
amerCar$Country = "America"
amerCar$Type = "Car"
amerSUV = sample_n(subset(UsedCars, Model == "EscapeSE"), 100)
amerSUV$Country = "America"
amerSUV$Type = "SUV"
Cars3 = rbind(germCar, germSUV, japCar, japSUV, amerCar, amerSUV)
Cars3$Age = 2017 - Cars3$Year
head(Cars3)

#1

boxplot(Price ~ Model, data = Cars3, las = 2)

According to the boxplots, the E-ClassE350 has the largest range and the largest median. Additionally, the E-ClassE350 has visible outliers whereas the other car models have either one or no outliers. The RAV44X2 appears to be the second largest median with a pretty small range. The median for Altima2.5, the EscapeSE, and the X5AWD appear to be similar; the X5AWD has the largest range out of the three. The FocusSedan ha the smallest median and a similar range to the RAV44X2.

#2

tapply(Cars3$Price, Cars3$Model, mean)
##   Altima2.5 E-ClassE350    EscapeSE  FocusSedan     RAV44X2       X5AWD 
##    14993.97    30061.24    16073.30    11468.59    19191.87    15241.22

The above data shows the mean price for each car model.

tapply(Cars3$Price, Cars3$Model, sd)
##   Altima2.5 E-ClassE350    EscapeSE  FocusSedan     RAV44X2       X5AWD 
##    2689.565    8297.746    2846.080    2189.593    2348.477    5154.382

The above data shows the standard deviation for each car model.

summary(Cars3$Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3999   12989   16738   17838   19994   48951
sd(Cars3$Price)
## [1] 7417.989

The above data shows the mean and standard deviation for the entire sample of car prices.

#3 Using the data from #2, it appears that the mean prices for each car model falls between 10,000 and 30,000. The E-ClassE350 has the highest mean price, 30,061.24, while the other five models have a mean price less than 20,000. The E-ClassE350 has the highest standard deviation, 8,297.75, the X5AWD has the second highest standard deviation, 5,154.38, and the other four models have a standard deviation less than 3,000. The magnitude of the E-ClassE350 and X5AWD standard deviations are more than twice the standard deviation values for the other four models, so that poses a risk for the constant variance condition.

#4

anovaCars3 = aov(Price ~ Model, data = Cars3)
summary(anovaCars3)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## Model         5 2.098e+10 4.195e+09   207.9 <2e-16 ***
## Residuals   594 1.199e+10 2.018e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is that the means for each car model are equal. The alternative hypothesis is that there is some car model’s mean that does not equal another car model’s mean. The p-value is less than 2e-16, so the null hypothesis can be rejected, and we can conclude a significant difference in the means for the car models. This conclusion suggests that there is some car model in my data that has a mean price significantly different from another car model.

#5

plot(anovaCars3)

## hat values (leverages) are all = 0.01
##  and there are no factor predictors; no plot no. 5

hist(anovaCars3$residuals)

The above data shows the residuals versus fitted plot, the normal q-q plot, and a histogram of the residuals. According to the residuals versus fitted plot - The residuals for each model demonstrate variance in the data points. The last model (where the 30,000 mark is) shows the most variance. This is fitting because said model, the E-ClassE350, had the largest standard deviation out of all the models. Two of the models (where the 15,000 marks is) overlap. This is also fitting because the two models, the Altima2.5 and the X5AWD, have pretty close mean prices. According to the normal q-q plot - the plot appears to be quite linear except for the left end. There are quite a few outliers that disrupt the linearity. According to the histogram - the plot appears to be normal, with the values centering at 0.

#6

library(TukeyC)
## Loading required package: doBy
TukeyHSD(anovaCars3)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ Model, data = Cars3)
## 
## $Model
##                             diff         lwr         upr     p adj
## E-ClassE350-Altima2.5   15067.27  13251.0529  16883.4871 0.0000000
## EscapeSE-Altima2.5       1079.33   -736.8871   2895.5471 0.5329208
## FocusSedan-Altima2.5    -3525.38  -5341.5971  -1709.1629 0.0000006
## RAV44X2-Altima2.5        4197.90   2381.6829   6014.1171 0.0000000
## X5AWD-Altima2.5           247.25  -1568.9671   2063.4671 0.9988433
## EscapeSE-E-ClassE350   -13987.94 -15804.1571 -12171.7229 0.0000000
## FocusSedan-E-ClassE350 -18592.65 -20408.8671 -16776.4329 0.0000000
## RAV44X2-E-ClassE350    -10869.37 -12685.5871  -9053.1529 0.0000000
## X5AWD-E-ClassE350      -14820.02 -16636.2371 -13003.8029 0.0000000
## FocusSedan-EscapeSE     -4604.71  -6420.9271  -2788.4929 0.0000000
## RAV44X2-EscapeSE         3118.57   1302.3529   4934.7871 0.0000175
## X5AWD-EscapeSE           -832.08  -2648.2971    984.1371 0.7796181
## RAV44X2-FocusSedan       7723.28   5907.0629   9539.4971 0.0000000
## X5AWD-FocusSedan         3772.63   1956.4129   5588.8471 0.0000001
## X5AWD-RAV44X2           -3950.65  -5766.8671  -2134.4329 0.0000000

The largest difference in mean price is between the E-ClassE350 and the Altima2.5 is 15,067.27, with E-ClassE350 averaging 15,067.27 more in price. The 95% confidence interval of that difference is between 13,251.05 and 16,883.49. The p-value is 0, which indicates a significant difference in means. It appears that all of the relationships in the above data have a significant difference in mean prices except for EscapeSE-Altima2.5, X5AWD-Altima2.5, and X5AWD-EscapeSE.

#7

anova1Cars3 = aov(Price ~ Country + Type, data = Cars3)
summary(anova1Cars3)
##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Country       2 8.053e+09 4.026e+09   98.73  < 2e-16 ***
## Type          1 6.035e+08 6.035e+08   14.80 0.000133 ***
## Residuals   596 2.430e+10 4.078e+07                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is that the means for the models grouped by country are the same, the means for the models grouped by type of vehicle are the same, and that the two predictor variables do not play a role in predicting price. The alternative hypothesis is that there is some car model’s mean that does not equal another car model’s mean when grouped by country and when grouped by mean and that the two predictor variables do play a role in predicting price. The p-value is less than 2e-16, so the null hypothesis can be rejected, and we can conclude that the mean prices differ for observations grouped by country. The other p-value is 0.000233, so the null hypothesis can be rejected, and we can conclude that the mean prices differ for observations grouped by type of vehicle. Lastly, these conclusions suggest that both country and type of car play a role in predicting the price of a car model. This model demonstrates significant differences among the mean prices for the car models.

TukeyHSD(anova1Cars3)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ Country + Type, data = Cars3)
## 
## $Country
##                       diff       lwr       upr p adj
## Germany-America   8880.285  7379.865 10380.705 0e+00
## Japanese-America  3321.975  1821.555  4822.395 8e-07
## Japanese-Germany -5558.310 -7058.730 -4057.890 0e+00
## 
## $Type
##              diff       lwr       upr     p adj
## SUV-Car -2005.803 -3029.822 -981.7846 0.0001325

The largest difference in mean price is between Germany companies and American companies at 8,880.29, with Germany averaging 8,880.29 more in price. The 95% confidence interval of that difference is between 7,379.87 and 10,380.71. The p-value is 0, which indicates a significant difference in means. It appears that all of the relationships, Germany-America, Japanese-America, Japanese-Germany, and SUV-Car in the above data have a significant difference in mean prices.

#8

plot(anova1Cars3)

## hat values (leverages) are all = 0.006666667
##  and there are no factor predictors; no plot no. 5

hist(anova1Cars3$residuals)

The above data shows the residuals versus fitted plot, the normal q-q plot, and a histogram of the residuals. According to the residuals versus fitted plot - The residuals for each model demonstrate variance in the data points. The last model (where the 30,000 mark is) shows the most variance. This is fitting because said model, the E-ClassE350, had the largest standard deviation out of all the models. None of the models overlap and each model’s variability seems pretty scattered. According to the normal q-q plot - the plot appears to be quite linear. Both the right and the left end show one or two outliers that disrupts the linearity a bit, but overall, the plot falls on the line. According to the histogram - the plot appears to be normal, with the values centering at 0. The plot skews, just a little bit, to the right, which I assume is due to outliers.

#9

anova2Cars3 = aov(Price ~ Country + Type + Country*Type, data = Cars3)
summary(anova2Cars3)
##               Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Country        2 8.053e+09 4.026e+09  199.55  < 2e-16 ***
## Type           1 6.035e+08 6.035e+08   29.91 6.68e-08 ***
## Country:Type   2 1.232e+10 6.160e+09  305.28  < 2e-16 ***
## Residuals    594 1.199e+10 2.018e+07                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis is that there is no main effect of country on mean price, no main effect of type of vehicle on mean price, and there is no significant interaction effect of country and type of vehicle on the mean price. The alternative hypothesis is that there is a main effect of country on mean price, a main effect of type of vehicle on mean price, and there is a significant interaction effect between country and type of vehicle on the mean price. The p-value is less than 2e-16, so the null hypothesis can be rejected, and we can conclude that there is a main effect of country on mean price. The other p-value is 6.68e-08, so the null hypothesis can be rejected, and we can conclude that there is a main effect of type of vehicle on mean price. The last p-value is less than 2e-16, so we can reject the null hypothesis, and conclude that there is a significant interaction effect of country and type of vehicle on the mean price. Lastly, these conclusions suggest that both country, type of car, and the interaction of the two predictors play a role in predicting the price of a car model. This model demonstrates significant differences among the mean prices for the car models.

TukeyHSD(anova2Cars3)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Price ~ Country + Type + Country * Type, data = Cars3)
## 
## $Country
##                       diff       lwr       upr p adj
## Germany-America   8880.285  7824.864  9935.706     0
## Japanese-America  3321.975  2266.554  4377.396     0
## Japanese-Germany -5558.310 -6613.731 -4502.889     0
## 
## $Type
##              diff       lwr       upr p adj
## SUV-Car -2005.803 -2726.114 -1285.493 1e-07
## 
## $`Country:Type`
##                                diff         lwr         upr     p adj
## Germany:Car-America:Car    18592.65  16776.4329  20408.8671 0.0000000
## Japanese:Car-America:Car    3525.38   1709.1629   5341.5971 0.0000006
## America:SUV-America:Car     4604.71   2788.4929   6420.9271 0.0000000
## Germany:SUV-America:Car     3772.63   1956.4129   5588.8471 0.0000001
## Japanese:SUV-America:Car    7723.28   5907.0629   9539.4971 0.0000000
## Japanese:Car-Germany:Car  -15067.27 -16883.4871 -13251.0529 0.0000000
## America:SUV-Germany:Car   -13987.94 -15804.1571 -12171.7229 0.0000000
## Germany:SUV-Germany:Car   -14820.02 -16636.2371 -13003.8029 0.0000000
## Japanese:SUV-Germany:Car  -10869.37 -12685.5871  -9053.1529 0.0000000
## America:SUV-Japanese:Car    1079.33   -736.8871   2895.5471 0.5329208
## Germany:SUV-Japanese:Car     247.25  -1568.9671   2063.4671 0.9988433
## Japanese:SUV-Japanese:Car   4197.90   2381.6829   6014.1171 0.0000000
## Germany:SUV-America:SUV     -832.08  -2648.2971    984.1371 0.7796181
## Japanese:SUV-America:SUV    3118.57   1302.3529   4934.7871 0.0000175
## Japanese:SUV-Germany:SUV    3950.65   2134.4329   5766.8671 0.0000000

The largest difference in mean price is between Germany:Car and America:Car at 18,592.65, with German:Car averaging 18,592.65 more in price. The 95% confidence interval of that difference is between 16,776.4329 and 20,408.87. The p-value is 0, which indicates a significant difference in means. It appears that all of the relationships in the above data have a significant difference in mean prices except for America:SUV-Japanese:Car, Germany:SUV-Japanese:Car, and Germany:SUV-America:SUV.

#10

interaction.plot(Cars3$Country, Cars3$Type, Cars3$Price)

interaction.plot(Cars3$Type, Cars3$Country, Cars3$Price)

The first interaction plot, which plots Country versus mean Price, shows crossing lines (indicates interaction occurs). The plot indicates that the relationship between country and mean price depends on type of vehicle. An SUV, produced by Japan, is associated with a higher price and a car, produced by Germany, is associated with a higher price. The second interaction plot, which plots Type versus mean Price, shows crossing lines (indicates interaction occurs), too. A car, produced by America, is associated with a higher price and an SUV, produced by either Japan or Germany, is associated with a higher price.

#11

model = lm(Price ~ factor(Model), Cars3)
summary(model)
## 
## Call:
## lm(formula = Price ~ factor(Model), data = Cars3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23661.2  -2242.2     -2.5   2287.1  18889.8 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               14994.0      449.2  33.380  < 2e-16 ***
## factor(Model)E-ClassE350  15067.3      635.3  23.719  < 2e-16 ***
## factor(Model)EscapeSE      1079.3      635.3   1.699   0.0898 .  
## factor(Model)FocusSedan   -3525.4      635.3  -5.550 4.32e-08 ***
## factor(Model)RAV44X2       4197.9      635.3   6.608 8.66e-11 ***
## factor(Model)X5AWD          247.2      635.3   0.389   0.6973    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4492 on 594 degrees of freedom
## Multiple R-squared:  0.6364, Adjusted R-squared:  0.6333 
## F-statistic: 207.9 on 5 and 594 DF,  p-value: < 2.2e-16

The mean price is expected to increase by 15,067.3 when the Model E-ClassE350 increases by 1. The E-ClassE350 appears to have the largest positive effect on the dependent variable. The mean price is expected to increase by 1,079.3 when the model EscapeSE increases by 1. The mean price is expected to decrease by -3,525.4 when the model FocusSedan is increased by 1. The FocusSedan appears to be the only negative effect on the dependent variable. The mean price is expected to increase by 4,197.9 when the model RAV44X2 increase by 1. Lastly, the mean price is expected to increase by 247.2 when the model X5AWD is increased by 1. X5AWD has the smallest effect on the dependent variable. I interpreted these coefficient values as the values needed to approach the mean - for example, the value 247.2 would increase the mean price to approach X5AWD’s mean price.

#12

model1 = aov(Price ~ Age + Mileage, data = Cars3)
summary(model1)
##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Age           1 2.872e+09 2.872e+09   63.71 7.42e-15 ***
## Mileage       1 3.177e+09 3.177e+09   70.48 3.38e-16 ***
## Residuals   597 2.691e+10 4.508e+07                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above data demonstrates that both age and mileage are significant predictors for mean price.

model2 = aov(Price ~ Age + Mileage + Age*Mileage, data = Cars3)
summary(model2)
##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Age           1 2.872e+09 2.872e+09   64.17 6.03e-15 ***
## Mileage       1 3.177e+09 3.177e+09   70.99 2.69e-16 ***
## Age:Mileage   1 2.376e+08 2.376e+08    5.31   0.0215 *  
## Residuals   596 2.667e+10 4.476e+07                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above data demonstrates that the interaction of age and mileage has a main effect on price.

model3 = aov(Price ~ Model - Age - Mileage, data = Cars3)
summary(model3)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## Model         5 2.098e+10 4.195e+09   207.9 <2e-16 ***
## Residuals   594 1.199e+10 2.018e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above data uses a model that takes out age and mileage and although models with the two variables demonstrate that those variables account for variability, the model still demonstrates a significant difference between price and car model.