library(mosaic)
library(readr)
library(leaps)
library(dplyr)

source('~/Downloads/STOR 455/ShowSubsets.R')
UsedCars <- read_csv("~/Downloads/STOR 455/UsedCars.csv")
## Parsed with column specification:
## cols(
##   Id = col_double(),
##   Price = col_double(),
##   Year = col_double(),
##   Mileage = col_double(),
##   City = col_character(),
##   State = col_character(),
##   Vin = col_character(),
##   Make = col_character(),
##   Model = col_character()
## )
cars = subset(as.data.frame(table(UsedCars$Model)), Freq > 2500)

cars
set.seed(19191) 
EscapeFWD = sample_n(subset(UsedCars, Model == "EscapeFWD"), 200)
EscapeSE = sample_n(subset(UsedCars, Model == "EscapeSE"), 200)
EscapeTitanium = sample_n(subset(UsedCars, Model == "EscapeTitanium"), 200)

Escape = rbind(EscapeFWD, EscapeSE, EscapeTitanium)
write.csv(Escape, file = "Escape.csv", row.names = FALSE)

Escape$Age = 2017 - Escape$Year

MODEL #3: Polynomial Models

Dataset used in Assignment #2:

cars1 = as.data.frame(table(UsedCars$Model))

names(cars1)[1] = "Model"
names(cars1)[2] = "Count"

cars2 = subset(cars1, Count >= 2500)

set.seed(1938575)
MyCars = sample_n(subset(UsedCars, Model == "Civic"), 200)
range(MyCars$Year)
## [1] 2005 2017
MyCars$Age = 2017 - MyCars$Year

A

quadmod = lm(Price~poly(Age, degree = 2, raw = TRUE), data = MyCars)
summary(quadmod)
## 
## Call:
## lm(formula = Price ~ poly(Age, degree = 2, raw = TRUE), data = MyCars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4596.6 -1348.7  -283.9  1181.0 19467.5 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        20432.48     443.36  46.085  < 2e-16 ***
## poly(Age, degree = 2, raw = TRUE)1 -2557.78     212.69 -12.026  < 2e-16 ***
## poly(Age, degree = 2, raw = TRUE)2   117.92      19.91   5.924 1.38e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2396 on 197 degrees of freedom
## Multiple R-squared:  0.7007, Adjusted R-squared:  0.6977 
## F-statistic: 230.6 on 2 and 197 DF,  p-value: < 2.2e-16

The prediction equation is Price = 20432.5 - 2557.8(Age) + 117.92(Age^2).

plot(Price ~ Age, main = "Quadratic Model", data = MyCars)
curve(20432.48 - 2557.78*x + 117.92*x^2, add = TRUE)

The scatterplot of the the data with the quadratic fit drawn on it is shown above.

plot(quadmod$residuals ~ quadmod$fitted.values)
abline(a = 0, b = 0)

The residual plot shows that the data does not fit the conditions for a simple linear model. The data is not shapeless because there appears to be a slight vertical pattern in the data, there is an obvious outlier, and the data is clustered together rather than symmetrically distributed. The model does not satisfy the second condition of zero mean.

hist(quadmod$residuals)

Using a histogram of residuals, we can see that the residuals are skewed to the right - the distribution of the errors are not completely centered at zero. There appears to be outliers that are skewing the data. This plot does not satisfy the fifth condition of normality because the values do not follow a normal distribution.

qqnorm(quadmod$residuals)
qqline(quadmod$residuals)

Using a normal q-q plot, we can see that there is not too much variability expected because the line fits very well - the variance for Y is the same at each X (homoscedastcity). There is an overt outlier at the rightmost end of the plot. This conclusion fits with the histogram that the data is not completely normally distributed and/or there may be relationships among the errors.

B

summary(quadmod)
## 
## Call:
## lm(formula = Price ~ poly(Age, degree = 2, raw = TRUE), data = MyCars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4596.6 -1348.7  -283.9  1181.0 19467.5 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        20432.48     443.36  46.085  < 2e-16 ***
## poly(Age, degree = 2, raw = TRUE)1 -2557.78     212.69 -12.026  < 2e-16 ***
## poly(Age, degree = 2, raw = TRUE)2   117.92      19.91   5.924 1.38e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2396 on 197 degrees of freedom
## Multiple R-squared:  0.7007, Adjusted R-squared:  0.6977 
## F-statistic: 230.6 on 2 and 197 DF,  p-value: < 2.2e-16

H0: p-valueAge = 0 ;

H&alpha: p-valueAge != 0

I performed a t-test and used the r-squared value of 0.7007 and the p-value of approximately 0 to assess the effectiveness of my model. The r-sqaured demonstrates that most of the variability in the response variable (Price) can be explained by the Age variable. It appears that this model is pretty effective. The r value of 0.837 (square root of r-squared) demonstrates a strong positive linear relationship between price and age The p-value being close to 0 allows us to reject the null hypothesis and conclude a linear relationship from our model. Our model appears effective because of the conclusions we were able to draw from the p-value, r value, and r-squared value.

C

newdata = data.frame(Age = 3)
predict.lm(quadmod, newdata, interval = "confidence", level = 0.9)
##        fit      lwr      upr
## 1 13820.42 13496.33 14144.51
predict.lm(quadmod, newdata, interval = "prediction", level = 0.9)
##        fit      lwr      upr
## 1 13820.42 9847.185 17793.65

The predicted value for purchasing a car of this model, that is three years old, is 13,820.42 dollars. I am 90% confident that the mean price of a car of this model, that is three years old, falls between 13,496.33 dollars and 14,144.51. We are 90% confident that the price of an individual car of this model, that is three years old, falls between 9,847.19 dollars and 17,793.65 dollars.

D

The prediction equation is Price = 20432.5 - 2557.8(Age) + 117.92(Age^2). The intercept is positive, so when Age is set to 0, the predicted price will not be a negative value. Thus, the quadratic model does not allow for some Age where a car has a negative predicted value because the lowest Age (0) equals a positive value (20432.5). The following lines show the calculation, using the quadratic formula, for checking if there is some Age where a car has a zero predicted price.

20432.5 - 2557.8(x) + 117.92(x^2) = 0 ;

x = -2557.8 +/- sqrt(-2557.8^2 - 4(117.92)(20432.5)) / 2(117.92) ;

x = (-2557.8 +/- sqrt(-3095260.76)) / 235.84 ;

x = -10.845 +/- 7.4598i

The answer includes an imaginary number, so there is no true value for an Age where a car has a zero predicted price.

E

20432.5 - (2557.8*50) + (117.92*50^2)
## [1] 187342.5

I set the Age value equal to 50 years old (a number I thought classified a car as rather old) and the predicted price value came out to be $187,342.5. It appears to be a flaw because typically as age increases, a car’s price value decreases. However, in the real world, people desire older cars that lack supply. Demand exceeds supply and an old car is deemed rare, invaluable, and classier, so the increase in price makes sense in our current market (depending on car model).

F

quadmod1 = lm(Price~poly(Age, degree = 3, raw = TRUE), data = MyCars)
summary(quadmod1)
## 
## Call:
## lm(formula = Price ~ poly(Age, degree = 3, raw = TRUE), data = MyCars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4647  -1333   -218   1263  19192 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        20708.462    552.712  37.467  < 2e-16 ***
## poly(Age, degree = 3, raw = TRUE)1 -2868.723    428.008  -6.702 2.13e-10 ***
## poly(Age, degree = 3, raw = TRUE)2   195.706     95.004   2.060   0.0407 *  
## poly(Age, degree = 3, raw = TRUE)3    -4.927      5.884  -0.837   0.4034    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2398 on 196 degrees of freedom
## Multiple R-squared:  0.7018, Adjusted R-squared:  0.6972 
## F-statistic: 153.7 on 3 and 196 DF,  p-value: < 2.2e-16
plot(quadmod1$residuals ~ quadmod1$fitted.values)
abline(a = 0, b = 0)

Including a cubic term did not improve the fit significantly because the r-squared value (0.7018), the p-value of approximately 0, and the residual plot all look similar to the original fit. The original fit had an r-squared value of 0.7007, so there was a very slight improvement.

quadmod2 = lm(Price~poly(Age, degree = 4, raw = TRUE), data = MyCars)
summary(quadmod2)
## 
## Call:
## lm(formula = Price ~ poly(Age, degree = 4, raw = TRUE), data = MyCars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4882.9 -1336.3  -217.3  1279.7 18867.3 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        21032.659    621.726  33.829  < 2e-16 ***
## poly(Age, degree = 4, raw = TRUE)1 -3556.935    741.759  -4.795 3.22e-06 ***
## poly(Age, degree = 4, raw = TRUE)2   504.746    288.225   1.751   0.0815 .  
## poly(Age, degree = 4, raw = TRUE)3   -50.160     40.264  -1.246   0.2143    
## poly(Age, degree = 4, raw = TRUE)4     2.050      1.805   1.136   0.2575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2396 on 195 degrees of freedom
## Multiple R-squared:  0.7037, Adjusted R-squared:  0.6977 
## F-statistic: 115.8 on 4 and 195 DF,  p-value: < 2.2e-16
plot(quadmod2$residuals ~ quadmod2$fitted.values)
abline(a = 0, b = 0)

Using a quartic term did not make significant improvements, either. The r-squared value (0.7037), the p-value of approximately 0, and the residual plot resembles the original model (r-squared value of 0.7007, p-value of approximately 0, and a similar residual plot). The r-squared value very subtly improved.

MODEL #4: Complete second order model

A

mod2 = lm(Price ~ Age + Mileage + I(Age^2) + I(Mileage^2) + I(Age*Mileage), data = MyCars)
plot(Price ~ Age + Mileage, main = "Second Order Model", data = MyCars)

The complete second order model for predicted a used car Price based on Age and Mileage is shown above.

plot(mod2$residuals ~ mod2$fitted.values)
abline(a = 0, b = 0)

The residual plot shows that the data does fit the conditions for a simple linear model. The data is shapeless - there is no overt pattern in the plot and the plot appears to be symmetrically distributed. There is an obvious outlier at the rightmost end of the plot and there seems to be a bit of clustering, at a few sections, near the residual plot line. The model does satisfy the second condition of zero mean,

hist(mod2$residuals)

Using a histogram of residuals, we can see that the residuals are skewed to the right - the distribution of the errors are not completely centered at zero. There appears to be outliers that are skewing the data. This plot does not satisfy the fifth condition of normality because the values do not follow a normal distribution.

qqnorm(mod2$residuals)
qqline(mod2$residuals)

Using a normal q-q plot, we can see that there is not too much variability expected because the line fits rather well - the variance for Y is the same at each X (homoscedastcity). However, there is a subtle curvature at both ends of the plot points, which indicates the data may be skewed. There is an overt outlier at the rightmost end of the graph. This conclusion fits with the histogram that the data is not completely normally distributed and/or there may be relationships among the errors.

B

summary(mod2)
## 
## Call:
## lm(formula = Price ~ Age + Mileage + I(Age^2) + I(Mileage^2) + 
##     I(Age * Mileage), data = MyCars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4257.1 -1334.8  -307.3  1004.6 19472.4 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.065e+04  4.739e+02  43.578  < 2e-16 ***
## Age              -1.960e+03  2.545e+02  -7.700  6.8e-13 ***
## Mileage          -4.615e-02  1.831e-02  -2.521  0.01251 *  
## I(Age^2)          8.664e+01  3.299e+01   2.626  0.00932 ** 
## I(Mileage^2)      5.391e-08  1.172e-07   0.460  0.64594    
## I(Age * Mileage)  1.209e-03  3.711e-03   0.326  0.74482    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2289 on 194 degrees of freedom
## Multiple R-squared:  0.7311, Adjusted R-squared:  0.7241 
## F-statistic: 105.5 on 5 and 194 DF,  p-value: < 2.2e-16

H0: p-value = 0 ;

H&alpha: p-value != 0

I performed a t-test and used the r-squared value of 0.7311 and the p-value of approximately 0 to assess the effectiveness of my model. The r-squared demonstrates that most of the variability in the response variable (Price) can be explained by the Age and Mileage variables. It appears that this model is pretty effective. The r value of 0.855 (square root of r-squared) demonstrates a strong positive linear relationship between price, age, and mileage. The p-value being close to 0 allows us to reject the null hypothesis and conclude a linear relationship from our model. Our model appears effective because of the conclusions we were able to draw from the p-value, r value, and r-squared value.

C

anova(mod2)

H0: p-value = 0 ;

H&alpha: p-value != 0

The test shows an F-value of 467.0217 (variation among the variables is larger than the variation within the variables) and a p-value of approximately 0, so we can reject the null hypothesis and conclude there exists a relationship between variables age, mileage, and price.

D

anova(quadmod, mod2)

H0: &betai != 0 for a “subset” of predictors

H&alpha: &betai != 0 for some predictor in the subset

The test shows an F-value of 7.2981 (variation among the variables is larger than the variation within the variables) and a p-value of approximately 0, so we can reject the null hypothesis and conclude the newer model, with the added variable Mileage, shows the necessity of the extra term. More variability is explained when the new term, Mileage, being tested is added.

Model #5: Categorical predictors

A

newmod = lm(Price ~ Age + Mileage + Model, data = Escape)
plot(newmod)

A multiple regression model using Age, Mileage, and Model to predict the Price of the car is shown above.

B

summary(newmod)
## 
## Call:
## lm(formula = Price ~ Age + Mileage + Model, data = Escape)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4735.5 -1161.6  -122.3  1019.9 10664.8 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.134e+04  1.735e+02 123.024   <2e-16 ***
## Age                 -8.977e+02  5.705e+01 -15.734   <2e-16 ***
## Mileage             -6.621e-02  3.629e-03 -18.247   <2e-16 ***
## ModelEscapeSE        4.366e+02  1.820e+02   2.399   0.0167 *  
## ModelEscapeTitanium  3.022e+03  1.835e+02  16.463   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1817 on 595 degrees of freedom
## Multiple R-squared:  0.7645, Adjusted R-squared:  0.7629 
## F-statistic: 482.8 on 4 and 595 DF,  p-value: < 2.2e-16

H0: p-value = 0 ;

H&alpha: p-value != 0

I performed a t-test and used the r-squared value of 0.7645 and the p-value of approximately 0 to assess the effectiveness of my model. The r-squared demonstrates that most of the variability in the response variable (Price) can be explained by the Age, Mileage, and Model variables. It appears that this model is pretty effective. The r value of 0.874 (square root of r-squared) demonstrates a strong positive linear relationship between price, age, mileage, and model. The p-value being close to 0 allows us to reject the null hypothesis and conclude a linear relationship from our model. Our model appears effective because of the conclusions we were able to draw from the p-value, r value, and r-squared value.

C

newmod1 = lm(Price ~ Age + Mileage + Model + Age*Model + Mileage*Model, data = Escape)
plot(newmod1)

A multiple regression model using Age, Mileage, Model, and the interactions between Age and Model, and Mileage and Model to predict the Price of the car is shown above.

D

summary(newmod1)
## 
## Call:
## lm(formula = Price ~ Age + Mileage + Model + Age * Model + Mileage * 
##     Model, data = Escape)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4722.6 -1177.3  -131.5  1041.6 10610.7 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  2.092e+04  2.073e+02 100.921  < 2e-16 ***
## Age                         -9.997e+02  8.994e+01 -11.115  < 2e-16 ***
## Mileage                     -5.071e-02  6.724e-03  -7.542 1.76e-13 ***
## ModelEscapeSE                9.244e+02  3.663e+02   2.524  0.01188 *  
## ModelEscapeTitanium          4.479e+03  3.972e+02  11.277  < 2e-16 ***
## Age:ModelEscapeSE            1.187e+02  1.468e+02   0.809  0.41896    
## Age:ModelEscapeTitanium     -8.872e+01  1.514e+02  -0.586  0.55814    
## Mileage:ModelEscapeSE       -1.795e-02  8.890e-03  -2.019  0.04395 *  
## Mileage:ModelEscapeTitanium -2.903e-02  9.191e-03  -3.158  0.00167 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1794 on 591 degrees of freedom
## Multiple R-squared:  0.7718, Adjusted R-squared:  0.7687 
## F-statistic: 249.8 on 8 and 591 DF,  p-value: < 2.2e-16

H0: p-value< = 0 ;

H&alpha: p-value != 0

I performed a t-test and used the r-squared value of 0.7718 and the p-value of approximately 0 to assess the effectiveness of my model. The r-squared demonstrates that most of the variability in the response variable (Price) can be explained by the Age, Mileage, and Model variables and the Age and Model and Mileage and Model interactions. It appears that this model is pretty effective. The r value of 0.879 (square root of r-squared) demonstrates a strong positive linear relationship between price, age, mileage, model, interaction between age and model, and interaction between mileage and model. The p-value being close to 0 allows us to reject the null hypothesis and conclude a linear relationship from our model. Our model appears effective because of the conclusions we were able to draw from the p-value, r value, and r-squared value.

E

The model in part A discussed how Age, Mileage, and Model could be used to predict Price. The model in Part C discussed how Age, Mileage, Model, the interaction between Age and Model, and the interaction between Mileage and Model could be used to predict Price. Both models demonstrated a strong, linear relationsip (both models’ p-values were approximately 0) among the variables and interactions tested - however, the model in Part C showed a higher r-squared value and r value, which leads us to assume that adding the interaction between Age and Model and the interaction between Mileage and Model strengthens our model and provides more accurate predictions for Price.