library(mosaic)
library(Stat2Data)
library(readr)
library(car)
library(corrplot)
UsedCars <- read_csv("Downloads/UsedCars.csv")
## Parsed with column specification:
## cols(
## Id = col_double(),
## Price = col_double(),
## Year = col_double(),
## Mileage = col_double(),
## City = col_character(),
## State = col_character(),
## Vin = col_character(),
## Make = col_character(),
## Model = col_character()
## )
head(UsedCars)
Cars = as.data.frame(table(UsedCars$Model))
head(Cars)
names(Cars)[1] = "Model"
names(Cars)[2] = "Count"
head(Cars)
Cars2 = subset(Cars, Count >= 2500)
Cars2
set.seed(1938575)
MyCars = sample_n(subset(UsedCars, Model == "Civic"), 200)
range(MyCars$Year)
## [1] 2005 2017
MyCars$Age = 2017 - MyCars$Year
Calculate the least squares regression line that best fits your data. Interpret (in context) what the slope estimate tells you about prices and ages of your used car model. Explain why the sign (positive/negative) make sense.
lm(Price ~ Age, data = MyCars)
##
## Call:
## lm(formula = Price ~ Age, data = MyCars)
##
## Coefficients:
## (Intercept) Age
## 18442 -1360
The equation for the regression line is y = -1360x + 18442 As the value of a car’s age increases by 1 year, the value of the car’s price decreases by $1,360. The negative sign of the slope makes sense because the older a car gets, the more a car’s value depreciates.
Produce a scatterplot of the relationship with the regression line drawn on it.
mod1 = lm(Price ~ Age, data = MyCars)
plot(Price ~ Age, data = MyCars)
abline(mod1)
The above plot demonstrates a scatterplot of the relationship between age and price with the regression line drawn on it.
Produce appropriate residual plots and comment on how well your data appear to fit the conditions for a simple linear model. Don’t worry about doing transformations at this point if there are problems with the conditions.
plot(mod1$residuals ~ mod1$fitted.values)
abline(a = 0, b = 0)
The residual plot shows that the data does not fit the conditions for a simple linear model. The data is not shapeless because there appears to be a slightly curved pattern in the data, there is an obvious outlier, and the data is clustered together rather than symmetrically distributed. The model does not satisfy the second condition of zero mean.
hist(mod1$residuals)
Using a histogram of residuals, we can see that the residuals are skewed to the right - the distribution of the errors are not completely centered at zero. There appears to be outliers that are skewing the data. This plot does not satisfy the fifth condition of normality because the values do not follow a normal distribution.
qqnorm(mod1$residuals)
qqline(mod1$residuals)
Using a normal q-q plot, we can see that there is not too much variability expected because the line fits very well - the variance for Y is the same at each X (homoscedastcity). However, there is a cery slight curvature at two or three of the points, which indicates the data may be skewed. There is an overt outlier at the rightmost part of the graph. This conclusion fits with the histogram that the data is not completely normally distributed and/or there may be relationships among the errors.
Find the car in your sample with the largest residual (in magnitude - positive or negative). For that car, find its standardized and studentized residual. Based on these residuals, would this value be considered influential?
mod1 = lm(Price ~ Age, data = MyCars)
mod1$residuals
## 1 2 3 4 5 6
## 3257.251557 572.051841 2601.650988 3406.451273 -864.948159 -913.948159
## 7 8 9 10 11 12
## -1451.347591 957.650988 -2227.748443 1628.051841 -3472.748443 -2121.948159
## 13 14 15 16 17 18
## -762.948159 -372.948159 120.853546 537.051841 -3923.748443 -1922.748443
## 19 20 21 22 23 24
## -3928.747023 927.051841 1618.051841 -1778.748443 -83.548727 50.650988
## 25 26 27 28 29 30
## 1215.650988 -2490.748443 -1672.948159 -653.347591 1244.652409 -840.948159
## 31 32 33 34 35 36
## 1087.852125 -3755.347591 -128.147875 1150.653830 1605.451273 2636.051841
## 37 38 39 40 41 42
## -4374.948159 -2368.747023 -1976.948159 -2582.548727 1026.053262 -2594.548727
## 43 44 45 46 47 48
## 2276.251557 -2144.347591 3525.051841 -1731.748443 -811.948159 1510.650988
## 49 50 51 52 53 54
## 261.251557 -377.948159 -1222.748443 207.051841 73.252977 -655.347591
## 55 56 57 58 59 60
## -1567.948159 -1755.347591 2265.251557 -524.747023 14.051841 -2125.548727
## 61 62 63 64 65 66
## -2001.748443 -573.946738 1019.251557 -2734.748443 -83.548727 1513.650988
## 67 68 69 70 71 72
## 102.650988 -891.548727 3305.451273 -862.948159 -63.946738 -1033.547307
## 73 74 75 76 77 78
## 21457.650988 632.051841 4987.852125 267.251557 -1925.747023 -1712.948159
## 79 80 81 82 83 84
## -543.349012 -1229.147875 -2404.147875 -1091.948159 -3769.147875 -1575.946738
## 85 86 87 88 89 90
## -1834.748443 1161.251557 -1363.948159 -1866.948159 -783.547307 -438.748443
## 91 92 93 94 95 96
## -1374.948159 463.051841 -465.948159 -1808.548727 684.652409 -1283.547307
## 97 98 99 100 101 102
## -1748.347591 3763.254398 -924.747023 854.251557 -367.948159 2537.051841
## 103 104 105 106 107 108
## 1136.051841 -2004.147875 2043.653830 3623.051841 269.251557 2165.652409
## 109 110 111 112 113 114
## -5648.347591 -2004.147875 -983.948159 -1735.748443 5682.451273 1593.852125
## 115 116 117 118 119 120
## -2372.948159 1576.252977 -2362.948159 -1653.347591 -1772.748443 2458.454114
## 121 122 123 124 125 126
## -769.347591 902.451273 272.251557 -727.748443 -2781.748443 -362.948159
## 127 128 129 130 131 132
## 1425.053262 -732.748443 2657.251557 4144.653830 -675.946738 -1867.948159
## 133 134 135 136 137 138
## 272.251557 -564.946738 -1327.548727 -1507.748443 1137.051841 3387.451273
## 139 140 141 142 143 144
## 2326.451273 -1422.748443 -365.948159 2917.451273 764.051841 3277.251557
## 145 146 147 148 149 150
## -1309.147875 905.451273 4370.254398 -1845.948159 2106.653830 -2362.948159
## 151 152 153 154 155 156
## -533.948159 3244.652409 636.051841 -3222.748443 3777.251557 2416.451273
## 157 158 159 160 161 162
## -2332.548727 2306.652409 995.853546 -4648.347591 -3.147875 137.051841
## 163 164 165 166 167 168
## 3936.053262 806.451273 343.652409 -3723.748443 4357.650988 3145.653830
## 169 170 171 172 173 174
## -1960.948159 1786.653830 1635.051841 -1389.547307 -1012.948159 1254.251557
## 175 176 177 178 179 180
## -1422.948159 -1293.547307 856.652409 -3323.147875 2917.451273 -477.948159
## 181 182 183 184 185 186
## -3483.547307 2351.652409 -1783.547307 -454.349012 2277.251557 -3777.748443
## 187 188 189 190 191 192
## 155.653830 3799.650988 2275.251557 -1004.147875 -2821.748443 -3096.548727
## 193 194 195 196 197 198
## 2637.051841 227.251557 74.051841 -616.548727 -2.146454 1137.051841
## 199 200
## -2067.347591 1132.051841
The residual with the largest value (in terms of magnitude) is number 73 out of 200 - ID 383124 - a 2017 Honda Civic. It has a residual with the value of 21457.650988.
rstandard(mod1)
## 1 2 3 4 5
## 1.2598703689 0.2210853413 1.0102344048 1.3196493555 -0.3342832683
## 6 7 8 9 10
## -0.3532206809 -0.5612889908 0.3718607840 -0.8616694794 0.6292059063
## 11 12 13 14 15
## -1.3432222799 -0.8200858726 -0.2948625319 -0.1441361869 0.0472355050
## 16 17 18 19 20
## 0.2075586181 -1.5176642984 -0.7436987130 -1.5250614669 0.3582849631
## 21 22 23 24 25
## 0.6253411282 -0.6880010396 -0.0323665350 0.0196680383 0.4720435055
## 26 27 28 29 30
## -0.9633950910 -0.6465573369 -0.2526733170 0.4813524335 -0.3250078009
## 31 32 33 34 35
## 0.4204115422 -1.4523297333 -0.0495240525 0.4518221189 0.6219471725
## 36 37 38 39 40
## 1.0187755364 -1.6908203733 -0.9195004892 -0.7640465904 -1.0004718961
## 41 42 43 44 45
## 0.3995016141 -1.0051206613 0.8804299694 -0.8292973391 1.3623543074
## 46 47 48 49 50
## -0.6698219379 -0.3137999445 0.5865935166 0.1010493324 -0.1460685760
## 51 52 53 54 55
## -0.4729461343 0.0800209415 0.0284353491 -0.2534467899 -0.6059771671
## 56 57 58 59 60
## -0.6788568666 0.8761752860 -0.2036963590 0.0054307247 -0.8234314199
## 61 62 63 64 65
## -0.7742550754 -0.2234705127 0.3942356961 -1.0577717042 -0.0323665350
## 66 67 68 69 70
## 0.5877584319 0.0398599046 -0.3453833944 1.2805222481 -0.3335103127
## 71 72 73 74 75
## -0.0248981473 -0.4003015788 8.3321157877 0.2442740098 1.9276062949
## 76 77 78 79 80
## 0.1033700688 -0.7475366987 -0.6620164492 -0.2109852043 -0.4750167250
## 81 82 83 84 85
## -0.9291074518 -0.4220137306 -1.4566256152 -0.6136068070 -0.7096612459
## 86 87 88 89 90
## 0.4491597906 -0.5271356943 -0.7215340316 -0.3034744727 -0.1697032463
## 91 92 93 94 95
## -0.5313869502 0.1789592603 -0.1800786230 -0.7006265381 0.2647800308
## 96 97 98 99 100
## -0.4971286849 -0.6761497115 1.4952512229 -0.3589684046 0.3304154454
## 101 102 103 104 105
## -0.1422037979 0.9805142334 0.4390588249 -0.7745233746 0.8024724550
## 106 107 108 109 110
## 1.4002291326 0.1041436476 0.8375366886 -2.1844218016 -0.7745233746
## 111 112 113 114 115
## -0.3802741274 -0.6713690955 2.2013651626 0.6159603999 -0.9170918023
## 116 117 118 119 120
## 0.6118700604 -0.9132270242 -0.6394097504 -0.6856803033 0.9706472694
## 121 122 123 124 125
## -0.2975347433 0.3496070089 0.1053040157 -0.2814853823 -1.0759508059
## 126 127 128 129 130
## -0.1402714088 0.5548552882 -0.2834193293 1.0277967301 1.6274627755
## 131 132 133 134 135
## -0.2631849857 -0.7219205094 0.1053040157 -0.2199662945 -0.5142885315
## 136 137 138 139 140
## -0.5831811128 0.4394453027 1.3122888105 0.9012604839 -0.5503040139
## 141 142 143 144 145
## -0.1414308423 1.1302121720 0.2952890804 1.2676061569 -0.5059335404
## 146 147 148 149 150
## 0.3507692002 1.7364301059 -0.7134179976 0.8272103847 -0.9132270242
## 151 152 153 154 155
## -0.2063591140 1.2548253002 0.2458199210 -1.2465249304 1.4610008559
## 156 157 158 159 160
## 0.9361262232 -0.9036226203 0.8920665257 0.3892285067 -1.7976853682
## 161 162 163 164 165
## -0.0012165284 0.0529674950 1.5325321696 0.3124168869 0.1329029070
## 166 167 168 169 170
## -1.4403064188 1.6920981993 1.2351898911 -0.7578629455 0.7015574087
## 171 172 173 174 175
## 0.6319112509 -0.5381833778 -0.3914819838 0.4851312046 -0.5499378850
## 176 177 178 179 180
## -0.5010017691 0.3312986973 -1.2842643691 1.1302121720 -0.1847163567
## 181 182 183 184 185
## -1.3492072183 0.9094696652 -0.6907828970 -0.1764260484 0.8808167588
## 186 187 188 189 190
## -1.4611930463 0.0611198967 1.4754239411 0.8800431800 -0.3880631816
## 191 192 193 194 195
## -1.0914223818 -1.1995940072 1.0191620142 0.0878984928 0.0286193931
## 196 197 198 199 200
## -0.2388491910 -0.0008389398 0.4394453027 -0.7995186338 0.4375129136
The standardized residual value, for this car, is 8.3321157877.
rstudent(mod1)
## 1 2 3 4 5
## 1.2617524670 0.2205535632 1.0102871605 1.3221398557 -0.3335321812
## 6 7 8 9 10
## -0.3524386382 -0.5603157475 0.3710501437 -0.8611068278 0.6282433903
## 11 12 13 14 15
## -1.3459725343 -0.8194051357 -0.2941815833 -0.1437792892 0.0471163379
## 16 17 18 19 20
## 0.2070563434 -1.5227095983 -0.7428565718 -1.5302193575 0.3574949628
## 21 22 23 24 25
## 0.6243768643 -0.6870832376 -0.0322847833 0.0196183278 0.4711151338
## 26 27 28 29 30
## -0.9632194053 -0.6456044404 -0.2520750887 0.4804165331 -0.3242725441
## 31 32 33 34 35
## 0.4195358454 -1.4564359495 -0.0493991394 0.4509122209 0.6209814906
## 36 37 38 39 40
## 1.0188735588 -1.6988544863 -0.9191400916 -0.7632407067 -1.0004742932
## 41 42 43 44 45
## 0.3986521975 -1.0051468556 0.8799279662 -0.8286408573 1.3653238562
## 46 47 48 49 50
## -0.6688865945 -0.3130843794 0.5856194219 0.1007964334 -0.1457070997
## 51 52 53 54 55
## -0.4720170052 0.0798199034 0.0283635097 -0.2528469803 -0.6050062636
## 56 57 58 59 60
## -0.6779298148 0.8756591174 -0.2032026155 0.0054169938 -0.8227593690
## 61 62 63 64 65
## -0.7734691881 -0.2229335945 0.3933933202 -1.0580910084 -0.0322847833
## 66 67 68 69 70
## 0.5867844335 0.0397592804 -0.3446139361 1.2826065079 -0.3327605283
## 71 72 73 74 75
## -0.0248352324 -0.3994511072 10.3135627072 0.2436930982 1.9410313845
## 76 77 78 79 80
## 0.1031114851 -0.7467010338 -0.6610746119 -0.2104754000 -0.4740858807
## 81 82 83 84 85
## -0.9287851218 -0.4211361333 -1.4607905725 -0.6126381016 -0.7087688652
## 86 87 88 89 90
## 0.4482525360 -0.5261722017 -0.7206577240 -0.3027775774 -0.1692864721
## 91 92 93 94 95
## -0.5304217233 0.1785212094 -0.1796380145 -0.6997229473 0.2641573190
## 96 97 98 99 100
## -0.4961814755 -0.6752200874 1.4999632406 -0.3581773411 0.3296709072
## 101 102 103 104 105
## -0.1418514867 0.9804182073 0.4381620361 -0.7737380293 0.8017482825
## 106 107 108 109 110
## 1.4036556755 0.1038831709 0.8369028106 -2.2056380606 -0.7737380293
## 111 112 113 114 115
## -0.3794512142 -0.6704351127 2.2231735094 0.6149924816 -0.9167220641
## 116 117 118 119 120
## 0.6109008072 -0.9128424686 -0.6384525377 -0.6847600713 0.9705047982
## 121 122 123 124 125
## -0.2968488107 0.3488307283 0.1050407013 -0.2808298566 -1.0763816364
## 126 127 128 129 130
## -0.1399236923 0.5538831407 -0.2827600802 1.0279437995 1.6343156970
## 131 132 133 134 135
## -0.2625654666 -0.7210447508 0.1050407013 -0.2194369344 -0.5133311531
## 136 137 138 139 140
## -0.5822068058 0.4385481009 1.3147005347 0.9008313662 -0.5493328523
## 141 142 143 144 145
## -0.1410803680 1.1310086928 0.2946073341 1.2695630026 -0.5049808312
## 146 147 148 149 150
## 0.3499910589 1.7453801387 -0.7125305369 0.8265483141 -0.9128424686
## 151 152 153 154 155
## -0.2058594839 1.2566592768 0.2452358024 -1.2482808059 1.4652260683
## 156 157 158 159 160
## 0.9358325326 -0.9032021419 0.8916045107 0.3883929793 -1.8079550531
## 161 162 163 164 165
## -0.0012134524 0.0528339437 1.5378050886 0.3117037936 0.1325727822
## 166 167 168 169 170
## -1.4442504384 1.7001572152 1.2368412476 -0.7570455398 0.7006549325
## 171 172 173 174 175
## 0.6309500434 -0.5372156814 -0.3906433559 0.4841924278 -0.5489668102
## 176 177 178 179 180
## -0.5000520679 0.3305526570 -1.2863861584 1.1310086928 -0.1842651882
## 181 182 183 184 185
## -1.3520252506 0.9090709037 -0.6898680819 -0.1759937980 0.8803160554
## 186 187 188 189 190
## -1.4654209155 0.0609659333 1.4798508689 0.8795398790 -0.3872292692
## 191 192 193 194 195
## -1.0919524200 -1.2009329275 1.0192621117 0.0876779566 0.0285470895
## 196 197 198 199 200
## -0.2382796028 -0.0008368186 0.4385481009 -0.7987875507 0.4366177868
The studentized residual, for this car, is 10.3135627072. This value would definitely be considered influential because its residual value, its standardized residual value, and its studentized residual value are much higher than all other values for their respective categories - it is an extremity.
Determine the leverage for the car with the largest absolute residual. What does this day about the potential for this car to be influential on your model?
hatvalues(mod1)[73]
## 73
## 0.01457741
2*(2/73)
## [1] 0.05479452
3*(2/73)
## [1] 0.08219178
The leverage is less than the two or three times the average leverage (according to the linear model), so the value does ot need to be checked.
Compute and interpret in context a 90% confidence interval for the slope of your model.
confint(mod1, level = 0.9)
## 5 % 95 %
## (Intercept) 17924.718 18959.980
## Age -1477.657 -1241.944
There is a 90% chance that the true slope of the data falls between -1477.657 and -1241.944. The confidence interval calculated has a 90% chance of containing the true slope for the data.
Test the strength of the linear relationship between your variables using each of the three methods (test for correlation, test for slope, ANOVA for regression). Include hypotheses for each test and your conclusions in the context of the problem.
cor(MyCars$Price, MyCars$Age)
## [1] -0.8046167
H0: r;Age=0 H&alpha: r;Age!=0 The r value for the correlation test is -0.805, which means there is a strong negative relationship between age and price. As the age of a car goes up, the price of the car goes down.
summary(mod1)
##
## Call:
## lm(formula = Price ~ Age, data = MyCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5648.3 -1717.6 -370.4 1222.9 21457.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18442.35 313.22 58.88 <2e-16 ***
## Age -1359.80 71.32 -19.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2594 on 198 degrees of freedom
## Multiple R-squared: 0.6474, Adjusted R-squared: 0.6456
## F-statistic: 363.6 on 1 and 198 DF, p-value: < 2.2e-16
H0:βAge=0.0 H&alpha:βAge>0.0 According to the test, our p-value is approximately 0, which indicates strong evidence to reject the null hypothesis. We can conclude there exists a relationship between variables age and price. Additionally, the r-squared value of .647 allows us to conclude that a large proportion of total variability in the response variable is explained by my model.
anova(mod1)
H0: βAge=0 H&alpha: βAge!=0 The test shows an F-value of 363.56 and a p-value of approximately 0, so we can reject the null hypothesis and conclude there exists a relationship between variables age and price.
Suppose that you are interested in purchasing a car of this model that is three years old (in 2017). Determine each of the following: 90% confidence interval for the mean price at this age and 90% prediction interval for the price of an individual car at this age. Write sentences that carefully interpret each of the intervals (in terms of car prices).
newdata=data.frame(Age = 3)
predict.lm(mod1, newdata, interval = "confidence", level = 0.9)
## fit lwr upr
## 1 14362.95 14052.69 14673.2
predict.lm(mod1, newdata, interval = "prediction", level = 0.9)
## fit lwr upr
## 1 14362.95 10064.48 18661.42
The predicted value for purchasing a car of this model that is three years old is 14,362.95 dollars. We are 90% confident that the mean price of a car of this model, that is three years old, falls between 14,052.69 dollars and 14,673.2. We are 90% confident that the price of an individual car falls between 10,064.48 dollars and 18,661.42 dollars.
According to your model, is there an age at which the car should be free? If so, find this age and comment on what the “free car” phenomenon says about the appropriateness of your model.
18442/1360
## [1] 13.56029
I set the price = 0 and solved for age (0 = -1360x + 18442). At approximately 13.6 years, the car should be free according to my model. The “free car” phenomenon basically says that my model may only be useful for certain ages and will fail to be as useful after a certain age because it does not accurately follow the rate of depreciation.
Experiment with some transformations to attempt to find one that seems to do a better job of satisfying the linearity condition. Include the summary output for fitting that model and a scatterplot of the transformed variable(s) with the least squares line. Explain why you think that this transformation does or does not improve satisfying the linear model conditions.
mod3 = lm(log(Price) ~ Age, data = MyCars)
plot(log(Price) ~ Age, data = MyCars)
abline(mod3)
summary(mod3)
##
## Call:
## lm(formula = log(Price) ~ Age, data = MyCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59982 -0.09889 -0.01009 0.10830 0.72006
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.874073 0.019935 495.31 <2e-16 ***
## Age -0.115115 0.004539 -25.36 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1651 on 198 degrees of freedom
## Multiple R-squared: 0.7646, Adjusted R-squared: 0.7634
## F-statistic: 643.2 on 1 and 198 DF, p-value: < 2.2e-16
plot(mod3$residuals ~ mod3$fitted.values)
abline(a = 0, b = 0)
I tried a multitude of different transformations (logarithmic, exponential, square root, etc.) and making the response variable a logarithmic function was the biggest improvement I could get. I think satisfies the linear conditions more in some aspects, but not all. The scatterplot still demonstrates a negative linear relationship. The r-squared value has increased to 0.765, which makes it a stronger fit than the original model. Lastly, the residual plot shows more of a shapeless distribution than the previous one did because the plot appears shapeless, with more of the distribution centered around 0, and the distribution appears more distributed than the original residual plot. The transformation is not an extreme improvement, but it is an improvement.
Run the model with two predictors (age and miles) for price as the response variable and provide the output (both the summary and the anova for the model).
mod2 = lm(Price ~ Age + Mileage, data = MyCars)
plot(Price ~ Age + Mileage, data = MyCars)
summary(mod2)
##
## Call:
## lm(formula = Price ~ Age + Mileage, data = MyCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4208.5 -1471.6 -217.5 979.8 21253.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.882e+04 3.017e+02 62.394 < 2e-16 ***
## Age -9.674e+02 9.891e+01 -9.781 < 2e-16 ***
## Mileage -3.664e-02 6.815e-03 -5.377 2.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2429 on 197 degrees of freedom
## Multiple R-squared: 0.6925, Adjusted R-squared: 0.6894
## F-statistic: 221.9 on 2 and 197 DF, p-value: < 2.2e-16
anova(mod2)
Find the largest residual for a car in your sample.
mod2$residuals
## 1 2 3 4 5 6
## 4313.66158 405.31296 2737.80635 3136.49105 -2182.37832 -1288.88434
## 7 8 9 10 11 12
## -1629.29669 683.09120 -2517.76887 176.25711 -3104.64997 -2433.82423
## 13 14 15 16 17 18
## -1004.21596 -72.36069 736.34478 466.75352 -4208.52913 -1140.16074
## 19 20 21 22 23 24
## -191.76411 97.28409 961.85511 -1793.07784 -350.06465 -215.55453
## 25 26 27 28 29 30
## 1110.74158 -1397.84344 -2202.03534 -1165.21115 492.85997 -298.37989
## 31 32 33 34 35 36
## 467.10715 -900.91518 -835.58676 975.36647 1878.04250 1913.60719
## 37 38 39 40 41 42
## -3737.36835 -2592.67918 -2305.23964 -1674.26322 178.97761 -1061.37821
## 43 44 45 46 47 48
## 2140.63863 -1186.22539 4107.55940 -2305.77754 -1194.39585 1189.66636
## 49 50 51 52 53 54
## -53.09886 -796.67099 -1604.33611 -306.75582 98.37345 380.01500
## 55 56 57 58 59 60
## -1258.16366 -2199.97391 3124.74988 -421.58005 -684.24603 -1451.65923
## 61 62 63 64 65 66
## -1544.94077 107.33336 136.62732 -2687.48339 -309.50245 1139.60939
## 67 68 69 70 71 72
## -225.73524 -834.15333 3796.16965 -918.22344 -209.37332 -1354.56230
## 73 74 75 76 77 78
## 21253.51626 350.11192 4417.04958 145.08608 -1578.87709 -2207.15259
## 79 80 81 82 83 84
## -812.08279 -667.03029 -1916.44928 -2424.73113 -2369.95405 566.23202
## 85 86 87 88 89 90
## -1317.92191 1361.45845 -1970.78872 -1942.04652 -1352.84545 -703.44956
## 91 92 93 94 95 96
## -1901.58036 -630.82822 -1253.35828 -1820.69900 204.70364 -794.67409
## 97 98 99 100 101 102
## -36.47204 1941.77415 -1245.52280 902.65250 174.03385 2160.54008
## 103 104 105 106 107 108
## 1546.71053 -2356.05379 1912.70275 3479.94676 331.28315 375.91474
## 109 110 111 112 113 114
## -788.04673 -2177.13309 -1622.48366 -1387.10663 5031.89525 1197.17023
## 115 116 117 118 119 120
## -2769.61278 -379.39554 -2595.56855 2372.55186 -2233.62843 1565.01172
## 121 122 123 124 125 126
## -219.52220 390.40032 246.70985 -762.63375 -2666.36675 -1028.37856
## 127 128 129 130 131 132
## 524.11453 -1129.39579 3091.85445 3067.54461 -143.76111 -2275.64188
## 133 134 135 136 137 138
## -65.29296 388.98314 -1469.99635 -1476.56903 810.33595 3398.05541
## 139 140 141 142 143 144
## 1990.13320 -1545.86661 -679.17996 3287.98537 464.52398 2851.84060
## 145 146 147 148 149 150
## -1994.23542 425.31511 3147.93680 -838.43524 1880.69121 -2064.92560
## 151 152 153 154 155 156
## -641.18116 3576.24072 1650.23352 -3100.22164 3786.66589 2894.52832
## 157 158 159 160 161 162
## -2298.27415 1655.66088 215.44824 -3128.65697 -632.50362 -205.16343
## 163 164 165 166 167 168
## 3639.81005 884.40258 1514.66202 -3614.99887 4225.04057 965.74383
## 169 170 171 172 173 174
## -2452.99074 -157.72427 886.81154 480.85897 -1171.22284 1034.98596
## 175 176 177 178 179 180
## -1310.99140 -1192.89133 992.90461 -2983.77428 2411.44617 -1302.51281
## 181 182 183 184 185 186
## -1739.29789 1040.63661 -1655.88650 -713.48271 3092.01055 -4112.10515
## 187 188 189 190 191 192
## 2379.91175 3563.41826 2620.63227 -1700.22788 -1769.22243 -2352.85707
## 193 194 195 196 197 198
## 1345.30740 166.68052 114.63086 -548.05094 333.91631 734.92763
## 199 200
## 1034.89163 798.74047
The largest residual for a car in my sample isnumber 73 out of 200 - ID 383124 - a 2017 Honda Civic. It has a value of 21253.516.
Assess the importance of each of the predictors in the model - be sure to indicate the specific value(s) from the output you are using to make the assessments. Include hypotheses and conclusions in context.
cor.test(MyCars$Price, MyCars$Age)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -19.067, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8486231 -0.7495505
## sample estimates:
## cor
## -0.8046167
cor.test(MyCars$Price, MyCars$Mileage)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = -15.345, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7945271 -0.6664383
## sample estimates:
## cor
## -0.7370317
cor.test(MyCars$Age, MyCars$Mileage)
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 15.38, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6673629 0.7951403
## sample estimates:
## cor
## 0.7377914
Both age and mileage are important predictors of price. I used a t-test for correlation to detect if age and mileage were good predictors for price and to test if there was a relationship between age and price and between mileage and price. The null hypothesis states that p = 0 and there is no true correlation between the variables being tested. The alternative hypothesis states that p is NOT equal to 0 and there is a true correlation between the variables being tested. The p-value for the correlation between age and price was less than 2.2e-16, which is not equal to 0, and thus, indicates we reject the null hypothesis and assume a correlation between age and price is present. The p-value for the correlation between mileage and price was also less than 2.2e16, which is not equal to 0, and thus, indicates we reject the null hypothesis and assume a correlation between mileage and price is present.
Assess the overall effectiveness of this model (with a formal test). Again, be sure to include hypotheses and the specific value(s) you are using from the output to reach a conclusion.
summary(mod2)
##
## Call:
## lm(formula = Price ~ Age + Mileage, data = MyCars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4208.5 -1471.6 -217.5 979.8 21253.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.882e+04 3.017e+02 62.394 < 2e-16 ***
## Age -9.674e+02 9.891e+01 -9.781 < 2e-16 ***
## Mileage -3.664e-02 6.815e-03 -5.377 2.13e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2429 on 197 degrees of freedom
## Multiple R-squared: 0.6925, Adjusted R-squared: 0.6894
## F-statistic: 221.9 on 2 and 197 DF, p-value: < 2.2e-16
I performed a t-test and used the r-squared value of 0.693 and the p-value of approximately 0 to assess the effectiveness of my model. The r-sqaured demonstrates that most of the variability in the response variable (Price) can be explained by my linear combination of gas and mileage. It appears that this model is pretty effective. The r value of 0.83 (square root of r-squared) demonstrates a strong positive linear relationship between price and variables age and mileage. The p-value being close to 0 allows us to reject the null hypothesis and conclude a linear relationship from our model. Our model appears effective because of the conclusions we were able to draw from the p-value and r and r-squared values. H0: p-value;Age=0 H&alpha: p-value ;Age!=0
vif(mod2)
## Age Mileage
## 2.1946 2.1946
Both values for age and mileage are between 1 and 5, which means they are moderately correlated. This does not necessarily mean there is multicollinearity, but it is something to be aware of because it can impact coefficients and t-tests.
31K miles. Determine each of the following: 90% confidence interval for the mean price at this age and mileage and 90% prediction interval for the price of an individual car at this age and mileage. Write sentences that carefully interpret each of the intervals (in terms of car prices).
newdata=data.frame(Age = 3, Mileage = 31000)
predict.lm(mod2, newdata, interval = "confidence", level = 0.9)
## fit lwr upr
## 1 14785.55 14467.37 15103.74
predict.lm(mod2, newdata, interval = "prediction", level = 0.9)
## fit lwr upr
## 1 14785.55 10759.19 18811.92
The predicted value for purchasing a car of this model that is three years old and has 31K miles is 14,785.55 dollars. We are 90% confident that the mean price of a car of this model, that is three years old and has 31K miles, falls between 14,467.37 dollars and 15,103.74. We are 90% confident that the price of an individual car falls between 10,759.19 dollars and 18,811.92 dollars.