Car Prices Revisited

Nalpak Ynnad, Sat Dec 8 12:14:16 2012

The original data collection and report was done with classmates B. Franklin, J. Adams, and T. Jefferson.

COMMENTARY: This document serves two purposes: an example template for your used-car revision and an explanation of what's going on in that revision. The content marked in type like this paragraph is commentary, not part of the commentary.

You need to read your document into R. If it's a CSV file, the best thing is to get the file name as a character string and then paste it into a command. To get the string, at the R console, give this command (but don't put the command in your document, just give it at the console):


Once you have the file name as a string, paste that into fetchData():

mycars = fetchData("/Users/kaplan/Desktop/used-hondas.csv")
## Complete file name given.  No searching necessary.

If your data is in a Google Spreadsheet, make sure to publish the file to the web and grab the CSV link. Then paste that character string into fetchGoogle(), making sure to put the link between quotation marks.

mycars = fetchGoogle("")
## Loading required package: RCurl
## Loading required package: bitops

Either way, the data are now read in to R.

Description of Data

We studied used Honda Accords in three locations: St. Paul, MN; Raleigh-Durham, NC; and Santa Cruz, CA.

Are prices different by location?

mod1 = lm(Price ~ Location, data = mycars)
## Call:
## lm(formula = Price ~ Location, data = mycars)
## Coefficients:
##        (Intercept)  LocationSanta Cruz     LocationSt.Paul  
##              14028                1312                -314

In our original report, we concluded that cars in Santa Cruz are $1300 more expensive than in Durham, but cars in St. Paul are $300 cheaper.

Looking at the regression report:

Estimate Std. Error t value Pr(> |t|)
(Intercept) 14028.0588 791.4330 17.72 0.0000
LocationSanta Cruz 1312.0791 1166.5018 1.12 0.2637
LocationSt.Paul -314.4037 1166.5018 -0.27 0.7881

The p-values suggest that there is not enough data to support such a claim about the differences between the locations. Indeed, the margin of error is \( \pm 2300 \) dollars.

An ANOVA analysis of the model also indicates that location is not associated with price:

Df Sum Sq Mean Sq F value Pr(> F)
Location 2 43692780.94 21846390.47 1.03 0.3627
Residuals 89 1895383953.88 21296448.92

Price by mileage

A simple model of price by mileage

mod2 = lm(Price ~ Mileage, data = mycars)

This indicates that price goes down by 10 cents per mile.

Revisiting this model:

Estimate Std. Error t value Pr(> |t|)
(Intercept) 20766.5803 362.0150 57.36 0.0000
Mileage -0.1013 0.0048 -21.19 0.0000

The 95% confidence interval is \( 10.0 \pm 0.9 \) cents per gallon.

Price by age and mileage

We hypothesized that mileage and age are the primary determinants of used-car price. Model 3 tries to untangle their respective effects:

mod3 = lm(Price ~ Age + Mileage, data = mycars)

We concluded that the price of a used Honda goes down by $538 per year (on average) and 7.7 cents per mile driven.

Looking now at the regression report …

Estimate Std. Error t value Pr(> |t|)
(Intercept) 21330.4922 350.2190 60.91 0.0000
Age -538.2931 117.9343 -4.56 0.0000
Mileage -0.0767 0.0069 -11.10 0.0000

We see that both age and mileage are statistically significant. Confidence intervals are:

The margin of error on mileage has gone up, even though age was included as a covariate and is eating up variance. Perhaps this is due to the collinearity between age and mileage:

r.squared(lm(Age ~ Mileage, data = mycars))
## [1] 0.6079

The root mean-square residual tells the typical size of a residual — it's the generalization of the standard deviation.

Df Sum Sq Mean Sq F value Pr(> F)
Age 1 1313584180.27 1313584180.27 445.48 0.0000
Mileage 1 363058665.01 363058665.01 123.13 0.0000
Residuals 89 262433889.55 2948695.39
## [1] 1717

About $1700 is a typical deviation for an actual car from the model price.

Prices by age and location

In model 4, we looked for an interaction between age and location in determining price:

mod4 = lm(Price ~ Age * Location, data = mycars)

In our original report, we concluded that the effect of age differs by location. ANOVA is an appropriate technique here, since it lets us look at all the vectors that are involved in the interaction.

Df Sum Sq Mean Sq F value Pr(> F)
Age 1 1313584180.27 1313584180.27 186.09 0.0000
Location 2 17726153.76 8863076.88 1.26 0.2901
Age:Location 2 699935.45 349967.73 0.05 0.9517
Residuals 86 607066465.34 7058912.39

There's no evidence for a dependence of price on location nor for an interaction between location and age in determining the price.

COMPILING YOUR REPORT. When you press the “Knit HTML” button, your .Rmd file will be translated into an HTML file. You can download this file from RStudio to your computer, and then upload it to Moodle to hand it in.