Car Prices Revisited

Nalpak Ynnad, Sat Dec 8 12:14:16 2012

The original data collection and report was done with classmates B. Franklin, J. Adams, and T. Jefferson.

COMMENTARY: This document serves two purposes: an example template for your used-car revision and an explanation of what's going on in that revision. The content marked in type like this paragraph is commentary, not part of the commentary.

You need to read your document into R. If it's a CSV file, the best thing is to get the file name as a character string and then paste it into a command. To get the string, at the R console, give this command (but don't put the command in your document, just give it at the console):

file.choose()
"/Users/kaplan/Desktop/used-hondas.csv"

Once you have the file name as a string, paste that into fetchData():

mycars = fetchData("/Users/kaplan/Desktop/used-hondas.csv")

## Complete file name given.  No searching necessary.

If your data is in a Google Spreadsheet, make sure to publish the file to the web and grab the CSV link. Then paste that character string into fetchGoogle(), making sure to put the link between quotation marks.

mycars = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Am13enSalO74dHpOa2szV2c4WmJ0NURERFh0dTdLY2c&single=true&gid=0&output=csv")

## Loading required package: RCurl

## Loading required package: bitops

Either way, the data are now read in to R.

Description of Data

We studied used Honda Accords in three locations: St. Paul, MN; Raleigh-Durham, NC; and Santa Cruz, CA.

Are prices different by location?

mod1 = lm(Price ~ Location, data = mycars)
mod1

## 
## Call:
## lm(formula = Price ~ Location, data = mycars)
## 
## Coefficients:
##        (Intercept)  LocationSanta Cruz     LocationSt.Paul  
##              14028                1312                -314

In our original report, we concluded that cars in Santa Cruz are $1300 more expensive than in Durham, but cars in St. Paul are $300 cheaper.

Looking at the regression report:

xtable(summary(mod1))

	Estimate	Std. Error	t value	Pr(> \|t\|)
(Intercept)	14028.0588	791.4330	17.72	0.0000
LocationSanta Cruz	1312.0791	1166.5018	1.12	0.2637
LocationSt.Paul	-314.4037	1166.5018	-0.27	0.7881

The p-values suggest that there is not enough data to support such a claim about the differences between the locations. Indeed, the margin of error is $ \pm 2300 $ dollars.

An ANOVA analysis of the model also indicates that location is not associated with price:

xtable(anova(mod1))

	Df	Sum Sq	Mean Sq	F value	Pr(> F)
Location	2	43692780.94	21846390.47	1.03	0.3627
Residuals	89	1895383953.88	21296448.92

Price by mileage

A simple model of price by mileage

mod2 = lm(Price ~ Mileage, data = mycars)

This indicates that price goes down by 10 cents per mile.

Revisiting this model:

xtable(summary(mod2))

	Estimate	Std. Error	t value	Pr(> \|t\|)
(Intercept)	20766.5803	362.0150	57.36	0.0000
Mileage	-0.1013	0.0048	-21.19	0.0000

The 95% confidence interval is $ 10.0 \pm 0.9 $ cents per gallon.

Price by age and mileage

We hypothesized that mileage and age are the primary determinants of used-car price. Model 3 tries to untangle their respective effects:

mod3 = lm(Price ~ Age + Mileage, data = mycars)

We concluded that the price of a used Honda goes down by $538 per year (on average) and 7.7 cents per mile driven.

Looking now at the regression report …

xtable(summary(mod3))

	Estimate	Std. Error	t value
(Intercept)	21330.4922	350.2190	60.91
Age	-538.2931	117.9343	-4.56
Mileage	-0.0767	0.0069	-11.10

We see that both age and mileage are statistically significant. Confidence intervals are:

Age: $ 538 \pm 230 $ dollars per year decrease in price
Mileage: $ 7.7 \pm 1.4 $ cents per mile decrease in price

The margin of error on mileage has gone up, even though age was included as a covariate and is eating up variance. Perhaps this is due to the collinearity between age and mileage:

r.squared(lm(Age ~ Mileage, data = mycars))

## [1] 0.6079

The root mean-square residual tells the typical size of a residual — it's the generalization of the standard deviation.

xtable(anova(mod3))

	Df	Sum Sq	Mean Sq	F value	Pr(> F)
Age	1	1313584180.27	1313584180.27	445.48	0.0000
Mileage	1	363058665.01	363058665.01	123.13	0.0000
Residuals	89	262433889.55	2948695.39

sqrt(2948695)

## [1] 1717

About $1700 is a typical deviation for an actual car from the model price.

Prices by age and location

In model 4, we looked for an interaction between age and location in determining price:

mod4 = lm(Price ~ Age * Location, data = mycars)

In our original report, we concluded that the effect of age differs by location. ANOVA is an appropriate technique here, since it lets us look at all the vectors that are involved in the interaction.

xtable(anova(mod4))

	Df	Sum Sq	Mean Sq	F value	Pr(> F)
Age	1	1313584180.27	1313584180.27	186.09	0.0000
Location	2	17726153.76	8863076.88	1.26	0.2901
Age:Location	2	699935.45	349967.73	0.05	0.9517
Residuals	86	607066465.34	7058912.39

There's no evidence for a dependence of price on location nor for an interaction between location and age in determining the price.

COMPILING YOUR REPORT. When you press the “Knit HTML” button, your .Rmd file will be translated into an HTML file. You can download this file from RStudio to your computer, and then upload it to Moodle to hand it in.