The original data collection and report was done with classmates B. Franklin, J. Adams, and T. Jefferson.

COMMENTARY: This document serves two purposes: an example template for your used-car revision and an explanation of what's going on in that revision. The content marked in type like this paragraph is commentary, not part of the commentary.

You need to read your document into R. If it's a CSV file, the best thing is to get the file name as a character string and then paste it into a command. To get the string, at the R console, give this command (but don't put the command in your document, just give it at the console):

```
file.choose()
"/Users/kaplan/Desktop/used-hondas.csv"
```

Once you have the file name as a string, paste that into

`fetchData()`

:

```
mycars = fetchData("/Users/kaplan/Desktop/used-hondas.csv")
```

```
## Complete file name given. No searching necessary.
```

If your data is in a Google Spreadsheet, make sure to publish the file to the web and grab the CSV link. Then paste that character string into

`fetchGoogle()`

, making sure to put the link between quotation marks.

```
mycars = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Am13enSalO74dHpOa2szV2c4WmJ0NURERFh0dTdLY2c&single=true&gid=0&output=csv")
```

```
## Loading required package: RCurl
```

```
## Loading required package: bitops
```

Either way, the data are now read in to R.

We studied used Honda Accords in three locations: St. Paul, MN; Raleigh-Durham, NC; and Santa Cruz, CA.

```
mod1 = lm(Price ~ Location, data = mycars)
mod1
```

```
##
## Call:
## lm(formula = Price ~ Location, data = mycars)
##
## Coefficients:
## (Intercept) LocationSanta Cruz LocationSt.Paul
## 14028 1312 -314
```

In our original report, we concluded that cars in Santa Cruz are $1300 more expensive than in Durham, but cars in St. Paul are $300 cheaper.

Looking at the regression report:

```
xtable(summary(mod1))
```

Estimate | Std. Error | t value | Pr(> |t|) | |
---|---|---|---|---|

(Intercept) | 14028.0588 | 791.4330 | 17.72 | 0.0000 |

LocationSanta Cruz | 1312.0791 | 1166.5018 | 1.12 | 0.2637 |

LocationSt.Paul | -314.4037 | 1166.5018 | -0.27 | 0.7881 |

The p-values suggest that there is not enough data to support such a claim about the differences between the locations. Indeed, the margin of error is \( \pm 2300 \) dollars.

An ANOVA analysis of the model also indicates that location is not associated with price:

```
xtable(anova(mod1))
```

Df | Sum Sq | Mean Sq | F value | Pr(> F) | |
---|---|---|---|---|---|

Location | 2 | 43692780.94 | 21846390.47 | 1.03 | 0.3627 |

Residuals | 89 | 1895383953.88 | 21296448.92 |

A simple model of price by mileage

```
mod2 = lm(Price ~ Mileage, data = mycars)
```

This indicates that price goes down by 10 cents per mile.

Revisiting this model:

```
xtable(summary(mod2))
```

Estimate | Std. Error | t value | Pr(> |t|) | |
---|---|---|---|---|

(Intercept) | 20766.5803 | 362.0150 | 57.36 | 0.0000 |

Mileage | -0.1013 | 0.0048 | -21.19 | 0.0000 |

The 95% confidence interval is \( 10.0 \pm 0.9 \) cents per gallon.

We hypothesized that mileage and age are the primary determinants of used-car price. Model 3 tries to untangle their respective effects:

```
mod3 = lm(Price ~ Age + Mileage, data = mycars)
```

We concluded that the price of a used Honda goes down by $538 per year (on average) and 7.7 cents per mile driven.

Looking now at the regression report …

```
xtable(summary(mod3))
```

Estimate | Std. Error | t value | Pr(> |t|) | |
---|---|---|---|---|

(Intercept) | 21330.4922 | 350.2190 | 60.91 | 0.0000 |

Age | -538.2931 | 117.9343 | -4.56 | 0.0000 |

Mileage | -0.0767 | 0.0069 | -11.10 | 0.0000 |

We see that both age and mileage are statistically significant. Confidence intervals are:

- Age: \( 538 \pm 230 \) dollars per year decrease in price
- Mileage: \( 7.7 \pm 1.4 \) cents per mile decrease in price

The margin of error on mileage has gone up, even though age was included as a covariate and is eating up variance. Perhaps this is due to the collinearity between age and mileage:

```
r.squared(lm(Age ~ Mileage, data = mycars))
```

```
## [1] 0.6079
```

The root mean-square residual tells the typical size of a residual — it's the generalization of the standard deviation.

```
xtable(anova(mod3))
```

Df | Sum Sq | Mean Sq | F value | Pr(> F) | |
---|---|---|---|---|---|

Age | 1 | 1313584180.27 | 1313584180.27 | 445.48 | 0.0000 |

Mileage | 1 | 363058665.01 | 363058665.01 | 123.13 | 0.0000 |

Residuals | 89 | 262433889.55 | 2948695.39 |

```
sqrt(2948695)
```

```
## [1] 1717
```

About $1700 is a typical deviation for an actual car from the model price.

In model 4, we looked for an interaction between age and location in determining price:

```
mod4 = lm(Price ~ Age * Location, data = mycars)
```

In our original report, we concluded that the effect of age differs by location. ANOVA is an appropriate technique here, since it lets us look at all the vectors that are involved in the interaction.

```
xtable(anova(mod4))
```

Df | Sum Sq | Mean Sq | F value | Pr(> F) | |
---|---|---|---|---|---|

Age | 1 | 1313584180.27 | 1313584180.27 | 186.09 | 0.0000 |

Location | 2 | 17726153.76 | 8863076.88 | 1.26 | 0.2901 |

Age:Location | 2 | 699935.45 | 349967.73 | 0.05 | 0.9517 |

Residuals | 86 | 607066465.34 | 7058912.39 |

There's no evidence for a dependence of price on location nor for an interaction between location and age in determining the price.

COMPILING YOUR REPORT. When you press the “Knit HTML” button, your .Rmd file will be translated into an HTML file. You can download this file from RStudio to your computer, and then upload it to Moodle to hand it in.