Porsche Prices in R markdown

First we install some R packages - you'll need to install from the packages tab on the bottom right window (knitr and markdown).

Now get the porsche prices data from the data file PorschePrice.csv

getwd()
## [1] "/Users/traves/Dropbox/SM339/day 08"
setwd("/Users/traves/Dropbox/sm339/stat2 data files")
PP <- read.csv("PorschePrice.csv")
names(PP)
## [1] "Price"   "Age"     "Mileage"
attach(PP)
summary(Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    16.0    40.6    51.9    50.5    58.6    83.0
hist(Price)

plot of chunk unnamed-chunk-1

Try a box-whisker plot and a density plot

require(lattice)
## Loading required package: lattice
require(latticeExtra)
## Loading required package: latticeExtra
## Loading required package: RColorBrewer
bwplot(~Price, main = "Box-whisker plot of Price")

plot of chunk unnamed-chunk-2

densityplot(Price, main = "Density plot of price")

plot of chunk unnamed-chunk-2

Print the density plot to a pdf file that we might want to include in a WORD document, etc.

setwd("/Users/traves/Dropbox/sm339/day08")
## Error: cannot change working directory
pdf(file = "density.pdf", height = 5, width = 5)
densityplot(Price, main = "Density plot of price")
dev.off()
## pdf 
##   2

Plot Price, Age and Mileage against one another

pairs(PP)

plot of chunk unnamed-chunk-4

Look at Price vs Mileage more carefully:

plot(Price ~ Mileage, col = "red", pch = 19, xlab = "Mileage (1000's of miles)", 
    ylab = "Price (1000's of $)", main = "Price of used Porsches vs Mileage")

plot of chunk unnamed-chunk-5

CHOOSE:

Let's try to use a linear model to explain the relationship between the variables: \[ \text{Price} = \beta_0 + \beta_1*\text{Mileage} + \text{Error} \] where \( \beta_0 \) and \( \beta_1 \) are constants.

FIT:

Fitting the model means finding the “best” values for \( \beta_0 \) and \( \beta_1 \). What does “best” mean in this context?

After fitting the model we write \[ \widehat{\text{Price}} = \beta_0 + \beta_1*\text{Mileage} \] to represent the fitted value corresponding to a particular choice of Mileage. Why did the Error term disappear? On the graph below, the fitted line is the blue line.

PMline = lm(Price ~ Mileage)
summary(PMline)
## 
## Call:
## lm(formula = Price ~ Mileage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.308  -4.047  -0.395   3.837  12.676 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.0905     2.3699    30.0   <2e-16 ***
## Mileage      -0.5894     0.0566   -10.4    4e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 7.17 on 28 degrees of freedom
## Multiple R-squared: 0.795,   Adjusted R-squared: 0.787 
## F-statistic:  108 on 1 and 28 DF,  p-value: 3.98e-11
plot(Price ~ Mileage, col = "red", pch = 19, xlab = "Mileage (1000's of miles)", 
    ylab = "Price (1000's of $)", main = "Price of used Porsches vs Mileage")
lines(Mileage, PMline$fitted, col = "blue")

plot of chunk unnamed-chunk-6

The standard error of a linear model is the estimated standard deviation of the error term in the linear model. It is given by the formula \[ \hat{\sigma_\epsilon} = \sqrt{\frac{\sum (y - \hat{y})^2}{n-2}}, \] where \( y \) is the response (or dependent) variable (e.g. Price in our model above) and \( n \) is the number of data points. Note that it is easy to fit a model with zero residuals if we only have 2 points. This goes some way to explaining why there is the term \( n-2 \) in the bottom of the expression for \( \hat{\sigma_\epsilon} \) (we say that there are \( n-2 \) degrees of freedom here).

Here's some complicated code to compute \( \hat{\sigma_\epsilon} \):

sqrt(sum((Price - PMline$fitted)^2)/(length(Price) - 2))
## [1] 7.17

Note that this is also reported on the line “Residual standard error” in the summary of the linear model:

summary(PMline)
## 
## Call:
## lm(formula = Price ~ Mileage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.308  -4.047  -0.395   3.837  12.676 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.0905     2.3699    30.0   <2e-16 ***
## Mileage      -0.5894     0.0566   -10.4    4e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 7.17 on 28 degrees of freedom
## Multiple R-squared: 0.795,   Adjusted R-squared: 0.787 
## F-statistic:  108 on 1 and 28 DF,  p-value: 3.98e-11

ASSESS:

There are assumptions (the book calls them conditions) that underlie the linear model.

  1. Linearity (overall relationship between variables follows a linear pattern)
  2. Zero Mean (the mean of all possible observations with fixed explanatory variable is zero; i.e. mean(Error)=0)
  3. Constant Variance (spread of errors does not depend on the explanatory variable)
  4. Independence (errors are independent from one another)
  5. Random (data are obtained using a random process – e.g. random sampling or randomization in a statistical experiment)
  6. Normality (errors follow a normal distribution)

Under these conditions we write, e.g. \[ \text{Price} = \beta_0 + \beta_1*\text{Mileage} + \epsilon, \] where \( \epsilon ~ N(0,\sigma) \).

Conditions 5 and 6 are particularly important if we wish to carry out statistical procedures (e.g. confidence intervals or hypothesis tests), though sometimes we can argue that our conclusions are valid despite these conditions being violated. We'll talk a lot about this in future classes.

The graph of Price versus Mileage seems to exhibit a linear trend so the linearity condition seems valid.

The residual is the error in our model when compared against the actual observation: \[ \text{Residual} = \text{Price} - \widehat{\text{Price}}. \] If the Residual is positive, does the data point lie above or below the blue line?

Check residuals:

plot(Mileage, PMline$residual, col = "red")
abline(0, 0, col = "blue")

plot of chunk unnamed-chunk-9

It seems that the zero mean assumption for the residuals is probably satisfied. Also the constant variance looks okay. Independence and Randomness seem reasonable given the context in which the data were collected.

Check normality assumption for residuals We might try a density plot of the residuals:

pdf(file = "dennorm.pdf")
densityplot(PMline$residuals)
dev.off()
## pdf 
##   2

It doesn't look so far from a normal distribution. A QQ plot can also be used to check normality:

qqnorm(PMline$residual, col = "red")
qqline(PMline$residual, col = "blue")

plot of chunk unnamed-chunk-11

USE:

We report that our data supports the conclusion that \[ \text{Price} = 71,091 - 0.589*\text{Mileage} + \epsilon, \] where \( \epsilon ~ N(0,7170) \). This means that we expect to pay $\\( 71,901 \) for a Porsche with zero miles and every mile driven ought to reduce the price by about 59 cents. There is some error in our model and we estimate the spread by saying that the “typical error” is about $\\( 7,170 \).