First we install some R packages - you'll need to install from the packages tab on the bottom right window (knitr and markdown).
Now get the porsche prices data from the data file PorschePrice.csv
getwd()
## [1] "/Users/traves/Dropbox/SM339/day 08"
setwd("/Users/traves/Dropbox/sm339/stat2 data files")
PP <- read.csv("PorschePrice.csv")
names(PP)
## [1] "Price" "Age" "Mileage"
attach(PP)
summary(Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.0 40.6 51.9 50.5 58.6 83.0
hist(Price)
Try a box-whisker plot and a density plot
require(lattice)
## Loading required package: lattice
require(latticeExtra)
## Loading required package: latticeExtra
## Loading required package: RColorBrewer
bwplot(~Price, main = "Box-whisker plot of Price")
densityplot(Price, main = "Density plot of price")
Print the density plot to a pdf file that we might want to include in a WORD document, etc.
setwd("/Users/traves/Dropbox/sm339/day08")
## Error: cannot change working directory
pdf(file = "density.pdf", height = 5, width = 5)
densityplot(Price, main = "Density plot of price")
dev.off()
## pdf
## 2
Plot Price, Age and Mileage against one another
pairs(PP)
Look at Price vs Mileage more carefully:
plot(Price ~ Mileage, col = "red", pch = 19, xlab = "Mileage (1000's of miles)",
ylab = "Price (1000's of $)", main = "Price of used Porsches vs Mileage")
CHOOSE:
Let's try to use a linear model to explain the relationship between the variables: \[ \text{Price} = \beta_0 + \beta_1*\text{Mileage} + \text{Error} \] where \( \beta_0 \) and \( \beta_1 \) are constants.
FIT:
Fitting the model means finding the “best” values for \( \beta_0 \) and \( \beta_1 \). What does “best” mean in this context?
After fitting the model we write \[ \widehat{\text{Price}} = \beta_0 + \beta_1*\text{Mileage} \] to represent the fitted value corresponding to a particular choice of Mileage. Why did the Error term disappear? On the graph below, the fitted line is the blue line.
PMline = lm(Price ~ Mileage)
summary(PMline)
##
## Call:
## lm(formula = Price ~ Mileage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.308 -4.047 -0.395 3.837 12.676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.0905 2.3699 30.0 <2e-16 ***
## Mileage -0.5894 0.0566 -10.4 4e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.17 on 28 degrees of freedom
## Multiple R-squared: 0.795, Adjusted R-squared: 0.787
## F-statistic: 108 on 1 and 28 DF, p-value: 3.98e-11
plot(Price ~ Mileage, col = "red", pch = 19, xlab = "Mileage (1000's of miles)",
ylab = "Price (1000's of $)", main = "Price of used Porsches vs Mileage")
lines(Mileage, PMline$fitted, col = "blue")
The standard error of a linear model is the estimated standard deviation of the error term in the linear model. It is given by the formula \[ \hat{\sigma_\epsilon} = \sqrt{\frac{\sum (y - \hat{y})^2}{n-2}}, \] where \( y \) is the response (or dependent) variable (e.g. Price in our model above) and \( n \) is the number of data points. Note that it is easy to fit a model with zero residuals if we only have 2 points. This goes some way to explaining why there is the term \( n-2 \) in the bottom of the expression for \( \hat{\sigma_\epsilon} \) (we say that there are \( n-2 \) degrees of freedom here).
Here's some complicated code to compute \( \hat{\sigma_\epsilon} \):
sqrt(sum((Price - PMline$fitted)^2)/(length(Price) - 2))
## [1] 7.17
Note that this is also reported on the line “Residual standard error” in the summary of the linear model:
summary(PMline)
##
## Call:
## lm(formula = Price ~ Mileage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.308 -4.047 -0.395 3.837 12.676
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.0905 2.3699 30.0 <2e-16 ***
## Mileage -0.5894 0.0566 -10.4 4e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.17 on 28 degrees of freedom
## Multiple R-squared: 0.795, Adjusted R-squared: 0.787
## F-statistic: 108 on 1 and 28 DF, p-value: 3.98e-11
ASSESS:
There are assumptions (the book calls them conditions) that underlie the linear model.
Under these conditions we write, e.g. \[ \text{Price} = \beta_0 + \beta_1*\text{Mileage} + \epsilon, \] where \( \epsilon ~ N(0,\sigma) \).
Conditions 5 and 6 are particularly important if we wish to carry out statistical procedures (e.g. confidence intervals or hypothesis tests), though sometimes we can argue that our conclusions are valid despite these conditions being violated. We'll talk a lot about this in future classes.
The graph of Price versus Mileage seems to exhibit a linear trend so the linearity condition seems valid.
The residual is the error in our model when compared against the actual observation: \[ \text{Residual} = \text{Price} - \widehat{\text{Price}}. \] If the Residual is positive, does the data point lie above or below the blue line?
Check residuals:
plot(Mileage, PMline$residual, col = "red")
abline(0, 0, col = "blue")
It seems that the zero mean assumption for the residuals is probably satisfied. Also the constant variance looks okay. Independence and Randomness seem reasonable given the context in which the data were collected.
Check normality assumption for residuals We might try a density plot of the residuals:
pdf(file = "dennorm.pdf")
densityplot(PMline$residuals)
dev.off()
## pdf
## 2
It doesn't look so far from a normal distribution. A QQ plot can also be used to check normality:
qqnorm(PMline$residual, col = "red")
qqline(PMline$residual, col = "blue")
USE:
We report that our data supports the conclusion that \[ \text{Price} = 71,091 - 0.589*\text{Mileage} + \epsilon, \] where \( \epsilon ~ N(0,7170) \). This means that we expect to pay $\\( 71,901 \) for a Porsche with zero miles and every mile driven ought to reduce the price by about 59 cents. There is some error in our model and we estimate the spread by saying that the “typical error” is about $\\( 7,170 \).