MLWeek2

Linear Regression(ISLR chap 3)

Simple Linear Regression

Data: Boston from Mass

#display data
head(Boston)

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

str(Boston)

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

summary(Boston)

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

apply Linear Regression on data, choose medv as responce and lstat as predictor

lm.fit <- lm(medv~lstat, data = Boston)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

result explanation

Residuals: \(y_i - \hat{y_i}\)

boxplot(lm.fit$residuals, ylab = "Residuals")

From the box plot of the residuals, we could see the median is near 0, which roughly implies our model fit train data well, but there are many outliers, which implies linear model maybe not good enough.

Coefficients: \(\beta_0 \beta_1\)

Both \(\beta_0\) and \(\beta_1\) have small standard error and \(P_{value}\). Small \(P_{value}\) indicates the parameters have strong relation rather than chance.

Residual standard error: \(RSE\)

RSE roughly measures the average amount of difference between \(y_i\) and \(\hat{y_i}\), therefore small RSE indicates model fit well. In our case, this model is not too bad

R-squared: \(R^2\)

\(R^2\) measures how much the variability of Y is explained by X. If the ground truth of our model is linear, we expected to see \(R^2\) close to 1. In our case, \(R^2\) is not high. Adjusted \(R^2\) consider the condition of high \(P\)(for high dimension)

F-statistics:

F-statistics indicates whether there is a relationship between predictors and response. The small \(p_{value}\) strongly indicates there exsist relationship between predictors and response.

predict with lm.fit

After we get the model, we could predict the response.

predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "confidence")

##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461

predict(lm.fit, data.frame(lstat = c(5, 10, 15)), interval = "prediction")

##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846

We could check the fit result(a random variable, 95% confidence interval could be view as the probability that ground true lie in this interval is 0.95) with confidence interval(for population) and prediction interval(for individual). Prediction also consider the random error term, so it is wider than confidence interval.

some plot

least squares regression line

plot(Boston$lstat, Boston$medv, xlab = "lstat", ylab = "medv", pch = 20, col = "blue")
abline(lm.fit, col = "red")
legend("topright","regression line",lty=1, col="red")

plot(lm.fit)

plot(lm.fit)

(1): Residuals is \(y_i - \hat{y_i}\), Fitted values is \(\hat{y_i}\). If the model is linear the trend line(red line) should be roughly straight. Therefore a U shape trend line provides an indication of non-linearity in the data.

(2): Basicly Q-Q plot is used to measure normality. If the true model is linear, the residuals should be approximately normally distributed. Therefore, lack of linearity indicates non-normality. A roughly straight line indicates normality, further more, linearity. In our case, the plot indicates the true model is not linear.

(3): \(\sqrt{|standardized residuals|}\) is rescaled residuals. Therefore, Scale-Location plot is similar to the Residual vs Fitted plot. U shape trend line indicates non-linear relationship

(4): Leverage could measure how much each data point influences the regression line. If the linear model is true, the points should near 0 and reach 2-3 standard deviations away from 0, and symmetrically around 0. If points lie far from 0 with fewer points nearby, it will have much influence on regression line. The red line shows the Cook’s distance, we expected the red line close to horizontal line.detail see

fitted values VS studentized residuals

plot(lm.fit$fitted.values, rstudent(lm.fit), ylab = "studentized residuals", xlab = "fitted.values")

The plot shows studentized residuals, computed by dividing each residual by its estimated standard error, verse fitted.values. This plot is similar to plot(1) in 2. The trend of these points should be flat if the model fit linear well.

calculate leverage statistics by hatvalues()

plot(hatvalues(lm.fit))

leverage statistics for a points indicates influences of the regression line. Large leverage statistics indicates strong influence.

MLWeek2_2

Ye Lin

September 20, 2016