Overview

I’m pursuing the Master’s in Data Science at CUNY to arm myself with modeling and predictive analytic tools that I can apply to horse racing, specifically so I can use these tools to win at the race track. This is a big undertaking and to be successful, I believe its important no only to understand the concepts taught in the classes but to know how to apply it.

To that end, I am going to use the Boston data set to walk through the steps of creating a simple linear regression model. Think of it as a Style Guide to Simple Linear Regression. The steps will leverage content in the book An Introduction to Statistical Learning. This book was recommended as additional readings sections of R For Data Science, so I logged in to my account Amazon and ordered it - My initial impressions are favorable. After this initial Blog post, my plan is to write additional posts that expand on this example. Some ideas include comparing the traditional approach the the Tidy/Modelr approach, a post on the Broom package and its application to modeling and a post to determine if Transformations could improve the results of our initial post. That’s a lot of good stuff, but let’s get stared with Blog Post #1.

The Data Set

The Boston data set includes records on the median value of homes in 506 Boston neighborhoods. Our objective will be to develop a model to predict median home values using one of the 13 predictors in the data set.

First we’ll load the MASS library. MASS includes the Boston data set. Next we will call the fix() function on the Boston data set. Fix is a very useful function and a good first step for any modeling project. Fix opens the Boston data set in a spreadsheet-like window that can be edited.

Let’s utilize the summary function to find out more about our data set. This should provide us a better understanding of the variables, their central tendencies and spread. This would also be a great time to use ggplot to do some basic exploratory data analysis, but that’s beyond the scope of this post.

##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Next we will fit a basic model utilizing our response variable, medv (median value of home) and lstat (percent of households with low socioeconomic status). To do this we will utilize the lm() function to fit the simple linear regression model. The basic form of the model follows:

model_1 = lm(medv~lstat, data=Boston)

Simple Linear Regression Model Results

Great, we’ve run our first model and now its time to see some results. For a quick look at the results, you can use the model_1. This will display the model’s coefficients (intercept and lstat). More detailed information can be obtained by using the summary function: summary(model_1). The summary function will display p-values, F-statistic, standard errors and R-squared. An easy way to to find out all the information that is available for our model is to apply the names function to our model: names(model_1). You can see the results of these three statements(model_1, summary(model_1) and names(model_1)) below:

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

There are a variety of ways to access the coefficients of our model. First we can use the previously used method of typing the name of our models: model_1. Other option include the coefficients name from the name command: model_1$coefficients and finally we can use the coef() function: coef(model_1). Let’s see if they all yield the same results:

## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Coefficients:
## (Intercept)        lstat  
##       34.55        -0.95
## (Intercept)       lstat 
##  34.5538409  -0.9500494
## (Intercept)       lstat 
##  34.5538409  -0.9500494

They do, Nice! Note, any of the names provided by the names function can be used in conjunction with the model name (model_1) and the dollar sign to display particular information: model_1\(rank, model_1\)residuals, model_1$terms, etc.

Confidence Intervals and Predictions

Next we might want to know something about the confidence intervals of our coefficients and/or we might want to use our model to make some predictions. The confint() function Computes confidence intervals for one or more parameters in a fitted model.

##                 2.5 %     97.5 %
## (Intercept) 33.448457 35.6592247
## lstat       -1.026148 -0.8739505

The predict() function can be utilized to produce both confidence and prediction intervals for the prediction of medv for a given value of lstat.

Confidence Intervals

##        fit      lwr      upr
## 1 29.80359 29.00741 30.59978
## 2 25.05335 24.47413 25.63256
## 3 20.30310 19.73159 20.87461
## 4 15.55285 14.77355 16.33216

Prediction Intervals

##        fit       lwr      upr
## 1 29.80359 17.565675 42.04151
## 2 25.05335 12.827626 37.27907
## 3 20.30310  8.077742 32.52846
## 4 15.55285  3.316021 27.78969

Note the confidence intervals and prediction intervals are centered on the same fitted values. However, the prediction intervals appear to be substantially wider than the confidence intervals.

To wrap this blog post up, we will use the base R package to plot medv and lstat along with our least square regression line.

Note - There is evidence of non-linearity in our plot we will address this in a later blog post.

Summary

In this post we’ve created a Simple Linear Regression. The following r functions and commands were used to complete this task. Test yourself to see if you can remember what each of the commands do and how they were utilized.

  1. fix()
  2. names()
  3. summary()
  4. coef()
  5. confit()
  6. $coefficients, $rank, $residuals, $term
  7. predict()
  8. plot()
  9. abline()

Thanks for reading