Regression Line

Using lm, we can find the equation of the regression line for the given points.

\[ (5.6, 8.8), (6.3,12.4), (7,14.8), (7.7,18.2), (8.4,20.8) \]

## (Intercept)           x 
##      -14.80        4.26

Equation of Regression Line:

\[ {y} = -14.80 + 4.26x \]

Cars Dataset

Using the R built-in “cars” dataset, we can build a linear model for stopping distance as a function of speed

First 6 Rows

##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16
## 6     9   10

Last 6 Rows

##    speed dist
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

Summary Statistics

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Linear Model

The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function. The “Coefficients” table shown is key; its first column displays the linear model’s y-intercept and the coefficient of speed.

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

With this table, we can write down the least squares regression line for the linear model:

\[ \hat{y} = -17.5791 + 3.9324 * speed \]

The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 65.11% of the variability in stopping distance is explained by speed.

The extremely low p-value indicates that the variables are significant.

Prediction & Prediction Errors

The following code will create a scatterplot with the least squares line laid on top.

The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model linear_model, which contains both parameter estimates. This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.

Model Diagnostics

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

Linearity: We checked if the relationship between speed and stopping distance is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. speed.

There is no obvious pattern in the residual plot

Nearly normal residuals: To check this condition, we can look at a histogram or a normal probability plot of the residuals.

The residuals appear to be constant and normal, therefore we can assume the constant variability condition is met.

2011 Major League Baseball

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average.

## Observations: 30
## Variables: 13
## $ X            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ team         <fct> Texas Rangers, Boston Red Sox, Detroit Tigers, Kansas Ci…
## $ runs         <int> 855, 875, 787, 730, 762, 718, 867, 721, 735, 615, 708, 6…
## $ at_bats      <int> 5659, 5710, 5563, 5672, 5532, 5600, 5518, 5447, 5544, 55…
## $ hits         <int> 1599, 1600, 1540, 1560, 1513, 1477, 1452, 1422, 1429, 14…
## $ homeruns     <int> 210, 203, 169, 129, 162, 108, 222, 185, 163, 95, 191, 11…
## $ bat_avg      <dbl> 0.283, 0.280, 0.277, 0.275, 0.273, 0.264, 0.263, 0.261, …
## $ strikeouts   <int> 930, 1108, 1143, 1006, 978, 1085, 1138, 1083, 1201, 1164…
## $ stolen_bases <int> 143, 102, 49, 153, 57, 130, 147, 94, 118, 118, 81, 126, …
## $ wins         <int> 96, 90, 95, 71, 90, 77, 97, 96, 73, 56, 69, 82, 71, 79, …
## $ new_onbase   <dbl> 0.340, 0.349, 0.340, 0.329, 0.341, 0.335, 0.343, 0.325, …
## $ new_slug     <dbl> 0.460, 0.461, 0.434, 0.415, 0.425, 0.391, 0.444, 0.425, …
## $ new_obs      <dbl> 0.800, 0.810, 0.773, 0.744, 0.766, 0.725, 0.788, 0.750, …

After loading the Season Data For All 30 Teams, I am interested in creating a linear model to predict the number of wins a team achieves in a season based on how many homeruns the team scores.

homeruns as the predictor

wins as the output

Plot Data

The plot concludes an approximately linear relationship with a 66% correlation strength as seen below. The two variables are positively correlated meaning the more home runs, the more wins. The strength is moderately strong because there is variation in the data.

## [1] 0.660614

Sum of Squared Residuals

Similar to how we can use the mean and standard deviation to summarize a single variable, we can summarize the relationship between homerunsand wins by finding the line that best follows their association.

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     48.8140       0.2119  
## 
## Sum of Squares:  2129.785

There are 30 residuals shown in blue, one for each of the 30 MLB teams. Residuals are the difference between the observed values and the values predicted by the line:

\[ e_i = y_i - \hat{y}_i \]

The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, I will rerun the plot_ss command and add the argument showSquares = TRUE.

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##     48.8140       0.2119  
## 
## Sum of Squares:  2129.785

The output from the plot_ss function provides the slope and intercept of the line as well as the sum of squares.

Linear Model

## 
## Call:
## lm(formula = wins ~ homeruns, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2874  -6.7083   0.7708   5.6292  20.7649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.81397    7.08635   6.888 1.74e-07 ***
## homeruns     0.21190    0.04551   4.656 7.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.721 on 28 degrees of freedom
## Multiple R-squared:  0.4364, Adjusted R-squared:  0.4163 
## F-statistic: 21.68 on 1 and 28 DF,  p-value: 7.094e-05

The “Coefficients” table shown is key; its first column displays the linear model’s y-intercept and the coefficient of homeruns. With this table, we can write down the least squares regression line for the linear model:

\[ \hat{y} = 48.81397 + 0.21190 * {home runs} \]

The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 43.6% of the variability in wins is explained by home runs.

Prediction & Prediction Errors

The following code will create a scatterplot with the least squares line laid on top.

Model Diagnostics

Linearity: We checked if the relationship between homeruns and wins is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. homeruns.

There is no obvious pattern in the residual plot

Nearly normal residuals: To check this condition, we can look at a histogram or a normal probability plot of the residuals.

The residuals appear to be constant and normal, therefore we can assume the constant variability condition is met.

2008 Real-World Data

The below dataset contains real-world data from 2008. The variables included:

variable description
Country name of the country.
LifeExp average life expectancy for the country in years.
InfantSurvival proportion of those surviving to one year or more.
Under5Survival proportion of those surviving to five years or more.
TBFree proportion of the population without TB.
PropMD proportion of the population who are MDs
PropRN proportion of the population who are RNs.
PersExp mean personal expenditures on healthcare in US dollars at average exchange rate.
GovtExp mean government expenditures per capita on healthcare, US dollars at average exchange rate.
TotExp sum of personal and government expenditures.
## Observations: 190
## Variables: 10
## $ Country        <fct> Afghanistan, Albania, Algeria, Andorra, Angola, Antigu…
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979…
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976…
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, …
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.…
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.…
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, …
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 1876…
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 190…

I will run simple linear regression analysis with TotExp (sum of personal and government expenditures) as the predictor variable and LifeExp (average life expectancy for the country in years) as the response variable.

Linear Model

## 
## Call:
## lm(formula = df$LifeExp ~ df$TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## df$TotExp   6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Although the variables are significant with low p-values, the model is not good because the \(R^2\) is low (~26%). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 26% of the variability in average life expectancy for the country is explained by the sum of personal and government expenditures. The F-statistic is 65 (we would like this to be higher) and this metric is used to compare different models.

Diagnostic Plots

Based on the diagnostic plots, we can see that some residuals are very high. The Normal Q-Q plot shows the residuals following a slightly skewed distribution. In the Scale-Location plot, the residuals appear to not be following the model very well, since there are a lot of residuals that do not follow the line. Lastly, the Residuals vs Leverage plot, shows that a few residuals have higher leverage than our other models and some have very high influence.

Transform Data

Using the same predictor and response variable, I will transform the data by raising life expectancy to the 4.6 power and total expenditures to the 0.06 power. The goal is to see if the transformed model is better in comparison to the original non-trasnformed model.

Linear Model

## 
## Call:
## lm(formula = df2$LifeExp ~ df2$TotExp)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## df2$TotExp   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The model has improved significantly. The low p-values demonstrate that the variables are still statistically significant.The \(R^2\) is much higher with an explained variance of 0.7298. The F-statistic (507.7) is much higher than the previous model, which is better for rejecting the null hypothesis that there is no relationship between the variables LifeExp and TotExp.

Diagnostic Plots

Overall the residuals are more normal than the original model. The Normal Q-Q plot shows that the residuals follow a less skewed distribution. In the Scale-Location plot, the residuals appear to be more evenly distributed. In the Residuals vs Leverage plot, no residuals have a very high leverage/influence.

We can now forecast life expectancy when TotExp^.06 =1.5 and when TotExp^.06=2.5.

\[ \hat{y} = -736527910 + 620060216 * {expenditures} \]

When TotExp^.06 =1.5, life expectancy is:

## [1] 63.31153

When TotExp^.06 =2.5, life expectancy is:

## [1] 86.50645

Multiple Regression Model

I will now build the following multiple regression model

LifeExp = \(b0+b1\) * PropMd + \(b2\) * TotExp + \(b3\) * PropMD * TotExp

## 
## Call:
## lm(formula = df$LifeExp ~ df$PropMD + df$TotExp + df$md_exp, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.277e+01  7.956e-01  78.899  < 2e-16 ***
## df$PropMD    1.497e+03  2.788e+02   5.371 2.32e-07 ***
## df$TotExp    7.233e-05  8.982e-06   8.053 9.39e-14 ***
## df$md_exp   -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The p-values demonstrate that all variables are statistically significant. The standard error is only high for PropMD (proportion of the population who are MDs). Similar to the first model, the \(R^2\) is low with an explained variance of .3574. The F-statistic is 34 (we would like this to be higher). It’s possible this is lower compared to the previous two models because this metric penalizes the user for utilizing more variables.

Python Regression Analysis

  1. Property Sale Prices

  2. Boston Houses