Using lm, we can find the equation of the regression line for the given points.
\[ (5.6, 8.8), (6.3,12.4), (7,14.8), (7.7,18.2), (8.4,20.8) \]
x <- c(5.6,6.3,7,7.7,8.4)
y <- c(8.8, 12.4,14.8, 18.2,20.8)
df <- data.frame (x,y)
#round the coefficients of the regression line to the nearest hundredth
round(coef(lm(y~x, data=df)),2)## (Intercept) x
## -14.80 4.26
\[ {y} = -14.80 + 4.26x \]
Using the R built-in “cars” dataset, we can build a linear model for stopping distance as a function of speed
First 6 Rows
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
Last 6 Rows
## speed dist
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
Summary Statistics
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
ggplot(cars, aes(speed, dist)) +
geom_point(aes(color=speed)) +
#x axis
xlab("Speed") +
#y axis
ylab("Distance") +
# chart title
ggtitle("Speed vs. Stopping Distance") +
# add line
geom_smooth(method = lm, se = FALSE) +
theme_light()The output of lm is an object that contains all of the information we need about the linear model that was just fit. We can access this information using the summary function. The “Coefficients” table shown is key; its first column displays the linear model’s y-intercept and the coefficient of speed.
##
## Call:
## lm(formula = dist ~ speed, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.069 -9.525 -2.272 9.215 43.201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.5791 6.7584 -2.601 0.0123 *
## speed 3.9324 0.4155 9.464 1.49e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
## F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
With this table, we can write down the least squares regression line for the linear model:
\[ \hat{y} = -17.5791 + 3.9324 * speed \]
The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 65.11% of the variability in stopping distance is explained by speed.
The extremely low p-value indicates that the variables are significant.
The following code will create a scatterplot with the least squares line laid on top.
The function abline plots a line based on its slope and intercept. Here, we used a shortcut by providing the model linear_model, which contains both parameter estimates. This line can be used to predict \(y\) at any value of \(x\). When predictions are made for values of \(x\) that are beyond the range of the observed data, it is referred to as extrapolation and is not usually recommended. However, predictions made within the range of the data are more reliable. They’re also used to compute the residuals.
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
Linearity: We checked if the relationship between speed and stopping distance is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. speed.
plot(linear_model$residuals ~ cars$speed, xlab = "Speed", ylab= "Residuals")
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0There is no obvious pattern in the residual plot
Nearly normal residuals: To check this condition, we can look at a histogram or a normal probability plot of the residuals.
qqnorm(linear_model$residuals)
qqline(linear_model$residuals) # adds diagonal line to the normal prob plotThe residuals appear to be constant and normal, therefore we can assume the constant variability condition is met.
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
The movie Moneyball focuses on the “quest for the secret of success in baseball”. It follows a low-budget team, the Oakland Athletics, who believed that underused statistics, such as a player’s ability to get on base, better predict the ability to score runs than typical statistics like home runs, RBIs (runs batted in), and batting average.
## Observations: 30
## Variables: 13
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ team <fct> Texas Rangers, Boston Red Sox, Detroit Tigers, Kansas Ci…
## $ runs <int> 855, 875, 787, 730, 762, 718, 867, 721, 735, 615, 708, 6…
## $ at_bats <int> 5659, 5710, 5563, 5672, 5532, 5600, 5518, 5447, 5544, 55…
## $ hits <int> 1599, 1600, 1540, 1560, 1513, 1477, 1452, 1422, 1429, 14…
## $ homeruns <int> 210, 203, 169, 129, 162, 108, 222, 185, 163, 95, 191, 11…
## $ bat_avg <dbl> 0.283, 0.280, 0.277, 0.275, 0.273, 0.264, 0.263, 0.261, …
## $ strikeouts <int> 930, 1108, 1143, 1006, 978, 1085, 1138, 1083, 1201, 1164…
## $ stolen_bases <int> 143, 102, 49, 153, 57, 130, 147, 94, 118, 118, 81, 126, …
## $ wins <int> 96, 90, 95, 71, 90, 77, 97, 96, 73, 56, 69, 82, 71, 79, …
## $ new_onbase <dbl> 0.340, 0.349, 0.340, 0.329, 0.341, 0.335, 0.343, 0.325, …
## $ new_slug <dbl> 0.460, 0.461, 0.434, 0.415, 0.425, 0.391, 0.444, 0.425, …
## $ new_obs <dbl> 0.800, 0.810, 0.773, 0.744, 0.766, 0.725, 0.788, 0.750, …
After loading the Season Data For All 30 Teams, I am interested in creating a linear model to predict the number of wins a team achieves in a season based on how many homeruns the team scores.
homeruns as the predictor
wins as the output
#x axis is home runs, y axis is wins
homerun_wins <- ggplot(df, aes(homeruns, wins))
homerun_wins + geom_point()The plot concludes an approximately linear relationship with a 66% correlation strength as seen below. The two variables are positively correlated meaning the more home runs, the more wins. The strength is moderately strong because there is variation in the data.
## [1] 0.660614
Similar to how we can use the mean and standard deviation to summarize a single variable, we can summarize the relationship between homerunsand wins by finding the line that best follows their association.
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 48.8140 0.2119
##
## Sum of Squares: 2129.785
There are 30 residuals shown in blue, one for each of the 30 MLB teams. Residuals are the difference between the observed values and the values predicted by the line:
\[ e_i = y_i - \hat{y}_i \]
The most common way to do linear regression is to select the line that minimizes the sum of squared residuals. To visualize the squared residuals, I will rerun the plot_ss command and add the argument showSquares = TRUE.
## Click two points to make a line.
## Call:
## lm(formula = y ~ x, data = pts)
##
## Coefficients:
## (Intercept) x
## 48.8140 0.2119
##
## Sum of Squares: 2129.785
The output from the plot_ss function provides the slope and intercept of the line as well as the sum of squares.
##
## Call:
## lm(formula = wins ~ homeruns, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2874 -6.7083 0.7708 5.6292 20.7649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.81397 7.08635 6.888 1.74e-07 ***
## homeruns 0.21190 0.04551 4.656 7.09e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.721 on 28 degrees of freedom
## Multiple R-squared: 0.4364, Adjusted R-squared: 0.4163
## F-statistic: 21.68 on 1 and 28 DF, p-value: 7.094e-05
The “Coefficients” table shown is key; its first column displays the linear model’s y-intercept and the coefficient of homeruns. With this table, we can write down the least squares regression line for the linear model:
\[ \hat{y} = 48.81397 + 0.21190 * {home runs} \]
The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 43.6% of the variability in wins is explained by home runs.
The following code will create a scatterplot with the least squares line laid on top.
Linearity: We checked if the relationship between homeruns and wins is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. homeruns.
plot(linear_model$residuals ~ df$homeruns, xlab = "Homeruns", ylab= "Residuals")
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0There is no obvious pattern in the residual plot
Nearly normal residuals: To check this condition, we can look at a histogram or a normal probability plot of the residuals.
qqnorm(linear_model$residuals)
qqline(linear_model$residuals) # adds diagonal line to the normal prob plotThe residuals appear to be constant and normal, therefore we can assume the constant variability condition is met.
The below dataset contains real-world data from 2008. The variables included:
| variable | description |
|---|---|
Country |
name of the country. |
LifeExp |
average life expectancy for the country in years. |
InfantSurvival |
proportion of those surviving to one year or more. |
Under5Survival |
proportion of those surviving to five years or more. |
TBFree |
proportion of the population without TB. |
PropMD |
proportion of the population who are MDs |
PropRN |
proportion of the population who are RNs. |
PersExp |
mean personal expenditures on healthcare in US dollars at average exchange rate. |
GovtExp |
mean government expenditures per capita on healthcare, US dollars at average exchange rate. |
TotExp |
sum of personal and government expenditures. |
df <- read.csv("https://raw.githubusercontent.com/aaronzalkisps/risk/master/who.csv?token=AOOG3OFNXVMKKIRCIWYHJTC7UCPD2")
glimpse(df)## Observations: 190
## Variables: 10
## $ Country <fct> Afghanistan, Albania, Algeria, Andorra, Angola, Antigu…
## $ LifeExp <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979…
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976…
## $ TBFree <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, …
## $ PropMD <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.…
## $ PropRN <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.…
## $ PersExp <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, …
## $ GovtExp <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 1876…
## $ TotExp <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 190…
I will run simple linear regression analysis with TotExp (sum of personal and government expenditures) as the predictor variable and LifeExp (average life expectancy for the country in years) as the response variable.
##
## Call:
## lm(formula = df$LifeExp ~ df$TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## df$TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Although the variables are significant with low p-values, the model is not good because the \(R^2\) is low (~26%). The \(R^2\) value represents the proportion of variability in the response variable that is explained by the explanatory variable. For this model, 26% of the variability in average life expectancy for the country is explained by the sum of personal and government expenditures. The F-statistic is 65 (we would like this to be higher) and this metric is used to compare different models.
plot(df$TotExp, df$LifeExp, xlab="Personal and Government Expenditures"
, ylab="Life Expectancy(Years)")
abline(lm1, col = "green")Based on the diagnostic plots, we can see that some residuals are very high. The Normal Q-Q plot shows the residuals following a slightly skewed distribution. In the Scale-Location plot, the residuals appear to not be following the model very well, since there are a lot of residuals that do not follow the line. Lastly, the Residuals vs Leverage plot, shows that a few residuals have higher leverage than our other models and some have very high influence.
Using the same predictor and response variable, I will transform the data by raising life expectancy to the 4.6 power and total expenditures to the 0.06 power. The goal is to see if the transformed model is better in comparison to the original non-trasnformed model.
##
## Call:
## lm(formula = df2$LifeExp ~ df2$TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## df2$TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
The model has improved significantly. The low p-values demonstrate that the variables are still statistically significant.The \(R^2\) is much higher with an explained variance of 0.7298. The F-statistic (507.7) is much higher than the previous model, which is better for rejecting the null hypothesis that there is no relationship between the variables LifeExp and TotExp.
plot(df2$TotExp,df2$LifeExp ,xlab="Personal and Government Expenditures"
, ylab="Life Expectancy(Years)")
abline(lm2)Overall the residuals are more normal than the original model. The Normal Q-Q plot shows that the residuals follow a less skewed distribution. In the Scale-Location plot, the residuals appear to be more evenly distributed. In the Residuals vs Leverage plot, no residuals have a very high leverage/influence.
We can now forecast life expectancy when TotExp^.06 =1.5 and when TotExp^.06=2.5.
\[ \hat{y} = -736527910 + 620060216 * {expenditures} \]
When TotExp^.06 =1.5, life expectancy is:
## [1] 63.31153
When TotExp^.06 =2.5, life expectancy is:
## [1] 86.50645
I will now build the following multiple regression model
LifeExp = \(b0+b1\) * PropMd + \(b2\) * TotExp + \(b3\) * PropMD * TotExp
df$md_exp <- df$PropMD * df$TotExp
lm4 = lm(df$LifeExp ~ df$PropMD + df$TotExp + df$md_exp, data = df)
summary(lm4)##
## Call:
## lm(formula = df$LifeExp ~ df$PropMD + df$TotExp + df$md_exp,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## df$PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## df$TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## df$md_exp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The p-values demonstrate that all variables are statistically significant. The standard error is only high for PropMD (proportion of the population who are MDs). Similar to the first model, the \(R^2\) is low with an explained variance of .3574. The F-statistic is 34 (we would like this to be higher). It’s possible this is lower compared to the previous two models because this metric penalizes the user for utilizing more variables.