Read-in our CSV data
# TODO: replace local file path with GitHub URL
df <- read.csv("~/CUNY/computationalMath605/data/real-world-data.csv")
head(df)
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD
## 1 Afghanistan 42 0.835 0.743 0.99769 0.000228841
## 2 Albania 71 0.985 0.983 0.99974 0.001143127
## 3 Algeria 71 0.967 0.962 0.99944 0.001060478
## 4 Andorra 82 0.997 0.996 0.99983 0.003297297
## 5 Angola 41 0.846 0.740 0.99656 0.000070400
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991 0.000142857
## PropRN PersExp GovtExp TotExp
## 1 0.000572294 20 92 112
## 2 0.004614439 169 3128 3297
## 3 0.002091362 108 5184 5292
## 4 0.003500000 2589 169725 172314
## 5 0.001146162 36 1620 1656
## 6 0.002773810 503 12543 13046
ggplot(df, aes(x=LifeExp, y=TotExp)) + geom_point()
Let’s run a simple linear regression on these two variables using the
R methods lm and summary
linModel <- lm( LifeExp~TotExp, df)
summary(linModel)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The adjusted \(R^2\) value states
that \(25.37\%\) of our output variable
(TotExp) can be explained by our input variable
(LifeExp). The p-value (\(7.71 x
10^{-14}\)) indicates the probability of our model producing
coefficients more extreme than those produced. The
F-statistic value compares our model to a model with only
an intercept parameter. Per our regression text, we only have one
additional parameter. The Standard Error listed shows the statistical
standard error for each coefficient listed (Intercept and
TotExp)
res <- resid(linModel) #alternatively, model$residuals
hist(res)
Finally, we can generate a QQ-plot to check our assumptions for
regression
# Create Q-Q plot of our residuals
qqnorm(res)
qqline(res)
This plot does not exhibit linear behavior. In this case a linear
regression model does not fit our data well
df$LifeExpRaised <- df$LifeExp ^ 4.6
df$TotExpRaised <- df$TotExp ^ 0.06
head(df)
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD
## 1 Afghanistan 42 0.835 0.743 0.99769 0.000228841
## 2 Albania 71 0.985 0.983 0.99974 0.001143127
## 3 Algeria 71 0.967 0.962 0.99944 0.001060478
## 4 Andorra 82 0.997 0.996 0.99983 0.003297297
## 5 Angola 41 0.846 0.740 0.99656 0.000070400
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991 0.000142857
## PropRN PersExp GovtExp TotExp LifeExpRaised TotExpRaised
## 1 0.000572294 20 92 112 29305338 1.327251
## 2 0.004614439 169 3128 3297 327935478 1.625875
## 3 0.002091362 108 5184 5292 327935478 1.672697
## 4 0.003500000 2589 169725 172314 636126841 2.061481
## 5 0.001146162 36 1620 1656 26230450 1.560068
## 6 0.002773810 503 12543 13046 372636298 1.765748
# Plotting our variables raised to their respective powers
ggplot(df, aes(x=LifeExpRaised, y=TotExpRaised)) + geom_point()
This dataset looks to have a much more linear relationship. Let’s run
our linear model on this data and the interpret our summary
statistics.
linModelRaised <- lm( LifeExpRaised~TotExpRaised, df)
summary(linModelRaised)
##
## Call:
## lm(formula = LifeExpRaised ~ TotExpRaised, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExpRaised 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
In this case, we see a much lower p-value (on the order of \(10^{-16}\) rather than \(10^{-14}\)). This is important because it
means our model “covers” a larger portion of the distribution of output
values. The F-statistic is also higher, meaning that model compares more
favorably than our linModel if both had no independent
parameters (in other words, only the intercept value). The standard
error of our coefficients is larger, but that is because the scale our
our data is much larger, because we raised the LifeExp
variable to a higher power.
Overall, our linModelRaised is a “better” model as it is
modelling data that visually looks more linear. This results in a better
fit from a linear model
Using R’s built-in predict function, we can pass our
linModelRaised object as well as a dataframe of our new
TotExp values
# predicting new values between LifeExp and TotExp
newExp <- data.frame(TotExpRaised=c(1.5, 2.5))
predict(linModelRaised, newdata = newExp)
## 1 2
## 193562414 813622630
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExpWe can use the mlm model as this model takes more than
one predictor variable (multiple regression)
# Storing our last term as a product column
df$mDExp <- df$PropMD * df$TotExp
# build the multiple regression model
mlm <- lm(LifeExp ~ PropMD + TotExp + mDExp, df)
Again, let’s look at and interpret our summary stats on this model
summary(mlm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + mDExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## mDExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
mresid <- resid(mlm)
hist(mresid)
LifeExp when \(PropMD=0.03\) and \(TotExp = 14\). Does this forecast seem
realistic? Why or why not?Again, we can use the predict method here for these new
input values.
multiNewData <- data.frame(PropMD=0.03, TotExp = 14)
multiNewData$mDExp <- multiNewData$PropMD * multiNewData$TotExp
# Predicting with our given values
predict(mlm, multiNewData)
## 1
## 107.696
While this predicted value is possible, it seems a bit high to be realistic. In part, a life expectancy over 100 years does not seem like a reasonable value, knowing how rare it is for people to live to over 100 years.
hist(resid(mlm))
Looking at the above, the residuals from our multiple regression model appear to be skew-left. We can also see this pattern in our residual values from our fitted values below. This would violate our assumption of heteroscedasticity. This could result in a model that does not produce an ideal fit. In part, the data we’re working with here were not scaled up as in step 2, resulting in a poorer fit from our linear model.
plot(fitted(mlm), resid(mlm))
We can also use the plot.lm function in order to
generate these graphs with a single function call
plot(mlm)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced