library(tidyverse)
library(lmtest)
The data used to complete this assignment is imported from a
.csv file and summarized in the R cell below:
df <- read.csv('who.csv')
summary(df)
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN PersExp
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883 Min. : 3.00
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455 1st Qu.: 36.25
## Median :0.9992 Median :0.0010474 Median :0.0027584 Median : 199.50
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336 Mean : 742.00
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164 3rd Qu.: 515.25
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387 Max. :6350.00
## GovtExp TotExp
## Min. : 10.0 Min. : 13
## 1st Qu.: 559.5 1st Qu.: 584
## Median : 5385.0 Median : 5541
## Mean : 40953.5 Mean : 41696
## 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :476420.0 Max. :482750
A data dictionary of the columns included the dataset is provided below:
| Column Name | Description |
|---|---|
Country |
Name of the country. |
LifeExp |
Average life expectancy for the country in years. |
InfantSurvival |
Proportion of those surviving to one year or more. |
Under5Survival |
Proportion of those surviving to five years or more. |
TBFree |
Proportion of the population without TB. |
PropMD |
Proportion of the population who are MDs |
PropRN |
Proportion of the population who are RNs |
PersExp |
Mean personal expenditures on healthcare in US dollars at average exchange rate. |
GovtExp |
Mean government expenditures per capita on healthcare, US dollars at average exchange rate. |
TotExp |
Sum of personal and government expenditures. |
Provide a scatterplot of LifeExp~TotExp, and run simple
linear regression. Do not transform the variables. Provide and interpret
the \(F\) statistics, \(R^2\), standard error, and \(p\)-values only. Discuss whether the
assumptions of simple linear regression met.
A scatter plot of LifeExp and TotExp is
provided below:
ggplot(data = df, aes(x=TotExp, y=LifeExp)) +
geom_point()
Furthermore, the cell below creates a linear model in relating these two variables:
lin_reg <- lm(LifeExp ~ TotExp, data=df)
summary(lin_reg)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The \(p\)-value for
TotExp model are less than 0.01, which indicates that
TotExp is a statistically significant predictor of
LifeExp. Furthermore, the \(p\)-value of the intercept indicates that
the intercept term (the value of LifeExp when
TotExp is 0) is statistically different than 0. The \(F\) statistic value is unhelpful at face
value, but it used to determine the \(p\)-value of the overall model. Under the
hood lm uses the \(F\)
distribution at the provided degrees of freedom, and calculates the area
under that distribution where the density is less than the calculated
\(F\) statistic. This area is the
provided \(p\) value (can be done for
this case in R via the following command
pf(2.326, df = 1, df2 = 188, lower.tail = FALSE)).
While the model is presented as being statistically significant, it
can definitely improved upon. The \(R^2\) value is given as 0.2577, meaning
that the model only accounts for ~26% of the variance contained within
LifeExp. Furthermore, the standard error value of 9.371
tells us that the predicted values of LifeExp by the model
are on average $$9.371 years different from actual figures provided in
the data.
Its also important to check that the actual assumptions of linear regression are met, seeing as usage of the model might provide misleading predictions if this were the case. The assumptions of linear regression are given below:
By visual analysis of the previously created scatterplot, it is easy
to see that there does not exist a linear relationship between
LifeExp and TotExp. The structure of the
scatterplot looks like these variables looks much more exponential in
nature, which is something that could be fixed via a logarithmic
transformation. As such, we conclude that the assumptions of linear
regression are not met, without having to even check the rest.
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
The cell below performs the desired transformation of the variables
by creating two new columns, LifeExpT and
TotExpT:
xT <- 0.06
yT <- 4.6
df <- df %>%
mutate(LifeExpT = LifeExp^yT,
TotExpT = TotExp^xT)
Note that after performing these transformations, the scatter plot of the newly created columns provides a structure that is much more linear in nature compared to the previous plot:
ggplot(data = df, aes(x=TotExpT, y=LifeExpT)) +
geom_point()
It is clear that our linear relationship assumption is now met.
The cell below creates a new linear regression model using these transformed fields:
lin_regT <- lm(LifeExpT ~ TotExpT, data=df)
summary(lin_regT)
##
## Call:
## lm(formula = LifeExpT ~ TotExpT, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExpT 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Once again, we see a \(F\)-statistic
value that results in \(p\)-value less
than 0.0, indicating that TotExpT is a statistically
significant predictor of LifeExpT. The difference in this
case however, is that is it clear that transforming the data had a
positive impact on the model. The new \(R^2\) value is 0.7263, meaning that ~73% of
the variance seen within LifeExpT can be explained by
TotExpT. It is worth noting that the standard error value
is quite high, but it is still not an unreasonable percentage of the max
predicted LifeExpT value.
Using the results from 3, forecast life expectancy when
TotExp^.06 =1.5. Then forecast life expectancy when
TotExp^.06=2.5.
The cell below uses the predict function to forecast
life expectancy for the two values of TotExpT:
pred_x <- c(1.5, 2.5)
pred_x <- as.data.frame(pred_x)
colnames(pred_x) <- 'TotExpT'
preds <-
cbind(pred_x^(1/xT), as.data.frame((predict(lin_regT, pred_x)^(1/yT))))
colnames(preds) <- c('x', 'pred_y')
preds
## x pred_y
## 1 860.705 63.31153
## 2 4288777.125 86.50645
Note that the given values and their predictions in this case each
had to be transformed once more to undo the previously applied
exponential transformations (raising to the 1/0.06 power for
x, and to the 1/4.6 power for pred_y). The
data above can be interpreted as follows:
The linear model predicts that if the sum of a country’s per capita personal and government expenditure on healthcare per year is $860.71 per year, then the average life expectancy in that country, is about 63.3 years old.
If that total expenditure is $4,288,777.125, then the average life expectancy is predicted to be 86.5 years old. It’s worth nothing that this is quite an unlikely scenario, as this represents an extremely high per capita healthcare cost.
Build the following multiple regression model and interpret the \(F\) Statistics, \(R^2\), standard error, and \(p\)-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
The regression model is built and summarized in the cell below:
lin_regM <- lm(LifeExp ~ PropMD + TotExp + (PropMD*TotExp), data = df)
summary(lin_regM)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
This model does seem to perform better than the first model (higher \(R^2\) value, lower standard error), but it still does not come close to the \(R^2\) value achieved from the model that utilized the logarithmic transformations.
Forecast LifeExp when PropMD=.03 and
TotExp = 14. Does this forecast seem realistic? Why or why
not?
The forcasts are made in the cell below:
pred_x <- cbind(0.03, 14)
pred_x <- as.data.frame(pred_x)
colnames(pred_x) <- c('PropMD', 'TotExp')
preds <-
cbind(pred_x, as.data.frame((predict(lin_regM, pred_x))))
colnames(preds) <- c('PropMD', 'TotExp', 'pred_y')
preds
## PropMD TotExp pred_y
## 1 0.03 14 107.696
Based on the data above, the linear model predicts that if 3% of a
country’s residents are doctors, and the per capita combined personal
and government expenditure on healthcare is $14 a year, that the average
life expectancy in that country will be ~107.7 years old. While the
inflated estimate is likely due to the fact that the model places
emphasis on the relatively high 0.03 PropMd value (the
average value for this field in the dataset is ~0.002), it is further
made to seem unreasonable by the fact that we are using such a small
value of TotExp.