library(tidyverse)
library(janitor)
# load data
df <- read_csv("/Users/justinwilliams/Documents/CUNY SPS/605/assignments/data/who.csv") %>%
clean_names()
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Variables include:
Country:
name of the countryLifeExp:
average life expectancy for the country in
yearsInfantSurvival:
proportion of those surviving to one
year or moreUnder5Survival:
proportion of those surviving to five
years or moreTBFree:
proportion of the population without TB.PropMD:
proportion of the population who are MDsPropRN:
proportion of the population who are RNsPersExp:
mean personal expenditures on healthcare in US
dollars at average exchange rateGovtExp:
mean government expenditures per capita on
healthcare, US dollars at average exchange rateTotExp:
sum of personal and government
expenditures.Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Scatterplot function
sct_plot <- function(df, col1, col2, xlab, ylab, title="Scatterplot",
title_size = 18) {
df %>%
ggplot(aes(!!sym(col1), !!sym(col2))) +
geom_point() +
labs(title = title,
y = ylab,
x = xlab) +
theme(plot.title.position = "plot",
plot.title = element_text(hjust = 0.5, size = title_size))
}
sct_plot(df, 'tot_exp', 'life_exp',"Total Expenditures",
"Life Expectancy (yrs)",
"Life Expectancy as a function of Total Expenditures")
Run simple regressions on life_exp
~
tot_exp
# slr model
lm1 <- lm(life_exp ~ tot_exp, data = df)
# summary
summary(lm1)
##
## Call:
## lm(formula = life_exp ~ tot_exp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## tot_exp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The standard error is extremely small here, it shows the statistical
\(standard error\) for the
tot_exp
coefficient. For a good model we would like to see
a standard error that is at least five to ten times smaller than the
corresponding coefficient.
In this case:
# define std error
std_err <- 7.795e-06
# define coefficient estimate
est <- 6.297e-05
# divide by std error
est/std_err
## [1] 8.078255
Here we can see the standard error is about ~8 smaller than the coefficient estimate, which generally indicates a good model.
The F-statistic
is 65.26 which compares the current
model to to a model that has only a single additional parameter.
However, because this model already has a single parameter, it gives the
same information as the line for slope estimate, it is the \(t\) value squared. Therefore, since the
F-statistic
is large and the associated p-value is small,
we reject the null hypothesis and conclude that the coefficient is
significantly different then zero, which indicates there IS a linear
relationship between tot_exp
and life_exp
For the predictor tot_exp
, the \(t\)-value is 8.079 which
is the ratio of the coefficient estimate to the standard error. This
then gives us the p-value labeled as \(Pr(>|t|)\) which shows the probability
of observing a test statistic as extreme or more extreme as the one
observed (assuming there is no linear relationship between the predictor
and response variables). Because this p-value is so low,
tot_exp
is a highly significant predictor of
life_exp
.
The \(R^2\) is pretty low with a
score of 0.2577, tot_exp
only accounts for ~25% of the
variation in life_exp
.
Assumptions of linear regressions can be checked by doing a residual analysis.
par(mfrow=c(2,2))
plot(lm1)
Here in the residuals vs. fitted chart we can see the residuals are NOT fitted around zero. This is an additional sign (besides small \(R^2\)) that we have produced a poor model.
When looking at the Q-Q plot we see some obvious nonlinearities, leading us to be cautious about the residuals being normally distributed.
Let’s check out a histogram.
hist(lm1$residuals)
Here we can clearly see our residuals are NOT normal.
Overall we can conclude that, we have a poor model.
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better?”
# transform
df$life_exp_pwr <- df$life_exp^4.6
df$tot_exp_pwr <- df$tot_exp^0.06
Re-plot
sct_plot(df, 'tot_exp_pwr', 'life_exp_pwr',"Total Expenditures ^ 0.06",
"Life Expectancy ^ 4.6",
"Life Expectancy as a function of Total Expenditures")
Re-run model
# transformed model
lm2 <- lm(life_exp_pwr ~ tot_exp_pwr, data = df)
#summary
summary(lm2)
##
## Call:
## lm(formula = life_exp_pwr ~ tot_exp_pwr, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## tot_exp_pwr 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
After transforming the variables, we can have a much better model.
The \(R^2\) here is 0.7298, which
accounts for almost 73% of the variation in life_exp_pwr
.
We can see a much larger F-statistic
and a much smaller
p-value, this is also an indicator of a better model.
Let’s do the same coefficient/standard error check.
# assign variables
std_err2 <- 27518940
est2 <- 620060216
# compute
est2/std_err2
## [1] 22.53213
Here we can see the standard error is almost 23 times smaller then the coefficient estimate. Typically a ratio of 5 - 10 times smaller indicates a decent model. This indicates a better then decent model.
Let’s look at some diagnostic plots.
par(mfrow=c(2,2))
plot(lm2)
Much more normal looking, we can see the residuals are more gathered around zero and the Q-Q plot has a much straighter line.
hist(lm2$residuals)
In short, the transformation produced a model that predicts
life_exp
much better then the non-transformed version.
However, due to the power transformation, interpretability.
Using the results above, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
This can be gleaned from the following formula:
\[\widehat{LifeExp4.6}=\beta_0+\beta_1 \times TotExp0.06\]
# write function
forecast_resp <- function(predictor,beta_0, beta_1) {
(beta_0 + (beta_1 * predictor))^(1/4.6)
}
For 1.5.
# tot_exp^0.06 = 1.5
forecast_resp(1.5, -736527910, 620060216)
## [1] 63.31153
Let’s try this for 2.5
forecast_resp(2.5, -736527910, 620060216)
## [1] 86.50645
Build the following multiple regression model and interpret the F Statistics, \(R^2\), standard error, and p-values. How good is the model?
\[\widehat{LifeExp}=\beta_0+\beta_1 \times PropMd+\beta_2 \times TotExp+\beta_3 \times PropMD \times TotExp\]
# multiple regression
lm4 <- lm(life_exp ~ prop_md + tot_exp + prop_md*tot_exp, data = df)
# summary
summary(lm4)
##
## Call:
## lm(formula = life_exp ~ prop_md + tot_exp + prop_md * tot_exp,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## prop_md 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## tot_exp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## prop_md:tot_exp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
The F-statistic
is much lower then the other models
indicating that it is not as good of a fit. However, the \(R^2\) value is a touch higher than the
first model with a single predictor, so it explains more of the
variance, roughly ~36%. The p-values are all significant and very small,
therefore we can say that each of these coefficients are highly
predictive of life_exp
. However, I would not conclude that
this is a good model, due to its overall low \(R^2\).
Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
# define function
for_mult_resp <- function(pred_1, pred_2, beta_0, beta_1, beta_2, beta_3) {
(beta_0 + (beta_1 * pred_1) + (beta_2 * pred_2) - (beta_3 * pred_1 * pred_2))^(1/4.6)
}
# compute forecast
for_mult_resp(.03, 14, 6.277e+01, 1.497e+03,7.233e-05,-6.026e-03)
## [1] 2.765487
This does not seem realistic, because how could life_exp
be 2.765487 years?