The attached who.csv dataset contains real-world data from
2008. The variables included follow.
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or
more
Under5Survival: proportion of those surviving to five years or
more
TBFree: proportion of the population without TB
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at
average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US
dollars at average exchange rate
TotExp: sum of personal and government expenditures.
Loading the Data and Minor Inspection
Below the data is loaded into R.
The data contained no NA/missing values
library(ggplot2)
library(kableExtra)
library(readr)
library(tidyverse)
url <- "https://raw.githubusercontent.com/greggmaloy/Data_605/main/who.csv"
data <- read_csv(url)
df = data
head(data) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)
| Country | LifeExp | InfantSurvival | Under5Survival | TBFree | PropMD | PropRN | PersExp | GovtExp | TotExp |
|---|---|---|---|---|---|---|---|---|---|
| Afghanistan | 42 | 0.835 | 0.743 | 0.99769 | 0.0002288 | 0.0005723 | 20 | 92 | 112 |
| Albania | 71 | 0.985 | 0.983 | 0.99974 | 0.0011431 | 0.0046144 | 169 | 3128 | 3297 |
| Algeria | 71 | 0.967 | 0.962 | 0.99944 | 0.0010605 | 0.0020914 | 108 | 5184 | 5292 |
| Andorra | 82 | 0.997 | 0.996 | 0.99983 | 0.0032973 | 0.0035000 | 2589 | 169725 | 172314 |
| Angola | 41 | 0.846 | 0.740 | 0.99656 | 0.0000704 | 0.0011462 | 36 | 1620 | 1656 |
| Antigua and Barbuda | 73 | 0.990 | 0.989 | 0.99991 | 0.0001429 | 0.0027738 | 503 | 12543 | 13046 |
apply(df, 2, function(col) sum(is.na(col))) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F) #%>%
| x | |
|---|---|
| Country | 0 |
| LifeExp | 0 |
| InfantSurvival | 0 |
| Under5Survival | 0 |
| TBFree | 0 |
| PropMD | 0 |
| PropRN | 0 |
| PersExp | 0 |
| GovtExp | 0 |
| TotExp | 0 |
#scroll_box(width='100%', height = '200px')
QUESTION 1
1.)Provide a scatterplot of LifeExp~TotExp, and run simple
linear regression. Do not transform the variables. Provide and interpret
the F statistics, R^2, standard error,and p-values only. Discuss whether
the assumptions of simple linear regression met.
Scatter Plot
The scatter plot does not appear linear.
ggplot(data, aes(x = TotExp, y = LifeExp)) +
geom_point() +
labs(x = "Total Expenditure", y = "Life Expectancy", title = "Scatterplot of Life Expectancy vs Total Expenditure") +
theme_minimal()
Linear Model
Below is the model.
model <- lm(LifeExp ~ TotExp, data = data)
#LM summary stats
summary(model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
`
F Statistics
Value: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Interpretation: The F-value statistic tests where the
independent variables in the model are statistically significant
predictors of the model. A value of 65.26 is quite high and the
associated p-value is statistically significant (p-value <0.05). The
high f-value coupled with the lower p-value denotes a strong
relationship between LifeExp and TotalExp.
R-Squared
Multiple R-squared: A multiple R^2 value of 0.2577 denotes that
approximately 25.77% of the variance in LifeExp can be explained by
TotExp. Although we know the model is statistically significant, ~75% of
the variance of LifeExp is not accounted for in the model, indicating
the model could be improved.
Adjusted R-squared: An adjusted R^2 value 0.2537 approximates
the multiple R^2 value of 0.2577. Since there is only one predictor in
our model, adjusted R^2 which approximates the multiple R^2 is
relatively unimportant(?).
Coefficient Standard Error and P-Value
P-Value: The p-value for ‘Total_Exp’ is statistically
significant (p=value <0.05). The statistically significant p-value
suggests that for each unit increase in ‘TotExp’, ‘LifeExp’ increase by
0.00006297 years, which is quite small.
SE: The standard error is relatively small (SE= 7.795e-06) and when compared to the coefficent value, which is the t-value(t-value=8), the model suggests that TotExp is a strong, statistically significant predictor of LifeExp
Residual Standard Error
RSE: The residual standard error of ~9 is large, considering
that it would be interpreted as a RSE of 9 years. The large size of the
RSE is further reinforced when taken into account the spread of the 1st
and 3rd quartile residual values (1st q= -4.7 vs 3rd q=7). Generally
speaking, RSE that is approximately 1.5 times the 1st and 3rd quartile
residuals provides evidence that residuals are normally distributed. The
RSE of this model is slightly higher than the 1.5 times the 1st and 3rd
quartile residuals.
LM Assumptions
Linearity: The residuals versus fitted plot clearly
demonstrates the model is not linear
Homoscedasticity (Equal Variance): This assumption is violated
via the residuals versus fitted plot. THe residuals are not uniformly
distributed around zero and across all levels of the independant
variable.
Normality of Errors: The Q-Q plot is not a straight line,
denoting the residuals are not normally distributed. This provides
strong evidence that the data is not linear.
par(mfrow=c(2,2))
plot(model)
LM Conclusion
Although statistically significant, the high RSE (~9) is the strongest
indicator that the model is a poor fit. The RSE compared to the residual
1st and 3rd quartile distribution also indicates that the distribution
of the residual is not normal and violates the normailty of errors
assumption necessary for a LM to be valid. Furthermore, the multiple R^2
of 0.2577 indicates approximately 25.77% of the variance in LifeExp can
be explained by TotExp. This means ~75% of the variance in the
realtionship is not explained.
The residual analysis using the above plots further confirms that the residuals are not normaly distributed and homoscedasticity is violated.
QUESTION 2
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6).
Raise total expenditures to the 0.06 power (nearly a log transform,
TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run
the simple regression model using the transformed variables. Provide and
interpret the F statistics, R^2, standard error, and p-values. Which
model is “better?”
Variable Transformation and Scatterplot
Near log variable transformation was applied to the variables of
interest in order to:
1. Normalize the data, especially the LifeExp variable. Log
transformation enabled the further granualization of the ‘year’ values.
‘year’ is essentially too large of a number/catchall for the
analysis.
2. Stabilizes the variance to make the residuals fall more in line with
normal distribution (fulfilling homoscedasticity LM assumption).
3. Establishing a more linear relationship.
The resulting scatter plot appears to denote a more linear relationship between the two variables than was previous present in the scatter plot where the variables were not log transformed.
#variable transformation
data <- data %>%
mutate(LifeExp_transformed = LifeExp^4.6, TotExp_transformed = TotExp^0.06)
#scatterplot
ggplot(data, aes(x = TotExp_transformed, y = LifeExp_transformed)) +
geom_point() +
labs(x = "Total Expenditure (Transformed)", y = "Life Expectancy (Transformed)",
title = "Scatterplot of Transformed Life Expectancy vs Transformed Total Expenditure") +
theme_minimal()
Linear Model
model_transformed <- lm(LifeExp_transformed ~ TotExp_transformed, data = data)
# Display the model summary
summary_model_transformed <- summary(model_transformed)
print(summary_model_transformed)
##
## Call:
## lm(formula = LifeExp_transformed ~ TotExp_transformed, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_transformed 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
F Statistics
Value: 507.7 on 1 and 188 DF, p-value: 2.2e-16
Interpretation: The F-value dramatically increased after log
transformation from A value of ~65 to an F-value of 507.7. Additionally,
although the p-value of the first model was statistically significant,
the log transformed LM the p-value grew lesser by a power of 2 (^14 to
^16). The More favorable f-value statistic for the log transformed LM
support the notion that the log transformed LM is a better fit.
R-Squared
Multiple R-squared: A multiple R^2 value of 0.7298 denotes that
approximately 72.98% of the variance in LifeExp can be explained by
TotExp. The log transform LM therefore accounted for 50% more variance
between the two variables. This is a dramatic increase in variance
explained from the first model, which explained only 25.77% of the
variance.
Adjusted R-squared: An adjusted R^2 value 0.7283 approximates the multiple R^2 value of 0.7298. Since there is only one predictor in our model, adjusted R^2 which approximates the multiple R^2 is relatively unimportant(?).
Coefficient Standard Error and P-Value
P-Value: The p-value for ‘Total_Exp’ is also statistically
significant (p=value <0.05) in the log transformed LM. However, the
log transform p-value is actually grew smaller, indicating greater
statistical significant (LM ^14 versus Log Trans LM ^16) and provides
further evidence that this model is a better fit.
SE: The standard error increased by a great deal when compared to the LM (LM SE = 0.7535e^06 vsLog Trans SE= 46817945), which is expected since log transformation was used. However, when comparing to the coefficent value, which is the t-value, the t-value increased with the log transformed LM (LM t-value=8 vs Log Trans t-value=22). The increased, more statistically signifcanted t-value denotes that in the log transformed LM, TotExp is a stronger predictor of LifeExp than in the first LM.
Residual Standard Error
RSE: The residual standard error of 90,000,00 grew a lot, which
is expected since the variables were log transformed. However, when
compared to the spread of the 1st and 3rd quartile residual values (1st
q= 53,000,000 vs 3rd q=59,000,000), an RSE of 90,000,000 years is not
exactly the 1.5 times value we are seeking.
LM Assumptions
Linearity: The residuals versus fitted plot appears to indicate
a direct, linear relationship and is a dramatic improvement when
compared to the non-log transformed LM. This assumption appears to be
satisfied.
Homoscedasticity (Equal Variance): This assumption appears to
be fulfilled via the residuals versus fitted plot. The residuals are
uniformly distributed around zero and across all levels of the
independent variable.
Normality of Errors: The Q-Q plot, although dramatically
improved, is not a straight line, denoting the residuals are not
normally distributed. This provides evidence that the data is not
linear.
par(mfrow=c(2,2))
plot(model_transformed)
Log Tranform LM Conclusion The log transform model saw an increase in variance explained (75% versus 25%). The SE grew tremendously, but that was expected. The t-value denoted that the new SE was a better fit for coefficient values. Furthermore the value also became smaller in the lof transform model, denoting increased statistical significant in the log transform LM. However, the residuals appear to not be normally distributed as indicated by comparing the RSE value to the to the spread of the 1st and 3rd quartile residual values (1st q= 53,000,000 vs 3rd q=59,000,000), an RSE of 90,000,000 years is not exactly the 1.5 times value we are seeking. Furthermore, we were not asked to interpret the median, but when interpreted, the median is not centered at zero median (1st q= 53,000,000 vs 3rd q=59,000,000, median=13697187) and is closer to 3rd quartile observations, indicating skewed residuals, non-normal distribution of residuals and possible non-linear relationship. Model be improved. The QQ plot provides further evidence that the residuals are not normally distributed
QUESTION 3
Using the results from 3, forecast life expectancy when
TotExp^.06 =1.5. Then forecast life expectancy when
TotExp^.06=2.5.
intercept <- coef(model_transformed)[1]
slope <- coef(model_transformed)[2]
predicted_lifeExp_1_5 <- intercept + slope * 1.5
predicted_lifeExp_2_5 <- intercept + slope * 2.5
actual_lifeExp_1_5 <- predicted_lifeExp_1_5^(1/4.6)
actual_lifeExp_2_5 <- predicted_lifeExp_2_5^(1/4.6)
actual_lifeExp_1_5
## (Intercept)
## 63.31153
actual_lifeExp_2_5
## (Intercept)
## 86.50645
QUESTION 4
Build the following multiple regression model and interpret the
F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x
TotExp
model_multiple <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = data)
summary(model_multiple)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
F Statistics
Value: 34.49 on 3 and 186 DF, p-value: 2.2e-16
Interpretation: The F-value dramatically decreased in multiple
LM when compared to the log transformation of 507.7. The p-value
remained the same when compared to the log transformed model and remains
statistically significant. The more favorable f-value statistic for the
log transformed LM support the notion that the log transformed LM is a
better fit.
R-Squared
Multiple R-squared: A multiple R^2 value of 0.3574 denotes that
approximately 35.74% of the variance in LifeExp can be explained by
TotExp. The log transform LM multiple R^2 value = 0.7298 accounted for
~40% more variance between the two variables. This is a dramatic
decrease in variance when compared to the log transformed model.
Adjusted R-squared: An adjusted R^2 value 0.3471 approximates the multiple R^2 value of 0.3574. Since this model has multiple predictors, adjusted R^2 which approximates the multiple R^2 reinforces that the model does indeed account for 0.3471 of the variance between the independent and dependent variables.
Coefficient Standard Error and P-Value
P-Value: The p-value for all independent variables are
statistically significant (p=value <0.05) in the multiple LM.
SE: The standard error for all independent variables appears to be acceptable, as does the t-value. The t-value for the log transformed LM is a higher value, however, which indicates a stronger relationship between variables in the log transform LM than the multiple LM (log transform LM= ~20 than the multiple LM between 8 to -4).
Residual Standard Error
RSE: The residual standard error of 8.765 is rather high, as it
represents 8.75 years. The large size of the RSE is further reinforced
when taken into account the spread of the 1st and 3rd quartile residual
values (1st q= -4.132 vs 3rd q=6.54, Median =2.098). The median is not
centered at zero and median observations is skewed toward the 3rd
quartile providing evidence that the residuals are not normally
distributed.
LM Assumptions
Linearity: The residuals versus fitted plot clearly
demonstrates the model is not linear
Homoscedasticity (Equal Variance): This assumption is violated
via the residuals versus fitted plot. The residuals are not uniformly
distributed around zero and across all levels of the independent
variable.
Normality of Errors: The Q-Q plot denotes some minor skew,
denoting the residuals are not normally distributed. This provides
strong evidence that the data is not linear.
par(mfrow=c(2,2))
plot(model_multiple)
Multiple LM Conclusion This model is a better fit that
the initial simple linear mode, but over all the log transformed LM is
the model has the best fit of the three models provided. Next step would
be to add additional predictors to the log transformed LM.
QUESTION 5
5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this
forecast seem realistic? Why or why not?
intercept <- 6.277e+01
coeff_PropMD <- 1.497e+03
coeff_TotExp <- 7.233e-05
coeff_Interaction <- -6.026e-03
PropMD_value <- 0.03
TotExp_value <- 14
LifeExp_Pred <- intercept +
(coeff_PropMD * PropMD_value) +
(coeff_TotExp * TotExp_value) +
(coeff_Interaction * PropMD_value * TotExp_value)
LifeExp_Pred
## [1] 107.6785
The life expectancy of 107.6 years seems unrealistic, as it is quite a high life expectancy.