The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
urlfile<-"https://raw.githubusercontent.com/catcho1632/DATA605/main/Assignment12_dataset"
who<-read_csv(url(urlfile))
## Rows: 190 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (9): LifeExp, InfantSurvival, Under5Survival, TBFree, PropMD, PropRN, Pe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(who)
## # A tibble: 6 × 10
## Country LifeExp Infan…¹ Under…² TBFree PropMD PropRN PersExp GovtExp TotExp
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghani… 42 0.835 0.743 0.998 2.29e-4 5.72e-4 20 92 112
## 2 Albania 71 0.985 0.983 1.00 1.14e-3 4.61e-3 169 3128 3297
## 3 Algeria 71 0.967 0.962 0.999 1.06e-3 2.09e-3 108 5184 5292
## 4 Andorra 82 0.997 0.996 1.00 3.30e-3 3.5 e-3 2589 169725 172314
## 5 Angola 41 0.846 0.74 0.997 7.04e-5 1.15e-3 36 1620 1656
## 6 Antigua… 73 0.99 0.989 1.00 1.43e-4 2.77e-3 503 12543 13046
## # … with abbreviated variable names ¹InfantSurvival, ²Under5Survival
The scatter plot of Life Expectancy vs Total Expenditure does not look linear but it rather increases rapidly from the left side of the plot until the trend plateus. Visually, a line would not best fit the entirety of the plot. It seems like there should be 2 different lines to fit this plot.
plot(who$TotExp, who$LifeExp, main = 'Life Expectancy vs Total Expenditures', xlab = 'Total Expenditures', ylab = 'Life Expectancy')
lm_who <- lm(LifeExp~TotExp, data = who)
lm_who
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Coefficients:
## (Intercept) TotExp
## 6.475e+01 6.297e-05
Provide and interpret the F statistics, R^2, standard error,and p-values only.
An F-statistic of 65.26 is reported with a p-value of 7.714e-14. Assuming a significance value of 0.05, a p-value of 7.714e-14 is statistically significant and so we can reject the null hypothesis that there is no relationship between Life Expectancy vs Total Expenditures. It also means that obtaining an F-value of 65.26 from chance alone if there was no relationship between the dependent and independent variables is significantly small. From this evidence alone, there is evidence to support that there is a relationship between these variables. This F-statistic indicates that there is a larger proportion of the variance in the dependent variable that is explained by the regression model, relative to the unexplained variance.
However, the F-statistic and p-value alone is not enough to say a linear regression is the best fit model. The R^2 of 0.2577 is reported, which means only 25% of the variance in the dependent variable is explained by this linear regression. The other 75% is unexplained, which seems significant.
The variance not captured by this linear regression is further seen in the plot below. It’s clear that a line may not be the best fit model to explain this data.
plot(who$TotExp, who$LifeExp, main = 'Life Expectancy vs Total Expenditures', xlab = 'Total Expenditures', ylab = 'Life Expectancy')
abline(lm_who)
summary(lm_who)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Next, the assumptions are checked. The summary shows that the residuals are not generally centered around 0. The Q1 and Q3 values are not equal and max and min residuals also deviate from each other. The Q-Q plot confirms that the residuals do not seem to follow a normal distribution. The plot is bowed and does not follow a linear trend well. The Residual vs Fitted line shows that the spread of residuals are also not uniformly spread out and there appears to have more spread left of the plot. Additionally, the cook’s distance is high on the left of the residuals vs leverage plot which means there is disproportionately influential points in this regression model. Overall, the assumptions for a linear regression model are not met.
qqnorm(resid(lm_who))
qqline(resid(lm_who))
par(mfrow=c(2,2))
plot(lm_who)
The plot with the adjusted data visually shows a more linear plot. The equation of the line is:
\[LifeExp^{4.6}=620060216*TotExp^{0.06}-736527909\]
The F-statistic 507.7 for a p-value of < 2.2e-16. This means that with the assumption that there is no relationship between the dependent and independent variables, the chance of getting an F-statistic of 507.7 is very small. The p-value is significantly lower than the significance value of 0.05 and therefore this linear regression is statistically significant and the null hypothesis that there is no relationship can be rejected. The higher F-statistic of 507.7 is greater than the first model and therefore more variance of the dependent variable is captured by this model. Additionally, the R^2 increased to .73, which is a significant improvement in the amount of variation captured by this model than the 25% from the last. Overall, this model does a better job capturing the variance of the data.
who_1 = who
who_1$LifeExp = who$LifeExp^4.6
who_1$TotExp = who_1$TotExp^0.06
plot(who_1$TotExp, who_1$LifeExp, main = 'Life Expectancy vs Total Expenditures', xlab = 'Total Expenditures', ylab = 'Life Expectancy')
lm_who_1 = lm(LifeExp~TotExp, data = who_1)
lm_who_1
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_1)
##
## Coefficients:
## (Intercept) TotExp
## -736527909 620060216
summary(lm_who_1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
This can be further seen when the regression line is plotted against the scatter plot. The line does a better job of explaining this relationship.
plot(who_1$TotExp, who_1$LifeExp, main = 'Life Expectancy vs Total Expenditures', xlab = 'Total Expenditures', ylab = 'Life Expectancy')
abline(lm_who_1)
The linear equation is: \[LifeExp^{4.6}=620060216*TotExp^{0.06}-736527909\]
# Forecast for TotExp^0.06 = 1.5
LifeExp_1.5 = (620060216*1.5-736527909)^(1/4.6)
TotExp_1.5 = 1.5^(1/0.06)
print(paste0('For a total expenditure of ',round(TotExp_1.5,3), ' the forecasted life expectancy is ',round(LifeExp_1.5,3), ' years'))
## [1] "For a total expenditure of 860.705 the forecasted life expectancy is 63.312 years"
# Forecast for TotExp^.06 = 2.5
LifeExp_1.5 = (620060216*2.5-736527909)^(1/4.6)
TotExp_1.5 = 2.5^(1/0.06)
print(paste0('For a total expenditure of ',round(TotExp_1.5,3), ' the forecasted life expectancy is ',round(LifeExp_1.5,3), ' years'))
## [1] "For a total expenditure of 4288777.125 the forecasted life expectancy is 86.506 years"
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
The equation is: \[LifeExp=62.77+ 1497 * PropMd + 0.00007233 * TotExp - 0.006026 * PropMD * TotExp\]
An F-statistic of 34.49 with 3 numerator degrees of freedom and 186 denominator degrees of freedom, and a p-value of < 2.2e-16 (which is basically 0), shows that the model is statistically significant and that the predictor variables are collectively contributing to the prediction of the outcome variable. The low p-value is less than 0.05, so the null hypothesis that there is no relationship between the variables and the outcome can be rejected. Although this F-statistic is smaller than when TotExp was the only predictor, the increase in DOF (3) means there is more variance from the residuals that is being divided in the denominator. There the R^2 is looked at next. The R^2 is 0.3574, which means 36% of the variance in the independent variable can be explained by this model. I would say adding more variables did not help the model to capture more variance. It seems having the TotExp and adjusting it through mathematical manipulation did a better job at capturing the variation in LifeExp.
lm_who_2 = lm(LifeExp~PropMD+TotExp+I(PropMD*TotExp), data = who)
lm_who_2
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + I(PropMD * TotExp),
## data = who)
##
## Coefficients:
## (Intercept) PropMD TotExp I(PropMD * TotExp)
## 6.277e+01 1.497e+03 7.233e-05 -6.026e-03
summary(lm_who_2)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + I(PropMD * TotExp),
## data = who)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## I(PropMD * TotExp) -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
\[LifeExp=62.77+ 1497 * PropMd + 0.00007233 * TotExp - 0.006026 * PropMD * TotExp\]
This model forecasts that the life expectancy is 108 years old if 3% of the population are MDs and total expenditure is 14. This is contrary to what I would expect since you would think it would require more than $14 from personal and goverment expenditure and more than a 3% population of doctors to push the average life expectancy to such a high value. This model does not make sense and I would not rely on its forecasting.
LifeExp_mult = 62.77+ 1497 * 0.03 + 0.00007233 * 14 - 0.006026 * 0.03 * 14
print(paste0('The life expectancy if 3% of the population are MDs and total expenditure is 14 is ',round(LifeExp_mult,3), ' years'))
## [1] "The life expectancy if 3% of the population are MDs and total expenditure is 14 is 107.678 years"