The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
Provide a scatterplot of LifeExp~TotExp, and run a simple linear regression. Do not transform the variables. Provide and interpret the F statistic, R^2, standard error,and p-values only. Discuss whether the assumptions of a simple linear regression are met.
The first step in this one-factor modeling process is to determine whether or it looks as though a linear relationship exists between the predictor (TotExp) and the output value (LifeExp). We do this using the plot function and observe that the plot indicates a logarithmic and not a linear relationship. Although a linear model might not be the right fit without transforming the variables, we proceed nevertheless with the quality evaluation of a one-factor linear model as directed in the problem.
The F-test compares the current model to a model with one fewer predictor. If the current model is better than the reduced model, the p-value will be small. In our case, the p-value associated with the f-test is very small at (7.714e−14), hence this metric indicates that this model fits the data better than a model with no independent variables.
The multiple R-squared value is a number between 0 and 1 and is a statistical measure of how well the model describes the data. This value is computed by dividing the total variation of the model by the total variation of the data and the higher the value, the better the fit although not always. Both the multiple R-squared as well as the adjusted R-squared values for our model here are ~25% which tells us that the linear model expressed here explains only ~25% of the variation in the data.
The std. error column shows the statistical standard error for each of the coefficients. A good model will typically have standard error that is at least five to ten times smaller than the corresponding coefficient. Here, the standard error for TotExp is 8.1 (6.297e-05 / 7.795e-06) times smaller than the TotExp coefficient. For the intercept, the standard error is 85.9 (6.475e+01 / 7.535e-01) times smaller than the coefficient. What this tells us is that there is little variability in both the slope estimate m (standard error for TotExp is 8.1 times smaller than the coefficient) and in the y-intercept.
The column labeled Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. From values below, the probability that the TotExp coefficient is not relevant in the model is a miniscule 7.795e-06 while the probability that the intercept is not relevant is also very small at 7.535e-01. Since both of the values are well below the assumed standard threshold of 0.05, we can conclude that both the coefficient and the intercept are significant in the model.
No, the assumptions are not met. The visual analysis (plot of LifeExp vs TotExp) shows that the relationship between the predictor and the independent variable here is not linear but perhaps logarithmic in nature. In addition, the quality evaluation of the model, specifically the R-squared value shows us that this model might not be a good fit for the data since it explains only ~25% of the variation in the data. Even though the p-values and the standard error indicate that the variables are significant, the R-squared value indicates that the model explains little of the variation in the data. Perhaps we would be better off with a single-factor model that employs transformed variables as a first-step rather than the single factor model that uses untransformed variables as is the case here.
library(readr)
## Warning: package 'readr' was built under R version 3.6.3
who.df <- read_csv('https://raw.githubusercontent.com/tponnada/DATA607/master/who.csv')
## Parsed with column specification:
## cols(
## Country = col_character(),
## LifeExp = col_double(),
## InfantSurvival = col_double(),
## Under5Survival = col_double(),
## TBFree = col_double(),
## PropMD = col_double(),
## PropRN = col_double(),
## PersExp = col_double(),
## GovtExp = col_double(),
## TotExp = col_double()
## )
plot(who.df$TotExp, who.df$LifeExp, main="Total Expenditure vs Life Expectancy", xlab="Total Expenditure", ylab="Life Expectancy")
who.lm <- lm(LifeExp~TotExp, dat = who.df)
summary(who.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
The first step in this one-factor modeling process is to determine whether or it looks as though a linear relationship exists between the transformed value of the predictor (TotExp) and the transformed value of the output value (LifeExp). A plot of the transformed variables indicates a linear relationship with values of Life Expectancy now increasing linearly with an increase in the transformed Total Expenditure variable. Next we proceed with a quality evaluation of the one-factor linear model that uses these transformed variables.
The F-test compares the current model to a model with one fewer predictor. If the current model is better than the reduced model, the p-value will be small. In our case, the p-value associated with the f-test is very small at (2.2e-16), hence this metric indicates that this model fits the data better than a model with no independent variables.
The multiple R-squared value is a number between 0 and 1 and is a statistical measure of how well the model describes the data. This value is computed by dividing the total variation of the model by the total variation of the data and the higher the value, the better the fit although not always. Both the multiple R-squared as well as the adjusted R-squared values for our model here are ~73% which tells us that the linear model expressed here explains about 73% of the variation in the data.
The std. error column shows the statistical standard error for each of the coefficients. A good model will typically have standard error that is at least five to ten times smaller than the corresponding coefficient. Here, the standard error for TotExp is 22.5 (620060216 / 27518940) times smaller than the TotExp coefficient. For the intercept, the standard error is 15.7 (-736527910 / 46817945) times smaller than the coefficient. What this tells us is that there is little variability in both the slope estimate m (standard error for TotExp is 22.5 times smaller than the coefficient) and in the y-intercept.
The column labeled Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. From values below, the probability that the TotExp coefficient is not relevant in the model is a miniscule 2e-16 and similarly the probability that the intercept is not relevant is also very small at 2e-16. Since both of the values are well below the assumed standard threshold of 0.05, we can conclude that both the coefficient and the intercept are significant in this model.
The one-factor model with transformed variables is better because:
The plot of the transformed variables looks more linear. The R2 value has increased dramatically from ~25% to ~73%. The standard error for both the slope coefficient and the y-intercept is much smaller than in the previous untransformed one-factor model.
##
## Call:
## lm(formula = LifeExptxform ~ TotExptxform, data = who.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExptxform 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Using the results from problem 2, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
The solution equation for problem 2 can be expressed as: LifeExp^4.6 = -736527910 + 620060216 * TotExp^.06
OR LifeExp^4.6 = -736527910 + 620060216 * 1.5 _(1)
AND
LifeExp^4.6 = -736527910 + 620060216 * 2.5 _(2)
Life expectancy when total expenditure = 1.5 is ~63.3 years and Life expectancy when total expenditure = 2.5 is ~86.5 years.
LifeExp1 = -736527909 + 620060216 * (1.5)
LifeExp2 = -736527909 + 620060216 * (2.5)
LifeExp1 = exp(log(LifeExp1)/4.6)
LifeExp2 = exp(log(LifeExp2)/4.6)
sprintf("Life expectancy when total expenditure = 1.5 is: %f", LifeExp1)
## [1] "Life expectancy when total expenditure = 1.5 is: 63.311534"
sprintf("Life expectancy when total expenditure = 2.5 is: %f", LifeExp2)
## [1] "Life expectancy when total expenditure = 2.5 is: 86.506449"
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
LifeExp = b0 + (b1 x PropMd) + (b2 x TotExp) + (b3 x PropMD x TotExp)
The F-test compares the current model to a model with one fewer predictor. If the current model is better than the reduced model, the p-value will be small. In our case, the p-value associated with the f-test is very small at (2.2e-16), hence this metric indicates that this multi-factor model fits the data better than a model with one fewer predictors.
The multiple R-squared value is a number between 0 and 1 and is a statistical measure of how well the model describes the data. This value is computed by dividing the total variation of the model by the total variation of the data and the higher the value, the better the fit although not always. Both the multiple R-squared as well as the adjusted R-squared values for our model here are ~35% which tells us that the multi-factor model expressed here explains only about 35% of the variation in the data.
The std. error column shows the statistical standard error for each of the coefficients. A good model will typically have standard error that is at least five to ten times smaller than the corresponding coefficient. Here, the standard error for PropMD is 5.37 (1.497e+03 / 2.788e+02) times smaller than the PropMD coefficient. The standard error for TotExp is 8.05 (7.233e-05 / 8.982e-06) times smaller than the TotExp coefficient. The standard error for the product of the two-variable term (PropMD * TotExp) is 4.09 (-6.026e-03 / 1.472e-03) times smaller than the corresponding coefficient. Finally, for the intercept, the standard error is 78.9 (6.277e+01 / 7.956e-01) times smaller than the coefficient. Overall, from above results, there is little variability in the PropMD, TotExp and the intercept variables but more variability in the two-variable term (PropMD * TotExp).
The column labeled Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. From values below, the probability that the PropMD coefficient is not relevant in the model is a miniscule 2.32e-07, the probability that the TotExp coefficient is not relevant in the model is also small at 9.39e-14, the probability that the twovar_term coefficient is not relevant in the model is similarly small at 6.35e-05. Finally, the probability that the intercept is not relevant is also very small at 2e-16.
Since all of the values are well below the assumed standard threshold of 0.05, we can conclude that both the coefficients and the intercept are significant in this model.
The multi-factor model used here is not as good as a one-factor model of the transformed variables because:
The R2 value has decreased from ~73% to ~35%. The standard error for the TotExp coefficient is larger (8.05 vs 22.5) than in the previous transformed one-factor model. The The standard error for the y-intercept is smaller (78.9 vs 15.7) than in the previous transformed one-factor model. Altogether, there is low expected variability in both the TotExp and in the y-intercept. The p-values associated with all three of the variables are much smaller compared to the standard threshold of 0.05, so all variables used in the model are significant.
Overall, the model is not good as the previous transformed one-factor model because the R-squared value has decreased and the standard error although still small for the TotExp coefficient on an absolute basis is still larger than the value reported in the transformed one-factor model. All-in-all, the values for R-squared and standard error in this multi-factor model are closer to the values in the original untransformed linear model which we determined to be a poor fit.
who.df$twovar_term <- who.df$PropMD * who.df$TotExp
who3.lm <- lm(LifeExp ~ PropMD + TotExp + twovar_term, data = who.df)
summary(who3.lm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + twovar_term, data = who.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## twovar_term -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
Forecast LifeExp when PropMD =.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
Assuming that we are using the equation for the solution derived in problem 4 to arrive at life expectancy, we can use the linear equation below:
LifeExp = 6.277e+01 + (1.497e+03 * PropMD) + (7.233e-05 * TotExp) + (-6.026e-03 * twovar_term).
Given that PropMD=.03 and TotExp = 14, the above equation can be written as:
LifeExp = 6.277e+01 + (1.497e+03 * 0.03) + (7.233e-05 * 14) + (-6.026e-03 * 0.03 * 14).
Plugging in the mean values of PropMD and TotExp in the model above, we arrive at an average life expectancy of 68.02 years.
LifeExp = 6.277e+01 + (1.497e+03 * 0.00179538) + (7.233e-05 * 41695.49) + (-6.026e-03 * 0.00179538 * 41695.49).
Essentially, we arrive at an increased life expectancy of 107.7 years (from 68 years) by employing a higher proportion of doctors in the overall population but at the same time decreasing the total expenditure on health care dollars.
Looking at the original dataset, the mean and median proportion of MD’s in the overall population is ~0.001 and we are proposing to increase the proportion of MD’s now to 0.03. However, the mean and median values of the total expenditure on health care in the original dataset are $41,696 and $5,541, respectively and we are proposing to decrease this to just $14/capita.
There is clearly an interaction between health care spend and proportion of MD’s in the model which appears intuitive (more MD’s mean higher salaries and hence more health care dollars/spend) but this is perhaps not being accounted for in the assumed values. Hence, the increased life expectancy forecast does not seem realistic given the values used for the predictor variables and the interaction between them.
Maybe the problem statement is assuming that although we are decreasing the overall health care spend, we are using a bigger proportion of the health care dollars on employing more MD’s (at the expense of fewer RN’s?) which in effect is improving the health outcomes through improved average life expectancies?
LifeExp = 6.277e+01 + (1.497e+03 * 0.03) + (7.233e-05 * 14) + (-6.026e-03 * 0.03 * 14)
sprintf("Life expectancy when PropMD =.03 and TotExp = 14 is: %f", LifeExp)
## [1] "Life expectancy when PropMD =.03 and TotExp = 14 is: 107.678482"
median(who.df$PropMD)
## [1] 0.001047359
mean(who.df$PropMD)
## [1] 0.00179538
median(who.df$TotExp)
## [1] 5541
mean(who.df$TotExp)
## [1] 41695.49
AvglifeExp = 6.277e+01 + (1.497e+03 * 0.00179538) + (7.233e-05 * 41695.49) + (-6.026e-03 * 0.00179538 * 41695.49); AvglifeExp
## [1] 68.02242