The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.
#Read file
myFile <- file.choose()
myData <- read.csv(file=myFile, header=TRUE)
Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
#plot LifeExp vs TotExp
plot(LifeExp~TotExp,data=myData)
lifeExp_TotExp <- lm(myData$LifeExp~myData$TotExp)
summary(lifeExp_TotExp)
##
## Call:
## lm(formula = myData$LifeExp ~ myData$TotExp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## myData$TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The F statistics is 65.26. Since it is only single variable, it is not useful in this case.
This value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case. [LinearRegression_fulltext.pdf, page 33.]
R^2 is 0.2577. The model explains 25.77% of data variation.
Multiplying this value by 100 gives a value that we can interpret as a percentage between 0 and 100. Consequently, you should not ever expect an R2 value of exactly one. In general, values of R2 that are closer to one indicate a better-fitting model. However, a good model does not necessarily require a large R2 value. It may still accurately predict future observations, even with a small R2 value. [LinearRegression_fulltext.pdf, page 32.]
Standard Error is 7.795e-06. The ratio of coefficient and standard error (6.297e-05/7.795e-06) = 8.07. It is a good model.
The Std. Error column shows the statistical standard error foreachofthe coefficients. For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. [LinearRegression_fulltext.pdf, page 31.][http://onlinestatbook.com/2/regression/accuracy.html]
p-value is 7.714e-14. Since it is very small, we reject null hypothese.
#transform data by power
myData.LifeExp_46 <- myData$LifeExp^4.6
myData.TotExp_006 <- myData$TotExp^0.06
#plot transformed data
plot(myData.LifeExp_46~myData.TotExp_006)
transform_lifeExp_TotExp_lm <- lm(myData.LifeExp_46~myData.TotExp_006)
summary(transform_lifeExp_TotExp_lm)
##
## Call:
## lm(formula = myData.LifeExp_46 ~ myData.TotExp_006)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## myData.TotExp_006 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
F-statistics is 507.7. Since it is only single variable, it is not useful in this case.
This value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case. [LinearRegression_fulltext.pdf, page 33.]
R^2 is 0.7298. The model explains 72.98% of data variation.
Multiplying this value by 100 gives a value that we can interpret as a percentage between 0 and 100. Consequently, you should not ever expect an R2 value of exactly one. In general, values of R2 that are closer to one indicate a better-fitting model. However, a good model does not necessarily require a large R2 value. It may still accurately predict future observations, even with a small R2 value. [LinearRegression_fulltext.pdf, page 32.]
Standard error is 27518940. The ratio of coefficient and standard error (620060216/27518940) = 22.53. It is a good model.
The Std. Error column shows the statistical standard error foreachofthe coefficients. For a good model, we typically would like to see a standard error that is at least five to ten times smaller than the corresponding coefficient. [LinearRegression_fulltext.pdf, page 31.]
P-values is < 2.2e-16. It is small and we reject the null hypothese.
From the distribution of the scatter chart, R^2 and standard error values, the transformed dataset creates a better model.
TotExp = sum of personal and government expenditures The forecast life expentancy = (-736527910) + 620060216 * (TotExp^0.06_value). For TotalExp^0.06 = 1.5
# for 1.5
TotExp_006_value <- 1.5
forecast_life_expectancy <- (-736527910) + 620060216 * (TotExp_006_value)
print(paste("For transformed sum of personal and government expenditures=", TotExp_006_value, " the forecast life expectancy is", forecast_life_expectancy, ", or convert value =", forecast_life_expectancy^(1/4.6), sep=" "))
## [1] "For transformed sum of personal and government expenditures= 1.5 the forecast life expectancy is 193562414 , or convert value = 63.3115334478635"
# for 2.5
TotExp_006_value <- 2.5
forecast_life_expectancy <- (-736527910) + 620060216 * (TotExp_006_value)
print(paste("For transformed sum of personal and government expenditures=", TotExp_006_value, " the forecast life expectancy is", forecast_life_expectancy, ", or convert value =", forecast_life_expectancy^(1/4.6), sep=" "))
## [1] "For transformed sum of personal and government expenditures= 2.5 the forecast life expectancy is 813622630 , or convert value = 86.5064484928337"
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
where PropMD: proportion of the population who are MDs
multiple_reg_model_lm <- lm(formula = LifeExp~TotExp+PropMD+(PropMD*TotExp),data=myData)
summary(multiple_reg_model_lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + (PropMD * TotExp), data = myData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp:PropMD -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
F-statistic is 34.49.
http://www.statisticshowto.com/probability-and-statistics/f-statistic-value-test/ The F value in regression is the result of a test where the null hypothesis is that all of the regression coefficients are equal to zero. In other words, the model has no predictive capability. Basically, the f-test compares your model with zero predictor variables (the intercept only model), and decides whether your added coefficients improved the model. If you get a significant result, then whatever coefficients you included in your model improved the model’s fit.
Read your p-value first. If the p-value is small (less than your alpha level), you can accept the null hypothesis. Only then should you consider the f-value. If you fail to reject the null, discard the f-value result.
R^2 is 0.3574. The model explains 35.74% of data variation
Standard Error for TotExp is 8.982e-06. The ratio of coefficient and st. error is (7.233e-05/8.982e-06) = 8.053 Standard Error for PropMD is 2.788e+02. The ratio of coefficient and st. error is ( 1.497e+03 / 2.788e+02) = 5.371 Standard Error for TotExp:PropMD is 1.472e-03. The ratio of coefficient and st. error is ( -6.026e-03 / 1.472e-03) = -4.093
P-value is < 2.2e-16. It is significant small and we reject the null hypothese.
The model is not as good as the LifeExp~TotExp.
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
LifeExp = (6.277e+01)+(7.233e-05)x PropMd + (1.497e+03) x TotExp +(-6.026e-03) x PropMD x TotExp
PropMD_value <- 0.03
TotExp_value <- 14
PropMD_TotExp_value <- 0.03*14
forecast_LifeExp2 <- (6.277e+01)+(7.233e-05)* PropMD_value + (1.497e+03) * TotExp_value +(-6.026e-03) * PropMD_TotExp_value
print(paste("For PropMD=", PropMD_value, " and TotExp =", TotExp_value, ", the forecast life expectancy is", forecast_LifeExp2, sep=" "))
## [1] "For PropMD= 0.03 and TotExp = 14 , the forecast life expectancy is 21020.7674712499"
This model doesn’t seem like realistic because life can’t be over 21020 years old.