In this assignment, a real world WHO dataset from 2008 is provided. The dataset is found as csv file format which is then stored on my github account for conducting the required analysis.
library(ggplot2)
who_data<-read.csv("https://raw.githubusercontent.com//Raji030//data605_hw12_dataset//main//who.csv")
head(who_data)
## Country LifeExp InfantSurvival Under5Survival TBFree PropMD
## 1 Afghanistan 42 0.835 0.743 0.99769 0.000228841
## 2 Albania 71 0.985 0.983 0.99974 0.001143127
## 3 Algeria 71 0.967 0.962 0.99944 0.001060478
## 4 Andorra 82 0.997 0.996 0.99983 0.003297297
## 5 Angola 41 0.846 0.740 0.99656 0.000070400
## 6 Antigua and Barbuda 73 0.990 0.989 0.99991 0.000142857
## PropRN PersExp GovtExp TotExp
## 1 0.000572294 20 92 112
## 2 0.004614439 169 3128 3297
## 3 0.002091362 108 5184 5292
## 4 0.003500000 2589 169725 172314
## 5 0.001146162 36 1620 1656
## 6 0.002773810 503 12543 13046
plot(who_data$TotExp, who_data$LifeExp,
xlab = "TotExp", ylab = "LifeExp", abline(lm(who_data$LifeExp~who_data$TotExp)))
model <- lm(LifeExp ~ TotExp, data = who_data)
summary(model)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
F-statistic: The F-statistic is 65.26 with a p-value of 7.714e-14, which is less than the significance level of 0.05. This indicates that the model is statistically significant, and there is a linear relationship between the predictor (TotExp) and the response variable (LifeExp).
R-squared: The R-squared value is 0.2577, which means that approximately 25.77% of the variability in response variable (LifeExp) can be explained by explanatory variable (TotExp). The adjusted R-squared value in the model is 0.2537. From this R-squared value it ca be said that the model does not indicate a better fit of the model.
Residual standard error: The residual standard error is 9.371 indicating the average amount of error in the predictions made by the model.
p-values: The p-value for TotExp is 7.71e-14, which is less than the significance level of 0.05 indicating that there is strong evidence to suggest that the coefficient for TotExp is significantly different from zero. So, the null hypothesis can be rejected and it can be said that the predictor variable has significant effect on response variable.
ggplot(data =model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")
hist(model$residuals)
ggplot(data =model, aes(sample = .resid)) +
stat_qq()
In the model diagnostic part, the linearity, nearly normal residual and constant variability or homoscedasticity of the residuals assumptions have been checked to see whether the linear model is reliable, and the test results are given below:
Residuals analysis: The residuals appear not to be randomly scattered around zero and also possess curvature which indicating that the assumptions of linearity and homoscedasticity are not satisfied.
Histogram of residuals: The histogram of residuals is not approximately normally distributed. It is clearly left skewed. So, the assumption of nearly normal residual distribution is not satisfied.
Normality assumption: The normal probability plot (or q-q plot) of residuals appears not to be fairly linear which indicating that the residuals are not approximately normally distributed. So, the assumption of the linearity is not met.
Based on the result of the model diagnostic above it can be said that the linear model was not an appropriate one.
who_data$LifeExp_trans <- who_data$LifeExp^4.6
who_data$TotExp_trans <- who_data$TotExp^0.06
plot(who_data$TotExp_trans, who_data$LifeExp_trans,
xlab = "TotExp^.06", ylab = "LifeExp^4.6", abline(lm(who_data$LifeExp_trans~who_data$TotExp_trans)))
remodel <- lm(LifeExp_trans ~ TotExp_trans, data = who_data)
summary(remodel)
##
## Call:
## lm(formula = LifeExp_trans ~ TotExp_trans, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_trans 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
F-statistic: The F-statistic is 507.7 with a p-value of < 2.2e-16, which indicates that the model is highly significant.
R-squared: The R-squared value is 0.7298, which means that approximately 72.98% of the variability in response variable (LifeExp) can be explained by explanatory variable (TotExp). The adjusted R-squared value is 0.7283, which is very close to the R-squared value and indicates a good fit of the model.
Residual standard error: The residual standard error is 90490000, which represents the average amount by which the observed values of LifeExp_trans deviate from the predicted values.
p-values: Both the intercept and TotExp_trans have p-values less than of 2e-16, which means that they are highly statistically significant.
Comparing the two models above, it can be said that the transformed variables model has a higher R-squared value (0.7298) compared to the original model which has an R-squared value of 0.2577. This indicates that the transformed variables model explains more variance in the response variable (LifeExp) and therefore, it is considered to be a better model. Moreover, the F-statistic and p-values for the coefficients in the transformed variables model are also lower, indicating higher statistical significance of the model.
The equation for the linear regression model we found here is :
LifeExp^4.6 = -736527910 + 620060216 * TotExp^0.06
# LifeExp^4.6 <- -736527910 + 620060216 * TotExp^0.06 # Equation
LifeExp<- (-736527910 + 620060216 *1.5)^(1/4.6) # For, TotExp^.06 =1.5
LifeExp
## [1] 63.31153
# LifeExp^4.6 <- -736527910 + 620060216 * TotExp^0.06 # Equation
LifeExp<- (-736527910 + 620060216 *2.5)^(1/4.6) # For, TotExp^.06 =2.5
LifeExp
## [1] 86.50645
mlt_model<-lm(LifeExp~TotExp + PropMD + PropMD * TotExp, data=who_data)
summary(mlt_model)
##
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + PropMD * TotExp, data = who_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp:PropMD -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
F-statistic: The F-statistic has a very low p-value (< 2.2e-16), which indicates that the model is statistically significant.
R-squared: The R-squared value is 0.3574, which means that the model explains about 35.74% of the variance in life expectancy. The adjusted R-squared is 0.3471, which is slightly lower than the R-squared value, indicating that the model may be slightly over fitting.
Residual standard error: The residual standard error is 8.765, which represents the average amount of error in the model’s predictions.
p-values: All the coefficients have p-values below 0.05, indicating that they are statistically significant.
Based on the statistics above, it can be said that the model appears to be statistically significant. However, the model is unable to explain a large proportion of the variance in life expectancy. Though the model has some explanatory power, overall, it can be said that the model is not a good fitted model with the given dataset.
From the model above the regression equation : LifeExp = 62.8 + 1497 x PropMd + 0.000072 x TotExp -0.006 x PropMD x TotExp
LifeExp<-62.8 + 1497 * 0.03 + 0.000072 * 14 - 0.006 * 14 * 0.03 # where, PropMD=.03 and TotExp = 14
LifeExp
## [1] 107.7085
The life expectancy is found 107.71 from the forecast above. This forecasted value is not realistic. Because, if we raise the proportion of doctors in the population, the total expenditure will also be increased as the proportion of doctors in the population is not independent of the total expenditure in the healthcare industry. Therefore, it is not practical to have a drastic increase in proportion of the doctors in the population while considering a drastic decrease of the total expenditure in healthcare system.