In this assignment, a real world WHO dataset from 2008 is provided. The dataset is found as csv file format which is then stored on my github account for conducting the required analysis.

Load library

library(ggplot2)

Get the dataset

who_data<-read.csv("https://raw.githubusercontent.com//Raji030//data605_hw12_dataset//main//who.csv")
head(who_data)

##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp
## 1 0.000572294      20      92    112
## 2 0.004614439     169    3128   3297
## 3 0.002091362     108    5184   5292
## 4 0.003500000    2589  169725 172314
## 5 0.001146162      36    1620   1656
## 6 0.002773810     503   12543  13046

Ques-1: Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Ans-1:

Scaatter polt of LifeExp vs TotExp

plot(who_data$TotExp, who_data$LifeExp, 
     xlab = "TotExp", ylab = "LifeExp", abline(lm(who_data$LifeExp~who_data$TotExp)))

Building the linear regression model to fit the data

model <- lm(LifeExp ~ TotExp, data = who_data)

Summary of the model

summary(model)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Interpret model statistics

F-statistic: The F-statistic is 65.26 with a p-value of 7.714e-14, which is less than the significance level of 0.05. This indicates that the model is statistically significant, and there is a linear relationship between the predictor (TotExp) and the response variable (LifeExp).

R-squared: The R-squared value is 0.2577, which means that approximately 25.77% of the variability in response variable (LifeExp) can be explained by explanatory variable (TotExp). The adjusted R-squared value in the model is 0.2537. From this R-squared value it ca be said that the model does not indicate a better fit of the model.

Residual standard error: The residual standard error is 9.371 indicating the average amount of error in the predictions made by the model.

p-values: The p-value for TotExp is 7.71e-14, which is less than the significance level of 0.05 indicating that there is strong evidence to suggest that the coefficient for TotExp is significantly different from zero. So, the null hypothesis can be rejected and it can be said that the predictor variable has significant effect on response variable.

Model diagnostic:

Residuals vs fitted value (predicted value) plot:

ggplot(data =model, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals")

Histogram of the residuals

hist(model$residuals)

Normal probability plot of the residual

ggplot(data =model, aes(sample = .resid)) +
stat_qq()

In the model diagnostic part, the linearity, nearly normal residual and constant variability or homoscedasticity of the residuals assumptions have been checked to see whether the linear model is reliable, and the test results are given below:

Residuals analysis: The residuals appear not to be randomly scattered around zero and also possess curvature which indicating that the assumptions of linearity and homoscedasticity are not satisfied.
Histogram of residuals: The histogram of residuals is not approximately normally distributed. It is clearly left skewed. So, the assumption of nearly normal residual distribution is not satisfied.
Normality assumption: The normal probability plot (or q-q plot) of residuals appears not to be fairly linear which indicating that the residuals are not approximately normally distributed. So, the assumption of the linearity is not met.

Based on the result of the model diagnostic above it can be said that the linear model was not an appropriate one.

Ques-2: . Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Ans-2:

Create new transformed variables

who_data$LifeExp_trans <- who_data$LifeExp^4.6
who_data$TotExp_trans <- who_data$TotExp^0.06

Scatter Plot the transformed variables

plot(who_data$TotExp_trans, who_data$LifeExp_trans, 
     xlab = "TotExp^.06", ylab = "LifeExp^4.6", abline(lm(who_data$LifeExp_trans~who_data$TotExp_trans)))

Re-run the linear regression model with transformed variables

remodel <- lm(LifeExp_trans ~ TotExp_trans, data = who_data)

Summary of the remodel statistics

summary(remodel)

## 
## Call:
## lm(formula = LifeExp_trans ~ TotExp_trans, data = who_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -736527910   46817945  -15.73   <2e-16 ***
## TotExp_trans  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Interpret statistics for the transformed variables model

F-statistic: The F-statistic is 507.7 with a p-value of < 2.2e-16, which indicates that the model is highly significant.

R-squared: The R-squared value is 0.7298, which means that approximately 72.98% of the variability in response variable (LifeExp) can be explained by explanatory variable (TotExp). The adjusted R-squared value is 0.7283, which is very close to the R-squared value and indicates a good fit of the model.

Residual standard error: The residual standard error is 90490000, which represents the average amount by which the observed values of LifeExp_trans deviate from the predicted values.

p-values: Both the intercept and TotExp_trans have p-values less than of 2e-16, which means that they are highly statistically significant.

Comparing the two models above, it can be said that the transformed variables model has a higher R-squared value (0.7298) compared to the original model which has an R-squared value of 0.2577. This indicates that the transformed variables model explains more variance in the response variable (LifeExp) and therefore, it is considered to be a better model. Moreover, the F-statistic and p-values for the coefficients in the transformed variables model are also lower, indicating higher statistical significance of the model.

Ques-3: Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Ans-3:

The equation for the linear regression model we found here is :

LifeExp^4.6 = -736527910 + 620060216 * TotExp^0.06

# LifeExp^4.6 <- -736527910 + 620060216 * TotExp^0.06 # Equation

LifeExp<- (-736527910 + 620060216 *1.5)^(1/4.6) # For, TotExp^.06 =1.5
LifeExp

## [1] 63.31153

# LifeExp^4.6 <- -736527910 + 620060216 * TotExp^0.06 # Equation

LifeExp<- (-736527910 + 620060216 *2.5)^(1/4.6) # For, TotExp^.06 =2.5
LifeExp

## [1] 86.50645

Ques-4: Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

Ans-4:

Build multiple regression model

mlt_model<-lm(LifeExp~TotExp + PropMD + PropMD * TotExp, data=who_data)

Multiple regression model summary

summary(mlt_model)

## 
## Call:
## lm(formula = LifeExp ~ TotExp + PropMD + PropMD * TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp:PropMD -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Interpret statistics for multiple regression model

F-statistic: The F-statistic has a very low p-value (< 2.2e-16), which indicates that the model is statistically significant.

R-squared: The R-squared value is 0.3574, which means that the model explains about 35.74% of the variance in life expectancy. The adjusted R-squared is 0.3471, which is slightly lower than the R-squared value, indicating that the model may be slightly over fitting.

Residual standard error: The residual standard error is 8.765, which represents the average amount of error in the model’s predictions.

p-values: All the coefficients have p-values below 0.05, indicating that they are statistically significant.

Based on the statistics above, it can be said that the model appears to be statistically significant. However, the model is unable to explain a large proportion of the variance in life expectancy. Though the model has some explanatory power, overall, it can be said that the model is not a good fitted model with the given dataset.

Ques-5: Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Ans-5:

From the model above the regression equation : LifeExp = 62.8 + 1497 x PropMd + 0.000072 x TotExp -0.006 x PropMD x TotExp

LifeExp<-62.8 + 1497 * 0.03 + 0.000072 * 14 - 0.006 * 14 * 0.03 # where, PropMD=.03 and TotExp = 14
LifeExp

## [1] 107.7085

The life expectancy is found 107.71 from the forecast above. This forecasted value is not realistic. Because, if we raise the proportion of doctors in the population, the total expenditure will also be increased as the proportion of doctors in the population is not independent of the total expenditure in the healthcare industry. Therefore, it is not practical to have a drastic increase in proportion of the doctors in the population while considering a drastic decrease of the total expenditure in healthcare system.

Data605_Assignment12

Mahmud Hasan Al Raji

2023-04-23