DATA 605 Assignment Week 12

The attached who.csv dataset contains real-world data from 2008. The variables included follow.
Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

Loading the Data and Minor Inspection
Below the data is loaded into R.
The data contained no NA/missing values

library(ggplot2)
library(kableExtra)
library(readr)
library(tidyverse)
url <- "https://raw.githubusercontent.com/greggmaloy/Data_605/main/who.csv"
data <- read_csv(url)

df = data
head(data) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F)

Country	LifeExp	InfantSurvival	Under5Survival	TBFree	PropMD	PropRN	PersExp	GovtExp	TotExp
Afghanistan	42	0.835	0.743	0.99769	0.0002288	0.0005723	20	92	112
Albania	71	0.985	0.983	0.99974	0.0011431	0.0046144	169	3128	3297
Algeria	71	0.967	0.962	0.99944	0.0010605	0.0020914	108	5184	5292
Andorra	82	0.997	0.996	0.99983	0.0032973	0.0035000	2589	169725	172314
Angola	41	0.846	0.740	0.99656	0.0000704	0.0011462	36	1620	1656
Antigua and Barbuda	73	0.990	0.989	0.99991	0.0001429	0.0027738	503	12543	13046

apply(df, 2, function(col) sum(is.na(col))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = F) #%>%

	x
Country	0
LifeExp	0
InfantSurvival	0
Under5Survival	0
TBFree	0
PropMD	0
PropRN	0
PersExp	0
GovtExp	0
TotExp	0

  #scroll_box(width='100%', height = '200px')

QUESTION 1
1.)Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Scatter Plot
The scatter plot does not appear linear.

ggplot(data, aes(x = TotExp, y = LifeExp)) +
  geom_point() +  
  labs(x = "Total Expenditure", y = "Life Expectancy", title = "Scatterplot of Life Expectancy vs Total Expenditure") +
  theme_minimal()

Linear Model
Below is the model.

model <- lm(LifeExp ~ TotExp, data = data)

#LM summary stats
summary(model)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

F Statistics
Value: 65.26 on 1 and 188 DF, p-value: 7.714e-14
Interpretation: The F-value statistic tests where the independent variables in the model are statistically significant predictors of the model. A value of 65.26 is quite high and the associated p-value is statistically significant (p-value <0.05). The high f-value coupled with the lower p-value denotes a strong relationship between LifeExp and TotalExp.

R-Squared
Multiple R-squared: A multiple R^2 value of 0.2577 denotes that approximately 25.77% of the variance in LifeExp can be explained by TotExp. Although we know the model is statistically significant, ~75% of the variance of LifeExp is not accounted for in the model, indicating the model could be improved.
Adjusted R-squared: An adjusted R^2 value 0.2537 approximates the multiple R^2 value of 0.2577. Since there is only one predictor in our model, adjusted R^2 which approximates the multiple R^2 is relatively unimportant(?).

Coefficient Standard Error and P-Value
P-Value: The p-value for ‘Total_Exp’ is statistically significant (p=value <0.05). The statistically significant p-value suggests that for each unit increase in ‘TotExp’, ‘LifeExp’ increase by 0.00006297 years, which is quite small.

SE: The standard error is relatively small (SE= 7.795e-06) and when compared to the coefficent value, which is the t-value(t-value=8), the model suggests that TotExp is a strong, statistically significant predictor of LifeExp

Residual Standard Error
RSE: The residual standard error of ~9 is large, considering that it would be interpreted as a RSE of 9 years. The large size of the RSE is further reinforced when taken into account the spread of the 1st and 3rd quartile residual values (1st q= -4.7 vs 3rd q=7). Generally speaking, RSE that is approximately 1.5 times the 1st and 3rd quartile residuals provides evidence that residuals are normally distributed. The RSE of this model is slightly higher than the 1.5 times the 1st and 3rd quartile residuals.

LM Assumptions
Linearity: The residuals versus fitted plot clearly demonstrates the model is not linear
Homoscedasticity (Equal Variance): This assumption is violated via the residuals versus fitted plot. THe residuals are not uniformly distributed around zero and across all levels of the independant variable.
Normality of Errors: The Q-Q plot is not a straight line, denoting the residuals are not normally distributed. This provides strong evidence that the data is not linear.

par(mfrow=c(2,2))
plot(model)

LM Conclusion
Although statistically significant, the high RSE (~9) is the strongest indicator that the model is a poor fit. The RSE compared to the residual 1st and 3rd quartile distribution also indicates that the distribution of the residual is not normal and violates the normailty of errors assumption necessary for a LM to be valid. Furthermore, the multiple R^2 of 0.2577 indicates approximately 25.77% of the variance in LifeExp can be explained by TotExp. This means ~75% of the variance in the realtionship is not explained.

The residual analysis using the above plots further confirms that the residuals are not normaly distributed and homoscedasticity is violated.

QUESTION 2
Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Variable Transformation and Scatterplot
Near log variable transformation was applied to the variables of interest in order to:
1. Normalize the data, especially the LifeExp variable. Log transformation enabled the further granualization of the ‘year’ values. ‘year’ is essentially too large of a number/catchall for the analysis.
2. Stabilizes the variance to make the residuals fall more in line with normal distribution (fulfilling homoscedasticity LM assumption).
3. Establishing a more linear relationship.

The resulting scatter plot appears to denote a more linear relationship between the two variables than was previous present in the scatter plot where the variables were not log transformed.

#variable transformation
data <- data %>%
  mutate(LifeExp_transformed = LifeExp^4.6, TotExp_transformed = TotExp^0.06)

#scatterplot
ggplot(data, aes(x = TotExp_transformed, y = LifeExp_transformed)) +
  geom_point() +  
  labs(x = "Total Expenditure (Transformed)", y = "Life Expectancy (Transformed)",
       title = "Scatterplot of Transformed Life Expectancy vs Transformed Total Expenditure") +
  theme_minimal()

Linear Model

model_transformed <- lm(LifeExp_transformed ~ TotExp_transformed, data = data)

# Display the model summary
summary_model_transformed <- summary(model_transformed)
print(summary_model_transformed)

## 
## Call:
## lm(formula = LifeExp_transformed ~ TotExp_transformed, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -736527910   46817945  -15.73   <2e-16 ***
## TotExp_transformed  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

F Statistics
Value: 507.7 on 1 and 188 DF, p-value: 2.2e-16
Interpretation: The F-value dramatically increased after log transformation from A value of ~65 to an F-value of 507.7. Additionally, although the p-value of the first model was statistically significant, the log transformed LM the p-value grew lesser by a power of 2 (^14 to ^16). The More favorable f-value statistic for the log transformed LM support the notion that the log transformed LM is a better fit.

R-Squared
Multiple R-squared: A multiple R^2 value of 0.7298 denotes that approximately 72.98% of the variance in LifeExp can be explained by TotExp. The log transform LM therefore accounted for 50% more variance between the two variables. This is a dramatic increase in variance explained from the first model, which explained only 25.77% of the variance.

Adjusted R-squared: An adjusted R^2 value 0.7283 approximates the multiple R^2 value of 0.7298. Since there is only one predictor in our model, adjusted R^2 which approximates the multiple R^2 is relatively unimportant(?).

Coefficient Standard Error and P-Value
P-Value: The p-value for ‘Total_Exp’ is also statistically significant (p=value <0.05) in the log transformed LM. However, the log transform p-value is actually grew smaller, indicating greater statistical significant (LM ^14 versus Log Trans LM ^16) and provides further evidence that this model is a better fit.

SE: The standard error increased by a great deal when compared to the LM (LM SE = 0.7535e^06 vsLog Trans SE= 46817945), which is expected since log transformation was used. However, when comparing to the coefficent value, which is the t-value, the t-value increased with the log transformed LM (LM t-value=8 vs Log Trans t-value=22). The increased, more statistically signifcanted t-value denotes that in the log transformed LM, TotExp is a stronger predictor of LifeExp than in the first LM.

Residual Standard Error
RSE: The residual standard error of 90,000,00 grew a lot, which is expected since the variables were log transformed. However, when compared to the spread of the 1st and 3rd quartile residual values (1st q= 53,000,000 vs 3rd q=59,000,000), an RSE of 90,000,000 years is not exactly the 1.5 times value we are seeking.

LM Assumptions
Linearity: The residuals versus fitted plot appears to indicate a direct, linear relationship and is a dramatic improvement when compared to the non-log transformed LM. This assumption appears to be satisfied.
Homoscedasticity (Equal Variance): This assumption appears to be fulfilled via the residuals versus fitted plot. The residuals are uniformly distributed around zero and across all levels of the independent variable.
Normality of Errors: The Q-Q plot, although dramatically improved, is not a straight line, denoting the residuals are not normally distributed. This provides evidence that the data is not linear.

par(mfrow=c(2,2))
plot(model_transformed)

Log Tranform LM Conclusion The log transform model saw an increase in variance explained (75% versus 25%). The SE grew tremendously, but that was expected. The t-value denoted that the new SE was a better fit for coefficient values. Furthermore the value also became smaller in the lof transform model, denoting increased statistical significant in the log transform LM. However, the residuals appear to not be normally distributed as indicated by comparing the RSE value to the to the spread of the 1st and 3rd quartile residual values (1st q= 53,000,000 vs 3rd q=59,000,000), an RSE of 90,000,000 years is not exactly the 1.5 times value we are seeking. Furthermore, we were not asked to interpret the median, but when interpreted, the median is not centered at zero median (1st q= 53,000,000 vs 3rd q=59,000,000, median=13697187) and is closer to 3rd quartile observations, indicating skewed residuals, non-normal distribution of residuals and possible non-linear relationship. Model be improved. The QQ plot provides further evidence that the residuals are not normally distributed

QUESTION 3
Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

   intercept <- coef(model_transformed)[1]
slope <- coef(model_transformed)[2]
predicted_lifeExp_1_5 <- intercept + slope * 1.5


predicted_lifeExp_2_5 <- intercept + slope * 2.5

actual_lifeExp_1_5 <- predicted_lifeExp_1_5^(1/4.6)
actual_lifeExp_2_5 <- predicted_lifeExp_2_5^(1/4.6)

actual_lifeExp_1_5

## (Intercept) 
##    63.31153

actual_lifeExp_2_5

## (Intercept) 
##    86.50645

QUESTION 4
Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

model_multiple <- lm(LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = data)
summary(model_multiple)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD:TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

F Statistics
Value: 34.49 on 3 and 186 DF, p-value: 2.2e-16 Interpretation: The F-value dramatically decreased in multiple LM when compared to the log transformation of 507.7. The p-value remained the same when compared to the log transformed model and remains statistically significant. The more favorable f-value statistic for the log transformed LM support the notion that the log transformed LM is a better fit.

R-Squared
Multiple R-squared: A multiple R^2 value of 0.3574 denotes that approximately 35.74% of the variance in LifeExp can be explained by TotExp. The log transform LM multiple R^2 value = 0.7298 accounted for ~40% more variance between the two variables. This is a dramatic decrease in variance when compared to the log transformed model.

Adjusted R-squared: An adjusted R^2 value 0.3471 approximates the multiple R^2 value of 0.3574. Since this model has multiple predictors, adjusted R^2 which approximates the multiple R^2 reinforces that the model does indeed account for 0.3471 of the variance between the independent and dependent variables.

Coefficient Standard Error and P-Value
P-Value: The p-value for all independent variables are statistically significant (p=value <0.05) in the multiple LM.

SE: The standard error for all independent variables appears to be acceptable, as does the t-value. The t-value for the log transformed LM is a higher value, however, which indicates a stronger relationship between variables in the log transform LM than the multiple LM (log transform LM= ~20 than the multiple LM between 8 to -4).

Residual Standard Error
RSE: The residual standard error of 8.765 is rather high, as it represents 8.75 years. The large size of the RSE is further reinforced when taken into account the spread of the 1st and 3rd quartile residual values (1st q= -4.132 vs 3rd q=6.54, Median =2.098). The median is not centered at zero and median observations is skewed toward the 3rd quartile providing evidence that the residuals are not normally distributed.

LM Assumptions
Linearity: The residuals versus fitted plot clearly demonstrates the model is not linear
Homoscedasticity (Equal Variance): This assumption is violated via the residuals versus fitted plot. The residuals are not uniformly distributed around zero and across all levels of the independent variable.
Normality of Errors: The Q-Q plot denotes some minor skew, denoting the residuals are not normally distributed. This provides strong evidence that the data is not linear.

par(mfrow=c(2,2))
plot(model_multiple)

Multiple LM Conclusion This model is a better fit that the initial simple linear mode, but over all the log transformed LM is the model has the best fit of the three models provided. Next step would be to add additional predictors to the log transformed LM.

QUESTION 5
5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

intercept <- 6.277e+01
coeff_PropMD <- 1.497e+03
coeff_TotExp <- 7.233e-05
coeff_Interaction <- -6.026e-03
PropMD_value <- 0.03
TotExp_value <- 14

LifeExp_Pred <- intercept + 
                    (coeff_PropMD * PropMD_value) + 
                    (coeff_TotExp * TotExp_value) + 
                    (coeff_Interaction * PropMD_value * TotExp_value)

LifeExp_Pred

## [1] 107.6785

The life expectancy of 107.6 years seems unrealistic, as it is quite a high life expectancy.

DATA 605 Assignment Week 12

Gregg Maloy