library(tidyverse)
data <- read.csv("/Users/mohamedhassan/Downloads/who.csv")
summary(data)
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN PersExp
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883 Min. : 3.00
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455 1st Qu.: 36.25
## Median :0.9992 Median :0.0010474 Median :0.0027584 Median : 199.50
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336 Mean : 742.00
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164 3rd Qu.: 515.25
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387 Max. :6350.00
## GovtExp TotExp
## Min. : 10.0 Min. : 13
## 1st Qu.: 559.5 1st Qu.: 584
## Median : 5385.0 Median : 5541
## Mean : 40953.5 Mean : 41696
## 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :476420.0 Max. :482750
plot(data[,"TotExp"],data[,"LifeExp"], main="Relationship Between Government Expenditures and Life Expectancy",
xlab="Sum of Personal and Government Expenditures.", ylab="Average Life Expectancy")
model1 <- lm(LifeExp~TotExp, data=data)
summary(model1)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
The F-Statistic is the t-value squared. The value is 65.26, which
taken with the small p-value of 7.71^e-14 would indicate
that the null hypothesis can be rejected and the model is statistically
significant. The small p-value indicates that the variable
TotalExp is statisically significant to the relationship
with LifeExp. The Adjusted R-squared value indicates that
25.37% of the variation of LifeExp can be explained by the TotalExp. The
standard error is 7.795^e-14, which can be interpreted
through the t-value as being 8.079 times smaller than the correlation
coefficient of TotalExp, 6.297^e-05. Typically, the
standard error of a good model should be five to ten times smaller than
the corresponding coefficient.
plot(fitted(model1), residuals(model1), xlab="fitted", ylab="residuals")
abline(h=0)
plot(model1)
ggplot(model1, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title="Residual vs. Fitted Values Plot") +
xlab("Fitted values") +
ylab("Residuals")
The scatterplot of the model shows the residual points shifted mostly
to the left, with an unequal distribution above and below the zero
threshold. This indicates that the variance of the residuals is not
constant, and that there is heteroscedasticity. Additionally, the right
and left tail of the Q-Q plot deviate from the reference line, which
also indicates that the model does not do a good job of capturing the
linearity between LifeExp and TotExp.
data_new <- data %>%
mutate(LifeExp2 = LifeExp^4.6) %>%
mutate(TotExp2 = TotExp^.06)
plot(data_new[,"TotExp2"],data_new[,"LifeExp2"], main="Relationship Between Government Expenditures and Life Expectancy",
xlab="Sum of Personal and Government Expenditures.", ylab="Average Life Expectancy")
model2 <- lm(LifeExp2~TotExp2, data=data_new)
summary(model2)
##
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = data_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp2 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
Transforming each variable had an impact on the performance of the
model. The F-statistic increased substantially, from 65.26 to 507.7.
Combined with the small p-value of 2^e-16, this indicates
that the null hypothesis can be rejected and the model is statistically
significant. The small p-value indicates that the newly transformed
variable TotExp2 is statistically significant to the
transformed dependent variable, LifeExp2.The Adjusted
R-squared increased as well, from 25.37% to 72.83%, which indicates that
this model does a better job of capturing the variation in
LifeExp2 that is explained by TotExp2. The
standard error can be explained through the t-value, which shows that
the standard error, 27518940, is 22.53 smaller than the corresponding
coefficient, 620060216. As stated earlier, a good model has a t-value
(ratio of coefficient/standard error) that shows the standard error
being five to ten times smaller than the coefficient. Taken altogether,
this model does a better job than the initial model.
plot(fitted(model2), residuals(model2), xlab="fitted", ylab="residuals")
abline(h=0)
plot(model2)
ggplot(model2, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title="Residual vs. Fitted Values Plot") +
xlab("Fitted values") +
ylab("Residuals")
The plots of the new model supports the argument that this model is better. The Residual vs. Fitted Values plot shows randomly scattered residual points with no discernible pattern, with a more even distribution above and below the horizontal axis. The Q-Q plot shows the points doing a better job of falling on the reference line, even though the left tail deviates from the line.
x <- 1.5
forecast1 <- round((-736527909 + 620060216 * x)^(1/4.6), 1)
cat("The Forecast of Life Expectancy when TotExp^.06 = 1.5 is", forecast1)
## The Forecast of Life Expectancy when TotExp^.06 = 1.5 is 63.3
y <- 2.5
forecast2 <- round((-736527909 + 620060216 * y)^(1/4.6), 1)
cat("The Forecast of Life Expectancy when TotExp^.06 = 2.5 is", forecast2)
## The Forecast of Life Expectancy when TotExp^.06 = 2.5 is 86.5
LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
model3 <- lm(LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data=data)
summary(model3)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
When analyzing the summary of the model, we can see that it does not
do a good job of capturing the relationship between the independent
variables and dependent variable, LifeExp. While the
overall p-value of 2.2^e-16 may indicate that the
independent variables are statistically significant to
LifeExp, the small F-Statistic value of 34.49 suggests that
it would be difficult to reject the null hypothesis and determine that
the independent variables are statistically significant. The Adjusted
R-squared value is 34.71%, which is the percentage of variation in
LifeExp that can be explained by the independent variables.
When examining each independent variable, each have a p-value less than
.05, indicates that each variable is statistically significant to
LifeExp. The t-values of PropMD and
TotExp are between 5 and 10, which indicates that the
standard error is smaller than their respective coefficients and the
variables are statistically significant. However, the interactive term
PropMD*TotExp has a t-value of -4.093, which indicates that
the variable does not have a statistically significant impact on
LifeExp.
plot(model3)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
ggplot(model3, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(title="Residual vs. Fitted Values Plot") +
xlab("Fitted values") +
ylab("Residuals")
The plots follow the pattern of the initial model. The residual
points are skewed to the left and not randomly scattered, with an
unequal distribution above and below the horizontal axis. This indicates
that the variance of the residuals is not constant, and that there is
heteroscedasticity. Additionally, the right and left tail of the Q-Q
plot deviate from the reference line, which also indicates that the
model does not do a good job of capturing the linearity between the
independent variables and LifeExp.
x <- 0.03
z <- 14
forecast3 <- 62.77 + (1497 * x) + (0.00007233 * z) - (0.006026 * x * z)
cat("The Forecast of Life Expectancy when TotExp = 1.5 is", forecast3)
## The Forecast of Life Expectancy when TotExp = 1.5 is 107.6785
This doesn’t seem realistic. The mean of LifeExp in the
dataset is 67.38, with a Min of 40 and a Max of 83. This model suggests
that the life expectancy would far surpass the mean and max of life
expectancy. Additionally, the outcome of this model forecasts that
increasing the sum of personal and government expenditures,
TotExp, would extend the life expectancy of a person until
they were 107, which doesn’t seem plausible.