knitr::opts_chunk$set(echo = TRUE)
AC<- read.csv("AllCountries.csv")
str(AC)
## 'data.frame': 217 obs. of 26 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : int 521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
## $ Rural : num 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num 3.72 4.08 13.81 NA NA ...
## $ Health : num 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : int 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num 67.4 123.7 111 NA 104.4 ...
## $ HIV : num NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : int NA 808 1328 NA NA 545 NA 2030 1016 NA ...
## $ Electricity : int NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
## $ Developed : int NA 1 1 NA NA 1 NA 2 1 NA ...
summary(AC)
## Country Code LandArea Population
## Length:217 Length:217 Min. : 0.01 Min. : 0.0120
## Class :character Class :character 1st Qu.: 10.83 1st Qu.: 0.7728
## Mode :character Mode :character Median : 94.28 Median : 6.5725
## Mean : 608.38 Mean : 35.0335
## 3rd Qu.: 446.30 3rd Qu.: 25.0113
## Max. :16376.87 Max. :1392.7300
## NA's :8 NA's :1
## Density GDP Rural CO2
## Min. : 0.1 Min. : 275 Min. : 0.00 Min. : 0.0400
## 1st Qu.: 37.5 1st Qu.: 2032 1st Qu.:19.62 1st Qu.: 0.8575
## Median : 92.1 Median : 5950 Median :38.15 Median : 2.7550
## Mean : 361.4 Mean : 14733 Mean :39.10 Mean : 4.9780
## 3rd Qu.: 219.8 3rd Qu.: 17298 3rd Qu.:57.83 3rd Qu.: 6.2525
## Max. :20777.5 Max. :114340 Max. :87.00 Max. :43.8600
## NA's :8 NA's :30 NA's :3 NA's :13
## PumpPrice Military Health ArmedForces
## Min. :0.1100 Min. : 0.000 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.7450 1st Qu.: 3.015 1st Qu.: 6.157 1st Qu.: 12.0
## Median :0.9800 Median : 4.650 Median : 9.605 Median : 31.5
## Mean :0.9851 Mean : 6.178 Mean :10.597 Mean : 162.1
## 3rd Qu.:1.1800 3rd Qu.: 8.445 3rd Qu.:13.713 3rd Qu.: 146.5
## Max. :2.0000 Max. :31.900 Max. :39.460 Max. :3031.0
## NA's :50 NA's :67 NA's :29 NA's :49
## Internet Cell HIV Hunger
## Min. : 1.30 Min. : 13.70 Min. : 0.100 Min. : 2.50
## 1st Qu.:29.18 1st Qu.: 83.83 1st Qu.: 0.175 1st Qu.: 2.50
## Median :58.35 Median :110.00 Median : 0.400 Median : 6.50
## Mean :54.47 Mean :107.05 Mean : 1.941 Mean :11.25
## 3rd Qu.:78.92 3rd Qu.:127.50 3rd Qu.: 1.400 3rd Qu.:14.80
## Max. :98.90 Max. :328.80 Max. :27.400 Max. :61.80
## NA's :13 NA's :15 NA's :81 NA's :52
## Diabetes BirthRate DeathRate ElderlyPop
## Min. : 1.000 Min. : 7.00 Min. : 1.600 Min. : 1.200
## 1st Qu.: 5.350 1st Qu.:11.40 1st Qu.: 5.800 1st Qu.: 3.600
## Median : 7.200 Median :17.85 Median : 7.250 Median : 6.600
## Mean : 8.542 Mean :20.11 Mean : 7.683 Mean : 8.953
## 3rd Qu.:10.750 3rd Qu.:27.65 3rd Qu.: 9.350 3rd Qu.:14.500
## Max. :30.500 Max. :47.80 Max. :15.500 Max. :27.500
## NA's :10 NA's :15 NA's :15 NA's :24
## LifeExpectancy FemaleLabor Unemployment Energy
## Min. :52.20 Min. : 6.20 Min. : 0.100 Min. : 66
## 1st Qu.:66.90 1st Qu.:50.15 1st Qu.: 3.400 1st Qu.: 738
## Median :74.30 Median :60.60 Median : 5.600 Median : 1574
## Mean :72.46 Mean :57.95 Mean : 7.255 Mean : 2664
## 3rd Qu.:77.70 3rd Qu.:69.25 3rd Qu.: 9.400 3rd Qu.: 3060
## Max. :84.70 Max. :85.80 Max. :30.200 Max. :17923
## NA's :18 NA's :30 NA's :30 NA's :82
## Electricity Developed
## Min. : 39 Min. :1.00
## 1st Qu.: 904 1st Qu.:1.00
## Median : 2620 Median :2.00
## Mean : 4270 Mean :1.81
## 3rd Qu.: 5600 3rd Qu.:3.00
## Max. :53832 Max. :3.00
## NA's :76 NA's :75
simple_model <- lm( LifeExpectancy ~ GDP, data = AC)
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Interpretation:
The intercept (around 68.4) represents the predicted life expectancy when GDP is 0. This is not practically meaningful, but mathematically it’s the y-intercept.
The coefficient for weight (around -0.00025 (rounded)) means for every 1 dollar increase in GDP,life expectancy increases by approximately 0.25 years.
-The R² value (0.43) means that GDP explains about 43% of the variance in life expectancy across countries.
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AC)
# View the model summary
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpretation - The health coefficient is about 0.248, suggesting that every 1% point increase in government health spending, life expectancy increases by 0.248 years. - The R² value for the multiple regression model is about 0.72, compared to 0.43 from the simple regression model in question 1. This suggests that adding Health and Internet predictors increases the adjusted R². Also suggesting that including these variables explains more of the variation in life expectancy than GDP alone.
Homoscedasticity - To check the assumption, I would look at a Residuals v. Fitted plot along with a Scale-Location plot. - The ideal outcome for the Residuals v. Fitted plot, the residuals should be scattered randomly around zero with the spread of points consistent across the line. The ideal outcome for the Scale-Location plot, is that it shows standardized residuals vs fitted values, along with the red line staying horizontal with a even spread of points.
#Homoscedasticity
plot(fitted(simple_model), resid(simple_model),
xlab="Fitted Values", ylab="Residuals", main="Residuals vs Fitted")
abline(h=0, lty=2)
plot(simple_model, which = 3)
Interpretation - In the Residuals v. Fitted plot, the residuals weren’t spread evenly. We can see that the residuals are clustered on the lower fitted values, but, are less clustered from 75-90 fitted values. In the Scale-Location plot, the residuals also weren’t spread evenly and were clustered around the 70 fitted value. The red line also didn’t remain flat as it rose around the 80 fitted value. This suggests the variance is not constant.
#Normality of residuals
qqnorm(resid(simple_model), main= "Normal Q-Q Plot")
qqline(resid(simple_model), col="blue")
Interpretation - For the Q-Q plot, the points mostly followed the straight line. Around the -2 and 2 theoretical quantile, the points curved away, otherwise the plot mostly matches the ideal pattern.
residuals_multiple <- resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
Interpretation - Multiple model, RMSE = 3.41, meaning predictions of life expectancy miss by ~4 years on average. Countries with large residuals would mean the predictions of life expectancy are less accurate. This would make me less confident in the model’s predictions as those large residuals can mean there’s a underestimation or overestimation of life expectancy. To further investigate, I would look into variables such as healthcare and poverty to see if affects these residuals.
seeing they’re highly correlated variables, it can affect the model. The model can become confused, and while it can make fair predictions, it can be hard to tell how accurate each variable performs.Errors can also become inflated which causes the predictors to be less significant. In terms of interpretation, it can be very hard to see how much each variable predict CO2 emissions.