setwd("C:/Users/leyla/Documents/DATA 101")
All_countries <- read.csv("AllCountries.csv")
head(All_countries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?
First, prepare the data to make sure not errors occur
#Cheking the variables
str(All_countries)
## 'data.frame': 217 obs. of 26 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : int 521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
## $ Rural : num 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num 3.72 4.08 13.81 NA NA ...
## $ Health : num 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : int 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num 67.4 123.7 111 NA 104.4 ...
## $ HIV : num NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : int NA 808 1328 NA NA 545 NA 2030 1016 NA ...
## $ Electricity : int NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
## $ Developed : int NA 1 1 NA NA 1 NA 2 1 NA ...
#Describing my data
summary(All_countries)
## Country Code LandArea Population
## Length:217 Length:217 Min. : 0.01 Min. : 0.0120
## Class :character Class :character 1st Qu.: 10.83 1st Qu.: 0.7728
## Mode :character Mode :character Median : 94.28 Median : 6.5725
## Mean : 608.38 Mean : 35.0335
## 3rd Qu.: 446.30 3rd Qu.: 25.0113
## Max. :16376.87 Max. :1392.7300
## NA's :8 NA's :1
## Density GDP Rural CO2
## Min. : 0.1 Min. : 275 Min. : 0.00 Min. : 0.0400
## 1st Qu.: 37.5 1st Qu.: 2032 1st Qu.:19.62 1st Qu.: 0.8575
## Median : 92.1 Median : 5950 Median :38.15 Median : 2.7550
## Mean : 361.4 Mean : 14733 Mean :39.10 Mean : 4.9780
## 3rd Qu.: 219.8 3rd Qu.: 17298 3rd Qu.:57.83 3rd Qu.: 6.2525
## Max. :20777.5 Max. :114340 Max. :87.00 Max. :43.8600
## NA's :8 NA's :30 NA's :3 NA's :13
## PumpPrice Military Health ArmedForces
## Min. :0.1100 Min. : 0.000 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.7450 1st Qu.: 3.015 1st Qu.: 6.157 1st Qu.: 12.0
## Median :0.9800 Median : 4.650 Median : 9.605 Median : 31.5
## Mean :0.9851 Mean : 6.178 Mean :10.597 Mean : 162.1
## 3rd Qu.:1.1800 3rd Qu.: 8.445 3rd Qu.:13.713 3rd Qu.: 146.5
## Max. :2.0000 Max. :31.900 Max. :39.460 Max. :3031.0
## NA's :50 NA's :67 NA's :29 NA's :49
## Internet Cell HIV Hunger
## Min. : 1.30 Min. : 13.70 Min. : 0.100 Min. : 2.50
## 1st Qu.:29.18 1st Qu.: 83.83 1st Qu.: 0.175 1st Qu.: 2.50
## Median :58.35 Median :110.00 Median : 0.400 Median : 6.50
## Mean :54.47 Mean :107.05 Mean : 1.941 Mean :11.25
## 3rd Qu.:78.92 3rd Qu.:127.50 3rd Qu.: 1.400 3rd Qu.:14.80
## Max. :98.90 Max. :328.80 Max. :27.400 Max. :61.80
## NA's :13 NA's :15 NA's :81 NA's :52
## Diabetes BirthRate DeathRate ElderlyPop
## Min. : 1.000 Min. : 7.00 Min. : 1.600 Min. : 1.200
## 1st Qu.: 5.350 1st Qu.:11.40 1st Qu.: 5.800 1st Qu.: 3.600
## Median : 7.200 Median :17.85 Median : 7.250 Median : 6.600
## Mean : 8.542 Mean :20.11 Mean : 7.683 Mean : 8.953
## 3rd Qu.:10.750 3rd Qu.:27.65 3rd Qu.: 9.350 3rd Qu.:14.500
## Max. :30.500 Max. :47.80 Max. :15.500 Max. :27.500
## NA's :10 NA's :15 NA's :15 NA's :24
## LifeExpectancy FemaleLabor Unemployment Energy
## Min. :52.20 Min. : 6.20 Min. : 0.100 Min. : 66
## 1st Qu.:66.90 1st Qu.:50.15 1st Qu.: 3.400 1st Qu.: 738
## Median :74.30 Median :60.60 Median : 5.600 Median : 1574
## Mean :72.46 Mean :57.95 Mean : 7.255 Mean : 2664
## 3rd Qu.:77.70 3rd Qu.:69.25 3rd Qu.: 9.400 3rd Qu.: 3060
## Max. :84.70 Max. :85.80 Max. :30.200 Max. :17923
## NA's :18 NA's :30 NA's :30 NA's :82
## Electricity Developed
## Min. : 39 Min. :1.00
## 1st Qu.: 904 1st Qu.:1.00
## Median : 2620 Median :2.00
## Mean : 4270 Mean :1.81
## 3rd Qu.: 5600 3rd Qu.:3.00
## Max. :53832 Max. :3.00
## NA's :76 NA's :75
Next, remove rows with NA, from the GDP and Life Expectancy columns, since those are the variables I will use in this case.
ACountries <- na.omit(All_countries[, c("GDP", "LifeExpectancy")])
summary(ACountries)
## GDP LifeExpectancy
## Min. : 275 Min. :52.20
## 1st Qu.: 2022 1st Qu.:66.65
## Median : 5878 Median :73.70
## Mean : 14667 Mean :72.05
## 3rd Qu.: 16854 3rd Qu.:77.40
## Max. :114340 Max. :84.70
After data is clean, I will fit a simple regression model, where we analyze if we can predict a country’s life expectancy based on it’s GDP .
model <- lm(LifeExpectancy ~ GDP, data = ACountries)
summary(model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = ACountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Interpretation
The intercept (68,42) represents life expectancy for a country with GDP at 0 dollars. No country has $0 GDP, but it serves as a starting point
Slope (0.0002476) shows that for every $1 increase in GDP, life expectancy increases by 0.0002476 years. Which is a positive relation.
R-squared (0.4304) About 43% of the variation of life expectancy accross countries is related to the GDP.
The p-value is extremely small (less than 0.0001), which means the relationship between GDP and life expectancy is statistically significant.
Wealthier countries tend to have an average lifespan. GDP explains a substantial portion (43%) of the difference in life expectancy, but other factors also contribute.
Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?
First, cleaning the data of missing values
Allcountries <- na.omit(All_countries[ ,c("LifeExpectancy", "GDP", "Health", "Internet")])
summary(Allcountries)
## LifeExpectancy GDP Health Internet
## Min. :52.20 Min. : 275 Min. : 0.00 Min. : 3.90
## 1st Qu.:66.60 1st Qu.: 2016 1st Qu.: 6.15 1st Qu.:27.10
## Median :73.70 Median : 5878 Median : 9.65 Median :55.50
## Mean :71.95 Mean : 14165 Mean :10.52 Mean :52.15
## 3rd Qu.:77.30 3rd Qu.: 16434 3rd Qu.:13.56 3rd Qu.:76.40
## Max. :84.10 Max. :114340 Max. :39.46 Max. :98.30
Then, will fit the multiple linear regression
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = Allcountries)
summary(model2)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = Allcountries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
Interpretation
Intercept (59.08) predict life expectancy if GDP, health and Internet were all 0. Is a baseline because is not realistic
GDP (0.00002367) this means after controlling Health and Internet, GDP does not have a strong direct effect on life expetancy in the model
Health (0.2479) For every 1% increase in government spending on healthcare,is associated an increase of approximately 0.25 years in average life expectancy. This effect is statistically significant.
Internet (0.1903) for every 1% of increase in population with iternet access, life expectancy increases by 0.19 years. Statistically significant.
R-squared shows that is higher than is simple regression (0.43), meaning that including Health and Internet improves the model substantially.
### Problem 3
Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.
plot(All_countries$GDP, All_countries$LifeExpectancy,
xlab= "GDP", ylab= "LifeExpectancy", main= "Life Expectancy vs GDP")
abline(model, col= "orange", lwd= 2)
The scatterplot shows a clear positive relationship between GDP and Life
expectancy. Countries with higher GDP tend to have a longer Life
expectancy. However, the increase is sharper at low GDP levels, and
levels off as GDP gets high, suggesting a slightly nonlinear pattern, so
the linear model is not perfect.
par(mfrow = c(2,2))
plot(model)
par(mfrow = c(1,1))
The residuals vs fitted plot shows a clear curved pattern, which suggest the variance of residual is not constant. The spread of residuals indicates that the model may not fully capture the relationship between GDP and life expectancy, reducing the reliability of prediction for some countries.
Q -Q plot: the points mostly follow the diagonal line indicating residuals are aproximetelly normally distributed. Not severe deviations
Scale Location plot: Shows the spread of residual increases slightly with fitted values.
Residuals vs Leverage: Most points have low leverage and standarized residuals near zero. No points seem to greatly influence the model.
Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?
I will calculate RMSE for Model 2
# Calculate prediced values
predicted <- predict(model2)
predicted
## 1 2 3 6 7 8 9 11
## 61.76039 75.22632 70.91914 63.22893 76.61030 77.14294 73.94435 81.21674
## 12 13 15 16 18 19 20 21
## 80.72885 75.18840 79.98512 63.38389 75.48786 80.79902 71.24152 62.70711
## 23 24 25 26 27 29 30 31
## 70.37117 70.30837 76.27501 69.42274 74.59145 79.29995 74.31526 64.85760
## 32 33 34 35 36 38 39 41
## 62.26443 72.50409 67.11357 64.26288 82.52862 61.16494 61.79704 80.01292
## 42 43 44 45 46 47 48 49
## 71.88877 74.40823 61.63191 61.65483 61.75850 80.22679 68.66617 75.16923
## 52 53 54 55 57 58 59 60
## 76.97150 78.27761 83.02193 70.49757 75.59763 72.85948 68.75067 70.32113
## 61 63 64 65 67 68 69 71
## 64.94927 79.46561 68.71987 64.13055 70.64139 80.16731 79.58834 71.12349
## 72 73 74 75 77 79 81 82
## 63.54943 73.24762 81.57809 67.96632 75.42224 72.81999 71.38054 62.28954
## 83 84 85 86 88 89 90 91
## 63.03355 68.20222 62.53726 68.72823 76.65638 83.22373 66.47207 67.37936
## 93 94 96 97 98 99 100 101
## 69.04219 81.86574 78.46022 74.89763 71.67947 83.10741 74.85824 76.17099
## 102 103 105 107 108 109 110 111
## 64.01045 63.27052 81.25526 80.08073 68.01648 64.92106 77.25631 77.71047
## 112 113 114 116 117 119 120 121
## 67.48475 61.57552 63.40033 77.47316 83.34414 65.37587 64.15240 76.63030
## 122 123 124 126 127 128 129 130
## 76.34469 62.86452 78.95756 64.43836 72.39649 74.05114 67.29099 76.64273
## 132 133 134 135 136 137 139 140
## 65.00898 75.84626 73.16641 65.12012 66.14155 69.64483 64.49345 82.85788
## 142 143 144 145 146 148 149 150
## 82.92623 69.40543 62.44162 65.64177 76.97985 83.74199 76.60835 63.02177
## 152 153 154 155 156 157 158 160
## 75.78247 63.06341 74.85792 72.40703 72.35137 76.64562 76.99471 80.52426
## 161 162 163 164 166 167 168 169
## 74.30026 75.85120 65.44850 68.67220 66.66463 77.74812 66.27396 75.68921
## 170 171 172 174 175 176 178 180
## 73.14581 63.56546 80.03254 78.47079 78.06815 63.38910 73.22757 79.65603
## 181 183 185 186 187 188 189 191
## 67.78882 71.08931 74.10906 67.64374 72.63714 83.28551 84.42584 64.53600
## 192 193 194 195 196 197 198 199
## 64.50992 73.10253 65.16020 62.50951 69.01734 76.58457 73.12502 74.00584
## 200 203 204 205 206 207 208 209
## 65.45783 64.87996 71.76290 80.09556 82.76949 84.65572 77.31866 71.35029
## 210 212 215 216 217
## 65.37913 70.79891 64.18393 66.19378 67.88522
# Calculate RMSE
results <- sqrt(mean(model2$residuals^2))
results
## [1] 4.056417
The RMSE is 4.06 years aprox. This means that the model’s prediction of life expectancy are off by about 4 years per country. Countries with unusually high or low life expectancy compared to what GDP, Health and Internet predict will have larger residuals. These larger residuals suggest that the model may not fully captor other factors affecting life expectancy. I might investigate additional predictors, factors not included in the data like education, environment or diet that may explain why life expectancy in some countries deviates, or check for data outliers in those countries.
Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.
For this hypothetical model multicollinearity happens because these two predictors are highly correlated because both measure related aspects of energy consumption (Countries that consume a lot of energy usually consume more electricity too). This makes it difficult to interpret their independent regression coefficient, as the model cannot separate the effect of Energy from that of Electricity. The model may still provide accurate predictions of CO2 but the standard error will probably increase, making it statistically significe harder to access. To deal with multicollinearity, we could consider remove one o the correlated variables, or combining them into a single measure.