library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(readr)
All_Countries <-read.csv('AllCountries.csv')
#Data exploration, cleaning the data and summarization
dim(All_Countries)
## [1] 217 26
colnames(All_Countries)
## [1] "Country" "Code" "LandArea" "Population"
## [5] "Density" "GDP" "Rural" "CO2"
## [9] "PumpPrice" "Military" "Health" "ArmedForces"
## [13] "Internet" "Cell" "HIV" "Hunger"
## [17] "Diabetes" "BirthRate" "DeathRate" "ElderlyPop"
## [21] "LifeExpectancy" "FemaleLabor" "Unemployment" "Energy"
## [25] "Electricity" "Developed"
colSums(is.na(All_Countries))
## Country Code LandArea Population Density
## 0 0 8 1 8
## GDP Rural CO2 PumpPrice Military
## 30 3 13 50 67
## Health ArmedForces Internet Cell HIV
## 29 49 13 15 81
## Hunger Diabetes BirthRate DeathRate ElderlyPop
## 52 10 15 15 24
## LifeExpectancy FemaleLabor Unemployment Energy Electricity
## 18 30 30 82 76
## Developed
## 75
str(All_Countries)
## 'data.frame': 217 obs. of 26 variables:
## $ Country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ...
## $ Code : chr "AFG" "ALB" "DZA" "ASM" ...
## $ LandArea : num 652.86 27.4 2381.74 0.2 0.47 ...
## $ Population : num 37.172 2.866 42.228 0.055 0.077 ...
## $ Density : num 56.9 104.6 17.7 277.3 163.8 ...
## $ GDP : int 521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
## $ Rural : num 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
## $ CO2 : num 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
## $ PumpPrice : num 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
## $ Military : num 3.72 4.08 13.81 NA NA ...
## $ Health : num 2.01 9.51 10.73 NA 14.02 ...
## $ ArmedForces : int 323 9 317 NA NA 117 0 105 49 NA ...
## $ Internet : num 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
## $ Cell : num 67.4 123.7 111 NA 104.4 ...
## $ HIV : num NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
## $ Hunger : num 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
## $ Diabetes : num 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
## $ BirthRate : num 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
## $ DeathRate : num 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
## $ ElderlyPop : num 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
## $ LifeExpectancy: num 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
## $ FemaleLabor : num 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
## $ Unemployment : num 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
## $ Energy : int NA 808 1328 NA NA 545 NA 2030 1016 NA ...
## $ Electricity : int NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
## $ Developed : int NA 1 1 NA NA 1 NA 2 1 NA ...
head(All_Countries)
## Country Code LandArea Population Density GDP Rural CO2 PumpPrice
## 1 Afghanistan AFG 652.86 37.172 56.9 521 74.5 0.29 0.70
## 2 Albania ALB 27.40 2.866 104.6 5254 39.7 1.98 1.36
## 3 Algeria DZA 2381.74 42.228 17.7 4279 27.4 3.74 0.28
## 4 American Samoa ASM 0.20 0.055 277.3 NA 12.8 NA NA
## 5 Andorra AND 0.47 0.077 163.8 42030 11.9 5.83 NA
## 6 Angola AGO 1246.70 30.810 24.7 3432 34.5 1.29 0.97
## Military Health ArmedForces Internet Cell HIV Hunger Diabetes BirthRate
## 1 3.72 2.01 323 11.4 67.4 NA 30.3 9.6 32.5
## 2 4.08 9.51 9 71.8 123.7 0.1 5.5 10.1 11.7
## 3 13.81 10.73 317 47.7 111.0 0.1 4.7 6.7 22.3
## 4 NA NA NA NA NA NA NA NA NA
## 5 NA 14.02 NA 98.9 104.4 NA NA 8.0 NA
## 6 9.40 5.43 117 14.3 44.7 1.9 23.9 3.9 41.3
## DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1 6.6 2.6 64.0 50.3 1.5 NA
## 2 7.5 13.6 78.5 55.9 13.9 808
## 3 4.8 6.4 76.3 16.4 12.1 1328
## 4 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA
## 6 8.4 2.5 61.8 76.4 7.3 545
## Electricity Developed
## 1 NA NA
## 2 2309 1
## 3 1363 1
## 4 NA NA
## 5 NA NA
## 6 312 1
summary(All_Countries)
## Country Code LandArea Population
## Length:217 Length:217 Min. : 0.01 Min. : 0.0120
## Class :character Class :character 1st Qu.: 10.83 1st Qu.: 0.7728
## Mode :character Mode :character Median : 94.28 Median : 6.5725
## Mean : 608.38 Mean : 35.0335
## 3rd Qu.: 446.30 3rd Qu.: 25.0113
## Max. :16376.87 Max. :1392.7300
## NA's :8 NA's :1
## Density GDP Rural CO2
## Min. : 0.1 Min. : 275 Min. : 0.00 Min. : 0.0400
## 1st Qu.: 37.5 1st Qu.: 2032 1st Qu.:19.62 1st Qu.: 0.8575
## Median : 92.1 Median : 5950 Median :38.15 Median : 2.7550
## Mean : 361.4 Mean : 14733 Mean :39.10 Mean : 4.9780
## 3rd Qu.: 219.8 3rd Qu.: 17298 3rd Qu.:57.83 3rd Qu.: 6.2525
## Max. :20777.5 Max. :114340 Max. :87.00 Max. :43.8600
## NA's :8 NA's :30 NA's :3 NA's :13
## PumpPrice Military Health ArmedForces
## Min. :0.1100 Min. : 0.000 Min. : 0.000 Min. : 0.0
## 1st Qu.:0.7450 1st Qu.: 3.015 1st Qu.: 6.157 1st Qu.: 12.0
## Median :0.9800 Median : 4.650 Median : 9.605 Median : 31.5
## Mean :0.9851 Mean : 6.178 Mean :10.597 Mean : 162.1
## 3rd Qu.:1.1800 3rd Qu.: 8.445 3rd Qu.:13.713 3rd Qu.: 146.5
## Max. :2.0000 Max. :31.900 Max. :39.460 Max. :3031.0
## NA's :50 NA's :67 NA's :29 NA's :49
## Internet Cell HIV Hunger
## Min. : 1.30 Min. : 13.70 Min. : 0.100 Min. : 2.50
## 1st Qu.:29.18 1st Qu.: 83.83 1st Qu.: 0.175 1st Qu.: 2.50
## Median :58.35 Median :110.00 Median : 0.400 Median : 6.50
## Mean :54.47 Mean :107.05 Mean : 1.941 Mean :11.25
## 3rd Qu.:78.92 3rd Qu.:127.50 3rd Qu.: 1.400 3rd Qu.:14.80
## Max. :98.90 Max. :328.80 Max. :27.400 Max. :61.80
## NA's :13 NA's :15 NA's :81 NA's :52
## Diabetes BirthRate DeathRate ElderlyPop
## Min. : 1.000 Min. : 7.00 Min. : 1.600 Min. : 1.200
## 1st Qu.: 5.350 1st Qu.:11.40 1st Qu.: 5.800 1st Qu.: 3.600
## Median : 7.200 Median :17.85 Median : 7.250 Median : 6.600
## Mean : 8.542 Mean :20.11 Mean : 7.683 Mean : 8.953
## 3rd Qu.:10.750 3rd Qu.:27.65 3rd Qu.: 9.350 3rd Qu.:14.500
## Max. :30.500 Max. :47.80 Max. :15.500 Max. :27.500
## NA's :10 NA's :15 NA's :15 NA's :24
## LifeExpectancy FemaleLabor Unemployment Energy
## Min. :52.20 Min. : 6.20 Min. : 0.100 Min. : 66
## 1st Qu.:66.90 1st Qu.:50.15 1st Qu.: 3.400 1st Qu.: 738
## Median :74.30 Median :60.60 Median : 5.600 Median : 1574
## Mean :72.46 Mean :57.95 Mean : 7.255 Mean : 2664
## 3rd Qu.:77.70 3rd Qu.:69.25 3rd Qu.: 9.400 3rd Qu.: 3060
## Max. :84.70 Max. :85.80 Max. :30.200 Max. :17923
## NA's :18 NA's :30 NA's :30 NA's :82
## Electricity Developed
## Min. : 39 Min. :1.00
## 1st Qu.: 904 1st Qu.:1.00
## Median : 2620 Median :2.00
## Mean : 4270 Mean :1.81
## 3rd Qu.: 5600 3rd Qu.:3.00
## Max. :53832 Max. :3.00
## NA's :76 NA's :75
#1 Simple Linear Regression (Fitting and Interpretation):
Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US).
simple_model <- lm(LifeExpectancy ~ GDP, data = All_Countries)
summary(simple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = All_Countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.352 -3.882 1.550 4.458 9.330
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.842e+01 5.415e-01 126.36 <2e-16 ***
## GDP 2.476e-04 2.141e-05 11.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.901 on 177 degrees of freedom
## (38 observations deleted due to missingness)
## Multiple R-squared: 0.4304, Adjusted R-squared: 0.4272
## F-statistic: 133.7 on 1 and 177 DF, p-value: < 2.2e-16
Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?
-The intercept is 68.42, and the slope is 0.0002476.
-The intercept meaning in the dataset is that if the country had a GDP of $0 per invididual, the prediction of their life expectancy would be 68.42 years. (only stastically significant) The slope meaning in this dataset is that for every dollar (GDP) an individual makes their life expectancy will go up 0.0002476 years.
What does the R² value tell you about how well GDP explains variation in life expectancy across countries?
#2 Multiple Linear Regression (Fitting and Interpretation)
multiple_model <- lm(LifeExpectancy ~ GDP + Health + Internet, data = All_Countries)
summary(multiple_model)
##
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = All_Countries)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5662 -1.8227 0.4108 2.5422 9.4161
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.908e+01 8.149e-01 72.499 < 2e-16 ***
## GDP 2.367e-05 2.287e-05 1.035 0.302025
## Health 2.479e-01 6.619e-02 3.745 0.000247 ***
## Internet 1.903e-01 1.656e-02 11.490 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 169 degrees of freedom
## (44 observations deleted due to missingness)
## Multiple R-squared: 0.7213, Adjusted R-squared: 0.7164
## F-statistic: 145.8 on 3 and 169 DF, p-value: < 2.2e-16
-The coefficient for health is 0.25. This means for every increase of 1% the government spends on healthcare, the life expectancy is predicted to go up by 0.25 years. In terms of the adjusted R² in comparision to the simple regression model from Question 1, this shows an increase in q1 it was 0.427, now it is 0.716. The Multiple Linear Regression explains more of the variation of the life expectancy because it uses multiple variables than just using GDP, this explains that GDP is not a sole factor in determining life expectancy.
#3 Checking Assumptions (Homoscedasticity and Normality)
-In order to check the assumptions of homoscedasticity, you need to look at residual versus fitted values plot. The ideal outcome is if we visually see that there is no distinct pattern or shape. A violation would be if we see a curve or pattern, this indicates that the model is unreliable due to misinterpreting the data.
-In order to check the normality of residuals, you need to look at a Q-Q plot. The ideal outcome is that we visually see that the data points follow a straight diagonal line. If the points stray from the line, it could indicate that the residuals aren’t following a normal distribution therefore, becoming unreliable.
par(mfrow=c(2,2));plot(simple_model); par(mfrow=c(1,1))
Afterwords, code your answer
Reflect if it matched the ideal outcome:
-For residual versus fitted values plot, there is a slight curve pattern in the begining but straightens out for the majority of the plot, and visually scattered points after. This is not exactly ideal to the stands, but acceptable.
-For Q-Q, there are slight deviations from the straight diagnonal line but points are close enough to the line. This satisfies the ideal outcome.
#4 Diagnosing Model Fit (RMSE and Residuals)
residuals_multiple <-resid(multiple_model)
rmse_multiple <- sqrt(mean(residuals_multiple^2))
rmse_multiple
## [1] 4.056417
For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?
-The RMSE rounded is found to be 4.06, this represents in that the life expectancy prediction is off by about 4 years. Although, the RMSE is not 0 (ideal) but instead 4, this would be considered a fair fit for the dataset. Large residuals for certain countries especially those with unusually high or low expectancy would impact our confidence model by decreasing it. Large residuals would suggest that, the confidence model isn’t taking account of essential factors affecting certain countries. We could possibly investigate further taking a look that the outliers to see what are the causes eg. if its error from data entry.
#5 Hypothetical Example (Multicollinearity in Multiple Regression)
-Multicollinearity might affect our interpretation of the regression because our model isn’t able to distinguish clearly how each variable affects co2 emmisions, because of Energy and Electricity being highly correlated. This could possibily result in our coefficients unstable and p-values may not appear statistically significant even if the variables actually are. This ultimately suggests that our model possibly can give us a good prediction of co2 emmisions but in terms of being able to properly predict the coefficients it would be unreliable.