setwd("C:/Users/leyla/Documents/DATA 101")
All_countries <- read.csv("AllCountries.csv")
head(All_countries)
##          Country Code LandArea Population Density   GDP Rural  CO2 PumpPrice
## 1    Afghanistan  AFG   652.86     37.172    56.9   521  74.5 0.29      0.70
## 2        Albania  ALB    27.40      2.866   104.6  5254  39.7 1.98      1.36
## 3        Algeria  DZA  2381.74     42.228    17.7  4279  27.4 3.74      0.28
## 4 American Samoa  ASM     0.20      0.055   277.3    NA  12.8   NA        NA
## 5        Andorra  AND     0.47      0.077   163.8 42030  11.9 5.83        NA
## 6         Angola  AGO  1246.70     30.810    24.7  3432  34.5 1.29      0.97
##   Military Health ArmedForces Internet  Cell HIV Hunger Diabetes BirthRate
## 1     3.72   2.01         323     11.4  67.4  NA   30.3      9.6      32.5
## 2     4.08   9.51           9     71.8 123.7 0.1    5.5     10.1      11.7
## 3    13.81  10.73         317     47.7 111.0 0.1    4.7      6.7      22.3
## 4       NA     NA          NA       NA    NA  NA     NA       NA        NA
## 5       NA  14.02          NA     98.9 104.4  NA     NA      8.0        NA
## 6     9.40   5.43         117     14.3  44.7 1.9   23.9      3.9      41.3
##   DeathRate ElderlyPop LifeExpectancy FemaleLabor Unemployment Energy
## 1       6.6        2.6           64.0        50.3          1.5     NA
## 2       7.5       13.6           78.5        55.9         13.9    808
## 3       4.8        6.4           76.3        16.4         12.1   1328
## 4        NA         NA             NA          NA           NA     NA
## 5        NA         NA             NA          NA           NA     NA
## 6       8.4        2.5           61.8        76.4          7.3    545
##   Electricity Developed
## 1          NA        NA
## 2        2309         1
## 3        1363         1
## 4          NA        NA
## 5          NA        NA
## 6         312         1

Problem 1

Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

First, prepare the data to make sure not errors occur

#Cheking the variables
str(All_countries)
## 'data.frame':    217 obs. of  26 variables:
##  $ Country       : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr  "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num  652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num  37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num  56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : int  521 5254 4279 NA 42030 3432 16864 11653 4212 NA ...
##  $ Rural         : num  74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num  0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num  0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num  3.72 4.08 13.81 NA NA ...
##  $ Health        : num  2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : int  323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num  11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num  67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num  NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num  30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num  9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num  32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num  6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num  2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num  64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num  50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num  1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : int  NA 808 1328 NA NA 545 NA 2030 1016 NA ...
##  $ Electricity   : int  NA 2309 1363 NA NA 312 NA 3075 1962 NA ...
##  $ Developed     : int  NA 1 1 NA NA 1 NA 2 1 NA ...
#Describing my data
summary(All_countries)
##    Country              Code              LandArea          Population       
##  Length:217         Length:217         Min.   :    0.01   Min.   :   0.0120  
##  Class :character   Class :character   1st Qu.:   10.83   1st Qu.:   0.7728  
##  Mode  :character   Mode  :character   Median :   94.28   Median :   6.5725  
##                                        Mean   :  608.38   Mean   :  35.0335  
##                                        3rd Qu.:  446.30   3rd Qu.:  25.0113  
##                                        Max.   :16376.87   Max.   :1392.7300  
##                                        NA's   :8          NA's   :1          
##     Density             GDP             Rural            CO2         
##  Min.   :    0.1   Min.   :   275   Min.   : 0.00   Min.   : 0.0400  
##  1st Qu.:   37.5   1st Qu.:  2032   1st Qu.:19.62   1st Qu.: 0.8575  
##  Median :   92.1   Median :  5950   Median :38.15   Median : 2.7550  
##  Mean   :  361.4   Mean   : 14733   Mean   :39.10   Mean   : 4.9780  
##  3rd Qu.:  219.8   3rd Qu.: 17298   3rd Qu.:57.83   3rd Qu.: 6.2525  
##  Max.   :20777.5   Max.   :114340   Max.   :87.00   Max.   :43.8600  
##  NA's   :8         NA's   :30       NA's   :3       NA's   :13       
##    PumpPrice         Military          Health        ArmedForces    
##  Min.   :0.1100   Min.   : 0.000   Min.   : 0.000   Min.   :   0.0  
##  1st Qu.:0.7450   1st Qu.: 3.015   1st Qu.: 6.157   1st Qu.:  12.0  
##  Median :0.9800   Median : 4.650   Median : 9.605   Median :  31.5  
##  Mean   :0.9851   Mean   : 6.178   Mean   :10.597   Mean   : 162.1  
##  3rd Qu.:1.1800   3rd Qu.: 8.445   3rd Qu.:13.713   3rd Qu.: 146.5  
##  Max.   :2.0000   Max.   :31.900   Max.   :39.460   Max.   :3031.0  
##  NA's   :50       NA's   :67       NA's   :29       NA's   :49      
##     Internet          Cell             HIV             Hunger     
##  Min.   : 1.30   Min.   : 13.70   Min.   : 0.100   Min.   : 2.50  
##  1st Qu.:29.18   1st Qu.: 83.83   1st Qu.: 0.175   1st Qu.: 2.50  
##  Median :58.35   Median :110.00   Median : 0.400   Median : 6.50  
##  Mean   :54.47   Mean   :107.05   Mean   : 1.941   Mean   :11.25  
##  3rd Qu.:78.92   3rd Qu.:127.50   3rd Qu.: 1.400   3rd Qu.:14.80  
##  Max.   :98.90   Max.   :328.80   Max.   :27.400   Max.   :61.80  
##  NA's   :13      NA's   :15       NA's   :81       NA's   :52     
##     Diabetes        BirthRate       DeathRate        ElderlyPop    
##  Min.   : 1.000   Min.   : 7.00   Min.   : 1.600   Min.   : 1.200  
##  1st Qu.: 5.350   1st Qu.:11.40   1st Qu.: 5.800   1st Qu.: 3.600  
##  Median : 7.200   Median :17.85   Median : 7.250   Median : 6.600  
##  Mean   : 8.542   Mean   :20.11   Mean   : 7.683   Mean   : 8.953  
##  3rd Qu.:10.750   3rd Qu.:27.65   3rd Qu.: 9.350   3rd Qu.:14.500  
##  Max.   :30.500   Max.   :47.80   Max.   :15.500   Max.   :27.500  
##  NA's   :10       NA's   :15      NA's   :15       NA's   :24      
##  LifeExpectancy   FemaleLabor     Unemployment        Energy     
##  Min.   :52.20   Min.   : 6.20   Min.   : 0.100   Min.   :   66  
##  1st Qu.:66.90   1st Qu.:50.15   1st Qu.: 3.400   1st Qu.:  738  
##  Median :74.30   Median :60.60   Median : 5.600   Median : 1574  
##  Mean   :72.46   Mean   :57.95   Mean   : 7.255   Mean   : 2664  
##  3rd Qu.:77.70   3rd Qu.:69.25   3rd Qu.: 9.400   3rd Qu.: 3060  
##  Max.   :84.70   Max.   :85.80   Max.   :30.200   Max.   :17923  
##  NA's   :18      NA's   :30      NA's   :30       NA's   :82     
##   Electricity      Developed   
##  Min.   :   39   Min.   :1.00  
##  1st Qu.:  904   1st Qu.:1.00  
##  Median : 2620   Median :2.00  
##  Mean   : 4270   Mean   :1.81  
##  3rd Qu.: 5600   3rd Qu.:3.00  
##  Max.   :53832   Max.   :3.00  
##  NA's   :76      NA's   :75

Next, remove rows with NA, from the GDP and Life Expectancy columns, since those are the variables I will use in this case.

ACountries <- na.omit(All_countries[, c("GDP", "LifeExpectancy")])
summary(ACountries)
##       GDP         LifeExpectancy 
##  Min.   :   275   Min.   :52.20  
##  1st Qu.:  2022   1st Qu.:66.65  
##  Median :  5878   Median :73.70  
##  Mean   : 14667   Mean   :72.05  
##  3rd Qu.: 16854   3rd Qu.:77.40  
##  Max.   :114340   Max.   :84.70

After data is clean, I will fit a simple regression model, where we analyze if we can predict a country’s life expectancy based on it’s GDP .

model <- lm(LifeExpectancy ~ GDP, data = ACountries)
summary(model)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = ACountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Interpretation

Wealthier countries tend to have an average lifespan. GDP explains a substantial portion (43%) of the difference in life expectancy, but other factors also contribute.

Problem 2

Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

First, cleaning the data of missing values

Allcountries <- na.omit(All_countries[ ,c("LifeExpectancy", "GDP", "Health", "Internet")])
summary(Allcountries)
##  LifeExpectancy       GDP             Health         Internet    
##  Min.   :52.20   Min.   :   275   Min.   : 0.00   Min.   : 3.90  
##  1st Qu.:66.60   1st Qu.:  2016   1st Qu.: 6.15   1st Qu.:27.10  
##  Median :73.70   Median :  5878   Median : 9.65   Median :55.50  
##  Mean   :71.95   Mean   : 14165   Mean   :10.52   Mean   :52.15  
##  3rd Qu.:77.30   3rd Qu.: 16434   3rd Qu.:13.56   3rd Qu.:76.40  
##  Max.   :84.10   Max.   :114340   Max.   :39.46   Max.   :98.30

Then, will fit the multiple linear regression

model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = Allcountries)
summary(model2)
## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = Allcountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Interpretation

### Problem 3

Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome.

  1. Check if Life expectancy has a linear relationship with GDP
plot(All_countries$GDP, All_countries$LifeExpectancy,
     xlab= "GDP", ylab= "LifeExpectancy", main= "Life Expectancy vs GDP")
abline(model, col= "orange", lwd= 2)

The scatterplot shows a clear positive relationship between GDP and Life expectancy. Countries with higher GDP tend to have a longer Life expectancy. However, the increase is sharper at low GDP levels, and levels off as GDP gets high, suggesting a slightly nonlinear pattern, so the linear model is not perfect.

  1. 4 Diagnostics plot
par(mfrow = c(2,2))
plot(model)

par(mfrow = c(1,1))

Problem 4

Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

I will calculate RMSE for Model 2

# Calculate prediced values
predicted <- predict(model2)
predicted
##        1        2        3        6        7        8        9       11 
## 61.76039 75.22632 70.91914 63.22893 76.61030 77.14294 73.94435 81.21674 
##       12       13       15       16       18       19       20       21 
## 80.72885 75.18840 79.98512 63.38389 75.48786 80.79902 71.24152 62.70711 
##       23       24       25       26       27       29       30       31 
## 70.37117 70.30837 76.27501 69.42274 74.59145 79.29995 74.31526 64.85760 
##       32       33       34       35       36       38       39       41 
## 62.26443 72.50409 67.11357 64.26288 82.52862 61.16494 61.79704 80.01292 
##       42       43       44       45       46       47       48       49 
## 71.88877 74.40823 61.63191 61.65483 61.75850 80.22679 68.66617 75.16923 
##       52       53       54       55       57       58       59       60 
## 76.97150 78.27761 83.02193 70.49757 75.59763 72.85948 68.75067 70.32113 
##       61       63       64       65       67       68       69       71 
## 64.94927 79.46561 68.71987 64.13055 70.64139 80.16731 79.58834 71.12349 
##       72       73       74       75       77       79       81       82 
## 63.54943 73.24762 81.57809 67.96632 75.42224 72.81999 71.38054 62.28954 
##       83       84       85       86       88       89       90       91 
## 63.03355 68.20222 62.53726 68.72823 76.65638 83.22373 66.47207 67.37936 
##       93       94       96       97       98       99      100      101 
## 69.04219 81.86574 78.46022 74.89763 71.67947 83.10741 74.85824 76.17099 
##      102      103      105      107      108      109      110      111 
## 64.01045 63.27052 81.25526 80.08073 68.01648 64.92106 77.25631 77.71047 
##      112      113      114      116      117      119      120      121 
## 67.48475 61.57552 63.40033 77.47316 83.34414 65.37587 64.15240 76.63030 
##      122      123      124      126      127      128      129      130 
## 76.34469 62.86452 78.95756 64.43836 72.39649 74.05114 67.29099 76.64273 
##      132      133      134      135      136      137      139      140 
## 65.00898 75.84626 73.16641 65.12012 66.14155 69.64483 64.49345 82.85788 
##      142      143      144      145      146      148      149      150 
## 82.92623 69.40543 62.44162 65.64177 76.97985 83.74199 76.60835 63.02177 
##      152      153      154      155      156      157      158      160 
## 75.78247 63.06341 74.85792 72.40703 72.35137 76.64562 76.99471 80.52426 
##      161      162      163      164      166      167      168      169 
## 74.30026 75.85120 65.44850 68.67220 66.66463 77.74812 66.27396 75.68921 
##      170      171      172      174      175      176      178      180 
## 73.14581 63.56546 80.03254 78.47079 78.06815 63.38910 73.22757 79.65603 
##      181      183      185      186      187      188      189      191 
## 67.78882 71.08931 74.10906 67.64374 72.63714 83.28551 84.42584 64.53600 
##      192      193      194      195      196      197      198      199 
## 64.50992 73.10253 65.16020 62.50951 69.01734 76.58457 73.12502 74.00584 
##      200      203      204      205      206      207      208      209 
## 65.45783 64.87996 71.76290 80.09556 82.76949 84.65572 77.31866 71.35029 
##      210      212      215      216      217 
## 65.37913 70.79891 64.18393 66.19378 67.88522
# Calculate RMSE
results <- sqrt(mean(model2$residuals^2))
results
## [1] 4.056417

The RMSE is 4.06 years aprox. This means that the model’s prediction of life expectancy are off by about 4 years per country. Countries with unusually high or low life expectancy compared to what GDP, Health and Internet predict will have larger residuals. These larger residuals suggest that the model may not fully captor other factors affecting life expectancy. I might investigate additional predictors, factors not included in the data like education, environment or diet that may explain why life expectancy in some countries deviates, or check for data outliers in those countries.

Problem 5

Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

For this hypothetical model multicollinearity happens because these two predictors are highly correlated because both measure related aspects of energy consumption (Countries that consume a lot of energy usually consume more electricity too). This makes it difficult to interpret their independent regression coefficient, as the model cannot separate the effect of Energy from that of Electricity. The model may still provide accurate predictions of CO2 but the standard error will probably increase, making it statistically significe harder to access. To deal with multicollinearity, we could consider remove one o the correlated variables, or combining them into a single measure.