Arnav Shah DATA 101 HW 9

# In this chunk I am loading the readr library so I can import the dataset
library(readr)
AllCountries <- read_csv("AllCountries.csv")

## Rows: 217 Columns: 26
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): Country, Code
## dbl (24): LandArea, Population, Density, GDP, Rural, CO2, PumpPrice, Militar...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# In this chunk I used the str function in order to view the formation of the dataset
# I used the function head here in order to preview first rows 
str(AllCountries)

## spc_tbl_ [217 × 26] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country       : chr [1:217] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Code          : chr [1:217] "AFG" "ALB" "DZA" "ASM" ...
##  $ LandArea      : num [1:217] 652.86 27.4 2381.74 0.2 0.47 ...
##  $ Population    : num [1:217] 37.172 2.866 42.228 0.055 0.077 ...
##  $ Density       : num [1:217] 56.9 104.6 17.7 277.3 163.8 ...
##  $ GDP           : num [1:217] 521 5254 4279 NA 42030 ...
##  $ Rural         : num [1:217] 74.5 39.7 27.4 12.8 11.9 34.5 75.4 8.1 36.9 56.6 ...
##  $ CO2           : num [1:217] 0.29 1.98 3.74 NA 5.83 1.29 5.74 4.78 1.9 8.41 ...
##  $ PumpPrice     : num [1:217] 0.7 1.36 0.28 NA NA 0.97 NA 1.1 0.77 NA ...
##  $ Military      : num [1:217] 3.72 4.08 13.81 NA NA ...
##  $ Health        : num [1:217] 2.01 9.51 10.73 NA 14.02 ...
##  $ ArmedForces   : num [1:217] 323 9 317 NA NA 117 0 105 49 NA ...
##  $ Internet      : num [1:217] 11.4 71.8 47.7 NA 98.9 14.3 76 75.8 69.7 97.2 ...
##  $ Cell          : num [1:217] 67.4 123.7 111 NA 104.4 ...
##  $ HIV           : num [1:217] NA 0.1 0.1 NA NA 1.9 NA 0.4 0.2 NA ...
##  $ Hunger        : num [1:217] 30.3 5.5 4.7 NA NA 23.9 NA 3.8 4.3 NA ...
##  $ Diabetes      : num [1:217] 9.6 10.1 6.7 NA 8 3.9 13.2 5.5 7.1 11.6 ...
##  $ BirthRate     : num [1:217] 32.5 11.7 22.3 NA NA 41.3 16.1 17 13.1 11 ...
##  $ DeathRate     : num [1:217] 6.6 7.5 4.8 NA NA 8.4 5.8 7.6 9.7 8.9 ...
##  $ ElderlyPop    : num [1:217] 2.6 13.6 6.4 NA NA 2.5 7.2 11.3 11.4 13.6 ...
##  $ LifeExpectancy: num [1:217] 64 78.5 76.3 NA NA 61.8 76.5 76.7 74.8 76 ...
##  $ FemaleLabor   : num [1:217] 50.3 55.9 16.4 NA NA 76.4 NA 57.1 55.8 NA ...
##  $ Unemployment  : num [1:217] 1.5 13.9 12.1 NA NA 7.3 NA 9.5 17.7 NA ...
##  $ Energy        : num [1:217] NA 808 1328 NA NA ...
##  $ Electricity   : num [1:217] NA 2309 1363 NA NA ...
##  $ Developed     : num [1:217] NA 1 1 NA NA 1 NA 2 1 NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   Code = col_character(),
##   ..   LandArea = col_double(),
##   ..   Population = col_double(),
##   ..   Density = col_double(),
##   ..   GDP = col_double(),
##   ..   Rural = col_double(),
##   ..   CO2 = col_double(),
##   ..   PumpPrice = col_double(),
##   ..   Military = col_double(),
##   ..   Health = col_double(),
##   ..   ArmedForces = col_double(),
##   ..   Internet = col_double(),
##   ..   Cell = col_double(),
##   ..   HIV = col_double(),
##   ..   Hunger = col_double(),
##   ..   Diabetes = col_double(),
##   ..   BirthRate = col_double(),
##   ..   DeathRate = col_double(),
##   ..   ElderlyPop = col_double(),
##   ..   LifeExpectancy = col_double(),
##   ..   FemaleLabor = col_double(),
##   ..   Unemployment = col_double(),
##   ..   Energy = col_double(),
##   ..   Electricity = col_double(),
##   ..   Developed = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(AllCountries)

## # A tibble: 6 × 26
##   Country Code  LandArea Population Density   GDP Rural   CO2 PumpPrice Military
##   <chr>   <chr>    <dbl>      <dbl>   <dbl> <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 Afghan… AFG     653.       37.2      56.9   521  74.5  0.29      0.7      3.72
## 2 Albania ALB      27.4       2.87    105.   5254  39.7  1.98      1.36     4.08
## 3 Algeria DZA    2382.       42.2      17.7  4279  27.4  3.74      0.28    13.8 
## 4 Americ… ASM       0.2       0.055   277.     NA  12.8 NA        NA       NA   
## 5 Andorra AND       0.47      0.077   164.  42030  11.9  5.83     NA       NA   
## 6 Angola  AGO    1247.       30.8      24.7  3432  34.5  1.29      0.97     9.4 
## # ℹ 16 more variables: Health <dbl>, ArmedForces <dbl>, Internet <dbl>,
## #   Cell <dbl>, HIV <dbl>, Hunger <dbl>, Diabetes <dbl>, BirthRate <dbl>,
## #   DeathRate <dbl>, ElderlyPop <dbl>, LifeExpectancy <dbl>, FemaleLabor <dbl>,
## #   Unemployment <dbl>, Energy <dbl>, Electricity <dbl>, Developed <dbl>

Question 1

Simple Linear Regression (Fitting and Interpretation): Using the AllCountries dataset, fit a simple linear regression model to predict LifeExpectancy (average life expectancy in years) based on GDP (gross domestic product per capita in $US). Report the intercept and slope coefficients and interpret their meaning in the context of the dataset. What does the R² value tell you about how well GDP explains variation in life expectancy across countries?

# In this chunk I am fitting a simple regression model to view how GDP predicts the life expectancy
# I am displaying the results here using the summary function
model <- lm(LifeExpectancy ~ GDP, data = AllCountries)
summary(model)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP, data = AllCountries)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.352  -3.882   1.550   4.458   9.330 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.842e+01  5.415e-01  126.36   <2e-16 ***
## GDP         2.476e-04  2.141e-05   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.901 on 177 degrees of freedom
##   (38 observations deleted due to missingness)
## Multiple R-squared:  0.4304, Adjusted R-squared:  0.4272 
## F-statistic: 133.7 on 1 and 177 DF,  p-value: < 2.2e-16

Analysis

This question is asking me to build and interpret a simple linear regression model to understand how GDP per capita predicts life expectancy across countries. To answer this, I fit a regression model using LifeExpectancy as the response variable and GDP as the predictor.

From the model output, I examined the intercept and slope coefficients. The intercept represents the predicted life expectancy when GDP is zero, which ends up providing a baseline for the model. The slope coefficient is 0.0002476, which means that every 1 unit in GDP per capita, the life expectancy increases slightly. This helps me understand the relationship between economic wealth and life expectancy.

The R² value is 0.4304, which tells me about 43% of variation in life expectancy across countries would be explained by just GDP alone. This explains how GDP is important in influencing life expectancy.

Question 2

Multiple Linear Regression (Fitting and Interpretation) Fit a multiple linear regression model to predict LifeExpectancy using GDP, Health (percentage of government expenditures on healthcare), and Internet (percentage of population with internet access) as predictors. Interpret the coefficient for Health, explaining what it means in terms of life expectancy while controlling for GDP and Internet. How does the adjusted R² compare to the simple regression model from Question 1, and what does this suggest about the additional predictors?

# In this chunk I am fitting a multiple regression model that involves GDP, Health, and Internet 
# LifeExpectancy is the response variable which is what I am predicting
# GDP, Health, and Internet are all predictors which are implemented to explain life expectancy 
# I a displaying the results in this using the summary function
model2 <- lm(LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
summary(model2)

## 
## Call:
## lm(formula = LifeExpectancy ~ GDP + Health + Internet, data = AllCountries)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5662  -1.8227   0.4108   2.5422   9.4161 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.908e+01  8.149e-01  72.499  < 2e-16 ***
## GDP         2.367e-05  2.287e-05   1.035 0.302025    
## Health      2.479e-01  6.619e-02   3.745 0.000247 ***
## Internet    1.903e-01  1.656e-02  11.490  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.104 on 169 degrees of freedom
##   (44 observations deleted due to missingness)
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.7164 
## F-statistic: 145.8 on 3 and 169 DF,  p-value: < 2.2e-16

Analysis

This question is asking me to add more predictors and to interpret how one variable affects life expectancy while controlling for others. I decided to fit a multiple regression model using GDP, Health, and Internet as the predictors for life expectancy.

The thing that I focused on was trying to interpret the coefficient for Health. The coefficient is about 0.248, which means that for every 1% increase in government spending on healthcare, then the life expectancy is expected to increase by about 0.25 years even if GDP and Internet access are constant. This displays why the moree expensive healthcare spending is associated with longer life expectancy, despite there being access to wealth and technology.

The thing I also did was to compare the adjusted R² of this model to the simple regression model from Question 1. The adjusted R² increased from about 0.43 to 0.72. This explains how by adding Health and Internet would significantly improve the model and it provides an understanding for factors that influence life expectancy across countries.

Question 3

Checking Assumptions (Homoscedasticity and Normality) For the simple linear regression model from Question 1 (LifeExpectancy ~ GDP), describe how you would check the assumptions of homoscedasticity and normality of residuals. For each assumption, explain what an ideal outcome would look like and what a violation might indicate about the model’s reliability for predicting life expectancy. Afterwords, code your answer and reflect if it matched the ideal outcome

# In this chunk I created a residual vs fitted plot in order to check for homoscedasticity
# fitted.values in this predicts the life expectancy
# residuals basically means the difference in the actual and predicted values
# I also added a horizontal line to visualize
plot(model$fitted.values, model$residuals,
     main = "Residuals vs Fitted Values",
     xlab = "Fitted Life Expectancy",
     ylab = "Residuals",
     pch = 19)

abline(h = 0, col = "red")

# In this chunk I created a histogram in order to check for normality 
# Residuals basically means the difference between actual life expectancy as well as the predicted values
# I expect the histogram to look roughly bell-shaped if the normality assumption is met
# breaks = 7 basically controls the number of bins there are in the histogram
# col is the column = "lightblue" which basically improves the visual appearence
hist(model$residuals,
     main = "Histogram of Residuals",
     xlab = "Residuals",
     breaks = 7,
     col = "lightblue")

Analysis

This question is asking me to check if the assumptions of linear regression are met, specifically homoscedasticity and normality, and to interpret what the plots show. The thing that I used to check homoscedasticity is that I used a residuals vs fitted plot. The points ideally should be randomly scattered around zero with no pattern. However In my plot, the spread is uneven and it displays a slight pattern when fitted values increase as well as some extreme residuals. This shows how the model may be less reliable for certain ranges of life expectancy. The way I checked normality was by using a histogram of the residuals. In my case my histogram is slightly skewed which indicated how the residuals are not perfectly normal. The model also shows a relationship since the assumptions are not fully met, the predictions are somewhat reliable but not perfect and other factors also has an influence on life expectancy.

Question 4

Diagnosing Model Fit (RMSE and Residuals) For the multiple regression model from Question 2 (LifeExpectancy ~ GDP + Health + Internet), calculate the RMSE and explain what it represents in the context of predicting life expectancy. How would large residuals for certain countries (e.g., those with unusually high or low life expectancy) affect your confidence in the model’s predictions, and what might you investigate further?

# In this chunk I am calculating the RMSE in order to measure the error in the prediction
# For residuals the way you solve is actual- predicted values
rmse <- sqrt(mean(model2$residuals^2))
rmse

## [1] 4.056417

Analysis

This question is trying to ask me to evaluate how my multiple regression model performs using RMSE and to understand the impact and the reasoning of prediction errors. The thing I did was calculate the RMSE to be about 4.06, this bascially means that my model’s predictions are off by about 4 years on average when trying to estimate the life expectancy.

This suggests that while the model is reasonable accurate there is still prediction errors. Large residuals for certain countries mean the model is not predicting well for those cases The reason behind this could be due to missing variables or unique factors affecting those countries.To improve the model, I could investigate outliers. I could also include additional variables, and explore more differences to better concisely explain the errors.

Question 5

Hypothetical Example (Multicollinearity in Multiple Regression) Suppose you are analyzing the AllCountries dataset and fit a multiple linear regression model to predict CO2 emissions (metric tons per capita) using Energy (kilotons of oil equivalent) and Electricity (kWh per capita) as predictors. You notice that Energy and Electricity are highly correlated. Explain how this multicollinearity might affect the interpretation of the regression coefficients and the reliability of the model.

The goal of this question is to ask me to explain what happens when two predictor variables are highly related in a multiple regression model. In this case, Energy and Electricity are correlated, this means they both convey similar information. Due to this it becomes challenging for me tell how much each variable is individually affecting CO2 emissions (metric tons per capita). The model also has trouble separating their effects as well as the coefficients can also become unstable and may change due to small changes which occur in the data.

If the model looks like it fits well, this issue makes the individual results less trustworthy and harder to disect. The way I could try to deal with it is by removing one of the variables and checking for multicollinearity using tools in order to see if the model is reliable.

Arnav Shah DATA 101 HW 9

2026-04-14

Question 1

Analysis

Question 2

Analysis

Question 3

Analysis

Question 4

Analysis

Question 5