Homework 6

Your Chance to Do a Prediction

Let’s look at the gapminder data. This is data from the World Bank on a ton of different outcomes, but we are going to look at a specific subset focused on, among other things, infant mortality. We will load it via the “dslabs” package, though there are a few different sources (including a gapminder package)

install.packages(“dslabs”)

library(dslabs)

## Warning: package 'dslabs' was built under R version 4.4.3

data(gapminder)
head(gapminder)

##               country year infant_mortality life_expectancy fertility
## 1             Albania 1960           115.40           62.87      6.19
## 2             Algeria 1960           148.20           47.50      7.65
## 3              Angola 1960           208.00           35.98      7.32
## 4 Antigua and Barbuda 1960               NA           62.97      4.43
## 5           Argentina 1960            59.87           65.39      3.11
## 6             Armenia 1960               NA           66.86      4.55
##   population          gdp continent          region
## 1    1636054           NA    Europe Southern Europe
## 2   11124892  13828152297    Africa Northern Africa
## 3    5270844           NA    Africa   Middle Africa
## 4      54681           NA  Americas       Caribbean
## 5   20619075 108322326649  Americas   South America
## 6    1867396           NA      Asia    Western Asia

colnames(gapminder)

## [1] "country"          "year"             "infant_mortality" "life_expectancy" 
## [5] "fertility"        "population"       "gdp"              "continent"       
## [9] "region"

dim(gapminder)

## [1] 10545     9

Go Through Each of Our Steps

First Step: What do we care about? What is our outcome of interest?

I will do this one for us: infant mortality. This will be the outcome you are going to try to predict.

The rest of the assignment is straightforward; simply follow the steps below:

Second Step, but Question 1!

What could explain levels of infant mortality? Pick at least three of the variables in the dataset that you will use as predictors/covariates, and explain each decision.

#What variables are we using and why?

#Fertility: Countries with higher fertility rates often have higher infant mortality, partly due to resource constraints and limited maternal/child health capacity.

#Life_expectancy: Higher life expectancy is a strong signal of better health systems, nutrition, sanitation, and medical access, which are all linked to lower infant mortality.

#GDP: GDP per capita captures national economic development. Wealthier countries typically invest more in healthcare and public health, reducing infant mortality.

predictors <- c("fertility", "life_expectancy", "gdp")
predictors

## [1] "fertility"       "life_expectancy" "gdp"

Third Step, Question 2:

Set up our datasets for our analysis. Divide your dataset in half so that you have both a training and a testing dataset. Do so using random rows as discussed in class (which is important here!). Ensure that the datasets are mutually exclusive (rows are in one of the datasets, but not the other)

#Creating Training and Testing Sets
set.seed(7) 

n <- nrow(gapminder)

#Randomly sample half the rows for training
train_index <- sample(1:n, size = n/2, replace = FALSE)

#Create training and testing datasets
train_set <- gapminder[train_index, ]
test_set  <- gapminder[-train_index, ]

#Check 
dim(train_set)

## [1] 5272    9

dim(test_set)

## [1] 5273    9

Fourth Step, Question 3:

Regress and Predict. Run the actual regression, and then use the results to predict infant mortality in your testing set. Attach the predictions for the testing set directly to the testing set to make the next step easier.

#Regression model using my predictors
model <- lm(infant_mortality ~ fertility + life_expectancy + gdp, data = train_set)

summary(model)

## 
## Call:
## lm(formula = infant_mortality ~ fertility + life_expectancy + 
##     gdp, data = train_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.638  -9.077   0.226   9.110  94.071 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.344e+02  4.036e+00  58.082  < 2e-16 ***
## fertility        6.091e+00  2.539e-01  23.991  < 2e-16 ***
## life_expectancy -3.155e+00  4.854e-02 -65.010  < 2e-16 ***
## gdp              1.506e-12  4.087e-13   3.685 0.000232 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.99 on 3558 degrees of freedom
##   (1710 observations deleted due to missingness)
## Multiple R-squared:  0.8679, Adjusted R-squared:  0.8678 
## F-statistic:  7793 on 3 and 3558 DF,  p-value: < 2.2e-16

#Predict infant mortality for the testing set
test_set$predicted_im <- predict(model, newdata = test_set)

#Check the first few 
head(test_set[, c("infant_mortality", "predicted_im")])

##    infant_mortality predicted_im
## 1             115.4           NA
## 4                NA           NA
## 7                NA           NA
## 14             69.5     65.77299
## 16             29.5     30.75759
## 19            175.0           NA

#Removing rows in the test set that are missing predictors
test_clean <- test_set %>%
  filter(!is.na(fertility),
         !is.na(life_expectancy),
         !is.na(gdp),
         !is.na(infant_mortality))

#Make predictions only on rows we can actually predict
test_clean$predicted_im <- predict(model, newdata = test_clean)

#Compare actual vs predicted
head(test_clean[, c("infant_mortality", "predicted_im")])

##   infant_mortality predicted_im
## 1             69.5     65.77299
## 2             29.5     30.75759
## 3            115.5    115.87990
## 4            190.0    162.47489
## 5             89.3     92.80090
## 6            209.6    122.36408

Fifth Step, Question 4:

Assess Prediction. Calculate the MAE and RMSE (mean absolute error and root mean square error) for your predictions. Compare them to the MAE and RMSE of simply guessing the mean infant mortality or median infant mortality as the rate for a given country in a given year.

#Absolute error for each observation
test_clean$abs_error <- abs(test_clean$infant_mortality - test_clean$predicted_im)

#Squared error for each observation
test_clean$sq_error <- (test_clean$infant_mortality - test_clean$predicted_im)^2

# MAE
MAE_model <- mean(test_clean$abs_error, na.rm = TRUE)

# RMSE
RMSE_model <- sqrt(mean(test_clean$sq_error, na.rm = TRUE))

MAE_model

## [1] 12.15373

RMSE_model

## [1] 16.81808

#Baseline: Mean guess
mean_guess <- mean(train_set$infant_mortality, na.rm = TRUE)

#Errors from simply guessing the mean
test_clean$mean_abs_error <- abs(test_clean$infant_mortality - mean_guess)
test_clean$mean_sq_error  <- (test_clean$infant_mortality - mean_guess)^2

MAE_mean <- mean(test_clean$mean_abs_error)
RMSE_mean <- sqrt(mean(test_clean$mean_sq_error))

#Baseline: Median guess
median_guess <- median(train_set$infant_mortality, na.rm = TRUE)

test_clean$med_abs_error <- abs(test_clean$infant_mortality - median_guess)
test_clean$med_sq_error  <- (test_clean$infant_mortality - median_guess)^2

MAE_median <- mean(test_clean$med_abs_error)
RMSE_median <- sqrt(mean(test_clean$med_sq_error))

#Print everything
MAE_model

## [1] 12.15373

RMSE_model

## [1] 16.81808

MAE_mean

## [1] 38.50456

RMSE_mean

## [1] 46.45832

MAE_median

## [1] 37.22098

RMSE_median

## [1] 48.34491

#My regression model seem to have performed better than either a mean-guess or median-guess baseline. The model produced a MAE of 12.15 and a RMSE of 16.82, showing that the typical prediction error was around 12 infant deaths per 1,000 live births. In contrast, simply guessing the mean infant mortality rate resulted in a MAE of 38.50 and RMSE of 46.46, while the median guess resulted in a MAE of 37.22 and RMSE of 48.34. Both approaches performed worse than the regression model. These results suggest that the predictors I selected (fertility, life expectancy, and GDP) provide meaningful explanatory power.