Project 1: Global Statistics

Introduction & Background of Dataset

In this project, a single dataset consisting of information about global mortality rates and population growth will be used. This dataset was pulled from UNdata, a public database containing global statistics provided by intergovernmental and international organizations such as the United Nations. Below, the dataset will be imported in the csv file format and converted to a data frame with the read.csv() function.

population_data <- read.csv("population_mortality.csv")

This dataset file has been successfully assigned to the name population_data. Before we proceed, it is necessary that several libraries or packages are imported to plot different variables against each other and visualize much of the data.

What libraries are needed?

library(dplyr)
library(ggplot2)

In order to filter and arrange the dataset as we please, the dplyr library is significant in providing numerous functions that simplify a dataset into a form that is easier to break down for data visualization and that is more relevant to the problem or question that is being studied. This is followed by the ggplot2 library which is integral to developing any advanced two-dimensional plot, especially for building statistical models such as simple linear regression plots and ANOVA (Analysis of Variance) plots. These concepts will be introduced and expanded on later in this report.

Breakdown of Dataset

Numerous simple functions can be utilized to better understand the data set and prepare for data analysis as we examine specific variables that will be explored. Firstly, it will be helpful to obtain the dimensions of the data frame, and the nrow() and ncol() functions are one such method out of many that can be used to retrieve the number of columns and rows.

nrow(population_data)

## [1] 6832

ncol(population_data)

## [1] 7

colnames(population_data)

## [1] "T03"                                                        
## [2] "Population.growth.and.indicators.of.fertility.and.mortality"
## [3] "X"                                                          
## [4] "X.1"                                                        
## [5] "X.2"                                                        
## [6] "X.3"                                                        
## [7] "X.4"

Based on the outputs of the two functions, population_data has 7 distinct columns and 6832 entries or rows, but the col_names() function returns column names that are unintelligible as they have been abbreviated. This needs to be fixed for easier readability when we later create various plots, so we will be giving the columns more appropriate names where:

T03 = regionalNumber, or numbers associated with different global regions
Population.growth.and.indicators.of.fertility.and.mortality = globalRegion, or name of country or region
X = Year
X.1 = Variable, or what the value is measuring (expanded on below)
X.2 = Value
X.3 = Footnotes, or extra notes within dataset
X.4 = Source, or from where the information from that particular row was provided

Additionally, we will be creating a tibble from this data frame that excludes the first row (provides true name of column) and unnecessary columns such as Footnotes, Source, and Regional Number.

population <- as_tibble(population_data)

population_tibble <- population %>% 
  rename(globalRegion = Population.growth.and.indicators.of.fertility.and.mortality,
         Year = X, Variable = X.1, Value = X.2) %>% select(-c(T03,X.3,X.4))

population_tibble <- population_tibble[-1, ]

summary(population_tibble)

##  globalRegion           Year             Variable            Value          
##  Length:6831        Length:6831        Length:6831        Length:6831       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

The output of the summary() function confirms that we have successfully simplfied the tibble from 7 columns to 4 and named them each appropriately.

It is important to note that the Variable column contains 7 different variables that include:

Population annual rate of increase (percent)
Total fertility rate (children per women)
Under five mortality rate for both sexes (per 1,000 live births)
Maternal mortality ratio (deaths per 100,000 population)
Life expectancy at birth for both sexes (years)
Life expectancy at birth for males (years)
Life expectancy at birth for females (years)

Since this column has multiple categories, it is best to factor the column into different levels for each variable as shown above as we will be filtering out levels in the tibble for different data visualizations.

population_tibble$Variable <- factor(population_tibble$Variable)

levels(population_tibble$Variable)

## [1] "Life expectancy at birth for both sexes (years)"                 
## [2] "Life expectancy at birth for females (years)"                    
## [3] "Life expectancy at birth for males (years)"                      
## [4] "Maternal mortality ratio (deaths per 100,000 population)"        
## [5] "Population annual rate of increase (percent)"                    
## [6] "Total fertility rate (children per women)"                       
## [7] "Under five mortality rate for both sexes (per 1,000 live births)"

Lastly, the Year and Value columns are in the character class and must be converted to numeric for data visualization. The following code confirms the changes made in the dataset.

population_tibble$Year <- as.numeric(population_tibble$Year)
population_tibble$Value <- as.numeric(population_tibble$Value)

class(population_tibble$Year)

## [1] "numeric"

class(population_tibble$Value)

## [1] "numeric"

Problem Definitions

Considering this dataset:

Problem 1: What predictions can be made based on the relationship present between global fertility rates and global under-five mortality rates?
Problem 2: How has the life expectancies for both sexes changed globally as the years have progressed?

Problem 1: Infant Mortality vs. Fertility Rate

Since the first question calls for the development of a prediction model, the best suited tool will be applying concepts from simple linear regression to determine a close linear relationship between the two variables of interest. In this case, the independent variable would be the global under-five mortality rates, and the dependent variable would be the global fertility rates.

Filtering Dataset

To proceed with this problem, we must filter the globalRegion column to all entries except for those that fall under the category of “Total, all countries or areas” since we only want to consider specific regions or areas. Following that, we will filter the Variable column for entries related to our identified independent and dependent variables.

We are going to study this problem by the latest year provided in the original dataset through the use of the max() function.

max(population_tibble$Year)

## [1] 2024

Since the latest year is 2024, we will be filtering for all entries that were recorded in the year 2024 below.

mort_tibble <- population_tibble %>% filter(globalRegion != "Total, all 
countries or areas" & 
Variable == "Under five mortality rate for both sexes (per 1,000 live births)"
        & Year == 2024) %>% rename(mortalityRate = Value)

fert_tibble <- population_tibble %>% filter(globalRegion != "Total, all 
countries or areas" & 
Variable == "Total fertility rate (children per women)" & Year == 2024) %>% 
  rename(fertilityRate = Value) 

mort_vs_fert <- tibble(globalRegion = mort_tibble$globalRegion, 
                       Year = mort_tibble$Year, 
                       mortalityRate = mort_tibble$mortalityRate, 
                       fertilityRate = fert_tibble$fertilityRate)
str(mort_vs_fert)

## tibble [270 × 4] (S3: tbl_df/tbl/data.frame)
##  $ globalRegion : chr [1:270] "Total, all countries or areas" "Africa" "Northern Africa" "Sub-Saharan Africa" ...
##  $ Year         : num [1:270] 2024 2024 2024 2024 2024 ...
##  $ mortalityRate: num [1:270] 36 62.4 26.9 67.6 48.5 72.1 33.6 89.2 12.7 5.8 ...
##  $ fertilityRate: num [1:270] 2.2 4 2.9 4.3 4 5.4 2.3 4.4 1.7 1.6 ...

Plotting Simple Linear Regression Model

This newly created tibble will be used to build the simple linear regression model as the mortalityRate and fertilityRate has been mapped onto each other within the same tibble in separate columns. We will be plotting the independent variable (mortalityRate) on the x-axis and the dependent variable (fertilityRate) on the y-axis.

ggplot(mort_vs_fert, aes(x = mortalityRate, y = fertilityRate)) + 
  geom_point(shape = 1) + geom_smooth(method = "lm", formula = y ~ x, color = "red") + 
  labs(title = "Simple Linear Regression: Mortality vs. Fertility Rate", 
       x = "Under-Five Mortality Rate (Per 1,000 Live Births)", 
       y = "Fertility Rate (Children Per Woman)") + theme_minimal()

We can utilize the lm() function to get a mathematical representation of the line in the model above.

mort_fert_line <- lm(mort_vs_fert$fertilityRate ~ mort_vs_fert$mortalityRate, 
                     data = mort_vs_fert)
summary(mort_fert_line)

## 
## Call:
## lm(formula = mort_vs_fert$fertilityRate ~ mort_vs_fert$mortalityRate, 
##     data = mort_vs_fert)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6469 -0.3864 -0.1087  0.3061  2.7283 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                1.425300   0.053299   26.74   <2e-16 ***
## mort_vs_fert$mortalityRate 0.040953   0.001645   24.89   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6315 on 268 degrees of freedom
## Multiple R-squared:  0.698,  Adjusted R-squared:  0.6969 
## F-statistic: 619.4 on 1 and 268 DF,  p-value: < 2.2e-16

As shown by the coefficients section in the output, the single linear regression model line can be represented by \(y = 0.04095x + 1.42350\). There is an error of 0.05329 which can be added to the equation as well.

Building a Residual Model

A residual model is the final plot that can confirm whether our linear model is an accurate predictive model based on its distribution. In the previous subsection, the lm() function supplied a summary containing the intercept, regression coefficient, and error values which will also be applied to build the residual model.

Specifically, the lm() function contain predictive values, but we can also use functions predict() and residuals() to create these specific sets of values, residual being the difference between the predicted and actual values as shown here:

residuals <- residuals(mort_fert_line)
predicted <- predict(mort_fert_line)

In order for these sets of values to be read well by the ggplot() function, we must create a tibble with both.

resid_vs_pred <- tibble(Residuals = residuals, Predicted = predicted)

ggplot(resid_vs_pred, aes(x = Predicted, y = Residuals)) + 
  geom_point(shape = 1, color = "blue") + geom_hline(yintercept = 0, 
                                                      color = "red") + 
  labs(title = "Residual Model Scatterplot (1): Mortality vs. Fertility", 
       x = "Fitted Values", y = "Residual Values") + theme_minimal()

ggplot(resid_vs_pred, aes(y = Residuals)) + geom_boxplot(fill = "maroon") + 
  labs(title = "Residual Model Boxplot (2): Mortality vs. Fertility", 
       y = "Residual Values") + theme_minimal()

Residual Model Analysis

Residual Model 1 - Scatterplot: We can observe that the values are clustered around the smaller fitted values with more points between -1 and 0 of the residual values (y-axis), growing more sparse as the fitted values increase. They are, however, clustered around the red line at 0.

Based on this, we can possibly conclude that the linear model made is not an accurate predictive model for the relationship between mortality rate of infants and fertility rate (children per woman). Firstly, this plot shows an uneven distribution, especially since a majority of the points are clustered underneath the red line at 0. Specifically, this may mean that the linear model consistently results in a predicted value that is less than the actual.

Residual Model 1 - Boxplot: The key sign of good or bad residual boxplot is the placement of the median line and its proximity to zero. In this case, the median line is close to zero, but the width of the box can be a sign of great inaccuracy in the linear model coupled with the presence of many outliers.

Problem 1: Conclusion

Given the original question, the simple linear regression model was not a good fit based on the distributions of the two residual models, one being a scatterplot and another being a boxplot. Both pointed towards large errors and inaccuracies as mention in the previous subsection. The linear model cannot be used to form any predictions regarding the relationship between infant mortality rates and the number of children per woman.

Problem 2: Life Expectancies

Question: How has the life expectancies for both sexes changed globally as the years have progressed?

Filtering Dataset

For this problem, we will be considering the whole world which means that we must filter for the global region that accounts for all countries or areas. Below, we have also filtered for the life expectancy variable that accounts for both sexes as well, creating a new filtered tibble.

life_exp_tibble <- population_tibble %>% filter(
  Variable == "Life expectancy at birth for both sexes (years)" & 
    globalRegion == "Total, all countries or areas") %>% 
  rename(lifeExpectancy = Value) %>% select(-Variable)

life_exp_tibble

## # A tibble: 4 × 3
##   globalRegion                   Year lifeExpectancy
##   <chr>                         <dbl>          <dbl>
## 1 Total, all countries or areas  2010           70.1
## 2 Total, all countries or areas  2015           71.6
## 3 Total, all countries or areas  2020           71.9
## 4 Total, all countries or areas  2024           73.3

Plotting Barplot

life_exp_tibble$Year <- as.character(life_exp_tibble$Year)

ggplot(data = life_exp_tibble, aes(x = Year, y = lifeExpectancy, 
                                   fill = lifeExpectancy)) + 
  geom_bar(stat = "identity") + 
  geom_text(aes(label = lifeExpectancy, vjust = -0.3)) + scale_y_continuous(
    breaks = seq(0,70, by = 10)) + scale_fill_gradient(name = "Life Expectancy", 
                                                       low = "lightblue", 
                                                       high = "darkblue") +
  labs(title = "Barplot: Global Life Expectancy Over The Years", 
       x = "Year", y = "Life Expectancy (Both Sexes)") + theme_minimal()

Above, we have converted the Year variable to the character class from the numeric class for the x-axis. Then, we have used the ggplot() function to plot the average global life expectancy by each of the years in the dataset from 2010 to 2024.

Problem 2: Observations & Conclusion

Based on the barplot above, it is evident that the life expectancies of both sexes around the world have increased over a span of 14 years, likely due to improvements in technology and medical treatment. We have successfully answered the second problem.

Project 1: Global Statistics

Alysa Jacob

11-02-2025

Introduction & Background of Dataset

What libraries are needed?

Breakdown of Dataset

Problem Definitions

Problem 1: Infant Mortality vs. Fertility Rate

Filtering Dataset

Plotting Simple Linear Regression Model

Building a Residual Model

Residual Model Analysis

Problem 1: Conclusion

Problem 2: Life Expectancies

Filtering Dataset

Plotting Barplot

Problem 2: Observations & Conclusion