In this project, a single dataset consisting of information about global mortality rates and population growth will be used. This dataset was pulled from UNdata, a public database containing global statistics provided by intergovernmental and international organizations such as the United Nations. Below, the dataset will be imported in the csv file format and converted to a data frame with the read.csv() function.
population_data <- read.csv("population_mortality.csv")
This dataset file has been successfully assigned to the name population_data. Before we proceed, it is necessary that several libraries or packages are imported to plot different variables against each other and visualize much of the data.
library(dplyr)
library(ggplot2)
In order to filter and arrange the dataset as we please, the dplyr library is significant in providing numerous functions that simplify a dataset into a form that is easier to break down for data visualization and that is more relevant to the problem or question that is being studied. This is followed by the ggplot2 library which is integral to developing any advanced two-dimensional plot, especially for building statistical models such as simple linear regression plots and ANOVA (Analysis of Variance) plots. These concepts will be introduced and expanded on later in this report.
Numerous simple functions can be utilized to better understand the data set and prepare for data analysis as we examine specific variables that will be explored. Firstly, it will be helpful to obtain the dimensions of the data frame, and the nrow() and ncol() functions are one such method out of many that can be used to retrieve the number of columns and rows.
nrow(population_data)
## [1] 6832
ncol(population_data)
## [1] 7
colnames(population_data)
## [1] "T03"
## [2] "Population.growth.and.indicators.of.fertility.and.mortality"
## [3] "X"
## [4] "X.1"
## [5] "X.2"
## [6] "X.3"
## [7] "X.4"
Based on the outputs of the two functions, population_data has 7 distinct columns and 6832 entries or rows, but the col_names() function returns column names that are unintelligible as they have been abbreviated. This needs to be fixed for easier readability when we later create various plots, so we will be giving the columns more appropriate names where:
Additionally, we will be creating a tibble from this data frame that excludes the first row (provides true name of column) and unnecessary columns such as Footnotes, Source, and Regional Number.
population <- as_tibble(population_data)
population_tibble <- population %>%
rename(globalRegion = Population.growth.and.indicators.of.fertility.and.mortality,
Year = X, Variable = X.1, Value = X.2) %>% select(-c(T03,X.3,X.4))
population_tibble <- population_tibble[-1, ]
summary(population_tibble)
## globalRegion Year Variable Value
## Length:6831 Length:6831 Length:6831 Length:6831
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
The output of the summary() function confirms that we have successfully simplfied the tibble from 7 columns to 4 and named them each appropriately.
It is important to note that the Variable column contains 7 different variables that include:
Since this column has multiple categories, it is best to factor the column into different levels for each variable as shown above as we will be filtering out levels in the tibble for different data visualizations.
population_tibble$Variable <- factor(population_tibble$Variable)
levels(population_tibble$Variable)
## [1] "Life expectancy at birth for both sexes (years)"
## [2] "Life expectancy at birth for females (years)"
## [3] "Life expectancy at birth for males (years)"
## [4] "Maternal mortality ratio (deaths per 100,000 population)"
## [5] "Population annual rate of increase (percent)"
## [6] "Total fertility rate (children per women)"
## [7] "Under five mortality rate for both sexes (per 1,000 live births)"
Lastly, the Year and Value columns are in the character class and must be converted to numeric for data visualization. The following code confirms the changes made in the dataset.
population_tibble$Year <- as.numeric(population_tibble$Year)
population_tibble$Value <- as.numeric(population_tibble$Value)
class(population_tibble$Year)
## [1] "numeric"
class(population_tibble$Value)
## [1] "numeric"
Considering this dataset:
Since the first question calls for the development of a prediction model, the best suited tool will be applying concepts from simple linear regression to determine a close linear relationship between the two variables of interest. In this case, the independent variable would be the global under-five mortality rates, and the dependent variable would be the global fertility rates.
To proceed with this problem, we must filter the globalRegion column to all entries except for those that fall under the category of “Total, all countries or areas” since we only want to consider specific regions or areas. Following that, we will filter the Variable column for entries related to our identified independent and dependent variables.
We are going to study this problem by the latest year provided in the original dataset through the use of the max() function.
max(population_tibble$Year)
## [1] 2024
Since the latest year is 2024, we will be filtering for all entries that were recorded in the year 2024 below.
mort_tibble <- population_tibble %>% filter(globalRegion != "Total, all
countries or areas" &
Variable == "Under five mortality rate for both sexes (per 1,000 live births)"
& Year == 2024) %>% rename(mortalityRate = Value)
fert_tibble <- population_tibble %>% filter(globalRegion != "Total, all
countries or areas" &
Variable == "Total fertility rate (children per women)" & Year == 2024) %>%
rename(fertilityRate = Value)
mort_vs_fert <- tibble(globalRegion = mort_tibble$globalRegion,
Year = mort_tibble$Year,
mortalityRate = mort_tibble$mortalityRate,
fertilityRate = fert_tibble$fertilityRate)
str(mort_vs_fert)
## tibble [270 × 4] (S3: tbl_df/tbl/data.frame)
## $ globalRegion : chr [1:270] "Total, all countries or areas" "Africa" "Northern Africa" "Sub-Saharan Africa" ...
## $ Year : num [1:270] 2024 2024 2024 2024 2024 ...
## $ mortalityRate: num [1:270] 36 62.4 26.9 67.6 48.5 72.1 33.6 89.2 12.7 5.8 ...
## $ fertilityRate: num [1:270] 2.2 4 2.9 4.3 4 5.4 2.3 4.4 1.7 1.6 ...
This newly created tibble will be used to build the simple linear regression model as the mortalityRate and fertilityRate has been mapped onto each other within the same tibble in separate columns. We will be plotting the independent variable (mortalityRate) on the x-axis and the dependent variable (fertilityRate) on the y-axis.
ggplot(mort_vs_fert, aes(x = mortalityRate, y = fertilityRate)) +
geom_point(shape = 1) + geom_smooth(method = "lm", formula = y ~ x, color = "red") +
labs(title = "Simple Linear Regression: Mortality vs. Fertility Rate",
x = "Under-Five Mortality Rate (Per 1,000 Live Births)",
y = "Fertility Rate (Children Per Woman)") + theme_minimal()
We can utilize the lm() function to get a mathematical
representation of the line in the model above.
mort_fert_line <- lm(mort_vs_fert$fertilityRate ~ mort_vs_fert$mortalityRate,
data = mort_vs_fert)
summary(mort_fert_line)
##
## Call:
## lm(formula = mort_vs_fert$fertilityRate ~ mort_vs_fert$mortalityRate,
## data = mort_vs_fert)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6469 -0.3864 -0.1087 0.3061 2.7283
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.425300 0.053299 26.74 <2e-16 ***
## mort_vs_fert$mortalityRate 0.040953 0.001645 24.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6315 on 268 degrees of freedom
## Multiple R-squared: 0.698, Adjusted R-squared: 0.6969
## F-statistic: 619.4 on 1 and 268 DF, p-value: < 2.2e-16
As shown by the coefficients section in the output, the single linear regression model line can be represented by \(y = 0.04095x + 1.42350\). There is an error of 0.05329 which can be added to the equation as well.
A residual model is the final plot that can confirm whether our linear model is an accurate predictive model based on its distribution. In the previous subsection, the lm() function supplied a summary containing the intercept, regression coefficient, and error values which will also be applied to build the residual model.
Specifically, the lm() function contain predictive values, but we can also use functions predict() and residuals() to create these specific sets of values, residual being the difference between the predicted and actual values as shown here:
residuals <- residuals(mort_fert_line)
predicted <- predict(mort_fert_line)
In order for these sets of values to be read well by the ggplot() function, we must create a tibble with both.
resid_vs_pred <- tibble(Residuals = residuals, Predicted = predicted)
ggplot(resid_vs_pred, aes(x = Predicted, y = Residuals)) +
geom_point(shape = 1, color = "blue") + geom_hline(yintercept = 0,
color = "red") +
labs(title = "Residual Model Scatterplot (1): Mortality vs. Fertility",
x = "Fitted Values", y = "Residual Values") + theme_minimal()
ggplot(resid_vs_pred, aes(y = Residuals)) + geom_boxplot(fill = "maroon") +
labs(title = "Residual Model Boxplot (2): Mortality vs. Fertility",
y = "Residual Values") + theme_minimal()
Residual Model 1 - Scatterplot: We can observe that the values are clustered around the smaller fitted values with more points between -1 and 0 of the residual values (y-axis), growing more sparse as the fitted values increase. They are, however, clustered around the red line at 0.
Based on this, we can possibly conclude that the linear model made is not an accurate predictive model for the relationship between mortality rate of infants and fertility rate (children per woman). Firstly, this plot shows an uneven distribution, especially since a majority of the points are clustered underneath the red line at 0. Specifically, this may mean that the linear model consistently results in a predicted value that is less than the actual.
Residual Model 1 - Boxplot: The key sign of good or bad residual boxplot is the placement of the median line and its proximity to zero. In this case, the median line is close to zero, but the width of the box can be a sign of great inaccuracy in the linear model coupled with the presence of many outliers.
Given the original question, the simple linear regression model was not a good fit based on the distributions of the two residual models, one being a scatterplot and another being a boxplot. Both pointed towards large errors and inaccuracies as mention in the previous subsection. The linear model cannot be used to form any predictions regarding the relationship between infant mortality rates and the number of children per woman.
Question: How has the life expectancies for both sexes changed globally as the years have progressed?
For this problem, we will be considering the whole world which means that we must filter for the global region that accounts for all countries or areas. Below, we have also filtered for the life expectancy variable that accounts for both sexes as well, creating a new filtered tibble.
life_exp_tibble <- population_tibble %>% filter(
Variable == "Life expectancy at birth for both sexes (years)" &
globalRegion == "Total, all countries or areas") %>%
rename(lifeExpectancy = Value) %>% select(-Variable)
life_exp_tibble
## # A tibble: 4 × 3
## globalRegion Year lifeExpectancy
## <chr> <dbl> <dbl>
## 1 Total, all countries or areas 2010 70.1
## 2 Total, all countries or areas 2015 71.6
## 3 Total, all countries or areas 2020 71.9
## 4 Total, all countries or areas 2024 73.3
life_exp_tibble$Year <- as.character(life_exp_tibble$Year)
ggplot(data = life_exp_tibble, aes(x = Year, y = lifeExpectancy,
fill = lifeExpectancy)) +
geom_bar(stat = "identity") +
geom_text(aes(label = lifeExpectancy, vjust = -0.3)) + scale_y_continuous(
breaks = seq(0,70, by = 10)) + scale_fill_gradient(name = "Life Expectancy",
low = "lightblue",
high = "darkblue") +
labs(title = "Barplot: Global Life Expectancy Over The Years",
x = "Year", y = "Life Expectancy (Both Sexes)") + theme_minimal()
Above, we have converted the Year variable to the character class from the numeric class for the x-axis. Then, we have used the ggplot() function to plot the average global life expectancy by each of the years in the dataset from 2010 to 2024.
Based on the barplot above, it is evident that the life expectancies of both sexes around the world have increased over a span of 14 years, likely due to improvements in technology and medical treatment. We have successfully answered the second problem.