The data set entails the socio-economic indicators from 2004 to 2015 for several countries, including the United States, Germany, Japan, Canada, the United Kingdom, Australia, India, Brazil, South Africa, and Mexico. This data allows for cross-country comparisons and analysis of socio-economic trends and their impacts.
The objectives of the analysis are to apply appropriate methods for preparing, validating, analyzing, modeling, and predicting data, focusing on understanding the evolution of economic development and identifying patterns or trends in GDP growth across the countries.
# Loading all required packages and data
library(datarium)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(TTR)
library(ggplot2)
library(qqplotr)
##
## Attaching package: 'qqplotr'
##
## The following objects are masked from 'package:ggplot2':
##
## stat_qq_line, StatQqLine
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
## method from
## fortify.SpatialPolygonsDataFrame ggplot2
##
## The 'mosaic' package masks several functions from core packages in order to add
## additional features. The original behavior of these functions should not be affected by this.
##
## Attaching package: 'mosaic'
##
## The following object is masked from 'package:Matrix':
##
## mean
##
## The following objects are masked from 'package:dplyr':
##
## count, do, tally
##
## The following object is masked from 'package:purrr':
##
## cross
##
## The following object is masked from 'package:ggplot2':
##
## stat
##
## The following objects are masked from 'package:stats':
##
## binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
## quantile, sd, t.test, var
##
## The following objects are masked from 'package:base':
##
## max, mean, min, prod, range, sample, sum
library(melt)
##
## Attaching package: 'melt'
##
## The following object is masked from 'package:mosaic':
##
## chisq
library(e1071)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(plm)
##
## Attaching package: 'plm'
##
## The following object is masked from 'package:melt':
##
## nobs
##
## The following object is masked from 'package:mosaic':
##
## r.squared
##
## The following objects are masked from 'package:dplyr':
##
## between, lag, lead
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following objects are masked from 'package:mosaic':
##
## deltaMethod, logit
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
data <- read_csv("data.csv")
## Rows: 132 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Country Name, Country Code, Time Code
## dbl (12): Time, Adolescent fertility rate, Birth rate, crude, Death rate, cr...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data set is covering 132 observations and 15 variables from 2004 to 2015, includes key socio-economic indicators like Adolescent Fertility Rate, Birth Rate, and Death Rate. Initial data checks confirmed no missing values, and columns were renamed for clarity (e.g., “Country Name” to “Country”).
# Read dataset
head(data)
## # A tibble: 6 × 15
## `Country Name` `Country Code` Time `Time Code` `Adolescent fertility rate`
## <chr> <chr> <dbl> <chr> <dbl>
## 1 United States USA 2004 YR2004 39.7
## 2 United States USA 2005 YR2005 39.0
## 3 United States USA 2006 YR2006 39.8
## 4 United States USA 2007 YR2007 40.4
## 5 United States USA 2008 YR2008 39.6
## 6 United States USA 2009 YR2009 37.6
## # ℹ 10 more variables: `Birth rate, crude` <dbl>, `Death rate, crude` <dbl>,
## # `Employment to population ratio` <dbl>, `Fertility rate, total` <dbl>,
## # `GDP per capita growth` <dbl>, `GDP growth` <dbl>,
## # `Life expectancy at birth` <dbl>, `Mortality rate, infant` <dbl>,
## # `Population ages 65 and above` <dbl>, `Population growth` <dbl>
# EXPLORATORY DATA ANALYSIS
names(data)
## [1] "Country Name" "Country Code"
## [3] "Time" "Time Code"
## [5] "Adolescent fertility rate" "Birth rate, crude"
## [7] "Death rate, crude" "Employment to population ratio"
## [9] "Fertility rate, total" "GDP per capita growth"
## [11] "GDP growth" "Life expectancy at birth"
## [13] "Mortality rate, infant" "Population ages 65 and above"
## [15] "Population growth"
# Check data types
str(data)
## spc_tbl_ [132 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country Name : chr [1:132] "United States" "United States" "United States" "United States" ...
## $ Country Code : chr [1:132] "USA" "USA" "USA" "USA" ...
## $ Time : num [1:132] 2004 2005 2006 2007 2008 ...
## $ Time Code : chr [1:132] "YR2004" "YR2005" "YR2006" "YR2007" ...
## $ Adolescent fertility rate : num [1:132] 39.7 39 39.8 40.4 39.6 ...
## $ Birth rate, crude : num [1:132] 14 14 14.3 14.3 14 13.5 13 12.7 12.6 12.4 ...
## $ Death rate, crude : num [1:132] 8.2 8.3 8.1 8 8.1 ...
## $ Employment to population ratio: num [1:132] 61.2 61.5 61.9 61.8 61 ...
## $ Fertility rate, total : num [1:132] 2.05 2.06 2.11 2.12 2.07 ...
## $ GDP per capita growth : num [1:132] 2.9 2.53 1.8 1.04 -0.82 ...
## $ GDP growth : num [1:132] 3.853 3.483 2.783 2.011 0.122 ...
## $ Life expectancy at birth : num [1:132] 77.5 77.5 77.7 78 78 ...
## $ Mortality rate, infant : num [1:132] 6.8 6.7 6.7 6.6 6.5 6.4 6.2 6.1 6 6 ...
## $ Population ages 65 and above : num [1:132] 12.3 12.4 12.4 12.5 12.7 ...
## $ Population growth : num [1:132] 0.925 0.922 0.964 0.951 0.946 ...
## - attr(*, "spec")=
## .. cols(
## .. `Country Name` = col_character(),
## .. `Country Code` = col_character(),
## .. Time = col_double(),
## .. `Time Code` = col_character(),
## .. `Adolescent fertility rate` = col_double(),
## .. `Birth rate, crude` = col_double(),
## .. `Death rate, crude` = col_double(),
## .. `Employment to population ratio` = col_double(),
## .. `Fertility rate, total` = col_double(),
## .. `GDP per capita growth` = col_double(),
## .. `GDP growth` = col_double(),
## .. `Life expectancy at birth` = col_double(),
## .. `Mortality rate, infant` = col_double(),
## .. `Population ages 65 and above` = col_double(),
## .. `Population growth` = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# Checking for missing values in the entire dataset
missing_values <- sum(is.na(data))
# Display the count of missing values
cat("Total missing values in the dataset:", missing_values, "\n")
## Total missing values in the dataset: 0
# Display the count of missing values for each column
col_missing_values <- colSums(is.na(data))
cat("Missing values by column:\n")
## Missing values by column:
print(col_missing_values)
## Country Name Country Code
## 0 0
## Time Time Code
## 0 0
## Adolescent fertility rate Birth rate, crude
## 0 0
## Death rate, crude Employment to population ratio
## 0 0
## Fertility rate, total GDP per capita growth
## 0 0
## GDP growth Life expectancy at birth
## 0 0
## Mortality rate, infant Population ages 65 and above
## 0 0
## Population growth
## 0
#Rename Colums
data <- data %>%
rename(
Country = "Country Name",
Country_Code = "Country Code",
Year = "Time",
Year_Code = "Time Code",
Adol_Fert = "Adolescent fertility rate",
Birth_Rate = "Birth rate, crude",
Death_Rate = "Death rate, crude",
Employedtopopul = "Employment to population ratio",
Fertility_Rate = "Fertility rate, total",
GDP_PC = "GDP per capita growth",
GDP_GROWTH = "GDP growth",
Life_expt = "Life expectancy at birth",
Mort_Rate = "Mortality rate, infant",
Popul65 = "Population ages 65 and above",
Popul_Growth = "Population growth"
)
An exploratory data analysis (EDA) was conducted which provided a comprehensive overview of the data. This includes demographics, health, and economic indicators.
The descriptive analysis is highlighted below, here is the summary of what we found:
There is a wide range of adolescent fertility rates, indicating significant differences in birth rates among teenagers across different countries. While the birth rate varies significantly, the death rate has a narrower range, showing more consistency across countries.
The employment to population ratio shows how many people are employed compared to the total population, with an average of about 57%. There’s a moderate variation, suggesting differing labor market conditions. On average, women have around 2.3 children, with some countries having much higher rates.
There is considerable variation in economic growth, reflecting different economic conditions and growth rates among the countries. Life expectancy ratio is a substantial range in life expectancy, with some countries experiencing much higher average lifespans.
Meanwhile, the infant mortality rate varies widely, indicating differences in healthcare quality and access. In terms of population growth and aging, the growth rates and proportion of the elderly population differ significantly, affecting social and economic policies.
# performing a comprehensive descriptive statistical analysis on the data, including mean, median, mode, standard deviation, skewness, and kurtosis.
# Display summary statistics
summary(data)
## Country Country_Code Year Year_Code
## Length:132 Length:132 Min. :2004 Length:132
## Class :character Class :character 1st Qu.:2007 Class :character
## Mode :character Mode :character Median :2010 Mode :character
## Mean :2010
## 3rd Qu.:2012
## Max. :2015
## Adol_Fert Birth_Rate Death_Rate Employedtopopul
## Min. : 4.027 Min. : 8.00 Min. : 5.001 Min. :41.61
## 1st Qu.: 13.874 1st Qu.:11.28 1st Qu.: 6.600 1st Qu.:55.12
## Median : 32.672 Median :14.00 Median : 8.101 Median :57.96
## Mean : 43.476 Mean :17.16 Mean : 8.743 Mean :56.62
## 3rd Qu.: 70.797 3rd Qu.:21.27 3rd Qu.:10.025 3rd Qu.:59.56
## Max. :131.024 Max. :43.15 Max. :16.271 Max. :63.41
## Fertility_Rate GDP_PC GDP_GROWTH Life_expt
## Min. :1.260 Min. :-6.4499 Min. :-5.694 Min. :48.77
## 1st Qu.:1.665 1st Qu.: 0.8249 1st Qu.: 1.558 1st Qu.:68.32
## Median :1.915 Median : 1.7893 Median : 2.708 Median :78.47
## Mean :2.304 Mean : 1.8357 Mean : 2.929 Mean :73.38
## 3rd Qu.:2.430 3rd Qu.: 3.1453 3rd Qu.: 4.189 3rd Qu.:80.95
## Max. :6.085 Max. : 7.0132 Max. : 9.251 Max. :83.79
## Mort_Rate Popul65 Popul_Growth
## Min. : 2.000 Min. : 3.026 Min. :-1.8537
## 1st Qu.: 4.075 1st Qu.: 5.411 1st Qu.: 0.7364
## Median : 6.300 Median :12.822 Median : 1.0121
## Mean :21.152 Mean :11.548 Mean : 1.0613
## 3rd Qu.:31.225 3rd Qu.:15.918 3rd Qu.: 1.3806
## Max. :97.900 Max. :27.328 Max. : 2.7641
# Mean
# Calculate means for each numeric column, handling missing values
mean_values <- sapply(data, function(x) mean(as.numeric(x), na.rm = TRUE))
## Warning in mean(as.numeric(x), na.rm = TRUE): NAs introduced by coercion
## Warning in rlang::is_formula(x): NAs introduced by coercion
## Warning in mean(as.numeric(x), na.rm = TRUE): NAs introduced by coercion
## Warning in rlang::is_formula(x): NAs introduced by coercion
## Warning in mean(as.numeric(x), na.rm = TRUE): NAs introduced by coercion
## Warning in rlang::is_formula(x): NAs introduced by coercion
print(mean_values)
## Country Country_Code Year Year_Code Adol_Fert
## NaN NaN 2009.500000 NaN 43.476136
## Birth_Rate Death_Rate Employedtopopul Fertility_Rate GDP_PC
## 17.164439 8.742803 56.623152 2.304424 1.835657
## GDP_GROWTH Life_expt Mort_Rate Popul65 Popul_Growth
## 2.929413 73.383285 21.151515 11.548117 1.061268
# Median
# Identifying numeric columns in the dataframe
numeric_columns <- sapply(data, is.numeric)
# Calculate median values for numeric columns, handling missing values
median_values <- sapply(data[, numeric_columns], median, na.rm = TRUE)
# Display median values
cat("\nMedian values:\n")
##
## Median values:
print(median_values)
## Year Adol_Fert Birth_Rate Death_Rate Employedtopopul
## 2009.500000 32.671500 14.000000 8.101000 57.961000
## Fertility_Rate GDP_PC GDP_GROWTH Life_expt Mort_Rate
## 1.915000 1.789341 2.707613 78.465854 6.300000
## Popul65 Popul_Growth
## 12.822358 1.012106
# Identify numeric columns in the dataframe
numeric_columns <- sapply(data, is.numeric)
# Mode (using density function from e1071 package)
mode_values <- sapply(data[, numeric_columns], function(x) {
dens <- density(x, na.rm = TRUE)
dens$x[which.max(dens$y)]
})
cat("\nMode values:\n")
##
## Mode values:
print(mode_values)
## Year Adol_Fert Birth_Rate Death_Rate Employedtopopul
## 2009.5176586 14.8737331 12.1335519 6.9137849 58.0675711
## Fertility_Rate GDP_PC GDP_GROWTH Life_expt Mort_Rate
## 1.8274972 1.5945390 2.5010266 80.2291190 5.2770271
## Popul65 Popul_Growth
## 5.2676630 0.9643459
The standard deviation indicates the variability of the data. For instance, adolescent fertility and infant mortality rates show high variability, suggesting significant differences between countries.
The Skewness tells us about the symmetry of the data. For example, the fertility rate is positively skewed, meaning most countries have lower fertility rates, with a few having very high rates.
Also, the Kurtosis indicates the presence of outliers. A higher kurtosis value in fertility rate shows that there are more extreme values compared to a normal distribution.
# Standard Deviation
sd_values <- sapply(data[, numeric_columns], sd, na.rm = TRUE)
cat("\nStandard Deviation values:\n")
##
## Standard Deviation values:
print(sd_values)
## Year Adol_Fert Birth_Rate Death_Rate Employedtopopul
## 3.4652033 35.9157688 9.1745917 2.6768795 4.9973621
## Fertility_Rate GDP_PC GDP_GROWTH Life_expt Mort_Rate
## 1.2218254 2.4904089 2.8003458 10.1700999 25.5906396
## Popul65 Popul_Growth
## 6.5628343 0.7711411
# Skewness (using skewness function from e1071 package)
skewness_values <- sapply(data[, numeric_columns], skewness, na.rm = TRUE)
cat("\nSkewness values:\n")
##
## Skewness values:
print(skewness_values)
## Year Adol_Fert Birth_Rate Death_Rate Employedtopopul
## 0.0000000 0.9027195 1.6076668 0.9984417 -1.1382690
## Fertility_Rate GDP_PC GDP_GROWTH Life_expt Mort_Rate
## 2.2878265 -0.6236192 -0.4160655 -1.1599462 1.5572305
## Popul65 Popul_Growth
## 0.3927286 0.0481546
# Kurtosis (using kurtosis function from e1071 package)
kurtosis_values <- sapply(data[, numeric_columns], kurtosis, na.rm = TRUE)
cat("\nKurtosis values:\n")
##
## Kurtosis values:
print(kurtosis_values)
## Year Adol_Fert Birth_Rate Death_Rate Employedtopopul
## -1.24369931 -0.11490047 2.03703385 0.30365856 0.48732298
## Fertility_Rate GDP_PC GDP_GROWTH Life_expt Mort_Rate
## 4.22768062 1.55223707 1.17574043 0.06251764 1.36646007
## Popul65 Popul_Growth
## -0.99317987 1.22640634
We looked into several key economic indicators to understand the economic conditions and trends over time. Here’s a summary of our findings:
In terms of GDP per Capita, most countries have a GDP per capita close to the median, but there are some with very low or very high GDP per capita. The GDP per Capita and GDP Growth were calculated annually by growth rates for GDP per capita and overall GDP growth. The analysis showed varying trends across different countries. See the figures below:
# Extracting relevant columns
economic_indicators <- data[, c("GDP_PC", "GDP_GROWTH", "Employedtopopul",
"Fertility_Rate", "Life_expt", "Popul_Growth")]
# Check for missing values
sum(is.na(economic_indicators))
## [1] 0
# Descriptive statistics
summary(economic_indicators)
## GDP_PC GDP_GROWTH Employedtopopul Fertility_Rate
## Min. :-6.4499 Min. :-5.694 Min. :41.61 Min. :1.260
## 1st Qu.: 0.8249 1st Qu.: 1.558 1st Qu.:55.12 1st Qu.:1.665
## Median : 1.7893 Median : 2.708 Median :57.96 Median :1.915
## Mean : 1.8357 Mean : 2.929 Mean :56.62 Mean :2.304
## 3rd Qu.: 3.1453 3rd Qu.: 4.189 3rd Qu.:59.56 3rd Qu.:2.430
## Max. : 7.0132 Max. : 9.251 Max. :63.41 Max. :6.085
## Life_expt Popul_Growth
## Min. :48.77 Min. :-1.8537
## 1st Qu.:68.32 1st Qu.: 0.7364
## Median :78.47 Median : 1.0121
## Mean :73.38 Mean : 1.0613
## 3rd Qu.:80.95 3rd Qu.: 1.3806
## Max. :83.79 Max. : 2.7641
# Calculating annual growth rates for GDP per capita and GDP growth
data <- data %>%
group_by(Country) %>%
mutate(GDP_PC_Growth = (GDP_PC / lag(GDP_PC) - 1) * 100,
GDP_Growth_Rate = (GDP_GROWTH / lag(GDP_GROWTH) - 1) * 100) %>%
ungroup()
# Plotting time series for GDP per capita
ggplot(data, aes(x = Year, y = GDP_PC, color = Country)) +
geom_line() +
labs(title = "GDP per Capita Over Time", x = "Year", y = "GDP per Capita") +
theme_minimal()
ggplot(data, aes(x = Year, y = GDP_GROWTH, color = Country)) +
geom_line() +
labs(title = "GDP Growth Over Time", x = "Year", y = "GDP Growth (%)") +
theme_minimal()
Examining how different economic indicators relate to each other, we considered the GDP per Capita and GDP Growth. Countries with higher GDP per capita tend to have higher GDP growth rates. Also, Fertility Rate and Population Growth were examined. Higher fertility rates are associated with higher population growth.
In the same lane, Life Expectancy and Fertility Rate were analyzed. The result shows that countries with higher life expectancy tend to have lower fertility rates. Lastly, Life Expectancy and Population Growth were considered and higher life expectancy is often linked to lower population growth rates.
# Calculating correlation matrix
correlation_matrix <- cor(economic_indicators, use = "complete.obs")
# Print correlation matrix
print(correlation_matrix)
## GDP_PC GDP_GROWTH Employedtopopul Fertility_Rate Life_expt
## GDP_PC 1.0000000 0.9602481 -0.23484760 0.3092626 -0.3792636
## GDP_GROWTH 0.9602481 1.0000000 -0.22362501 0.5113965 -0.5373129
## Employedtopopul -0.2348476 -0.2236250 1.00000000 -0.1621408 0.5444490
## Fertility_Rate 0.3092626 0.5113965 -0.16214078 1.0000000 -0.8533639
## Life_expt -0.3792636 -0.5373129 0.54444898 -0.8533639 1.0000000
## Popul_Growth 0.2175076 0.4812365 -0.04360777 0.8105208 -0.6838790
## Popul_Growth
## GDP_PC 0.21750765
## GDP_GROWTH 0.48123647
## Employedtopopul -0.04360777
## Fertility_Rate 0.81052080
## Life_expt -0.68387904
## Popul_Growth 1.00000000
# Visualizing the correlation matrix using a heatmap
ggplot(data = melt(correlation_matrix), aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1, 1), space = "Lab", name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Considering the Employment to Population Ratio, the ratio indicates the proportion of the working-age population that is employed. Countries show different trends in employment over time. See the figure below:
ggplot(data, aes(x = Year, y = Employedtopopul, color = Country)) +
geom_line() +
labs(title = "Employment to Population Ratio Over Time", x = "Year", y = "Employment to Population Ratio") +
theme_minimal()
The Life Expectancy over the years trends differ, reflecting improvements or challenges in healthcare and living standards. This can be seen in the figure below:
ggplot(data, aes(x = Year, y = Life_expt, color = Country)) +
geom_line() +
labs(title = "Life Expectancy Over Time", x = "Year", y = "Life Expectancy at Birth") +
theme_minimal()
This analysis provided insights into the socio-economic development trends across various countries from 2004 to 2015. Key findings include:
Significant variations in adolescent fertility rates, birth rates, and infant mortality rates among countries, indicating differences in health and social conditions.
A general trend of economic growth, although the rates vary considerably between countries.
Correlations between economic indicators, such as higher GDP per capita are associated with higher GDP growth rates, and lower fertility rates are linked with higher life expectancy.
Employment trends and life expectancy show differing patterns over time, reflecting the unique socio-economic conditions of each country.