Introduction

The data set entails the socio-economic indicators from 2004 to 2015 for several countries, including the United States, Germany, Japan, Canada, the United Kingdom, Australia, India, Brazil, South Africa, and Mexico. This data allows for cross-country comparisons and analysis of socio-economic trends and their impacts.

The objectives of the analysis are to apply appropriate methods for preparing, validating, analyzing, modeling, and predicting data, focusing on understanding the evolution of economic development and identifying patterns or trends in GDP growth across the countries.

# Loading all required packages and data

library(datarium)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.92 loaded
library(TTR)
library(ggplot2)
library(qqplotr)
## 
## Attaching package: 'qqplotr'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     stat_qq_line, StatQqLine
library(forecast)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(mosaic)
## Registered S3 method overwritten by 'mosaic':
##   method                           from   
##   fortify.SpatialPolygonsDataFrame ggplot2
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Attaching package: 'mosaic'
## 
## The following object is masked from 'package:Matrix':
## 
##     mean
## 
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## 
## The following object is masked from 'package:purrr':
## 
##     cross
## 
## The following object is masked from 'package:ggplot2':
## 
##     stat
## 
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
##     quantile, sd, t.test, var
## 
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
library(melt)
## 
## Attaching package: 'melt'
## 
## The following object is masked from 'package:mosaic':
## 
##     chisq
library(e1071)
library(reshape2)
## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plm)
## 
## Attaching package: 'plm'
## 
## The following object is masked from 'package:melt':
## 
##     nobs
## 
## The following object is masked from 'package:mosaic':
## 
##     r.squared
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following objects are masked from 'package:mosaic':
## 
##     deltaMethod, logit
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
data <- read_csv("data.csv")
## Rows: 132 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): Country Name, Country Code, Time Code
## dbl (12): Time, Adolescent fertility rate, Birth rate, crude, Death rate, cr...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Preparation and Analysis

The data set is covering 132 observations and 15 variables from 2004 to 2015, includes key socio-economic indicators like Adolescent Fertility Rate, Birth Rate, and Death Rate. Initial data checks confirmed no missing values, and columns were renamed for clarity (e.g., “Country Name” to “Country”).

# Read dataset
head(data)
## # A tibble: 6 × 15
##   `Country Name` `Country Code`  Time `Time Code` `Adolescent fertility rate`
##   <chr>          <chr>          <dbl> <chr>                             <dbl>
## 1 United States  USA             2004 YR2004                             39.7
## 2 United States  USA             2005 YR2005                             39.0
## 3 United States  USA             2006 YR2006                             39.8
## 4 United States  USA             2007 YR2007                             40.4
## 5 United States  USA             2008 YR2008                             39.6
## 6 United States  USA             2009 YR2009                             37.6
## # ℹ 10 more variables: `Birth rate, crude` <dbl>, `Death rate, crude` <dbl>,
## #   `Employment to population ratio` <dbl>, `Fertility rate, total` <dbl>,
## #   `GDP per capita growth` <dbl>, `GDP growth` <dbl>,
## #   `Life expectancy at birth` <dbl>, `Mortality rate, infant` <dbl>,
## #   `Population ages 65 and above` <dbl>, `Population growth` <dbl>
# EXPLORATORY DATA ANALYSIS

names(data)
##  [1] "Country Name"                   "Country Code"                  
##  [3] "Time"                           "Time Code"                     
##  [5] "Adolescent fertility rate"      "Birth rate, crude"             
##  [7] "Death rate, crude"              "Employment to population ratio"
##  [9] "Fertility rate, total"          "GDP per capita growth"         
## [11] "GDP growth"                     "Life expectancy at birth"      
## [13] "Mortality rate, infant"         "Population ages 65 and above"  
## [15] "Population growth"
# Check data types

str(data)
## spc_tbl_ [132 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country Name                  : chr [1:132] "United States" "United States" "United States" "United States" ...
##  $ Country Code                  : chr [1:132] "USA" "USA" "USA" "USA" ...
##  $ Time                          : num [1:132] 2004 2005 2006 2007 2008 ...
##  $ Time Code                     : chr [1:132] "YR2004" "YR2005" "YR2006" "YR2007" ...
##  $ Adolescent fertility rate     : num [1:132] 39.7 39 39.8 40.4 39.6 ...
##  $ Birth rate, crude             : num [1:132] 14 14 14.3 14.3 14 13.5 13 12.7 12.6 12.4 ...
##  $ Death rate, crude             : num [1:132] 8.2 8.3 8.1 8 8.1 ...
##  $ Employment to population ratio: num [1:132] 61.2 61.5 61.9 61.8 61 ...
##  $ Fertility rate, total         : num [1:132] 2.05 2.06 2.11 2.12 2.07 ...
##  $ GDP per capita growth         : num [1:132] 2.9 2.53 1.8 1.04 -0.82 ...
##  $ GDP growth                    : num [1:132] 3.853 3.483 2.783 2.011 0.122 ...
##  $ Life expectancy at birth      : num [1:132] 77.5 77.5 77.7 78 78 ...
##  $ Mortality rate, infant        : num [1:132] 6.8 6.7 6.7 6.6 6.5 6.4 6.2 6.1 6 6 ...
##  $ Population ages 65 and above  : num [1:132] 12.3 12.4 12.4 12.5 12.7 ...
##  $ Population growth             : num [1:132] 0.925 0.922 0.964 0.951 0.946 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `Country Name` = col_character(),
##   ..   `Country Code` = col_character(),
##   ..   Time = col_double(),
##   ..   `Time Code` = col_character(),
##   ..   `Adolescent fertility rate` = col_double(),
##   ..   `Birth rate, crude` = col_double(),
##   ..   `Death rate, crude` = col_double(),
##   ..   `Employment to population ratio` = col_double(),
##   ..   `Fertility rate, total` = col_double(),
##   ..   `GDP per capita growth` = col_double(),
##   ..   `GDP growth` = col_double(),
##   ..   `Life expectancy at birth` = col_double(),
##   ..   `Mortality rate, infant` = col_double(),
##   ..   `Population ages 65 and above` = col_double(),
##   ..   `Population growth` = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
# Checking for missing values in the entire dataset
missing_values <- sum(is.na(data))

# Display the count of missing values
cat("Total missing values in the dataset:", missing_values, "\n")
## Total missing values in the dataset: 0
# Display the count of missing values for each column
col_missing_values <- colSums(is.na(data))
cat("Missing values by column:\n")
## Missing values by column:
print(col_missing_values)
##                   Country Name                   Country Code 
##                              0                              0 
##                           Time                      Time Code 
##                              0                              0 
##      Adolescent fertility rate              Birth rate, crude 
##                              0                              0 
##              Death rate, crude Employment to population ratio 
##                              0                              0 
##          Fertility rate, total          GDP per capita growth 
##                              0                              0 
##                     GDP growth       Life expectancy at birth 
##                              0                              0 
##         Mortality rate, infant   Population ages 65 and above 
##                              0                              0 
##              Population growth 
##                              0
#Rename Colums
data <- data %>%
  rename(
    Country = "Country Name",
    Country_Code = "Country Code",
    Year = "Time",
    Year_Code = "Time Code",
    Adol_Fert = "Adolescent fertility rate",
    Birth_Rate = "Birth rate, crude",
    Death_Rate = "Death rate, crude",
    Employedtopopul = "Employment to population ratio",
    Fertility_Rate = "Fertility rate, total",
    GDP_PC = "GDP per capita growth",
    GDP_GROWTH = "GDP growth",
    Life_expt = "Life expectancy at birth",
    Mort_Rate = "Mortality rate, infant",
    Popul65 = "Population ages 65 and above",
    Popul_Growth = "Population growth"
  )

Exploratory Analysis and Insight

An exploratory data analysis (EDA) was conducted which provided a comprehensive overview of the data. This includes demographics, health, and economic indicators.

The descriptive analysis is highlighted below, here is the summary of what we found:

There is a wide range of adolescent fertility rates, indicating significant differences in birth rates among teenagers across different countries. While the birth rate varies significantly, the death rate has a narrower range, showing more consistency across countries.

The employment to population ratio shows how many people are employed compared to the total population, with an average of about 57%. There’s a moderate variation, suggesting differing labor market conditions. On average, women have around 2.3 children, with some countries having much higher rates.

There is considerable variation in economic growth, reflecting different economic conditions and growth rates among the countries. Life expectancy ratio is a substantial range in life expectancy, with some countries experiencing much higher average lifespans.

Meanwhile, the infant mortality rate varies widely, indicating differences in healthcare quality and access. In terms of population growth and aging, the growth rates and proportion of the elderly population differ significantly, affecting social and economic policies.

# performing a comprehensive descriptive statistical analysis on the data, including mean, median, mode, standard deviation, skewness, and kurtosis.

# Display summary statistics
summary(data)
##    Country          Country_Code            Year       Year_Code        
##  Length:132         Length:132         Min.   :2004   Length:132        
##  Class :character   Class :character   1st Qu.:2007   Class :character  
##  Mode  :character   Mode  :character   Median :2010   Mode  :character  
##                                        Mean   :2010                     
##                                        3rd Qu.:2012                     
##                                        Max.   :2015                     
##    Adol_Fert         Birth_Rate      Death_Rate     Employedtopopul
##  Min.   :  4.027   Min.   : 8.00   Min.   : 5.001   Min.   :41.61  
##  1st Qu.: 13.874   1st Qu.:11.28   1st Qu.: 6.600   1st Qu.:55.12  
##  Median : 32.672   Median :14.00   Median : 8.101   Median :57.96  
##  Mean   : 43.476   Mean   :17.16   Mean   : 8.743   Mean   :56.62  
##  3rd Qu.: 70.797   3rd Qu.:21.27   3rd Qu.:10.025   3rd Qu.:59.56  
##  Max.   :131.024   Max.   :43.15   Max.   :16.271   Max.   :63.41  
##  Fertility_Rate      GDP_PC          GDP_GROWTH       Life_expt    
##  Min.   :1.260   Min.   :-6.4499   Min.   :-5.694   Min.   :48.77  
##  1st Qu.:1.665   1st Qu.: 0.8249   1st Qu.: 1.558   1st Qu.:68.32  
##  Median :1.915   Median : 1.7893   Median : 2.708   Median :78.47  
##  Mean   :2.304   Mean   : 1.8357   Mean   : 2.929   Mean   :73.38  
##  3rd Qu.:2.430   3rd Qu.: 3.1453   3rd Qu.: 4.189   3rd Qu.:80.95  
##  Max.   :6.085   Max.   : 7.0132   Max.   : 9.251   Max.   :83.79  
##    Mort_Rate         Popul65        Popul_Growth    
##  Min.   : 2.000   Min.   : 3.026   Min.   :-1.8537  
##  1st Qu.: 4.075   1st Qu.: 5.411   1st Qu.: 0.7364  
##  Median : 6.300   Median :12.822   Median : 1.0121  
##  Mean   :21.152   Mean   :11.548   Mean   : 1.0613  
##  3rd Qu.:31.225   3rd Qu.:15.918   3rd Qu.: 1.3806  
##  Max.   :97.900   Max.   :27.328   Max.   : 2.7641
# Mean
# Calculate means for each numeric column, handling missing values
mean_values <- sapply(data, function(x) mean(as.numeric(x), na.rm = TRUE))
## Warning in mean(as.numeric(x), na.rm = TRUE): NAs introduced by coercion
## Warning in rlang::is_formula(x): NAs introduced by coercion
## Warning in mean(as.numeric(x), na.rm = TRUE): NAs introduced by coercion
## Warning in rlang::is_formula(x): NAs introduced by coercion
## Warning in mean(as.numeric(x), na.rm = TRUE): NAs introduced by coercion
## Warning in rlang::is_formula(x): NAs introduced by coercion
print(mean_values)
##         Country    Country_Code            Year       Year_Code       Adol_Fert 
##             NaN             NaN     2009.500000             NaN       43.476136 
##      Birth_Rate      Death_Rate Employedtopopul  Fertility_Rate          GDP_PC 
##       17.164439        8.742803       56.623152        2.304424        1.835657 
##      GDP_GROWTH       Life_expt       Mort_Rate         Popul65    Popul_Growth 
##        2.929413       73.383285       21.151515       11.548117        1.061268
# Median
# Identifying numeric columns in the dataframe
numeric_columns <- sapply(data, is.numeric)

# Calculate median values for numeric columns, handling missing values
median_values <- sapply(data[, numeric_columns], median, na.rm = TRUE)

# Display median values
cat("\nMedian values:\n")
## 
## Median values:
print(median_values)
##            Year       Adol_Fert      Birth_Rate      Death_Rate Employedtopopul 
##     2009.500000       32.671500       14.000000        8.101000       57.961000 
##  Fertility_Rate          GDP_PC      GDP_GROWTH       Life_expt       Mort_Rate 
##        1.915000        1.789341        2.707613       78.465854        6.300000 
##         Popul65    Popul_Growth 
##       12.822358        1.012106
# Identify numeric columns in the dataframe
numeric_columns <- sapply(data, is.numeric)

# Mode (using density function from e1071 package)
mode_values <- sapply(data[, numeric_columns], function(x) {
  dens <- density(x, na.rm = TRUE)
  dens$x[which.max(dens$y)]
})
cat("\nMode values:\n")
## 
## Mode values:
print(mode_values)
##            Year       Adol_Fert      Birth_Rate      Death_Rate Employedtopopul 
##    2009.5176586      14.8737331      12.1335519       6.9137849      58.0675711 
##  Fertility_Rate          GDP_PC      GDP_GROWTH       Life_expt       Mort_Rate 
##       1.8274972       1.5945390       2.5010266      80.2291190       5.2770271 
##         Popul65    Popul_Growth 
##       5.2676630       0.9643459

Statistical Observations

The standard deviation indicates the variability of the data. For instance, adolescent fertility and infant mortality rates show high variability, suggesting significant differences between countries.

The Skewness tells us about the symmetry of the data. For example, the fertility rate is positively skewed, meaning most countries have lower fertility rates, with a few having very high rates.

Also, the Kurtosis indicates the presence of outliers. A higher kurtosis value in fertility rate shows that there are more extreme values compared to a normal distribution.

# Standard Deviation
sd_values <- sapply(data[, numeric_columns], sd, na.rm = TRUE)
cat("\nStandard Deviation values:\n")
## 
## Standard Deviation values:
print(sd_values)
##            Year       Adol_Fert      Birth_Rate      Death_Rate Employedtopopul 
##       3.4652033      35.9157688       9.1745917       2.6768795       4.9973621 
##  Fertility_Rate          GDP_PC      GDP_GROWTH       Life_expt       Mort_Rate 
##       1.2218254       2.4904089       2.8003458      10.1700999      25.5906396 
##         Popul65    Popul_Growth 
##       6.5628343       0.7711411
# Skewness (using skewness function from e1071 package)
skewness_values <- sapply(data[, numeric_columns], skewness, na.rm = TRUE)
cat("\nSkewness values:\n")
## 
## Skewness values:
print(skewness_values)
##            Year       Adol_Fert      Birth_Rate      Death_Rate Employedtopopul 
##       0.0000000       0.9027195       1.6076668       0.9984417      -1.1382690 
##  Fertility_Rate          GDP_PC      GDP_GROWTH       Life_expt       Mort_Rate 
##       2.2878265      -0.6236192      -0.4160655      -1.1599462       1.5572305 
##         Popul65    Popul_Growth 
##       0.3927286       0.0481546
# Kurtosis (using kurtosis function from e1071 package)
kurtosis_values <- sapply(data[, numeric_columns], kurtosis, na.rm = TRUE)
cat("\nKurtosis values:\n")
## 
## Kurtosis values:
print(kurtosis_values)
##            Year       Adol_Fert      Birth_Rate      Death_Rate Employedtopopul 
##     -1.24369931     -0.11490047      2.03703385      0.30365856      0.48732298 
##  Fertility_Rate          GDP_PC      GDP_GROWTH       Life_expt       Mort_Rate 
##      4.22768062      1.55223707      1.17574043      0.06251764      1.36646007 
##         Popul65    Popul_Growth 
##     -0.99317987      1.22640634

Conclusion

This analysis provided insights into the socio-economic development trends across various countries from 2004 to 2015. Key findings include: