The here, we will use the world_development_indicators.csv dataset.
We start by installing the relevant packages and loading the libraries as below
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(RMySQL)
## Loading required package: DBI
library(DBI)
library(readr)
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
library(tidyr)
library(naniar)
#1 Load and preview the data
We will read the data into the wd_dev_indicators data frame. The preview of the data shows that the data set has 59 columns/variables. We will start by selecting only the variable of interest.
wd_dev_indicators <- read_csv("https://raw.githubusercontent.com/hawa1983/Project-2/main/world_development_indicators_1.csv")
## Rows: 165 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Series Name, country, 1990 [YR1990], 2000 [YR2000], 2013 [YR2013],...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
wd_dev_indicators <- wd_dev_indicators %>%
rename(
"1990" = `1990 [YR1990]`,
"2000" = `2000 [YR2000]`,
"2013" = `2013 [YR2013]`,
"2014" = `2014 [YR2014]`,
"2015" = `2015 [YR2015]`,
"2016" = `2016 [YR2016]`,
"2017" = `2017 [YR2017]`,
"2018" = `2018 [YR2018]`,
"2019" = `2019 [YR2019]`,
"2020" = `2020 [YR2020]`,
"2021" = `2021 [YR2021]`,
"2022" = `2022 [YR2022]`
)
glimpse(wd_dev_indicators)
## Rows: 165
## Columns: 14
## $ `Series Name` <chr> "Population", "Population", "Population", "Population gr…
## $ country <chr> "Canada", "Mexico", "United States", "Canada", "Mexico",…
## $ `1990` <chr> "27691138", "81720428", "249623000", "1.507660673", "1.8…
## $ `2000` <chr> "30685730", "97873442", "282162411", "0.931281553", "1.5…
## $ `2013` <chr> "35082954", "117290686", "316059947", "1.056591259", "1.…
## $ `2014` <chr> "35437435", "118755887", "318386329", "1.005337579", "1.…
## $ `2015` <chr> "35702908", "120149897", "320738994", "0.746339478", "1.…
## $ `2016` <chr> "36109487", "121519221", "323071755", "1.132348655", "1.…
## $ `2017` <chr> "36545236", "122839258", "325122128", "1.199520711", "1.…
## $ `2018` <chr> "37065084", "124013861", "326838199", "1.412455869", "0.…
## $ `2019` <chr> "37601230", "125085311", "328329953", "1.436136823", "0.…
## $ `2020` <chr> "38007166", "125998302", "331511512", "1.073795827", "0.…
## $ `2021` <chr> "38226498", "126705138", "332031554", "0.575421916", "0.…
## $ `2022` <chr> "38929902", "127504125", "333287557", "1.823370422", "0.…
#2 Convert the Year columns to numeric type
# Define a list of the columns you want to convert to numeric (in this case, columns from '1990' to '2022')
columns_to_convert <- c('1990', '2000', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022')
# Use the mutate function to convert the specified columns to numeric
wd_dev_indicators <- wd_dev_indicators %>%
mutate(across(all_of(columns_to_convert), as.numeric))
## Warning: There were 12 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(all_of(columns_to_convert), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 11 remaining warnings.
#3 Select the relevant data for our analysis
In this analysis, we will be looking at economic and migration data. The countries of interest are United States, Canada, and Mexico. The analysis will also cover from 2013 to 2021
us_mex_can_dev_indicators <- wd_dev_indicators |>
filter(country %in% c("Canada", "Mexico", "United States"))
us_mex_can_dev_indicators <- us_mex_can_dev_indicators %>%
select(-c("1990", "2000", "2022"))
us_mex_can_dev_indicators
## # A tibble: 165 × 11
## `Series Name` country `2013` `2014` `2015` `2016` `2017` `2018` `2019`
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Population Canada 3.51e+7 3.54e+7 3.57e+7 3.61e+7 3.65e+7 3.71e+7 3.76e+7
## 2 Population Mexico 1.17e+8 1.19e+8 1.20e+8 1.22e+8 1.23e+8 1.24e+8 1.25e+8
## 3 Population United… 3.16e+8 3.18e+8 3.21e+8 3.23e+8 3.25e+8 3.27e+8 3.28e+8
## 4 Population g… Canada 1.06e+0 1.01e+0 7.46e-1 1.13e+0 1.20e+0 1.41e+0 1.44e+0
## 5 Population g… Mexico 1.32e+0 1.24e+0 1.17e+0 1.13e+0 1.08e+0 9.52e-1 8.60e-1
## 6 Population g… United… 6.93e-1 7.33e-1 7.36e-1 7.25e-1 6.33e-1 5.26e-1 4.55e-1
## 7 Surface area… Canada 9.88e+6 9.88e+6 9.88e+6 9.88e+6 9.88e+6 9.88e+6 9.88e+6
## 8 Surface area… Mexico 1.96e+6 1.96e+6 1.96e+6 1.96e+6 1.96e+6 1.96e+6 1.96e+6
## 9 Surface area… United… 9.83e+6 9.83e+6 9.83e+6 9.83e+6 9.83e+6 9.83e+6 9.83e+6
## 10 Population d… Canada 3.91e+0 3.95e+0 3.98e+0 4.03e+0 4.08e+0 4.13e+0 4.19e+0
## # ℹ 155 more rows
## # ℹ 2 more variables: `2020` <dbl>, `2021` <dbl>
#4 Reshape the data Reshape the data from wide to long format using pivot_longer. Impute missing values. Then, reshape the data again to wide for analysis.
us_mex_can_dev_indicators_long <- us_mex_can_dev_indicators |>
pivot_longer(cols = c("2013":"2021"),
names_to = "year",
values_to = "value")
us_mex_can_dev_indicators_long
## # A tibble: 1,485 × 4
## `Series Name` country year value
## <chr> <chr> <chr> <dbl>
## 1 Population Canada 2013 35082954
## 2 Population Canada 2014 35437435
## 3 Population Canada 2015 35702908
## 4 Population Canada 2016 36109487
## 5 Population Canada 2017 36545236
## 6 Population Canada 2018 37065084
## 7 Population Canada 2019 37601230
## 8 Population Canada 2020 38007166
## 9 Population Canada 2021 38226498
## 10 Population Mexico 2013 117290686
## # ℹ 1,475 more rows
# Reshape the data from long to wide format
us_mex_can_dev_indicators_wider <- us_mex_can_dev_indicators_long |>
pivot_wider(names_from = `Series Name`,
values_from = value
)
# Convert to a tidy format
data_tidy <- pivot_longer(us_mex_can_dev_indicators_wider, -c("country", "year"), names_to = "Variable", values_to = "Value")
# Impute missing values using the mean
data_tidy <- data_tidy %>%
group_by(Variable) %>%
mutate(Value = ifelse(is.na(Value), mean(Value, na.rm = TRUE), Value))
# Convert back to the original format
data_imputed <- pivot_wider(data_tidy, names_from = Variable, values_from = Value)
# Print the imputed data
glimpse(data_imputed)
## Rows: 27
## Columns: 57
## $ country <chr> …
## $ year <chr> …
## $ Population <dbl> …
## $ `Population growth (annual %)` <dbl> …
## $ `Surface area (sq. km)` <dbl> …
## $ `Population density (people per sq. km of land area)` <dbl> …
## $ `Poverty headcount ratio at national poverty lines (% of population)` <dbl> …
## $ `Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population)` <dbl> …
## $ `GNI, Atlas method (current US$)` <dbl> …
## $ `GNI per capita, Atlas method (current US$)` <dbl> …
## $ `GNI, PPP (current international $)` <dbl> …
## $ `GNI per capita, PPP (current international $)` <dbl> …
## $ `Income share held by lowest 20%` <dbl> …
## $ `Life expectancy at birth, total (years)` <dbl> …
## $ `Fertility rate, total (births per woman)` <dbl> …
## $ `Adolescent fertility rate (births per 1,000 women ages 15-19)` <dbl> …
## $ `Contraceptive prevalence, any method (% of married women ages 15-49)` <dbl> …
## $ `Births attended by skilled health staff (% of total)` <dbl> …
## $ `Mortality rate, under-5 (per 1,000 live births)` <dbl> …
## $ `Prevalence of underweight, weight for age (% of children under 5)` <dbl> …
## $ `Immunization, measles (% of children ages 12-23 months)` <dbl> …
## $ `Primary completion rate, total (% of relevant age group)` <dbl> …
## $ `School enrollment, primary (% gross)` <dbl> …
## $ `School enrollment, secondary (% gross)` <dbl> …
## $ `School enrollment, primary and secondary (gross), gender parity index (GPI)` <dbl> …
## $ `Prevalence of HIV, total (% of population ages 15-49)` <dbl> …
## $ `Forest area (sq. km)` <dbl> …
## $ `Terrestrial and marine protected areas (% of total territorial area)` <dbl> …
## $ `Annual freshwater withdrawals, total (% of internal resources)` <dbl> …
## $ `Urban population growth (annual %)` <dbl> …
## $ `Energy use (kg of oil equivalent per capita)` <dbl> …
## $ `CO2 emissions (metric tons per capita)` <dbl> …
## $ `Electric power consumption (kWh per capita)` <dbl> …
## $ `GDP (current US$)` <dbl> …
## $ `GDP growth (annual %)` <dbl> …
## $ `Inflation, GDP deflator (annual %)` <dbl> …
## $ `Agriculture, forestry, and fishing, value added (% of GDP)` <dbl> …
## $ `Industry (including construction), value added (% of GDP)` <dbl> …
## $ `Exports of goods and services (% of GDP)` <dbl> …
## $ `Imports of goods and services (% of GDP)` <dbl> …
## $ `Gross capital formation (% of GDP)` <dbl> …
## $ `Revenue, excluding grants (% of GDP)` <dbl> …
## $ `Time required to start a business (days)` <dbl> …
## $ `Domestic credit provided by financial sector (% of GDP)` <dbl> …
## $ `Tax revenue (% of GDP)` <dbl> …
## $ `Military expenditure (% of GDP)` <dbl> …
## $ `Mobile cellular subscriptions (per 100 people)` <dbl> …
## $ `High-technology exports (% of manufactured exports)` <dbl> …
## $ `Statistical Capacity Score (Overall Average) (scale 0 - 100)` <dbl> …
## $ `Merchandise trade (% of GDP)` <dbl> …
## $ `Net barter terms of trade index (2015 = 100)` <dbl> …
## $ `External debt stocks, total (DOD, current US$)` <dbl> …
## $ `Total debt service (% of exports of goods, services and primary income)` <dbl> …
## $ `Net migration` <dbl> …
## $ `Personal remittances, received (current US$)` <dbl> …
## $ `Foreign direct investment, net inflows (BoP, current US$)` <dbl> …
## $ `Net official development assistance and official aid received (current US$)` <dbl> …
#write_csv(data_imputed, "C:/Users/RemoteUser/Documents/DATA607/Project2/tidy.csv")
The annual population growth rate has been declining, indicating slower population growth over the years.
data_imputed$year <- as.Date(paste0(data_imputed$year, "-01-01"))
data_imputed$country <- as.factor(data_imputed$country)
# Create a ggplot
ggplot(data_imputed, aes(x = year, y = `Population growth (annual %)`, color = country)) +
geom_line() +
labs(title = "Population Growth Rate Over Time by Country",
x = "year",
y = "Population growth") +
theme_minimal() +
scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))
# Economic Development
The GDP of the three countries deccreased from 2013. It the decline gradually until 2020 when it saw a sharp decline due to the Corvid-19 pandemic. The GDP then rebounded sharply reflecting robust economic growth.
# Create a ggplot
ggplot(data_imputed, aes(x = year, y = `GDP growth (annual %)`, color = country)) +
geom_line() +
labs(title = "GDP Growth Over Time by Country",
x = "year",
y = "GDP growth (annual %)") +
theme_minimal() +
scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))
## GNI per capita, Atlas method (current US$)
The GNI per capita for United States increased since 2013 showing rising income levels. The GNI per capita of Canada and Mexico has be declining since 2013 indicating decreasing income levels.
# Create a ggplot
ggplot(data_imputed, aes(x = year, y = `GNI per capita, Atlas method (current US$)`, color = country)) +
geom_line() +
labs(title = "GNI per capita Over Time by Country",
x = "year",
y = "GNI per capita") +
theme_minimal() +
scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))
##Primary completion rate, total (% of relevant age group) The primary completion rate is flat for Canada, declined but recovered for the United States, suggesting good access to primary education. Mexico has experienced a decline in primary completion rate.
# Create a ggplot
ggplot(data_imputed, aes(x = year, y = `Primary completion rate, total (% of relevant age group)`, color = country)) +
geom_line() +
labs(title = "Primary completion rate, total (% of relevant age group)",
x = "year",
y = "Primary completion rate") +
theme_minimal() +
scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))
# Infrastructure and Technology
The mobile subscription rate has seen significant growth, reflecting high mobile phone usage.
# Create a ggplot
ggplot(data_imputed, aes(x = year, y = `Mobile cellular subscriptions (per 100 people)`, color = country)) +
geom_line() +
labs(title = "Primary completion rate, total (% of relevant age group)",
x = "year",
y = "Primary completion rate") +
theme_minimal() +
scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))
## High-technology exports (% of manufactured exports) High-tech exports
as a percentage of total exports have remained relatively stable.
# Create a ggplot
ggplot(data_imputed, aes(x = year, y = `High-technology exports (% of manufactured exports)`, color = country)) +
geom_line() +
labs(title = "Primary completion rate, total (% of relevant age group)",
x = "year",
y = "Primary completion rate") +
theme_minimal() +
scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))
# Distribution of Wealth The income share heald by the bottom 20% have
declined over the years, indicating a skewed distribution of wealth.
# Create a ggplot
ggplot(data_imputed, aes(x = year, y = `Income share held by lowest 20%` , color = country)) +
geom_line() +
labs(title = "Income share held by lowest 20%",
x = "year",
y = "Income share") +
theme_minimal() +
scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))
Social Indicators
Life expectancy at birth, total (years):
The life expectancy in Canada has steadily increased since the year 2013, indicating improvements in healthcare and living conditions. Mexico and the United States have seen a decrease in life expectancy. All three countries saw a sharp decline in life expectancy between 2019 and 2021 most likely due to the Covid-19 pandemic.
Fertility rate, total (births per woman)
The fertility rate of all three countries decreased, which suggests declining birth rates.
Tax revenue (% of GDP)
Tax revenue as a percentage of GDP increased for the Canada and Mexico which might indicate a higher tax burden on the economy. Tax revenue as a percentage of GDP increase for the United States also decrease in 2017 to 2018. This may be due to tax laws.
Foreign direct investment, net inflows (BoP, current US$)
United States received substantial net foreign direct investment, indicating its attractiveness to foreign investors.Net foreign direct investment for Mexico and Canada has been declining.