World Development Indicators

The here, we will use the world_development_indicators.csv dataset.

Load the relevant libraries

We start by installing the relevant packages and loading the libraries as below

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(RMySQL)

## Loading required package: DBI

library(DBI)
library(readr)
library(data.table)

## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## The following object is masked from 'package:purrr':
## 
##     transpose

library(tidyr)
library(naniar)

#1 Load and preview the data

We will read the data into the wd_dev_indicators data frame. The preview of the data shows that the data set has 59 columns/variables. We will start by selecting only the variable of interest.

wd_dev_indicators <- read_csv("C:/Users/RemoteUser/Documents/DATA607/Project2/World Dev Indicators/world_development_indicators_1.csv")

## Rows: 165 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): Series Name, country, 1990 [YR1990], 2000 [YR2000], 2013 [YR2013],...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

wd_dev_indicators <- wd_dev_indicators %>%
  rename(
    "1990" = `1990 [YR1990]`,
    "2000" = `2000 [YR2000]`,
    "2013" = `2013 [YR2013]`,
    "2014" = `2014 [YR2014]`,
    "2015" = `2015 [YR2015]`,
    "2016" = `2016 [YR2016]`,
    "2017" = `2017 [YR2017]`,
    "2018" = `2018 [YR2018]`,
    "2019" = `2019 [YR2019]`,
    "2020" = `2020 [YR2020]`,
    "2021" = `2021 [YR2021]`,
    "2022" = `2022 [YR2022]`
  )

glimpse(wd_dev_indicators)

## Rows: 165
## Columns: 14
## $ `Series Name` <chr> "Population", "Population", "Population", "Population gr…
## $ country       <chr> "Canada", "Mexico", "United States", "Canada", "Mexico",…
## $ `1990`        <chr> "27691138", "81720428", "249623000", "1.507660673", "1.8…
## $ `2000`        <chr> "30685730", "97873442", "282162411", "0.931281553", "1.5…
## $ `2013`        <chr> "35082954", "117290686", "316059947", "1.056591259", "1.…
## $ `2014`        <chr> "35437435", "118755887", "318386329", "1.005337579", "1.…
## $ `2015`        <chr> "35702908", "120149897", "320738994", "0.746339478", "1.…
## $ `2016`        <chr> "36109487", "121519221", "323071755", "1.132348655", "1.…
## $ `2017`        <chr> "36545236", "122839258", "325122128", "1.199520711", "1.…
## $ `2018`        <chr> "37065084", "124013861", "326838199", "1.412455869", "0.…
## $ `2019`        <chr> "37601230", "125085311", "328329953", "1.436136823", "0.…
## $ `2020`        <chr> "38007166", "125998302", "331511512", "1.073795827", "0.…
## $ `2021`        <chr> "38226498", "126705138", "332031554", "0.575421916", "0.…
## $ `2022`        <chr> "38929902", "127504125", "333287557", "1.823370422", "0.…

#2 Convert the Year columns to numeric type

# Define a list of the columns you want to convert to numeric (in this case, columns from '1990' to '2022')
columns_to_convert <- c('1990', '2000', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022')

# Use the mutate function to convert the specified columns to numeric
wd_dev_indicators <- wd_dev_indicators %>%
  mutate(across(all_of(columns_to_convert), as.numeric))

## Warning: There were 12 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `across(all_of(columns_to_convert), as.numeric)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 11 remaining warnings.

#3 Select the relevant data for our analysis

In this analysis, we will be looking at economic and migration data. The countries of interest are United States, Canada, and Mexico. The analysis will also cover from 2013 to 2021

us_mex_can_dev_indicators <- wd_dev_indicators |> 
  filter(country %in% c("Canada", "Mexico", "United States"))

us_mex_can_dev_indicators <- us_mex_can_dev_indicators %>%
  select(-c("1990", "2000", "2022"))

us_mex_can_dev_indicators

## # A tibble: 165 × 11
##    `Series Name` country  `2013`  `2014`  `2015`  `2016`  `2017`  `2018`  `2019`
##    <chr>         <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 Population    Canada  3.51e+7 3.54e+7 3.57e+7 3.61e+7 3.65e+7 3.71e+7 3.76e+7
##  2 Population    Mexico  1.17e+8 1.19e+8 1.20e+8 1.22e+8 1.23e+8 1.24e+8 1.25e+8
##  3 Population    United… 3.16e+8 3.18e+8 3.21e+8 3.23e+8 3.25e+8 3.27e+8 3.28e+8
##  4 Population g… Canada  1.06e+0 1.01e+0 7.46e-1 1.13e+0 1.20e+0 1.41e+0 1.44e+0
##  5 Population g… Mexico  1.32e+0 1.24e+0 1.17e+0 1.13e+0 1.08e+0 9.52e-1 8.60e-1
##  6 Population g… United… 6.93e-1 7.33e-1 7.36e-1 7.25e-1 6.33e-1 5.26e-1 4.55e-1
##  7 Surface area… Canada  9.88e+6 9.88e+6 9.88e+6 9.88e+6 9.88e+6 9.88e+6 9.88e+6
##  8 Surface area… Mexico  1.96e+6 1.96e+6 1.96e+6 1.96e+6 1.96e+6 1.96e+6 1.96e+6
##  9 Surface area… United… 9.83e+6 9.83e+6 9.83e+6 9.83e+6 9.83e+6 9.83e+6 9.83e+6
## 10 Population d… Canada  3.91e+0 3.95e+0 3.98e+0 4.03e+0 4.08e+0 4.13e+0 4.19e+0
## # ℹ 155 more rows
## # ℹ 2 more variables: `2020` <dbl>, `2021` <dbl>

#4 Reshape the data Reshape the data from wide to long format using pivot_longer. Impute missing values. Then, reshape the data again to wide for analysis.

us_mex_can_dev_indicators_long <- us_mex_can_dev_indicators |>
  pivot_longer(cols = c("2013":"2021"),
               names_to = "year",
               values_to = "value")


us_mex_can_dev_indicators_long

## # A tibble: 1,485 × 4
##    `Series Name` country year      value
##    <chr>         <chr>   <chr>     <dbl>
##  1 Population    Canada  2013   35082954
##  2 Population    Canada  2014   35437435
##  3 Population    Canada  2015   35702908
##  4 Population    Canada  2016   36109487
##  5 Population    Canada  2017   36545236
##  6 Population    Canada  2018   37065084
##  7 Population    Canada  2019   37601230
##  8 Population    Canada  2020   38007166
##  9 Population    Canada  2021   38226498
## 10 Population    Mexico  2013  117290686
## # ℹ 1,475 more rows

# Reshape the data from long to wide format
us_mex_can_dev_indicators_wider <- us_mex_can_dev_indicators_long |>
  pivot_wider(names_from = `Series Name`, 
              values_from = value
  )

# Convert to a tidy format
data_tidy <- pivot_longer(us_mex_can_dev_indicators_wider, -c("country", "year"), names_to = "Variable", values_to = "Value")

# Impute missing values using the mean
data_tidy <- data_tidy %>% 
  group_by(Variable) %>% 
  mutate(Value = ifelse(is.na(Value), mean(Value, na.rm = TRUE), Value))

# Convert back to the original format
data_imputed <- pivot_wider(data_tidy, names_from = Variable, values_from = Value)

# Print the imputed data
glimpse(data_imputed)

## Rows: 27
## Columns: 57
## $ country                                                                       <chr> …
## $ year                                                                          <chr> …
## $ Population                                                                    <dbl> …
## $ `Population growth (annual %)`                                                <dbl> …
## $ `Surface area (sq. km)`                                                       <dbl> …
## $ `Population density (people per sq. km of land area)`                         <dbl> …
## $ `Poverty headcount ratio at national poverty lines (% of population)`         <dbl> …
## $ `Poverty headcount ratio at $2.15 a day (2017 PPP) (% of population)`         <dbl> …
## $ `GNI, Atlas method (current US$)`                                             <dbl> …
## $ `GNI per capita, Atlas method (current US$)`                                  <dbl> …
## $ `GNI, PPP (current international $)`                                          <dbl> …
## $ `GNI per capita, PPP (current international $)`                               <dbl> …
## $ `Income share held by lowest 20%`                                             <dbl> …
## $ `Life expectancy at birth, total (years)`                                     <dbl> …
## $ `Fertility rate, total (births per woman)`                                    <dbl> …
## $ `Adolescent fertility rate (births per 1,000 women ages 15-19)`               <dbl> …
## $ `Contraceptive prevalence, any method (% of married women ages 15-49)`        <dbl> …
## $ `Births attended by skilled health staff (% of total)`                        <dbl> …
## $ `Mortality rate, under-5 (per 1,000 live births)`                             <dbl> …
## $ `Prevalence of underweight, weight for age (% of children under 5)`           <dbl> …
## $ `Immunization, measles (% of children ages 12-23 months)`                     <dbl> …
## $ `Primary completion rate, total (% of relevant age group)`                    <dbl> …
## $ `School enrollment, primary (% gross)`                                        <dbl> …
## $ `School enrollment, secondary (% gross)`                                      <dbl> …
## $ `School enrollment, primary and secondary (gross), gender parity index (GPI)` <dbl> …
## $ `Prevalence of HIV, total (% of population ages 15-49)`                       <dbl> …
## $ `Forest area (sq. km)`                                                        <dbl> …
## $ `Terrestrial and marine protected areas (% of total territorial area)`        <dbl> …
## $ `Annual freshwater withdrawals, total (% of internal resources)`              <dbl> …
## $ `Urban population growth (annual %)`                                          <dbl> …
## $ `Energy use (kg of oil equivalent per capita)`                                <dbl> …
## $ `CO2 emissions (metric tons per capita)`                                      <dbl> …
## $ `Electric power consumption (kWh per capita)`                                 <dbl> …
## $ `GDP (current US$)`                                                           <dbl> …
## $ `GDP growth (annual %)`                                                       <dbl> …
## $ `Inflation, GDP deflator (annual %)`                                          <dbl> …
## $ `Agriculture, forestry, and fishing, value added (% of GDP)`                  <dbl> …
## $ `Industry (including construction), value added (% of GDP)`                   <dbl> …
## $ `Exports of goods and services (% of GDP)`                                    <dbl> …
## $ `Imports of goods and services (% of GDP)`                                    <dbl> …
## $ `Gross capital formation (% of GDP)`                                          <dbl> …
## $ `Revenue, excluding grants (% of GDP)`                                        <dbl> …
## $ `Time required to start a business (days)`                                    <dbl> …
## $ `Domestic credit provided by financial sector (% of GDP)`                     <dbl> …
## $ `Tax revenue (% of GDP)`                                                      <dbl> …
## $ `Military expenditure (% of GDP)`                                             <dbl> …
## $ `Mobile cellular subscriptions (per 100 people)`                              <dbl> …
## $ `High-technology exports (% of manufactured exports)`                         <dbl> …
## $ `Statistical Capacity Score (Overall Average) (scale 0 - 100)`                <dbl> …
## $ `Merchandise trade (% of GDP)`                                                <dbl> …
## $ `Net barter terms of trade index (2015 = 100)`                                <dbl> …
## $ `External debt stocks, total (DOD, current US$)`                              <dbl> …
## $ `Total debt service (% of exports of goods, services and primary income)`     <dbl> …
## $ `Net migration`                                                               <dbl> …
## $ `Personal remittances, received (current US$)`                                <dbl> …
## $ `Foreign direct investment, net inflows (BoP, current US$)`                   <dbl> …
## $ `Net official development assistance and official aid received (current US$)` <dbl> …

#write_csv(data_imputed, "C:/Users/RemoteUser/Documents/DATA607/Project2/tidy.csv")

Population Analysis

The annual population growth rate has been declining, indicating slower population growth over the years.

data_imputed$year <- as.Date(paste0(data_imputed$year, "-01-01"))
data_imputed$country <- as.factor(data_imputed$country)

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `Population growth (annual %)`, color = country)) +
  geom_line() +
  labs(title = "Population Growth Rate Over Time by Country",
       x = "year",
       y = "Population growth") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

# Economic Development

GDP (current US$)

The GDP of the three countries deccreased from 2013. It the decline gradually until 2020 when it saw a sharp decline due to the Corvid-19 pandemic. The GDP then rebounded sharply reflecting robust economic growth.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `GDP growth (annual %)`, color = country)) +
  geom_line() +
  labs(title = "GDP Growth Over Time by Country",
       x = "year",
       y = "GDP growth (annual %)") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

## GNI per capita, Atlas method (current US$)

The GNI per capita for United States increased since 2013 showing rising income levels. The GNI per capita of Canada and Mexico has be declining since 2013 indicating decreasing income levels.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `GNI per capita, Atlas method (current US$)`, color = country)) +
  geom_line() +
  labs(title = "GNI per capita Over Time by Country",
       x = "year",
       y = "GNI per capita") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

Social Indicators

Life expectancy at birth, total (years):

The life expectancy in Canada has steadily increased since the year 2013, indicating improvements in healthcare and living conditions. Mexico and the United States have seen a decrease in life expectancy. All three countries saw a sharp decline in life expectancy between 2019 and 2021 most likely due to the Covid-19 pandemic.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `Life expectancy at birth, total (years)`, color = country)) +
  geom_line() +
  labs(title = "Life expectancy Over Time by Country",
       x = "year",
       y = "Life expectancy") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

Fertility rate, total (births per woman)

The fertility rate of all three countries decreased, which suggests declining birth rates.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `Fertility rate, total (births per woman)`, color = country)) +
  geom_line() +
  labs(title = "Fertility rate, total (births per woman)",
       x = "year",
       y = "Fertility rate") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

# Economic and Financial Indicators

Tax revenue (% of GDP)

Tax revenue as a percentage of GDP increased for the Canada and Mexico which might indicate a higher tax burden on the economy. Tax revenue as a percentage of GDP increase for the United States also decrease in 2017 to 2018. This may be due to tax laws.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `Tax revenue (% of GDP)`, color = country)) +
  geom_line() +
  labs(title = "Tax revenue Over Time by Country",
       x = "year",
       y = "Tax revenue (% of GDP)") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

Foreign direct investment, net inflows (BoP, current US$)

United States received substantial net foreign direct investment, indicating its attractiveness to foreign investors.Net foreign direct investment for Mexico and Canada has been declining.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `Foreign direct investment, net inflows (BoP, current US$)`, color = country)) +
  geom_line() +
  labs(title = "Foreign direct investment Over Time by Country",
       x = "year",
       y = "Foreign direct investment") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

Education and Gender Parity

##Primary completion rate, total (% of relevant age group) The primary completion rate is flat for Canada, declined but recovered for the United States, suggesting good access to primary education. Mexico has experienced a decline in primary completion rate.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `Primary completion rate, total (% of relevant age group)`, color = country)) +
  geom_line() +
  labs(title = "Primary completion rate, total (% of relevant age group)",
       x = "year",
       y = "Primary completion rate") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

# Infrastructure and Technology

Mobile cellular subscriptions (per 100 people)

The mobile subscription rate has seen significant growth, reflecting high mobile phone usage.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `Mobile cellular subscriptions (per 100 people)`, color = country)) +
  geom_line() +
  labs(title = "Primary completion rate, total (% of relevant age group)",
       x = "year",
       y = "Primary completion rate") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

## High-technology exports (% of manufactured exports) High-tech exports as a percentage of total exports have remained relatively stable.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `High-technology exports (% of manufactured exports)`, color = country)) +
  geom_line() +
  labs(title = "Primary completion rate, total (% of relevant age group)",
       x = "year",
       y = "Primary completion rate") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

# Distribution of Wealth The income share heald by the bottom 20% have declined over the years, indicating a skewed distribution of wealth.

# Create a ggplot
ggplot(data_imputed, aes(x = year, y =  `Income share held by lowest 20%` , color = country)) +
  geom_line() +
  labs(title = "Income share held by lowest 20%",
       x = "year",
       y = "Income share") +
  theme_minimal() +
  scale_color_manual(values = c("Canada" = "blue", "Mexico" = "red", "United States" = "green"))

Conclusion

AMWEST provides the most flights and has the highest proportion of delayed flights for all destinations.
Phoenix is the destination with the most arrivals but with the lowest proportion of delayed arrivals.
San Francisco, though it comes third to last in arrivals flight has the highest proportion of delayed arrivals.
For any given destination, ALASKA has the lowest proportion of delayed flights.

Project 2

Fomba Kassoh, Group Member: Souleymane Doumbia

2023-10-15