library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.2
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'readr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## âś” dplyr     1.1.3     âś” readr     2.1.4
## âś” forcats   1.0.0     âś” stringr   1.5.0
## âś” ggplot2   3.4.4     âś” tibble    3.2.1
## âś” lubridate 1.9.2     âś” tidyr     1.3.0
## âś” purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## âś– dplyr::filter() masks stats::filter()
## âś– dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Load the dataset

country = read_csv("country_stat.csv")
## Rows: 10545 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): country, continent, region
## dbl (6): year, infant_mortality, life_expectancy, fertility, population, gdp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(country)
## # A tibble: 6 Ă— 9
##   country    year infant_mortality life_expectancy fertility population      gdp
##   <chr>     <dbl>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
## 1 Albania    1960            115.             62.9      6.19    1636054 NA      
## 2 Algeria    1960            148.             47.5      7.65   11124892  1.38e10
## 3 Angola     1960            208              36.0      7.32    5270844 NA      
## 4 Antigua …  1960             NA              63.0      4.43      54681 NA      
## 5 Argentina  1960             59.9            65.4      3.11   20619075  1.08e11
## 6 Armenia    1960             NA              66.9      4.55    1867396 NA      
## # ℹ 2 more variables: continent <chr>, region <chr>

QUESTION 1 : Are there missing values in the data? If so, can you show how data is missing? (open-ended question)

Objective

Check for missing values and visualize missingness pattern.

summary(country)
##    country               year      infant_mortality life_expectancy
##  Length:10545       Min.   :1960   Min.   :  1.50   Min.   :13.20  
##  Class :character   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
##  Mode  :character   Median :1988   Median : 41.50   Median :67.54  
##                     Mean   :1988   Mean   : 55.31   Mean   :64.81  
##                     3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
##                     Max.   :2016   Max.   :276.90   Max.   :83.90  
##                                    NA's   :1453                    
##    fertility       population             gdp             continent        
##  Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Length:10545      
##  1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Class :character  
##  Median :3.750   Median :5.009e+06   Median :7.794e+09   Mode  :character  
##  Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11                     
##  3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10                     
##  Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                     
##  NA's   :187     NA's   :185         NA's   :2972                          
##     region         
##  Length:10545      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Interpretation

There is 1453 NA’s in infant_mortality, 187 NA’s in fertility, 185 NA’s in population and 2972 NA’s in gdp


QUESTION 2 : How many unique countries are included in the data? How many years of observations are included in the data?

Objective

Determine the number of unique countries and years of observations.

length(unique(country$country))
## [1] 185
diff(range(country$year)) + 1
## [1] 57

Interpretation

There are 185 unique countries and 57 years.


QUESTION 3 : In the data, create a new variable called GDP_per_capita which equals to GDP/population.

Objective

To create a new variable for GDP per capita.

country_per <- country %>%
  mutate(GDP_per_capita = gdp/population)

The histogram shows the frequency distribution of life expectancy values. We can observe the range of life expectancy values and identify any noticeable patterns, such as whether the distribution is skewed or symmetric. This visualization provides insights into the overall life expectancy distribution among the countries in the dataset.

QUESTION 4.4: How does infant mortality vary by continent?

Objective

To compare the variation in infant mortality rates across different continents.

# Box plot of infant mortality by continent
infant_mortality_boxplot <- ggplot(country, aes(x = continent, y = infant_mortality, fill = continent)) +
  geom_boxplot() +
  labs(title = "Infant Mortality Rate by Continent", x = "Continent", y = "Infant Mortality Rate") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

print(infant_mortality_boxplot)
## Warning: Removed 1453 rows containing non-finite values (`stat_boxplot()`).

Interpretation

The box plot compares the distribution of infant mortality rates across continents. By examining the box plots for each continent, we can identify differences in the central tendency and spread of infant mortality rates. This visualization helps us understand how infant mortality varies geographically and provides insights into potential disparities in healthcare and socio-economic conditions among continents. Additionally, differences in the whiskers’ lengths and outliers can indicate variations in the severity of infant mortality rates among continents.