Chhabra_Ridhi_BANA4137

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.3.2

## Warning: package 'ggplot2' was built under R version 4.3.2

## Warning: package 'readr' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Load the dataset

country = read_csv("country_stat.csv")

## Rows: 10545 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): country, continent, region
## dbl (6): year, infant_mortality, life_expectancy, fertility, population, gdp
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(country)

## # A tibble: 6 × 9
##   country    year infant_mortality life_expectancy fertility population      gdp
##   <chr>     <dbl>            <dbl>           <dbl>     <dbl>      <dbl>    <dbl>
## 1 Albania    1960            115.             62.9      6.19    1636054 NA      
## 2 Algeria    1960            148.             47.5      7.65   11124892  1.38e10
## 3 Angola     1960            208              36.0      7.32    5270844 NA      
## 4 Antigua …  1960             NA              63.0      4.43      54681 NA      
## 5 Argentina  1960             59.9            65.4      3.11   20619075  1.08e11
## 6 Armenia    1960             NA              66.9      4.55    1867396 NA      
## # ℹ 2 more variables: continent <chr>, region <chr>

QUESTION 1 : Are there missing values in the data? If so, can you show how data is missing? (open-ended question)

Objective

Check for missing values and visualize missingness pattern.

summary(country)

##    country               year      infant_mortality life_expectancy
##  Length:10545       Min.   :1960   Min.   :  1.50   Min.   :13.20  
##  Class :character   1st Qu.:1974   1st Qu.: 16.00   1st Qu.:57.50  
##  Mode  :character   Median :1988   Median : 41.50   Median :67.54  
##                     Mean   :1988   Mean   : 55.31   Mean   :64.81  
##                     3rd Qu.:2002   3rd Qu.: 85.10   3rd Qu.:73.00  
##                     Max.   :2016   Max.   :276.90   Max.   :83.90  
##                                    NA's   :1453                    
##    fertility       population             gdp             continent        
##  Min.   :0.840   Min.   :3.124e+04   Min.   :4.040e+07   Length:10545      
##  1st Qu.:2.200   1st Qu.:1.333e+06   1st Qu.:1.846e+09   Class :character  
##  Median :3.750   Median :5.009e+06   Median :7.794e+09   Mode  :character  
##  Mean   :4.084   Mean   :2.701e+07   Mean   :1.480e+11                     
##  3rd Qu.:6.000   3rd Qu.:1.523e+07   3rd Qu.:5.540e+10                     
##  Max.   :9.220   Max.   :1.376e+09   Max.   :1.174e+13                     
##  NA's   :187     NA's   :185         NA's   :2972                          
##     region         
##  Length:10545      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Interpretation

There is 1453 NA’s in infant_mortality, 187 NA’s in fertility, 185 NA’s in population and 2972 NA’s in gdp

QUESTION 2 : How many unique countries are included in the data? How many years of observations are included in the data?

Objective

Determine the number of unique countries and years of observations.

length(unique(country$country))

## [1] 185

diff(range(country$year)) + 1

## [1] 57

Interpretation

There are 185 unique countries and 57 years.

QUESTION 3 : In the data, create a new variable called GDP_per_capita which equals to GDP/population.

Objective

To create a new variable for GDP per capita.

country_per <- country %>%
  mutate(GDP_per_capita = gdp/population)

QUESTION 4 : Propose four questions you would like to know about this data. At least one question needs to be related to time series and be answered using time series data visualization. Some example questions can be: Does the developing countries grow slower than the developed countries? Is Africa catching up with world or left behind? Is the world more divided now than it was 50 years ago? Please propose your own questions and do not use the exactly same questions listed above.

QUESTION 4.1 :What has been the Infant Mortality Rate by Region Over Time

Objective

To analyze the trend of infant mortality rate over time for different regions.

infant_mortality_by_region_plot <- ggplot(country, aes(x = year, y = infant_mortality, color = region)) +
  geom_line() +
  labs(title = "Infant Mortality Rate by Region Over Time", x = "Year", y = "Infant Mortality Rate") +
  theme_minimal()
print(infant_mortality_by_region_plot)

## Warning: Removed 249 rows containing missing values (`geom_line()`).

### Interpretation

This visualization depicts the variations in infant mortality rates across different regions over time, aiding in the recognition of regional patterns and discrepancies in infant mortality rates throughout the years.

QUESTION 4.2 : which Asian countries have found financial success in recent years?

Objective

(Part 1) To find and analyze asian country that has been financial successful and (Part 2) has GPD per capita more than 2000.

country_per %>%
  filter(continent == ("Asia")) %>%
  ggplot() +
  geom_line(aes(x = year, y = GDP_per_capita, color = country)) +
  theme(legend.text=element_text(size=7), legend.title=element_text(size=7))

## Warning: Removed 959 rows containing missing values (`geom_line()`).

# Now looking at Asian countries which have GPD Per Capita > 2000
country_per %>%
  filter(continent == ("Asia")) %>%
  filter(GDP_per_capita > 2000) %>%
  ggplot() +
  geom_line(aes(x = year, y = GDP_per_capita, color = country))

Interpretation

From this, we can see United Arab Emirates’ financial success going up in the last couple of years while other countries are also doing well in terms of financial success.

QUESTION 4.3: What is the distribution of life expectancy across all countries

Objective

To visually explore the distribution of life expectancy across all countries in the dataset.

# Histogram of life expectancy
life_expectancy_histogram <- ggplot(country, aes(x = life_expectancy)) +
  geom_histogram(binwidth = 2, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Life Expectancy", x = "Life Expectancy", y = "Frequency") +
  theme_minimal()

print(life_expectancy_histogram)

### Interpretation

The histogram shows the frequency distribution of life expectancy values. We can observe the range of life expectancy values and identify any noticeable patterns, such as whether the distribution is skewed or symmetric. This visualization provides insights into the overall life expectancy distribution among the countries in the dataset.

QUESTION 4.4: How does infant mortality vary by continent?

Objective

To compare the variation in infant mortality rates across different continents.

# Box plot of infant mortality by continent
infant_mortality_boxplot <- ggplot(country, aes(x = continent, y = infant_mortality, fill = continent)) +
  geom_boxplot() +
  labs(title = "Infant Mortality Rate by Continent", x = "Continent", y = "Infant Mortality Rate") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

print(infant_mortality_boxplot)

## Warning: Removed 1453 rows containing non-finite values (`stat_boxplot()`).

Interpretation

The box plot compares the distribution of infant mortality rates across continents. By examining the box plots for each continent, we can identify differences in the central tendency and spread of infant mortality rates. This visualization helps us understand how infant mortality varies geographically and provides insights into potential disparities in healthcare and socio-economic conditions among continents. Additionally, differences in the whiskers’ lengths and outliers can indicate variations in the severity of infant mortality rates among continents.

Chhabra_Ridhi_BANA4137_Homework9

Ridhi Chhabra

2024-04-01

QUESTION 1 : Are there missing values in the data? If so, can you show how data is missing? (open-ended question)

Objective

Interpretation

QUESTION 2 : How many unique countries are included in the data? How many years of observations are included in the data?

Objective

Interpretation

QUESTION 3 : In the data, create a new variable called GDP_per_capita which equals to GDP/population.

Objective

QUESTION 4.4: How does infant mortality vary by continent?

Objective

Interpretation