DSLabs Assignment

Author

Duchelle K

Install package dslabs

library(dslabs)
Warning: package 'dslabs' was built under R version 4.4.1
data(package = "dslabs")
list.files(system.file("script", package = "dslabs"))
 [1] "make-admissions.R"                   
 [2] "make-brca.R"                         
 [3] "make-brexit_polls.R"                 
 [4] "make-calificaciones.R"               
 [5] "make-death_prob.R"                   
 [6] "make-divorce_margarine.R"            
 [7] "make-gapminder-rdas.R"               
 [8] "make-greenhouse_gases.R"             
 [9] "make-historic_co2.R"                 
[10] "make-mice_weights.R"                 
[11] "make-mnist_127.R"                    
[12] "make-mnist_27.R"                     
[13] "make-movielens.R"                    
[14] "make-murders-rda.R"                  
[15] "make-na_example-rda.R"               
[16] "make-nyc_regents_scores.R"           
[17] "make-olive.R"                        
[18] "make-outlier_example.R"              
[19] "make-polls_2008.R"                   
[20] "make-polls_us_election_2016.R"       
[21] "make-pr_death_counts.R"              
[22] "make-reported_heights-rda.R"         
[23] "make-research_funding_rates.R"       
[24] "make-stars.R"                        
[25] "make-temp_carbon.R"                  
[26] "make-tissue-gene-expression.R"       
[27] "make-trump_tweets.R"                 
[28] "make-weekly_us_contagious_diseases.R"
[29] "save-gapminder-example-csv.R"        

Choose a dataset and Load additional library packages to explore it

I have chosen the gapminder dataset. This dataset includes health and income outcomes for 184 countries from 1960 to 2016. It also includes two character vectors, OECD and OPEC, with the names of OECD and OPEC countries from 2016.

OPEC, the Organization of the Petroleum Exporting Countries, was formed in 1960 and consists of 13 member states, including Saudi Arabia, Iraq, and Venezuela. These countries collectively hold over 80% of the world’s proven oil reserves and produce about 40% of the world’s crude oil, with their exports accounting for around 60% of global petroleum trade. OPEC aims to coordinate petroleum policies among its members to stabilize oil prices, ensure supply, and provide a fair return on investment.

The OECD (Organization for Economic Co-operation and Development) is made up of industrialized countries, such as the United States and much of Europe. While these economies consume less oil than non-OECD countries, they still have a significant impact on global oil demand.

data("gapminder")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
# save the gapminder dataset to my folder using the write_csv command
write_csv(gapminder, "gapminder.csv", na="")

Name the regions using the code %in% : The West, East Asia, Latin America, Sub-Saharan Africa, and Others.

The mutate function will add a new column group based on the conditions specified in the case_when function:

  • If the region is in the west vector, the group is “The West”.

  • If the region is in “Eastern Asia” or “South-Eastern Asia”, the group is “East Asia”.

  • If the region is in “Caribbean”, “Central America”, or “South America”, the group is “Latin America”.

  • If the continent is “Africa” and the region is not “Northern Africa”, the group is “Sub-Saharan Africa”.

  • For all other cases, the group is “Others”.

The second mutate function converts the group column into a factor with specified levels. The levels are reversed using “rev” to ensure the desired order: “The West”, “Others”, “Latin America”, “East Asia”, “Sub-Saharan Africa”.

# Load the Gapminder Data into the global environment
data("gapminder")
# Define Western Regions by creating the character vector West that contains regions that are considered part of "The West"
west <- c("Western Europe","Northern Europe","Southern Europe", 
 "Northern America","Australia and New Zealand")
# Mutate the Dataset to Add a Group Column
gapminder2 <- gapminder %>%
 mutate(group = case_when(
 region %in% west ~ "The West", 
 region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia", 
 region %in% c("Caribbean", "Central America", "South America") ~ "Latin America", 
 continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa", 
 TRUE ~ "Others"))
# Set Factor Levels for the Group Column
gapminder2 <- gapminder2 %>%
 mutate(group = factor(group, levels = rev(c("The West", "Others", "Latin America",
                                             "East Asia","Sub-Saharan Africa"))))

Compare infant mortality to life expectancy

Unlike the geom_text arguments used in the class notes, I have modified them to ensure the year fits appropriately in the graph. Within the aes function of ggplot, I have added a text aesthetic that includes the desired information (country, infant mortality, life expectancy, GDP, fertility) which will be displayed when mousing over the points.

# Filter and Transform Data
gapminder_plot<- gapminder2 %>% 
 filter(year%in%c(1960, 2000, 2015) & !is.na(group) &
 !is.na(infant_mortality) & !is.na(life_expectancy)) %>%
 mutate(population_in_millions = population/10^6) %>%
# Create the Scatter Plot
 ggplot( aes(infant_mortality,, y=life_expectancy, col = group, size = population_in_millions,
             text = paste("Country:", country, "<br>",
                          "Infant Mortality:", infant_mortality, "<br>",
                          "Life Expectancy:", life_expectancy, "<br>",
                          "GDP:", gdp, "<br>",
                          "Fertility:", fertility))) +
  scale_color_manual(values = c("Sub-Saharan Africa"= "brown", "East Asia"= "violet",
                                "Latin America"= "orange", "Others"= "green",
                                "The West"= "lightblue")) +
# Add scatter plot points with semi-transparency (alpha = 0.6)
 geom_point(alpha = 0.6) +
# Remove the legend for the size aesthetic
 guides(size=FALSE) +
# Customize the plot appearance, removing the legend title and centering the plot title 
 theme(legend.title = element_blank(),
       plot.title = element_text(hjust = 0.5)) +
# Set the y-axis limits to range from 30 to 85.
 coord_cartesian(ylim = c(30, 85))+
# Set the labels for the x and y axes
 xlab("Infant Mortality Rate") +
   ylab("Life Expectancy") +
# Add a title to the plot
  ggtitle("Infant Mortality Rate and Life Expectancy Relationship") +
# Add text annotations for the year at the specified coordinates (x = 150, y = 83) in grey color and large size (cex = 10).
 geom_text(aes(x=150, y=83, label=year), cex=10, color="grey") +
# Creates separate panels for each year (1960, 2000, and 2015)
 facet_grid(. ~ year) +
# Further customization of the appearance by removing the background and text for the facets and positioning the legend at the bottom
 theme(strip.background = element_blank(),
 strip.text.x = element_blank(),
 strip.text.y = element_blank(),
 legend.position = "bottom")
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
gapminder_plot

Convert ggplot to plotly for interactivity

ggplotly function will convert the ggplot object to a plotly object, specifying tooltip = “text” to use the custom text for tooltips.

gapminder_plotly <- ggplotly(gapminder_plot, tooltip = "text")

gapminder_plotly

Description

The scatter plot illustrates the relationship between infant mortality rate and life expectancy for the years 1960, 2000, and 2015. Data points are color-coded based on the group variable, representing different regions or groups. Additionally, the size of the points reflects the population in millions, providing a visual cue for the relative population sizes of each data point. By faceting the plot by year, it allows for easy comparison across the three specified years. When mousing over the points we can notice that countries with higher life expectancy also have least fertility rate and higher gdp.

We observe the following trends over the time:

  • 1960: There is a clear negative relationship between infant mortality rate and life expectancy. Countries with higher infant mortality rates tend to have lower life expectancy. The data points are more spread out, indicating a wide variation in both metrics.
  • 2000: The negative relationship persists, but the spread of data points is somewhat narrower compared to 1960. Some regions have improved (lower infant mortality and higher life expectancy), but disparities still exist.
  • 2015: The relationship remains negative but with a more pronounced clustering of countries having low infant mortality rates and higher life expectancy. This indicates overall global improvements in health metrics.

We can also observe regional differences:

  • Sub-Saharan Africa, represented by brown points, consistently shows higher infant mortality rates and lower life expectancy across all three years, though there is improvement from 1960 to 2015.
  • East Asia, violet points, tends to cluster at lower infant mortality rates and higher life expectancy, particularly evident in 2000 and 2015.
  • Latin America, orange points, shows moderate improvements over time with a shift towards lower infant mortality and higher life expectancy.
  • Others, green points, shows significant improvement over time, with many countries achieving low infant mortality rates and high life expectancy by 2015.
  • The West, light blue points, consistently shows low infant mortality and high life expectancy across all three years.

The size of the points represents population in millions. Larger circles indicate countries with larger populations. Notably, in 1960, some countries with very high infant mortality rates and low life expectancy also have large populations, highlighting regions with significant public health challenges.

By 2015, the larger population points tend to be in regions with lower infant mortality and higher life expectancy, reflecting improvements in populous countries.

The plot shows a general trend of improvement over time, with countries moving towards lower infant mortality rates and higher life expectancy from 1960 to 2015. The clustering of points in the 2015 panel suggests that many countries have achieved significant health improvements, though some regions still lag behind.

Overall, the plot effectively communicates the relationship between infant mortality rate and life expectancy over time, highlighting regional disparities and improvements in global health metrics.