TidyVerse Extend

DATA 607 TidyVerse Extend

In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/acatlin/SPRING2024TIDYVERSE

FiveThirtyEight.com datasets.

Kaggle datasets.

Your task here is to Extend an Existing Example. Using one of your classmate’s examples (as created above), extend his or her example with additional annotated code. (15 points)

You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.

After you’ve extended your classmate’s vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.

You should complete your submission on the schedule stated in the course syllabus.

_________________________________________________________________________________________________________ Tidyverse Extension is below the create section

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(reactable)
library(purrr)

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Introduction

For this tidyverse assignment we were to pick a dataset from fivethirtyeight.com or Kaggle and use one of the tidyverse package to create a vignette. The Article I will be using is from Kraggle and my dataset is World happiness report.

What is the purr package?

Purrr is a popular R Programming package that provides a consistent and powerful set of tools for working with functions and vectors. It was developed by Hadley Wickham and is part of the tidyverse suite of packages. Purrr is an essential package for functional programming in R. According to purrr.tidyverse.org, purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.

Data Import

This step below I will be importing the world happiness dataset from my github account URL: (https://github.com/jnaval88/DATA607/blob/fc9b840efccb9a4f2743a21e3217acef8cb85cf1/Tidyverse_Assignment/world-happiness-report.csv.)

worldhappiness <- read.csv(file = "https://raw.githubusercontent.com/jnaval88/DATA607/main/Tidyverse_Assignment/world-happiness-report.csv")

Data filter and maping

First I will filter the data for a specific year.

worldhappiness2020 <- worldhappiness %>% 
  filter( year == '2020')

I filter the data for year 2020, which mean I will looking at information equivalent that year only.

Calculating the Average

For this step I will calculate the average life expectancy at birth for the year 2020

mean(worldhappiness2020$Healthy.life.expectancy.at.birth, na.rm = TRUE)

## [1] 67.09957

Purrr map function

Now I will be using the mapping function from the purrr package on world hapiness dataset using the year filter 2020, I will be looking at healthy life expectancy at birth.

worldhappiness2020$Healthy.life.expectancy.at.birth %>% map_dbl(mean)

##  [1] 69.30 69.20 74.20 73.60 69.70 65.30 72.40 55.10 64.20 68.40 66.80 67.20
## [13] 62.40 54.30 74.00 70.10 69.90 68.30 71.40 74.10 71.30 73.00 66.40 69.10
## [25] 62.30 66.70 69.00 59.50 72.10 74.20 64.10 72.80 58.00 72.80    NA 68.40
## [37] 73.00 60.90 66.60 61.40 72.50 73.70 74.00 50.70 75.20 67.20 65.80 61.30
## [49]    NA 64.70 59.50 67.40 68.50 72.20 67.00 68.90 66.40 62.70 68.90 66.50
## [61] 59.60 57.10 72.50 73.60 50.50 65.56 73.40 62.10 70.10 72.80 65.10 66.90
## [73] 69.00 69.50 71.70 57.30 74.20 75.00 72.80 74.70    NA 64.70 58.50 67.60
## [85] 67.50 67.60 56.50 65.20 67.50 72.70 68.10 69.20 66.90 56.30 56.80

For this step I am using the same map function and extended it to multiple columns.

worldhappiness %>% 
  select( "Healthy.life.expectancy.at.birth", "Freedom.to.make.life.choices" ) %>% 
  map(~mean(.,na.rm = TRUE))

## $Healthy.life.expectancy.at.birth
## [1] 63.35937
## 
## $Freedom.to.make.life.choices
## [1] 0.7425576

Exploring map function futher more

Below I will use the map function a bit more. I will split the original data frame by year, and run a linear model on each year. I then apply the summary function the results from each model and then again use the map function to obtain the r.squared value for each year.

worldhappiness %>%  
  split(.$year) %>% 
  map(~lm( `Healthy.life.expectancy.at.birth` ~`Log.GDP.per.capita`  , data = .) ) %>% 
  map(summary) %>% 
  map_df("r.squared") %>% 
  
  reactable()

Conclusion

From the purrr package in the tidyverse I use the map function to show how to manipulate vector.

Extension:

library(tidyverse)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

This code sparked my interest in many different dynamics regarding the people of the world and their happiness. I wanted to create visualizations to bring these ideas to life and understand how different dynamics may correlate and what conclusions can be drawn from such correlations.

I first wondered what was the different life expectancy’s of the people around the world and what was the ranges and the amount of countries in such ranges. To obtain such result I decided to utilize the TidyVerse package by using ‘GGPLOT’ to create a density plot of life expectancy in the most recent year of data we have which is 2020.

density_plot_2020 <- ggplot(worldhappiness2020, aes(x = Healthy.life.expectancy.at.birth)) +
geom_density(fill = "green", alpha = 0.5) + 
labs(title = "Density Plot of Life Expectancy in 2020",
  x = "Life Expectancy at Birth",
  y = "Density")

density_plot_2020

## Warning: Removed 3 rows containing non-finite values (`stat_density()`).

This was a great visualization and I was able to conclude from this that most countries had a life expectancy at birth of 65 to about 72 years of age. The most dense age is 68 which means this is what most countries believe life expectancy will be at birth.

This gave me great insights into the world population and life expectancy but I now wanted to explore the correlation between GDP per capita and Happiness in every country. I originally did this using just the data from 2020 but I wanted to see if these results were consistent across the years. I knew that I wanted this graph to be interactive so I can look at these scores year by year. I discovered and decided to use the ‘Plotly’ library which helped me create an interactive visualization in which you could look at these plots year by year.

Happiness_vs_GDP_Plot <- plot_ly(worldhappiness, x = ~Log.GDP.per.capita, y = ~Life.Ladder, color = ~as.factor(year), type = "scatter", mode = "markers") %>%
  layout(title = "Happiness Score vs. GDP per Capita (All Years)",
         xaxis = list(title = "Log GDP per Capita"),
         yaxis = list(title = "Happiness Score"),
         colorway = c("#636EFA", "#EF553B", "#00CC96", "#AB63FA", "#FFA15A", "#19D3F3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"),
         hovermode = "closest",
         updatemenus = list(
           list(
             buttons = list(
               list(method = "restyle",
                    args = list("visible", list(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)),
                    label = "All"),
               list(method = "restyle",
                    args = list("visible", list(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
                    label = "2005"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
                    label = "2006"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
                    label = "2007"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
                    label = "2008"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)),
                    label = "2009"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)),
                    label = "2010"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)),
                    label = "2011"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)),
                    label = "2012"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)),
                    label = "2013"),
               list(method = "restyle",
                    args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)),
                    label = "2014")
             ),
             direction = "down",
             showactive = TRUE,
             x = 0.1,
             xanchor = "left",
             y = 1.1,
             yanchor = "top"
           )
         )
  )

Happiness_vs_GDP_Plot

## Warning: Ignoring 36 observations

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Conclusion

This visualization gave me many insights regarding the correlation between a country’s GDP per capita and their happiness. I first noticed that all of the scatter plots for every different year was mostly skewed to left which indicated to me that as a country’s GDP per capita rises, its Happiness score also increases. I found the best results by simply comparing the plots of 2 years which were relatively 10 years apart. I did 2010 and 2020 which pretty much showed the same things which were left skewed scatter plots that support my theory. Using ‘GGPLOT’ was great in creating the initial plot which required me to implement ‘PLOTLY’ to make the scatter plot interactive.

TidyVerse Extend

Biyag Dukuray, James Naval

2024-04-19

DATA 607 TidyVerse Extend

Introduction

Data Import

Data filter and maping

Calculating the Average

Purrr map function

Exploring map function futher more

Conclusion

Extension:

Conclusion