In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
GitHub repository: https://github.com/acatlin/SPRING2024TIDYVERSE
FiveThirtyEight.com datasets.
Kaggle datasets.
Your task here is to Extend an Existing Example. Using one of your classmate’s examples (as created above), extend his or her example with additional annotated code. (15 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.
After you’ve extended your classmate’s vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.
You should complete your submission on the schedule stated in the course syllabus.
_________________________________________________________________________________________________________ Tidyverse Extension is below the create section
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reactable)
library(purrr)
Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
For this tidyverse assignment we were to pick a dataset from fivethirtyeight.com or Kaggle and use one of the tidyverse package to create a vignette. The Article I will be using is from Kraggle and my dataset is World happiness report.
What is the purr package?
Purrr is a popular R Programming package that provides a consistent and powerful set of tools for working with functions and vectors. It was developed by Hadley Wickham and is part of the tidyverse suite of packages. Purrr is an essential package for functional programming in R. According to purrr.tidyverse.org, purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read. The best place to learn about the map() functions is the iteration chapter in R for data science.
This step below I will be importing the world happiness dataset from my github account URL: (https://github.com/jnaval88/DATA607/blob/fc9b840efccb9a4f2743a21e3217acef8cb85cf1/Tidyverse_Assignment/world-happiness-report.csv.)
worldhappiness <- read.csv(file = "https://raw.githubusercontent.com/jnaval88/DATA607/main/Tidyverse_Assignment/world-happiness-report.csv")
First I will filter the data for a specific year.
worldhappiness2020 <- worldhappiness %>%
filter( year == '2020')
I filter the data for year 2020, which mean I will looking at information equivalent that year only.
For this step I will calculate the average life expectancy at birth for the year 2020
mean(worldhappiness2020$Healthy.life.expectancy.at.birth, na.rm = TRUE)
## [1] 67.09957
Now I will be using the mapping function from the purrr package on world hapiness dataset using the year filter 2020, I will be looking at healthy life expectancy at birth.
worldhappiness2020$Healthy.life.expectancy.at.birth %>% map_dbl(mean)
## [1] 69.30 69.20 74.20 73.60 69.70 65.30 72.40 55.10 64.20 68.40 66.80 67.20
## [13] 62.40 54.30 74.00 70.10 69.90 68.30 71.40 74.10 71.30 73.00 66.40 69.10
## [25] 62.30 66.70 69.00 59.50 72.10 74.20 64.10 72.80 58.00 72.80 NA 68.40
## [37] 73.00 60.90 66.60 61.40 72.50 73.70 74.00 50.70 75.20 67.20 65.80 61.30
## [49] NA 64.70 59.50 67.40 68.50 72.20 67.00 68.90 66.40 62.70 68.90 66.50
## [61] 59.60 57.10 72.50 73.60 50.50 65.56 73.40 62.10 70.10 72.80 65.10 66.90
## [73] 69.00 69.50 71.70 57.30 74.20 75.00 72.80 74.70 NA 64.70 58.50 67.60
## [85] 67.50 67.60 56.50 65.20 67.50 72.70 68.10 69.20 66.90 56.30 56.80
For this step I am using the same map function and extended it to multiple columns.
worldhappiness %>%
select( "Healthy.life.expectancy.at.birth", "Freedom.to.make.life.choices" ) %>%
map(~mean(.,na.rm = TRUE))
## $Healthy.life.expectancy.at.birth
## [1] 63.35937
##
## $Freedom.to.make.life.choices
## [1] 0.7425576
Below I will use the map function a bit more. I will
split the original data frame by year, and run a linear model on each
year. I then apply the summary function the results from
each model and then again use the map function to obtain
the r.squared value for each year.
worldhappiness %>%
split(.$year) %>%
map(~lm( `Healthy.life.expectancy.at.birth` ~`Log.GDP.per.capita` , data = .) ) %>%
map(summary) %>%
map_df("r.squared") %>%
reactable()
From the purrr package in the tidyverse I use the map function to show how to manipulate vector.
library(tidyverse)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
This code sparked my interest in many different dynamics regarding the people of the world and their happiness. I wanted to create visualizations to bring these ideas to life and understand how different dynamics may correlate and what conclusions can be drawn from such correlations.
I first wondered what was the different life expectancy’s of the people around the world and what was the ranges and the amount of countries in such ranges. To obtain such result I decided to utilize the TidyVerse package by using ‘GGPLOT’ to create a density plot of life expectancy in the most recent year of data we have which is 2020.
density_plot_2020 <- ggplot(worldhappiness2020, aes(x = Healthy.life.expectancy.at.birth)) +
geom_density(fill = "green", alpha = 0.5) +
labs(title = "Density Plot of Life Expectancy in 2020",
x = "Life Expectancy at Birth",
y = "Density")
density_plot_2020
## Warning: Removed 3 rows containing non-finite values (`stat_density()`).
This was a great visualization and I was able to conclude from this that most countries had a life expectancy at birth of 65 to about 72 years of age. The most dense age is 68 which means this is what most countries believe life expectancy will be at birth.
This gave me great insights into the world population and life expectancy but I now wanted to explore the correlation between GDP per capita and Happiness in every country. I originally did this using just the data from 2020 but I wanted to see if these results were consistent across the years. I knew that I wanted this graph to be interactive so I can look at these scores year by year. I discovered and decided to use the ‘Plotly’ library which helped me create an interactive visualization in which you could look at these plots year by year.
Happiness_vs_GDP_Plot <- plot_ly(worldhappiness, x = ~Log.GDP.per.capita, y = ~Life.Ladder, color = ~as.factor(year), type = "scatter", mode = "markers") %>%
layout(title = "Happiness Score vs. GDP per Capita (All Years)",
xaxis = list(title = "Log GDP per Capita"),
yaxis = list(title = "Happiness Score"),
colorway = c("#636EFA", "#EF553B", "#00CC96", "#AB63FA", "#FFA15A", "#19D3F3", "#FF6692", "#B6E880", "#FF97FF", "#FECB52"),
hovermode = "closest",
updatemenus = list(
list(
buttons = list(
list(method = "restyle",
args = list("visible", list(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE)),
label = "All"),
list(method = "restyle",
args = list("visible", list(TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
label = "2005"),
list(method = "restyle",
args = list("visible", list(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
label = "2006"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
label = "2007"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)),
label = "2008"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE)),
label = "2009"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE)),
label = "2010"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE)),
label = "2011"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE)),
label = "2012"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE)),
label = "2013"),
list(method = "restyle",
args = list("visible", list(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE)),
label = "2014")
),
direction = "down",
showactive = TRUE,
x = 0.1,
xanchor = "left",
y = 1.1,
yanchor = "top"
)
)
)
Happiness_vs_GDP_Plot
## Warning: Ignoring 36 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
This visualization gave me many insights regarding the correlation between a country’s GDP per capita and their happiness. I first noticed that all of the scatter plots for every different year was mostly skewed to left which indicated to me that as a country’s GDP per capita rises, its Happiness score also increases. I found the best results by simply comparing the plots of 2 years which were relatively 10 years apart. I did 2010 and 2020 which pretty much showed the same things which were left skewed scatter plots that support my theory. Using ‘GGPLOT’ was great in creating the initial plot which required me to implement ‘PLOTLY’ to make the scatter plot interactive.