Data 607 - TidyVerse CREATE assignment

In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/acatlin/FALL2022TIDYVERSE

FiveThirtyEight.com datasets.

Kaggle datasets.

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Load libraries

library(ggplot2)
library(gganimate)
library(dplyr)
library(gifski)
library(ggthemes)

Read csv file

world_energy <- read.csv("https://raw.githubusercontent.com/saniewor/MSDS/main/datasets/World%20Energy%20Consumption.csv")

I wanted to view the US energy usage of renewables such as solar and wind and the change over time using an animated plot. Original data can be found at the link : https://www.kaggle.com/datasets/pralabhpoudel/world-energy-consumption

Regular ggplot using scale limits and theme

usa_solar <- world_energy%>%
filter(iso_code == "USA") %>%
ggplot(aes(year, solar_consumption, group = 1)) + geom_point(na.rm=TRUE, color = "blue") + geom_line(na.rm=TRUE, color = "green")+
  labs(title = "USA solar consumption over years", x = "Year", y = "Solar consumption")+theme_stata()+
  xlim(2000,2025)
usa_solar

Transition over time using gganimate

animate(usa_solar + transition_reveal(year),fps = 5, duration = 15, height = 500, width = 675)

## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

Looking at wind

usa_wind <- world_energy%>%
filter(iso_code == "USA") %>%
ggplot(aes(year, wind_consumption)) + geom_point(na.rm=TRUE, color = "yellow") + geom_line(na.rm=TRUE, color = "red")+
  labs(title = "USA wind consumption over years",x = "Year", y = "Wind consumption")+ theme_economist()+
  xlim(1990,2025)
usa_wind

Transition over time using gganimate

animate(usa_wind + transition_reveal(year), fps = 5, duration = 15, height = 500, width = 675, )

## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

Extended by Daria Dubovskaia

Dplyr package

Dplyr package is one of the most useful part of the tidyverse library. Dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. In addition to the above functions, we will consider: rename(), select(), distinct(), mutate(), group_by(), arrange(), summarise(). The dplyr provides the pipe %>% operator, so the result from one step is then “piped” into the next step df %>% f(y).

Rename()

Rename() changes the names of individual variables using new_name = old_name syntax. The data on world energy consists of 122 columns with names that are difficult to understand. Some of the columns we can rename for better understanding.

world_energy_rename <- world_energy %>% 
           rename("primary_wind_consumption_twh" = "wind_consumption",
           "share_cons_coal_energy" = "coal_share_energy", "low_carbon_elect_cons_percapita"="low_carbon_energy_per_capita", "gross_domestic_product" = "gdp")

Select(), Distinct()

Select function changes whether or not a column is included in case we need only several columns instead of the entire data frame, the first argument is the data frame/tibble, the further arguments are one or more unquoted expressions separated by commas.
Distinct function select only unique/distinct rows from a data frame.The first argument is a data frame/tibble, the further arguments are optional variables to use when determining uniqueness.
We will select column name with year to see what years are included in the data as well as what countries are mentioned. The data included the observations on 242 different countries over 121 years, from 1900 to 2020.

year <- world_energy_rename %>% select(year)  %>% distinct()
summary(year)

##       year     
##  Min.   :1900  
##  1st Qu.:1930  
##  Median :1960  
##  Mean   :1960  
##  3rd Qu.:1990  
##  Max.   :2020

head(year)

##   year
## 1 1900
## 2 1901
## 3 1902
## 4 1903
## 5 1904
## 6 1905

country <- world_energy_rename %>% select(country)  %>% distinct()
summary(country)

##    country         
##  Length:242        
##  Class :character  
##  Mode  :character

head(country)

##          country
## 1    Afghanistan
## 2         Africa
## 3        Albania
## 4        Algeria
## 5 American Samoa
## 6         Angola

Group_by(), Arrange(), Summarise()

Group_by() takes an existing tbl and converts it into a grouped data frame/tibble where operations are performed “by group”. Arguments are a data frame/tibble, variables or computations to group by, .add (FALSE will override existing groups, .add = TRUE will add to the existing groups), .drop (drop groups formed by factor levels that don’t appear in the data).
Arrange() function changes the order of the rows by the values of selected columns. The arguments are .data (data frame/tibble), variables, or functions of variables (use desc() to sort a variable in descending order), .by_group (TRUE will sort first by grouping variable).
Summarise() collapses a group into a single row, it will have one (or more) rows for each combination of grouping variables. The arguments are .data, name-value pairs of summary functions. The name will be the name of the variable in the result, .groups (grouping structure of the result).
We will check the total coal production for over 121 years in the all the countries. First, we group by the iso_code, then we will summarize by the coal_production to find the sum of coal production over 121 in each country, and arrange in desc order. Chine is the first over other countries.

 world_energy_rename %>% 
  group_by(country) %>% 
  summarize(coal_production_total=sum(coal_production,na.rm = TRUE)) %>% 
  arrange(desc(coal_production_total))

## # A tibble: 242 x 2
##    country        coal_production_total
##    <chr>                          <dbl>
##  1 World                       1260113.
##  2 Asia Pacific                 690240.
##  3 China                        480890.
##  4 North America                476268.
##  5 United States                450350.
##  6 Europe                       421567.
##  7 Russia                       176007.
##  8 United Kingdom               115912.
##  9 Germany                      113268.
## 10 CIS                          100362.
## # ... with 232 more rows

Conclusion by Daria Dubovskaia

In the extended part of the assignment, some other functions of the dplyr package were added such as rename(), select(), distinct(), group_by(), arrange(), summarise(). Tidyverse package is the best tool to transform the data for the further analysis. There are more to discover within each of the libraries mentioned above as well as tidyverse package contains a lot of other libraries that were not mentioned in the current “vignette” such as tidyr, tibble, purrr.