In this assignment, you’ll practice collaborating around a code project with GitHub. You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
GitHub repository: https://github.com/acatlin/FALL2022TIDYVERSE
FiveThirtyEight.com datasets.
Kaggle datasets.
Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Load libraries
library(ggplot2)
library(gganimate)
library(dplyr)
library(gifski)
library(ggthemes)
world_energy <- read.csv("https://raw.githubusercontent.com/saniewor/MSDS/main/datasets/World%20Energy%20Consumption.csv")
I wanted to view the US energy usage of renewables such as solar and wind and the change over time using an animated plot. Original data can be found at the link : https://www.kaggle.com/datasets/pralabhpoudel/world-energy-consumption
Regular ggplot using scale limits and theme
usa_solar <- world_energy%>%
filter(iso_code == "USA") %>%
ggplot(aes(year, solar_consumption, group = 1)) + geom_point(na.rm=TRUE, color = "blue") + geom_line(na.rm=TRUE, color = "green")+
labs(title = "USA solar consumption over years", x = "Year", y = "Solar consumption")+theme_stata()+
xlim(2000,2025)
usa_solar
Transition over time using gganimate
animate(usa_solar + transition_reveal(year),fps = 5, duration = 15, height = 500, width = 675)
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
Looking at wind
usa_wind <- world_energy%>%
filter(iso_code == "USA") %>%
ggplot(aes(year, wind_consumption)) + geom_point(na.rm=TRUE, color = "yellow") + geom_line(na.rm=TRUE, color = "red")+
labs(title = "USA wind consumption over years",x = "Year", y = "Wind consumption")+ theme_economist()+
xlim(1990,2025)
usa_wind
Transition over time using gganimate
animate(usa_wind + transition_reveal(year), fps = 5, duration = 15, height = 500, width = 675, )
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
Dplyr package is one of the most useful part of the tidyverse library. Dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. In addition to the above functions, we will consider: rename(), select(), distinct(), mutate(), group_by(), arrange(), summarise(). The dplyr provides the pipe %>% operator, so the result from one step is then “piped” into the next step df %>% f(y).
Rename() changes the names of individual variables using new_name = old_name syntax. The data on world energy consists of 122 columns with names that are difficult to understand. Some of the columns we can rename for better understanding.
world_energy_rename <- world_energy %>%
rename("primary_wind_consumption_twh" = "wind_consumption",
"share_cons_coal_energy" = "coal_share_energy", "low_carbon_elect_cons_percapita"="low_carbon_energy_per_capita", "gross_domestic_product" = "gdp")
Select function changes whether or not a column is included in case we need only several columns instead of the entire data frame, the first argument is the data frame/tibble, the further arguments are one or more unquoted expressions separated by commas.
Distinct function select only unique/distinct rows from a data frame.The first argument is a data frame/tibble, the further arguments are optional variables to use when determining uniqueness.
We will select column name with year to see what years are included in the data as well as what countries are mentioned. The data included the observations on 242 different countries over 121 years, from 1900 to 2020.
year <- world_energy_rename %>% select(year) %>% distinct()
summary(year)
## year
## Min. :1900
## 1st Qu.:1930
## Median :1960
## Mean :1960
## 3rd Qu.:1990
## Max. :2020
head(year)
## year
## 1 1900
## 2 1901
## 3 1902
## 4 1903
## 5 1904
## 6 1905
country <- world_energy_rename %>% select(country) %>% distinct()
summary(country)
## country
## Length:242
## Class :character
## Mode :character
head(country)
## country
## 1 Afghanistan
## 2 Africa
## 3 Albania
## 4 Algeria
## 5 American Samoa
## 6 Angola
Group_by() takes an existing tbl and converts it into a grouped data frame/tibble where operations are performed “by group”. Arguments are a data frame/tibble, variables or computations to group by, .add (FALSE will override existing groups, .add = TRUE will add to the existing groups), .drop (drop groups formed by factor levels that don’t appear in the data).
Arrange() function changes the order of the rows by the values of selected columns. The arguments are .data (data frame/tibble), variables, or functions of variables (use desc() to sort a variable in descending order), .by_group (TRUE will sort first by grouping variable).
Summarise() collapses a group into a single row, it will have one (or more) rows for each combination of grouping variables. The arguments are .data, name-value pairs of summary functions. The name will be the name of the variable in the result, .groups (grouping structure of the result).
We will check the total coal production for over 121 years in the all the countries. First, we group by the iso_code, then we will summarize by the coal_production to find the sum of coal production over 121 in each country, and arrange in desc order. Chine is the first over other countries.
world_energy_rename %>%
group_by(country) %>%
summarize(coal_production_total=sum(coal_production,na.rm = TRUE)) %>%
arrange(desc(coal_production_total))
## # A tibble: 242 x 2
## country coal_production_total
## <chr> <dbl>
## 1 World 1260113.
## 2 Asia Pacific 690240.
## 3 China 480890.
## 4 North America 476268.
## 5 United States 450350.
## 6 Europe 421567.
## 7 Russia 176007.
## 8 United Kingdom 115912.
## 9 Germany 113268.
## 10 CIS 100362.
## # ... with 232 more rows
In the extended part of the assignment, some other functions of the dplyr package were added such as rename(), select(), distinct(), group_by(), arrange(), summarise(). Tidyverse package is the best tool to transform the data for the further analysis. There are more to discover within each of the libraries mentioned above as well as tidyverse package contains a lot of other libraries that were not mentioned in the current “vignette” such as tidyr, tibble, purrr.