Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.

  1. Get dataset from 538.

  2. tidyr separate function:

  3. dplyr select function:


Zach’s Extension - dplyr package


I’ve commented out Euclid’s work in order to make it easier to follow my extension

Sometimes, it’s valuable to manipulate a dataset into smaller groups, and to summarize information based on column values. We can use TidyVerse packages and built-in functions to accomplish this.

To extend Euclid’s example, I’ll also use the summarise(), group_by(), mutate(), arrange(), and mean() functions in the dplyr package to obtain the average actual maximum temperature by month for 2014:

weather2_group <- weather2 %>% 
  group_by(month) %>% 
  summarise(max_temp_mean = mean(actual_max_temp)) %>%
  mutate(month = as.numeric(month)) %>% # in order to arrange by month, need to convert to numeric
  arrange(month)
## # A tibble: 12 x 2
##    month max_temp_mean
##    <dbl>         <dbl>
##  1     1          51.3
##  2     2          47.6
##  3     3          66  
##  4     4          72.4
##  5     5          83.9
##  6     6          91.8
##  7     7          87.3
##  8     8          86.0
##  9     9          80.7
## 10    10          75.5
## 11    11          59.4
## 12    12          56

Zach’s Extension - using ggplot2


Additionally, similar to Euclid, I’ll use the ggplot2 package to display data present in the weather dataset.

In our example, we’ll create a facet bar chart that compares the average actual max temperature for 2014 to the average record max temperature for each month:

Before plotting, we’ll have to create a new column that includes the average record max temperature for each month:

weather2_group_record <- weather2 %>% 
  group_by(month) %>% 
  summarise(record_max_temp_mean = mean(record_max_temp)) %>%
  mutate(month = as.numeric(month)) %>% # in order to arrange by month, need to convert to numeric
  arrange(month)

Then, we’ll combine the mean values from both datasets into one dataset and use the tidyr package to tidy the data into a format conducive to plotting. For this section, I used the gather() function to do this:

weather_final <- merge(weather2_group, weather2_group_record, by = "month")

weather_final <- weather_final %>% 
  gather(max_temp_mean, record_max_temp_mean, key = '2014_record', value = 'temperature') %>% 
  mutate(`2014_record` = ifelse(`2014_record` == 'max_temp_mean', '2014 average max', 'Record average max')) %>% 
  arrange(month) %>% 
  select(month, `2014_record`, temperature)

months <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")

for(i in 1:length(months)){
  for (j in 1:length(weather_final$month)){
    if(weather_final$month[j] == i){
      weather_final$month[j] <- months[i]
    }
  }
}

weather_final
##        month        2014_record temperature
## 1    January   2014 average max    51.25806
## 2    January Record average max    74.90323
## 3   February   2014 average max    47.60714
## 4   February Record average max    77.28571
## 5      March   2014 average max    66.00000
## 6      March Record average max    85.29032
## 7      April   2014 average max    72.43333
## 8      April Record average max    90.06667
## 9        May   2014 average max    83.93548
## 10       May Record average max    94.70968
## 11      June   2014 average max    91.80000
## 12      June Record average max   100.06667
## 13      July   2014 average max    87.29032
## 14      July Record average max   101.03226
## 15    August   2014 average max    85.96774
## 16    August Record average max   100.16129
## 17 September   2014 average max    80.66667
## 18 September Record average max    96.93333
## 19   October   2014 average max    75.48387
## 20   October Record average max    88.64516
## 21  November   2014 average max    59.36667
## 22  November Record average max    79.56667
## 23  December   2014 average max    56.00000
## 24  December Record average max    75.35484

Now that the data is ready, we can plot this in a faceted bar graph, by month:

library(ggplot2)

plot <- ggplot(weather_final, aes(x = ".", y = temperature, fill= `2014_record`))
plot <- plot + scale_y_continuous()
plot <- plot + theme(legend.position = "right", axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title = element_blank())
plot <- plot + geom_bar(stat = "identity", position = "dodge", color="#dddddd")
plot <- plot + facet_wrap(month_f ~ .)
plot <- plot + geom_text(aes(label=format(temperature, digits = 0)), vjust = 1.5, position = position_dodge(0.9), color="#ffffff", fontface="bold")
plot <- plot + theme(legend.title = element_blank())
plot


  1. dplyr filter (subsetting dataset)

  2. ggplot histogram for actual_mean_temp

Update with new examples of the Tidyverse packages used above (tidyr, dplyr and ggplot2)

tidyr::gather - a function that transform the values from multiple columns into one column, with the original column names as keys stored in a new created column.
For example, we can combine the actual_mean_temp, actual_min_temp and actual_max_temp into one columns:

dplyr::group_by - group the rows by the values in column(s)
dplyr::summarise - for each group determined by the group_by function, generate a summary value that is defined by a summary function such as mean()
dplyr::arrange - arrange the order of the rows by the values in column(s)
For example, we can group the rows by months and summarize the average of the mean temperature in each month:

ggplot2 - the package provides various functions to plot multuple components on the same graph.
For example, the geom_line function plots and connects each (x,y) pairs from provided input lists x and y.
The geom_smooth function plots a smooth line accross the points.
Labels can be created by the labs function.
In the following example we plot the actual mean temperature of 365 days

We can also plot multiple lines on the same graph. For example, we can add the actual min temperature and max temperature to the same plot.