Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset.
Get dataset from 538.
tidyr separate function:
dplyr select function:
I’ve commented out Euclid’s work in order to make it easier to follow my extension
Sometimes, it’s valuable to manipulate a dataset into smaller groups, and to summarize information based on column values. We can use TidyVerse packages and built-in functions to accomplish this.
To extend Euclid’s example, I’ll also use the summarise(), group_by(), mutate(), arrange(), and mean() functions in the dplyr package to obtain the average actual maximum temperature by month for 2014:
weather2_group <- weather2 %>%
group_by(month) %>%
summarise(max_temp_mean = mean(actual_max_temp)) %>%
mutate(month = as.numeric(month)) %>% # in order to arrange by month, need to convert to numeric
arrange(month)
## # A tibble: 12 x 2
## month max_temp_mean
## <dbl> <dbl>
## 1 1 51.3
## 2 2 47.6
## 3 3 66
## 4 4 72.4
## 5 5 83.9
## 6 6 91.8
## 7 7 87.3
## 8 8 86.0
## 9 9 80.7
## 10 10 75.5
## 11 11 59.4
## 12 12 56
summarise() – applies summary functions to the actual_max_temp column in order to create a new table. In our case, it took the actual_max_temp of each day.
group_by() – created a grouped copy of our table, so that we were only getting the mean temperatures by month.
mutate() – creates a new column, and in our case it created the new column maxtempmean.
arrange() – orders the rows by values of a column, in our case we ordered the new table by the month number, from 1 to 12.
mean() – takes the mean of values, in our case it took the mean of the actual_max_temp values, grouped by month.
Additionally, similar to Euclid, I’ll use the ggplot2 package to display data present in the weather dataset.
In our example, we’ll create a facet bar chart that compares the average actual max temperature for 2014 to the average record max temperature for each month:
Before plotting, we’ll have to create a new column that includes the average record max temperature for each month:
weather2_group_record <- weather2 %>%
group_by(month) %>%
summarise(record_max_temp_mean = mean(record_max_temp)) %>%
mutate(month = as.numeric(month)) %>% # in order to arrange by month, need to convert to numeric
arrange(month)
Then, we’ll combine the mean values from both datasets into one dataset and use the tidyr package to tidy the data into a format conducive to plotting. For this section, I used the gather() function to do this:
weather_final <- merge(weather2_group, weather2_group_record, by = "month")
weather_final <- weather_final %>%
gather(max_temp_mean, record_max_temp_mean, key = '2014_record', value = 'temperature') %>%
mutate(`2014_record` = ifelse(`2014_record` == 'max_temp_mean', '2014 average max', 'Record average max')) %>%
arrange(month) %>%
select(month, `2014_record`, temperature)
months <- c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December")
for(i in 1:length(months)){
for (j in 1:length(weather_final$month)){
if(weather_final$month[j] == i){
weather_final$month[j] <- months[i]
}
}
}
weather_final
## month 2014_record temperature
## 1 January 2014 average max 51.25806
## 2 January Record average max 74.90323
## 3 February 2014 average max 47.60714
## 4 February Record average max 77.28571
## 5 March 2014 average max 66.00000
## 6 March Record average max 85.29032
## 7 April 2014 average max 72.43333
## 8 April Record average max 90.06667
## 9 May 2014 average max 83.93548
## 10 May Record average max 94.70968
## 11 June 2014 average max 91.80000
## 12 June Record average max 100.06667
## 13 July 2014 average max 87.29032
## 14 July Record average max 101.03226
## 15 August 2014 average max 85.96774
## 16 August Record average max 100.16129
## 17 September 2014 average max 80.66667
## 18 September Record average max 96.93333
## 19 October 2014 average max 75.48387
## 20 October Record average max 88.64516
## 21 November 2014 average max 59.36667
## 22 November Record average max 79.56667
## 23 December 2014 average max 56.00000
## 24 December Record average max 75.35484
Now that the data is ready, we can plot this in a faceted bar graph, by month:
library(ggplot2)
plot <- ggplot(weather_final, aes(x = ".", y = temperature, fill= `2014_record`))
plot <- plot + scale_y_continuous()
plot <- plot + theme(legend.position = "right", axis.text.x = element_blank(), axis.ticks.x = element_blank(), axis.title = element_blank())
plot <- plot + geom_bar(stat = "identity", position = "dodge", color="#dddddd")
plot <- plot + facet_wrap(month_f ~ .)
plot <- plot + geom_text(aes(label=format(temperature, digits = 0)), vjust = 1.5, position = position_dodge(0.9), color="#ffffff", fontface="bold")
plot <- plot + theme(legend.title = element_blank())
plot
dplyr filter (subsetting dataset)
ggplot histogram for actual_mean_temp
Update with new examples of the Tidyverse packages used above (tidyr, dplyr and ggplot2)
tidyr::gather - a function that transform the values from multiple columns into one column, with the original column names as keys stored in a new created column.
For example, we can combine the actual_mean_temp, actual_min_temp and actual_max_temp into one columns:
dplyr::group_by - group the rows by the values in column(s)
dplyr::summarise - for each group determined by the group_by function, generate a summary value that is defined by a summary function such as mean()
dplyr::arrange - arrange the order of the rows by the values in column(s)
For example, we can group the rows by months and summarize the average of the mean temperature in each month:
ggplot2 - the package provides various functions to plot multuple components on the same graph.
For example, the geom_line function plots and connects each (x,y) pairs from provided input lists x and y.
The geom_smooth function plots a smooth line accross the points.
Labels can be created by the labs function.
In the following example we plot the actual mean temperature of 365 days
We can also plot multiple lines on the same graph. For example, we can add the actual min temperature and max temperature to the same plot.