Using the Tidyverse

Harold Nelson

2024-02-26

Setup

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Section 3.2.3

Review the section

Find the row(s) with the maximum air temperature using the tidyverse.

Solution

max_temp_rows = airquality %>% 
  filter(Temp == max(Temp))

max_temp_rows
##   Ozone Solar.R Wind Temp Month Day
## 1    76     203  9.7   97     8  28

Hot Days

Use mutate to create the variable hot in the airquality dataframe. A day is hot if Temp >90. It is not_hot if it is not hot. Count the number of both types of day. Also count the fraction of each type in the total days.

Solution

airquality_counts = airquality %>% 
  mutate(hot = Temp > 90,
         not_hot = hot == FALSE) %>% 
  summarize(hot_count = sum(hot),
            fract_hot = mean(hot),
            not_hot_count = sum(not_hot),
            fract_not_hot = mean(not_hot))

airquality_counts
##   hot_count  fract_hot not_hot_count fract_not_hot
## 1        14 0.09150327           139     0.9084967

Section 3.3

Put the philosphers.csv file in your current working directory. Then use the import control in RStudio to import it.

Copy the import command from the console and save it in a chunk.

philosophers <- read.csv("~/Library/CloudStorage/Dropbox/Documents/SMU/CSC 201/philosophers.csv")

Section 3.8

Use group_by and summarize to get the mean value of ozone for each month from the airquality dataframe. Also get the counts of missin Ozone values for each month.

ozone_mo = airquality %>% 
  group_by(Month) %>% 
  summarize(mean_ozone = mean(Ozone,na.rm=T),
            na_days = sum(is.na(Ozone)))

ozone_mo
## # A tibble: 5 × 3
##   Month mean_ozone na_days
##   <int>      <dbl>   <int>
## 1     5       23.6       5
## 2     6       29.4      21
## 3     7       59.1       5
## 4     8       60.0       5
## 5     9       31.4       1

Exercise 3.13

Do the work using the tidyverse.

Solution

library(datasauRus)
datasaurus_dozen %>% 
  group_by(dataset) %>% 
  summarize(cor = cor(x,y),
         mean_x = mean(x),
         sd_x = sd(x),
         mean_y = mean(y),
         sd_y = sd(y))
## # A tibble: 13 × 6
##    dataset        cor mean_x  sd_x mean_y  sd_y
##    <chr>        <dbl>  <dbl> <dbl>  <dbl> <dbl>
##  1 away       -0.0641   54.3  16.8   47.8  26.9
##  2 bullseye   -0.0686   54.3  16.8   47.8  26.9
##  3 circle     -0.0683   54.3  16.8   47.8  26.9
##  4 dino       -0.0645   54.3  16.8   47.8  26.9
##  5 dots       -0.0603   54.3  16.8   47.8  26.9
##  6 h_lines    -0.0617   54.3  16.8   47.8  26.9
##  7 high_lines -0.0685   54.3  16.8   47.8  26.9
##  8 slant_down -0.0690   54.3  16.8   47.8  26.9
##  9 slant_up   -0.0686   54.3  16.8   47.8  26.9
## 10 star       -0.0630   54.3  16.8   47.8  26.9
## 11 v_lines    -0.0694   54.3  16.8   47.8  26.9
## 12 wide_lines -0.0666   54.3  16.8   47.8  26.9
## 13 x_shape    -0.0656   54.3  16.8   47.8  26.9

3.13 Follow-up

Use faceting to look at scatterplots of x and y by dataset.

Solution

datasaurus_dozen %>% 
  ggplot(aes(x = x, y = y)) +
  geom_point(size = .5) +
  facet_wrap(~dataset) +
  ggtitle("Woe be he...")

## A Weather Report

Load the OAW2309 dataframe.
Create the dataframe Mar01 using filter.
Verify your work using head()

Solution

load("OAW2309.Rdata")
Mar01 = OAW2309 %>% 
  filter(mo == 3 & dy ==1)

head(Mar01)
## # A tibble: 6 × 7
##   DATE        PRCP  TMAX  TMIN mo       dy    yr
##   <date>     <dbl> <dbl> <dbl> <fct> <int> <dbl>
## 1 1942-03-01  0.16    58    35 3         1  1942
## 2 1943-03-01  0       57    28 3         1  1943
## 3 1944-03-01  0       55    25 3         1  1944
## 4 1945-03-01  0.01    53    32 3         1  1945
## 5 1946-03-01  0.19    53    41 3         1  1946
## 6 1947-03-01  0       57    24 3         1  1947

Forecast

What is the probability of rain?

Solution

mean(Mar01$PRCP > 0)
## [1] 0.5853659

Describe

the possibilities for TMAX using the values produced by summary.

Solution

summary(Mar01$TMAX)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   36.00   46.00   50.00   50.46   55.00   62.00

The most likely value of the maximum temperature is 50. However, on this date, maximum temperatures have ranged from 36 to 62. The middle 50% of temperatures has been between 46 and 55.

Generalize with a Function

Create a function weather_forecast that accepts any month and day to produce these results.

Solution

weather_forecast = function(month,day){
  days = filter(OAW2309,
                mo == month,
                dy == day)
  
  rain_prob = mean(days$PRCP > 0)
  print(paste("The probability of rain is ",round(rain_prob,2)))
  print('')
  print("Summary of the Maximum Temperature")
  summary(days$TMAX)
}

weather_forecast(7,4)
## [1] "The probability of rain is  0.23"
## [1] ""
## [1] "Summary of the Maximum Temperature"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   61.00   69.00   74.00   74.88   80.00   93.00