Housekeeping

Create a new project, OAW.

Download the file OAW2309.Rdata.

Move the file to your project directory.

file.choose().

getwd().

Setup

Get the tidyverse and the OAW2309 dataframe.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

load("OAW2309.Rdata")

Inspect the Data

Use str()

Solution

str(OAW2309)

## tibble [30,075 × 7] (S3: tbl_df/tbl/data.frame)
##  $ DATE: Date[1:30075], format: "1941-05-13" "1941-05-14" ...
##  $ PRCP: num [1:30075] 0 0 0.3 1.08 0.06 0 0 0 0 0 ...
##  $ TMAX: num [1:30075] 66 63 58 55 57 59 58 65 68 85 ...
##  $ TMIN: num [1:30075] 50 47 44 45 46 39 40 50 42 46 ...
##  $ mo  : Factor w/ 12 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ dy  : int [1:30075] 13 14 15 16 17 18 19 20 21 22 ...
##  $ yr  : num [1:30075] 1941 1941 1941 1941 1941 ...

Try glimpse()

It is an alternative to str() and part of the tidyverse.

Solution

glimpse(OAW2309)

## Rows: 30,075
## Columns: 7
## $ DATE <date> 1941-05-13, 1941-05-14, 1941-05-15, 1941-05-16, 1941-05-17, 1941…
## $ PRCP <dbl> 0.00, 0.00, 0.30, 1.08, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ TMAX <dbl> 66, 63, 58, 55, 57, 59, 58, 65, 68, 85, 84, 75, 72, 59, 61, 59, 6…
## $ TMIN <dbl> 50, 47, 44, 45, 46, 39, 40, 50, 42, 46, 46, 50, 41, 37, 48, 46, 4…
## $ mo   <fct> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6,…
## $ dy   <int> 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 2…
## $ yr   <dbl> 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941,…

skimr

This is a useful tool to inspect a dataframe. Install it, then load it.

library(skimr)
sOAW2309 = skim(OAW2309)
#View(sOAW2309)

Inspect TMAX

Do a summary() Use geom_density() and geom_rug()

Solution

summary(OAW2309$TMAX)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   50.00   59.00   60.62   71.00  110.00

OAW2309 %>% 
  ggplot(aes(x = TMAX)) +
  geom_density() +
  geom_rug()

Solution

summary(OAW2309$TMIN)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -8.00   33.00   40.00   39.86   47.00   69.00

OAW2309 %>% 
  ggplot(aes(x = TMIN)) +
  geom_density() +
  geom_rug()

Solution

summary(OAW2309$PRCP)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1362  0.1400  4.8200

OAW2309 %>% 
  ggplot(aes(x = PRCP)) +
  geom_density() +
  geom_rug()

Log Transformation

Transform to the base 10 logarithm. Add .1 to avoid zero values.

Solution

OAW2309 %>% 
  mutate(LPRCP = log10(PRCP + .1)) %>% 
  ggplot(aes(x = LPRCP)) +
  geom_density(adjust = .5) +
  geom_rug()

## Zero Values

What fraction of the days have zero values of PRCP?

Solution

mean(OAW2309$PRCP == 0)

## [1] 0.5546467

Smallest non-zero value

What is the smallest non-zero value of PRCP?

Solution

non_zero = OAW2309 %>% 
  filter(PRCP > 0) %>% 
  arrange(PRCP)

head(non_zero)

## # A tibble: 6 × 7
##   DATE        PRCP  TMAX  TMIN mo       dy    yr
##   <date>     <dbl> <dbl> <dbl> <fct> <int> <dbl>
## 1 1941-05-26  0.01    59    37 5        26  1941
## 2 1941-06-01  0.01    65    48 6         1  1941
## 3 1941-08-28  0.01    70    51 8        28  1941
## 4 1941-09-11  0.01    66    44 9        11  1941
## 5 1941-11-05  0.01    64    40 11        5  1941
## 6 1941-12-11  0.01    44    34 12       11  1941

How Frequent?

What fraction of the non-zero value days have PRCP > .01?

Solution

mean(non_zero$PRCP > .01001)

## [1] 0.9186949

Years

Which years have the greatest number of non-zero days.

Solution

years = non_zero %>% 
  group_by(yr) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))

head(years,10)

## # A tibble: 10 × 2
##       yr count
##    <dbl> <int>
##  1  1950   193
##  2  2016   190
##  3  1997   188
##  4  1948   186
##  5  1999   185
##  6  1968   184
##  7  1953   183
##  8  1955   183
##  9  1984   183
## 10  1990   183

tail(years,10)

## # A tibble: 10 × 2
##       yr count
##    <dbl> <int>
##  1  1952   143
##  2  1970   141
##  3  1993   141
##  4  2015   141
##  5  1985   131
##  6  1944   130
##  7  1987   130
##  8  1943   129
##  9  2023   107
## 10  1941    97

Time Trend?

Exclude 1941 and 2023, which are incomplete. Then do a scatterplot with year on the horizontal axis and count on the vertical.

Solution

years %>% 
  filter(yr != 1941 & yr != 2023) %>% 
  ggplot(aes(x = yr, y = count)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red")

## `geom_smooth()` using formula = 'y ~ x'

Bad Days

Define a day as bad if PRCP > 0 and TMIN <= 32. Calculate and display the probability that a day will be bad for each month.

Solution

OAW2309 %>% 
  mutate(bad = PRCP > 0 & TMIN <= 32) %>% 
  group_by(mo) %>% 
  summarize(fract_bad = mean(bad)) %>% 
  ungroup()

## # A tibble: 12 × 2
##    mo    fract_bad
##    <fct>     <dbl>
##  1 1      0.207   
##  2 2      0.184   
##  3 3      0.154   
##  4 4      0.0709  
##  5 5      0.00900 
##  6 6      0.000402
##  7 7      0       
##  8 8      0       
##  9 9      0       
## 10 10     0.0232  
## 11 11     0.117   
## 12 12     0.208

OAW Analysis

Housekeeping

Setup

Inspect the Data

Solution

Try glimpse()

Solution

skimr

Inspect TMAX

Solution

Now TMIN

Solution

Now PRCP

Solution

Log Transformation

Solution

Solution

Smallest non-zero value

Solution

How Frequent?

Solution

Years

Solution

Time Trend?

Solution

Bad Days

Solution