Why Data Analysis

Harold Nelson

10/29/2019

Who Needs Data Analysis

Data Analysis vs Statistics

CSC 360

Prerequisites

CSC 360 Resources

Example: Olympia Weather

Let’s see what you can do with the weather. I’ll use this data to answer a few questions.

I obtained data from NOAA using https://www.ncdc.noaa.gov/. The data from NOAA is in the form of a CSV file. After a little cleaning up in Excel, it’s ready to be imported into R.

olywthr <- read_csv("2019 10 26.csv", 
col_types = cols(SNOW = col_double(), 
TMAX = col_double(), TMIN = col_double()))

Check

Look at the data with a few simple commands and understand what it means. Does it pass some obvious validity tests?

glimpse(olywthr)
## Observations: 28,649
## Variables: 5
## $ DATE <date> 1941-05-13, 1941-05-14, 1941-05-15, 1941-05-16, 1941-05-17…
## $ PRCP <dbl> 0.00, 0.00, 0.30, 1.08, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ SNOW <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ TMAX <dbl> 66, 63, 58, 55, 57, 59, 58, 65, 68, 85, 84, 75, 72, 59, 61,…
## $ TMIN <dbl> 50, 47, 44, 45, 46, 39, 40, 50, 42, 46, 46, 50, 41, 37, 48,…
summary(olywthr)
##       DATE                 PRCP             SNOW            TMAX       
##  Min.   :1941-05-13   Min.   :0.0000   Min.   : 0.00   Min.   : 18.00  
##  1st Qu.:1960-12-21   1st Qu.:0.0000   1st Qu.: 0.00   1st Qu.: 50.00  
##  Median :1980-07-31   Median :0.0000   Median : 0.00   Median : 59.00  
##  Mean   :1980-07-31   Mean   :0.1362   Mean   : 0.04   Mean   : 60.56  
##  3rd Qu.:2000-03-10   3rd Qu.:0.1400   3rd Qu.: 0.00   3rd Qu.: 71.00  
##  Max.   :2019-10-19   Max.   :4.8200   Max.   :14.20   Max.   :104.00  
##                       NA's   :3        NA's   :5407    NA's   :11      
##       TMIN      
##  Min.   :-8.00  
##  1st Qu.:33.00  
##  Median :40.00  
##  Mean   :39.82  
##  3rd Qu.:47.00  
##  Max.   :69.00  
##  NA's   :11

Recode/Add

For my purposes, I want direct access to the components of the date. Those are easy to extract using some features from the lubridate package.

olywthr %>% 
  mutate(yr = year(DATE),
         mo = month(DATE),
         dy = day(DATE)) -> olywthr

glimpse(olywthr)
## Observations: 28,649
## Variables: 8
## $ DATE <date> 1941-05-13, 1941-05-14, 1941-05-15, 1941-05-16, 1941-05-17…
## $ PRCP <dbl> 0.00, 0.00, 0.30, 1.08, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ SNOW <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ TMAX <dbl> 66, 63, 58, 55, 57, 59, 58, 65, 68, 85, 84, 75, 72, 59, 61,…
## $ TMIN <dbl> 50, 47, 44, 45, 46, 39, 40, 50, 42, 46, 46, 50, 41, 37, 48,…
## $ yr   <dbl> 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941,…
## $ mo   <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6,…
## $ dy   <int> 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,…

Historical Data for Today

Today is October 29, 2019.

Let’s extract all of the historical data for this month and day.

olywthr %>% 
  filter(mo == 10 & dy == 29) -> oct29

Let’s run a summary of all of the variables in our smaller dataframe.

summary(oct29)
##       DATE                 PRCP             SNOW         TMAX      
##  Min.   :1941-10-29   Min.   :0.0000   Min.   :0    Min.   :44.00  
##  1st Qu.:1961-01-28   1st Qu.:0.0000   1st Qu.:0    1st Qu.:52.00  
##  Median :1980-04-29   Median :0.0200   Median :0    Median :55.00  
##  Mean   :1980-04-29   Mean   :0.1465   Mean   :0    Mean   :55.24  
##  3rd Qu.:1999-07-29   3rd Qu.:0.1800   3rd Qu.:0    3rd Qu.:58.00  
##  Max.   :2018-10-29   Max.   :1.6400   Max.   :0    Max.   :68.00  
##                                        NA's   :15                  
##       TMIN             yr             mo           dy    
##  Min.   :21.00   Min.   :1941   Min.   :10   Min.   :29  
##  1st Qu.:32.00   1st Qu.:1960   1st Qu.:10   1st Qu.:29  
##  Median :39.50   Median :1980   Median :10   Median :29  
##  Mean   :37.97   Mean   :1980   Mean   :10   Mean   :29  
##  3rd Qu.:43.00   3rd Qu.:1999   3rd Qu.:10   3rd Qu.:29  
##  Max.   :51.00   Max.   :2018   Max.   :10   Max.   :29  
## 

Significant Variables

There are three variables in which we are most interested.

We can examine these graphically using some features of ggplot2. I’ll use both a histogram and a density plot, which I prefer.

First the maximum temperature.

oct29 %>% ggplot(aes(x=TMAX)) + geom_histogram(binwidth = 3) 

oct29 %>% ggplot(aes(x=TMAX)) + geom_density() 

Next the minimum temperature.

oct29 %>% ggplot(aes(x=TMIN)) + geom_histogram(binwidth = 3)

oct29 %>% ggplot(aes(x=TMIN)) + geom_density()

Finally, the precipitation.

oct29 %>% ggplot(aes(x=PRCP)) + geom_histogram(binwidth = 3)

oct29 %>% ggplot(aes(x=PRCP)) + geom_density()

History and Future

I’ll create a similar dataframe with data from a month ago and a few months into the future. Then I’ll do a facetted graphic to compare all of the months.

First the data.

olywthr %>% 
  filter(mo %in% c(8,9,10,11,12) & dy == 29) -> allmo

Now the graphic.

allmo %>% ggplot(aes(x = TMAX)) + 
  geom_density() + facet_wrap(~mo,ncol = 1)

I can see how the distribution of daily maximum temperatures has drifted down since August and that it will decline more as the month changes until we reach December.

The Whole Year?

I’ll stick with the 29th day of each month and do a different style of graph, a side-by-side boxplot.

olywthr %>% filter(dy == 29) %>% 
  ggplot(aes(x=factor(mo),y=TMAX)) +
  geom_boxplot()

July 2018

During July 2018, I thought that the weather was unusually hot for July. Let’s look at this question graphically. I’ll create a marker variable for 2018 and filter the data to include only the july days. Then I can look at the distribution of temperatures for the 31 days of July 2018 in comparision with all of the other July days.

olywthr %>% filter(mo == 7) %>% 
  mutate(marker = ifelse(yr == 2018,"YR_2018","Other")) -> QJ18 
  
QJ18 %>% ggplot(aes(x=TMAX,color=marker)) +
  geom_density()

Traditional Statistical Inference

Most of the time I’m satisfied with graphical answers to my questions. However, it’s certainly possible to use traditional statistical methods. Here, I should test the null hypothesis that the average daily maximum temperature for July, 2018 is the same as the average daily maximum temperature for all other years. In other words, Is the difference I see in the graphic nothing but random fluctuation?

I’ll use t.test() to do this.

t.test(TMAX ~ marker, data = QJ18)
## 
##  Welch Two Sample t-test
## 
## data:  TMAX by marker
## t = -3.4349, df = 30.678, p-value = 0.001721
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -8.568658 -2.182376
## sample estimates:
##   mean in group Other mean in group YR_2018 
##              77.33416              82.70968

The p-value is far below 5%, so I’ll dismiss the idea of random fluctuation.