The elements of the checklist are

  1. Formulate your question

  2. Read in your data

  3. Check the packaging

  4. Run str()

  5. Look at the top and the bottom of your data

  6. Check your “n”s

  7. Validate with at least one external data source

  8. Try the easy solution first

  9. Challenge your solution

  10. Follow up

Formulate your question

Notes:

Read in your data

Notes:

# install.packages("readr")

# Load the reader library
library(readr)

# Load the dplyr library
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the data
ozone <- read_csv("hourly_44201_2014.csv")

# the names of the columns to remove any spaces.
names(ozone) <- make.names(names(ozone))

Check the packaging & Run str() & look at TOP and BOTTOM of your data

Notes:

# check the number of rows and columns.

nrow(ozone)
## [1] 9085391
ncol(ozone)
## [1] 24
str(ozone)
## Classes 'tbl_df', 'tbl' and 'data.frame':    9085391 obs. of  24 variables:
##  $ State.Code         : chr  "01" "01" "01" "01" ...
##  $ County.Code        : chr  "003" "003" "003" "003" ...
##  $ Site.Num           : chr  "0010" "0010" "0010" "0010" ...
##  $ Parameter.Code     : int  44201 44201 44201 44201 44201 44201 44201 44201 44201 44201 ...
##  $ POC                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Latitude           : num  30.5 30.5 30.5 30.5 30.5 ...
##  $ Longitude          : num  -87.9 -87.9 -87.9 -87.9 -87.9 ...
##  $ Datum              : chr  "NAD83" "NAD83" "NAD83" "NAD83" ...
##  $ Parameter.Name     : chr  "Ozone" "Ozone" "Ozone" "Ozone" ...
##  $ Date.Local         : Date, format: "2014-03-01" "2014-03-01" ...
##  $ Time.Local         : chr  "01:00" "02:00" "03:00" "04:00" ...
##  $ Date.GMT           : Date, format: "2014-03-01" "2014-03-01" ...
##  $ Time.GMT           : chr  "07:00" "08:00" "09:00" "10:00" ...
##  $ Sample.Measurement : num  0.047 0.047 0.043 0.038 0.035 0.035 0.034 0.037 0.044 0.046 ...
##  $ Units.of.Measure   : chr  "Parts per million" "Parts per million" "Parts per million" "Parts per million" ...
##  $ MDL                : num  0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 ...
##  $ Uncertainty        : chr  "" "" "" "" ...
##  $ Qualifier          : chr  "" "" "" "" ...
##  $ Method.Type        : chr  "FEM" "FEM" "FEM" "FEM" ...
##  $ Method.Code        : chr  "047" "047" "047" "047" ...
##  $ Method.Name        : chr  "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" ...
##  $ State.Name         : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ County.Name        : chr  "Baldwin" "Baldwin" "Baldwin" "Baldwin" ...
##  $ Date.of.Last.Change: Date, format: "2014-06-30" "2014-06-30" ...
# Check to TOP of the data
head(ozone[, c(6:7, 10)])
## Source: local data frame [6 x 3]
## 
##   Latitude Longitude Date.Local
##      (dbl)     (dbl)     (date)
## 1   30.498 -87.88141 2014-03-01
## 2   30.498 -87.88141 2014-03-01
## 3   30.498 -87.88141 2014-03-01
## 4   30.498 -87.88141 2014-03-01
## 5   30.498 -87.88141 2014-03-01
## 6   30.498 -87.88141 2014-03-01
# Check to BOTTOM of the data
tail(ozone[, c(6:7, 10)])
## Source: local data frame [6 x 3]
## 
##   Latitude Longitude Date.Local
##      (dbl)     (dbl)     (date)
## 1  31.7122 -106.3953 2014-08-31
## 2  31.7122 -106.3953 2014-08-31
## 3  31.7122 -106.3953 2014-08-31
## 4  31.7122 -106.3953 2014-08-31
## 5  31.7122 -106.3953 2014-08-31
## 6  31.7122 -106.3953 2014-08-31

Check your “n”s

Notes:

table(ozone$Time.Local)
## 
##  00:00  01:00  02:00  03:00  04:00  05:00  06:00  07:00  08:00  09:00 
## 369790 371153 360748 354518 358642 384903 384525 383040 380256 378408 
##  10:00  11:00  12:00  13:00  14:00  15:00  16:00  17:00  18:00  19:00 
## 378208 379448 379732 381761 383652 385296 386404 386930 386849 387370 
##  20:00  21:00  22:00  23:00 
## 387309 386087 376100 374262

Notes:

# Perform the check
#
filter(ozone, Time.Local == "13:14") %>%
  select(State.Name, County.Name, Date.Local,
  Time.Local, Sample.Measurement)
## Source: local data frame [0 x 5]
## 
## Variables not shown: State.Name (chr), County.Name (chr), Date.Local
##   (date), Time.Local (chr), Sample.Measurement (dbl)
# Nothing returned.

# Perform alternative check
#
filter(ozone, State.Code == "36"
 & County.Code == "033"
 & Date.Local == "2014-09-30") %>%
 select(Date.Local, Time.Local,
 Sample.Measurement) %>%
 as.data.frame
##    Date.Local Time.Local Sample.Measurement
## 1  2014-09-30      00:00              0.011
## 2  2014-09-30      01:00              0.012
## 3  2014-09-30      02:00              0.012
## 4  2014-09-30      03:00              0.011
## 5  2014-09-30      04:00              0.011
## 6  2014-09-30      05:00              0.011
## 7  2014-09-30      06:00              0.010
## 8  2014-09-30      07:00              0.010
## 9  2014-09-30      08:00              0.010
## 10 2014-09-30      09:00              0.010
## 11 2014-09-30      10:00              0.010
## 12 2014-09-30      11:00              0.012
## 13 2014-09-30      12:00              0.011
## 14 2014-09-30      13:00              0.013
## 15 2014-09-30      14:00              0.016
## 16 2014-09-30      15:00              0.017
## 17 2014-09-30      16:00              0.017
## 18 2014-09-30      17:00              0.015
## 19 2014-09-30      18:00              0.017
## 20 2014-09-30      19:00              0.014
## 21 2014-09-30      20:00              0.014
## 22 2014-09-30      21:00              0.011
## 23 2014-09-30      22:00              0.010
## 24 2014-09-30      23:00              0.010
## 25 2014-09-30      00:00              0.010
## 26 2014-09-30      01:00              0.011
## 27 2014-09-30      02:00              0.011
## 28 2014-09-30      03:00              0.010
## 29 2014-09-30      04:00              0.010
## 30 2014-09-30      05:00              0.010
## 31 2014-09-30      06:00              0.009
## 32 2014-09-30      07:00              0.008
## 33 2014-09-30      08:00              0.009
## 34 2014-09-30      09:00              0.009
## 35 2014-09-30      10:00              0.009
## 36 2014-09-30      11:00              0.011
## 37 2014-09-30      12:00              0.010
## 38 2014-09-30      13:00              0.012
## 39 2014-09-30      14:00              0.015
## 40 2014-09-30      15:00              0.016
## 41 2014-09-30      16:00              0.016
## 42 2014-09-30      17:00              0.014
## 43 2014-09-30      18:00              0.016
## 44 2014-09-30      19:00              0.013
## 45 2014-09-30      20:00              0.013
## 46 2014-09-30      21:00              0.010
## 47 2014-09-30      22:00              0.009
## 48 2014-09-30      23:00              0.009
# Perhaps we should see exactly how many states are represented in this dataset
select(ozone, State.Name) %>% unique %>% nrow
## [1] 53
# We can take a look at the unique elements of the State.Name variable to see what's going on.
#
unique(ozone$State.Name)
##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District Of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"             
## [52] "Puerto Rico"          "Country Of Mexico"

Validate with at least one external data source

Notes:

# look at the hourly measurements of ozone
summary(ozone$Sample.Measurement)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01900 0.03000 0.03026 0.04100 0.24100
# We can get a bit more detail on the distribution by looking at deciles of the data.
quantile(ozone$Sample.Measurement, seq(0, 1, 0.1))
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
## 0.000 0.009 0.016 0.022 0.026 0.030 0.034 0.039 0.043 0.050 0.241

Try the easy solution first

Notes:

# To identify each county we will use a combination of the State.Name and the County.Name Svariables.
#
ranking <- group_by(ozone, State.Name, County.Name) %>%
 summarize(ozone = mean(Sample.Measurement)) %>%
 as.data.frame %>%
 arrange(desc(ozone))

# look at the top 10 counties in this ranking
head(ranking, 10)
##    State.Name County.Name      ozone
## 1    Colorado Clear Creek 0.05022088
## 2  California    Mariposa 0.04849027
## 3  California      Nevada 0.04821713
## 4    Colorado        Park 0.04799975
## 5     Wyoming      Albany 0.04738065
## 6    Colorado     Chaffee 0.04732017
## 7  California        Inyo 0.04469113
## 8        Utah    San Juan 0.04435075
## 9  California   El Dorado 0.04363664
## 10     Nevada        Elko 0.04346459
# we can look at the 10 lowest counties too.
tail(ranking, 10)
##      State.Name          County.Name       ozone
## 788     Georgia               DeKalb 0.020429020
## 789  New Jersey               Warren 0.020400928
## 790   Louisiana     West Baton Rouge 0.020229531
## 791  Washington              Whatcom 0.020114267
## 792      Oregon           Washington 0.019650027
## 793    Oklahoma                Caddo 0.017435731
## 794 Puerto Rico               Juncos 0.013466699
## 795      Alaska Fairbanks North Star 0.013419708
## 796 Puerto Rico              Bayamon 0.009246600
## 797 Puerto Rico               Catano 0.005014176
# Let's take a look at one of the higest level counties, Mariposa County, California. First
# let's see how many observations there are for this county in the dataset.
#
filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>% nrow
## [1] 12130
# We can take a look at how ozone varies through the year in this county by looking at
# monthly averages. First we'll need to convert the date variable into a Date class.
ozone <- mutate(ozone, Date.Local = as.Date(Date.Local))

# Then we will split the data by month to look at the average hourly levels.
#
filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>%
 mutate(month = factor(months(Date.Local), levels = month.name)) %>%
 group_by(month) %>%
 summarize(ozone = mean(Sample.Measurement))
## Source: local data frame [12 x 2]
## 
##        month      ozone
##       (fctr)      (dbl)
## 1    January 0.04081929
## 2   February 0.03884305
## 3      March 0.04548374
## 4      April 0.04976124
## 5        May 0.05047637
## 6       June 0.05639331
## 7       July 0.05224286
## 8     August 0.05541681
## 9  September 0.05250999
## 10   October 0.04894909
## 11  November 0.03536872
## 12  December 0.03472859
# Now let's take a look at one of the lowest level counties, Caddo County, Oklahoma.
#
filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>% nrow
## [1] 7562
# Here we see that there are perhaps fewer observations than we would expect for a
# monitor that was measuring 24 hours a day all year. We can check the data to see if
# anything funny is going on.


filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>%
 mutate(month = factor(months(Date.Local), levels = month.name)) %>%
 group_by(month) %>%
 summarize(ozone = mean(Sample.Measurement))
## Source: local data frame [12 x 2]
## 
##        month       ozone
##       (fctr)       (dbl)
## 1    January 0.018732719
## 2   February 0.002060543
## 3      March 0.002000000
## 4      April 0.023208955
## 5        May 0.024182927
## 6       June 0.020195936
## 7       July 0.019112745
## 8     August 0.023452830
## 9  September 0.019302521
## 10   October 0.021025780
## 11  November 0.020529210
## 12  December 0.012710432

First, ozone appears to be higher in the summer months and lower in the winter months.

Second, there are two months missing (November and December) from the data.

It’s not immediately clear why that is, but it’s probably worth investigating a bit later on.

Challenge your solution

Notes:

set.seed(10234)
N <- nrow(ozone)
idx <- sample(N, N, replace = TRUE)
ozone2 <- ozone[idx, ]


ranking2 <- group_by(ozone2, State.Name, County.Name) %>%
 summarize(ozone = mean(Sample.Measurement)) %>%
 as.data.frame %>%
 arrange(desc(ozone))


cbind(head(ranking, 10), head(ranking2, 10))
##    State.Name County.Name      ozone State.Name County.Name      ozone
## 1    Colorado Clear Creek 0.05022088   Colorado Clear Creek 0.05031511
## 2  California    Mariposa 0.04849027 California    Mariposa 0.04847238
## 3  California      Nevada 0.04821713 California      Nevada 0.04811088
## 4    Colorado        Park 0.04799975   Colorado        Park 0.04794328
## 5     Wyoming      Albany 0.04738065    Wyoming      Albany 0.04740893
## 6    Colorado     Chaffee 0.04732017   Colorado     Chaffee 0.04733954
## 7  California        Inyo 0.04469113 California        Inyo 0.04458294
## 8        Utah    San Juan 0.04435075       Utah    San Juan 0.04438289
## 9  California   El Dorado 0.04363664 California   El Dorado 0.04375390
## 10     Nevada        Elko 0.04346459     Nevada        Elko 0.04340676
cbind(tail(ranking, 10),  tail(ranking2, 10))
##      State.Name          County.Name       ozone  State.Name
## 788     Georgia               DeKalb 0.020429020  New Jersey
## 789  New Jersey               Warren 0.020400928   Louisiana
## 790   Louisiana     West Baton Rouge 0.020229531     Georgia
## 791  Washington              Whatcom 0.020114267  Washington
## 792      Oregon           Washington 0.019650027      Oregon
## 793    Oklahoma                Caddo 0.017435731    Oklahoma
## 794 Puerto Rico               Juncos 0.013466699      Alaska
## 795      Alaska Fairbanks North Star 0.013419708 Puerto Rico
## 796 Puerto Rico              Bayamon 0.009246600 Puerto Rico
## 797 Puerto Rico               Catano 0.005014176 Puerto Rico
##              County.Name       ozone
## 788               Warren 0.020522180
## 789     West Baton Rouge 0.020317902
## 790               DeKalb 0.020284889
## 791              Whatcom 0.019888536
## 792           Washington 0.019570696
## 793                Caddo 0.017259953
## 794 Fairbanks North Star 0.013629980
## 795               Juncos 0.013587565
## 796              Bayamon 0.009459922
## 797               Catano 0.005022035

Follow up questions

At this point it’s useful to consider a few followup questions.

Notes: