EDA Checklist

The elements of the checklist are

Formulate your question
Read in your data
Check the packaging
Run str()
Look at the top and the bottom of your data
Check your “n”s
Validate with at least one external data source
Try the easy solution first
Challenge your solution
Follow up

Formulate your question

Formulating a question can be a useful way to guide the exploratory data analysis process and to limit the exponential number of paths that can be taken with any sizeable dataset. In particular, a sharp question or hypothesis can serve as a dimension reduction tool that can eliminate variables that are not immediately relevant to the question.

Notes:

Read in your data

Notes:

# install.packages("readr")

# Load the reader library
library(readr)

# Load the dplyr library
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the data
ozone <- read_csv("hourly_44201_2014.csv")

# the names of the columns to remove any spaces.
names(ozone) <- make.names(names(ozone))

Check the packaging & Run str() & look at TOP and BOTTOM of your data

Notes:

# check the number of rows and columns.

nrow(ozone)

## [1] 9085391

ncol(ozone)

## [1] 24

str(ozone)

## Classes 'tbl_df', 'tbl' and 'data.frame':    9085391 obs. of  24 variables:
##  $ State.Code         : chr  "01" "01" "01" "01" ...
##  $ County.Code        : chr  "003" "003" "003" "003" ...
##  $ Site.Num           : chr  "0010" "0010" "0010" "0010" ...
##  $ Parameter.Code     : int  44201 44201 44201 44201 44201 44201 44201 44201 44201 44201 ...
##  $ POC                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Latitude           : num  30.5 30.5 30.5 30.5 30.5 ...
##  $ Longitude          : num  -87.9 -87.9 -87.9 -87.9 -87.9 ...
##  $ Datum              : chr  "NAD83" "NAD83" "NAD83" "NAD83" ...
##  $ Parameter.Name     : chr  "Ozone" "Ozone" "Ozone" "Ozone" ...
##  $ Date.Local         : Date, format: "2014-03-01" "2014-03-01" ...
##  $ Time.Local         : chr  "01:00" "02:00" "03:00" "04:00" ...
##  $ Date.GMT           : Date, format: "2014-03-01" "2014-03-01" ...
##  $ Time.GMT           : chr  "07:00" "08:00" "09:00" "10:00" ...
##  $ Sample.Measurement : num  0.047 0.047 0.043 0.038 0.035 0.035 0.034 0.037 0.044 0.046 ...
##  $ Units.of.Measure   : chr  "Parts per million" "Parts per million" "Parts per million" "Parts per million" ...
##  $ MDL                : num  0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 ...
##  $ Uncertainty        : chr  "" "" "" "" ...
##  $ Qualifier          : chr  "" "" "" "" ...
##  $ Method.Type        : chr  "FEM" "FEM" "FEM" "FEM" ...
##  $ Method.Code        : chr  "047" "047" "047" "047" ...
##  $ Method.Name        : chr  "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" ...
##  $ State.Name         : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ County.Name        : chr  "Baldwin" "Baldwin" "Baldwin" "Baldwin" ...
##  $ Date.of.Last.Change: Date, format: "2014-06-30" "2014-06-30" ...

# Check to TOP of the data
head(ozone[, c(6:7, 10)])

## Source: local data frame [6 x 3]
## 
##   Latitude Longitude Date.Local
##      (dbl)     (dbl)     (date)
## 1   30.498 -87.88141 2014-03-01
## 2   30.498 -87.88141 2014-03-01
## 3   30.498 -87.88141 2014-03-01
## 4   30.498 -87.88141 2014-03-01
## 5   30.498 -87.88141 2014-03-01
## 6   30.498 -87.88141 2014-03-01

# Check to BOTTOM of the data
tail(ozone[, c(6:7, 10)])

## Source: local data frame [6 x 3]
## 
##   Latitude Longitude Date.Local
##      (dbl)     (dbl)     (date)
## 1  31.7122 -106.3953 2014-08-31
## 2  31.7122 -106.3953 2014-08-31
## 3  31.7122 -106.3953 2014-08-31
## 4  31.7122 -106.3953 2014-08-31
## 5  31.7122 -106.3953 2014-08-31
## 6  31.7122 -106.3953 2014-08-31

Check your “n”s

In general, counting things is usually a good way to figure out if anything is wrong or not. To do this properly, you need to identify some landmarks that can be used to check against your data.
In this example, we will use the fact that the dataset purportedly contains hourly data for the entire country. These will be our two landmarks for comparison.
We have hourly ozone data that comes from monitors across the country. The monitors should be monitoring continuously during the day, so all hours should be represented. We can take a look at the Time.Local variable to see what time measurements are recorded as being taken.

Notes:

table(ozone$Time.Local)

## 
##  00:00  01:00  02:00  03:00  04:00  05:00  06:00  07:00  08:00  09:00 
## 369790 371153 360748 354518 358642 384903 384525 383040 380256 378408 
##  10:00  11:00  12:00  13:00  14:00  15:00  16:00  17:00  18:00  19:00 
## 378208 379448 379732 381761 383652 385296 386404 386930 386849 387370 
##  20:00  21:00  22:00  23:00 
## 387309 386087 376100 374262

One thing we notice here is that while almost all measurements in the dataset are recorded as being taken on the hour, some are taken at slightly different times. Such a small number of readings are taken at these off times that we might not want to care. But it does seem a bit odd, so it might be worth a quick check.

Notes:

# Perform the check
#
filter(ozone, Time.Local == "13:14") %>%
  select(State.Name, County.Name, Date.Local,
  Time.Local, Sample.Measurement)

## Source: local data frame [0 x 5]
## 
## Variables not shown: State.Name (chr), County.Name (chr), Date.Local
##   (date), Time.Local (chr), Sample.Measurement (dbl)

# Nothing returned.

# Perform alternative check
#
filter(ozone, State.Code == "36"
 & County.Code == "033"
 & Date.Local == "2014-09-30") %>%
 select(Date.Local, Time.Local,
 Sample.Measurement) %>%
 as.data.frame

##    Date.Local Time.Local Sample.Measurement
## 1  2014-09-30      00:00              0.011
## 2  2014-09-30      01:00              0.012
## 3  2014-09-30      02:00              0.012
## 4  2014-09-30      03:00              0.011
## 5  2014-09-30      04:00              0.011
## 6  2014-09-30      05:00              0.011
## 7  2014-09-30      06:00              0.010
## 8  2014-09-30      07:00              0.010
## 9  2014-09-30      08:00              0.010
## 10 2014-09-30      09:00              0.010
## 11 2014-09-30      10:00              0.010
## 12 2014-09-30      11:00              0.012
## 13 2014-09-30      12:00              0.011
## 14 2014-09-30      13:00              0.013
## 15 2014-09-30      14:00              0.016
## 16 2014-09-30      15:00              0.017
## 17 2014-09-30      16:00              0.017
## 18 2014-09-30      17:00              0.015
## 19 2014-09-30      18:00              0.017
## 20 2014-09-30      19:00              0.014
## 21 2014-09-30      20:00              0.014
## 22 2014-09-30      21:00              0.011
## 23 2014-09-30      22:00              0.010
## 24 2014-09-30      23:00              0.010
## 25 2014-09-30      00:00              0.010
## 26 2014-09-30      01:00              0.011
## 27 2014-09-30      02:00              0.011
## 28 2014-09-30      03:00              0.010
## 29 2014-09-30      04:00              0.010
## 30 2014-09-30      05:00              0.010
## 31 2014-09-30      06:00              0.009
## 32 2014-09-30      07:00              0.008
## 33 2014-09-30      08:00              0.009
## 34 2014-09-30      09:00              0.009
## 35 2014-09-30      10:00              0.009
## 36 2014-09-30      11:00              0.011
## 37 2014-09-30      12:00              0.010
## 38 2014-09-30      13:00              0.012
## 39 2014-09-30      14:00              0.015
## 40 2014-09-30      15:00              0.016
## 41 2014-09-30      16:00              0.016
## 42 2014-09-30      17:00              0.014
## 43 2014-09-30      18:00              0.016
## 44 2014-09-30      19:00              0.013
## 45 2014-09-30      20:00              0.013
## 46 2014-09-30      21:00              0.010
## 47 2014-09-30      22:00              0.009
## 48 2014-09-30      23:00              0.009

# Perhaps we should see exactly how many states are represented in this dataset
select(ozone, State.Name) %>% unique %>% nrow

## [1] 53

# We can take a look at the unique elements of the State.Name variable to see what's going on.
#
unique(ozone$State.Name)

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District Of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"             
## [52] "Puerto Rico"          "Country Of Mexico"

NOTE: - Now we can see that this monitor just records its values at odd times, rather than on the hour. It seems, from looking at the previous output, that this is the only monitor in the country that does this, so it’s probably not something we should worry about.

Validate with at least one external data source

Making sure your data matches something outside of the dataset is very important. It allows you to ensure that the measurements are roughly in line with what they should be and it serves as a check on what other things might be wrong in your dataset. External validation can often be as simple as checking your data against a single number, as we will do here.

Notes:

# look at the hourly measurements of ozone
summary(ozone$Sample.Measurement)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.01900 0.03000 0.03026 0.04100 0.24100

# We can get a bit more detail on the distribution by looking at deciles of the data.
quantile(ozone$Sample.Measurement, seq(0, 1, 0.1))

##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
## 0.000 0.009 0.016 0.022 0.026 0.030 0.034 0.039 0.043 0.050 0.241

From the summary we can see that the maximum hourly concentration is quite high (0.241 ppm) but that in general, the bulk of the distribution is far below 0.03.
Knowing that the national standard for ozone is something like 0.075, we can see from the data that:
The data are at least of the right order of magnitude (i.e. the units are correct)
The range of the distribution is roughly what we’d expect, given the regulation around ambient pollution levels
Some hourly levels (less than 10%) are above 0.075 but this may be reasonable given the wording of the standard and the averaging involved.

Try the easy solution first

What’s the simplest answer we could provide to this question? For the moment, don’t worry about whether the answer is correct, but the point is how could you provide prima facie evidence for your hypothesis or question. You may refute that evidence later with deeper analysis, but this is the first pass.

Notes:

# To identify each county we will use a combination of the State.Name and the County.Name Svariables.
#
ranking <- group_by(ozone, State.Name, County.Name) %>%
 summarize(ozone = mean(Sample.Measurement)) %>%
 as.data.frame %>%
 arrange(desc(ozone))

# look at the top 10 counties in this ranking
head(ranking, 10)

##    State.Name County.Name      ozone
## 1    Colorado Clear Creek 0.05022088
## 2  California    Mariposa 0.04849027
## 3  California      Nevada 0.04821713
## 4    Colorado        Park 0.04799975
## 5     Wyoming      Albany 0.04738065
## 6    Colorado     Chaffee 0.04732017
## 7  California        Inyo 0.04469113
## 8        Utah    San Juan 0.04435075
## 9  California   El Dorado 0.04363664
## 10     Nevada        Elko 0.04346459

# we can look at the 10 lowest counties too.
tail(ranking, 10)

##      State.Name          County.Name       ozone
## 788     Georgia               DeKalb 0.020429020
## 789  New Jersey               Warren 0.020400928
## 790   Louisiana     West Baton Rouge 0.020229531
## 791  Washington              Whatcom 0.020114267
## 792      Oregon           Washington 0.019650027
## 793    Oklahoma                Caddo 0.017435731
## 794 Puerto Rico               Juncos 0.013466699
## 795      Alaska Fairbanks North Star 0.013419708
## 796 Puerto Rico              Bayamon 0.009246600
## 797 Puerto Rico               Catano 0.005014176

# Let's take a look at one of the higest level counties, Mariposa County, California. First
# let's see how many observations there are for this county in the dataset.
#
filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>% nrow

## [1] 12130

# We can take a look at how ozone varies through the year in this county by looking at
# monthly averages. First we'll need to convert the date variable into a Date class.
ozone <- mutate(ozone, Date.Local = as.Date(Date.Local))

# Then we will split the data by month to look at the average hourly levels.
#
filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>%
 mutate(month = factor(months(Date.Local), levels = month.name)) %>%
 group_by(month) %>%
 summarize(ozone = mean(Sample.Measurement))

## Source: local data frame [12 x 2]
## 
##        month      ozone
##       (fctr)      (dbl)
## 1    January 0.04081929
## 2   February 0.03884305
## 3      March 0.04548374
## 4      April 0.04976124
## 5        May 0.05047637
## 6       June 0.05639331
## 7       July 0.05224286
## 8     August 0.05541681
## 9  September 0.05250999
## 10   October 0.04894909
## 11  November 0.03536872
## 12  December 0.03472859

# Now let's take a look at one of the lowest level counties, Caddo County, Oklahoma.
#
filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>% nrow

## [1] 7562

# Here we see that there are perhaps fewer observations than we would expect for a
# monitor that was measuring 24 hours a day all year. We can check the data to see if
# anything funny is going on.


filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>%
 mutate(month = factor(months(Date.Local), levels = month.name)) %>%
 group_by(month) %>%
 summarize(ozone = mean(Sample.Measurement))

## Source: local data frame [12 x 2]
## 
##        month       ozone
##       (fctr)       (dbl)
## 1    January 0.018732719
## 2   February 0.002060543
## 3      March 0.002000000
## 4      April 0.023208955
## 5        May 0.024182927
## 6       June 0.020195936
## 7       July 0.019112745
## 8     August 0.023452830
## 9  September 0.019302521
## 10   October 0.021025780
## 11  November 0.020529210
## 12  December 0.012710432

Always be checking. Does that number of observations sound right?
A few things stand out here:

First, ozone appears to be higher in the summer months and lower in the winter months.

Second, there are two months missing (November and December) from the data.

It’s not immediately clear why that is, but it’s probably worth investigating a bit later on.

Here we can see that the levels of ozone are much lower in this county and that also three months are missing (October, November, and December). Given the seasonal nature of ozone, it’s possible that the levels of ozone are so low in those months that it’s not even worth measuring. In fact some of the monthly averages are below the typical method detection limit of the measurement technology, meaning that those values are highly uncertain and likely not distinguishable from zero.

Challenge your solution

The easy solution is nice because it is, well, easy, but you should never allow those results to hold the day. You should always be thinking of ways to challenge the results, especially if those results comport with your prior expectation.

Notes:

set.seed(10234)
N <- nrow(ozone)
idx <- sample(N, N, replace = TRUE)
ozone2 <- ozone[idx, ]


ranking2 <- group_by(ozone2, State.Name, County.Name) %>%
 summarize(ozone = mean(Sample.Measurement)) %>%
 as.data.frame %>%
 arrange(desc(ozone))


cbind(head(ranking, 10), head(ranking2, 10))

##    State.Name County.Name      ozone State.Name County.Name      ozone
## 1    Colorado Clear Creek 0.05022088   Colorado Clear Creek 0.05031511
## 2  California    Mariposa 0.04849027 California    Mariposa 0.04847238
## 3  California      Nevada 0.04821713 California      Nevada 0.04811088
## 4    Colorado        Park 0.04799975   Colorado        Park 0.04794328
## 5     Wyoming      Albany 0.04738065    Wyoming      Albany 0.04740893
## 6    Colorado     Chaffee 0.04732017   Colorado     Chaffee 0.04733954
## 7  California        Inyo 0.04469113 California        Inyo 0.04458294
## 8        Utah    San Juan 0.04435075       Utah    San Juan 0.04438289
## 9  California   El Dorado 0.04363664 California   El Dorado 0.04375390
## 10     Nevada        Elko 0.04346459     Nevada        Elko 0.04340676

cbind(tail(ranking, 10),  tail(ranking2, 10))

##      State.Name          County.Name       ozone  State.Name
## 788     Georgia               DeKalb 0.020429020  New Jersey
## 789  New Jersey               Warren 0.020400928   Louisiana
## 790   Louisiana     West Baton Rouge 0.020229531     Georgia
## 791  Washington              Whatcom 0.020114267  Washington
## 792      Oregon           Washington 0.019650027      Oregon
## 793    Oklahoma                Caddo 0.017435731    Oklahoma
## 794 Puerto Rico               Juncos 0.013466699      Alaska
## 795      Alaska Fairbanks North Star 0.013419708 Puerto Rico
## 796 Puerto Rico              Bayamon 0.009246600 Puerto Rico
## 797 Puerto Rico               Catano 0.005014176 Puerto Rico
##              County.Name       ozone
## 788               Warren 0.020522180
## 789     West Baton Rouge 0.020317902
## 790               DeKalb 0.020284889
## 791              Whatcom 0.019888536
## 792           Washington 0.019570696
## 793                Caddo 0.017259953
## 794 Fairbanks North Star 0.013629980
## 795               Juncos 0.013587565
## 796              Bayamon 0.009459922
## 797               Catano 0.005022035

We can see that the rankings based on the resampled data (columns 4-6 on the right) are very close to the original, with the first 7 being identical. Numbers 8 and 9 get flipped in the resampled rankings but that’s about it. This might suggest that the original rankings are somewhat stable.
Here we can see that the bottom 7 counties are identical in both rankings, but after that things shuffle a bit. We’re less concerned with the counties at the bottom of the list, but this suggests there is also reasonable stability.

Follow up questions

At this point it’s useful to consider a few followup questions.

1. Do you have the right data? Sometimes at the conclusion of an exploratory data analysis, the conclusion is that the dataset is not really appropriate for this question. In this case, the dataset seemed perfectly fine for answering the question of which counties had the highest levels of ozone.
1. Do you need other data? One sub-question we tried to address was whether the county rankings were stable across years. We addressed this by resampling the data once to see if the rankings changed, but the better way to do this would be to simply get the data for previous years and re-do the rankings.
1. Do you have the right question? In this case, it’s not clear that the question we tried to answer has immediate relevance, and the data didn’t really indicate anything to increase the question’s relevance. For example, it might have been more interesting to assess which counties were in violation of the national ambient air quality standard, because determining this could have regulatory implications. However, this is a much more complicated calculation to do, requiring data from at least 3 previous years.
The goal of exploratory data analysis is to get you thinking about your data and reasoning about your question. At this point, we can refine our question or collect new data, all in an iterative process to get at the truth.

Notes:

EDA Checklist

Matthew R. Versaggi

May 19, 2016

Formulate your question

Read in your data

Check the packaging & Run str() & look at TOP and BOTTOM of your data

Check your “n”s

Validate with at least one external data source

Try the easy solution first

Challenge your solution

Follow up questions