The elements of the checklist are
Formulate your question
Read in your data
Check the packaging
Run str()
Look at the top and the bottom of your data
Check your “n”s
Validate with at least one external data source
Try the easy solution first
Challenge your solution
Follow up
Notes:
Notes:
# install.packages("readr")
# Load the reader library
library(readr)
# Load the dplyr library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the data
ozone <- read_csv("hourly_44201_2014.csv")
# the names of the columns to remove any spaces.
names(ozone) <- make.names(names(ozone))
Notes:
# check the number of rows and columns.
nrow(ozone)
## [1] 9085391
ncol(ozone)
## [1] 24
str(ozone)
## Classes 'tbl_df', 'tbl' and 'data.frame': 9085391 obs. of 24 variables:
## $ State.Code : chr "01" "01" "01" "01" ...
## $ County.Code : chr "003" "003" "003" "003" ...
## $ Site.Num : chr "0010" "0010" "0010" "0010" ...
## $ Parameter.Code : int 44201 44201 44201 44201 44201 44201 44201 44201 44201 44201 ...
## $ POC : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Latitude : num 30.5 30.5 30.5 30.5 30.5 ...
## $ Longitude : num -87.9 -87.9 -87.9 -87.9 -87.9 ...
## $ Datum : chr "NAD83" "NAD83" "NAD83" "NAD83" ...
## $ Parameter.Name : chr "Ozone" "Ozone" "Ozone" "Ozone" ...
## $ Date.Local : Date, format: "2014-03-01" "2014-03-01" ...
## $ Time.Local : chr "01:00" "02:00" "03:00" "04:00" ...
## $ Date.GMT : Date, format: "2014-03-01" "2014-03-01" ...
## $ Time.GMT : chr "07:00" "08:00" "09:00" "10:00" ...
## $ Sample.Measurement : num 0.047 0.047 0.043 0.038 0.035 0.035 0.034 0.037 0.044 0.046 ...
## $ Units.of.Measure : chr "Parts per million" "Parts per million" "Parts per million" "Parts per million" ...
## $ MDL : num 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 0.005 ...
## $ Uncertainty : chr "" "" "" "" ...
## $ Qualifier : chr "" "" "" "" ...
## $ Method.Type : chr "FEM" "FEM" "FEM" "FEM" ...
## $ Method.Code : chr "047" "047" "047" "047" ...
## $ Method.Name : chr "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" "INSTRUMENTAL - ULTRA VIOLET" ...
## $ State.Name : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ County.Name : chr "Baldwin" "Baldwin" "Baldwin" "Baldwin" ...
## $ Date.of.Last.Change: Date, format: "2014-06-30" "2014-06-30" ...
# Check to TOP of the data
head(ozone[, c(6:7, 10)])
## Source: local data frame [6 x 3]
##
## Latitude Longitude Date.Local
## (dbl) (dbl) (date)
## 1 30.498 -87.88141 2014-03-01
## 2 30.498 -87.88141 2014-03-01
## 3 30.498 -87.88141 2014-03-01
## 4 30.498 -87.88141 2014-03-01
## 5 30.498 -87.88141 2014-03-01
## 6 30.498 -87.88141 2014-03-01
# Check to BOTTOM of the data
tail(ozone[, c(6:7, 10)])
## Source: local data frame [6 x 3]
##
## Latitude Longitude Date.Local
## (dbl) (dbl) (date)
## 1 31.7122 -106.3953 2014-08-31
## 2 31.7122 -106.3953 2014-08-31
## 3 31.7122 -106.3953 2014-08-31
## 4 31.7122 -106.3953 2014-08-31
## 5 31.7122 -106.3953 2014-08-31
## 6 31.7122 -106.3953 2014-08-31
In general, counting things is usually a good way to figure out if anything is wrong or not. To do this properly, you need to identify some landmarks that can be used to check against your data.
In this example, we will use the fact that the dataset purportedly contains hourly data for the entire country. These will be our two landmarks for comparison.
We have hourly ozone data that comes from monitors across the country. The monitors should be monitoring continuously during the day, so all hours should be represented. We can take a look at the Time.Local variable to see what time measurements are recorded as being taken.
Notes:
table(ozone$Time.Local)
##
## 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00
## 369790 371153 360748 354518 358642 384903 384525 383040 380256 378408
## 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00
## 378208 379448 379732 381761 383652 385296 386404 386930 386849 387370
## 20:00 21:00 22:00 23:00
## 387309 386087 376100 374262
Notes:
# Perform the check
#
filter(ozone, Time.Local == "13:14") %>%
select(State.Name, County.Name, Date.Local,
Time.Local, Sample.Measurement)
## Source: local data frame [0 x 5]
##
## Variables not shown: State.Name (chr), County.Name (chr), Date.Local
## (date), Time.Local (chr), Sample.Measurement (dbl)
# Nothing returned.
# Perform alternative check
#
filter(ozone, State.Code == "36"
& County.Code == "033"
& Date.Local == "2014-09-30") %>%
select(Date.Local, Time.Local,
Sample.Measurement) %>%
as.data.frame
## Date.Local Time.Local Sample.Measurement
## 1 2014-09-30 00:00 0.011
## 2 2014-09-30 01:00 0.012
## 3 2014-09-30 02:00 0.012
## 4 2014-09-30 03:00 0.011
## 5 2014-09-30 04:00 0.011
## 6 2014-09-30 05:00 0.011
## 7 2014-09-30 06:00 0.010
## 8 2014-09-30 07:00 0.010
## 9 2014-09-30 08:00 0.010
## 10 2014-09-30 09:00 0.010
## 11 2014-09-30 10:00 0.010
## 12 2014-09-30 11:00 0.012
## 13 2014-09-30 12:00 0.011
## 14 2014-09-30 13:00 0.013
## 15 2014-09-30 14:00 0.016
## 16 2014-09-30 15:00 0.017
## 17 2014-09-30 16:00 0.017
## 18 2014-09-30 17:00 0.015
## 19 2014-09-30 18:00 0.017
## 20 2014-09-30 19:00 0.014
## 21 2014-09-30 20:00 0.014
## 22 2014-09-30 21:00 0.011
## 23 2014-09-30 22:00 0.010
## 24 2014-09-30 23:00 0.010
## 25 2014-09-30 00:00 0.010
## 26 2014-09-30 01:00 0.011
## 27 2014-09-30 02:00 0.011
## 28 2014-09-30 03:00 0.010
## 29 2014-09-30 04:00 0.010
## 30 2014-09-30 05:00 0.010
## 31 2014-09-30 06:00 0.009
## 32 2014-09-30 07:00 0.008
## 33 2014-09-30 08:00 0.009
## 34 2014-09-30 09:00 0.009
## 35 2014-09-30 10:00 0.009
## 36 2014-09-30 11:00 0.011
## 37 2014-09-30 12:00 0.010
## 38 2014-09-30 13:00 0.012
## 39 2014-09-30 14:00 0.015
## 40 2014-09-30 15:00 0.016
## 41 2014-09-30 16:00 0.016
## 42 2014-09-30 17:00 0.014
## 43 2014-09-30 18:00 0.016
## 44 2014-09-30 19:00 0.013
## 45 2014-09-30 20:00 0.013
## 46 2014-09-30 21:00 0.010
## 47 2014-09-30 22:00 0.009
## 48 2014-09-30 23:00 0.009
# Perhaps we should see exactly how many states are represented in this dataset
select(ozone, State.Name) %>% unique %>% nrow
## [1] 53
# We can take a look at the unique elements of the State.Name variable to see what's going on.
#
unique(ozone$State.Name)
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District Of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
## [52] "Puerto Rico" "Country Of Mexico"
Notes:
# look at the hourly measurements of ozone
summary(ozone$Sample.Measurement)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.01900 0.03000 0.03026 0.04100 0.24100
# We can get a bit more detail on the distribution by looking at deciles of the data.
quantile(ozone$Sample.Measurement, seq(0, 1, 0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.000 0.009 0.016 0.022 0.026 0.030 0.034 0.039 0.043 0.050 0.241
From the summary we can see that the maximum hourly concentration is quite high (0.241 ppm) but that in general, the bulk of the distribution is far below 0.03.
Knowing that the national standard for ozone is something like 0.075, we can see from the data that:
The data are at least of the right order of magnitude (i.e. the units are correct)
The range of the distribution is roughly what we’d expect, given the regulation around ambient pollution levels
Some hourly levels (less than 10%) are above 0.075 but this may be reasonable given the wording of the standard and the averaging involved.
Notes:
# To identify each county we will use a combination of the State.Name and the County.Name Svariables.
#
ranking <- group_by(ozone, State.Name, County.Name) %>%
summarize(ozone = mean(Sample.Measurement)) %>%
as.data.frame %>%
arrange(desc(ozone))
# look at the top 10 counties in this ranking
head(ranking, 10)
## State.Name County.Name ozone
## 1 Colorado Clear Creek 0.05022088
## 2 California Mariposa 0.04849027
## 3 California Nevada 0.04821713
## 4 Colorado Park 0.04799975
## 5 Wyoming Albany 0.04738065
## 6 Colorado Chaffee 0.04732017
## 7 California Inyo 0.04469113
## 8 Utah San Juan 0.04435075
## 9 California El Dorado 0.04363664
## 10 Nevada Elko 0.04346459
# we can look at the 10 lowest counties too.
tail(ranking, 10)
## State.Name County.Name ozone
## 788 Georgia DeKalb 0.020429020
## 789 New Jersey Warren 0.020400928
## 790 Louisiana West Baton Rouge 0.020229531
## 791 Washington Whatcom 0.020114267
## 792 Oregon Washington 0.019650027
## 793 Oklahoma Caddo 0.017435731
## 794 Puerto Rico Juncos 0.013466699
## 795 Alaska Fairbanks North Star 0.013419708
## 796 Puerto Rico Bayamon 0.009246600
## 797 Puerto Rico Catano 0.005014176
# Let's take a look at one of the higest level counties, Mariposa County, California. First
# let's see how many observations there are for this county in the dataset.
#
filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>% nrow
## [1] 12130
# We can take a look at how ozone varies through the year in this county by looking at
# monthly averages. First we'll need to convert the date variable into a Date class.
ozone <- mutate(ozone, Date.Local = as.Date(Date.Local))
# Then we will split the data by month to look at the average hourly levels.
#
filter(ozone, State.Name == "California" & County.Name == "Mariposa") %>%
mutate(month = factor(months(Date.Local), levels = month.name)) %>%
group_by(month) %>%
summarize(ozone = mean(Sample.Measurement))
## Source: local data frame [12 x 2]
##
## month ozone
## (fctr) (dbl)
## 1 January 0.04081929
## 2 February 0.03884305
## 3 March 0.04548374
## 4 April 0.04976124
## 5 May 0.05047637
## 6 June 0.05639331
## 7 July 0.05224286
## 8 August 0.05541681
## 9 September 0.05250999
## 10 October 0.04894909
## 11 November 0.03536872
## 12 December 0.03472859
# Now let's take a look at one of the lowest level counties, Caddo County, Oklahoma.
#
filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>% nrow
## [1] 7562
# Here we see that there are perhaps fewer observations than we would expect for a
# monitor that was measuring 24 hours a day all year. We can check the data to see if
# anything funny is going on.
filter(ozone, State.Name == "Oklahoma" & County.Name == "Caddo") %>%
mutate(month = factor(months(Date.Local), levels = month.name)) %>%
group_by(month) %>%
summarize(ozone = mean(Sample.Measurement))
## Source: local data frame [12 x 2]
##
## month ozone
## (fctr) (dbl)
## 1 January 0.018732719
## 2 February 0.002060543
## 3 March 0.002000000
## 4 April 0.023208955
## 5 May 0.024182927
## 6 June 0.020195936
## 7 July 0.019112745
## 8 August 0.023452830
## 9 September 0.019302521
## 10 October 0.021025780
## 11 November 0.020529210
## 12 December 0.012710432
Always be checking. Does that number of observations sound right?
A few things stand out here:
First, ozone appears to be higher in the summer months and lower in the winter months.
Second, there are two months missing (November and December) from the data.
It’s not immediately clear why that is, but it’s probably worth investigating a bit later on.
Notes:
set.seed(10234)
N <- nrow(ozone)
idx <- sample(N, N, replace = TRUE)
ozone2 <- ozone[idx, ]
ranking2 <- group_by(ozone2, State.Name, County.Name) %>%
summarize(ozone = mean(Sample.Measurement)) %>%
as.data.frame %>%
arrange(desc(ozone))
cbind(head(ranking, 10), head(ranking2, 10))
## State.Name County.Name ozone State.Name County.Name ozone
## 1 Colorado Clear Creek 0.05022088 Colorado Clear Creek 0.05031511
## 2 California Mariposa 0.04849027 California Mariposa 0.04847238
## 3 California Nevada 0.04821713 California Nevada 0.04811088
## 4 Colorado Park 0.04799975 Colorado Park 0.04794328
## 5 Wyoming Albany 0.04738065 Wyoming Albany 0.04740893
## 6 Colorado Chaffee 0.04732017 Colorado Chaffee 0.04733954
## 7 California Inyo 0.04469113 California Inyo 0.04458294
## 8 Utah San Juan 0.04435075 Utah San Juan 0.04438289
## 9 California El Dorado 0.04363664 California El Dorado 0.04375390
## 10 Nevada Elko 0.04346459 Nevada Elko 0.04340676
cbind(tail(ranking, 10), tail(ranking2, 10))
## State.Name County.Name ozone State.Name
## 788 Georgia DeKalb 0.020429020 New Jersey
## 789 New Jersey Warren 0.020400928 Louisiana
## 790 Louisiana West Baton Rouge 0.020229531 Georgia
## 791 Washington Whatcom 0.020114267 Washington
## 792 Oregon Washington 0.019650027 Oregon
## 793 Oklahoma Caddo 0.017435731 Oklahoma
## 794 Puerto Rico Juncos 0.013466699 Alaska
## 795 Alaska Fairbanks North Star 0.013419708 Puerto Rico
## 796 Puerto Rico Bayamon 0.009246600 Puerto Rico
## 797 Puerto Rico Catano 0.005014176 Puerto Rico
## County.Name ozone
## 788 Warren 0.020522180
## 789 West Baton Rouge 0.020317902
## 790 DeKalb 0.020284889
## 791 Whatcom 0.019888536
## 792 Washington 0.019570696
## 793 Caddo 0.017259953
## 794 Fairbanks North Star 0.013629980
## 795 Juncos 0.013587565
## 796 Bayamon 0.009459922
## 797 Catano 0.005022035
We can see that the rankings based on the resampled data (columns 4-6 on the right) are very close to the original, with the first 7 being identical. Numbers 8 and 9 get flipped in the resampled rankings but that’s about it. This might suggest that the original rankings are somewhat stable.
Here we can see that the bottom 7 counties are identical in both rankings, but after that things shuffle a bit. We’re less concerned with the counties at the bottom of the list, but this suggests there is also reasonable stability.
At this point it’s useful to consider a few followup questions.
Notes: