Quality Assurance

QA is basically the same for all data analysis, regardless of what software package you are using; Excel, SAS, SPSS or R. As you know, the idea when QAing data is to make sure that all of the data in your dataset is roughly what you expect it to be. Below are some tips for doing this in R:

Always use str() after importing your data to ensure that everything was imported in the expected format.

library(region5air)
data(airdata)
str(airdata)

## 'data.frame':    367595 obs. of  20 variables:
##  $ site       : chr  "840170890005" "840170311601" "840170314002" "840170310001" ...
##  $ data_status: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ action_code: int  10 10 10 10 10 10 10 10 10 10 ...
##  $ datetime   : chr  "20141231T0100-0600" "20141231T0100-0600" "20141231T0100-0600" "20141231T0100-0600" ...
##  $ parameter  : int  44201 44201 44201 44201 44201 44201 44201 44201 44201 44201 ...
##  $ duration   : int  60 60 60 60 60 60 60 60 60 60 ...
##  $ frequency  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ value      : num  0.022 0.021 0.018 0.021 0.023 0.026 0.023 0.017 0.018 0.021 ...
##  $ unit       : int  7 7 7 7 7 7 7 7 7 7 ...
##  $ qc         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poc        : int  1 1 1 1 1 1 1 1 2 1 ...
##  $ lat        : num  42 41.7 41.9 41.7 42.2 ...
##  $ lon        : num  -88.3 -88 -87.8 -87.7 -88.2 ...
##  $ GISDatum   : chr  "WGS84" "WGS84" "WGS84" "WGS84" ...
##  $ elev       : int  229 226 184 188 235 178 198 186 195 181 ...
##  $ method_code: int  87 87 87 87 87 87 87 87 87 87 ...
##  $ mpc        : chr  "1" "1" "1" "1" ...
##  $ mpc_value  : chr  "0.005" "0.005" "0.005" "0.005" ...
##  $ uncertainty: logi  NA NA NA NA NA NA ...
##  $ qualifiers : chr  NA NA NA NA ...

‘Illegal’ Data

You can set up functions to run that will automatically remove ‘illegal’ data. That is data that is not conceivably possible and has ended up in your dataset by accident.

airdata[which(airdata$ozone > 1.0)] = NA

Miscoded Data

This is trickier, because miscoded data can be within the realm of possibility.

The best way to detect this is to look at the summary of your data and pay attention to min and max values.
You can plot the data to see if you can detect anything weird through visual inspection.
- Boxplots with outliers are typically handy for this.
You can also do outlier tests on your dataset.

box.stat <- boxplot(chicago_air$ozone)

box.stat$out  ##accesses the outlier information embedded in the boxplot

## [1] 0.081 0.078

library(psych)
outlier.data <- outlier(sat.act,plot=TRUE)

head(outlier.data)

##    29442    29457    29498    29503    29504    29518 
## 2.287859 7.086753 3.384852 3.625971 5.894724 4.919817

Missing Data

You want to make sure that all of your missing data codes (i.e. -99,-999, -9999) are coded as such. You can use the following code to set them equal to NA and they will be replaced in your dataset.

my.met <- c(2, 2, 2, 1.6, 1.6, 1.4, 0.6, -99, 0.7, 1.2,-99, 2.3,-999)
my.met[which(my.met == -99 | my.met == -999)] = NA

#Alternatively, when you read in a file with NAs you can use the na.strings argument in read.csv or read.table
read.csv("E:/RIntro/datasets/dates_values.csv")
read.csv("E:/RIntro/datasets/dates_values.csv", na.strings = c('-99','-999'))

##     Date Value Value2
## 1  42393     1      5
## 2  42394     2      8
## 3  42395     3    -99
## 4  42396     4      3
## 5  42397     5      4
## 6  42398     6   -999
## 7  42399     7      6
## 8  42400     8      1
## 9  42401     9      3
## 10 42402    10      4

##     Date Value Value2
## 1  42393     1      5
## 2  42394     2      8
## 3  42395     3     NA
## 4  42396     4      3
## 5  42397     5      4
## 6  42398     6     NA
## 7  42399     7      6
## 8  42400     8      1
## 9  42401     9      3
## 10 42402    10      4

Some options for NAs

install.packages("zoo")
library(zoo)

my.met

##  [1] 2.0 2.0 2.0 1.6 1.6 1.4 0.6  NA 0.7 1.2  NA 2.3  NA

na.locf(my.met)  # replaces the NA with the most recent non-NA prior to it

##  [1] 2.0 2.0 2.0 1.6 1.6 1.4 0.6 0.6 0.7 1.2 1.2 2.3 2.3

na.approx(my.met) # replaces NA with a linearly interpolated value

##  [1] 2.00 2.00 2.00 1.60 1.60 1.40 0.60 0.65 0.70 1.20 1.75 2.30

na.spline(my.met) # replaces NA with a cubic spine interpolation

##  [1] 2.0000000 2.0000000 2.0000000 1.6000000 1.6000000 1.4000000 0.6000000
##  [8] 0.3755302 0.7000000 1.2000000 1.7774878 2.3000000 2.6075367

Other QA Issues

Neglected due diligence for package (e.g. is author qualified to develop a stats package)
Didn’t read documentation and misused package or function
Messed up something in the code and didn’t check your work
- Should always check that your code is behaving as expected!
File organization

Common Pitfalls

Weird column names

x <- read.csv("E:/RIntro/datasets/weird_columns.csv")

names(x)  # The read.csv function will automatically place an "X"" in front of any column name that starts with a number.  It will also fill blanks spaces and $%# symbols with dots (.)

## [1] "Date"                "Value"               "X3...value.2.5.junk"

A great tool for dealing with long names (like some of our site names) is the function abbreviate available in base R.

wi_site <- read.csv("E:/RIntro/datasets/wi_sites.csv")  # Load dataset with long site names

wi_site$new_name <- abbreviate(wi_site$site_name) # Create a new column with the abbreviated site names
wi_site

##                            site_name     county new_name
## 1 Bad River - Tribal School - Odanah    Ashland  BR-TS-O
## 2                Green Bay East High    Ashland     GBEH
## 3                       Green Bay UW    Ashland     GBUW
## 4         Eau Claire - DOT Sign Shop Eau Claire   EC-DSS

Or replace the existing columns with an abbreviated one

wi_site <- read.csv("E:/RIntro/datasets/wi_sites.csv")  # reload the csv file again

wi_site$site_name <- abbreviate(wi_site[1:4,1]) # replaces existing column with new abbreviated site names
wi_site

##   site_name     county
## 1   BR-TS-O    Ashland
## 2      GBEH    Ashland
## 3      GBUW    Ashland
## 4    EC-DSS Eau Claire

Confusion with other software: R for SAS users

http://r4stats.com/articles/add-ons/
This website gives you info about what types of functions and add-ons in SAS, SPSS and Stata are equivalent to R packages and functions.

A few other items of note

read.csv() and write.csv()

# read.csv(filename, stringsAsFactors = FALSE, colClasses=c("character","character","numeric","factor"))
# write.csv(dataframe, filename, row.names=F)

Should you save your workspace? save.image()
attach()
- The attach() function in R can be used to make objects within dataframes accessible in R with fewer keystrokes. For example:

data(chicago_air)
chicago_air$ozone
attach(chicago_air)  #Once you attach a dataset you no longer have to use the `$` symbol
ozone

The attach() function can be dangerous because you may have other objects in memory with the same or very similar names. This means your code could be very difficult to read for somebody else (or future you) and the likelihood of errors increases.

What to do with R errors

Laugh, cry, bang your head against your desk. Then start googling.
http://stackoverflow.com/questions/tagged/r
R Inferno (freely downloadable) is a great resource for dealing with common R errors:

Syntax errors

If you get a syntax error, then you’ve entered a command that R can’t understand. Generally the error message is pretty good about pointing to the approximate point in the command where the error is.
Common syntax mistakes are missing commas, unmatched parentheses, and the wrong type of closing brace [for example, an opening square bracket but a closing parenthesis).

Object not found

Errors of the object-not-found variety can have one of several causes:

The name is not spelled correctly, or the capitalization is wrong
The package or file containing the object is not on the search list

Open discussion on R pitfalls…what problems have you had?

Now let’s try some exercises to test our understanding of performing data QA in R.

Exercise 8

http://rpubs.com/kfrost14/Ex8

QA and Common Pitfalls

Wednesday, February 3, 2016