QA is basically the same for all data analysis, regardless of what software package you are using; Excel, SAS, SPSS or R. As you know, the idea when QAing data is to make sure that all of the data in your dataset is roughly what you expect it to be. Below are some tips for doing this in R:
str() after importing your data to ensure that everything was imported in the expected format.library(region5air)
data(airdata)
str(airdata)
## 'data.frame': 367595 obs. of 20 variables:
## $ site : chr "840170890005" "840170311601" "840170314002" "840170310001" ...
## $ data_status: int 0 0 0 0 0 0 0 0 0 0 ...
## $ action_code: int 10 10 10 10 10 10 10 10 10 10 ...
## $ datetime : chr "20141231T0100-0600" "20141231T0100-0600" "20141231T0100-0600" "20141231T0100-0600" ...
## $ parameter : int 44201 44201 44201 44201 44201 44201 44201 44201 44201 44201 ...
## $ duration : int 60 60 60 60 60 60 60 60 60 60 ...
## $ frequency : int 0 0 0 0 0 0 0 0 0 0 ...
## $ value : num 0.022 0.021 0.018 0.021 0.023 0.026 0.023 0.017 0.018 0.021 ...
## $ unit : int 7 7 7 7 7 7 7 7 7 7 ...
## $ qc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poc : int 1 1 1 1 1 1 1 1 2 1 ...
## $ lat : num 42 41.7 41.9 41.7 42.2 ...
## $ lon : num -88.3 -88 -87.8 -87.7 -88.2 ...
## $ GISDatum : chr "WGS84" "WGS84" "WGS84" "WGS84" ...
## $ elev : int 229 226 184 188 235 178 198 186 195 181 ...
## $ method_code: int 87 87 87 87 87 87 87 87 87 87 ...
## $ mpc : chr "1" "1" "1" "1" ...
## $ mpc_value : chr "0.005" "0.005" "0.005" "0.005" ...
## $ uncertainty: logi NA NA NA NA NA NA ...
## $ qualifiers : chr NA NA NA NA ...
You can set up functions to run that will automatically remove ‘illegal’ data. That is data that is not conceivably possible and has ended up in your dataset by accident.
airdata[which(airdata$ozone > 1.0)] = NA
This is trickier, because miscoded data can be within the realm of possibility.
box.stat <- boxplot(chicago_air$ozone)
box.stat$out ##accesses the outlier information embedded in the boxplot
## [1] 0.081 0.078
library(psych)
outlier.data <- outlier(sat.act,plot=TRUE)
head(outlier.data)
## 29442 29457 29498 29503 29504 29518
## 2.287859 7.086753 3.384852 3.625971 5.894724 4.919817
You want to make sure that all of your missing data codes (i.e. -99,-999, -9999) are coded as such. You can use the following code to set them equal to NA and they will be replaced in your dataset.
my.met <- c(2, 2, 2, 1.6, 1.6, 1.4, 0.6, -99, 0.7, 1.2,-99, 2.3,-999)
my.met[which(my.met == -99 | my.met == -999)] = NA
#Alternatively, when you read in a file with NAs you can use the na.strings argument in read.csv or read.table
read.csv("E:/RIntro/datasets/dates_values.csv")
read.csv("E:/RIntro/datasets/dates_values.csv", na.strings = c('-99','-999'))
## Date Value Value2
## 1 42393 1 5
## 2 42394 2 8
## 3 42395 3 -99
## 4 42396 4 3
## 5 42397 5 4
## 6 42398 6 -999
## 7 42399 7 6
## 8 42400 8 1
## 9 42401 9 3
## 10 42402 10 4
## Date Value Value2
## 1 42393 1 5
## 2 42394 2 8
## 3 42395 3 NA
## 4 42396 4 3
## 5 42397 5 4
## 6 42398 6 NA
## 7 42399 7 6
## 8 42400 8 1
## 9 42401 9 3
## 10 42402 10 4
install.packages("zoo")
library(zoo)
my.met
## [1] 2.0 2.0 2.0 1.6 1.6 1.4 0.6 NA 0.7 1.2 NA 2.3 NA
na.locf(my.met) # replaces the NA with the most recent non-NA prior to it
## [1] 2.0 2.0 2.0 1.6 1.6 1.4 0.6 0.6 0.7 1.2 1.2 2.3 2.3
na.approx(my.met) # replaces NA with a linearly interpolated value
## [1] 2.00 2.00 2.00 1.60 1.60 1.40 0.60 0.65 0.70 1.20 1.75 2.30
na.spline(my.met) # replaces NA with a cubic spine interpolation
## [1] 2.0000000 2.0000000 2.0000000 1.6000000 1.6000000 1.4000000 0.6000000
## [8] 0.3755302 0.7000000 1.2000000 1.7774878 2.3000000 2.6075367
x <- read.csv("E:/RIntro/datasets/weird_columns.csv")
names(x) # The read.csv function will automatically place an "X"" in front of any column name that starts with a number. It will also fill blanks spaces and $%# symbols with dots (.)
## [1] "Date" "Value" "X3...value.2.5.junk"
A great tool for dealing with long names (like some of our site names) is the function abbreviate available in base R.
wi_site <- read.csv("E:/RIntro/datasets/wi_sites.csv") # Load dataset with long site names
wi_site$new_name <- abbreviate(wi_site$site_name) # Create a new column with the abbreviated site names
wi_site
## site_name county new_name
## 1 Bad River - Tribal School - Odanah Ashland BR-TS-O
## 2 Green Bay East High Ashland GBEH
## 3 Green Bay UW Ashland GBUW
## 4 Eau Claire - DOT Sign Shop Eau Claire EC-DSS
Or replace the existing columns with an abbreviated one
wi_site <- read.csv("E:/RIntro/datasets/wi_sites.csv") # reload the csv file again
wi_site$site_name <- abbreviate(wi_site[1:4,1]) # replaces existing column with new abbreviated site names
wi_site
## site_name county
## 1 BR-TS-O Ashland
## 2 GBEH Ashland
## 3 GBUW Ashland
## 4 EC-DSS Eau Claire
read.csv() and write.csv()# read.csv(filename, stringsAsFactors = FALSE, colClasses=c("character","character","numeric","factor"))
# write.csv(dataframe, filename, row.names=F)
Should you save your workspace? save.image()
attach()
attach() function in R can be used to make objects within dataframes accessible in R with fewer keystrokes. For example:data(chicago_air)
chicago_air$ozone
attach(chicago_air) #Once you attach a dataset you no longer have to use the `$` symbol
ozone
The attach() function can be dangerous because you may have other objects in memory with the same or very similar names. This means your code could be very difficult to read for somebody else (or future you) and the likelihood of errors increases.
If you get a syntax error, then you’ve entered a command that R can’t understand. Generally the error message is pretty good about pointing to the approximate point in the command where the error is.
Common syntax mistakes are missing commas, unmatched parentheses, and the wrong type of closing brace [for example, an opening square bracket but a closing parenthesis).
Errors of the object-not-found variety can have one of several causes:
Now let’s try some exercises to test our understanding of performing data QA in R.