Harold Nelson
March 29, 2016
There are some things you should always do, but they won’t find all problems.
The more questions you ask, the more problems you’ll find.
For the dataframe as a whole do head(), tail(), str() and summary().
For numeric variables: - hist(x) - summary(x) - boxplot(x) - plot(density(x))
For qualitative variables.
Relationships
You need to examine the data you have left after taking care of the bad data.
Do you still have a random sample of your population?
Look at the data in the 1000births file. I’ll call it nc since it contains a sample of birth records from North Carolina.
library(readxl)
nc = read_excel("1000births.xlsx")
# Look at the structure
str(nc)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 14 variables:
## $ fage : num NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : num 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : chr "younger mom" "younger mom" "younger mom" "younger mom" ...
## $ weeks : num 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : chr "full term" "full term" "full term" "full term" ...
## $ visits : num 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : num 2 2 2 2 2 2 2 2 2 2 ...
## $ racemom : num 2 2 1 1 2 2 2 2 1 1 ...
## $ hispmom : chr "N" "N" "M" "M" ...
## $ gained : num 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: chr "not low" "not low" "not low" "not low" ...
## $ sexbaby : chr "male" "male" "female" "male" ...
## $ habit : chr "nonsmoker" "nonsmoker" "nonsmoker" "nonsmoker" ...
# Look at the data
summary(nc)
## fage mage mature weeks
## Min. :14.00 Min. :13 Length:1000 Min. :20.00
## 1st Qu.:25.00 1st Qu.:22 Class :character 1st Qu.:37.00
## Median :30.00 Median :27 Mode :character Median :39.00
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## premie visits marital racemom
## Length:1000 Min. : 0.0 Min. :1.000 Min. :0.000
## Class :character 1st Qu.:10.0 1st Qu.:1.000 1st Qu.:1.000
## Mode :character Median :12.0 Median :1.000 Median :1.000
## Mean :12.1 Mean :1.386 Mean :1.423
## 3rd Qu.:15.0 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :30.0 Max. :2.000 Max. :8.000
## NA's :9 NA's :1
## hispmom gained weight lowbirthweight
## Length:1000 Min. : 0.00 Min. : 1.000 Length:1000
## Class :character 1st Qu.:20.00 1st Qu.: 6.380 Class :character
## Mode :character Median :30.00 Median : 7.310 Mode :character
## Mean :30.33 Mean : 7.101
## 3rd Qu.:38.00 3rd Qu.: 8.060
## Max. :85.00 Max. :11.750
## NA's :27
## sexbaby habit
## Length:1000 Length:1000
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
There are many missing records for several variables, mostly the age of the father. We can eliminate all of these records using the complete cases function. Let’s see what the function does. First use it to create nc.cc and see what nc.cc is.
nc.cc = complete.cases(nc)
str(nc.cc)
## logi [1:1000] FALSE FALSE TRUE TRUE FALSE FALSE ...
summary(nc.cc)
## Mode FALSE TRUE NA's
## logical 198 802 0
We can use this vector to get a clean dataset for analysis. But this may introduce a bias. Let’s introduce an indicator into the dataset based on the complete.cases() values.
nc$goodRec = complete.cases(nc)
Now we can see if the records with incomplete data are different from those with complete data. The variables mage and weight are always valid. Let’s do side-by-side boxplots and summaries of these with the values of goodRec.
boxplot(nc$mage~nc$goodRec,main = "mage by goodRec")
tapply(nc$mage,nc$goodRec,summary)
## $`FALSE`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 20.00 24.00 24.48 28.00 41.00
##
## $`TRUE`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 22.00 28.00 27.62 33.00 50.00
boxplot(nc$weight~nc$goodRec,main = "Weight by goodRec")
tapply(nc$weight,nc$goodRec,summary)
## $`FALSE`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.940 6.910 6.657 7.690 11.750
##
## $`TRUE`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.515 7.380 7.211 8.130 11.630
Here’s the bottom line. If we eliminate all of the cases with bad data from our analysis, we would have high estimates for the average weigh of a baby and the age of the mother.
We can use the function is.na() to mark the set of records with a missing value of fage. This syntax resembles the handling of NULL in SQL. You can’t say “== NA” in R.
I’ll do this in two ways. We could use either of these to define a subset of the dataset.
fageKnown = !is.na(nc$fage)
fageUnknown = is.na(nc$fage)
table(fageKnown,fageUnknown)
## fageUnknown
## fageKnown FALSE TRUE
## FALSE 0 171
## TRUE 829 0