CSC 360 Lecture 3 Notes

Harold Nelson

March 29, 2016

Getting Organized

The Working Directory

What Kind of File?

Looking for Bad Data

There are some things you should always do, but they won’t find all problems.

The more questions you ask, the more problems you’ll find.

For the dataframe as a whole do head(), tail(), str() and summary().

For numeric variables: - hist(x) - summary(x) - boxplot(x) - plot(density(x))

For qualitative variables.

Relationships

Consequences of Cleaning

You need to examine the data you have left after taking care of the bad data.

Do you still have a random sample of your population?

An Example

Look at the data in the 1000births file. I’ll call it nc since it contains a sample of birth records from North Carolina.

library(readxl)
nc = read_excel("1000births.xlsx")

# Look at the structure
str(nc)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  14 variables:
##  $ fage          : num  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : num  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : chr  "younger mom" "younger mom" "younger mom" "younger mom" ...
##  $ weeks         : num  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : chr  "full term" "full term" "full term" "full term" ...
##  $ visits        : num  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ racemom       : num  2 2 1 1 2 2 2 2 1 1 ...
##  $ hispmom       : chr  "N" "N" "M" "M" ...
##  $ gained        : num  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: chr  "not low" "not low" "not low" "not low" ...
##  $ sexbaby       : chr  "male" "male" "female" "male" ...
##  $ habit         : chr  "nonsmoker" "nonsmoker" "nonsmoker" "nonsmoker" ...
# Look at the data
summary(nc)
##       fage            mage       mature              weeks      
##  Min.   :14.00   Min.   :13   Length:1000        Min.   :20.00  
##  1st Qu.:25.00   1st Qu.:22   Class :character   1st Qu.:37.00  
##  Median :30.00   Median :27   Mode  :character   Median :39.00  
##  Mean   :30.26   Mean   :27                      Mean   :38.33  
##  3rd Qu.:35.00   3rd Qu.:32                      3rd Qu.:40.00  
##  Max.   :55.00   Max.   :50                      Max.   :45.00  
##  NA's   :171                                     NA's   :2      
##     premie              visits        marital         racemom     
##  Length:1000        Min.   : 0.0   Min.   :1.000   Min.   :0.000  
##  Class :character   1st Qu.:10.0   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Median :12.0   Median :1.000   Median :1.000  
##                     Mean   :12.1   Mean   :1.386   Mean   :1.423  
##                     3rd Qu.:15.0   3rd Qu.:2.000   3rd Qu.:2.000  
##                     Max.   :30.0   Max.   :2.000   Max.   :8.000  
##                     NA's   :9      NA's   :1                      
##    hispmom              gained          weight       lowbirthweight    
##  Length:1000        Min.   : 0.00   Min.   : 1.000   Length:1000       
##  Class :character   1st Qu.:20.00   1st Qu.: 6.380   Class :character  
##  Mode  :character   Median :30.00   Median : 7.310   Mode  :character  
##                     Mean   :30.33   Mean   : 7.101                     
##                     3rd Qu.:38.00   3rd Qu.: 8.060                     
##                     Max.   :85.00   Max.   :11.750                     
##                     NA's   :27                                         
##    sexbaby             habit          
##  Length:1000        Length:1000       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
## 

There are many missing records for several variables, mostly the age of the father. We can eliminate all of these records using the complete cases function. Let’s see what the function does. First use it to create nc.cc and see what nc.cc is.

nc.cc = complete.cases(nc)
str(nc.cc)
##  logi [1:1000] FALSE FALSE TRUE TRUE FALSE FALSE ...
summary(nc.cc)
##    Mode   FALSE    TRUE    NA's 
## logical     198     802       0

We can use this vector to get a clean dataset for analysis. But this may introduce a bias. Let’s introduce an indicator into the dataset based on the complete.cases() values.

nc$goodRec = complete.cases(nc)

Now we can see if the records with incomplete data are different from those with complete data. The variables mage and weight are always valid. Let’s do side-by-side boxplots and summaries of these with the values of goodRec.

boxplot(nc$mage~nc$goodRec,main = "mage by goodRec")

tapply(nc$mage,nc$goodRec,summary)
## $`FALSE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   24.00   24.48   28.00   41.00 
## 
## $`TRUE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   22.00   28.00   27.62   33.00   50.00
boxplot(nc$weight~nc$goodRec,main = "Weight by goodRec")

tapply(nc$weight,nc$goodRec,summary)
## $`FALSE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.940   6.910   6.657   7.690  11.750 
## 
## $`TRUE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.515   7.380   7.211   8.130  11.630

Here’s the bottom line. If we eliminate all of the cases with bad data from our analysis, we would have high estimates for the average weigh of a baby and the age of the mother.

Eliminate Records with Missing fage.

We can use the function is.na() to mark the set of records with a missing value of fage. This syntax resembles the handling of NULL in SQL. You can’t say “== NA” in R.

I’ll do this in two ways. We could use either of these to define a subset of the dataset.

fageKnown = !is.na(nc$fage)
fageUnknown = is.na(nc$fage)
table(fageKnown,fageUnknown)
##          fageUnknown
## fageKnown FALSE TRUE
##     FALSE     0  171
##     TRUE    829    0