CSC 360 Lecture 3 Notes

Harold Nelson

March 29, 2016

Getting Organized

Organize your work into projects
Create one folder for all R Stuff
Mine is in droppbox
Create new projects as subfolders
Use File -> New Project in RStudio

The Working Directory

This is where R will store things and look for things by default.
Creating a new project sets the working directory and makes many things easier.
getwd() to see what yours is
setwd() to set your working directory
Move file you want to import into your working directory.

What Kind of File?

RDA, or RDATA: Click on “File” then on “Open File” then navigate and double-click
Note the load command in the console for future reference.
Excel: Save as CSV or TSV from Excel
Prep file following DataCamp tutorial
Use fn = file.choose() to get an accurate path.
read_csv(fn) or tsv to get the file into a dataframe. The _ functions (in readr) are better than the . functions (in base).

Looking for Bad Data

There are some things you should always do, but they won’t find all problems.

The more questions you ask, the more problems you’ll find.

For the dataframe as a whole do head(), tail(), str() and summary().

For numeric variables: - hist(x) - summary(x) - boxplot(x) - plot(density(x))

For qualitative variables.

table(x)
barplot(table(x))

Relationships

plot(x,y) for pairs of numeric variables
boxplot(x~y) for numeric variable x and categorical variable y mosaicplot(table(x,y)) for pairs of caegorical variables.

Consequences of Cleaning

You need to examine the data you have left after taking care of the bad data.

Do you still have a random sample of your population?

An Example

Look at the data in the 1000births file. I’ll call it nc since it contains a sample of birth records from North Carolina.

library(readxl)
nc = read_excel("1000births.xlsx")

# Look at the structure
str(nc)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  14 variables:
##  $ fage          : num  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : num  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : chr  "younger mom" "younger mom" "younger mom" "younger mom" ...
##  $ weeks         : num  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : chr  "full term" "full term" "full term" "full term" ...
##  $ visits        : num  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ racemom       : num  2 2 1 1 2 2 2 2 1 1 ...
##  $ hispmom       : chr  "N" "N" "M" "M" ...
##  $ gained        : num  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: chr  "not low" "not low" "not low" "not low" ...
##  $ sexbaby       : chr  "male" "male" "female" "male" ...
##  $ habit         : chr  "nonsmoker" "nonsmoker" "nonsmoker" "nonsmoker" ...

# Look at the data
summary(nc)

##       fage            mage       mature              weeks      
##  Min.   :14.00   Min.   :13   Length:1000        Min.   :20.00  
##  1st Qu.:25.00   1st Qu.:22   Class :character   1st Qu.:37.00  
##  Median :30.00   Median :27   Mode  :character   Median :39.00  
##  Mean   :30.26   Mean   :27                      Mean   :38.33  
##  3rd Qu.:35.00   3rd Qu.:32                      3rd Qu.:40.00  
##  Max.   :55.00   Max.   :50                      Max.   :45.00  
##  NA's   :171                                     NA's   :2      
##     premie              visits        marital         racemom     
##  Length:1000        Min.   : 0.0   Min.   :1.000   Min.   :0.000  
##  Class :character   1st Qu.:10.0   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Median :12.0   Median :1.000   Median :1.000  
##                     Mean   :12.1   Mean   :1.386   Mean   :1.423  
##                     3rd Qu.:15.0   3rd Qu.:2.000   3rd Qu.:2.000  
##                     Max.   :30.0   Max.   :2.000   Max.   :8.000  
##                     NA's   :9      NA's   :1                      
##    hispmom              gained          weight       lowbirthweight    
##  Length:1000        Min.   : 0.00   Min.   : 1.000   Length:1000       
##  Class :character   1st Qu.:20.00   1st Qu.: 6.380   Class :character  
##  Mode  :character   Median :30.00   Median : 7.310   Mode  :character  
##                     Mean   :30.33   Mean   : 7.101                     
##                     3rd Qu.:38.00   3rd Qu.: 8.060                     
##                     Max.   :85.00   Max.   :11.750                     
##                     NA's   :27                                         
##    sexbaby             habit          
##  Length:1000        Length:1000       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##                                       
##

There are many missing records for several variables, mostly the age of the father. We can eliminate all of these records using the complete cases function. Let’s see what the function does. First use it to create nc.cc and see what nc.cc is.

nc.cc = complete.cases(nc)
str(nc.cc)

##  logi [1:1000] FALSE FALSE TRUE TRUE FALSE FALSE ...

summary(nc.cc)

##    Mode   FALSE    TRUE    NA's 
## logical     198     802       0

We can use this vector to get a clean dataset for analysis. But this may introduce a bias. Let’s introduce an indicator into the dataset based on the complete.cases() values.

nc$goodRec = complete.cases(nc)

Now we can see if the records with incomplete data are different from those with complete data. The variables mage and weight are always valid. Let’s do side-by-side boxplots and summaries of these with the values of goodRec.

boxplot(nc$mage~nc$goodRec,main = "mage by goodRec")

tapply(nc$mage,nc$goodRec,summary)

## $`FALSE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   24.00   24.48   28.00   41.00 
## 
## $`TRUE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   15.00   22.00   28.00   27.62   33.00   50.00

boxplot(nc$weight~nc$goodRec,main = "Weight by goodRec")

tapply(nc$weight,nc$goodRec,summary)

## $`FALSE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.940   6.910   6.657   7.690  11.750 
## 
## $`TRUE`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.515   7.380   7.211   8.130  11.630

Here’s the bottom line. If we eliminate all of the cases with bad data from our analysis, we would have high estimates for the average weigh of a baby and the age of the mother.

Eliminate Records with Missing fage.

We can use the function is.na() to mark the set of records with a missing value of fage. This syntax resembles the handling of NULL in SQL. You can’t say “== NA” in R.

I’ll do this in two ways. We could use either of these to define a subset of the dataset.

fageKnown = !is.na(nc$fage)
fageUnknown = is.na(nc$fage)
table(fageKnown,fageUnknown)

##          fageUnknown
## fageKnown FALSE TRUE
##     FALSE     0  171
##     TRUE    829    0