Exploring Data

Harold Nelson

8/14/2017

This presentation uses a dataset of North Carolina birth data. It is based on an OpenIntro lab, which you can find at https://www.openintro.org/stat/data/?data=nc. Visit this link for basic documentation on the data.

Obtain the Data

To obtain the data, use the following commands. Following the load() command, run str() to verify the success of the load. How many variables are there? How many observations? Does the file look like the one documented on the Openintro website?

# First get the dataset, which is stored as an RData file, moved from the Openintro website to your computer.  This only has to be done once.

download.file("http://www.openintro.org/stat/data/nc.RData",
              destfile = "nc.RData")
# Note that after you have run the download command above once successfully, you should delete it.  

# Just running it once successfuly will place the data on your computer.

# However the load command below must be run every time you run knitr.

load("nc.RData")
str(nc)
## 'data.frame':    1000 obs. of  13 variables:
##  $ fage          : int  NA NA 19 21 NA NA 18 17 NA 20 ...
##  $ mage          : int  13 14 15 15 15 15 15 15 16 16 ...
##  $ mature        : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
##  $ weeks         : int  39 42 37 41 39 38 37 35 38 37 ...
##  $ premie        : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
##  $ visits        : int  10 15 11 6 9 19 12 5 9 13 ...
##  $ marital       : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gained        : int  38 20 38 34 27 22 76 15 NA 52 ...
##  $ weight        : num  7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
##  $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
##  $ gender        : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
##  $ habit         : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
##  $ whitemom      : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...

To hear my commentary on the section above, visit http://www.youtube.com/watch?v=PVca059TWiU.

Problem 1

Run a summary of the dataframe to look for anomalies. Do all of the minimum and maximum values make sense? Many of the varaibles have NA values. Offer a reason why NA values might be present in each of these variables?

summary(nc)
##       fage            mage            mature        weeks      
##  Min.   :14.00   Min.   :13   mature mom :133   Min.   :20.00  
##  1st Qu.:25.00   1st Qu.:22   younger mom:867   1st Qu.:37.00  
##  Median :30.00   Median :27                     Median :39.00  
##  Mean   :30.26   Mean   :27                     Mean   :38.33  
##  3rd Qu.:35.00   3rd Qu.:32                     3rd Qu.:40.00  
##  Max.   :55.00   Max.   :50                     Max.   :45.00  
##  NA's   :171                                    NA's   :2      
##        premie        visits            marital        gained     
##  full term:846   Min.   : 0.0   married    :386   Min.   : 0.00  
##  premie   :152   1st Qu.:10.0   not married:613   1st Qu.:20.00  
##  NA's     :  2   Median :12.0   NA's       :  1   Median :30.00  
##                  Mean   :12.1                     Mean   :30.33  
##                  3rd Qu.:15.0                     3rd Qu.:38.00  
##                  Max.   :30.0                     Max.   :85.00  
##                  NA's   :9                        NA's   :27     
##      weight       lowbirthweight    gender          habit    
##  Min.   : 1.000   low    :111    female:503   nonsmoker:873  
##  1st Qu.: 6.380   not low:889    male  :497   smoker   :126  
##  Median : 7.310                               NA's     :  1  
##  Mean   : 7.101                                              
##  3rd Qu.: 8.060                                              
##  Max.   :11.750                                              
##                                                              
##       whitemom  
##  not white:284  
##  white    :714  
##  NA's     :  2  
##                 
##                 
##                 
## 

The minimum values for both father’s age and mother’s age are suspiciously low, but not impossible.

The maximum value for mother’s age is high but not impossible.

The maximum value for gestation, 45, is probably incorrect.

Several variables have small numbers of NA values, probably refusal to answer or transcription errors.

There ar many NA values for fage. When the father’s age is missing, there was probably a weak relationship between the father and the mother. It would be interesting to see if any other variables in the dataset are related to the absence of the father’s age.

The minimum value for weight gained is suspicious but not impossible.

Problem 2

Provide the basic descriptive statistics for the weight of the baby. You should give the mean, median, 25th percentile, minimum, maximum, range and interquartile range.

# Place the R code you need to answer this question in this chunk.
summary(nc$weight)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.380   7.310   7.101   8.060  11.750
IQR(nc$weight)
## [1] 1.68
max(nc$weight) - min(nc$weight) # Range
## [1] 10.75

Problem 3

Produce a histogram and boxplot of the weights of the babies here.

hist(nc$weight)

boxplot(nc$weight,horizontal = T)

Describe the information you see in the graphic displays. Make two accurate statements based on these graphics.

The weights are centered at approximately seven pounds and are slightly left-skewed.

Problem 4

We want to compare the distribution of weight for male and female babies. To do this graphically, produce side-by-side boxplots. For numerical descriptive statistics, use tapply and the summary function.

# Place the R code you need to answer this question in this chunk.
boxplot(nc$weight~nc$gender)

tapply(nc$weight,nc$gender,summary)
## $female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.250   7.130   6.903   7.750  11.630 
## 
## $male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.380   6.560   7.440   7.302   8.310  11.750

Make two accurate statements describing the facts you see in these results.

All of the descriptive statistics confirm that male babies tend to be heavier than female babies.

Greaphically, Male babies are slightly heavier than female babies, but have fewer high outliers.

Problem 5

Produce a table and a barplot to describe the variable habit.

# Place the R code you need to answer this question in this chunk.
table(nc$habit)
## 
## nonsmoker    smoker 
##       873       126
barplot(table(nc$habit))

Make one accurate statement describibg the distribution of the variable habit here.

There are more nonsmokers than smokers.

Problem 6

Produce a table and a mosaicplot to describe the relationship between the variables habit and lowbirthweight.

# Place the R code you need to answer this question in this chunk.
table(nc$habit,nc$lowbirthweight)
##            
##             low not low
##   nonsmoker  92     781
##   smoker     18     108
mosaicplot(table(nc$habit,nc$lowbirthweight))

Make two accurate statements to describe the relationship between habit and lowbirthweight.

Smokers had a higher likelihood of a low birthweight baby. 18 out of 126 smokers had low birthweight babies. 92 out of 873 non-smokers had low birthweight babies.

The graphical results confirm this.

Note: For problems 7 through 8 it is important to review the corresponding material in the Module 2 Lecture Notes.

Problem 7

Produce a scatterplot to describe the relationship between the weeks of gestation and the birthweight.

plot(nc$weeks,nc$weight)

# Place the R code you need to answer this question in this chunk.

Increasing the weeks of gestation tends to increase the weight of the baby. Make one accurate statement to describe the relationship between the weeks of gestation and the birthweight.

Problem 8

Produce a linear model which shows how birthweightweight depends on the weeks of gestation. Display the summary results of the linear model.

# Place the R code you need to answer this question in this chunk.
Joe = lm(nc$weight~nc$weeks)
summary(Joe)
## 
## Call:
## lm(formula = nc$weight ~ nc$weeks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5775 -0.7048 -0.0235  0.7022  4.4165 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.09529    0.46464  -13.12   <2e-16 ***
## nc$weeks     0.34433    0.01209   28.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.119 on 996 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.449,  Adjusted R-squared:  0.4485 
## F-statistic: 811.7 on 1 and 996 DF,  p-value: < 2.2e-16

According to this model, how much extra birth weight would you expect to be associated with one extra week of gestation?

According to the model, an extra week adds about .34 pounds to the birthweight.