Harold Nelson
8/14/2017
This presentation uses a dataset of North Carolina birth data. It is based on an OpenIntro lab, which you can find at https://www.openintro.org/stat/data/?data=nc. Visit this link for basic documentation on the data.
To obtain the data, use the following commands. Following the load() command, run str() to verify the success of the load. How many variables are there? How many observations? Does the file look like the one documented on the Openintro website?
# First get the dataset, which is stored as an RData file, moved from the Openintro website to your computer. This only has to be done once.
download.file("http://www.openintro.org/stat/data/nc.RData",
destfile = "nc.RData")
# Note that after you have run the download command above once successfully, you should delete it.
# Just running it once successfuly will place the data on your computer.
# However the load command below must be run every time you run knitr.
load("nc.RData")
str(nc)
## 'data.frame': 1000 obs. of 13 variables:
## $ fage : int NA NA 19 21 NA NA 18 17 NA 20 ...
## $ mage : int 13 14 15 15 15 15 15 15 16 16 ...
## $ mature : Factor w/ 2 levels "mature mom","younger mom": 2 2 2 2 2 2 2 2 2 2 ...
## $ weeks : int 39 42 37 41 39 38 37 35 38 37 ...
## $ premie : Factor w/ 2 levels "full term","premie": 1 1 1 1 1 1 1 2 1 1 ...
## $ visits : int 10 15 11 6 9 19 12 5 9 13 ...
## $ marital : Factor w/ 2 levels "married","not married": 1 1 1 1 1 1 1 1 1 1 ...
## $ gained : int 38 20 38 34 27 22 76 15 NA 52 ...
## $ weight : num 7.63 7.88 6.63 8 6.38 5.38 8.44 4.69 8.81 6.94 ...
## $ lowbirthweight: Factor w/ 2 levels "low","not low": 2 2 2 2 2 1 2 1 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 1 2 1 2 2 2 2 1 ...
## $ habit : Factor w/ 2 levels "nonsmoker","smoker": 1 1 1 1 1 1 1 1 1 1 ...
## $ whitemom : Factor w/ 2 levels "not white","white": 1 1 2 2 1 1 1 1 2 2 ...
To hear my commentary on the section above, visit http://www.youtube.com/watch?v=PVca059TWiU.
Run a summary of the dataframe to look for anomalies. Do all of the minimum and maximum values make sense? Many of the varaibles have NA values. Offer a reason why NA values might be present in each of these variables?
summary(nc)
## fage mage mature weeks
## Min. :14.00 Min. :13 mature mom :133 Min. :20.00
## 1st Qu.:25.00 1st Qu.:22 younger mom:867 1st Qu.:37.00
## Median :30.00 Median :27 Median :39.00
## Mean :30.26 Mean :27 Mean :38.33
## 3rd Qu.:35.00 3rd Qu.:32 3rd Qu.:40.00
## Max. :55.00 Max. :50 Max. :45.00
## NA's :171 NA's :2
## premie visits marital gained
## full term:846 Min. : 0.0 married :386 Min. : 0.00
## premie :152 1st Qu.:10.0 not married:613 1st Qu.:20.00
## NA's : 2 Median :12.0 NA's : 1 Median :30.00
## Mean :12.1 Mean :30.33
## 3rd Qu.:15.0 3rd Qu.:38.00
## Max. :30.0 Max. :85.00
## NA's :9 NA's :27
## weight lowbirthweight gender habit
## Min. : 1.000 low :111 female:503 nonsmoker:873
## 1st Qu.: 6.380 not low:889 male :497 smoker :126
## Median : 7.310 NA's : 1
## Mean : 7.101
## 3rd Qu.: 8.060
## Max. :11.750
##
## whitemom
## not white:284
## white :714
## NA's : 2
##
##
##
##
The minimum values for both father’s age and mother’s age are suspiciously low, but not impossible.
The maximum value for mother’s age is high but not impossible.
The maximum value for gestation, 45, is probably incorrect.
Several variables have small numbers of NA values, probably refusal to answer or transcription errors.
There ar many NA values for fage. When the father’s age is missing, there was probably a weak relationship between the father and the mother. It would be interesting to see if any other variables in the dataset are related to the absence of the father’s age.
The minimum value for weight gained is suspicious but not impossible.
Provide the basic descriptive statistics for the weight of the baby. You should give the mean, median, 25th percentile, minimum, maximum, range and interquartile range.
# Place the R code you need to answer this question in this chunk.
summary(nc$weight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.380 7.310 7.101 8.060 11.750
IQR(nc$weight)
## [1] 1.68
max(nc$weight) - min(nc$weight) # Range
## [1] 10.75
Produce a histogram and boxplot of the weights of the babies here.
hist(nc$weight)
boxplot(nc$weight,horizontal = T)
Describe the information you see in the graphic displays. Make two accurate statements based on these graphics.
The weights are centered at approximately seven pounds and are slightly left-skewed.
We want to compare the distribution of weight for male and female babies. To do this graphically, produce side-by-side boxplots. For numerical descriptive statistics, use tapply and the summary function.
# Place the R code you need to answer this question in this chunk.
boxplot(nc$weight~nc$gender)
tapply(nc$weight,nc$gender,summary)
## $female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.250 7.130 6.903 7.750 11.630
##
## $male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.380 6.560 7.440 7.302 8.310 11.750
Make two accurate statements describing the facts you see in these results.
All of the descriptive statistics confirm that male babies tend to be heavier than female babies.
Greaphically, Male babies are slightly heavier than female babies, but have fewer high outliers.
Produce a table and a barplot to describe the variable habit.
# Place the R code you need to answer this question in this chunk.
table(nc$habit)
##
## nonsmoker smoker
## 873 126
barplot(table(nc$habit))
Make one accurate statement describibg the distribution of the variable habit here.
There are more nonsmokers than smokers.
Produce a table and a mosaicplot to describe the relationship between the variables habit and lowbirthweight.
# Place the R code you need to answer this question in this chunk.
table(nc$habit,nc$lowbirthweight)
##
## low not low
## nonsmoker 92 781
## smoker 18 108
mosaicplot(table(nc$habit,nc$lowbirthweight))
Make two accurate statements to describe the relationship between habit and lowbirthweight.
Smokers had a higher likelihood of a low birthweight baby. 18 out of 126 smokers had low birthweight babies. 92 out of 873 non-smokers had low birthweight babies.
The graphical results confirm this.
Note: For problems 7 through 8 it is important to review the corresponding material in the Module 2 Lecture Notes.
Produce a scatterplot to describe the relationship between the weeks of gestation and the birthweight.
plot(nc$weeks,nc$weight)
# Place the R code you need to answer this question in this chunk.
Increasing the weeks of gestation tends to increase the weight of the baby. Make one accurate statement to describe the relationship between the weeks of gestation and the birthweight.
Produce a linear model which shows how birthweightweight depends on the weeks of gestation. Display the summary results of the linear model.
# Place the R code you need to answer this question in this chunk.
Joe = lm(nc$weight~nc$weeks)
summary(Joe)
##
## Call:
## lm(formula = nc$weight ~ nc$weeks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5775 -0.7048 -0.0235 0.7022 4.4165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.09529 0.46464 -13.12 <2e-16 ***
## nc$weeks 0.34433 0.01209 28.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.119 on 996 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.449, Adjusted R-squared: 0.4485
## F-statistic: 811.7 on 1 and 996 DF, p-value: < 2.2e-16
According to this model, how much extra birth weight would you expect to be associated with one extra week of gestation?
According to the model, an extra week adds about .34 pounds to the birthweight.