In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.
Load the nc data set into your workspace:
download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")
1000 obs. of 9 variables
The nc dataset has 1000 observations on 13 different
variables, some discrete and some continuous. The meaning of each
variable is as follows.
| variable | description |
|---|---|
fage |
father’s age in years. |
mage |
mother’s age in years. |
mature |
maturity status of mother. |
weeks |
length of pregnancy in weeks. |
premie |
whether the birth was classified as premature (premie) or full-term. |
visits |
number of hospital visits during pregnancy. |
marital |
whether mother is married or not married
at birth. |
gained |
weight gained by mother during pregnancy in pounds. |
weight |
weight of the baby at birth in pounds. |
lowbirthweight |
whether baby was classified as low birthweight (low) or
not (not low). |
gender |
gender of the baby, female or male. |
habit |
status of the mother as a nonsmoker or a
smoker. |
whitemom |
whether mom is white or not white. |
ggplot(nc, aes(x =mage))+geom_histogram()
This shows that the mother's age and pregency is the most at age 20. Many of them are 20 but this may be rounded because people don't want to tell you what they think is un-nice numbers.
download.file(“http://www.openintro.org/stat/data/nc.Rdata”, destfile = nc.Rdata”)
ggplot(nc, aes(x = mage, y = fage)) +geom_point()
```It should show a scatter plot that has an incrasing slope. Some will have male and female ages being the same for the couple and others have larger differences in age.Mother's age is at the bottome on the x-axis and the father's age is on the y-axis.
3. One problem with the graph from question 2 is that many points lie on top of one another. For example, all observations where mother and father were both 20 years old lie on top of one another. Do the graph from question 2 again, but replace `geom_point()` with `geom_jitter()`. Enter the new command you used to create the graph, and a sentence or two describing what you see.
```ggplot(data = nc, aes(x = mage, y = fage)) +
geom_jitter()
This will show a postive correlation. The only difference is that this one has more data points across the graph with a large range of outliers and overlapped points.
4. Now repeat the graph from question 3, but make separate graphs for mothers who were smokers and mothers who were nonsmokers. Enter the command you used to create the graph, and a sentence or two describing what you see.
ggplot(data = nc, aes(x = mage, y = fage)) + geom_jitter() + facet_wrap(~ habit)
```The nonsmoker graph has a lot more plots. Showing that more people don't smoke rather than they do. There is also another plot that has non-applicable data. This means that people had not added any data. R studios will do this because it is unable to fill in the missing data.