North Carolina births

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.

Exploratory analysis

Load the nc data set into your workspace:

download.file("http://www.openintro.org/stat/data/nc.RData", destfile = "nc.RData")
load("nc.RData")

1000 obs. of 9 variables

The nc dataset has 1000 observations on 13 different variables, some discrete and some continuous. The meaning of each variable is as follows.

variable description
fage father’s age in years.
mage mother’s age in years.
mature maturity status of mother.
weeks length of pregnancy in weeks.
premie whether the birth was classified as premature (premie) or full-term.
visits number of hospital visits during pregnancy.
marital whether mother is married or not married at birth.
gained weight gained by mother during pregnancy in pounds.
weight weight of the baby at birth in pounds.
lowbirthweight whether baby was classified as low birthweight (low) or not (not low).
gender gender of the baby, female or male.
habit status of the mother as a nonsmoker or a smoker.
whitemom whether mom is white or not white.
  1. Use ggplot2 to create a histogram of mother’s age. Enter two things below: the command you used to create the histogram, and a sentence or two describing what you see. (remember, you need to install and load ggplot2 before you can use it. See the slides for more detail)
 ggplot(nc, aes(x =mage))+geom_histogram()
This shows that the mother's age and pregency is the most at age 20. Many of them are 20 but this may be rounded because people don't want to tell you what they think is un-nice numbers. 

download.file(“http://www.openintro.org/stat/data/nc.Rdata”, destfile = nc.Rdata”)

  1. Use ggplot2 to make a graph (scatterplot) of mother’s age versus father’s age, where every point represents one couple. Again, enter two things below: the command you used to create the graph, and a sentence or two describing what you see.
 ggplot(nc, aes(x = mage, y = fage)) +geom_point()
```It should show a scatter plot that has an incrasing slope. Some will have male and female ages being the same for the couple and others have larger differences in age.Mother's age is at the bottome on the x-axis and the father's age is on the y-axis.

3. One problem with the graph from question 2 is that many points lie on top of one another. For example, all observations where mother and father were both 20 years old lie on top of one another. Do the graph from question 2 again, but replace `geom_point()` with `geom_jitter()`. Enter the new command you used to create the graph, and a sentence or two describing what you see.
```ggplot(data = nc, aes(x = mage, y = fage)) + 
  geom_jitter()
This will show a postive correlation. The only difference is that this one has more data points across the graph with a large range of outliers and overlapped points. 
4. Now repeat the graph from question 3, but make separate graphs for mothers who were smokers and mothers who were nonsmokers. Enter the command you used to create the graph, and a sentence or two describing what you see.
ggplot(data = nc, aes(x = mage, y = fage)) + geom_jitter() + facet_wrap(~ habit)
```The nonsmoker graph has a lot more plots. Showing that more people don't smoke rather than they do. There is also another plot that has non-applicable data. This means that people had not added any data. R studios will do this because it is unable to fill in the missing data.