STA 111 Lab 4
Goal
In this lab, we are going to explore the normal distribution, which is also called the Gaussian distribution. We are going to see a few examples of things that are normally distributed, as well as seeing how we can use properties of the normal distribution to work with data.
Loading the Data
We are going to work with data from a study interested in determining information about pregnant women and their newborn children. To load the data, put the following in a chunk of code and press play.
<- read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv') babyData
The data provide information on \(n = 1236\) individuals. The variables contained in the data are:
bwt
- birth weight of the baby (in ounces)gestation
- length of the pregnancy (in days)parity
- 1 if baby was the parent’s first born child, 0 otherwiseage
- mother’s age (in years)height
- mother’s height (in inches)weight
- mother’s weight (in lbs.)smoke
- 1 if the mother is a smoker, 0 otherwise
Exploring the mother’s height
The first variable we are going to explore is the height of the mother, in inches. We are interested in the probability that a mother is less than 69 inches (5 foot 9) in height. To answer questions like this, it is first helpful to look at the distribution of heights matches a named distribution, like the normal distribution.
Question 1
Create a histogram to explore the distribution of height, and describe the distribution. Make sure to label your axes!
Once we have looked at the distribution, we start to consider whether the distribution we are seeing matches a named distribution. A named distribution is just a distribution of data that has been given a name and whose properties are well known. Some examples are the Normal, the F, the T, the Chi-Square, and the Gamma distributions. In this lab, we are focusing on the normal distribution.
One thing that we know about the normal distribution is that it is symmetric. When we have a symmetric distribution, we know that when we look at the mean and the median of the distribution, these two measures of center should be roughly the same.
Question 2
Create a summary()
of height. Are the mean and the median different, or are they roughly the same?
Question 3
Normal distribution must be symmetric and unimodal. Based on what we have done so far, does anything suggest this distribution might not be a normal distribution?
Defining the Parameters
If we have decided that a normal distribution might be a reasonable way to describe the distribution of the data, the next step is to find the value of the two parameters that describe the distribution. These two parameters are:
- \(\mu\), which is the mean.
- \(\sigma\), which is the standard deviation.
Question 4
If we did use a normal distribution to describe the height of mothers in these data, what value would you suggest we use \(\mu\)?
Question 5
If we did use a normal distribution to describe the height of mothers in these data, what value would you suggest we use \(\sigma\)? Hint: We have some missing data in this data set. This means that the usual code (sd()
) will not quite work. Instead, we need to use sd( babyData$height, na.rm = TRUE)
Question 6
Based on Questions 4 and 5, write down the normal distribution you would use to describe the heights of mother’s in this data set. Hint: Use notation \(N(\mu, \sigma)\).
Okay, so at this point, we have a good idea that a normal distribution is a plausible choice for describing the distribution of heights, and we also have an idea of the center (\(\mu\)) and spread (\(\sigma\)) of the distribution. The last step is to actually draw the normal distribution you specified in Question 6 on top of our histogram to make sure that the curve is a good representation of the bars in the histogram.
We can use R to draw our normal distribution on top of the data, but this takes a few steps. The first step is to tell R the range of values that are on the X axis. In other words, what is the smallest value of height (from =
) and what is the largest value of height (to =
) in our data? This information can be found in the summary
we did earlier. Copy the code below, put it in a chunk, and press play.
# Step 1: Specify the range of your x axis
<- seq(from = 53, to = 72, length = 40) range
The next step is to tell R the parameters for the normal distribution you want to draw. We already know what those values are, but now we need to tell R what we have chosen. To do this, copy the code below and put it in a chunk. Then, replace MU
with the numeric value for \(\mu\) you chose in Question 4, and SIGMA
with the numeric value for \(\sigma\) that you chose in Question 5.
# Step 2
<- dnorm(range, mean = MU, sd = SIGMA) fun
The final step is to actually draw the curve on the histogram!
# Step 3: Histogram
hist(babyData$height, prob = TRUE, col = "white",
ylim = c(0, max(fun)),
main = "Histogram with normal curve",
xlab = "Height (in inches)")
lines(range, fun, col = 2, lwd = 2)
You will notice that this looks a little different from our standard histogram code. The additions are to allow us to draw the normal distribution on top of the histogram.
Question 7
Create the plot using the three steps of code above. Does it look like the normal distribution (the curve) is a good fit for the histogram? In other words, does it look like the curve matches with the shape of the histogram?
Using the Normal Distribution
Once we have decided that a normal distribution is a reasonable choice, we can use that to answer questions about the data.
Question 8
Based on the normal distribution we have chosen, what is the approximate probability that a mother is less than 69 inches (5 foot 9)
Another Example
We have just seen one example of where the normal distribution occurs in nature - human heights. It turns out that it is not just height that tends to be normally distributed!
Question 9
Create a histogram to explore the distribution of birth weight, and describe the distribution. Make sure to label your axes!
Question 10
Use the code from above Question 7 to draw the appropriate normal curve on the histogram. State the parameters (mean and standard deviation) of the normal distribution you chose.
Question 11
Does it look like this normal distribution might be appropriate to birthweight? Explain why or why not.
Question 12
What is the probability that a baby weighs less than 64.3 (approximately) ounces at birth?
Question 13
What is the probability that a baby has a weight more than 1 standard deviation above the mean?
Note:
This data for this lab was retrieved from http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv on July 19, 2022.
This lab was created by Nicole Dalzell, Assistant Teaching Professor of Wake Forest University, on July 21, 2022.