In this lab, we are going to explore the normal distribution, which is also called the Gaussian distribution. We are going to see a few examples of things that are normally distributed, as well as seeing how we can use properties of the normal distribution to work with data.
If you want a reminder on how to use R or R markdown, look back at Lab 1!
We are going to work with data from a study interested in determining information about pregnant women and their newborn children. To load the data, put the following in a chunk of code and press play.
library(dplyr)
babyData= read.csv('http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv')
The data provide information on n=1236 individuals. The variables contained in the data are:
*bwt - birth weight of the baby (in ounces)
*gestation - length of the pregnancy (in days)
*parity - 1 if baby was the parent’s first born child, 0 otherwise
*age - mother’s age (in years)
*height - mother’s height (in inches)
*weight - mother’s weight (in lbs.)
*smoke - 1 if the mother is a smoker, 0 otherwise
The first variable we are going to explore is the height of the mother, in inches. We are interested in the probability that a mother is less than 69 inches (5 foot 9) in height. To answer questions like this, it is first helpful to look at the distribution of heights matches a named distribution, like the normal distribution.
Create a histogram to explore the distribution of height, and describe the distribution. Make sure to label your axes!
Use the code below to draw the histogram. You have to give a proper title to the graph in main= “” and you have to label x axis in xlab= “” Your object will be height.
Ans: See histogram below
hist(babyData$height, main= "Mother's Height", xlab= "Height(inches)")
hist(object, main= “Title of the Plot”, xlab= “Your X Axis”)
Once we have looked at the distribution, we start to consider whether the distribution we are seeing matches a named distribution. A named distribution is just a distribution of data that has been given a name and whose properties are well known. Some examples are the Normal, the F, the T, the Chi-Square, and the Gamma distributions. In this lab, we are focusing on the normal distribution.
One thing that we know about the normal distribution is that it is symmetric. When we have a symmetric distribution, we know that when we look at the mean and the median of the distribution, these two measures of center should be roughly the same.
Create a summary() of height. Are the mean and the median different, or are they roughly the same?
summary(babyData$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 53.00 62.00 64.00 64.05 66.00 72.00 22
Ans: The mean is 64.05 inches while the median is 64 inches. These would be considered roughly the same.
Normal distribution must be symmetric and unimodal. Based on what we have done so far, does anything suggest this distribution might not be a normal distribution?
Ans: No, the information so far suggests that this data is normally distributed.
If we have decided that a normal distribution might be a reasonable way to describe the distribution of the data, the next step is to find the value of the two parameters that describe the distribution. These two parameters are:
μ, which is the mean. σ, which is the standard deviation.
If we did use a normal distribution to describe the height of mothers in these data, what value would you suggest we use μ?
Ans: 64.05 inches, the mean.
If we did use a normal distribution to describe the height of mothers in these data, what value would you suggest we use σ? Hint: We have some missing data in this data set. This means that the usual code (sd()) will not quite work. Instead, we need to use sd( babyData$height, na.rm = TRUE)
sd(babyData$height,na.rm=TRUE)
## [1] 2.533409
Ans: Standard deviation, σ, = 2.533409 inches.
Based on Questions 4 and 5, write down the normal distribution you would use to describe the heights of mother’s in this data set. Hint: Use notation N(μ,σ).
Ans: N(μ=64.05,σ=2.533409)
Okay, so at this point, we have a good idea that a normal distribution is a plausible choice for describing the distribution of heights, and we also have an idea of the center (μ) and spread (σ) of the distribution. The last step is to actually draw the normal distribution you specified in Question 6 on top of our histogram to make sure that the curve is a good representation of the bars in the histogram.
We can use R to draw our normal distribution on top of the data, but this takes a few steps. The first step is to tell R the range of values that are on the X axis. In other words, what is the smallest value of height (from =) and what is the largest value of height (to =) in our data? This information can be found in the summary we did earlier. Copy the code below, put it in a chunk, and press play.
summary(babyData$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 53.00 62.00 64.00 64.05 66.00 72.00 22
range<- seq(from=53,to=72,length=40)
Specify the range of your x axis
range <- seq(from = 53, to = 72, length = 40)
fun<- dnorm(range,mean=64.05,sd=2.533409)
The next step is to tell R the parameters for the normal distribution you want to draw. We already know what those values are, but now we need to tell R what we have chosen. To do this, copy the code below and put it in a chunk. Then, replace MU with the numeric value for μ you chose in Question 4, and SIGMA with the numeric value for σ that you chose in Question 5.
(Copy the code below and put it in the chunk. You will have to change the MU and SIGMA)
fun <- dnorm(range, mean = MU, sd = SIGMA)
fun <- dnorm(range, mean = 64.05, sd = 2.533)
The final step is to actually draw the curve on the histogram!
Histogram
(Copy the code below and put it in the chunk)
hist(babyData$height, prob = TRUE, col = “white”, ylim = c(0, max(fun)), main = “Histogram with normal curve”, xlab = “Height (in inches)”) lines(range, fun, col = 2, lwd = 2)
You will notice that this looks a little different from our standard histogram code. The additions are to allow us to draw the normal distribution on top of the histogram.
Create the plot using the three steps of code above. Does it look like the normal distribution (the curve) is a good fit for the histogram? In other words, does it look like the curve matches with the shape of the histogram?
Ans: Yes! See below.
range <- seq(from = 53, to = 72, length = 40)
fun <- dnorm(range, mean = 64.05, sd = 2.533)
hist(babyData$height, prob = TRUE, col = "white",
ylim = c(0, max(fun)),
main = "Histogram with normal curve", xlab = "Height (in inches)")
lines(range, fun, col = 2, lwd = 2)
Once we have decided that a normal distribution is a reasonable choice, we can use that to answer questions about the data.
Based on the normal distribution we have chosen, what is the approximate probability that a mother is less than 69 inches (5 foot 9). You can calculate the Z score and from the probability table calculate the Probability. You just have to report the Z score and probability. Use of R is not necessary here.
Ans: 1.954, so 97.44% of mothers are less than or at 69 inches.
We have just seen one example of where the normal distribution occurs in nature - human heights. It turns out that it is not just height that tends to be normally distributed!
Create a histogram to explore the distribution of birth weight, and describe the distribution. Make sure to label your axes!
hist(babyData$bwt, main= "Birth Weights Distribution", xlab= "Weight in Ounces")
Use the code from above Question 7 to draw the appropriate normal curve on the histogram. State the parameters (mean and standard deviation) of the normal distribution you chose.
summary(babyData$bwt, na.rm=TRUE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 55.0 108.8 120.0 119.6 131.0 176.0
sd(babyData$bwt, na.rm = TRUE)
## [1] 18.23645
Ans: mean = 119.6, standard deviation = 18.24
Does it look like this normal distribution might be appropriate to birthweight? Explain why or why not.
Ans: Yes, it appears that the distribution matches the normal histogram well. It is unimodal and symmetrical.
What is the probability that a baby weighs less than 64.3 (approximately) ounces at birth? (Similar as question 8)
Ans: -3.03, so 0.12% of babies weigh less than or equal to 64.3 ounces at birth.
What is the probability that a baby has a weight more than 1 standard deviation above the mean? (Similar as question 8)
(137.84 - 119.6) / 18.24 = 1
100 - 84.13 = 15.87%. So, 15.87% of babies will weigh more than 1 standard deviation above the mean for this data.
This data for this lab was retrieved from http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/babies.csv on July 19, 2022.