#In this lab we'll investigate the probability distribution that is most central to statistics:
# the normal distribution. If we are confident that our data are nearly normal, that opens the door to many powerful statistical methods. Here we'll use the graphical tools of R to assess the normality of our data and also learn how to generate random numbers from a normal distribution.
#The Data
# This week we'll be working with measurements of body dimensions. This data set contains measurements from 247 men and
# 260 women, most of whom were considered healthy young adults.
download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
#Let's take a quick peek at the first few rows of the data.
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
# You'll see that for every observation we have 25 measurements, many of which are either diameters or girths.
# A key to the variable names can be found at http://www.openintro.org/stat/data/bdims.php, but we'll be focusing on just
# three columns to get started: weight in kg ( wgt ), height in cm ( hgt ), and sex ( 1 indicates male, 0 indicates female).
#Since males and females tend to have different body dimensions, it will be useful to create two additional data sets:
#one with only men and another with only women.``
mdims <- subset(bdims, sex == 1)
fdims <- subset(bdims, sex == 0)
# 1.Make a histogram of men's heights and a histogram of women's heights.
# How would you compare the various aspects of the two distributions?
#The normal distribution
# In your description of the distributions, did you use words like bell-shaped or normal?
# It's tempting to say so when faced with a unimodal symmetric distribution.
# To see how accurate that description is, we can plot a normal distribution curve on top of a histogram to see how closely
# the data follow a normal distribution. This normal curve should have the same mean and standard deviation as the data.
# We'll be working with women's heights, so let's store them as a separate object and then calculate some statistics that
#will be referenced later.
fhgtmean <- mean(fdims$hgt) # mean of Female Height
fhgtsd <- sd(fdims$hgt) # StdDev of Female Height
# Next we make a density histogram to use as the backdrop and use the lines function to overlay a normal probability curve.
# The difference between a frequency histogram and a density histogram is that while in a frequency histogram the heights
# of the bars add up to the total number of observations, in a density histogram the areas of the bars add up to 1.
# The area of each bar can be calculated as simply the height times the width of the bar.
# Using a density histogram allows us to properly overlay a normal distribution curve over the histogram since the curve
# is a normal probability density function. Frequency and density histograms both display the same exact shape;
# they only differ in their y-axis. You can verify this by comparing the frequency histogram you constructed earlier and the
# density histogram created by the commands below.
hist(fdims$hgt ) # hist(fdims$hgt)

hist(fdims$hgt, probability = TRUE ,ylim = c(0, 0.06) ) # hist(fdims$hgt)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

## EXPLANATION OF THE ABOVE STATEMENTS FOR PLOTTING
# After plotting the density histogram with the first command,
# we create the x- and y-coordinates for the normal curve.
# We chose the x range as 140 to 190 in order to span the entire range of
# fheight .
# To create y , we use dnorm to calculate the density of each of those
# x-values in a distribution that is normal with mean fhgtmean and
# standard deviation fhgtsd .
# The final command draws a curve on the existing plot (the density histogram)
# by connecting each of the points specified by x and y .
# The argument col simply sets the color for the line to be drawn.
# If we left it out, the line would be drawn in black.
# The top of the curve is cut off because the limits of the
# x- and y-axes are set to best fit the histogram.
# To adjust the y-axis you can add a third argument to the histogram
# function: ylim = c(0, 0.06) .
## try this for your self
###############################################
##
## 2.Based on the this plot, does it appear that the data follow a nearly normal distribution?
## Evaluating the normal distribution
# #Eyeballing the shape of the histogram is one way to determine if the data appear
# to be nearly normally distributed, but it can be frustrating to decide just how
# close the histogram is to the curve. An alternative approach involves constructing
# a normal probability plot, also called a normal Q-Q plot for "quantile-quantile".
qqnorm(fdims$hgt)
qqline(fdims$hgt)

# A data set that is nearly normal will result in a probability plot where the points closely follow
# the line. Any deviations from normality leads to deviations of these points from the line.
# The plot for female heights shows points that tend to follow the line but with some errant points
# towards the tails. We're left with the same problem that we encountered with the histogram above:
# how close is close enough?
# A useful way to address this question is to rephrase it as: what do probability plots look
# like for data that I know came from a normal distribution?
# We can answer this by simulating data from a normal distribution using rnorm .
sim_norm <- rnorm(n = length(fdims$hgt), mean = fhgtmean, sd = fhgtsd)
# The first argument indicates how many numbers you'd like to generate, which we specify to be the same
# number of heights in the fdims data set using the length function.
# The last two arguments determine the mean and standard deviation of the normal distribution from
# which the simulated sample will be generated. We can take a look at the shape of our simulated data set,
# sim_norm , as well as its normal probability plot.
####################################################
# 3.Make a normal probability plot of sim_norm . Do all of the points fall on the line?
## How does this plot compare to the probability plot for the real data?
qqnorm(sim_norm)
qqline(sim_norm)

# Even better than comparing the original plot to a single plot generated from a normal distribution is
# to compare it to many more plots using the following function. Essentially you will simulate multiple distribution
# and see how they look vs the actual data
# It may be helpful to click the zoom button in the plot window.
qqnormsim(fdims$hgt)

# 4.Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated
# data? That is, do plots provide evidence that the female heights are nearly normal?
#5.Using the same technique, determine whether or not female weights appear to come from a normal distribution.
#
## Normal probabilities
## Okay, so now you have a slew of tools to judge whether or not a variable is normally distributed.
## Why should we care?
## It turns out that statisticians know a lot about the normal distribution.
## Once we decide that a random variable is approximately normal, we can answer all sorts of questions
## about that variable related to probability. Take, for example, the question of,
## "What is the probability that a randomly chosen young adult female is taller than 6 feet
# (about 182 cm)?"
## CAVEAT!!!
## (The study that published this data set is clear to point out that the sample was not random and
##therefore inference to a general population is not suggested. We do so here only as an exercise.)
# SO--
## "What is the probability that a randomly chosen young adult female is taller than 6 feet
# (about 182 cm)
#Let US Drawa the desity plots again for illustration!!
hist(fdims$hgt, probability = TRUE, ylim = c(0, 0.06) ) # hist(fdims$hgt, probability = F)
x <- 140:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")

# If we assume that female heights are normally distributed (a very close approximation is also okay),
# we can find this probability by calculating a Z score and consulting a Z table (also called a normal
# probability table). In R, this is done in one step with the function pnorm .
1 - pnorm(q = 182, mean = fhgtmean, sd = fhgtsd)
## [1] 0.004434387
# Note that the function pnorm gives the area under the normal curve below a given value, q, with a given mean and standard deviation.
# Since we're interested in the probability that someone is taller than 182 cm, we have to take
# one minus that probability.
## Assuming a normal distribution has allowed us to calculate a theoretical probability.
## If we want to calculate the probability empirically, we simply need to determine how many
## observations fall above 182 then divide this number by the total sample size.
sum(fdims$hgt > 182) / length(fdims$hgt)
## [1] 0.003846154
# Although the probabilities are not exactly the same, they are reasonably close.
## The closer that your distribution is to being normal, the more accurate the theoretical probabilities
## will be.
# 6.Write out two probability questions that you would like to answer;
# a) one regarding female heights and
# b) one regarding female weights.
#Calculate the those probabilities using both the theoretical normal distribution as well as the
# empirical distribution (four probabilities in all).
# Which variable, height or weight,had a closer agreement between the two methods?
# On Your Own
#.Now let's consider some of the other variables in the body dimensions data set.
# Using the figures at the end of the exercises, match the histogram to its normal probability plot.
# All of the variables have been standardized (first subtract the mean, then divide by the
# standard deviation), so the units won't be of any help.
# If you are uncertain based on these figures, generate the plots in R to check. (B,C,D)
#a. The histogram for female biiliac (pelvic) diameter ( bii.di ) belongs to normal probability plot letter _B___.
#b. The histogram for female elbow diameter ( elb.di ) belongs to normal probability plot letter __C__.
#c. The histogram for general age ( age ) belongs to normal probability plot letter __D__.
#d. The histogram for female chest depth ( che.de ) belongs to normal probability plot letter _A___.
#.Note that normal probability plots C and D have a slight stepwise pattern.
#Why do you think this is the case?
# Slight stepwise pattern could be understood as age & elbow diameter are not the same. But they are in a very similar
# increasing trending.
#.As you can see, normal probability plots can be used both to assess normality and visualize skewness.
#Make a normal probability plot for female knee diameter ( kne.di ).
#Based on this normal probability plot, is this variable left skewed, symmetric, or right skewed?
#Use a histogram to confirm your findings.
fkneeMean = fdims$kne.di
fkneeSd = sd(fdims$kne.di)
hist(fdims$kne.di, probability = TRUE)

# From this histogram, this is left skewed.
#B
qqnorm(fdims$bii.di)
qqline(fdims$bii.di)

#C
qqnorm(fdims$elb.di)
qqline(fdims$elb.di)

#D
qqnorm(fdims$age)
qqline(fdims$age)

#A
qqnorm(fdims$che.de)
qqline(fdims$che.de)
