load("more/bdims.RData")
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8
## che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 1 89.5 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5
## 2 97.0 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5
## 3 97.5 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9
## 4 97.0 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0
## 5 97.5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4
## 6 99.9 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5
## wri.gi age wgt hgt sex
## 1 16.5 21 65.6 174.0 1
## 2 17.0 23 71.8 175.3 1
## 3 16.9 28 80.7 193.5 1
## 4 16.6 23 72.6 186.5 1
## 5 18.0 22 78.8 187.2 1
## 6 16.9 21 74.8 181.5 1
Make a histogram of men’s heights and a histogram of women’s heights. How would you compare the various aspects of the two distributions?
mens <- subset(bdims, sex == 1)
females <- subset(bdims, sex == 0)
hist(mens$hgt, xlab = "Height of Men in cm", main = "Histogram For Men's Height")
hist(females$hgt, xlab = "Height of Females in cm", main = "Histogram For Females Height")
The mean of the men’s looks greater than the mean of female heights. Men’s height distribution looks unimodal and symmetric. The distribution of female hieghts is unimodal and it seems to be skewed left
Based on the this plot, does it appear that the data follow a nearly normal distribution?
fhgtmean <- mean(females$hgt)
fhgtsd <- sd(females$hgt)
hist(females$hgt, probability = TRUE)
x <- 150:190
y <- dnorm(x = x, mean = fhgtmean, sd = fhgtsd)
lines(x = x, y = y, col = "blue")
It does look like a normal distribution it is hard to tell though by just looking at it.
We can draw a qq plot to have a closer look. If the data points follow the lines closely then it means it is a normal distribution.
qqnorm(females$hgt)
qqline(females$hgt)
As we can see that the data points are very close but there are some outliers towards the top and the bottom so again it is very close to normal.
Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data?
qqnorm(females$hgt)
qqline(females$hgt)
sim_norm <- rnorm(n = length(females$hgt), mean = fhgtmean, sd = fhgtsd)
qqnormsim(females$hgt)
I believe not all the point fall on the line. However, there are some outliers on the far ends. The simulated data does look very similar to the actual data of female heights.
Does the normal probability plot for fdims$hgt look similar to the plots created for the simulated data? That is, do plots provide evidence that the female heights are nearly normal?
Yes the normal probability plot for females looks very similar to the simulated data. It is not very evenly distributed but there are outliers which are similarities to the outliers in the simulation. There is sufficient infromation that we can say that the heights for the female are clsoe to normal.
Using the same technique, determine whether or not female weights appear to come from a normal distribution.
female_weight_mean <- mean(females$wgt)
female_weight_sd <- sd(females$wgt)
hist(females$wgt, probability = TRUE, main = "Histogram - Female Weights Density", xlab = "Female Weight in Kg")
x <- 40:120
y <- dnorm(x = x, mean = female_weight_mean, sd = female_weight_sd)
lines(x = x, y = y, col = "blue")
It seems as if the Female weights are unimodal distribution. Direction mostly shifts to the right. As we compare this to the normal distribution there is no symmatry so we can see the assymterical nature of the data.
qqnorm(females$wgt)
qqline(females$wgt)
If we observe the qqplot we can see that there is a huge deviation between the line and the data itself. There are alot of outliers towards the top and are very far away from the line.However there are not that many ouliers on the lower end.
qqnormsim(females$wgt)
If we look at the qq plot simulations of the female weights it is very clear that they do not match with the actual data because the data of the simulations is very close to the lines and there are not that many outliers.
Write out two probability questions that you would like to answer; one regarding female heights and one regarding female weights. Calculate the those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which variable, height or weight, had a closer agreement between the two methods?
Lets try to find the probability of Womens height greater than 5 feet and 10 inches.
5 feet 10 inches =70 inches=177.8 cm
pnorm(q=177.8, mean = fhgtmean, sd = fhgtsd,lower.tail=FALSE)
## [1] 0.02411585
sum(females$hgt > 177.8)/length(females$hgt)
## [1] 0.01923077
Find the probability that a female weighs more than 85 kg?
pnorm(q=85, mean = female_weight_mean, sd = female_weight_sd,lower.tail=FALSE)
## [1] 0.005582733
sum(females$wgt > 85)/length(females$wgt)
## [1] 0.01538462
THe height variable had a closer agreement between the two.
Now let’s consider some of the other variables in the body dimensions data set. Using the figures at the end of the exercises, match the histogram to its normal probability plot. All of the variables have been standardized (first subtract the mean, then divide by the standard deviation), so the units won’t be of any help. If you are uncertain based on these figures, generate the plots in R to check.
qqnorm(females$bii.di)
qqline(females$bii.di)
qqnorm(females$elb.di)
qqline(females$elb.di)
qqnorm(bdims$age)
qqline(bdims$age)
qqnorm(females$che.de)
qqline(females$che.de)
I think this is the case because the stepwise probability patterns are normally created due to discrete variables such as age. SO I beleive the reason is the variables being discrete.
qqnorm(females$kne.di)
qqline(females$kne.di)
Since the data seems to be going upward and as though it is moving towards the right the variable looks to be right skewed and there are outliers on the top end.
hist(females$kne.di, xlab = "Female Knee Diameter in cm", main = "Histogram for Female Knee Diameter")
We can clearly see that the distribution is right skewed.