download.file("http://www.openintro.org/stat/data/bdims.RData", destfile = "bdims.RData")
load("bdims.RData")
table(bdims$sex) #Confirming there are 260 women and 247 men
##
## 0 1
## 260 247
head(bdims)
## bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi che.gi
## 1 42.9 26.0 31.5 17.7 28.0 13.1 10.4 18.8 14.1 106.2 89.5
## 2 43.7 28.5 33.5 16.9 30.8 14.0 11.8 20.6 15.1 110.5 97.0
## 3 40.1 28.2 33.3 20.9 31.7 13.9 10.9 19.7 14.1 115.1 97.5
## 4 44.3 29.9 34.0 18.4 28.2 13.9 11.2 20.9 15.0 104.5 97.0
## 5 42.5 29.9 34.0 21.5 29.4 15.2 11.6 20.7 14.9 107.5 97.5
## 6 43.3 27.0 31.5 19.6 31.3 14.0 11.5 18.8 13.9 119.8 99.9
## wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi wri.gi age
## 1 71.5 74.5 93.5 51.5 32.5 26.0 34.5 36.5 23.5 16.5 21
## 2 79.0 86.5 94.8 51.5 34.4 28.0 36.5 37.5 24.5 17.0 23
## 3 83.2 82.9 95.0 57.3 33.4 28.8 37.0 37.3 21.9 16.9 28
## 4 77.8 78.8 94.0 53.0 31.0 26.2 37.0 34.8 23.0 16.6 23
## 5 80.0 82.5 98.5 55.4 32.0 28.4 37.7 38.6 24.4 18.0 22
## 6 82.5 80.1 95.3 57.5 33.0 28.0 36.6 36.1 23.5 16.9 21
## wgt hgt sex
## 1 65.6 174.0 1
## 2 71.8 175.3 1
## 3 80.7 193.5 1
## 4 72.6 186.5 1
## 5 78.8 187.2 1
## 6 74.8 181.5 1
mend<-subset(bdims, sex==1)
wmend<-subset(bdims,sex==0)
hist(mend$hgt, main="Histrogram of Men's Height", xlab="Height in cm",labels=TRUE,ylim=c(0,90) )
hist(wmend$hgt, main="Histrogram of Women's Height",xlab="Height in cm",labels=TRUE,ylim=c(0,80))
Men’s height data have a longer range than those of women’s. There are bigger groups of women with similar heights than groups of men with similar height. The men’s height is more spread?
Everything changes if I do this:
hist(mend$hgt, main="Histrogram of Men's Height", xlab="Height in cm", breaks=50)
hist(wmend$hgt, main="Histrogram of Women's Height",xlab="Height in cm", breaks=50)
To the point where I would think the men’s height’s distribution is still very similar to that shown earlier but regarding women’s, for me it looks more like a bimodal distribution now. Or at least does not look the same. Is there any advantage to one Histogram over the other? Should we ignore one of the two? How should we interpret each one?
fhmean<-mean(wmend$hgt)
fhsd<-sd(wmend$hgt)
To create the density histogram
hist(wmend$hgt,probability = TRUE, main="Density Histogram of Women's Height",xlab="Height in cm")
hist(wmend$hgt,freq=FALSE) #It is the same as doing this
hist(wmend$hgt,probability = TRUE, main="Density Histogram of Women's Height",xlab="Height in cm",ylim=c(0,0.06))
x<-140:190
y<-dnorm(x=x,mean=fhmean,sd=fhsd)
lines(x=x,y=y,col="blue")
Based on this plot, it does seem like the data is following a normal distribution.
qqnorm(wmend$hgt)
qqline(wmend$hgt)
We will now simulate data that we know is coming from a normal distribution.
sim_norm<-rnorm(n=length(wmend$hgt), mean=fhmean,sd=fhsd)
qqnorm(sim_norm)
qqline(sim_norm)
Even though we are sure that this data is representative of a set with a normal distribution (we built it). It is clear that not all numbers are along the line. Both QQ plots are very similar! Both are densely condensed aling the lines around the center and as they approach the tails, the dots get slightly distanced from the line.
qqnormsim(wmend$hgt)
Yes, different aspects of the QQ plot for Women’s Height are found throughout different simulations of normal distributions’ QQ plots. This is good evidence that the data for women’s height are most probably following a normal distribution.
I was trying to understand the function that is provided (qqnormsim) this is the code:
function (dat)
{
par(mfrow = c(3, 3))
qqnorm(dat, main = "Normal QQ Plot (Data)")
qqline(dat)
for (i in 1:8) {
simnorm <- rnorm(n = length(dat), mean = mean(dat),
sd = sd(dat))
qqnorm(simnorm, main = "Normal QQ Plot (Sim)")
qqline(simnorm)
}
par(mfrow = c(1, 1))
}
## function (dat)
## {
## par(mfrow = c(3, 3))
## qqnorm(dat, main = "Normal QQ Plot (Data)")
## qqline(dat)
## for (i in 1:8) {
## simnorm <- rnorm(n = length(dat), mean = mean(dat),
## sd = sd(dat))
## qqnorm(simnorm, main = "Normal QQ Plot (Sim)")
## qqline(simnorm)
## }
## par(mfrow = c(1, 1))
## }
But that last line, I do not understand what its function is. To test it, I created a function without it. But my results look the same. Question: what is that line “par(mfrow=c(1,1))” doing?
myfunction<-function(dat)
{
par(mfrow=c(3,3))
qqnorm(dat, main = "Normal QQ Plot (Data)")
qqline(dat)
for (i in 1:8) {
simnorm <- rnorm(n = length(dat), mean = mean(dat),
sd = sd(dat))
qqnorm(simnorm, main = "Normal QQ Plot (Sim)")
qqline(simnorm)
}
}
myfunction(wmend$hgt)
qqnormsim(wmend$wgt)
Although most of the plot looks similar to a nomral distribution’s QQ plot, I would argue that, if we draw a line connecting all the dots, the shape of this line would be concave up, which is not something we see in the other plots. Also, compared to the simulated plots, there is a significant increase in the number of dots above the line.
To calculate the probability that a woman is taller than 182 cm
1-pnorm(182,mean=fhmean,sd=fhsd)
## [1] 0.004434387
pnorm(182,mean=fhmean,sd=fhsd,lower.tail=FALSE) #It is the same! :)
## [1] 0.004434387
sum(wmend$hgt>182)/length(wmend$hgt)
## [1] 0.003846154
Define mean and sd for weight
fwmean<-mean(wmend$wgt)
fwsd<-sd(wmend$wgt)
Two probabilities for height: 1) What is the probability that a woman is shorter than 175 cm? 2) What is the probability that a woman is between 155 and 165 cm?
Two probabilities for weight: 3) What is the probability that a woman’s weight is between 40 and 50 kg? 4) What is the probability that a woman’s weight is 50 kg? 5) What is the probability that a woman’s weight is greater than 60 kg?
pnorm(175,mean=fhmean,sd=fhsd)
## [1] 0.9391272
sum(wmend$hgt<175)/length(wmend$hgt)
## [1] 0.9153846
pnorm(165,mean=fhmean,sd=fhsd)-pnorm(155,mean=fhmean,sd=fhsd)
## [1] 0.4420656
(sum(wmend$hgt<165)-sum(wmend$hgt<155))/length(wmend$hgt)
## [1] 0.4461538
hist(wmend$wgt)
pnorm(50,mean=fwmean,sd=fwsd)-pnorm(40,mean=fwmean,sd=fwsd)
## [1] 0.1190612
range(wmend$wgt)
## [1] 42.0 105.2
(sum(wmend$wgt>40)-sum(wmend$wgt>50))/length(wmend$wgt)
## [1] 0.1192308
Given that the lowest value of the data is 42. Should I approach the problem differently to what I just did?
Using the normal distribution here, would give us a probability of 0. This type of questions are usually ignored when analyzing normal probability distributions then? My question is: given that this is just a method of approximating a probability, we (as an acamdemic community) are well aware of the fact that questions like this would generate a probability of 0 so we just know that we are not supposed to use this method to answer those questions, right?
pnorm(60,mean=fwmean,sd=fwsd,lower.tail = FALSE)
## [1] 0.524893
sum(wmend$wgt>60)/length(wmend$wgt)
## [1] 0.4384615
In general, the results of the two methods for calculating probabilites were closer for the Height variable. However, for probability 3) there is no significant difference in the results of both methods.
Actually plotting the QQplots:
qqnorm(wmend$bii.di)
qqline(wmend$bii.di)
My guess of a) was correct! :)
qqnorm(wmend$elb.di)
qqline(wmend$elb.di)
My guess of b) was correct! :)
qqnorm(wmend$age)
qqline(wmend$age)
My guess of c) was incorrect. The actual plot is QQ D
qqnorm(wmend$che.de)
qqline(wmend$che.de)
My guess of d) was incorrect. The actual plot is QQ A.
The variables elbow diameter and age are those which QQ plots have a slight stepwise pattern. I assume this is due to having a lot of repeated values of each in the data set.
qqnorm(wmend$kne.di)
qqline(wmend$kne.di)
hist(wmend$kne.di, freq=FALSE, main="Density Plot for Women's Knee Diameter", xlab="Diameter")
The data of Women’s Knee Diameter follow a distribution skewed to the right.