Assignment 2:
Part 1: Suppose that we are told that the heights of adult males in a particular region of the world are normally distributed with mean of 70 inches and variance of 25 inches squared.
mu <- 70
var <- 25
sigma <- sqrt(var)
Before we start lets write a few functions to assist us in the questions to follow
lessX <- function (x, mean, sd){
#returns a probabilty for a normal distribution with spectified Mean and Standard Deviation
return (pnorm(x, mean, sd))
}
lessZ <- function(x, mean , sd){
#returns a probabilty for a normal distribution with spectified Mean and Standard Deviation convered to a standard normal distribution
z <- (x-mean)/sd
return (pnorm(z))
}
These 2 functions return
\[Pr(x\le X)\] which can only be evaluated using tables
but while one uses a normal distribution with the specified mean and deviation the second one uses the standard normal distribution ( mean = 0, sigma = 1) which translates the \(x\) value above to a Z score using \[Z = \frac{(x-\mu)}{\sigma}\] Running both of these 2 function should give the same result
The next 2 functions are the reverse giving \[Pr(x \ge X) = 1- Pr(x \le X)\]
greaterX <- function(x, mean, sd){
ans<- 1-pnorm(x, mean, sd)
return (ans)
}
greaterZ <- function(x, mean, sd){
z<-(x-mean)/sd
ans<- 1-pnorm(z)
return (ans)
}
Finally we write 2 functions for \[Pr(X\le x \le Y)= Pr(x \le Y) - Pr(x \le X)\]
betweenX<-function(low, high, mean, sd){
ans<-pnorm(high, mean, sd)-pnorm(low, mean, sd)
return (ans)
}
betweenZ<-function(low, high, mean, sd){
zhigh<-(high-mean)/sd
zlow<-(low-mean)/sd
ans<- pnorm(zhigh)-pnorm(zlow)
return (ans)
}
One final function to create is one that assists with the shading under a curve
getPolygon <- function(x1min=-Inf, x1max=Inf, mu, sigma, x2min=NULL, x2max=NULL, x3min=NULL, x3max=NULL, x1col='grey', x2col='grey', x3col='grey') {
x1value = c(x1min, seq(x1min, x1max, 0.01),x1max)
y1value= c(0,dnorm(seq(x1min, x1max, 0.01), mean=mu, sd=sigma),0)
if(!is.null(x2min)){
if(is.null(x2max)){
x2max=Inf
}
x2value <- c(x2min, seq(x2min, x2max, 0.01),x2max)
y2value <- c(0,dnorm(seq(x2min, x2max, 0.01), mean=mu, sd=sigma),0)
}else{
x2value <- NULL
y2value <- NULL
}
if(!is.null(x3min)){
if(is.null(x3max)){
x3max=Inf
}
x3value = c(x3min, seq(x3min, x3max, 0.01),x3max)
y3value= c(0,dnorm(seq(x3min, x3max, 0.01), mean=mu, sd=sigma),0)
} else{
x3value <- NULL
y3value <- NULL
}
response<- list(polygon(x1value, y1value, col=x1col))
if(!is.null(x2value)){
response<- list(polygon(x2value, y2value, col=x2col),polygon(x1value, y1value, col=x1col))
}
if(!is.null(x3value)){
response <- list(polygon(x3value, y3value, col=x3col),polygon(x2value, y2value,col=x2col),polygon(x1value, y1value, col=x1col))
}
return(response)
}
With these functions we can now begin to solve problems with these formulae acting as checks for each other
Q1 Find the probability that a randomly sampled adult male has height between 65 inches and 75 inches. Here we can use the between functions we described To plot the function:
curve( dnorm(x,mean=mu, sd=sigma), from = (mu-(4*sigma)), to = (mu+(4*sigma)), main = "Height Distribution \n ( assuming Normal Distribution)", xlab= "Height (inches)", ylab= "Projected frequenc of Height")
polygonVals<-getPolygon(x1min=65, x1max=75, mu=mu, sigma=sigma)
text<-("65 < x < 75 ")
legend('topright',text, fill='grey', cex= 0.75)
They grey area is what we are looking to calculate
betweenX(65, 75, mu, sigma)
## [1] 0.6826895
betweenZ(65, 75, mu, sigma)
## [1] 0.6826895
Q2 Find the probability that a randomly sampled adult male has height either less than 70 inches or more than 75 inches. To graph
curve( dnorm(x,mean=mu, sd=sigma), from = (mu-(4*sigma)), to = (mu+(4*sigma)), main = "Height Distribution \n ( assuming Normal Distribution)", xlab= "Height (inches)", ylab= "Projected frequenc of Height")
polygonVals<-getPolygon(x1min=(mu-(4*sigma)), x1max=70, mu=mu, sigma=sigma,x2min=75,x2max=(mu+4*(sigma)) )
legend('topleft',"x<70",fill='grey', cex= 0.75)
legend('topright',"x>75",fill='grey', cex=0.75)
We will save these answers to check our answers in the next round. For this we need to use \[Pr(A \cap B) = Pr(A) + Pr(B)-Pr(A \cup B)\] And we know that in this case A and B are independant (cannot be both heights) we have \[Pr(A \cap B) = Pr(A) + Pr(B)\] So we have \[Pr(X \le 70 || X\ge75 )= Pr(x\le70)+Pr(x\ge75)\]
Q2X <- lessX(70, mu, sigma)+greaterX(75, mu, sigma)
Q2Z <- lessZ(70, mu, sigma)+greaterZ(75, mu, sigma)
Q2X
## [1] 0.6586553
Q2Z
## [1] 0.6586553
Q3 Find the probability that a randomly sampled adult male has height between 70 inches and 75 inches.
curve( dnorm(x,mean=mu, sd=sigma), from = (mu-(4*sigma)), to = (mu+(4*sigma)), main = "Height Distribution \n ( assuming Normal Distribution)", xlab= "Height (inches)", ylab= "Projected frequenc of Height")
polygonVals<-getPolygon(x1min=70, x1max=75, mu=mu, sigma=sigma)
legend('topright',"70<x<75",fill='grey', cex=0.75)
Same as Q1 above we can use the between function and the graph function to get this
betweenX(70, 75, mu, sigma)
## [1] 0.3413447
betweenZ(70, 75, mu, sigma)
## [1] 0.3413447
We can also use the fact that these are the inverse of Q2 so we can use ( as can be seen from the graph above) \[1-(Pr(X \le 70 || X\ge75 ))\]
1-Q2X
## [1] 0.3413447
1-Q2Z
## [1] 0.3413447
As we see here all 4 match so we can be confident that the formulae are accurate. It is also half ( due to the symmatery of the distribution of the answer we got in Q1)
Q4 Plot the curve of the PDF of the random variable representing the height of the adult sampled
x1min<-mu-sigma
x1max<-mu+sigma
x2min<-mu-(2*sigma)
x2max<-mu+(2*sigma)
x3min<-mu-(3*sigma)
x3max<-mu+(3*sigma)
curve( dnorm(x,mean=mu, sd=sigma), from = (mu-(4*sigma)), to = (mu+(4*sigma)), main = "Height Distribution \n ( assuming Normal Distribution)", xlab= "Height (inches)", ylab= "Projected frequency of Height")
polygons <- getPolygon(x1min, x1max, mu, sigma, x2min, x2max, x3min, x3max, x1col='grey', x2col='darkblue', x3col='lightblue')
abline(v=mu, col='black')
text((mu),0.04, "mean = 70")
cols <- c("gray", "dark blue", "light blue")
key <- c("68% of values are here", "95% of values are here", "99% of values are here")
legend('topright',key, fill = cols, cex= 0.75)
TASK 2 Load the dataset run containing 10K run finishing times of 346 people measured in two different times: ( inspecting the data also)
data <- url("https://acaimo.github.io/teaching/9002_CAs/run.Rdata")
load( data)
head(run)
Q5 Plot the histograms of T1 = times_1 and T2 = times_2
t1<- run$times_1
t2<- run$times_2
Write a function to make pretty plotting a bit easier
plotHistogram <- function(data, unit, title, xlab, ylab) {
#data for mean and median
mean=round(mean(data), digits= 2)
median=round(median(data), digits = 2)
# get the legend
Mean <- paste("Mean:", toString(mean), unit, sep=" ")
Median <- paste("Median: ", toString(median), unit, sep=" ")
key <- c(Mean, Median)
color <- c("red", "blue")
#data for scaling
ymax <- max(hist(data, plot = FALSE)$counts)
hist(data, main=title, xlab = xlab, ylab=ylab, labels=TRUE, ylim=c(0, 1.1*ymax), col=rgb(0,0,1,alpha=.25))
#add mean median and legend
abline(v=mean, col= 'red')
abline(v=median, col='blue')
legend('topright',key, fill = color, cex= 0.75)
}
To plot
plotHistogram(data = t1,
unit = "mins",
title =" Histogram of T1 times for the supplied sample ",
xlab="Time in mins",
ylab= "frequency of time occuring")
plotHistogram(data = t2,
unit = "mins",
title =" Histogram of T2 times for the supplied sample ",
xlab="Time in mins",
ylab= "frequency of time occuring")
Q6 Calculate the sample mean x¯ and standard deviation s for T1 and T2.
we can use built in functions for this( some of it already calcluated on the graphs but here for completeness)
xbarT1<-round(mean(t1), digits = 4)
ST1<-round(sd(t1), digits = 4)
paste("The mean of T1 is", toString(xbarT1), "mins and the standard deviation is",toString(ST1),"mins", sep=" ")
## [1] "The mean of T1 is 31.9707 mins and the standard deviation is 5.1405 mins"
and for t2
xbarT2<-round(mean(t2), digits = 4)
ST2<-round(sd(t2), digits = 4)
paste("The mean of T2 is", toString(xbarT2), " mins and the standard deviation is",toString(ST2),"mins", sep=" ")
## [1] "The mean of T2 is 35.1453 mins and the standard deviation is 3.7832 mins"
Q7 Assuming that both variables are normally distributed, plot the PDF of the sample mean X¯for T1 and T2.
legend<-paste( "mean =", xbarT1, sep=" ")
graph<- curve( dnorm(x,mean=xbarT1, sd=ST1), from = xbarT1-(4*ST1), to = xbarT1+(4*ST1),main= "Plot of the T1 times for our sample \n( assuming Normal Distribution)", xlab = "T1 ( mins)", ylab = "Frequency of occurance" )
abline( v= xbarT1, col= "dark Grey")
text(32, .02, legend, col = "grey")
legend<-paste( "mean =", xbarT2, sep=" ")
graph<- curve( dnorm(x,mean=xbarT2, sd=ST2), from = xbarT2-(4*ST2), to = xbarT2+(4*ST2), main= "Plot of the T2 times for our sample \n(assuming Normal Distribution)", xlab = "T2 ( mins)", ylab = "Frequency of occurance" )
abline(v = xbarT2, col="grey")
text(35.15, .02, legend, col = "grey")
Q8 Assuming that both variables are normally distributed, find the 95% confidence interval for the mean ?? for both variables. Formula for this is \[CI= \left(
\bar{x} - t_{n-1, \frac{\alpha}{2}} \frac{s}{\sqrt{n}} \ , \ \bar{x} + t_{n-1, \frac{\alpha}{2}} \frac{s}{\sqrt{n}}
\right).\]
where s is the standard deviation of the sample, n the number of observations of the sample and \((1-\alpha)*100\) = the % confidence interval
For this we can need to create a few more variables and functions to make life easier
CI<- function(xbar, sd, n, alpha){
se <- sd/sqrt(n)
t <- qt(1 - alpha/2, df = n - 1)
CI <- c(round(xbar-(t*se), digits=2),round(xbar+(t*se), digits=2))
return (CI)
}
This function collects the confidence interval for the given parameters and returns an upper and lower limit on the variance
Lets test this for T1 and T2
T1CI <- CI(xbarT1, ST1, length(t1), 0.05)
T2CI <- CI(xbarT2, ST2, length(t2), 0.05)
paste("T1 95% confidence interval is from ", toString(T1CI[1]), "to",toString(T1CI[2]), sep=" ")
## [1] "T1 95% confidence interval is from 31.43 to 32.51"
paste("T2 95% confidence interval is from ", toString(T2CI[1]), "to",toString(T2CI[2]), sep=" ")
## [1] "T2 95% confidence interval is from 34.75 to 35.55"
Q9 Test the hypothesis: \(H0:??T1=29;H1:??T1\ne29\) and draw your conclusion. Use \(??=0.05\).
For this we can use the built in ttest function to get a value very quicky but we can also write a code to use the confidence intervals and the mean, sd etc we have already calculated
ttest <- function(xbar, sd, n, alpha, mu0){
se <- sd/sqrt(n)
t <- (xbar - mu0)/se
k <- qt(1 - alpha / 2, df = n - 1)
ci<-c(-k, k)
if(t < ci[1]){
ans<-" Ho rejected"
reason <-paste("Statistic ",t, " is less than ", ci[1], sep=" ")
}
else if (t>ci[2]){
ans<-"Ho rejected"
reason <-paste("Statistic ",t, " is greater than ", ci[2], sep=" ")
}
else{
ans <- "Ho Not rejected"
reason<- paste("Statistic ", t, " is between ", ci[1], " and ", ci[2], sep=" ")
}
#convert for string
response <- c(ans, reason)
ans=toString(response)
return(response)
}
for our test we use
ttest(xbarT1, ST1, length(t1), 0.05, 29)
## [1] "Ho rejected"
## [2] "Statistic 10.7495796534868 is greater than 1.96686390888117"
Lets check this against the built in test function
t.test(t1, mu = 29)
##
## One Sample t-test
##
## data: t1
## t = 10.75, df = 345, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 29
## 95 percent confidence interval:
## 31.42715 32.51424
## sample estimates:
## mean of x
## 31.97069
n= length(t1)
k <- round(qt(0.95, df = n - 1),3); k
## [1] 1.649
As we can see here both using our written code to check against the T-values of our confidece interval and indeed calculating directly from our data we can see that our p-value is very low( 2.2e-16 ) and as described above t greater than CI upper limit we can reject the null hypothesis with 95% certainity
We reject the hypothesis that \(\mu_{t1}=29\)
Q10 Test the hypothesis: H0:??T2=35;H1:??T2>35 and draw your conclusion. Use ??=0.05.
For this we use the t.test function with alternative greater set
t.test(t2, mu = 35, alternative = 'greater')
##
## One Sample t-test
##
## data: t2
## t = 0.71463, df = 345, p-value = 0.2377
## alternative hypothesis: true mean is greater than 35
## 95 percent confidence interval:
## 34.8099 Inf
## sample estimates:
## mean of x
## 35.14535
here we see the T value is 0.71463
comparing to the k alpha value for what we want
n = length(t2)
k <- round(qt(0.95, df = n - 1),3); k
## [1] 1.649
since T < k we fail to reject the null hypothesis that $ _{t2}=35$