CA 2

Assignment 2:

Part 1: Suppose that we are told that the heights of adult males in a particular region of the world are normally distributed with mean of 70 inches and variance of 25 inches squared.

mu <- 70 
var <- 25
sigma <- sqrt(var)

Before we start lets write a few functions to assist us in the questions to follow

lessX <- function (x, mean, sd){
  #returns a probabilty for a normal distribution with spectified Mean and Standard Deviation
  return (pnorm(x, mean, sd))
}


lessZ <- function(x, mean , sd){
  #returns a probabilty for a normal distribution with spectified Mean and Standard Deviation convered to a standard normal distribution
  z <- (x-mean)/sd
  return (pnorm(z))
}

These 2 functions return

\[Pr(x\le X)\] which can only be evaluated using tables

but while one uses a normal distribution with the specified mean and deviation the second one uses the standard normal distribution ( mean = 0, sigma = 1) which translates the $x$ value above to a Z score using \[Z = \frac{(x-\mu)}{\sigma}\] Running both of these 2 function should give the same result

The next 2 functions are the reverse giving \[Pr(x \ge X) = 1- Pr(x \le X)\]

greaterX <- function(x, mean, sd){
  ans<- 1-pnorm(x, mean, sd)
  return (ans)
}

greaterZ <- function(x, mean, sd){
  z<-(x-mean)/sd
  ans<- 1-pnorm(z)
  return (ans)
}

Finally we write 2 functions for \[Pr(X\le x \le Y)= Pr(x \le Y) - Pr(x \le X)\]

betweenX<-function(low, high, mean, sd){
  ans<-pnorm(high, mean, sd)-pnorm(low, mean, sd)
  return (ans)
}

betweenZ<-function(low, high, mean, sd){
  zhigh<-(high-mean)/sd
  zlow<-(low-mean)/sd
  
  ans<- pnorm(zhigh)-pnorm(zlow)
  return (ans)
  
}

One final function to create is one that assists with the shading under a curve

getPolygon <- function(x1min=-Inf, x1max=Inf, mu, sigma, x2min=NULL, x2max=NULL, x3min=NULL, x3max=NULL, x1col='grey', x2col='grey', x3col='grey') {
  
  x1value = c(x1min, seq(x1min, x1max, 0.01),x1max)
  y1value= c(0,dnorm(seq(x1min, x1max, 0.01), mean=mu, sd=sigma),0)
  
  
  
  if(!is.null(x2min)){
    if(is.null(x2max)){
     x2max=Inf
   }
   x2value <- c(x2min, seq(x2min, x2max, 0.01),x2max)
   y2value <- c(0,dnorm(seq(x2min, x2max, 0.01), mean=mu, sd=sigma),0)
  }else{
    x2value <- NULL
    y2value <- NULL
  }
  

  if(!is.null(x3min)){
    if(is.null(x3max)){
      x3max=Inf
    }
    x3value = c(x3min, seq(x3min, x3max, 0.01),x3max)
    y3value= c(0,dnorm(seq(x3min, x3max, 0.01), mean=mu, sd=sigma),0)
  } else{
    x3value <- NULL
    y3value <- NULL
  }
  
  
  
  response<- list(polygon(x1value, y1value, col=x1col))

  if(!is.null(x2value)){
    response<- list(polygon(x2value, y2value, col=x2col),polygon(x1value, y1value, col=x1col))
  }
  if(!is.null(x3value)){
   response <- list(polygon(x3value, y3value, col=x3col),polygon(x2value, y2value,col=x2col),polygon(x1value, y1value, col=x1col))
   }
    return(response)
}

With these functions we can now begin to solve problems with these formulae acting as checks for each other

Q1 Find the probability that a randomly sampled adult male has height between 65 inches and 75 inches. Here we can use the between functions we described To plot the function:

curve( dnorm(x,mean=mu, sd=sigma), from = (mu-(4*sigma)), to = (mu+(4*sigma)), main = "Height Distribution \n ( assuming Normal Distribution)", xlab= "Height (inches)", ylab= "Projected frequenc of Height")
polygonVals<-getPolygon(x1min=65, x1max=75, mu=mu, sigma=sigma)
text<-("65 < x < 75 ")
legend('topright',text, fill='grey', cex= 0.75)

They grey area is what we are looking to calculate

betweenX(65, 75, mu, sigma)

## [1] 0.6826895

betweenZ(65, 75, mu, sigma)

## [1] 0.6826895

Q2 Find the probability that a randomly sampled adult male has height either less than 70 inches or more than 75 inches. To graph

curve( dnorm(x,mean=mu, sd=sigma), from = (mu-(4*sigma)), to = (mu+(4*sigma)), main = "Height Distribution \n ( assuming Normal Distribution)", xlab= "Height (inches)", ylab= "Projected frequenc of Height")
polygonVals<-getPolygon(x1min=(mu-(4*sigma)), x1max=70, mu=mu, sigma=sigma,x2min=75,x2max=(mu+4*(sigma)) )
legend('topleft',"x<70",fill='grey', cex= 0.75)
legend('topright',"x>75",fill='grey', cex=0.75)

We will save these answers to check our answers in the next round. For this we need to use \[Pr(A \cap B) = Pr(A) + Pr(B)-Pr(A \cup B)\] And we know that in this case A and B are independant (cannot be both heights) we have \[Pr(A \cap B) = Pr(A) + Pr(B)\] So we have \[Pr(X \le 70 || X\ge75 )= Pr(x\le70)+Pr(x\ge75)\]

Q2X <- lessX(70, mu, sigma)+greaterX(75, mu, sigma)
Q2Z <- lessZ(70, mu, sigma)+greaterZ(75, mu, sigma)

Q2X

## [1] 0.6586553

Q2Z

## [1] 0.6586553

Q3 Find the probability that a randomly sampled adult male has height between 70 inches and 75 inches.

curve( dnorm(x,mean=mu, sd=sigma), from = (mu-(4*sigma)), to = (mu+(4*sigma)), main = "Height Distribution \n ( assuming Normal Distribution)", xlab= "Height (inches)", ylab= "Projected frequenc of Height")
polygonVals<-getPolygon(x1min=70, x1max=75, mu=mu, sigma=sigma)
legend('topright',"70<x<75",fill='grey', cex=0.75)

Same as Q1 above we can use the between function and the graph function to get this

betweenX(70, 75, mu, sigma)

## [1] 0.3413447

betweenZ(70, 75, mu, sigma)

## [1] 0.3413447

We can also use the fact that these are the inverse of Q2 so we can use ( as can be seen from the graph above) \[1-(Pr(X \le 70 || X\ge75 ))\]

1-Q2X

## [1] 0.3413447

1-Q2Z

## [1] 0.3413447

As we see here all 4 match so we can be confident that the formulae are accurate. It is also half ( due to the symmatery of the distribution of the answer we got in Q1)

Q4 Plot the curve of the PDF of the random variable representing the height of the adult sampled

x1min<-mu-sigma
x1max<-mu+sigma

x2min<-mu-(2*sigma)
x2max<-mu+(2*sigma)

x3min<-mu-(3*sigma)
x3max<-mu+(3*sigma)
  
curve( dnorm(x,mean=mu, sd=sigma), from = (mu-(4*sigma)), to = (mu+(4*sigma)), main = "Height Distribution \n ( assuming Normal Distribution)", xlab= "Height (inches)", ylab= "Projected frequency of Height")

polygons <- getPolygon(x1min, x1max, mu, sigma, x2min, x2max, x3min, x3max, x1col='grey', x2col='darkblue', x3col='lightblue')

abline(v=mu, col='black')
text((mu),0.04, "mean = 70")

cols <- c("gray", "dark blue", "light blue")
key <- c("68% of values are here", "95% of values are here", "99% of values are here")
legend('topright',key, fill = cols, cex= 0.75)

TASK 2 Load the dataset run containing 10K run finishing times of 346 people measured in two different times: ( inspecting the data also)

data <- url("https://acaimo.github.io/teaching/9002_CAs/run.Rdata") 
load( data)
head(run)

Q5 Plot the histograms of T1 = times_1 and T2 = times_2

t1<- run$times_1
t2<- run$times_2

Write a function to make pretty plotting a bit easier

plotHistogram <- function(data, unit, title, xlab, ylab) {
  
  #data for mean and median
  mean=round(mean(data), digits= 2)
  median=round(median(data), digits = 2)
  # get the legend
  
  Mean <- paste("Mean:", toString(mean), unit, sep=" ")
  Median <-  paste("Median: ",  toString(median), unit, sep=" ")
  key <- c(Mean, Median)
  color <- c("red", "blue")
  #data for scaling 
  ymax <- max(hist(data, plot = FALSE)$counts)
 
  
  hist(data, main=title, xlab = xlab, ylab=ylab, labels=TRUE, ylim=c(0, 1.1*ymax), col=rgb(0,0,1,alpha=.25))


  #add mean median and legend
  abline(v=mean, col= 'red')
  abline(v=median, col='blue')
  legend('topright',key, fill = color, cex= 0.75)
}

To plot

plotHistogram(data = t1, 
              unit = "mins",
              title =" Histogram of T1 times for the supplied sample ",
              xlab="Time in mins", 
              ylab= "frequency of time occuring")

plotHistogram(data = t2, 
              unit = "mins",
              title =" Histogram of T2 times for the supplied sample ",
              xlab="Time in mins", 
              ylab= "frequency of time occuring")

Q6 Calculate the sample mean xÂ¯ and standard deviation s for T1 and T2.

we can use built in functions for this( some of it already calcluated on the graphs but here for completeness)

xbarT1<-round(mean(t1), digits = 4)

ST1<-round(sd(t1), digits = 4)

paste("The mean of T1 is", toString(xbarT1), "mins and the standard deviation is",toString(ST1),"mins", sep=" ")

## [1] "The mean of T1 is 31.9707 mins and the standard deviation is 5.1405 mins"

and for t2

xbarT2<-round(mean(t2), digits = 4)

ST2<-round(sd(t2), digits = 4)

paste("The mean of T2 is", toString(xbarT2), " mins and the standard deviation is",toString(ST2),"mins", sep=" ")

## [1] "The mean of T2 is 35.1453  mins and the standard deviation is 3.7832 mins"

Q7 Assuming that both variables are normally distributed, plot the PDF of the sample mean XÂ¯for T1 and T2.

legend<-paste( "mean =", xbarT1, sep=" ")
graph<- curve( dnorm(x,mean=xbarT1, sd=ST1), from = xbarT1-(4*ST1), to = xbarT1+(4*ST1),main= "Plot of the  T1 times for our sample \n( assuming Normal Distribution)", xlab = "T1 ( mins)", ylab = "Frequency of occurance" )
abline( v= xbarT1, col= "dark Grey")
text(32, .02, legend, col = "grey")

legend<-paste( "mean =", xbarT2, sep=" ")
graph<- curve( dnorm(x,mean=xbarT2, sd=ST2), from = xbarT2-(4*ST2), to = xbarT2+(4*ST2), main= "Plot of the  T2 times for our sample \n(assuming Normal Distribution)", xlab = "T2 ( mins)", ylab = "Frequency of occurance" )

abline(v = xbarT2, col="grey")

text(35.15, .02, legend, col = "grey")

Q8 Assuming that both variables are normally distributed, find the 95% confidence interval for the mean ?? for both variables. Formula for this is \[CI= \left( \bar{x} - t_{n-1, \frac{\alpha}{2}} \frac{s}{\sqrt{n}} \ , \ \bar{x} + t_{n-1, \frac{\alpha}{2}} \frac{s}{\sqrt{n}} \right).\]

where s is the standard deviation of the sample, n the number of observations of the sample and $(1-\alpha)*100$ = the % confidence interval

For this we can need to create a few more variables and functions to make life easier

CI<- function(xbar, sd, n, alpha){
  se <- sd/sqrt(n)
  t <- qt(1 - alpha/2, df = n - 1)
  CI <- c(round(xbar-(t*se), digits=2),round(xbar+(t*se), digits=2))
  return (CI)
}

This function collects the confidence interval for the given parameters and returns an upper and lower limit on the variance

Lets test this for T1 and T2

T1CI <- CI(xbarT1, ST1, length(t1), 0.05)
T2CI <- CI(xbarT2, ST2, length(t2), 0.05)

paste("T1 95% confidence interval is from ", toString(T1CI[1]), "to",toString(T1CI[2]), sep=" ")

## [1] "T1 95% confidence interval is from  31.43 to 32.51"

paste("T2 95% confidence interval is from ", toString(T2CI[1]), "to",toString(T2CI[2]), sep=" ")

## [1] "T2 95% confidence interval is from  34.75 to 35.55"

Q9 Test the hypothesis: $H0:??T1=29;H1:??T1\ne29$ and draw your conclusion. Use $??=0.05$.

For this we can use the built in ttest function to get a value very quicky but we can also write a code to use the confidence intervals and the mean, sd etc we have already calculated

ttest <- function(xbar, sd, n, alpha, mu0){
  se <- sd/sqrt(n)
  t <- (xbar - mu0)/se
  k <- qt(1 - alpha / 2, df = n - 1)
  ci<-c(-k, k)
  
  if(t < ci[1]){
      ans<-" Ho rejected"
      reason <-paste("Statistic ",t, " is less than ", ci[1], sep=" ")
  }
  else if (t>ci[2]){
    
    ans<-"Ho rejected"
    reason <-paste("Statistic ",t, " is greater than ", ci[2], sep=" ")
  }
  else{
    ans <- "Ho Not rejected"
    reason<- paste("Statistic ", t, " is between ", ci[1], " and ", ci[2], sep=" ") 
  }
    
  
  
  #convert for string 
  response <- c(ans, reason)
  ans=toString(response)
  return(response)
}

for our test we use

ttest(xbarT1, ST1, length(t1), 0.05, 29)

## [1] "Ho rejected"                                                   
## [2] "Statistic  10.7495796534868  is greater than  1.96686390888117"

Lets check this against the built in test function

t.test(t1, mu = 29)

## 
##  One Sample t-test
## 
## data:  t1
## t = 10.75, df = 345, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 29
## 95 percent confidence interval:
##  31.42715 32.51424
## sample estimates:
## mean of x 
##  31.97069

n= length(t1)
k <- round(qt(0.95, df = n - 1),3); k

## [1] 1.649

As we can see here both using our written code to check against the T-values of our confidece interval and indeed calculating directly from our data we can see that our p-value is very low( 2.2e-16 ) and as described above t greater than CI upper limit we can reject the null hypothesis with 95% certainity

We reject the hypothesis that $\mu_{t1}=29$

Q10 Test the hypothesis: H0:??T2=35;H1:??T2>35 and draw your conclusion. Use ??=0.05.

For this we use the t.test function with alternative greater set

t.test(t2, mu = 35, alternative = 'greater')

## 
##  One Sample t-test
## 
## data:  t2
## t = 0.71463, df = 345, p-value = 0.2377
## alternative hypothesis: true mean is greater than 35
## 95 percent confidence interval:
##  34.8099     Inf
## sample estimates:
## mean of x 
##  35.14535

here we see the T value is 0.71463

comparing to the k alpha value for what we want

n = length(t2)
k <- round(qt(0.95, df = n - 1),3); k

## [1] 1.649

since T < k we fail to reject the null hypothesis that $ _{t2}=35$

CA 2

Ciaran Darcy D10123132 d10123132@mydit.ie