As described in the abstract we have 100 subjects undergoing various tests. The first thing to do is to read in the data Q1 Read in the data make sure that you are in the correct directory to read the data in. You can use the commented out setwd() command to bring you to the correct area if you need to.

#setwd("<YOUR DIRECTORY NAME HERE>")
data = read.csv("d10123132.csv", header = TRUE)

With that done we can begin to look at some of the features of the data Q2 Summarise the data and define types

Age=data$Age
Sex=data$Sex
Height=data$Height
ReactionTime=data$ReactionTime
AgeGroup= data$AGE_GROUP

Lets take a look at the basic data from the group

summary(data)
       X               Age        Sex        Height       ReactionTime  
 Min.   :  1.00   Min.   :20.36   F:48   Min.   :153.4   Min.   :143.8  
 1st Qu.: 25.75   1st Qu.:32.14   M:52   1st Qu.:162.6   1st Qu.:376.4  
 Median : 50.50   Median :46.08          Median :166.6   Median :508.5  
 Mean   : 50.50   Mean   :45.47          Mean   :168.4   Mean   :489.8  
 3rd Qu.: 75.25   3rd Qu.:58.86          3rd Qu.:173.7   3rd Qu.:593.8  
 Max.   :100.00   Max.   :69.77          Max.   :190.4   Max.   :800.6  
 AGE_GROUP 
 20-40:37  
 40-70:63  
           
           
           
           

Looking for the means and Standard Deviations where Available Q3 Calculate the mean and standard deviation of the Age, Height and Reaction Times of the test group

print(paste("Average Age: ",mean(Age), " years"))
[1] "Average Age:  45.47  years"
print(paste("Standard Deviation Age:", sd(Age)," years" ))
[1] "Standard Deviation Age: 14.6462281144258  years"
       
print(paste("Average Height: ",mean(Height), " cm"))
[1] "Average Height:  168.4159  cm"
print(paste("Standard Deviation Heigh:", sd(Height), " cm"))
[1] "Standard Deviation Heigh: 8.65580731023509  cm"
print(paste("Average Reaction Time: ",mean(ReactionTime), " ms"))
[1] "Average Reaction Time:  489.813  ms"
print(paste("Standard Deviation Reaction Time:", sd(ReactionTime), " ms"))
[1] "Standard Deviation Reaction Time: 138.382611533196  ms"

Lets see what some of the data looks like Q4A Plot a histogram of the i) reaction time data, ii) age data, and iii) height data

Since there is some plotting to be done here lets write a function to produce some nice plots from given parameters

plotHistogram <- function(data, unit, title, xlab, ylab) {
  
  #data for mean and median
  mean=round(mean(data), digits= 2)
  median=round(median(data), digits = 2)
  # get the legend
  
  Mean <- paste("Mean:", toString(mean), unit, sep=" ")
  Median <-  paste("Median: ",  toString(median), unit, sep=" ")
  key <- c(Mean, Median)
  color <- c("red", "blue")
  #data for scaling 
  ymax <- max(hist(data, plot = FALSE)$counts)
 
  
  hist(data, main=title, xlab = xlab, ylab=ylab, labels=TRUE, ylim=c(0, 1.1*ymax), col=rgb(0,0,1,alpha=.25))
  #add mean median and legend
  abline(v=mean, col= 'red')
  abline(v=median, col='blue')
  legend('topright',key, fill = color, cex= 0.75)
}

This function takes in the data to be graphed, A unit of measure for the x values of the data, and some labelling information in Title, X-axis label and Y-axis label. It then returns a labelled plot with the mean and median marked in it. Lets call it for Reaction Time data

plotHistogram(data=ReactionTime, 
              unit="ms", 
              title= "Frequence of reaction Times", 
              xlab = "Reaction Times (ms)",
              ylab = "Frequecy of occurance")

This histogram shows that the data for the reaction time is roughly symmetrical with a mean at 489.8 ms and median of 508.5 ms. This tightness of mean to median suggests that there are few outliers to the data here. We also see that there is some symmatery to the data suggesting that the data may come from a normal distribution. There a are 2 peaks to the data at about 300-400 ms and at about 500-600 ms which hopefully can be explained with further analysis

plotHistogram(data=Height, 
              unit="cm", 
              title= "Frequence of Heights in Sample group", 
              xlab = "Height(cm)",
              ylab = "Frequecy of occurance")

Here we see the data is slightly differently shaped. The mean and median are slightly more seperated and there is a tail to the data at the higher end ( > 190 cm) of the data. This suggests the presence of an outlier ‘pulling’ the data towards that end eg. A really tall person sampled.

plotHistogram(data = Age, 
              unit = "years",
              title = "Plot of the frequency of ages in the sample group",
              xlab = "Age in years",
              ylab = "Frequecy of ages"
                )

Finally this data is quite flat. The mean and median are almost identical suggeting that the data does not have many outliers or extreme values pulling at it. its also shows that the sample group is roughly equally likely to have ages anywhere from 20-70

b. Plot a histogram of the reaction time data for each age group (20-40yrs, 40-70yrs) First we need to group the sample into the age groups looked for then we can plot the data

younger<- subset(data, AGE_GROUP=='20-40')
older <- subset(data, AGE_GROUP=="40-70")
youngReaction <- younger$ReactionTime
olderReaction <- older$ReactionTime

Once that is done we can write some code to prettily plot the graphs side by side

sidebyside<- function(X,Y,title=NULL,xlabel=NULL, ylabel=NULL, legend=NULL){
#set parameters for the scale of the graph
if (max(X) > max(Y)){
  xmax <- max(X)
}else{
  xmax <- max(Y)
}
xtall <- max(hist(X, plot=FALSE)$counts)
ytall <- max(hist(Y, plot=FALSE)$counts)
if (xtall>ytall){
  ymax=xtall
}else{
  ymax=ytall
}
#set the legends
key <- legend
color <- c("blue", "red")
meanX<- mean(X)
meanY<- mean(Y)
#plot the 2 graphs side by side 
hist(X, col=rgb(0,0,1,alpha=.25), 
     main=title, xlab=xlabel, ylab=ylabel, xlim = c(0, 1.1*xmax), labels=TRUE,  ylim=c(0,1.1*ymax)) 
hist(Y, col=rgb(1,0,0, alpha=0.25), xlim = c(0, 1.1*xmax), ylim=c(0,1.1*ymax),labels=TRUE, add=TRUE) 
#add lines showing mean of each section
abline(v=meanX, col= 'blue')
text(meanX, ymax, paste("mean:", round(meanX, digits=2)) , col = "blue", cex=0.5) 
abline(v=meanY, col='red')
text(meanY, (ymax-0.5), paste("mean:", round(meanY, digits=2)) , col = "red", cex=0.5) 
#include a legend
legend('topleft',key, fill = color, cex= 0.5 )
}

Finally with all of the prep work done we can plot the data

sidebyside(youngReaction, olderReaction, title="Comparative reaction Times by Age", xlabel = "Reaction Times (ms)", ylabel="counts", legend=c("Age Group 20-40", "Age Group 40-70"))

Here we can see that for younger people the mean reaction is lower than for older people. The older population (in red) tends towars the 300-900 ms range while the younger population tends towards the 100-700 ms. We see a longer tail on the younger population that the corresponding tail on the older one which is slightly more symmaterical. When looking at this graph we see a reason for the 2 peaks meantioned in the previous reaction times graph. The 2 peaks of the combined graph coincide with the means of each of the age groups on this graph.

Given that the populations mean reaction time is 450ms. Conduct a hypothesis test stating the Null H0 and alternative Hα, calculating the test statistics, stating your criteria for rejection, and your conclusion for the reaction time data of the: a.20-40 year old group b. 40-70 year old group

  1. First lets write a function to calculate some Confidence intervals for the data and we will also use the features of the data to calculate them
findCI <- function(x, alpha){
  
  #set up variables
  xbar <- mean(x)
  sigma <- sd(x)
  n <- length(x)
  se <- sigma/sqrt(n)
  
  
  #calculate t
  tn_1_alpha.2 <- qt(1 - alpha/2, df = n - 1)
  
  #return the CI
  
  CI <-  c(xbar-tn_1_alpha.2*se,xbar+tn_1_alpha.2*se)
  return (CI)
  
} 

This takes a rejection criteria ( in decimal ) and a data set and calculates a confidence interval for the test statistic to fall inside in order for the null hypothesis to be accepted.

findCI(youngReaction,0.05)
[1] 354.5534 437.6536

We would like our null hypothesis to fall within this region if we are to accept the null hypothesis

CItest <- function ( h0, x, alpha){
  CI <- findCI( x, alpha)
  if(h0 < CI[1]){
    ans<-" H0 rejected"
    reason <-paste(h0, "falls below the lower limit of our", (1-alpha)*100, "% confidence interval which is found to be: " ,CI[1], sep=" ")
  }
  else if (h0>CI[2]){
    
    ans<-"H0 rejected"
    reason <-paste(h0, "falls above the upper limit of our", (1-alpha)*100, "% confidence interval which is found to be: " ,CI[2], sep=" ")
  }
  else{
    ans <- "H0 Not rejected"
    reason <-paste(h0, "falls between the lower and upper limit of our", (1-alpha)*100, "% confidence interval which is found to be between : " ,CI[1],"and",CI[2], sep=" ")
  }
  
return(paste(ans, reason, sep=" : "))  
}

Lets call this function to test what we can already see (h0: \(mu\)= 450 does not fit in our confidence interval)

CItest(450, youngReaction, 0.05)
[1] "H0 rejected : 450 falls above the upper limit of our 95 % confidence interval which is found to be:  437.653611494543"

So for this test we reject the null hypothesis. Because h0 falls outside the upper boundary of the the Confidence Interval we can deduce that the real mean is lower than 450.

t.test(youngReaction, mu=450, conf.level = .95)

    One Sample t-test

data:  youngReaction
t = -2.6307, df = 36, p-value = 0.01246
alternative hypothesis: true mean is not equal to 450
95 percent confidence interval:
 354.5534 437.6536
sample estimates:
mean of x 
 396.1035 

This built in function gives some more information, most notably the p-value which here is significantly less than the \(t{a/2},{n-1}\) so once again we can reject the null hypothesis lets graph this on the histogram of ages ( means more functions.)

histCI <- function(data, alpha, title=NULL, xlab=NULL, ylab=NULL, h0) {
  ymax=max(hist(data, plot=FALSE)$counts)
  hist(data, col=rgb(0,0,1,alpha=.25), 
       main=paste(title, " with ", ((1-alpha)*100), " Confidence Interval"), xlab=xlab, ylab=ylab,  ylim=c(0,1.1*ymax)) 
  abline(v=h0, col= 'blue')
  text(h0, ymax, "h0: mean=450ms" , col = "blue", cex=0.5) 
  CI <- findCI(data,alpha)
  CImin <- CI[1]
  CImax <- CI[2]
  abline(v=CImin, col= 'red')
  text(CImin, ymax-5, paste("CI min:", round(CImin, digits=2), " ms") , col = "red", cex=0.5) 
  abline(v=CImax, col= 'red')
  text(CImax, ymax-2.5, paste("CI max:", round(CImax, digits=2), " ms") , col = "red", cex=0.5) 
}
histCI(youngReaction, 0.05, "Young reaction times", "Reaction Times (ms)", "counts", 450)

This shows that the hypothesised mean falls ( just) outside the confidence intervals of the distribution b. 40 -70 age group We just repeat this for the older age group

CItest(450, olderReaction, 0.05)
[1] " H0 rejected : 450 falls below the lower limit of our 95 % confidence interval which is found to be:  515.867635752091"
histCI(olderReaction, 0.05, "Older reaction times", "Reaction Times (ms)", "counts", 450)

Here we see that that the Confidence interval is higher for the sample group, so that our mean of 450 is too low for the sample. Lets check the Ttest function to be sure

t.test(olderReaction, mu=450)

    One Sample t-test

data:  olderReaction
t = 6.5422, df = 62, p-value = 1.324e-08
alternative hypothesis: true mean is not equal to 450
95 percent confidence interval:
 515.8676 573.8298
sample estimates:
mean of x 
 544.8487 

Again if we look at the p-value it is far below the 1.96 needed to accept the null hypothesis

SO once again we reject the Null hypothesis that \({mu}= 450ms\)

Linear regression analysis a.Plot a scatterplot of the reaction time data (y-axis) as a function of age(x-axis). Conduct a linear regression analysis of reaction time as a functionage and interpret the results.

First lets plot the data

scatter.smooth(x=Age, y=ReactionTime, main="Reaction Time ~ Age")  # scatterplot

Lets see if we can clean this up first

par(mfrow=c(1,2))
boxplot(Age, main="Age", sub=paste("Outlier rows: ", boxplot.stats(Age)$out))  # box plot for 'Age'
boxplot(ReactionTime, main="Reaction Time", sub=paste("Outlier rows: ", boxplot.stats(ReactionTime)$out))  # box plot for 'Reaction'

From these two box plots we see that there are no outliers affecting the graph so there is not a lot of clean up available for this particular data

On its own this gives some information but its quite sparse.We see a line suggesting that the higher the age the higher the reaction time but we have no numbers for this or indeed how close these points come to the line. Lets try to fix that

First the regression line. Ideally we want it to be of the form ( in this simple case) \[ response= {\beta}{_0}+{\beta}{_1}Indicator{_1}\] or \[y=c+mx\] where m is the slope of the line and c the intercept.

linearReg <- lm(ReactionTime ~ Age, data)
Regsummary<-summary(linearReg)
Regsummary

Call:
lm(formula = ReactionTime ~ Age, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-244.86  -66.97  -13.19   69.96  306.68 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 226.1311    35.9788   6.285 9.09e-09 ***
Age           5.7990     0.7535   7.696 1.12e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 109.8 on 98 degrees of freedom
Multiple R-squared:  0.3767,    Adjusted R-squared:  0.3703 
F-statistic: 59.23 on 1 and 98 DF,  p-value: 1.123e-11

Taking the data we need we now we can see that the line we want is \[Reaction Time = 226.1311 + 5.7990 * age\] The next thing we want to calculate is how much of this results is explained by the model. For this we use the \(R ^{2}\) value of .3703 which suggests that 37% of the reaction time is related to age.

Finally lets conside the error of the slope and work out a confidence interval for the slope of the regression line. The standard error for the slope of this line is give as 0.7535 and we kknow that the slope is 5.770 so a 95% confidence interval for this is

getCI <- function(xbar, se, n, alpha){
  t=qt(1-(alpha/2), df=n-2)
  me=t*se
  CI=c(xbar-me, xbar+se)
  return(CI)
}
CIslope<-getCI(Regsummary$coefficients[2], Regsummary$coefficients[4], 100, 0.05)
CIslope
[1] 4.303709 6.552541

so we have 2 slopes for our line and can do the same for the intercept

CIintercept<-getCI(Regsummary$coefficients[1], Regsummary$coefficients[2], 100, 0.05)
CIintercept
[1] 214.6232 231.9302

Lets use all this information to show a better plot of the data we looked at earlier( another function )

plotregression<- function(X, Y, title, xlab, ylab, c,clow, chigh, m_bar, mlow, mhigh, CI)
{
plot(X, Y, main=title, xlab=xlab, ylab=ylab)
abline(c, m_bar , col="black")
abline(clow, mlow, col= 'red')
abline(chigh, mhigh, col= 'red')
line<- paste(ylab, " = ", round(c, digits=2), " + ",round(m_bar, digits= 2), " * ", xlab )
confidence<- paste(CI, "% condfidence interval for the Regression Line")
key <-c(line, confidence )
color<- c("black", "red")
legend("topleft", key, fill = color, cex=0.5)
}
plotregression(Age, ReactionTime,
               "Plot of Reaction Time -V- Age /w regression lines",
               "Age(years)", "Reaction Time(ms)",
               Regsummary$coefficients[1], CIintercept[1], CIintercept[2],
               Regsummary$coefficients[2], CIslope[1], CIslope[2], 95
               )

b.Plot a scatterplot of the reaction time data (y-axis) as a function of height(x-axis). Conduct a linear regression analysis of reaction time as a function of height and interpret the results. First lets plot the data

scatter.smooth(x=Height, y=ReactionTime, main="Reaction Time ~ Height")  # scatterplot

Here we see a far flatter graph ( again no info provided) and no indication there is any trends bar a reduction in reaction times at the very tallest end of the graph

Lets look for outliers

par(mfrow=c(1,2))
boxplot(Height, main="Height", sub=paste("Outlier rows: ", boxplot.stats(Age)$out))  # box plot for 'Age'
boxplot(ReactionTime, main="Reaction Time", sub=paste("Outlier rows: ", boxplot.stats(ReactionTime)$out))  # box plot for 'Reaction'

Again no huge outliers which is good ( data is ‘clean’ ) so lets move on with the regression analysis

linearReg <- lm(ReactionTime ~ Height, data)
Regsummary<-summary(linearReg)
Regsummary

Call:
lm(formula = ReactionTime ~ Height, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-320.20 -115.69   25.34  100.50  310.46 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)  687.439    271.605   2.531    0.013 *
Height        -1.173      1.611  -0.729    0.468  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 138.7 on 98 degrees of freedom
Multiple R-squared:  0.005387,  Adjusted R-squared:  -0.004762 
F-statistic: 0.5308 on 1 and 98 DF,  p-value: 0.468

Here we get the Linear regression formula to be \[Reaction Time(ms) = 687.439 - 1.173*Height( in cm) \] and the \(R^{2}\) value for this is 0.005358 which translates to 0.5% of the reaction time explained by the height, so it would be fair to say that Height has miniscule influence on the reaction time

Lets do Confidence intervals and the prettied up regression line( for what its worth)

CIslope<-getCI(Regsummary$coefficients[2], Regsummary$coefficients[4], 100, 0.05)
CIintercept<-getCI(Regsummary$coefficients[1], Regsummary$coefficients[3], 100, 0.05)
plotregression(Age, ReactionTime,
               "Plot of Reaction Time -V- Height /w regression lines",
               "Height(cm)", "Reaction Time(ms)",
               Regsummary$coefficients[1], CIintercept[1], CIintercept[2],
               Regsummary$coefficients[2], CIslope[1], CIslope[2], 95)

To see why the CI lines dont show on the graph lets look at the CI data for the intercept

CIintercept
[1] 148.4473 959.0442

So we see that this does not actually fit onto the graph at the scale we are at showing how little value this regression brings to the data

Logistic regression analysis: a.Conduct a Logistic regression analysis of age group as a function of reaction time and interpret the results. Lets run the regression first

logistic<- glm(formula=AgeGroup~ReactionTime, family=binomial("logit"), data  )
summary(logistic)

Call:
glm(formula = AgeGroup ~ ReactionTime, family = binomial("logit"), 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2024  -0.8247   0.4806   0.7638   1.6846  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -4.156989   1.014849  -4.096  4.2e-05 ***
ReactionTime  0.009952   0.002145   4.639  3.5e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 131.79  on 99  degrees of freedom
Residual deviance: 101.41  on 98  degrees of freedom
AIC: 105.41

Number of Fisher Scoring iterations: 4

From this we see that the logistic regression takes the form of \[Pr(Age Group 40-70|Reaction time)= \frac{e^{\eta_{i}}}{1+e^{\eta_{i}} }\] where \[\eta_{i}=-4.156989+0.009953*Reaction Time\]

and the odds on being in the the40-70 age group given a reaction time are \[e^{\eta_{i}}:1\] With eta as desribed.

Now we can investogate the significance of this model by investigating if a 95% Confidence interval have values either side of zero

CIintercept<-getCI(4.156989,1.014849, 100, 0.05)
CIslope<- getCI( 0.009952,0.002145, 100, 0.05)
CIintercept
[1] 2.143054 5.171838
CIslope
[1] 0.005695317 0.012097000

As we see ofr each of the values in the CI are positive so we can say that this model is significant.

Finally lets plot the regression

 # assign 'survival' to these 20 individuals non-randomly... most mortality occurs at smaller body size
dat=as.data.frame(cbind(ReactionTime,AgeGroup)) # saves dataframe with two columns: body size & survival
dat$AgeGroup=dat$AgeGroup-1 # just 
plot(dat$ReactionTime,dat$AgeGroup,xlab="Reaction Time",ylab="Probability age between 40-70", main='Logistic regression for Age Group V Reaction Time') # plot with body size on x-axis and survival (0 or 1) on y-axis
curve(predict(logistic,data.frame(ReactionTime=x),type="resp"),add=TRUE, col="red") # draws a curve based on prediction from logistic regression model
key<-c('Value of Reaction Time and Age group', 'Regression line of probability')
col<-c('black', 'red')
legend(600, 0.4, key, fill=col,cex=0.5 )

As we see here the graph is of a typical logistic regression shape so we can summarise that our models fits the data quite well

b.Conduct a Logistic regression analysis of age group as a function of height and interpret the results.

logistic<- glm(formula=AgeGroup~Height, family=binomial("logit"), data  )
summary(logistic)

Call:
glm(formula = AgeGroup ~ Height, family = binomial("logit"), 
    data = data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5539  -1.3860   0.8904   0.9689   1.1341  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  4.24830    4.05677   1.047    0.295
Height      -0.02204    0.02400  -0.918    0.358

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 131.79  on 99  degrees of freedom
Residual deviance: 130.95  on 98  degrees of freedom
AIC: 134.95

Number of Fisher Scoring iterations: 4

From this we see that the logistic regression takes the form of \[Pr(Age Group 40-70|Height)= \frac{e^{\eta_{i}}}{1+e^{\eta_{i}} }\]

where \[\eta_{i}=4.24830-0.02204*Height\]

and the odds on being in the the40-70 age group given a reaction time are

\[e^{\eta_{i}}:1\]

With eta as desribed.

Now we can investogate the significance of this model by investigating if a 95% Confidence interval have values either side of zero

CIintercept<-getCI( 4.24830,4.05677 , 100, 0.05)
CIslope<- getCI( -0.02204,0.024005, 100, 0.05)
CIintercept
[1] -3.802228  8.305070
CIslope
[1] -0.06967714  0.00196500

In this case we see that the CI clealy passes over 0 in both cases so we can say that the model is not significant. We can also plot this regression to see what it tells us.

 # assign 'survival' to these 20 individuals non-randomly... most mortality occurs at smaller body size
dat=as.data.frame(cbind(Height,AgeGroup)) # saves dataframe with two columns: body size & survival
dat$AgeGroup=dat$AgeGroup-1 # just 
plot(dat$Height,dat$AgeGroup,xlab="Height",ylab="Probability age between 40-70", main='Logistic regression for Age Group V Height') # plot with body size on x-axis and survival (0 or 1) on y-axis
curve(predict(logistic,data.frame(Height=x),type="resp"),add=TRUE, col="red") # draws a curve based on prediction from logistic regression model
key<-c('Value of Height and Age group', 'Regression line of probability')
col<-c('black', 'red')
legend(180, 0.4, key, fill=col,cex=0.5 )

Here we see the regression line is flatter and has no real way to predict the age group from the Height given( as we alluded to with the confidence intervals)

---
title: "Take Home Assignment"
subtitle: Ciaran Darcy D10123132 d10123132@mydit.ie
output:
  html_notebook: default
  pdf_document: default
  word_document: default
---

As described in the abstract we have 100 subjects undergoing various tests. The first thing to do is to read in the data 
**Q1  Read in the data** make sure that you are in the correct directory to read the data in. You can use the commented out setwd() command to bring you to the correct area if you need to. 
```{r}
#setwd("<YOUR DIRECTORY NAME HERE>")
data = read.csv("d10123132.csv", header = TRUE)

```
With that done we can begin to look at some of the features of the data
**Q2 Summarise the data and define types**
```{r}
Age=data$Age
Sex=data$Sex
Height=data$Height
ReactionTime=data$ReactionTime
AgeGroup= data$AGE_GROUP
```
Lets take a look at the basic data from the group
```{r}
summary(data)

```
Looking for the means and Standard Deviations where Available
**Q3 Calculate the mean and standard deviation of the Age, Height and Reaction Times of the test group**
```{r}
print(paste("Average Age: ",mean(Age), " years"))
print(paste("Standard Deviation Age:", sd(Age)," years" ))
       
```
```{r}
print(paste("Average Height: ",mean(Height), " cm"))
print(paste("Standard Deviation Heigh:", sd(Height), " cm"))
```
```{r}
print(paste("Average Reaction Time: ",mean(ReactionTime), " ms"))
print(paste("Standard Deviation Reaction Time:", sd(ReactionTime), " ms"))
```
Lets see what some of the data looks like
**Q4A Plot a histogram of the i) reaction time data, ii) age data, and iii) height data**

Since there is some plotting to be done here lets write a function to produce some nice plots from given parameters
```{r}
plotHistogram <- function(data, unit, title, xlab, ylab) {
  
  #data for mean and median
  mean=round(mean(data), digits= 2)
  median=round(median(data), digits = 2)
  # get the legend
  
  Mean <- paste("Mean:", toString(mean), unit, sep=" ")
  Median <-  paste("Median: ",  toString(median), unit, sep=" ")
  key <- c(Mean, Median)
  color <- c("red", "blue")
  #data for scaling 
  ymax <- max(hist(data, plot = FALSE)$counts)
 
  
  hist(data, main=title, xlab = xlab, ylab=ylab, labels=TRUE, ylim=c(0, 1.1*ymax), col=rgb(0,0,1,alpha=.25))


  #add mean median and legend
  abline(v=mean, col= 'red')
  abline(v=median, col='blue')
  legend('topright',key, fill = color, cex= 0.75)
}

```
This function takes in the data to be graphed, A unit of measure for the x values of the data, and some labelling information in Title, X-axis label and Y-axis label. It then returns a labelled plot with the mean and median marked in it.
Lets call it for Reaction Time data
```{r}
plotHistogram(data=ReactionTime, 
              unit="ms", 
              title= "Frequence of reaction Times", 
              xlab = "Reaction Times (ms)",
              ylab = "Frequecy of occurance")

```

This histogram shows that the data for the reaction time is roughly symmetrical with a mean at 489.8 ms and median of 508.5 ms. This tightness of mean to median suggests that there are few outliers to the data here. We also see that there is some symmatery to the data suggesting that the data may come from a normal distribution. There a are 2 peaks to the data at about 300-400 ms and at about 500-600 ms which hopefully can be explained with further analysis

```{r}
plotHistogram(data=Height, 
              unit="cm", 
              title= "Frequence of Heights in Sample group", 
              xlab = "Height(cm)",
              ylab = "Frequecy of occurance")
```
Here we see the data is slightly differently shaped. The mean and median are slightly more seperated and there is a tail to the data at the higher end ( > 190 cm) of the data. This suggests the presence of an outlier 'pulling' the data towards that end eg. A really tall person sampled.  
```{r}
plotHistogram(data = Age, 
              unit = "years",
              title = "Plot of the frequency of ages in the sample group",
              xlab = "Age in years",
              ylab = "Frequecy of ages"
                )
```

Finally this data is quite flat. The mean and median are almost identical suggeting that the data does not have many outliers or extreme values pulling at it. its also shows that the sample group is roughly equally likely to have ages anywhere from 20-70 


**b. Plot a histogram of the reaction time data for each age group (20-40yrs, 40-70yrs)**
First we need to group the sample into the age groups looked for then we can plot the data
```{r}
younger<- subset(data, AGE_GROUP=='20-40')
older <- subset(data, AGE_GROUP=="40-70")

youngReaction <- younger$ReactionTime
olderReaction <- older$ReactionTime

```
Once that is done we can write some code to prettily plot the graphs side by side 
```{r}
sidebyside<- function(X,Y,title=NULL,xlabel=NULL, ylabel=NULL, legend=NULL){
#set parameters for the scale of the graph
if (max(X) > max(Y)){
  xmax <- max(X)
}else{
  xmax <- max(Y)
}

xtall <- max(hist(X, plot=FALSE)$counts)
ytall <- max(hist(Y, plot=FALSE)$counts)

if (xtall>ytall){
  ymax=xtall
}else{
  ymax=ytall
}

#set the legends
key <- legend
color <- c("blue", "red")

meanX<- mean(X)
meanY<- mean(Y)


#plot the 2 graphs side by side 
hist(X, col=rgb(0,0,1,alpha=.25), 
     main=title, xlab=xlabel, ylab=ylabel, xlim = c(0, 1.1*xmax), labels=TRUE,  ylim=c(0,1.1*ymax)) 
hist(Y, col=rgb(1,0,0, alpha=0.25), xlim = c(0, 1.1*xmax), ylim=c(0,1.1*ymax),labels=TRUE, add=TRUE) 

#add lines showing mean of each section
abline(v=meanX, col= 'blue')
text(meanX, ymax, paste("mean:", round(meanX, digits=2)) , col = "blue", cex=0.5) 

abline(v=meanY, col='red')
text(meanY, (ymax-0.5), paste("mean:", round(meanY, digits=2)) , col = "red", cex=0.5) 

#include a legend
legend('topleft',key, fill = color, cex= 0.5 )

}
```

Finally with all of the prep work done we can plot the data
```{r}
sidebyside(youngReaction, olderReaction, title="Comparative reaction Times by Age", xlabel = "Reaction Times (ms)", ylabel="counts", legend=c("Age Group 20-40", "Age Group 40-70"))

```
Here we can see that for younger people the mean reaction is lower than for older people. The older population (in red) tends towars the 300-900 ms range while the younger population tends towards the 100-700 ms. We see a longer tail on the younger population that the corresponding tail on the older one which is slightly more symmaterical. When looking at this graph we see a reason for the 2 peaks meantioned in the previous reaction times graph. The 2 peaks of the combined graph coincide with the means of each of the age groups on this graph. 


**Given that the 
populations mean reaction time is 450ms. Conduct a hypothesis test stating the Null H0 and alternative Hα, calculating the test statistics, stating your criteria for rejection, and your conclusion for the reaction time data of the:
a.20-40 year old group
b. 40-70 year old group**

a. First lets write a function to calculate some Confidence intervals for the data and we will also use the features of the data to calculate them 


```{r}
findCI <- function(x, alpha){
  
  #set up variables
  xbar <- mean(x)
  sigma <- sd(x)
  n <- length(x)
  se <- sigma/sqrt(n)
  
  
  #calculate t
  tn_1_alpha.2 <- qt(1 - alpha/2, df = n - 1)
  
  #return the CI
  
  CI <-  c(xbar-tn_1_alpha.2*se,xbar+tn_1_alpha.2*se)
  return (CI)
  
} 

```
This takes a rejection criteria ( in decimal ) and a data set and calculates a confidence interval for the test statistic to fall inside in order for the null hypothesis to be accepted.
```{r}
findCI(youngReaction,0.05)
```

We would like our null hypothesis to fall within this region if we are to accept the null hypothesis
```{r}
CItest <- function ( h0, x, alpha){
  CI <- findCI( x, alpha)
  if(h0 < CI[1]){
    ans<-" H0 rejected"
    reason <-paste(h0, "falls below the lower limit of our", (1-alpha)*100, "% confidence interval which is found to be: " ,CI[1], sep=" ")
  }
  else if (h0>CI[2]){
    
    ans<-"H0 rejected"
    reason <-paste(h0, "falls above the upper limit of our", (1-alpha)*100, "% confidence interval which is found to be: " ,CI[2], sep=" ")
  }

  else{
    ans <- "H0 Not rejected"
    reason <-paste(h0, "falls between the lower and upper limit of our", (1-alpha)*100, "% confidence interval which is found to be between : " ,CI[1],"and",CI[2], sep=" ")
  }
  
return(paste(ans, reason, sep=" : "))  
}
```
Lets call this function to test what we can already see (h0: $mu$= 450 does not fit in our confidence interval)
```{r}


```

```{r}
CItest(450, youngReaction, 0.05)
```


So for this test we reject the null hypothesis. Because h0 falls outside the upper boundary of the the Confidence Interval we can deduce that the real mean is lower than 450. 
```{r}
t.test(youngReaction, mu=450, conf.level = .95)

```
This built in function gives some more information, most notably the p-value which here is significantly less than the $t{a/2},{n-1}$ so once again we can reject the null hypothesis
lets graph this on the histogram of ages ( means more functions.)

```{r}
histCI <- function(data, alpha, title=NULL, xlab=NULL, ylab=NULL, h0) {
  ymax=max(hist(data, plot=FALSE)$counts)
  hist(data, col=rgb(0,0,1,alpha=.25), 
       main=paste(title, " with ", ((1-alpha)*100), " Confidence Interval"), xlab=xlab, ylab=ylab,  ylim=c(0,1.1*ymax)) 
  abline(v=h0, col= 'blue')
  text(h0, ymax, "h0: mean=450ms" , col = "blue", cex=0.5) 
  CI <- findCI(data,alpha)
  CImin <- CI[1]
  CImax <- CI[2]
  abline(v=CImin, col= 'red')
  text(CImin, ymax-5, paste("CI min:", round(CImin, digits=2), " ms") , col = "red", cex=0.5) 
  abline(v=CImax, col= 'red')
  text(CImax, ymax-2.5, paste("CI max:", round(CImax, digits=2), " ms") , col = "red", cex=0.5) 
}
```

```{r}
histCI(youngReaction, 0.05, "Young reaction times", "Reaction Times (ms)", "counts", 450)
```


This shows that the hypothesised mean falls ( just)  outside the confidence intervals of the distribution
**b. 40 -70 age group**
We just repeat this for the older age group

```{r}
CItest(450, olderReaction, 0.05)
```

```{r}
histCI(olderReaction, 0.05, "Older reaction times", "Reaction Times (ms)", "counts", 450)
```
Here we see that that the Confidence interval is higher for the sample group, so that our mean of 450 is too low for the sample. Lets check the Ttest function to be sure

```{r}
t.test(olderReaction, mu=450)
```

Again if we look at the p-value it is far below the 1.96 needed to accept the null hypothesis

SO once again we reject the Null hypothesis that ${mu}= 450ms$


**Linear regression analysis**
*a.Plot a scatterplot of the reaction time data (y-axis) as a function of age(x-axis). Conduct a linear regression analysis of reaction time as a functionage and interpret the results.*

First lets plot the data
```{r}
scatter.smooth(x=Age, y=ReactionTime, main="Reaction Time ~ Age")  # scatterplot
```



Lets see if we can clean this up first

```{r}
par(mfrow=c(1,2))
boxplot(Age, main="Age", sub=paste("Outlier rows: ", boxplot.stats(Age)$out))  # box plot for 'Age'
boxplot(ReactionTime, main="Reaction Time", sub=paste("Outlier rows: ", boxplot.stats(ReactionTime)$out))  # box plot for 'Reaction'
```
From these two box plots we see that there are no outliers affecting the graph so there is not a lot of clean up available for this particular data


On its own this gives some information but its quite sparse.We see a line suggesting that the higher the age the higher the reaction time but we have no numbers for this or indeed how close these points come to the line. Lets try to fix that 

First the regression line. Ideally we want it to be of the form ( in this simple case) 
$$ response= {\beta}{_0}+{\beta}{_1}Indicator{_1}$$
or 
$$y=c+mx$$
where m is the slope of the line and c the intercept. 
```{r}

linearReg <- lm(ReactionTime ~ Age, data)
Regsummary<-summary(linearReg)
Regsummary

```

Taking the data we need we now we can see that the line we want is 
$$Reaction Time = 226.1311 + 5.7990 * age$$
The next thing we want to calculate is how much of this results is explained by the model. For this we use the $R ^{2}$ value of .3703 which suggests that **37%** of the reaction time is related to age. 

Finally lets conside the error of the slope and work out a confidence interval for the slope of the regression line.
The standard error for the slope of this line is give as **0.7535** and we kknow that the slope is **5.770** so a 95% confidence interval for this is 

```{r}
getCI <- function(xbar, se, n, alpha){
  t=qt(1-(alpha/2), df=n-2)
  me=t*se
  CI=c(xbar-me, xbar+se)
  return(CI)
}
```


```{r}
CIslope<-getCI(Regsummary$coefficients[2], Regsummary$coefficients[4], 100, 0.05)
CIslope
```

so we have 2 slopes for our line and can do the same for the intercept
```{r}
CIintercept<-getCI(Regsummary$coefficients[1], Regsummary$coefficients[2], 100, 0.05)
CIintercept
```

Lets use all this information to show a better plot of the data we looked at earlier( another function )

```{r}
plotregression<- function(X, Y, title, xlab, ylab, c,clow, chigh, m_bar, mlow, mhigh, CI)
{
plot(X, Y, main=title, xlab=xlab, ylab=ylab)

abline(c, m_bar , col="black")
abline(clow, mlow, col= 'red')
abline(chigh, mhigh, col= 'red')

line<- paste(ylab, " = ", round(c, digits=2), " + ",round(m_bar, digits= 2), " * ", xlab )
confidence<- paste(CI, "% condfidence interval for the Regression Line")
key <-c(line, confidence )
color<- c("black", "red")
legend("topleft", key, fill = color, cex=0.5)
}
```
```{r}
plotregression(Age, ReactionTime,
               "Plot of Reaction Time -V- Age /w regression lines",
               "Age(years)", "Reaction Time(ms)",
               Regsummary$coefficients[1], CIintercept[1], CIintercept[2],
               Regsummary$coefficients[2], CIslope[1], CIslope[2], 95
               )
```


*b.Plot a scatterplot of the reaction time data (y-axis) as a function of height(x-axis). Conduct a linear regression analysis of reaction time as a function of height and interpret the results.*
First lets plot the data
```{r}
scatter.smooth(x=Height, y=ReactionTime, main="Reaction Time ~ Height")  # scatterplot
```
Here we see a far flatter graph ( again no info provided) and no indication there is any trends bar a reduction in reaction times at the very tallest end of the graph

Lets look for outliers

```{r}
par(mfrow=c(1,2))
boxplot(Height, main="Height", sub=paste("Outlier rows: ", boxplot.stats(Age)$out))  # box plot for 'Age'
boxplot(ReactionTime, main="Reaction Time", sub=paste("Outlier rows: ", boxplot.stats(ReactionTime)$out))  # box plot for 'Reaction'
```
Again no huge outliers which is good ( data is 'clean' ) so lets move on with the regression analysis
```{r}
linearReg <- lm(ReactionTime ~ Height, data)
Regsummary<-summary(linearReg)
Regsummary
```
Here we get the Linear regression formula to be
$$Reaction Time(ms) = 687.439 - 1.173*Height( in cm) $$
and the $R^{2}$ value for this is 0.005358 which translates to **0.5%** of the reaction time explained by the height, so it would be fair to say that Height has miniscule influence on the reaction time


Lets do Confidence intervals and the prettied up regression line( for what its worth)

```{r}
CIslope<-getCI(Regsummary$coefficients[2], Regsummary$coefficients[4], 100, 0.05)


CIintercept<-getCI(Regsummary$coefficients[1], Regsummary$coefficients[3], 100, 0.05)


plotregression(Age, ReactionTime,
               "Plot of Reaction Time -V- Height /w regression lines",
               "Height(cm)", "Reaction Time(ms)",
               Regsummary$coefficients[1], CIintercept[1], CIintercept[2],
               Regsummary$coefficients[2], CIslope[1], CIslope[2], 95)
```
To see why the CI lines dont show on the graph lets look at the CI data for the intercept
```{r}

CIintercept
```
So we see that this does not actually fit onto the graph at the scale we are at showing how little value this regression brings to the data

**Logistic regression analysis:**
*a.Conduct a Logistic regression analysis of age group as a function of reaction time and interpret the results.*
Lets run the regression first 

```{r}
logistic<- glm(formula=AgeGroup~ReactionTime, family=binomial("logit"), data  )
summary(logistic)
```
From this we see that the logistic regression takes the form of
$$Pr(Age Group 40-70|Reaction time)= \frac{e^{\eta_{i}}}{1+e^{\eta_{i}} }$$
where
$$\eta_{i}=-4.156989+0.009953*Reaction Time$$

and the odds on being in the the40-70 age group given a reaction time are
$$e^{\eta_{i}}:1$$
With eta as desribed.

Now we can investogate the significance of this model by investigating if a  95% Confidence interval have values either side of zero

```{r}
CIintercept<-getCI(4.156989,1.014849, 100, 0.05)
CIslope<- getCI( 0.009952,0.002145, 100, 0.05)

CIintercept
CIslope
```
As we see ofr each of the values in the CI are positive so we can say that this model is significant. 

Finally lets plot the regression
```{r}
 # assign 'survival' to these 20 individuals non-randomly... most mortality occurs at smaller body size
dat=as.data.frame(cbind(ReactionTime,AgeGroup)) # saves dataframe with two columns: body size & survival
dat$AgeGroup=dat$AgeGroup-1 # just 

plot(dat$ReactionTime,dat$AgeGroup,xlab="Reaction Time",ylab="Probability age between 40-70", main='Logistic regression for Age Group V Reaction Time') # plot with body size on x-axis and survival (0 or 1) on y-axis

curve(predict(logistic,data.frame(ReactionTime=x),type="resp"),add=TRUE, col="red") # draws a curve based on prediction from logistic regression model

key<-c('Value of Reaction Time and Age group', 'Regression line of probability')
col<-c('black', 'red')

legend(600, 0.4, key, fill=col,cex=0.5 )
```
As we see here the graph is of a typical logistic regression shape so we can summarise that our models fits the data quite well

*b.Conduct a Logistic regression analysis of age group as a function of height and interpret the results.*

```{r}
logistic<- glm(formula=AgeGroup~Height, family=binomial("logit"), data  )
summary(logistic)
```
From this we see that the logistic regression takes the form of
$$Pr(Age Group 40-70|Height)= \frac{e^{\eta_{i}}}{1+e^{\eta_{i}} }$$

where
$$\eta_{i}=4.24830-0.02204*Height$$

and the odds on being in the the40-70 age group given a reaction time are

$$e^{\eta_{i}}:1$$

With eta as desribed.

Now we can investogate the significance of this model by investigating if a  95% Confidence interval have values either side of zero

```{r}
CIintercept<-getCI( 4.24830,4.05677 , 100, 0.05)
CIslope<- getCI( -0.02204,0.024005, 100, 0.05)

CIintercept
CIslope
```
In this case we see that the CI clealy passes over 0 in both cases so we can say that the model is not significant. We can also plot this regression to see what it tells us.
```{r}
 # assign 'survival' to these 20 individuals non-randomly... most mortality occurs at smaller body size
dat=as.data.frame(cbind(Height,AgeGroup)) # saves dataframe with two columns: body size & survival
dat$AgeGroup=dat$AgeGroup-1 # just 

plot(dat$Height,dat$AgeGroup,xlab="Height",ylab="Probability age between 40-70", main='Logistic regression for Age Group V Height') # plot with body size on x-axis and survival (0 or 1) on y-axis

curve(predict(logistic,data.frame(Height=x),type="resp"),add=TRUE, col="red") # draws a curve based on prediction from logistic regression model

key<-c('Value of Height and Age group', 'Regression line of probability')
col<-c('black', 'red')

legend(180, 0.4, key, fill=col,cex=0.5 )
```

Here we see the regression line is flatter and has no real way to predict the age group from the Height given( as we alluded to with the confidence intervals)