2024-02-20

Exercise 6.1.

  • How do you quantify the monotonic relationship between Handspan and Height in the qanda dataset?

  • Can you guess the R function to test this relationship formally?

  • Write down explicitly your null and alternative hypotheses.

Hypotheses:

\(H_0\): there is no relationship between Handspan and Height.

\(H_1\): there is a relationship between Handspan and height.

Correlation test

cor.test (qanda$Handspan, qanda$Height)
## 
##  Pearson's product-moment correlation
## 
## data:  qanda$Handspan and qanda$Height
## t = 3.5256, df = 228, p-value = 0.0005106
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1009848 0.3465390
## sample estimates:
##       cor 
## 0.2273731

I use p-value to quantify the relationship, as it reveals whether the hypothesis is statistically significant.

R function to test it in a formal way

I use the spearman correlation test:

cor.test(qanda$Handspan, qanda$Height, method = "spearman", paired = TRUE)
## 
##  Spearman's rank correlation rho
## 
## data:  qanda$Handspan and qanda$Height
## S = 1340216, p-value = 1.354e-07
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.3390771

Exercise 6.2.

  • Begin a new program. Download Exam_results.csv.

  • What type of data do we have?

  • Create a scatter plot for the data.

  • Does there appear to be a connection between the orders of the two sets of exam results? Save the plot as a file.

Type of data:

we have discrete data in this case, or bivariate.

Scatter plot:

  • Yes, there appears to be a connection.

Exercise 6.3 - 4

Exercise 6.3.

  • Update your program from Exercise 6.2 so that it carries out the Wilcoxon test and,

  • when you source() it (see Section 3.2), it outputs the test statistic, p-value and method used, to the screen with appropriate text descriptions.

Exercise 6.4.

  • Extend your program from Exercise 6.3 so that it

  • outputs a conclusion (using if()), which depends upon the p-value.

Exercises 6.3-4

result <- wilcox.test (er$Exam1, er$Exam2, paired = TRUE)

cat ("The test statistic is:", result$statistic)
cat ("\n The p-value is:", result$p.value)
cat ("\n The method I use is:", result$method, "\n\n")

p <- result$p.value

if (p < 0.05) {
  print ("There is a difference between the orders of the two test results")
} else{
  print ("There is no difference between the orders of the two test results")
}

Exercises 6.3-4

## The test statistic is: 1143
## 
##  The p-value is: 0.01482317
## 
##  The method I use is: Wilcoxon signed rank test with continuity correction
## [1] "There is a difference between the orders of the two test results"

Exercise 6.5. Start a new program.

  • Generate two random samples (simulations):

  • one of length 24 from an exponential distribution with mean 1 and

  • one of length 35 from an exponential distribution with mean 2

  • Test the hypothesis that they come from the same population,

  • outputting the p-value and a written conclusion to your test, similar to that you produced in Exercise 6.4.

Exercise 6.5.

set.seed (100)

sa <- rexp (n = 24, rate = 1)

sb <- rexp (n = 35, rate = 1/2)

result_ab <- wilcox.test(sa, sb)

cat ("The test statistic is:", result_ab$statistic)
cat ("\n The p-value is:", result_ab$p.value)
cat ("\n The method I use is:", result_ab$method, "\n\n")

p <- result_ab$p.value

if (p < 0.05) {
  print ("Two samples are independent of each other")
} else{
  print ("Two samples are not independent of each other")
}

Exercise 6.5.

## The test statistic is: 201
## 
##  The p-value is: 0.0005453868
## 
##  The method I use is: Wilcoxon rank sum exact test
## [1] "Two samples are independent of each other"

Exercise 6.6.

  • Have another look at the histogram for the log of population you created in Exercise 2.6 and update it so that

  • it has breaks from 11.5 to 21.5 with intervals of 1.

On your plot,

  • add points representing the values of a Poisson distribution with mean 16 (roughlythe value of the mean of the log data) for the range of values.

  • Does that distribution appear to fit the data well?

  • Carry out a goodness of fit test comparing the data and distribution for four values: less than 16, 16 - 17, 18 - 19, greater than 19. What is the conclusion?

Take a look at the log of population histogram:

Add breaks from 11.5 to 21.5 with intervals of 1

add points representing the values of a Poisson distribution with mean 16

Does that distribution appear to fit the data well?

Seems not ……

Carry out a goodness of fit test

dfpoints <- data.frame (pop = pop,
                        prob_pop = poispoints,
                        expected_pop = poispoints * sum(pop))

Less than 16

observed_less16 <- pop [pop < 16]
row_less16 <- which (dfpoints$pop < 16)
expected_less16 <- dfpoints[row_less16, ] $expected_pop

result_less16 <-  chisq.test(observed_less16, 
                             p = expected_less16 / sum(expected_less16))
print (result_less16)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_less16
## X-squared = 31.328, df = 14, p-value = 0.004986

16 to 17

observed_1617 <- pop [pop > 16 & pop < 17]
row_1617 <- which (dfpoints$pop > 16 & dfpoints$pop < 17)
expected_1617 <- dfpoints[row_1617, ] $expected_pop

result_1617 <-  chisq.test(observed_1617, 
                             p = expected_1617 / sum(expected_1617))
print (result_1617)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_1617
## X-squared = 3.4023, df = 7, p-value = 0.8455

18 to 19

observed_1819 <- pop [pop > 18 & pop < 19]
row_1819 <- which (dfpoints$pop > 18 & dfpoints$pop < 19)
expected_1819 <- dfpoints[row_1819, ] $expected_pop

result_1819 <-  chisq.test(observed_1819, 
                           p = expected_1819 / sum(expected_1819))
print (result_1819)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_1819
## X-squared = 8.5813, df = 2, p-value = 0.0137

19+

observed_more19 <- pop [pop > 19]
row_more19 <- which (dfpoints$pop > 19)
expected_more19 <- dfpoints[row_more19, ] $expected_pop

result_more19 <-  chisq.test(observed_more19, 
                           p = expected_more19 / sum(expected_more19))
print (result_more19)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed_more19
## X-squared = 3.0322, df = 3, p-value = 0.3867

table for these results

pdf <- data.frame (group = c ("less than 16", "16 - 17",
                              "18 - 19", "greater than 19"),
                   "p-values" = c (result_less16$p.value,
                                   result_1617$p.value,
                                   result_1819$p.value,
                                   result_more19$p.value))
print (pdf)
##             group    p.values
## 1    less than 16 0.004985531
## 2         16 - 17 0.845465442
## 3         18 - 19 0.013695929
## 4 greater than 19 0.386686989

Overall?

observed <- pop

expected <- dfpoints$expected_pop

result <-  chisq.test(observed, 
                             p = expected / sum(expected))

print(result)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 65.058, df = 35, p-value = 0.001505

updated table

pdf <- data.frame (group = c ("less than 16", "16 - 17",
                              "18 - 19", "greater than 19",
                              "overall"),
                   "p-values" = c (result_less16$p.value,
                                   result_1617$p.value,
                                   result_1819$p.value,
                                   result_more19$p.value,
                                   result$p.value))
print (pdf)
##             group    p.values
## 1    less than 16 0.004985531
## 2         16 - 17 0.845465442
## 3         18 - 19 0.013695929
## 4 greater than 19 0.386686989
## 5         overall 0.001505146

Good model pois(16)?

  • \(H_0\): The Poisson (mean) model is a good fit.

  • \(H-1\): The Poisson (mean) model is not a good fit.