October 17, 2022

Week 12

Week 12

Are you looking at news articles differently after this introduction to statistics?

Sometimes you don’t need to be a statistician to call BS!

Recap week 11

Week 12
  • We can perform a linear regression
  • We can interpret the output of a linear regression
  • We know what assumptions to check for (and how)
  • We know how to report the results of a linear regression analysis
  • We can plot the results of a linear regression
  • We can predict a y value based on any x value, the intercept, and the slope

What’s left

Week 12
  • Revision questions
  • old lab reports
  • SPEQ: please fill in the questionnaire online!

How did we do…

Week 12

‘Mathews…we are getting another of those strange ’aw blah es pan yol’ sounds.’

Was there enough…

  • Common sense
  • Communication
  • Rules?

How did you do…? Be honest with yourself

Week 12
  • COME TO YOUR LECTURES/LABS: those who are present tend to do well
  • Have the right attitude (YOU are responsible for your learning)
  • Be organised, time yourself: Revise your labs every single week!
  • Be creative, inquisitive, ask questions, don’t let loose!
  • Be realistic: 102 hours self-directed learning (as per paper descriptor) equates to 8.5 hours per week, that’s 2 afternoons/week!
  • Don’t get lost in the ocean of available resources, choose one, stick to it, read!

Week 2

Week 12
Type of variable Categorical (Binomial) Categorical (Nominal) Categorical (Ordinal) Continuous
Predictor smoker, gender, handedness state of mind, hair colour age class, rank long jump results, body weight
Response survival, handedness employment type, hair colour income bracket, clutch size cholesterol level, body weight


A variable has got a name, and values, examples:

  • Variable name: handedness, values: left, right
  • Variable name: body weight, values: 63.4, 88.2, …
  • A categorical predictor variable is often called a ‘factor’, its values ‘factor levels’

Week 3

Week 12
  • The variance has one problem: it is measured in units squared
  • This isn’t a very meaningful metric so we take the square root value
  • This is the standard deviation (\(s\), sometimes \(sd\)):

\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\]

In R:

friends = c(1, 2, 3, 3, 4)
sd(friends)
[1] 1.140175

Week 4

Week 12

Week 5

Week 12

We can commit 2 types of errors:

Week 6

Week 12

Week 7

Week 12

Week 7

Week 12

We need a metric (a test statistic!) that puts the difference between the samples into perspective with

  • the difference between the samples that we would expect by chance, and
  • the standard deviations of the two samples

This is called the t-statistic:

\[t = \frac{\text{observed difference - expected difference}}{\text{estimate of the standard deviations}}\]

In fact, the expected difference is mostly zero (this is the case in the following examples)

Week 8

Week 12

There is a trade-off between sample size, standard deviations, expected difference, type-I error probability (\(\alpha\)) and type-II error probability (\(\beta\) or power, \(1-\beta\))!

Week 9

Week 12
  • Create a data frame in R that contains the contingency table you would like to test
  • Use the chisq.test() function on it
cats = data.frame(food = c(28, 10), affection = c(48, 114))
cats
  food affection
1   28        48
2   10       114
chisq.test(cats)

    Pearson's Chi-squared test with Yates' continuity correction

data:  cats
X-squared = 23.52, df = 1, p-value = 1.236e-06

Note that the computed chi-squared value is slightly different from the one we calculated. This is due to a correction factor, which we need not worry about now.

What does the p-value tell you?

Week 10

Week 12
  • It is an effect size
  • ± 0.1 = small effect
  • ± 0.3 = medium effect
  • ± 0.5 = large effect
  • The correlation coefficient, \(R\) shows us how strongly two variables are correlated.
  • It ranges from -1 to 1, a negative sign means a negative correlation

Week 11

Week 12
summary(m1) #m1 as defined in corresponding slides of week 11
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2067 -0.5234 -0.1801  0.5494  1.3408 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.8008     0.6415  -1.248    0.247    
x             1.1215     0.1034  10.848 4.61e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9391 on 8 degrees of freedom
Multiple R-squared:  0.9363,    Adjusted R-squared:  0.9284 
F-statistic: 117.7 on 1 and 8 DF,  p-value: 4.61e-06

Tests/models we looked at

Week 12
  • t-test (incl. power t-test)
  • Wilcoxon test
  • Shapiro-Wilk test
  • Correlation test (Pearson, Spearman, Kendall)
  • Chi-squared test
  • Linear regression

(1) Systematic variation is

Week 12
  1. variation caused by random effects in any direction
  2. variation coming from a factor that introduces bias in one direction only
  3. always deflating the standard deviation
  4. all of the above

(2) The y-axis of a histogram represents…

Week 12
  1. the frequencies
  2. the scores
  3. any metric
  4. none of the above

(3) The standard error…

Week 12
  1. equals the standard deviation divided by the square root of the sample size
  2. estimates the correlation between two variables
  3. equals the mean in most cases
  4. often equals the standard deviation

(4) Pick the variable that is categorical and nominal

Week 12
  1. Rank in a 100 m race
  2. Civil status
  3. Numbers of cigarettes smoked per day
  4. Cholesterol level in blood samples (in mg/ml)

(5) Which of the following is not correct:

Week 12
  1. An independent variable is the same as a predictor variable
  2. A predictor variable can be a factor
  3. A response variable is the same as a factor
  4. A factor can have two or more factor levels
  5. A response variable can be categorical or continuous

(6) Which operation in calculating the variance helps to make sure the latter does not automatically increase with sample size?

Week 12
  1. Squaring the sum of the differences between observations and the mean
  2. Dividing by the degrees of freedom
  3. Summing up the differences between observations and the mean
  4. All of the above is correct

(7) In this sample: [1, 3, 11, 15, 19, 20, 22, 24, 30, 31, 39], …

Week 12
  1. the median is 20
  2. the first quartile is 11
  3. the third quartile is 30
  4. the interquartile range is 19
  5. all of the above are correct
  6. only (1) is correct

(8) A population is known to be chi-squared distributed with 4 degrees of freedom. What is the chance of finding a value greater than 10?

Week 12
  1. About 95%
  2. About 50%
  3. About 4%
  4. less than 4%
  5. Almost zero

(9) You want to find out whether protein content in yogurt (on a continuous scale from 0 to 10) affects the viscosity of the product. What test would you use?

Week 12
  1. A two-tailed Wilcoxon test
  2. A two tailed t-test
  3. A correlation analysis
  4. A regression approach

(10) In a power t-test, if you increase your sample size AND you decrease your standard deviation,…

Week 12
  1. you increase your power
  2. you increase your type I error probability
  3. you decrease your type II error probability
  4. you improve your chances to find a difference, should there be one
  5. All but (2) are correct

(11) After regression analysis, we need to check…

Week 12
  1. whether the residuals are continuous
  2. whether the residuals are homogenous along the fitted values
  3. whether the residuals are normal
  4. all of the above
  5. only (2) and (3) are correct

(12) Which is true for the correlation coefficient?

Week 12
  1. It ranges from -1 to 1
  2. It is not necessarily associated with a p-value
  3. It has nothing to do with the t-statistic
  4. It is the normalised covariance
  5. All of the above is correct

(13) Just by sketching a standard normal distribution, what is the probability of sampling a value greater than 2 from such a distribution?

Week 12
  1. the same as sampling a value lower than -2
  2. quite low, maybe a few percent
  3. around 30%
  4. infinitely small
  5. 1 and 2 are correct

(14) In regression analysis, we try to…

Week 12
  1. minimise the sum of squared differences between the observed values and the mean
  2. minimise the distances between the fitted values and the observed values
  3. minimise the sum of squared differences between the fitted values and the observed values
  4. maximise the sum of squared differences between the fitted values and the observed values

(15) If in a Chi-squared test with 1 degree of freedom, you obtain a Chi-squared value of 1, this means that

Week 12
  1. your p-value is extremely low
  2. your p-value is clearly not significant (i.e. not below 0.05)
  3. your p-value is above 1
  4. given the information, it is impossible to make a statement on the p-value

Short answer questions

Week 12
  1. Describe an experimental setting that would require you to conduct a t-test

  2. Make up the according data frame and conduct the test

  3. Describe a situation in which you need a power test to determine the sample size. Come up with some numbers and perform the test.

  4. Describe a setting in which you best use a linear regression model. Invent the numbers, conduct the complete analysis, including checking for the model assumptions and interpretation of the coefficients for the intercept and the slope.