Are you looking at news articles differently after this introduction to statistics?
Sometimes you don’t need to be a statistician to call BS!
Recap week 11
- We can perform a linear regression
- We can interpret the output of a linear regression
- We know what assumptions to check for (and how)
- We know how to report the results of a linear regression analysis
- We can plot the results of a linear regression
- We can predict a y value based on any x value, the intercept, and the slope
What’s left
- Revision questions
- old lab reports
- SPEQ: please fill in the questionnaire online!
How did we do…
‘Mathews…we are getting another of those strange ’aw blah es pan yol’ sounds.’
Was there enough…
- Common sense
- Communication
- Rules?
How did you do…? Be honest with yourself
- COME TO YOUR LECTURES/LABS: those who are present tend to do well
- Have the right attitude (YOU are responsible for your learning)
- Be organised, time yourself: Revise your labs every single week!
- Be creative, inquisitive, ask questions, don’t let loose!
- Be realistic: 102 hours self-directed learning (as per paper descriptor) equates to 8.5 hours per week, that’s 2 afternoons/week!
- Don’t get lost in the ocean of available resources, choose one, stick to it, read!
Week 2
| Type of variable | Categorical (Binomial) | Categorical (Nominal) | Categorical (Ordinal) | Continuous |
|---|---|---|---|---|
| Predictor | smoker, gender, handedness | state of mind, hair colour | age class, rank | long jump results, body weight |
| Response | survival, handedness | employment type, hair colour | income bracket, clutch size | cholesterol level, body weight |
A variable has got a name, and values, examples:
- Variable name: handedness, values: left, right
- Variable name: body weight, values: 63.4, 88.2, …
- A categorical predictor variable is often called a ‘factor’, its values ‘factor levels’
Week 3
- The variance has one problem: it is measured in units squared
- This isn’t a very meaningful metric so we take the square root value
- This is the standard deviation (\(s\), sometimes \(sd\)):
\[s = \sqrt{\frac{\sum(x_i-\bar{x})^2}{n-1}} = \sqrt{\frac{5.2}{4}} = 1.14\]
In R:
friends = c(1, 2, 3, 3, 4) sd(friends) [1] 1.140175
Week 4
Week 5
We can commit 2 types of errors:
Week 6
Week 7
Week 7
We need a metric (a test statistic!) that puts the difference between the samples into perspective with
- the difference between the samples that we would expect by chance, and
- the standard deviations of the two samples
This is called the t-statistic:
\[t = \frac{\text{observed difference - expected difference}}{\text{estimate of the standard deviations}}\]
In fact, the expected difference is mostly zero (this is the case in the following examples)
Week 8
There is a trade-off between sample size, standard deviations, expected difference, type-I error probability (\(\alpha\)) and type-II error probability (\(\beta\) or power, \(1-\beta\))!
Week 9
- Create a data frame in R that contains the contingency table you would like to test
- Use the
chisq.test()function on it
cats = data.frame(food = c(28, 10), affection = c(48, 114))
cats
food affection
1 28 48
2 10 114
chisq.test(cats)
Pearson's Chi-squared test with Yates' continuity correction
data: cats
X-squared = 23.52, df = 1, p-value = 1.236e-06
Note that the computed chi-squared value is slightly different from the one we calculated. This is due to a correction factor, which we need not worry about now.
What does the p-value tell you?
Week 10
- It is an effect size
- ± 0.1 = small effect
- ± 0.3 = medium effect
- ± 0.5 = large effect
- The correlation coefficient, \(R\) shows us how strongly two variables are correlated.
- It ranges from -1 to 1, a negative sign means a negative correlation
Week 11
summary(m1) #m1 as defined in corresponding slides of week 11
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-1.2067 -0.5234 -0.1801 0.5494 1.3408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.8008 0.6415 -1.248 0.247
x 1.1215 0.1034 10.848 4.61e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9391 on 8 degrees of freedom
Multiple R-squared: 0.9363, Adjusted R-squared: 0.9284
F-statistic: 117.7 on 1 and 8 DF, p-value: 4.61e-06
Tests/models we looked at
- t-test (incl. power t-test)
- Wilcoxon test
- Shapiro-Wilk test
- Correlation test (Pearson, Spearman, Kendall)
- Chi-squared test
- Linear regression
(1) Systematic variation is
- variation caused by random effects in any direction
- variation coming from a factor that introduces bias in one direction only
- always deflating the standard deviation
- all of the above
(2) The y-axis of a histogram represents…
- the frequencies
- the scores
- any metric
- none of the above
(3) The standard error…
- equals the standard deviation divided by the square root of the sample size
- estimates the correlation between two variables
- equals the mean in most cases
- often equals the standard deviation
(4) Pick the variable that is categorical and nominal
- Rank in a 100 m race
- Civil status
- Numbers of cigarettes smoked per day
- Cholesterol level in blood samples (in mg/ml)
(5) Which of the following is not correct:
- An independent variable is the same as a predictor variable
- A predictor variable can be a factor
- A response variable is the same as a factor
- A factor can have two or more factor levels
- A response variable can be categorical or continuous
(6) Which operation in calculating the variance helps to make sure the latter does not automatically increase with sample size?
- Squaring the sum of the differences between observations and the mean
- Dividing by the degrees of freedom
- Summing up the differences between observations and the mean
- All of the above is correct
(7) In this sample: [1, 3, 11, 15, 19, 20, 22, 24, 30, 31, 39], …
- the median is 20
- the first quartile is 11
- the third quartile is 30
- the interquartile range is 19
- all of the above are correct
- only (1) is correct
(8) A population is known to be chi-squared distributed with 4 degrees of freedom. What is the chance of finding a value greater than 10?
- About 95%
- About 50%
- About 4%
- less than 4%
- Almost zero
(9) You want to find out whether protein content in yogurt (on a continuous scale from 0 to 10) affects the viscosity of the product. What test would you use?
- A two-tailed Wilcoxon test
- A two tailed t-test
- A correlation analysis
- A regression approach
(10) In a power t-test, if you increase your sample size AND you decrease your standard deviation,…
- you increase your power
- you increase your type I error probability
- you decrease your type II error probability
- you improve your chances to find a difference, should there be one
- All but (2) are correct
(11) After regression analysis, we need to check…
- whether the residuals are continuous
- whether the residuals are homogenous along the fitted values
- whether the residuals are normal
- all of the above
- only (2) and (3) are correct
(12) Which is true for the correlation coefficient?
- It ranges from -1 to 1
- It is not necessarily associated with a p-value
- It has nothing to do with the t-statistic
- It is the normalised covariance
- All of the above is correct
(13) Just by sketching a standard normal distribution, what is the probability of sampling a value greater than 2 from such a distribution?
- the same as sampling a value lower than -2
- quite low, maybe a few percent
- around 30%
- infinitely small
- 1 and 2 are correct
(14) In regression analysis, we try to…
- minimise the sum of squared differences between the observed values and the mean
- minimise the distances between the fitted values and the observed values
- minimise the sum of squared differences between the fitted values and the observed values
- maximise the sum of squared differences between the fitted values and the observed values
(15) If in a Chi-squared test with 1 degree of freedom, you obtain a Chi-squared value of 1, this means that
- your p-value is extremely low
- your p-value is clearly not significant (i.e. not below 0.05)
- your p-value is above 1
- given the information, it is impossible to make a statement on the p-value
Short answer questions
Describe an experimental setting that would require you to conduct a t-test
Make up the according data frame and conduct the test
Describe a situation in which you need a power test to determine the sample size. Come up with some numbers and perform the test.
Describe a setting in which you best use a linear regression model. Invent the numbers, conduct the complete analysis, including checking for the model assumptions and interpretation of the coefficients for the intercept and the slope.