As a way to help you prepare for the final exam, this lab presents several different tables and visualizations and asks you to interpret them. The correct answers can be found by clicking the tab that says “ANSWER” below each question.
I do not show you the code I used to prepare the figures because you do not need to write or interpret code on the final exam. (Although you may be asked about some of the functions and packages that we have relied on consistently throughout the course.)
In this example, we examine several variables and their relationships. We finish by interpreting a multiple linear regression.
Is it unimodal, bimodal, or multi-modal? What can we learn about how respondents feel about Justin Trudeau?
Do feelings for Trudeau vary by age? | ||||
Feelings Towards Trudeau | Age | |||
---|---|---|---|---|
young | middle-aged | older working | retirees | |
dislike | 0.31 | 0.35 | 0.43 | 0.45 |
meh | 0.36 | 0.31 | 0.26 | 0.23 |
like | 0.33 | 0.34 | 0.31 | 0.32 |
While we see that the values of the top row consistently increase (i.e. as age increases, so too does dislike for Trudeau), this pattern does not hold across the bottom row in the opposite direction. We are not really sure that younger people tend to like Trudeau more. We cannot be confident that there is a clear, negative relationship. There is a moderate correlation, however.
If we were to find a positive relationship: we would see that across the bottom row, “like” for Trudeau would be concentrated among older individuals (increasing values as we look across the bottom row). And as we would look across the top row, “dislike” for Trudeau would be concentrated among young people (as age decreases, so too would feelings towards Trudeau).
For a refresher on how positive, negative, and no relationship looks in a cross-tab, see slides 10-12 from Lecture 9.df2 %>%
filter(gender != "other") %>%
ggplot(aes(x = gender, y = cps21_lead_rating_23, fill = gender)) +
geom_boxplot(color = "black") +
scale_fill_manual(values = c("woman" = "pink", "man" = "blue")) +
labs(x = "Gender", y = "How do you feel about Justin Trudeau?\n(0 = really dislike, 100 = really like)", title="Feelings Towards Trudeau") +
theme_minimal()
ggplot(df2, aes(x = province, y = cps21_lead_rating_23, fill = province)) +
geom_boxplot(color = "black") +
scale_fill_viridis_d() + # generates colours from viridis automatically
labs(x = "Province", y = "Feelings Towards Trudeau\n(0 = really dislike, 100 = really like)", title="Do respondents feelings towards Trudeau vary by province?") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Below, I generate a fake dataset of 25 values. You’ll notice that the dataset is skewed: the mean is larger than the median. Most values are concentrated on the lower end, but there are a few values on the higher end creating a long tail.
## Value
## 1 40.16691
## 2 43.67469
## 3 46.56574
## 4 47.19762
## 5 47.22079
## 6 47.63604
## 7 47.77169
## 8 48.84911
## 9 50.35254
## 10 50.55341
## 11 50.64644
## 12 51.79907
## 13 52.00386
## 14 52.30458
## 15 52.48925
## 16 53.50678
## 17 56.12041
## 18 57.79354
## 19 58.57532
## 20 58.93457
## 21 62.86435
## 22 62.94799
## 23 63.54222
## 24 63.74992
## 25 64.56405
## Mean = 53.27324
## Median = 52.00386
Here is a histogram that shows the distribution of this variable (note the long tail to the right).
Let’s look at where Q1 (25th percentile), Q2 (median), and Q3 (75th percentile) lie:
## Q1 = 47.77169
## Q2 (Median) = 52.00386
## Q3 = 58.57532
Below, you’ll see all of the different values of the variable called “Value”. We create a boxplot below to visualize the distribution of this variable. I’ve included another variable called “labels” which tells you which values of “Value” appear inside or outside of the box plot, based on the percentile category that each data point falls into. The middle 50% of the points fall inside of the box.
## Value Quartile
## 1 40.16691 Not in the box (bottom 25th%)
## 2 43.67469 Not in the box (bottom 25th%)
## 3 46.56574 Not in the box (bottom 25th%)
## 4 47.19762 Not in the box (bottom 25th%)
## 5 47.22079 Not in the box (bottom 25th%)
## 6 47.63604 Not in the box (bottom 25th%)
## 7 47.77169 Not in the box (bottom 25th%)
## 8 48.84911 In the box (25-50th%)
## 9 50.35254 In the box (25-50th%)
## 10 50.55341 In the box (25-50th%)
## 11 50.64644 In the box (25-50th%)
## 12 51.79907 In the box (25-50th%)
## 13 52.00386 In the box (25-50th%)
## 14 52.30458 In the box (50-75th%)
## 15 52.48925 In the box (50-75th%)
## 16 53.50678 In the box (50-75th%)
## 17 56.12041 In the box (50-75th%)
## 18 57.79354 In the box (50-75th%)
## 19 58.57532 In the box (50-75th%)
## 20 58.93457 Not in the box (top 75th%)
## 21 62.86435 Not in the box (top 75th%)
## 22 62.94799 Not in the box (top 75th%)
## 23 63.54222 Not in the box (top 75th%)
## 24 63.74992 Not in the box (top 75th%)
## 25 64.56405 Not in the box (top 75th%)
The box plot reflects how the middle 50% of the data is not evenly distributed. The median is shifted towards the lower end (Q1) of the box, and there are higher values pulling the box upwards.
While the median is pretty low around (49.8), and not few values are below the median, there are several values between 50 and 56.7 that are apart of the middle 50% of the data.
Let’s run a regression with the dependent variable as feelings towards Trudeau (0-100, continuous measure) and three independent variables (gender, age, province).
Can you interpret the results?
##
## ====================================================
## Dependent variable:
## --------------------------------
## Feelings Towards Trudeau (0-100)
## ----------------------------------------------------
## Woman 5.872***
## (0.509)
##
## Non-binary, other -4.206
## (3.341)
##
## Age -0.122***
## (0.016)
##
## British Columbia 11.091***
## (1.026)
##
## Manitoba 8.818***
## (1.460)
##
## New Brunswick 17.095***
## (1.961)
##
## Newfoundland 18.833***
## (2.640)
##
## Nova Scotia 18.074***
## (1.707)
##
## Ontario 12.953***
## (0.816)
##
## PEI 7.264
## (4.910)
##
## Quebec 11.069***
## (0.834)
##
## Constant 37.186***
## (1.194)
##
## ----------------------------------------------------
## Observations 17,156
## R2 0.031
## Adjusted R2 0.031
## Residual Std. Error 32.592 (df = 17144)
## F Statistic 50.105*** (df = 11; 17144)
## ====================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The coefficient for each independent variable is the number NOT listed in brackets. A coefficient tells us, for a one unit change in the independent variable (or compared to the left-out category), how much does our dependent variable change?
As age increases by one year, ratings of Trudeau decrease by -0.12 points, holding all other variables constant.
For categorical variables, we compare to the left-out category: compared to men, women rate Trudeau 5.87 points higher, holding all other variables constant. Compared to people who live in Alberta, people who live in Ontario rate Trudeau 12.95 points higher, holding all other variables constant.
See the slides from Lecture 10, as well as the Semra Sevi reading assigned for Week 10 for a refresher on how to interpret coefficients.
The stars indicate statistical significance of our coefficients. We typically choose a statistical significance level before we run our model. In this case, I would say as long as p<0.05 (indicated by two or more stars), I am confident enough to reject the null hypothesis (and interpret the coefficient provided by the model). In other words, if there is a less than 5% chance that the relationship we find (coefficient) occurred by chance in the population, then I am confident enough to interpret the coefficient.
When selecting the level of statistical significance, you interpret all coefficients that have that level or greater (indicated by X number of stars or more). If we select p<0.05, then we interpret coefficients indicated by two or three stars. For p<0.1, then we interpret coefficients indicated by one, two, or three stars. If we selected p<0.01, then we only interpret coefficients with three stars.
If a coefficient is not statistically significant, then we do NOT interpret the coefficient.
We can see that for all the reported coefficients, we have 3 stars with the exception of PEI and gender-other where we have zero stars. Three stars indicates a statistical significance level of p<0.001 which is less than 0.05 (i.e. three stars is a higher threshold for statistical significance). We would not interpret the PEI and gender-other coefficients because their effects are not statistically significant (p value is larger than 0.05). In other words, we are not confident that the coefficient reported is the “true” value or that it is significantly different from zero. We are not confident we can generalize to the population. Statistical significance is used to help us make generalizations or inferences from our sample (2021 CES data of 10,000 Canadians) to the Canadian population.
Remember, we also looked at a regression output in this format when we working in R:
summary(mod1)
##
## Call:
## lm(formula = cps21_lead_rating_23 ~ gender + age + province,
## data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.196 -31.746 2.871 27.590 72.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.1862 1.1945 31.132 < 2e-16 ***
## genderwoman 5.8723 0.5091 11.535 < 2e-16 ***
## genderother -4.2061 3.3413 -1.259 0.208
## age -0.1223 0.0158 -7.744 1.02e-14 ***
## provinceBritish Columbia 11.0912 1.0262 10.808 < 2e-16 ***
## provinceManitoba 8.8179 1.4605 6.038 1.59e-09 ***
## provinceNew Brunswick 17.0948 1.9609 8.718 < 2e-16 ***
## provinceNewfoundland 18.8326 2.6403 7.133 1.02e-12 ***
## provinceNova Scotia 18.0740 1.7068 10.589 < 2e-16 ***
## provinceOntario 12.9535 0.8163 15.869 < 2e-16 ***
## provincePEI 7.2645 4.9096 1.480 0.139
## provinceQuebec 11.0686 0.8338 13.275 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.59 on 17144 degrees of freedom
## Multiple R-squared: 0.03115, Adjusted R-squared: 0.03053
## F-statistic: 50.1 on 11 and 17144 DF, p-value: < 2.2e-16
The “Estimate” column shows the coefficients for each independent variable.
Party | Frequency (n) | Cumulative frequency |
---|---|---|
Party C | 45 | 45 |
Party B | 39 | NA |
Party A | 34 | 118 |
Party D | 29 | 147 |
The correct answer would be 0.35.
How did I come to this answer? 0.02 + 0.38 + 0.25 = 0.65 (all other income categories total 65% of observations). 1 - 0.65 = 0.35 (the 60,001-90,000 category is 0.35 or 35% of observations).Other calculations you may be asked to complete on the final exam:
I will not ask you to…:
Compute the standard deviation by hand. (This does not mean that you shouldn’t be able to interpret the standard deviation or understand what the standard deviation of a variable communicates.)
Computer the interquartile range of a variable.
Complete linear regression by hand. (This does not mean that you shouldn’t be able to interpret a linear regression output. You need to understand why we use linear regression.)