Introduction

As a way to help you prepare for the final exam, this lab presents several different tables and visualizations and asks you to interpret them. The correct answers can be found by clicking the tab that says “ANSWER” below each question.

I do not show you the code I used to prepare the figures because you do not need to write or interpret code on the final exam. (Although you may be asked about some of the functions and packages that we have relied on consistently throughout the course.)

Example 1:

In this example, we examine several variables and their relationships. We finish by interpreting a multiple linear regression.

Distribution of the ‘feelings towards Trudeau’ variable

Question: Is the plot above a histogram, bar plot, scatterplot, box plot, or pie chart? Is this an appropriate plot to visualization the distribution of a continuous variable?

ANSWER: The above plot is a histogram, which is a type of bar plot used to visualize the distribution of a continuous variable.

Question: How would you describe the variable’s distribution?

Is it unimodal, bimodal, or multi-modal? What can we learn about how respondents feel about Justin Trudeau?

ANSWER: Biomodal - two peaks: one at 0 and one around 70. Most respondents strongly dislike Trudeau (~4,000), although there is another ~2100 respondents who gave him a 70/100 rating.

Relationship between age (X) and feelings towards Trudeau (Y)

Scatterplot

Question: How would you describe the relationship between age and feelings towards Justin Trudeau?

ANSWER: There does not appear to be much of a relationship between age and feelings about Trudeau in our sample. There may be a slightly negative relationship (e.g. people below the age of 40 give slightly higher ratings of Trudeau than people older than 40 who give him slightly lower ratings). A Pearson’s r correlation coefficient of -0.07 suggests the variables are weakly correlated (since r < 0.3 but not zero) and it is a negative relationship (r is negative). see Lecture 9 slides for a review of pearson’s r and relationships between two variables in a scatterplot. Slide # 14 in Lecture 9 provides a decision tree for describing the relationship between two continuous variables

Cross-tab

Do feelings for Trudeau vary by age?
Feelings Towards Trudeau Age
young middle-aged older working retirees
dislike 0.31 0.35 0.43 0.45
meh 0.36 0.31 0.26 0.23
like 0.33 0.34 0.31 0.32

Question: Is there a positive relationship, negative relationship, or no relationship between age and feelings towards Trudeau?

ANSWER:

While we see that the values of the top row consistently increase (i.e. as age increases, so too does dislike for Trudeau), this pattern does not hold across the bottom row in the opposite direction. We are not really sure that younger people tend to like Trudeau more. We cannot be confident that there is a clear, negative relationship. There is a moderate correlation, however.

If we were to find a positive relationship: we would see that across the bottom row, “like” for Trudeau would be concentrated among older individuals (increasing values as we look across the bottom row). And as we would look across the top row, “dislike” for Trudeau would be concentrated among young people (as age decreases, so too would feelings towards Trudeau).

For a refresher on how positive, negative, and no relationship looks in a cross-tab, see slides 10-12 from Lecture 9.

Relationship between gender and feelings towards Justin Trudeau

df2 %>%
  filter(gender != "other") %>%
  ggplot(aes(x = gender, y = cps21_lead_rating_23, fill = gender)) +
  geom_boxplot(color = "black") +
  scale_fill_manual(values = c("woman" = "pink", "man" = "blue")) +
  labs(x = "Gender", y = "How do you feel about Justin Trudeau?\n(0 = really dislike, 100 = really like)", title="Feelings Towards Trudeau") +
  theme_minimal()

Question: What is the relationship between gender and feelings towards Trudeau?

ANSWER: There is more variability in men’s rating of Trudeau (indicated by a taller box) and they tend to dislike Trudeau more than women (based on the median ratings of Trudeau for men and women and extension of the “man” box to the lower end of the rating scale). (I excluded the “other” gender to provide a clear example when an IV has two categories - we will add the other gender cateogry back in when we run the regression below.)

Province and Feelings Towards Trudeau

ggplot(df2, aes(x = province, y = cps21_lead_rating_23, fill = province)) +
  geom_boxplot(color = "black") +
  scale_fill_viridis_d() +  # generates colours from viridis automatically
  labs(x = "Province", y = "Feelings Towards Trudeau\n(0 = really dislike, 100 = really like)", title="Do respondents feelings towards Trudeau vary by province?") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 

Question: What is the relationship between where someone lives and their feelings towards Trudeau?

ANSWER: There appears to be some variation in Trudeau’s popularity across the provinces. It looks like people in Alberta especially tend to hold more negative feelings towards Trudeau. The median rating is especially low compared to other provinces (around 23). New Brunswick, Newfoundland, and Nova Scotia appear to hold more positive feelings towards Trudeau than the other provinces (e.g. the middle 50% of responses are within the 25-75 rating) R Lesson Class 7 includes the above graphic which provides information on how to interpret box plots.

Digression: why isn’t the median in the centre of the box?

Below, I generate a fake dataset of 25 values. You’ll notice that the dataset is skewed: the mean is larger than the median. Most values are concentrated on the lower end, but there are a few values on the higher end creating a long tail.

##       Value
## 1  40.16691
## 2  43.67469
## 3  46.56574
## 4  47.19762
## 5  47.22079
## 6  47.63604
## 7  47.77169
## 8  48.84911
## 9  50.35254
## 10 50.55341
## 11 50.64644
## 12 51.79907
## 13 52.00386
## 14 52.30458
## 15 52.48925
## 16 53.50678
## 17 56.12041
## 18 57.79354
## 19 58.57532
## 20 58.93457
## 21 62.86435
## 22 62.94799
## 23 63.54222
## 24 63.74992
## 25 64.56405
## Mean = 53.27324
## Median = 52.00386

Here is a histogram that shows the distribution of this variable (note the long tail to the right).

Let’s look at where Q1 (25th percentile), Q2 (median), and Q3 (75th percentile) lie:

## Q1 = 47.77169
## Q2 (Median) = 52.00386
## Q3 = 58.57532

Below, you’ll see all of the different values of the variable called “Value”. We create a boxplot below to visualize the distribution of this variable. I’ve included another variable called “labels” which tells you which values of “Value” appear inside or outside of the box plot, based on the percentile category that each data point falls into. The middle 50% of the points fall inside of the box.

##       Value                      Quartile
## 1  40.16691 Not in the box (bottom 25th%)
## 2  43.67469 Not in the box (bottom 25th%)
## 3  46.56574 Not in the box (bottom 25th%)
## 4  47.19762 Not in the box (bottom 25th%)
## 5  47.22079 Not in the box (bottom 25th%)
## 6  47.63604 Not in the box (bottom 25th%)
## 7  47.77169 Not in the box (bottom 25th%)
## 8  48.84911         In the box (25-50th%)
## 9  50.35254         In the box (25-50th%)
## 10 50.55341         In the box (25-50th%)
## 11 50.64644         In the box (25-50th%)
## 12 51.79907         In the box (25-50th%)
## 13 52.00386         In the box (25-50th%)
## 14 52.30458         In the box (50-75th%)
## 15 52.48925         In the box (50-75th%)
## 16 53.50678         In the box (50-75th%)
## 17 56.12041         In the box (50-75th%)
## 18 57.79354         In the box (50-75th%)
## 19 58.57532         In the box (50-75th%)
## 20 58.93457    Not in the box (top 75th%)
## 21 62.86435    Not in the box (top 75th%)
## 22 62.94799    Not in the box (top 75th%)
## 23 63.54222    Not in the box (top 75th%)
## 24 63.74992    Not in the box (top 75th%)
## 25 64.56405    Not in the box (top 75th%)

The box plot reflects how the middle 50% of the data is not evenly distributed. The median is shifted towards the lower end (Q1) of the box, and there are higher values pulling the box upwards.

While the median is pretty low around (49.8), and not few values are below the median, there are several values between 50 and 56.7 that are apart of the middle 50% of the data.

Regression

Let’s run a regression with the dependent variable as feelings towards Trudeau (0-100, continuous measure) and three independent variables (gender, age, province).

Can you interpret the results?

## 
## ====================================================
##                           Dependent variable:       
##                     --------------------------------
##                     Feelings Towards Trudeau (0-100)
## ----------------------------------------------------
## Woman                           5.872***            
##                                 (0.509)             
##                                                     
## Non-binary, other                -4.206             
##                                 (3.341)             
##                                                     
## Age                            -0.122***            
##                                 (0.016)             
##                                                     
## British Columbia               11.091***            
##                                 (1.026)             
##                                                     
## Manitoba                        8.818***            
##                                 (1.460)             
##                                                     
## New Brunswick                  17.095***            
##                                 (1.961)             
##                                                     
## Newfoundland                   18.833***            
##                                 (2.640)             
##                                                     
## Nova Scotia                    18.074***            
##                                 (1.707)             
##                                                     
## Ontario                        12.953***            
##                                 (0.816)             
##                                                     
## PEI                              7.264              
##                                 (4.910)             
##                                                     
## Quebec                         11.069***            
##                                 (0.834)             
##                                                     
## Constant                       37.186***            
##                                 (1.194)             
##                                                     
## ----------------------------------------------------
## Observations                     17,156             
## R2                               0.031              
## Adjusted R2                      0.031              
## Residual Std. Error       32.592 (df = 17144)       
## F Statistic            50.105*** (df = 11; 17144)   
## ====================================================
## Note:                    *p<0.1; **p<0.05; ***p<0.01

The coefficient for each independent variable is the number NOT listed in brackets. A coefficient tells us, for a one unit change in the independent variable (or compared to the left-out category), how much does our dependent variable change?

Question: Interpret at least two coefficients in the model output above.

ANSWER:

As age increases by one year, ratings of Trudeau decrease by -0.12 points, holding all other variables constant.

For categorical variables, we compare to the left-out category: compared to men, women rate Trudeau 5.87 points higher, holding all other variables constant. Compared to people who live in Alberta, people who live in Ontario rate Trudeau 12.95 points higher, holding all other variables constant.

See the slides from Lecture 10, as well as the Semra Sevi reading assigned for Week 10 for a refresher on how to interpret coefficients.

Review of statistical significance:

The stars indicate statistical significance of our coefficients. We typically choose a statistical significance level before we run our model. In this case, I would say as long as p<0.05 (indicated by two or more stars), I am confident enough to reject the null hypothesis (and interpret the coefficient provided by the model). In other words, if there is a less than 5% chance that the relationship we find (coefficient) occurred by chance in the population, then I am confident enough to interpret the coefficient.

When selecting the level of statistical significance, you interpret all coefficients that have that level or greater (indicated by X number of stars or more). If we select p<0.05, then we interpret coefficients indicated by two or three stars. For p<0.1, then we interpret coefficients indicated by one, two, or three stars. If we selected p<0.01, then we only interpret coefficients with three stars.

If a coefficient is not statistically significant, then we do NOT interpret the coefficient.

We can see that for all the reported coefficients, we have 3 stars with the exception of PEI and gender-other where we have zero stars. Three stars indicates a statistical significance level of p<0.001 which is less than 0.05 (i.e. three stars is a higher threshold for statistical significance). We would not interpret the PEI and gender-other coefficients because their effects are not statistically significant (p value is larger than 0.05). In other words, we are not confident that the coefficient reported is the “true” value or that it is significantly different from zero. We are not confident we can generalize to the population. Statistical significance is used to help us make generalizations or inferences from our sample (2021 CES data of 10,000 Canadians) to the Canadian population.

Remember, we also looked at a regression output in this format when we working in R:

summary(mod1)
## 
## Call:
## lm(formula = cps21_lead_rating_23 ~ gender + age + province, 
##     data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.196 -31.746   2.871  27.590  72.600 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               37.1862     1.1945  31.132  < 2e-16 ***
## genderwoman                5.8723     0.5091  11.535  < 2e-16 ***
## genderother               -4.2061     3.3413  -1.259    0.208    
## age                       -0.1223     0.0158  -7.744 1.02e-14 ***
## provinceBritish Columbia  11.0912     1.0262  10.808  < 2e-16 ***
## provinceManitoba           8.8179     1.4605   6.038 1.59e-09 ***
## provinceNew Brunswick     17.0948     1.9609   8.718  < 2e-16 ***
## provinceNewfoundland      18.8326     2.6403   7.133 1.02e-12 ***
## provinceNova Scotia       18.0740     1.7068  10.589  < 2e-16 ***
## provinceOntario           12.9535     0.8163  15.869  < 2e-16 ***
## provincePEI                7.2645     4.9096   1.480    0.139    
## provinceQuebec            11.0686     0.8338  13.275  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.59 on 17144 degrees of freedom
## Multiple R-squared:  0.03115,    Adjusted R-squared:  0.03053 
## F-statistic:  50.1 on 11 and 17144 DF,  p-value: < 2.2e-16

The “Estimate” column shows the coefficients for each independent variable.

Example 2: Calculations you might be asked to do on the exam

Party Frequency (n) Cumulative frequency
Party C 45 45
Party B 39 NA
Party A 34 118
Party D 29 147

Question: What is the missing value in the table above?

ANSWER: The missing value is 84. If we add 45 (frequency of Party C - first row) + 39 (frequency of Party B - second row) = 84.

Question: What is the missing value in the table above?

ANSWER:

The correct answer would be 0.35.

How did I come to this answer? 0.02 + 0.38 + 0.25 = 0.65 (all other income categories total 65% of observations). 1 - 0.65 = 0.35 (the 60,001-90,000 category is 0.35 or 35% of observations).

Note:

Other calculations you may be asked to complete on the final exam:

  • Mean, median, or mode of a variable.

I will not ask you to…:

  • Compute the standard deviation by hand. (This does not mean that you shouldn’t be able to interpret the standard deviation or understand what the standard deviation of a variable communicates.)

  • Computer the interquartile range of a variable.

  • Complete linear regression by hand. (This does not mean that you shouldn’t be able to interpret a linear regression output. You need to understand why we use linear regression.)