Introduction

Sections

In this lesson, we are focusing how we can best describe or summarize a variable by focusing on:

Distribution of a variable
Measures of central tendency
Measures of dispersion

Our discussion of how to best summarize or describe a single variable will extend into the next lab when we focus on visualizing a variable.

Getting Started

Before we get going, remember the “best practices” for starting a new R Script/R Markdown document:

Create a new file folder for this lesson.
Open a new R script (or R Markdown).
Set your working directory so R is linked to that new file folder. You can do this with Session -> Set Working Directory -> Choose Directory. Then copy and paste the code from your Console (bottom left panel) into your R Script.
Save the R script (or R Markdown).
Load any packages that you think we’ll need for today’s lab (e.g., rio and tidyverse)

library(rio)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Import Data

Let’s start by importing the data that we will be using for today’s lab.

We are going to use the same data that we used for last week’s lab. There are a couple of ways that you could open the data from the folder you set up last week (without having to move the data to our current working directory).

Probably the easiest way is to specify the exact pathway (tell R exactly where the “DC 2021 v1.dta” file is saved on your computer):

df <- import("/Users/shanayavanhooren/Documents/POL3325G/Lectures/Lecture 4/DC 2021 v1.dta") # here I specify the path

Other options to import the data into R:

Change your working directory to the folder where the data is saved and then use import(“data file name”) like we usually do.
Duplicate and then move the data file to your current working directory.

Wrangle the Data

Let’s start by doing the basic data wrangling that we need in order for the data to be in a useable format.

First, consider which variables you’re interested in exploring and get rid of any variables that we really do not need. We will use select() to narrow down the variables in our dataset.

Second, let’s rename variables to make their variable names more intuitive and/or shorter (this is a best practice). Take a look at the dataset codebook to get a better idea of what the variables mean.

Below, we are going to do both of these steps in one big code chunk thanks to the pipe operator %>%, which tells R, execute the select function, “THEN” execute the rename function.

Pro tip: Remember, spelling, capitalization, punctuation matters with R! If you try to rename the variable ResponseId using the rename() function but when you list the old variable name you write it as ResponseID (instead of ResponseId), R won’t know what variable you are referring to. This is because you wrote ResponseID with an upper case instead of a lower case ‘d’. It exists in the dataset as ResponseId and R only knows how to find that exact iteration of the variable name.

df2 <- df %>% 
  select(ResponseId, age_in_years, dc21_province, # only keep variables we care about
         dc21_education, dc21_vote_choice, 
         dc21_disability, dc21_disabilitytype, dc21_income_category) %>%
  rename(ID = ResponseId, # assign new names to our variables
         age = age_in_years, 
         province = dc21_province,
         education = dc21_education,
         disability = dc21_disability,
         disability_type = dc21_disabilitytype,
         fed_votechoice = dc21_vote_choice,
         income = dc21_income_category)

head(df2, 4)

##                  ID age province education fed_votechoice disability
## 1 R_0008EntEqvREVod  41        2        11             NA          1
## 2 R_009hnGPoP52SECd  47        4         6              2          1
## 3 R_00asvSPzZbLnazT  74        9         8              2          2
## 4 R_00dQgcddqPC9x97  67        9         5              3          2
##   disability_type income
## 1               3      4
## 2               2      3
## 3              NA      4
## 4              NA      3

You’ll notice that there are some ‘NA’ values. As noted in the Llaudet and Imai reading, missing values are represented as NA in R. We will talk more about NA values as we start doing calculations and creating visualizations of our variables.

Distribution

How many cases (observations) take on each unique value or category (of our variable)?

Maybe the first thing that we want to know about this data is how many respondents are there from each province? How many cases or observations take on each unique score of the province variable?

Let’s take a look at the variable class/type before we begin.

class(df2$province)

## [1] "numeric"

It appears to be coded as a numeric variable. When we look at the codebook, we see why this is the case. Numbers are used as labels to represent the different provinces. We should re-code the variable so that it is a character (or factor) variable since we know that the level of measurement is nominal (categorical without any order to the categories).

Let’s recode our variable and then try again.

Variable Recoding…

df2 <- df2 %>% 
  mutate(prov_name = case_match(province,
    1 ~ "AB", 
    2 ~ "BC",
    3 ~ "MB",
    4 ~ "NB", 
    5 ~ "NL",
    7 ~ "NS",
    9 ~ "ON",
    10 ~ "PEI",
    11 ~ "QE",
    12 ~ "SK",
    .default = "TR" # here we specify any other values to be coded as "TR" for territory.
  ))

class(df2$prov_name)

## [1] "character"

If we opened and scrolled our dataset, we might get an initial idea about which provinces seem to be more prevalent and which seem to be less prevalent, but it is not a very accurate picture.

We would probably want to create a frequency table instead.

Frequency Tables

A frequency table shows us the different values or categories a variable can take and the number of times each appears.

We can use the table() function from base R to easily create a frequency table.

Reminder: $ is the character we use to access an element inside of an object. In this case, we are trying to access a variable (element) inside of our dataframe called df2 (object).

table(df2$province)

## 
##    1    2    3    4    5    7    9   10   11   12 
##  868  873  290  144  123  195 3176   36 1996  248

This is not very intuitive since our province categories were coded as numbers in the original data. It would be better to use our re-coded province variable so that the different values of the variable make sense. (This just highlights the importance of data wrangling!)

Let’s try again to make a frequency table.

table(df2$prov_name)

## 
##   AB   BC   MB   NB   NL   NS   ON  PEI   QE   SK 
##  868  873  290  144  123  195 3176   36 1996  248

Using tidyverse (specifically, dplyr), we can use the count() function to create a frequency table according to tidy principles. I could use the assignment operator to save this as a new object (it is now another dataframe in our global environment called “prov_freq”).

prov_freq_df <- df2 %>% 
  count(prov_name)

print(prov_freq_df)

##    prov_name    n
## 1         AB  868
## 2         BC  873
## 3         MB  290
## 4         NB  144
## 5         NL  123
## 6         NS  195
## 7         ON 3176
## 8        PEI   36
## 9         QE 1996
## 10        SK  248

We have the most observations in our sample (dataset) from Ontario, and then Quebec. The fewest respondents were from PEI. These raw numbers are not that informative, however. What does it mean to have 3,176 respondents from Ontario? We need to know the total number of observations in our dataset to make sense of this information, but even then, it is not very informative. (Remember, you can find the total number of observations in the data in the global environment.)

Tables of Proportions

A table of proportions shows the proportion of observations that take each value or category of a variable. You can think of a proportion as expressing what share of observations belong to one category of our variable in relation to the whole. In the code below, we calculate the proportion by first calculating n for each category using count() (more specifically, “n” shows the frequencies of the province categories). Then, we calculate the proportions for each category as n divided by the total observations in our dataset (proportion = n/sum(n)).

prov_prop_df <- df2 %>%
  count(prov_name) %>%
  mutate(proportion = n/sum(n)) %>%
  mutate(prop = round(proportion, 2)) # create a rounded version of the proportion variable 

print(prov_prop_df)

##    prov_name    n  proportion prop
## 1         AB  868 0.109196125 0.11
## 2         BC  873 0.109825135 0.11
## 3         MB  290 0.036482576 0.04
## 4         NB  144 0.018115486 0.02
## 5         NL  123 0.015473644 0.02
## 6         NS  195 0.024531388 0.02
## 7         ON 3176 0.399547113 0.40
## 8        PEI   36 0.004528872 0.00
## 9         QE 1996 0.251100767 0.25
## 10        SK  248 0.031198893 0.03

The proportions should sum to 1.

To present a proportion as a percentage, we multiply the proportion by 100 (move the decimal two places to the right). It is more common to present percentages than proportions since percentages are fairly straightforward to understand. Let’s use mutate() to create a new percentage column in our dataset by multiplying the “prop” variable (column) by 100.

prov_prop_df <- prov_prop_df %>%
  mutate(percentage = prop * 100) # the * symbol means "multiply" (prop x 100)

print(prov_prop_df)

##    prov_name    n  proportion prop percentage
## 1         AB  868 0.109196125 0.11         11
## 2         BC  873 0.109825135 0.11         11
## 3         MB  290 0.036482576 0.04          4
## 4         NB  144 0.018115486 0.02          2
## 5         NL  123 0.015473644 0.02          2
## 6         NS  195 0.024531388 0.02          2
## 7         ON 3176 0.399547113 0.40         40
## 8        PEI   36 0.004528872 0.00          0
## 9         QE 1996 0.251100767 0.25         25
## 10        SK  248 0.031198893 0.03          3

This is a lot more informative. For example, we can communicate to our audience that 11% of respondents to the Democracy Checkup 2021 survey were from Alberta, while 40% of respondents were from Ontario. Again, the percentage column should sum to 1. You could “check” this by running sum(prov_prop_df$percentage) - this adds together all of the values of the percentage column (or the elements in the percentage vector).

Digression: NA values

You should be checking for how many NA values exist in your dataset but also in the specific variable or variables that you’re interested in.

Below, I create a fake dataset with four NA values. I show you several lines of code that you can use to determine the number of NAs and where they’re located in your dataset.

fake_data <- data.frame(
  id = c(1,2,3,4,5,6),
  gender = c("M", "F", "M", "F", "Other", "F"),
  favourite_team = c("Bills", "Eagles", "Chiefs", NA, "Bills", "Chiefs"),
  income = c(NA, 1, 2, 4, NA, NA)
)

head(fake_data)

##   id gender favourite_team income
## 1  1      M          Bills     NA
## 2  2      F         Eagles      1
## 3  3      M         Chiefs      2
## 4  4      F           <NA>      4
## 5  5  Other          Bills     NA
## 6  6      F         Chiefs     NA

# return boolean (TRUE/FALSE) answer to the question:
# "which values in the object are NAs"? : 
is.na(fake_data)

##         id gender favourite_team income
## [1,] FALSE  FALSE          FALSE   TRUE
## [2,] FALSE  FALSE          FALSE  FALSE
## [3,] FALSE  FALSE          FALSE  FALSE
## [4,] FALSE  FALSE           TRUE  FALSE
## [5,] FALSE  FALSE          FALSE   TRUE
## [6,] FALSE  FALSE          FALSE   TRUE

# return total number of NAs in the dataset: 
sum(is.na(fake_data))

## [1] 4

# return rows where at least one column has an NA value: 
fake_data %>%  
  filter(if_any(everything(), is.na))

##   id gender favourite_team income
## 1  1      M          Bills     NA
## 2  4      F           <NA>      4
## 3  5  Other          Bills     NA
## 4  6      F         Chiefs     NA

# return rows where a specific column (in this case, income) has NA: 
fake_data %>% 
  filter(is.na(income))

##   id gender favourite_team income
## 1  1      M          Bills     NA
## 2  5  Other          Bills     NA
## 3  6      F         Chiefs     NA

Frequency Tables for Continuous Variables?

Let’s walk through another example, except this time we are going to try looking at a continuous variable (age).

Spoiler alert: The purpose of me showing you a table of frequencies and proportions for a continuous variable (with a lot of unique values) is to demonstrate why we usually do not present this table and instead, we usually turn to visualizing the variable (something we will discuss in the next lab).

Let’s start by checking the variable type to make sure it matches our understanding of how the variable is measured (see R Lesson titled “Class 3” section titled “Variable Classes” for a reminder of how variable type/class aligns with different levels of measurement).

class(df2$age)

## [1] "numeric"

df2 %>% 
  filter(is.na(age))

## [1] ID              age             province        education      
## [5] fed_votechoice  disability      disability_type income         
## [9] prov_name      
## <0 rows> (or 0-length row.names)

table(df2$age)

## 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38 
##  95  72 123  96  78  91  98 103 118 121 123 115 134 104 119 119 124 197 131 139 
##  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58 
## 148 130 171 143 122 139 151 161 159 134 142 150 174 158 176 173 120 117 140 137 
##  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78 
## 139 149 151 132 146 145 143 163 153 145 128 129 146 128 113  98 103  63  72  55 
##  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  95  96 101 
##  50  34  25  22  19  10   5   7   7   6   4   4   1   3   1   1   1   3

Yikes! This is not very informative. Would it be better to use tidyverse (as opposed to table() in Base R) to create a frequency table?

df2 %>%
  count(age) %>%
  mutate(proportion = n/sum(n)) %>% # sum(n) = total number obs. in our data
  mutate(prop = round(proportion, 2)) %>%
  mutate(percentage = prop * 100)

##    age   n   proportion prop percentage
## 1   19  95 0.0119511888 0.01          1
## 2   20  72 0.0090577431 0.01          1
## 3   21 123 0.0154736445 0.02          2
## 4   22  96 0.0120769908 0.01          1
## 5   23  78 0.0098125550 0.01          1
## 6   24  91 0.0114479809 0.01          1
## 7   25  98 0.0123285948 0.01          1
## 8   26 103 0.0129576047 0.01          1
## 9   27 118 0.0148446345 0.01          1
## 10  28 121 0.0152220405 0.02          2
## 11  29 123 0.0154736445 0.02          2
## 12  30 115 0.0144672286 0.01          1
## 13  31 134 0.0168574663 0.02          2
## 14  32 104 0.0130834067 0.01          1
## 15  33 119 0.0149704365 0.01          1
## 16  34 119 0.0149704365 0.01          1
## 17  35 124 0.0155994465 0.02          2
## 18  36 197 0.0247829916 0.02          2
## 19  37 131 0.0164800604 0.02          2
## 20  38 139 0.0174864763 0.02          2
## 21  39 148 0.0186186942 0.02          2
## 22  40 130 0.0163542584 0.02          2
## 23  41 171 0.0215121399 0.02          2
## 24  42 143 0.0179896842 0.02          2
## 25  43 122 0.0153478425 0.02          2
## 26  44 139 0.0174864763 0.02          2
## 27  45 151 0.0189961001 0.02          2
## 28  46 161 0.0202541200 0.02          2
## 29  47 159 0.0200025160 0.02          2
## 30  48 134 0.0168574663 0.02          2
## 31  49 142 0.0178638822 0.02          2
## 32  50 150 0.0188702982 0.02          2
## 33  51 174 0.0218895459 0.02          2
## 34  52 158 0.0198767141 0.02          2
## 35  53 176 0.0221411498 0.02          2
## 36  54 173 0.0217637439 0.02          2
## 37  55 120 0.0150962385 0.02          2
## 38  56 117 0.0147188326 0.01          1
## 39  57 140 0.0176122783 0.02          2
## 40  58 137 0.0172348723 0.02          2
## 41  59 139 0.0174864763 0.02          2
## 42  60 149 0.0187444962 0.02          2
## 43  61 151 0.0189961001 0.02          2
## 44  62 132 0.0166058624 0.02          2
## 45  63 146 0.0183670902 0.02          2
## 46  64 145 0.0182412882 0.02          2
## 47  65 143 0.0179896842 0.02          2
## 48  66 163 0.0205057240 0.02          2
## 49  67 153 0.0192477041 0.02          2
## 50  68 145 0.0182412882 0.02          2
## 51  69 128 0.0161026544 0.02          2
## 52  70 129 0.0162284564 0.02          2
## 53  71 146 0.0183670902 0.02          2
## 54  72 128 0.0161026544 0.02          2
## 55  73 113 0.0142156246 0.01          1
## 56  74  98 0.0123285948 0.01          1
## 57  75 103 0.0129576047 0.01          1
## 58  76  63 0.0079255252 0.01          1
## 59  77  72 0.0090577431 0.01          1
## 60  78  55 0.0069191093 0.01          1
## 61  79  50 0.0062900994 0.01          1
## 62  80  34 0.0042772676 0.00          0
## 63  81  25 0.0031450497 0.00          0
## 64  82  22 0.0027676437 0.00          0
## 65  83  19 0.0023902378 0.00          0
## 66  84  10 0.0012580199 0.00          0
## 67  85   5 0.0006290099 0.00          0
## 68  86   7 0.0008806139 0.00          0
## 69  87   7 0.0008806139 0.00          0
## 70  88   6 0.0007548119 0.00          0
## 71  89   4 0.0005032080 0.00          0
## 72  90   4 0.0005032080 0.00          0
## 73  91   1 0.0001258020 0.00          0
## 74  92   3 0.0003774060 0.00          0
## 75  93   1 0.0001258020 0.00          0
## 76  95   1 0.0001258020 0.00          0
## 77  96   1 0.0001258020 0.00          0
## 78 101   3 0.0003774060 0.00          0

This is pretty big table. It has 78 rows (see “Description” at the top of the output which shows the df[rows by columns]). What would be a better way to understand the distribution of the age variable (or how many cases we have of the different values of the age variable)?

When we have a continuous variable that can take on a lot of different values, it is better to visualize (plot) the distribution of the variable than create a frequency table. You will learn how to do this in next week’s class.

Measures of Central Tendency

Measures of central tendency indicate the most typical value of our variable.

What is the most typical value?
Which one value of our variable would best represent the entire distribution?

Mode

The mode tells us the most frequently appearing value or category of our variable.

For nominal level variables, we summarize the most typical value by figuring out which category of our variable occurs most often.

Let’s look at the provinces variable by sorting our table of frequencies/proportions that we made above in descending order and then (%>%) selecting the first row in the dataframe using slice(1). Below, I am showing you how I first arrange the dataset and then how I both arrange the dataset and then use my pipe (%>%) to return the first row.

prov_prop_df %>%
  arrange(desc(n))

##    prov_name    n  proportion prop percentage
## 1         ON 3176 0.399547113 0.40         40
## 2         QE 1996 0.251100767 0.25         25
## 3         BC  873 0.109825135 0.11         11
## 4         AB  868 0.109196125 0.11         11
## 5         MB  290 0.036482576 0.04          4
## 6         SK  248 0.031198893 0.03          3
## 7         NS  195 0.024531388 0.02          2
## 8         NB  144 0.018115486 0.02          2
## 9         NL  123 0.015473644 0.02          2
## 10       PEI   36 0.004528872 0.00          0

prov_prop_df %>%
  arrange(desc(n)) %>%
  slice(1)

##   prov_name    n proportion prop percentage
## 1        ON 3176  0.3995471  0.4         40

Another way to find the mode (which uses less code): below, we filter our table of proportions that we created earlier using the n variable (frequency of each category). We ask R to return observations in our data with the largest value of n category/categories (if there is a tie).

prov_prop_df %>%
  filter(n == max(n))

##   prov_name    n proportion prop percentage
## 1        ON 3176  0.3995471  0.4         40

Median

The median tells us the value of the middle case. The median cannot be used at the nominal level because there’s no inherent order to the data (so we cannot order the data to find the middle case).

Simple Example

Let’s take a simple example where we have 3 possible categories of our variable (called Variable1). Let’s say the value of 1 represents dislike, 2 represents netural, and 3 represents like. If our dataset is small enough, we could just sort the responses to find the middle value.

Below, our data has 9 rows or 9 unique cases (n). When n is odd (like with 9), we first sort our variable so the categories are ordered from lowest to highest, and then we take the value of middle case. Here’s how you can figure out the middle case:

Median= n + 1 /2

9 + 1 = 10/2= 5.

fake_data2 <- data.frame(
  ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
  gender = c("M", "F", "M", "M", "F", "F", "F","F", "F"),
  variable1 = c(3, 2, 1, 1, 2, 3, 2, 3, 1)
)

fake_data2

##   ID gender variable1
## 1  1      M         3
## 2  2      F         2
## 3  3      M         1
## 4  4      M         1
## 5  5      F         2
## 6  6      F         3
## 7  7      F         2
## 8  8      F         3
## 9  9      F         1

fake_data2 %>% 
  arrange(variable1)

##   ID gender variable1
## 1  3      M         1
## 2  4      M         1
## 3  9      F         1
## 4  2      F         2
## 5  5      F         2
## 6  7      F         2
## 7  1      M         3
## 8  6      F         3
## 9  8      F         3

After sorting variable1 from lowest to highest, the median is the value of the middle case (the fifth observation in our dataset, or fifth row) which is 2.

Below, I use the example of the income variable to show you different code to find the median when your dataset is large.

Variable recoding

First, we should recode income so that it is an ordinal level variable. Check the codebook to learn how the variable is coded in the original data so that you can re-code it in an informative way.

table(df2$income)

## 
##  -99    1    2    3    4    5    6    7    8 
##   65  149 1130 1903 1745 1008 1061  578  310

Above, I’ve made a frequency table of the variable before we’ve recoded it. You’ll notice that there are 65 cases assigned to “-99”. If you search in the codebook, you’ll read that these were respondents who were coded as missing:

“If the respondent did not respond to a question or a component of a question (for example, did not click on or move a slider), then their response to that question, or that component of the question, was recorded as seen but missing (-99).”

Let’s recode income so that it is an ordinal level variable.

df2 <- df2 %>% 
  mutate(income_fct = 
           case_match(income, 
                      1 ~ "No income",
                      c(2, 3) ~ "1-60 thousand",
                      c(4, 5) ~ "60,001 - 90,000",
                      c(6,7,8) ~ ">90,001",
                      -99 ~ NA)) %>%
  mutate(income_fct = factor(income_fct, levels=c("No income", "1-60 thousand", "60,001 - 90,000", ">90,001")))

class(df2$income_fct)

## [1] "factor"

levels(df2$income_fct)

## [1] "No income"       "1-60 thousand"   "60,001 - 90,000" ">90,001"

table(df2$income, df2$income_fct) # Check variable recoding

##      
##       No income 1-60 thousand 60,001 - 90,000 >90,001
##   -99         0             0               0       0
##   1         149             0               0       0
##   2           0          1130               0       0
##   3           0          1903               0       0
##   4           0             0            1745       0
##   5           0             0            1008       0
##   6           0             0               0    1061
##   7           0             0               0     578
##   8           0             0               0     310

Just to be extra clear, you can use the code below to filter the dataset to return observations that are assigned “NA” for the income_fct variable. What do the NA values look like in the original income variable vs. our new income_fct variable? We want to see NA on income_fct and -99 on income - this tells us that both are coded as NA.

df2 %>% 
  filter(is.na(income_fct))

Method 1:

We could use cumulative proportions to figure out the median category. The median will be the first category where the cumulative proportion is greater than or equal to the median position (0.5). We’re dividing the data in half the best of our ability given it is categorical.

df2 %>%
  count(income_fct) %>%
  mutate(proportion = n/sum(n)) %>%
  mutate(cumulative_prop = cumsum(proportion))

##        income_fct    n  proportion cumulative_prop
## 1       No income  149 0.018744496       0.0187445
## 2   1-60 thousand 3033 0.381557429       0.4003019
## 3 60,001 - 90,000 2753 0.346332872       0.7466348
## 4         >90,001 1949 0.245188074       0.9918229
## 5            <NA>   65 0.008177129       1.0000000

Method 2: `median()`

An even easier method to calculate the median of an ordinal level variable is to convert your factor variable to a numeric variable and then compute the median. The factor variable MUST be coded in a way that accurately specifies the levels from lowest to highest, otherwise this will not work as expected. Then, you can use the median() function from dplyr within summarize() to return the median. It will not work if there are NA values on the income variable, so you should specify na.rm=TRUE or remove the NA values prior to calculation.

df2 %>% 
  mutate(income_numeric = as.numeric(income_fct)) %>%
  summarize(median_income = median(income_numeric, na.rm=TRUE))

##   median_income
## 1             3

It returns “3” - what does 3 mean? Three specifics which level of our factor variable is the median category. We can figure out which cateogry is the third level by running the code below and counting categories from left to right. The third category is the 60,001 - 90,000 category.

levels(df2$income_fct)

## [1] "No income"       "1-60 thousand"   "60,001 - 90,000" ">90,001"

Mean

For continuous variables, we typically want to take a look at the mean (or the average value of our variable). The mean is the point closest to every single value at the same time. Sometimes we will look at the median instead of the mean for continuous variables (e.g. when we have outliers in our data) - more on this next week.

`mean()`

Calculating the mean in Base R:

mean(df2$age)

## [1] 49.77054

Calculating the mean using dplyr:

df2 %>% 
  summarize(mean = mean(age))

##       mean
## 1 49.77054

Measures of Dispersion

Measures of dispersion indicate variety, diversity, or the amount of variation in a variable. We might wonder, how is wealth distributed in different countries? If we have a variable for income, for instance, we might look at its variation by taking the difference between the highest and lowest incomes in our dataset. This is one example of a measure of dispersion (specifically, the range). While in Norway, differences in income between highly wealthy and very poor individuals is small, in the United States, there is a much larger difference.

Variation Ratio

The variation ratio tells us what percentage or proportion of cases do not fall into the modal category. We can calculate this simply by taking the percentage of cases that fall into the modal category and subtracting that from 100%. Or, if working with proportions, we take the proportion of cases in the modal category and subtract that from 1.

Here’s how we can calculate the variation ratio by using the table of proportions we created earlier:

prov_prop_df %>% 
  filter(n == max(n)) %>% # filter to keep only the modal category
  mutate(variation_ratio = 100 - percentage)

##   prov_name    n proportion prop percentage variation_ratio
## 1        ON 3176  0.3995471  0.4         40              60

0.6 (or 60%) of observations do not fall into the modal category (Ontario). In other words, 40% of our data is from Ontario, the modal category. That means 60% of our data does not fall into the modal category.

Let’s do the same thing, except this time, let’s calculate the variation ratio by starting with the original dataframe (not the frequency table for the province variable)

df2 %>%
  count(prov_name) %>% # create frequency table 
  mutate(total_obs = sum(n)) %>%  # create column that lists total observations in dataset (it will have the same value in every row)
  filter(n == max(n)) %>% # filter to keep only the modal category
  summarize(variation_ratio = 1 - max(n) / first(total_obs)) # calculate variation ratio: 1 - the mode divided by the total observations

##   variation_ratio
## 1       0.6004529

For ordinal level data, we often just refer to our table of proportions to get a sense of dispersion.
If, for example, observations are highly concentrated in one category of our variable (e.g. 90% of observations are from Ontario), we say that dispersion is low. If categories are more evenly distributed (e.g., most provinces have around 15-30% of observations each), we would be more inclined to say that dispersion is high. Dispersion is relatively higher in the example below.

income_dist <- df2 %>%
  count(income_fct) %>% 
  mutate(proportion = n/sum(n))

print(income_dist)

##        income_fct    n  proportion
## 1       No income  149 0.018744496
## 2   1-60 thousand 3033 0.381557429
## 3 60,001 - 90,000 2753 0.346332872
## 4         >90,001 1949 0.245188074
## 5            <NA>   65 0.008177129

Range

Quite simply, the range is the highest value minus the lowest value.

The range can be distorted by atypically high or low scores (outliers in our data).

df2 %>%
  summarize(
    min_age = min(age),
    max_age = max(age),
    range_age = max_age - min_age
  )

##   min_age max_age range_age
## 1      19     101        82

We might also discuss the range of our ordinal level variable - what is the highest and lowest categories for instance? We can use levels() to check the levels of our ordinal (factor) variables.

Standard deviation

The standard deviation tells us the average distance of each score from the mean. A standard deviation of zero would indicate no dispersion and it increases in value as the distribution of scores becomes more diverse.

The formula for sample standard deviation is

\[ s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}} \quad \text{where:} \quad \]

\[ \begin{aligned} x_i &\text{ is a value/score of the variable of interest,} \\ \bar{x} &\text{ is the mean of the variable of interest,} \\ n &\text{ is the number of cases of the variable of interest.} \end{aligned} \]

calculate sd the long way

Let’s look at the standard deviation of a fake age variable by doing the calculation step-by-step. This is for illustrative purposes only - you do not need to know how to do this calculation. I will show you how to do this using a single function below…

fake_data2 <- data.frame(
  age = c(18, 22, 25, 30, 35, 40, 45, 50, 55, 60)
)

# Step 1: calculate the "sample mean" (mean of the age variable in our sample)
mean_age <- mean(fake_data2$age)

# Step 2: compute the deviations (subtract the mean from each value of age)
fake_data2 <- fake_data2 %>% 
  mutate(deviations = age - mean_age)

print(fake_data2)

##    age deviations
## 1   18        -20
## 2   22        -16
## 3   25        -13
## 4   30         -8
## 5   35         -3
## 6   40          2
## 7   45          7
## 8   50         12
## 9   55         17
## 10  60         22

# Step 3: compute the squared deviations 
fake_data2 <- fake_data2 %>% 
  mutate(deviations_sqd = deviations^2)
print(fake_data2)

##    age deviations deviations_sqd
## 1   18        -20            400
## 2   22        -16            256
## 3   25        -13            169
## 4   30         -8             64
## 5   35         -3              9
## 6   40          2              4
## 7   45          7             49
## 8   50         12            144
## 9   55         17            289
## 10  60         22            484

# Step 4: sum the squared deviations 
sum_squard_deviations <- sum(fake_data2$deviations_sqd)

# Step 5: Divide the sum of the squared deviations by the number of values 
variance <- sum_squard_deviations / (nrow(fake_data2) -1)

standard_deviation <- sqrt(variance)

print(standard_deviation)

## [1] 14.40679

`sd()`

Lucky for us, there is a sd() function in the tidyverse package (specifically, in dplyr which is a package inside of tidyverse). Since the standard deviation requires a mean, we know that it can only be calculated for interval/ratio level data.

fake_data2 %>% 
  summarize(std_dev_age = sd(age))

##   std_dev_age
## 1    14.40679

# Or here, we are doing the same calculation, but we store 
# the standard deviation in a column in our dataset: 
fake_data2 <- fake_data2 %>% mutate(
  std_dev_age = sd(age))

print(fake_data2)

##    age deviations deviations_sqd std_dev_age
## 1   18        -20            400    14.40679
## 2   22        -16            256    14.40679
## 3   25        -13            169    14.40679
## 4   30         -8             64    14.40679
## 5   35         -3              9    14.40679
## 6   40          2              4    14.40679
## 7   45          7             49    14.40679
## 8   50         12            144    14.40679
## 9   55         17            289    14.40679
## 10  60         22            484    14.40679

The exact meaning of our standard deviation depends on the scale of our data. We can better understand our standard deviation by comparing it to the range of the data. If the standard deviation is a large portion of the range, we can say that there is a lot of dispersion in our data.

df2 %>%
  summarize(
    min_age = min(age), # returns the lowest value of age
    max_age = max(age), # returns the highest value of age
    range_age = max_age - min_age, # returns the range of age 
    sd_age = sd(age, na.rm=TRUE) # returns the standard deviation of age 
  )

##   min_age max_age range_age   sd_age
## 1      19     101        82 16.61546

Digression: Understanding Standard Deviation

Let’s visualize two fake (made up) variables. One that has a high standard deviation and one that has a low standard deviation. These variables have similar means.

In the plot above, we can see the distributions of two variables (something we will learn how to do next week). The distribution of the red variable has a mean of 49 and a standard deviation of 20. The distribution of the blue variable has a mean of 52 and a standard deviation of 5. What does this mean?

When we average the values of the two variables, we find that they are similar. Their standard deviations, however, are drastically different which tells us how dispersed the data values are around the mean and influences the shape of the distributions we see in the plot. When we have a high standard deviation, the values of the variable fluctuate much more widely around the mean (or DISPERSE more widely around the mean). You can see this in the tables below, where I have printed out just the first 5 rows of each dataset. The first table below shows the values of the variable that has a high standard deviation (20) - we see values as high as 81 and as low as 38 just by looking at the first 5 observations. In the second table below, the first five values of the variable with a low standard deviation are printed. We see that most values of the variable cluster around the mean (or at least from the first five 5 rows).

head(high_sd_data, 5)

##     group    value
## 1 High SD 38.79049
## 2 High SD 45.39645
## 3 High SD 81.17417
## 4 High SD 51.41017
## 5 High SD 52.58575

head(low_sd_data, 5)

##    group    value
## 1 Low SD 46.44797
## 2 Low SD 51.28442
## 3 Low SD 48.76654
## 4 Low SD 48.26229
## 5 Low SD 45.24191

Wrapping Up

Important functions

summarize()
count()
sd()
median()
mean()
is.na()

Exercises

Using the Democracy Check Up 2021 data, choose one of the questions that focuses on attitudes towards Indigenous peoples (HINT: look at the dataset codebook!) What was the question being asked? Do respondents tend to hold positive or negative views about Indigenous peoples? Did respondents tend to share similar beliefs or attitudes or do the attitudes vary?
Again, looking at the Democracy Check Up 2021 data: how many respondents said that they had a “great deal of interest” in politics? What would be the best way to describe the central tendency of this variable?

Class 5: Describing a single variable

POL3325G Data Science for Politics (February 4, 2025)

Shanaya Vanhooren

Introduction

Sections

Getting Started

Import Data

Wrangle the Data

Distribution

Variable Recoding…

Frequency Tables

Tables of Proportions

Digression: NA values

Frequency Tables for Continuous Variables?

Measures of Central Tendency

Mode

Median

Simple Example

Variable recoding

Method 1:

Method 2: `median()`

Mean

`mean()`

Measures of Dispersion

Variation Ratio

Range

Standard deviation

calculate sd the long way

`sd()`

Digression: Understanding Standard Deviation

Wrapping Up

Important functions

Exercises

Class 5: Describing a single variable

POL3325G Data Science for Politics (February 4, 2025)

Shanaya Vanhooren

Introduction

Sections

Getting Started

Import Data

Wrangle the Data

Distribution

Variable Recoding…

Frequency Tables

Tables of Proportions

Digression: NA values

Frequency Tables for Continuous Variables?

Measures of Central Tendency

Mode

Median

Simple Example

Variable recoding

Method 1:

Method 2: median()

Mean

mean()

Measures of Dispersion

Variation Ratio

Range

Standard deviation

calculate sd the long way

sd()

Digression: Understanding Standard Deviation

Wrapping Up

Important functions

Exercises

Method 2: `median()`

`mean()`

`sd()`