Essentials for Analyzing Survey Data Using R

1. Introduction to This Course

Studies of behavior often require the use of surveys to collect self-reported data about a participants perceptions, motives, beliefs, and actions. Whether you are studying how and why people get vaccinated against a disease or why they engage in other health related behaviours, such as exercise, socialization, or eating, surveys can provide you insights into the behaviour of individuals and populations.

Surveys can be administered in a variety of formats, ranging from telephone surveys, paper questionnaires, or computer-assisted online surveys. While this review does not explore data collection, each of these types of surveys results in a dataset that will need to be analyzed.

“Essentials for Analyzing Survey Data Using R” will teach you how to conduct statistical analyses using R Studio. In doing so, we often provide the simplest methods for meeting each objective. However, in R there are numerous ways to complete a given task and you are welcome to explore and identify strategies that work better for you.

2. Introduction to R and R Studio

This section introduces you to the following concepts and their corresponding functions:

Installing R and R Studio
The R-Studio Environment
Getting Help with R: ?
R Syntax
R Packages: install.packages(), library(), update.packages()
Annotating Your Code: #
Creating R Objects: <-, c(), :
Bringing Data into R: \, read.csv(), read_xlsx()
Understanding Your Data table(), class(), levels(), colnames(), rownames(), as.factor(), as.character(), as.numeric(), as.logical, as.integer()
Merging Datasets: merge()
Subsetting Datasets: [], names()

R is a statistical language commonly accessed through the integrated development environment called RStudio. Base R performs a vast number of useful statistical operations, but it can be enhanced with packages. These packages are open source and created by the community of R users, making these resources open for public use free of charge.

Install R and R-Studio

Follow the instructions at this link to install both R and RStudio on your computer. Install R first then R-Studio.

The R-Studio Environment

There are several important windows to become familiar with when working in RStudio:

Four Essential Windows in R Studio

The top left window is a script editor. Here you can edit and execute R code (by highlighting and hitting ctrl+Enter).
The bottom left is the R console, where the R engine does its work. You’re your code is executed, with results and error notifications. You can also enter ad hoc commands in this window, but these commands will not be saved.
In the top right, R objects are displayed and summarized. Objects are named sets of data, whether a set of data from a survey, a list of names, a set of randomly-generated numbers, or even functions you have defined (created) yourself.
The bottom right contains several useful tools: a file explorer, a graphical device for displaying any plots you create (histograms, bar charts, and the like), a list of the packages you have installed, a viewer for function help documentation, and a viewer for web content.

R Syntax

An R function almost universally follows this form: function(arguments).

For instance, the mean() function can take the following arguments, each of which is described in the functions help file:

Argument	Description
x	An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only.
trim	The fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.
na.rm	A logical value indicating whether NAvalues should be stripped before the computation proceeds.
…	Further arguments passed to or from other methods.

R takes the arguments in a given order (usually as listed in the help file. This means that you can choose whether or not to name your arguments.

The text below names each argument:

mean(x = dataset$variable_name, trim = 0, na.rm = TRUE)

The text below does not name each argument:

mean(dataset$variable_name, 0, TRUE)

R Packages

R comes ready with a large set of useful functions, but the R community has created many new functions which perform more complex or specialized tasks or which simplify the language around basic R tasks. For the latter purpose, I highly recommend installing and familiarizing yourself with the Tidyverse family of packages, particularly plyr, dplyr, ggplot2, and readr. These simplify and extend data reading, cleaning, and plotting.

You will need these packages to perform the tasks I demonstrate below. Install them from RStudio under Tools > Install Packages… Input the list of desired packages: plyr tidyverse. Hit OK and RStudio will download and install the packages from CRAN. You can also download the packages from the R console:

install.packages(pkgs = c("plyr", "tidyverse"))

Finally, to activate all of the functions in each package, enter the following in the script editor or R console:

library(plyr) 

library(tidyverse)

This will load all of the functions contained in these packages into your R session; without doing this, R will not have access to many of the functions used in this course.

Getting Help with R

Specific functions are described fully in R. The ? symbol is the basic function for calling up documentation about a given function. To use it, you simply add a question mark before the name of a function. For example, to get information about the mean() function, you can enter the following code:

?mean #Calls up the help documentation for the mean function.

When this code is run, the documentation for the mean function will appear in the”Help” window in Rstudio.

When the information provided is not enough, or you have questions about a specific function or error, the R community is extremely active. Your question has likely already been asked on one of the following websites:

Often times you can simply Google the error you encountered and you will be able to find a solution.

Annotating Your Code

R has a comment flag, the hashtag (#). Any text on a line of code beginning with a # will not be read by the R engine. This allows you to explain what your code is trying to accomplish for your future self or for other readers of your code. It is vital that you comment your code, and it is best that you comment as you go. You will thank yourself time and again for developing this habit.

Creating R Objects

R is an object-based programming language. An object is a data structure used to store information. Objects can include vectors, lists, matrices, data frames, and many other types of data.

The “<-” symbol is used to assign values to an object. For example, the code below creates an object called “VARIABLE_LIST” than includes the following values: 1, 2, 3, 4, 5, 6, 7.

VARIABLE_LIST <- c(1,2,3,4,5,6,7) #Create an object called "VARIABLE_LIST" that contains the values 1, 2, 3, 4, 5, 6, 7.

You will notice that the values are enclosed in the c() function.

The c() function in R is a type of function known as a constructor. It is used to combine multiple objects into a vector. It can combine numeric values, character strings, logical values (TRUE or FALSE), and many other objects into a single vector.

There are always multiple ways to accomplish a given task in R. For example, since the values in our list are sequential, we could have used the following syntax to create the same list

VARIABLE_LIST_2 <- (1:7) #Create an object called "VARIABLE_LIST" that contains the values 1, 2, 3, 4, 5, 6, 7.

To view these objects, you can simply type the object names into the console window after creating them.

VARIABLE_LIST #Print the VARIABLE_LIST object.

## [1] 1 2 3 4 5 6 7

VARIABLE_LIST_2 #Print the VARIABLE_LIST_2 object.

## [1] 1 2 3 4 5 6 7

Now that you know how to create an object, you are ready to read your data into R.

Brining Data Into R

About The Data Used In This Course

For this course will utilize data from the 2021 Canadian Social Connection Survey (CSCS). The CSCS is a serial cross-sectional survey with a longitudinal sub-cohort that aims to study the social health and well-being of Canadians in the wake of the COVID-19 pandemic. Data from Wave 1 of the CSCS was collected between April 21st, 2021 and June 1st, 2021, during the third wave of the COVID-19 Pandemic in Canada. Throughout this period, participants were recruited using paid advertising in French and English on Facebook, Twitter, Instagram, and Google. Advertisements were targeted to people aged 16 years of age or older across Canada. Participants were eligible to participate if they were 16 years of age or older, lived in Canada, were able to complete the survey in English or French, and provided informed consent. Upon completion of the survey, participants were eligible to enter a prize draw for a $200 VISA gift card. Ethics review for the CSCS was conducted by the Research Ethics Board at the University of Victoria (Ethics Protocol Number 21-0115).

The questionnaire for this dataset is in a file called “CSCS_Questionnaire.pdf” and the question text and variable names are cross indexed in the file called: “variable_labels.csv”.

On my computer, the CSCS data file is stored at the following path. It is important to recognize that R is case sensitive, so it is important that everything is copied exactly as they are stored:

*“C:- Simon Fraser University (1sfu)Lab\2. Projects\9. Learning HubR for Analyzing Survey Data_data.csv”*

The backslash is a special character in R. It suppresses the special meaning of the character it proceeds. Therefore, to turn the file path into a usable path, you must replace each backslash with a double backslash, as follows:

“C:\Users\Kiffer\OneDrive - Simon Fraser University (1sfu)\HEAL Lab\2. Projects\9. Learning Hub\Using R for Analyzing Survey Data\CSCS_data.csv”

Reading in .csv Files

Data are stored in different file formats. For example, Microsoft Excel files often use the extensions .xlsx, .xls, or .csv. These extensions describe the format of the file your data is stored in. Different file formats require different functions in ordder to read their data into R.

Lets begin by learning how to read in .csv files. This can be done using the read.csv() function, as shown below:

CSCS_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\CSCS_data.csv", na.strings = c(" ", "", "NA", "9999: Missing")) #This command creates an object called "CSCS_data" using the read.csv() function. We have used two arguments within this function to specify the file name and the character strings that should be read in as missing.

You can see that after running the command lines above, that the data object has now been added to your environment window in the top right. Next to the object name, it shows the text “2448 obs. of 408 variables.” This indicates that there are 2448 rows of data and 408 columns. Each column is a variable. Each row is a participant observation. You can click on the data object and view the spreadsheet you read in.

You might also notice that the syntax I used above included several arguments within the read.csv() function. These arguments were the “file =” and “na.strings =” arguments. To familiarize yourself with what these arguments do, you can retrieve help documentation on any function, by using typing the ? symbol and the function name, as follows:

?read.csv #Using the ? in front of a function name opens the help file for that function

If you scroll through the help file you will see sections on usage and arguments. These sections tell you what the arguments do.

The “file =” argument is “the name of the file which the data are to be read from. In this case I am reading the file from a file path, which I have provided within quotation marks.
The”na.strings = ” argument is a character vector of strings which are interpreted as NA. In my example above, you will see that I created a list using the c() function. The string in this list says that if the data has a cell that is blank, a single space, NA, or says 9999: Missing, that the data should be read in as missing. You could add other character strings in this list, depending on how missing values are stored in the dataset you are reading in. Removing missing data may make data analysis easier. However, you should be cautious as sometimes missing data has a specific meaning. For example, if skip logic is used to determine which participants answer a given question, they may be missing data because they were not presented a given question.

Now that you understand what the code above accomplished, you can now view your data to confirm that it was read into R correctly. To view your data you can use the View() function.

View(CSCS_data) #The View() function will open the object called "CSCS_data" in a new window. Note that the View() function is typed with a capital V. If you were to use a lowercase V it would not work.

Note that above we read in a .csv file using the ‘read.csv()’ function. As I mentioned, different file formats require using different functions to be read into R.

R includes many functions in its base package. However some functions require you to install a new package.

While beyond the scope of this course, there are other functions and packages for other file formats that work very similar to those covered here. You can usually google and find an appropriate function that suits your specific needs. A few common functions include the read.spss() function, which is used to open SPSS files and the read_sas() function, which is used to open SAS files. As with the example below, you will need to install the packages that contain these functions.

Reading in .xlsx Files

To install a function that is not included in Base R, you will need to use the ‘install.packages()’ function. Lets take a look at the help file for this function.

?install.packages #Using the ? to open help documentation for the install.packages() function.

The ‘install.packages()’ function has an argument called pkgs. We can use this argument to specify the name of the package we want to download. In our case we want the readxl package

install.packages(pkgs = "readxl", repos = "http://cran.us.r-project.org") #Installing the readxl package using the install.packages() function. Note that the pkgs argument specifies the name of the package and the repos = function specifies that repository used. CRAN is the official repository for most R packages.

## package 'readxl' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

install.packages(pkgs = "Rcpp", repos = "http://cran.us.r-project.org") #Installing the Rcpp package using the install.packages() function.

## package 'Rcpp' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

Before an installed package can be used, you need to tell R that you are using a specific language library. This is because some functions have the same name as functions in other packages. To tell R we are using the readxl library we can use the library() function

library(readxl) #Telling R I want to use the readxl library
library(Rcpp) #Telling R I want to use the readxl library

With the package installed, we are now able to read in .xlsx data using the read_xlsx() function. Lets try this with the forward sortation area census data file. If you need to figure out how to use this function or if it ends up not working as expected, remember you can retrieve help documentation by typing the ? in front of the package name.

census_data <- read_xlsx(path = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\census_data.xlsx", sheet = 1) #Using the read_xlsx() function, which is part of the readxl package to create an object called "census_data" from the first sheet of the .xlsx file. Notice that since I set up a working directory earlier, I did not need to copy the entire file path. I only needed the name of the file, which I found by going to the files tab in the bottom right window of R.

Understanding Your Data

As you can see, it is difficult to make sense of the large volume of data displayed using the command above. There are three helpful functions that you can use to better understand the nature of a variable you are working with: class(), levels() and table().

Lets start with the class variable:

class(CSCS_data$gender) #tells you the class of the variable you are working with.

## [1] "character"

As you can see the gender variable is currently stored as a character variable. Data can be stored as character, numbers, factors, as true/FALSE logical values, or some other structure. Different functions specify what types of variables they can work with. For example, lets look at the help documentation for the levels() function:

?levels #retrieve help file for the levels() function.

As you can see the levels function can be used on factors. Currently the gender variable in our dataset is stored as a character variable. We can change it to a factor variable using the as.factor() function. Other similar functions include the as.character(), as.logical(), as.integer(), and as.numeric() functions which can be used to set a variable type to character, logical, integer, or numeric.

CSCS_data$gender <- as.factor(x = CSCS_data$gender) #replace the gender variable with a factor variable representing the same data.
class(x = CSCS_data$gender) #Confirm that the gender variable is now a factor.

## [1] "factor"

You can now run the levels function to identify levels() of the factor

levels(x = CSCS_data$gender)

## [1] "Man"        "Non-binary" "Woman"

The levels() function shows that there are three levels: “Man”, “Non-binary”, and “Woman” You can use the table() function to see how many of the 2448 participants identify as each gender. Notice that I include the useNA = argument to show the number of missing observations for the gender variable.

table(CSCS_data$gender, useNA = "always") #create a frequency table for the gender variable.

## 
##        Man Non-binary      Woman       <NA> 
##       1157         59       1232          0

Merging Datasets

We now have two separate data files loaded in as R objects. The first object (i.e., CSCS_data) consists of survey responses. Each row represents a participant and each column a variable. The second object (i.e., census_data) consists of census data. Each row represents a forward sortation area (i.e., the first 3 letters of a postal code) and each column represents a variable. To use these two data frames together, we will need to merge them. The merge() function is one function that can be used for this purpose. Lets review the help file for the merge function:

?merge #Review help documentation fir the merge() function

As you can see, the merge function asks for two data frames or objects that you want to merge. You can merge these using a specific variable. If the variable you are merging the data on is labeled differently in the two datasets, you can specify the variable names using the by.x and by.y arguments. In our case, we want to match forward sortation area level data for each participant using their self-reported FSA code. In both datasets this variable is labeled “FSA”. Therefore we can use the “by =” argument. There is also another set of arguments that indicate what you should do in cases where there are not matches between the two datasets. For example, in our case we might not have recruited participants in every postal area. Also, not all participants provided a postal code. To ensure we keep all the participants in our survey, regardless of whether they have a matching postal code, we will use the “all.x = TRUE” argument. Because we don’t want to create new rows for postal codes that dont have any participants living in them, we will also use the “all.y = FALSE” argument. Lets try merging the dataframes together using merge

merged_data <- merge(x = CSCS_data, y = census_data, by = "FSA", all.x = TRUE, all.y = FALSE) #Merge CSCS_data and census_data to create merged_data

You can see that the new variable has 2,448 observations (1 for each participant) and 485 variables (80 from the census_data and 406 from the CSCS_data). Because the FSA variable matched in the two datasets, you have 485 variables in the new merged_data file instead of 486. You can review the names of the columns and rows, using the colnames() and rownames() function.

colnames(x = merged_data) #print names of columns
rownames(x = merged_data) #print names of rows

Subsetting Datasets

As you can see there are lots of columns that you might not be interested in working with. It is sometimes helpful and sometimes necessary to create a data file that contains only the variables you want to work with. Likewise, there may be specific observations that you are interested in. To omit rows or columns you do not want to work with it is important to know how to subset a data file.

The bracket (“[]”) symbol provides an easy way to subset your data so that it only includes specific observations or rows from your dataset. Lets create a dataset for each gender. Before we do, however, it is necessary to take a closer look at the gender variable in R. In R, you can call up data about a specific variable by using the $ symbol following your call for the dataframe. So in this example we would call the variable by specifying the name of the data frame, followed by the $ symbol, followed by the name of the gender variable:

merged_data$gender #Print the gender column in the merged_data dataframe

Selecting Specific Observations for Your Dataset

Using the output from the functions above, we are now ready to subset our dataset using the [] symbol. To do so, we will create three datasets called “data_men”, “data_women”, and “data_nonbinary”. Notice that to accomplish this we specify the source data (i.e., merged_data) and then add the bracket symbol immediately thereafter. The [] symbols can be used to specify a subset. I often read them as “where…” Within the [] symbol we then write a statement for each dataset that indicates which observations we want included in the new dataset. In doing so, we use the == sign to indicate “equals”. You will also notice that following our statement there is a comma. When using the [] symbols, information prior to the comma specifies information about rows, and information following the comma specifies information about columns.

data_men <- merged_data[merged_data$gender == "Man" ,]
data_women <- merged_data[merged_data$gender == "Non-binary" ,]
data_nonbinary <- merged_data[merged_data$gender == "Woman" ,]

You can now use the table statement on the original dataset to ensure that the correct number of observations are stored in each of the new data files.

table(merged_data$gender, useNA = "always") #confirm that the number of observations listed in the environment window matches what we expected from the original dataset.

## 
##        Man Non-binary      Woman       <NA> 
##       1157         59       1232          0

Selecting Specific Variables for Your Dataset

Now that you have a dataset for each gender, lets try creating a dataset that includes only a small subset of variables that we’re interested in working with. In this case, I would like to only include variables included in the Malach-Pines Burnout Measure. To do this, I create an object called keepvars that includes a list of the names of the relevant variables I want to keep.

keepvars <- c("burnout_tired",
              "burnout_disappointed",
              "burnout_hopeless",
              "burnout_trapped",
              "burnout_helpless",
              "burnout_depressed",
              "burnout_sick",
              "burnout_worthless",
              "burnout_difficulty_sleeping",
              "burnout_had_it") #Creates a list of variable names and stores it as an object called "keepvars"

Using this newly created object, which represents a list of variable names, I will use the [] symbols to create a new dataset called “data_men_burnout_vars” that only contains these variables. Because I am referring to column names within the “data_men” dataframe, I am using the names() function in combination with the %in% function, which indicates “in”. So you might read the code below as “create an object called ‘data_men_burnout_vars’ from the”data_men” dataframe where, the column’s names are in the keepvars object.”

data_men_burnout_vars <- data_men[, names(data_men) %in% keepvars] #create a new dataset called "data_men_burnout_vars" that only contains variables that are listed in the keepvars object

If you wanted to create a dataset that did included all variables, except those listed variables, you could create a similar list, which I will instead call “dropvars.” This name is arbitrary (and I could have in fact used the same object–keepvars–from above), but descriptive names are helpful.

dropvars <- c("burnout_tired",
              "burnout_disappointed",
              "burnout_hopeless",
              "burnout_trapped",
              "burnout_helpless",
              "burnout_depressed",
              "burnout_sick",
              "burnout_worthless",
              "burnout_difficulty_sleeping",
              "burnout_had_it") #Creates a list of variable names and stores it as an object called "dropvars"

To drop the variables contained in the “dropvars” list, I again use the brackets function. This time, I use the !() function to tell R that I don’t want these variables. In R, the exclamation symbol indicates “NOT”.

data_men_no_burnout_vars <- data_men[,!(names(data_men) %in% dropvars)] #create a new dataset called "data_men_no_burnout_vars" that contains variables that are NOT listed in the "dropvars" object

3. Descriptive Statistics

This section introduces you to the following concepts and their corresponding functions:

Descriptive Statistics: summary(), sd(), table(), prop.table(), write.csv()
Creating Plots: hist(), boxplot(), dotchart(), barplot()

Getting Started

To begin, we will read in and merge the datasets that we will be working with.

library(readxl) #Tell R I am using the readxl library

CSCS_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\CSCS_data.csv", na.strings = c(" ", "", "NA", "9999: Missing")) #read in CSCS data

census_data <- read_xlsx(path = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\census_data.xlsx", sheet = 1) #read in census data

merged_data <- merge(x = CSCS_data, y = census_data, by = "FSA", all.x = TRUE, all.y = FALSE) #Merge CSCS_data and census_data to create merged_data

As you can see there are many variables in this dataset. Lets use the skills we learned last week to create a smaller dataset. To select these variables I went through the questionnaire and data dictionary and identified those that I thought might be interesting to explore. I decided that I would look at four main groups of variables (1) demographic variables, (2) those related to social connection, (3) those related to personality (i.e., attachment, and personality), (4) and those related to work and burnout.

keepvars <- c("gender",
              "age",
              "ethnicity",
              "income",
              "relationship_status",
              "identity_vetrans",
              "identity_indigenous",
              "identity_lgbtq",
              "identity_disability",
              "identity_bipoc",
              "identity_pwud",
              "identity_newcomers",
              "identity_homeless",
              "identity_mental_health",
              "educational_attainment",
              "lonely_change_covid",
              "province",
              "POP_DENS_SQ_KM",
              "FAMILY_SIZE_AVG",
              "UNSUITABLE_HOUSING_PER",
              "UCLA3Score",
              "lonely_direct",
              "lonely_duration",
              "close_friends_num",
              "live_with_partner",
              "zimet_overall_score",
              "discrimination_score",
              "tipi_agreeable_score",
              "tipi_extraversion_score",
              "tipi_conscientiousness_score",
              "tipi_emotional_stability_score",
              "tipi_openness_score",
              "attachment_secure_score",
              "attachment_anxious_score",
              "attachment_avoidant_score",
              "employment",
              "money_concerned",
              "work_paid_enough",
              "work_quitting",
              "work_dignity",
              "work_support",
              "work_quitting",
              "burnout_tired",
              "burnout_disappointed",
              "burnout_hopeless",
              "burnout_trapped",
              "burnout_helpless",
              "burnout_depressed",
              "burnout_sick",
              "burnout_worthless",
              "burnout_difficulty_sleeping",
              "burnout_had_it",
              "BurnoutScore",
              "BurnoutScore_groups",
              "weightvec") #Creates a list of variable names and stores it as an object called "keepvars"

data_burnout <- merged_data[, names(merged_data) %in% keepvars] #create a new dataset called "data_men_burnout_vars" that only contains variables that are listed in the keepvars object

Now that I have created a new smaller dataset, I want to save it for future use. That way I don’t have to recreate my dataset everytime I want to call it in future lectures. To save a dataset I will use the write.csv() function to save it as a csv:

write.csv(data_burnout, file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Save data_burnout to the working directory. Will be able to now load it in future lectures without having to merge the datasets or select my variables.

Now that we have a more manageable sized dataset, lets learn how to calculate descriptive statistics for (1) categorical (i.e., factor/character/logical) and (2) numeric (i.e., numeric/integer) variables.

Descriptive Statistics

Calculating Descriptive Statistics for Categorical Variables

Lets start with a variable that we know is a categorical variable, such as gender. However, we should always start by verifying the class using the class() function.

class(data_burnout$gender) #Show the class of the gender variable

## [1] "character"

As we suspected Gender is saved as a character variable. To get descriptive statistics for a categorical variable, we can use the table() and prop.table() functions. The table() function provides frequencies and the prop.table() function can be combined with the table function to calculate proportions. Functions can be combined by nesting them. R processess these functions by running the function from the inside out. Take a look:

table(data_burnout$gender) #calculate frequencies of gender

## 
##        Man Non-binary      Woman 
##       1157         59       1232

prop.table(table(data_burnout$gender)) #calculate proportions of gender

## 
##        Man Non-binary      Woman 
## 0.47263072 0.02410131 0.50326797

As you can see, 50.4% (n = 1232) identified as a woman, 47.3% (n = 1157) identified as a man, and 2.4% identified as non-binary.

Frequently, you are interested in providing descriptive statistics stratified by a second (or even third) variable. Lets first try this for a categorical variable. For the table function, we can simply seperate the variables by a comma:

table(data_burnout$gender, data_burnout$BurnoutScore_groups) #calculate frequencies of gender

##             
##                0   1
##   Man        802 308
##   Non-binary  30  24
##   Woman      783 409

One nice feature of the table function is you can add labels if it is helpful.

table(gender = data_burnout$gender, Burnt_Out = data_burnout$BurnoutScore_groups) #calculate frequencies of gender

##             Burnt_Out
## gender         0   1
##   Man        802 308
##   Non-binary  30  24
##   Woman      783 409

Getting proportions requires just one more step. You simply nest the table function from above within the prop.table() function and then specify a margin (1 or 2) that is used to calculate the percentage. The Margin = 1 argument calculates row total percentages, and Margin = 2 calculates column total percentages.

prop.table(table(gender = data_burnout$gender, Burnt_Out = data_burnout$BurnoutScore_groups), margin = 1) #calculate percent of men that are burnt out.

##             Burnt_Out
## gender               0         1
##   Man        0.7225225 0.2774775
##   Non-binary 0.5555556 0.4444444
##   Woman      0.6568792 0.3431208

prop.table(table(gender = data_burnout$gender, Burnt_Out = data_burnout$BurnoutScore_groups), margin = 2) # calculate the percent of burnt out people that are men.

##             Burnt_Out
## gender                0          1
##   Man        0.49659443 0.41565452
##   Non-binary 0.01857585 0.03238866
##   Woman      0.48482972 0.55195682

You can add additional variables, as needed, simply by adding them to the list:

table(gender = data_burnout$gender, Burnt_Out = data_burnout$BurnoutScore_groups, LGBTQ = data_burnout$identity_lgbtq) #calculate the frequency of burnout by gender AND sexual orientation

## , , LGBTQ = 8888: Not Selected
## 
##             Burnt_Out
## gender         0   1
##   Man        723 257
##   Non-binary  14  10
##   Woman      737 372
## 
## , , LGBTQ = Sexual or gender minorities (e.g., LGBTQ2+)
## 
##             Burnt_Out
## gender         0   1
##   Man         77  49
##   Non-binary  15  14
##   Woman       41  27

Calculating Descriptive Statistics for Numeric Variables

Lets now try a numeric variable, such as Burnout Score. To do so we will use the summary() function.

class(data_burnout$BurnoutScore) #Confirm the class of the variable

## [1] "numeric"

summary(data_burnout$BurnoutScore) #get a summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.70    3.50    3.42    4.10    7.00      92

As you can see, the summary() function gives us the minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum, and the number of missing variables. Since we have the mean, we probably also want to standard deviation, which can be calculated using the sd() function

sd(data_burnout$BurnoutScore, na.rm = TRUE) #get the sd

## [1] 1.096954

Creating Plots

The numbers summarizing the variable are great, but sometimes it is helpful to look at the data graphically. Lets review a few of the functions that can be used to create plots:

Dotcharts

A dotchart (also known as a dot plot) is a type of chart that uses dots or markers to represent the values of a data set. The position of each dot on the chart represents one of the values in the data set, and the dots are often connected with a line to better visualize the trends in the data. Dotcharts are used to compare groups of related data and to examine the distribution of values within a data set.

dotchart(data_burnout$BurnoutScore) #Create a dotchart, the dotchat gives you a sense of the volume of data in given areas.

Histograms

A histogram is a type of graph that displays the frequency of data within a set of ranges. It is used to visualize the distribution of numerical data. Histograms are useful for understanding the shape of a data set and for identifying skew in a dataset.

hist(data_burnout$BurnoutScore) #create a histogram. The histogram shows that the basic shape is a bellcurve.

Boxplots

A boxplot is a graphical representation of numerical data that displays the median, quartiles, and range of the data set. It is a useful tool for understanding the distribution of a data set and identifying potential outliers. Boxplots are also known as box and whisker plots

boxplot(data_burnout$BurnoutScore) #create a boxplot. The boxplot shows that the first quartile is slightly below 3 and the 3rd quartile is slightly above 4 and there are some outliers at the upper end of the distribution.

As you can see, all the charts show that the data is centered around 3-4, which agrees with our summary statistics (Median = 3.5, Mean = 3.4), the data are also slightly right-skewed, but basically the shape is normal, and the minimum of 1 and maximum of 7 are clearly identified.

Barcharts

A bar chart is a graphical representation of data using rectangular bars to illustrate numerical comparisons between categories. The bars can be plotted horizontally or vertically, and they are often used to compare values across different categories, such as sales performance over a specific period of time. You can create an easy bar chart for the categorical variables, by nesting a table() function within the barplot() function.

barplot(table(data_burnout$gender))

Saving Charts

To save the images, you can go to the plots window in the bottom right and select “Export” then “Save as Image”

Getting Started

Another helpful way to view your data, is to create a table that shows the descriptive statistics for all your variables. This can be done using the tableone package.

install.packages("tableone", repos = "http://cran.us.r-project.org")

## package 'tableone' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library(tableone)

In this example, we are going to create a table with descriptive statistics reported separately for two groups. Often times, a descriptives table is stratified by your outcome variable, but this is not necessarily required.

In this case, we will use gender.

table(merged_data$gender)

## 
##        Man Non-binary      Woman 
##       1157         59       1232

class(merged_data$gender)

## [1] "character"

merged_data$outcome_var <- merged_data$gender

The CreateTableOne() function requires you to identifty a dataset. The “strata =” arguement is used to specify the variable you would like to stratify your table by (i.e., often your outcome). If you want results for the overall sample, you do not need to include the “strata =” argument.

Once the CreateTableOne() function creates the table, the print() function is used to create the table. There are a variety of options available, and you can look at the help documentation to determine which of these you would like to use. Finally, as.data.frame() is used to save the table into a format that could then be saved as a .csv file using the write_csv() function.

Table <- CreateTableOne(data = merged_data, strata = "outcome_var")

library(dplyr)
Frequencies <- print(Table, showAllLevels = TRUE, missing = FALSE, printToggle = FALSE, cramVars = TRUE, varLabels = TRUE, explain = FALSE) %>%
  as_tibble(rownames = NA) 

Table <- as.data.frame(Frequencies)

write_csv(Table, "descriptives_by_gender.csv")

This table will now allow you to review your dataset, helping you identify potential changes that you may need to make as part of the data manipulation stage of your analysis.

4. Data manipulation

This section introduces you to the following concepts and their corresponding functions:

Recoding Variables: <, <=, >, >=, ==, !=, | , &
Mathematical operations: +, -, /, *, exp(), sum(), sqrt(), mean()

Getting Started

To get started, lets read in the burnout dataset we created last week.

burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data

Recoding Variables

Now that the data is read in, lets take a look at the intersection of two variables: Gender and Relationship status. Remember we can use the table() to create a cross tab.

table(burnout_data$gender, burnout_data$relationship_status) #cross tab for gender and relationship status

##             
##              In a relationship Single and dating Single and not dating
##   Man                      553               307                   282
##   Non-binary                22                16                    19
##   Woman                    579               204                   435

The function above presents these variables together. But imagine that we wanted to analyze these data as a single variable with four levels: Men in a relationship, women in a relationship, single men, and single women. To do this we would have to edit and combine the variable. Lets start by simplifying the relationship status variable. To do so, we will use the [] symbol to tell R that we want to conditionally recode the data. The first step, however, is to create a new variable, as we never want to over-write our data in case we must make edits later. I will create an empty variable called rel_di in the burnout_data dataframe:

burnout_data$rel_di <- NA #Create empty variable

Now that I have the rel_di variable created, I will fill it conditionally based on the old variable, as follows:

burnout_data$rel_di[burnout_data$relationship_status == "Single and dating"] <- "Single" #Recode Single and dating as just single
burnout_data$rel_di[burnout_data$relationship_status == "Single and not dating"] <- "Single" #Recode Single and not-dating as just single
burnout_data$rel_di[burnout_data$relationship_status == "In a relationship"] <- "Not Single" #Recode In a relationship as not single
table(burnout_data$rel_di, useNA = "always") #Check to make sure it worked

## 
## Not Single     Single       <NA> 
##       1154       1263         31

As you can see above, I used the == sign to tell R that where the old relationship_status was equal to “Single and dating”, I wanted to assign “Single” to the new variable. The brackets allowed me to create that conditional recode. #I could have also coded the data using the != symbols, which indicates “Does not equal.” Lets give that a try:

burnout_data$rel_di_version_two <- NA #Create empty variable
burnout_data$rel_di_version_two[burnout_data$relationship_status != "In a relationship"] <- "Single" #Recode "Single and not dating" and "Single and dating" as "single"
burnout_data$rel_di_version_two[burnout_data$relationship_status == "In a relationship"] <- "Not Single" #Recode "In a relationship" as "not single"

Lets compare these two versions:

table(burnout_data$rel_di_version_two, burnout_data$rel_di, useNA = "always") #Compare to new variables

##             
##              Not Single Single <NA>
##   Not Single       1154      0    0
##   Single              0   1263    0
##   <NA>                0      0   31

Note that the values that were originally NA were ignored, this is because we use a special function to deal with missing data: The is.na() function. Here is how it works

burnout_data$rel_di_version_three[is.na(burnout_data$relationship_status)] <- "9999: Missing" #Assign NA the "9999: Missing" value

table(burnout_data$rel_di_version_three, useNA = "always") #Make sure it worked

## 
## 9999: Missing          <NA> 
##            31          2417

You can see that the NA’s have now been reassigned using the is.na() function. I can also insert an ! point in from the is.na function, to specify when I want to recode every value that is not missing:

burnout_data$rel_di_version_three[!is.na(burnout_data$rel_di_version_two)] <- "4444: Not Missing" #Assign all non missing variables as missing
table(burnout_data$rel_di_version_three) #Make sure it worked!

## 
## 4444: Not Missing     9999: Missing 
##              2417                31

Now that we have a simplified relationship status variable, lets create a new variable that combines relationship and gender:

burnout_data$rel_status_and_gender <- NA #Create an empty variable

To combine variables we need the | symbol and the & symbol. | indicates “or” and & indicates “and”. We can use these to stack conditions as we recode:

burnout_data$rel_status_and_gender[burnout_data$gender == "Man" & 
                                     burnout_data$rel_di == "Single"] <- "Single Men" #Create Single Men level

burnout_data$rel_status_and_gender[burnout_data$gender == "Woman" & 
                                     burnout_data$rel_di == "Single"] <- "Single Women" #Create Single Women level

burnout_data$rel_status_and_gender[burnout_data$gender == "Non-binary" & 
                                     burnout_data$rel_di == "Single"] <- "Single Non-binary folk" #Create Single Women level

burnout_data$rel_status_and_gender[burnout_data$gender == "Man" & 
                                     burnout_data$rel_di == "Not Single"] <- "Non-single Men" #Create Single Men level

burnout_data$rel_status_and_gender[burnout_data$gender == "Woman" & 
                                     burnout_data$rel_di == "Not Single"] <- "Non-single Women" #Create Single Women level

burnout_data$rel_status_and_gender[burnout_data$gender == "Non-binary" & 
                                     burnout_data$rel_di == "Not Single"] <- "Non-single Non-binary folk" #Create Single Women level

table(burnout_data$rel_status_and_gender) #Make Sure it worked

## 
##             Non-single Men Non-single Non-binary folk 
##                        553                         22 
##           Non-single Women                 Single Men 
##                        579                        589 
##     Single Non-binary folk               Single Women 
##                         35                        639

We now have a variable that represents both relationship status and gender, lets take a look at the % of people who are burnt out in each category

prop.table(table(burnout_data$BurnoutScore_groups, burnout_data$rel_status_and_gender), margin = 2) #Cross tab burnout and gender/rel status

##    
##     Non-single Men Non-single Non-binary folk Non-single Women Single Men
##   0      0.7528090                  0.6190476        0.7051510  0.6951872
##   1      0.2471910                  0.3809524        0.2948490  0.3048128
##    
##     Single Non-binary folk Single Women
##   0              0.5161290    0.6130081
##   1              0.4838710    0.3869919

You can see in this data that 24.7% of men in relationships, 29.5% of women in relationships, 38.1% of non-binary folk in relationships, 30.5% of single men, 38.7% of single women, and 48.4% of single non-binary folks were burnt out.

Now imagine, that you wanted to consider the effect that age was playing in burnout within these groups. To do so we would want to take a look at the age variable.

class(burnout_data$age) #Check class

## [1] "integer"

summary(burnout_data$age) #Get summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    16.0    27.0    33.0    38.7    48.0    89.0

As you can see the median age was 33, and the data are stored as an “integer” Lets say that based on this data, you wanted to know whether the patterns shown above differed for those older than 33 compared to those younger than 33. The first step we’d need to do was to create a new age variable. To do so, we can use the greater than (>), greater than or equal to (>=). less than (<) and less than or equal to (<=) symbols and conditionally recode the data, as follows:

burnout_data$age_di_at_33 <- NA #Create an empty variable
burnout_data$age_di_at_33[burnout_data$age >= 33] <- "33 years or older"
burnout_data$age_di_at_33[burnout_data$age < 33] <- "Younger than 33"
table(burnout_data$age_di_at_33, burnout_data$age) #make sure it worked

##                    
##                      16  17  18  19  20  21  22  23  24  25  26  27  28  29  30
##   33 years or older   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   Younger than 33     3   4  15   8  30  23  48  80  57 154 138  61 154  96 159
##                    
##                      31  32  33  34  35  36  37  38  39  40  41  42  43  44  45
##   33 years or older   0   0  59  50 114  86  30  47  37  46  27  28  20  15  35
##   Younger than 33    58 114   0   0   0   0   0   0   0   0   0   0   0   0   0
##                    
##                      46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
##   33 years or older  17  18  11  11  26  18  20  17  21  24  32  20  33  20  29
##   Younger than 33     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##                    
##                      61  62  63  64  65  66  67  68  69  70  71  72  73  74  75
##   33 years or older  22  27  19  23  26  18  18  24  18  21  16  24  15  12   6
##   Younger than 33     0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##                    
##                      76  77  78  79  80  81  82  83  85  88  89
##   33 years or older   6  11   6   5   3   3   4   1   1   1   5
##   Younger than 33     0   0   0   0   0   0   0   0   0   0   0

Note that we could have created more groups if we had wanted to!, we could have broken down the data according to quartiles (Younger than 27, 27-32, 33-47, 48+):

burnout_data$age_quartiles <- NA #Create an empty variable
burnout_data$age_quartiles[burnout_data$age < 27] <- "Younger than 27"
burnout_data$age_quartiles[burnout_data$age >= 27 & 
                             burnout_data$age <= 32] <- "27-32"
burnout_data$age_quartiles[burnout_data$age >= 33 & 
                             burnout_data$age <= 47] <- "33-47"
burnout_data$age_quartiles[burnout_data$age >= 48] <- "48+"
table(burnout_data$age_quartiles, burnout_data$age) #make sure it worked

##                  
##                    16  17  18  19  20  21  22  23  24  25  26  27  28  29  30
##   27-32             0   0   0   0   0   0   0   0   0   0   0  61 154  96 159
##   33-47             0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   48+               0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   Younger than 27   3   4  15   8  30  23  48  80  57 154 138   0   0   0   0
##                  
##                    31  32  33  34  35  36  37  38  39  40  41  42  43  44  45
##   27-32            58 114   0   0   0   0   0   0   0   0   0   0   0   0   0
##   33-47             0   0  59  50 114  86  30  47  37  46  27  28  20  15  35
##   48+               0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   Younger than 27   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##                  
##                    46  47  48  49  50  51  52  53  54  55  56  57  58  59  60
##   27-32             0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   33-47            17  18   0   0   0   0   0   0   0   0   0   0   0   0   0
##   48+               0   0  11  11  26  18  20  17  21  24  32  20  33  20  29
##   Younger than 27   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##                  
##                    61  62  63  64  65  66  67  68  69  70  71  72  73  74  75
##   27-32             0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   33-47             0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##   48+              22  27  19  23  26  18  18  24  18  21  16  24  15  12   6
##   Younger than 27   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
##                  
##                    76  77  78  79  80  81  82  83  85  88  89
##   27-32             0   0   0   0   0   0   0   0   0   0   0
##   33-47             0   0   0   0   0   0   0   0   0   0   0
##   48+               6  11   6   5   3   3   4   1   1   1   5
##   Younger than 27   0   0   0   0   0   0   0   0   0   0   0

Lets look at the cross tabs between these variables and the other two variables we have been working with:

table(burnout_data$age_quartiles, burnout_data$rel_status_and_gender) #Cross tab burnout and gender/rel status

##                  
##                   Non-single Men Non-single Non-binary folk Non-single Women
##   27-32                      158                          3              146
##   33-47                      182                         10              161
##   48+                        144                          7              199
##   Younger than 27             69                          2               73
##                  
##                   Single Men Single Non-binary folk Single Women
##   27-32                  182                      8          140
##   33-47                  126                     14          122
##   48+                     71                      4          188
##   Younger than 27        210                      9          189

table(burnout_data$age_di_at_33, burnout_data$rel_status_and_gender) #Cross tab burnout and gender/rel status

##                    
##                     Non-single Men Non-single Non-binary folk Non-single Women
##   33 years or older            326                         17              360
##   Younger than 33              227                          5              219
##                    
##                     Single Men Single Non-binary folk Single Women
##   33 years or older        197                     18          310
##   Younger than 33          392                     17          329

table(burnout_data$age_quartiles, burnout_data$BurnoutScore_groups) #Cross tab burnout and gender/rel status

##                  
##                     0   1
##   27-32           431 190
##   33-47           404 196
##   48+             403 190
##   Younger than 27 377 165

table(burnout_data$age_di_at_33, burnout_data$BurnoutScore_groups) #Cross tab burnout and gender/rel status

##                    
##                       0   1
##   33 years or older 807 386
##   Younger than 33   808 355

You can see that for the four level age breakdown, some of the cell counts get quite small when looking at gender, so preceding with the two level age variable might be better (though sample sizes for the non-binary folk are still quite small).

To now look at burnout groups among those under and over 33, we can use the [] to conditionally select the data where

prop.table(table(burnout_data$BurnoutScore_groups[burnout_data$age_di_at_33 == "33 years or older"], 
                 burnout_data$rel_status_and_gender[burnout_data$age_di_at_33 == "33 years or older"]), margin = 2) #older xtab

##    
##     Non-single Men Non-single Non-binary folk Non-single Women Single Men
##   0      0.7738854                  0.6875000        0.7089337  0.6054054
##   1      0.2261146                  0.3125000        0.2910663  0.3945946
##    
##     Single Non-binary folk Single Women
##   0              0.4666667    0.5939597
##   1              0.5333333    0.4060403

prop.table(table(burnout_data$BurnoutScore_groups[burnout_data$age_di_at_33 == "Younger than 33"], 
                 burnout_data$rel_status_and_gender[burnout_data$age_di_at_33 == "Younger than 33"]), margin = 2) #younger xtab

##    
##     Non-single Men Non-single Non-binary folk Non-single Women Single Men
##   0      0.7227273                  0.4000000        0.6990741  0.7393617
##   1      0.2772727                  0.6000000        0.3009259  0.2606383
##    
##     Single Non-binary folk Single Women
##   0              0.5625000    0.6309148
##   1              0.4375000    0.3690852

Looking at these data, one thing that sticks out to me is that the proportion of non-single non-binary folk who are burnt out nearly doubles when looking at those younger than 33 compared to those older than 33. To make sure this data, makes sense, it’s always a good idea to double check the underlying frequencies:

table(burnout_data$BurnoutScore_groups[burnout_data$age_di_at_33 == "33 years or older"], 
                 burnout_data$rel_status_and_gender[burnout_data$age_di_at_33 == "33 years or older"]) #older xtab

##    
##     Non-single Men Non-single Non-binary folk Non-single Women Single Men
##   0            243                         11              246        112
##   1             71                          5              101         73
##    
##     Single Non-binary folk Single Women
##   0                      7          177
##   1                      8          121

table(burnout_data$BurnoutScore_groups[burnout_data$age_di_at_33 == "Younger than 33"], 
                 burnout_data$rel_status_and_gender[burnout_data$age_di_at_33 == "Younger than 33"]) #younger xtab

##    
##     Non-single Men Non-single Non-binary folk Non-single Women Single Men
##   0            159                          2              151        278
##   1             61                          3               65         98
##    
##     Single Non-binary folk Single Women
##   0                      9          200
##   1                      7          117

As you can see from this quality check, it seems the jump in burnout is based on only a few observations. Its always important to be aware that when you’re slicing up data in this way that it can create small sample sizes that make data analyses less reliable.

Mathematical Operations

So far, we’ve covered how to edit variables. One of the common reasons that you have to edit variables, is when you are trying to calculate scale scores. Many of the variables included in this dataset consist of scale items, that need to be calculated to create scale scores. In most cases, I have already done that for the CSCS data, but nevertheless, many datasets that you might work with will not already have these scores calculated, so it is important to learn how to do.

Lets recreate the burnout score. The burnout scale consists of 10 items:

“burnout_tired” “burnout_disappointed” “burnout_hopeless” “burnout_trapped” “burnout_helpless” “burnout_depressed” “burnout_sick” “burnout_worthless” “burnout_difficulty_sleeping” “burnout_had_it”

If you look at these variables, they are scores from “never” to “always”:

table(burnout_data$burnout_tired) #check

## 
## Almost never       Always        Never        Often       Rarely    Sometimes 
##          227           97          146          455          504          798 
##   Very Often 
##          216

In order to create a scale score, I will need to recode these variables as numeric variables and the sum their values together to create a final scale score. To do so, I will recode the values using the same variable editing skills taught above:

burnout_data$burnout_tired[burnout_data$burnout_tired == "Never"] <- 1 #Assign Never a value of 1
burnout_data$burnout_tired[burnout_data$burnout_tired == "Almost never"] <- 2  #Assign Almost never a value of 2
burnout_data$burnout_tired[burnout_data$burnout_tired == "Rarely"] <- 3  #Assign Rarely a value of 3
burnout_data$burnout_tired[burnout_data$burnout_tired == "Sometimes"] <- 4  #Assign Sometimes a value of 4
burnout_data$burnout_tired[burnout_data$burnout_tired == "Often"] <- 5  #Assign Often a value of 5
burnout_data$burnout_tired[burnout_data$burnout_tired == "Very Often"] <- 6  #Assign Very Often a value of 6
burnout_data$burnout_tired[burnout_data$burnout_tired == "Always"] <- 7  #Assign Always a value of 7
table(burnout_data$burnout_tired) #Make sure it worked

## 
##   1   2   3   4   5   6   7 
## 146 227 504 798 455 216  97

As you can see, this would take a few lines of code to do for each of the 10 variables. To make this go a little bit faster, I am going to use the mutate_at, vars, and case_when functions. The intricacies of this code are beyond the scope of this course, but I will provide comments on each line to help you follow.

library(dplyr)
burnout_data <- mutate_at(.tbl = burnout_data, #Use the mutate function to recreate the burnout data
                  .vars = vars(burnout_disappointed:burnout_had_it), #Edit the vars in the dataset from "burnout_disappointed" to "burnout_had_it"
                  .funs = ~case_when(. == "Never" ~ "1", #When the old variable says "Never", assign a value of 1
                                     . == "Almost never" ~ "2", #When the old variable says "Almost never", assign  a value of 2
                                     . == "Rarely" ~ "3", #When the old variable says "Rarely", assign  a value of 3
                                     . == "Sometimes" ~ "4", #When the old variable says "Sometimes", assign  a value of 4
                                     . == "Often" ~ "5", #When the old variable says "Often", assign  a value of 5
                                     . == "Very Often" ~ "6", #When the old variable says "Very Often", assign  a value of 6
                                     . == "Always" ~ "7", #When the old variable says "Always", assign  a value of 7
                                     TRUE ~ .)) #Everywhere else just leave it whatever it was in the old dataset

table(burnout_data$burnout_disappointed, useNA = "always") #Check the first variable to make sure it worked

## 
##    1    2    3    4    5    6    7 <NA> 
##  132  304  606  770  361  194   66   15

Don’t worry, you wont be tested on the intricacies of the mutate_at function, but as you can see, there is usually an easier more efficient way to do something in R. In this course, our goal is to teach the most straightforward way. But Learning new things is a great way to make your R coding experience better.

The next step in calculating the scale scores is to add the individual variable scores together. Before we do this, lets make sure they are numeric variables (remember they were originally stored as characters)

class(burnout_data$burnout_disappointed) #check class

## [1] "character"

burnout_data$burnout_tired <- as.numeric(burnout_data$burnout_tired) #recode as numeric
burnout_data$burnout_disappointed <- as.numeric(burnout_data$burnout_disappointed) #recode as numeric
burnout_data$burnout_hopeless <- as.numeric(burnout_data$burnout_hopeless) #recode as numeric
burnout_data$burnout_trapped <- as.numeric(burnout_data$burnout_trapped) #recode as numeric
burnout_data$burnout_helpless <- as.numeric(burnout_data$burnout_helpless) #recode as numeric
burnout_data$burnout_depressed <- as.numeric(burnout_data$burnout_depressed) #recode as numeric
burnout_data$burnout_sick <- as.numeric(burnout_data$burnout_sick) #recode as numeric
burnout_data$burnout_worthless <- as.numeric(burnout_data$burnout_worthless) #recode as numeric
burnout_data$burnout_difficulty_sleeping <- as.numeric(burnout_data$burnout_difficulty_sleeping) #recode as numeric
burnout_data$burnout_had_it <- as.numeric(burnout_data$burnout_had_it) #recode as numeric
class(burnout_data$burnout_disappointed) #make sure it worked

## [1] "numeric"

Now to add these numeric variables together, we can apply mathmatical transformations (e.g., +, -, *, /).

burnout_data$NewScore <- burnout_data$burnout_tired +
  burnout_data$burnout_disappointed +
  burnout_data$burnout_hopeless +
  burnout_data$burnout_trapped +
  burnout_data$burnout_helpless +
  burnout_data$burnout_depressed +
  burnout_data$burnout_sick +
  burnout_data$burnout_worthless +
  burnout_data$burnout_difficulty_sleeping +
  burnout_data$burnout_had_it
summary(burnout_data$NewScore) #check to make sure it worked

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    10.0    27.0    35.0    34.2    41.0    70.0      92

summary(burnout_data$BurnoutScore) #compare with old score

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.70    3.50    3.42    4.10    7.00      92

As you can see there is one big difference between the old score and the new score. The new score is 10 times bigger! This is because the burnout inventory is not actually calculated as a sum of the individual items, it is an average of score items. This is an easy fix, we simply divide by 10!

burnout_data$NewScore  <- burnout_data$NewScore / 10 #Calculate average of scale items, by dividing by number of items
summary(burnout_data$NewScore) #check to make sure it worked

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.70    3.50    3.42    4.10    7.00      92

While not shown here, you can use other mathematical functions, such as exp() to exponentiate, sum() to add values together, sqrt() to take the square root, mean() to get the average.

?exp
?sum
?sqrt
?mean

5. Correlation

This section introduces you to the following concepts and their corresponding functions:

Correlation: chisq.test() cor.test(), plot(), jitter(), rcorr(), cor(), corrplot(), chart.Correlation()

Getting Started

To get started, lets read in our burnout dataset.

burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data

Correlation

Correlation is typically used when comparing the association between two numeric variables. In our dataset we have a few continuous variables to work with. Lets start with looking at the association between burnout and loneliness scores. To do so, we will first want to inspect the variables and ensure they are in the correct format (i.e., numeric):

class(burnout_data$BurnoutScore) #Make sure they are numeric

## [1] "numeric"

class(burnout_data$UCLA3Score) #Make sure they are numeric

## [1] "integer"

burnout_data$UCLA3Score <- as.numeric(burnout_data$UCLA3Score) #Convert to numeric

Next, we can use the plot() function to visually inspect the relationship between the variables:

plot(burnout_data$BurnoutScore, burnout_data$UCLA3Score) #Plot the correlation between burnout scores and UCLA loneliness scores.

As you can see the top left part of the plot and bottom right part of the plot are quite sparse. You can imagine that if you were to draw a line through this data, you would probably draw it going from the bottom left to the top right. This indicates a positive association. However, with many scales that have a limited range, it can be difficult to ascertain an association visually because observations are overlapping. One strategy you can do is nest your variables within a jitter() function. The jitter() function adds a little bit of noise to your data so that all your observations are not overlapping one another. This can help you see trends a little bit clearer:

plot(jitter(burnout_data$BurnoutScore), jitter(burnout_data$UCLA3Score)) #Plot the correlation between burnout scores and UCLA loneliness scores using Jitter.

While in our data the jitter() function makes it easier to draw a line through the “bulky” party of the data. In other situations it might still be difficult plotting a line. As such, it is useful to calculate statistics that can help tell us the strength of an association objectively.

The cor.test() function can be used to calculate the correlation between two continuous variables

cor.test(burnout_data$BurnoutScore, burnout_data$UCLA3Score)

## 
##  Pearson's product-moment correlation
## 
## data:  burnout_data$BurnoutScore and burnout_data$UCLA3Score
## t = 31.193, df = 2336, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5130031 0.5702669
## sample estimates:
##       cor 
## 0.5422644

Correlation ranges from -1 (Strong negative association) to +1 (Strong Positive Association):

Correlation	Interpretation
0.00 to 0.19	Very Weak
0.20 to 0.39	Weak
0.40 to 0.59	Moderate
0.60 to 0.79	Strong
0.80 to 1.00	Very Strong

The correlation here is .5422 and the p-value is very small (i.e, 2.2e-16) Therefore we find that there is a moderately strong, statistically-signficant correlation between loneliness and burn out scores.

Correlation Matrices

When loooking at correlations for many variables as once, it is useful to create a correlation matrix. A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. To create a correlation matrix we use the corrplot() function, within the corrplot package

install.packages("corrplot", repos = "http://cran.us.r-project.org") #Install the corrplot package

## package 'corrplot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library(corrplot) #read in the corrplot library

The first step to creating a correlation is to subset into a new dataframe the variables you are wanting to build a correlation matrix for. Otherwise, your correlation plot would be way to big!

keepvars <- c("BurnoutScore", #Select numeric variables you want included in your correlation matrix
              "UCLA3Score",
              "zimet_overall_score",
              "discrimination_score",
              "attachment_secure_score",
              "attachment_avoidant_score",
              "attachment_anxious_score")
corr_data <- burnout_data[keepvars] #Create new dataframe with the variables you selected

Now that you have the variables selected, you need to create and plot a correlation matrix using the cor() and corrplot() functions respectively.

corr_matrix<-cor(corr_data, use = "pairwise.complete.obs") #Create Correlation Matrix, removing any data with missing for each  correlation.
corrplot(corr_matrix, is.corr = FALSE, method = "number") #Plot correlation matrix, and show results as "numbers"

This correlation matrix shows you how strongly the variables in your dataset are correlated using the colour and number. You can choose the method to display the plot using other symbols as well.

Another function you can use to create a correlation plot is the chart.Correlation() function within the PerformanceAnalytics packages. This method builds and plots the correlation plot in one step and has lots more useful graphs (histograms, individual correlation plots, etc), shows statistical significance of each test (the stars), and the size of effect. Note that the variables are listed in the middle square over top the historgram

install.packages("PerformanceAnalytics", repos = "http://cran.us.r-project.org") #Load in PerformanceAnalytics package

## package 'PerformanceAnalytics' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library(PerformanceAnalytics) #Use PerformanceAnalytics library
chart.Correlation(corr_data, histogram=TRUE) #Create plots

Note that there are three kinds of correlation coefficients that are widely used. Most of the correlation functions allow you to specify which one: The Default is the Pearsons correlation coefficient, which requires the variables to be normally distributed, linearly related, and have equal errors across all levels (i.e., homoscedastic). Next week, when we talk about linear regression, We will look at ways of checking whether your data meat these assumptions. In cases where these assumptions are not likely met, you can use the Spearman rank correlation, which is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal. Examples of using the spearman are provided below

chart.Correlation(corr_data, histogram=TRUE, method = "spearman") #Create correlation matrix using the spearman correlation

cor.test(burnout_data$BurnoutScore, burnout_data$UCLA3Score, method = "spearman") #Create Correlation between two variables with the spearman method argument

## 
##  Spearman's rank correlation rho
## 
## data:  burnout_data$BurnoutScore and burnout_data$UCLA3Score
## S = 1007256512, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5271124

6. Linear Regression

This section introduces you to the following concepts and their corresponding functions:

Linear Regression: lm(), summary(), relevel(), plot(), shapiro.test(), ks.test(), VIF(), varImp()

The last chapter reviewed the concept of correlation. This chapter focuses on linear regression. As you will see, linear regression is based on many of the same ideas as correlation. However, linear regression is advantageous because you can begin to work with multiple variables and you can mix variable types quite easily. Linear regression is used when you have a numeric outcome, such as the Burnout Score we have been working with.

Getting Started

To get started with linear regression, lets read in our burnout dataset.

burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data

Linear Regression

Now that we have our dataset, lets jump right in an build a linear regression model by creating an object called “model” using the lm() function.

?lm #Get information about the lm() function.

The main pieces of the lm model are the formula and dataset arguments. A formula is written using the ~ as an equals sign (because = and == are already used for other processes in R). When writing a formula, the variable on the left side of the ~ is your outcome. For a linear regression model it must be a numeric variable. The variables on the right side of the ~ are your explanatory factors. These are stratified by + signs. Lets give it a try to model whether burnout and gender are related:

model_1 <- lm(formula = BurnoutScore ~ gender, data = burnout_data) #Create a linear regression model object
broom::tidy(model_1) # get results for the model object

## # A tibble: 3 × 5
##   term             estimate std.error statistic    p.value
##   <chr>               <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)         3.30     0.0327    101.   0         
## 2 genderNon-binary    0.490    0.152       3.22 0.00129   
## 3 genderWoman         0.221    0.0455      4.86 0.00000126

As you can see, the model provides regression coefficients (etimate), standard errors, t-values, and p-values (pr(>|t|)).

The intercept is listed along with two coefficients.

P-values and coefficients in regression analysis work together to tell you which relationships in your model are statistically significant and the nature of those relationships. The coefficients describe the mathematical relationship between each independent variable and the dependent variable. The p-values for the coefficients indicate whether these relationships are statistically significant. A p-value less than 0.05 is typically considered to be statistically significant. The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable and the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease. The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable.

Assigning Reference Levels

You may have noticed that the third coefficient for men, is missing. This is because it is the referent level. You can change the referent level of a factor variable, as follows:

class(burnout_data$gender)

## [1] "character"

burnout_data$gender <- as.factor(burnout_data$gender)
burnout_data$gender <- relevel(x = burnout_data$gender, ref = "Woman")

Now if we run the model again, you will see that “woman” is the referent level:

model_2 <- lm(formula = BurnoutScore ~ gender, data = burnout_data) #Create a linear regression model object
broom::tidy(model_2) # get results for the model object

## # A tibble: 3 × 5
##   term             estimate std.error statistic    p.value
##   <chr>               <dbl>     <dbl>     <dbl>      <dbl>
## 1 (Intercept)         3.52     0.0316    111.   0         
## 2 genderMan          -0.221    0.0455     -4.86 0.00000126
## 3 genderNon-binary    0.269    0.152       1.77 0.0767

When selecting a referent, you should choose one that theoretically makes sense or sometimes if we don’t particularly care, we just choose the largest group.

In the model we just created our results show that for comparing men to women, burnout scores are -0.2209 points lower. We can compare this with the descriptive results, where there is a 0.2209 difference in the mean burnout scores.

summary(burnout_data$BurnoutScore[burnout_data$gender == "Man"]) #Burnout Score summary is 3.297 for men

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.600   3.400   3.297   4.000   7.000      47

summary(burnout_data$BurnoutScore[burnout_data$gender == "Woman"]) #Burnout Score summary is 3.518 for woman

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   2.800   3.500   3.518   4.200   7.000      40

3.518-3.297 #Calculate difference in means for women and men

## [1] 0.221

summary(burnout_data$BurnoutScore[burnout_data$gender == "Non-binary"]) #Burnout Score summary is 3.518 for woman

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   1.000   3.150   3.800   3.787   4.475   6.900       5

3.518-3.787  #Calculate difference in means for women and non-binary folk

## [1] -0.269

As you can see, the coefficients bare a resemblance to the data. Its important to note however, that only the coefficient for men was statistically significant (p < 0.05), whereas the coefficient for non-binary folks was near, but not statistically significant (p > 0.05). This is probably because of the small sample size of non-binary folks.

Multivariable Linear Regression

The simple regression model can be used to control for confounding variables. Lets build a multivariable regression model controlling for age, loneliness, social support, discrimination, and other personality traits we worked with last week:

model_3 <- lm(formula = BurnoutScore ~ gender + age + UCLA3Score + zimet_overall_score + discrimination_score + tipi_extraversion_score + tipi_agreeable_score + tipi_conscientiousness_score + tipi_emotional_stability_score + tipi_emotional_stability_score + tipi_openness_score + attachment_secure_score + attachment_avoidant_score + attachment_anxious_score, data = burnout_data) #Create a linear regression model object
broom::tidy(model_3) # View Results

## # A tibble: 15 × 5
##    term                             estimate std.error statistic  p.value
##    <chr>                               <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                     3.06        0.514     5.95    5.02e- 9
##  2 genderMan                      -0.0326      0.0862   -0.378   7.06e- 1
##  3 genderNon-binary               -0.0402      0.332    -0.121   9.04e- 1
##  4 age                             0.000234    0.00321   0.0731  9.42e- 1
##  5 UCLA3Score                      0.236       0.0298    7.92    1.62e-14
##  6 zimet_overall_score            -0.146       0.0402   -3.64    3.04e- 4
##  7 discrimination_score            0.00890     0.00471   1.89    5.94e- 2
##  8 tipi_extraversion_score         0.00813     0.0193    0.421   6.74e- 1
##  9 tipi_agreeable_score           -0.0381      0.0215   -1.77    7.74e- 2
## 10 tipi_conscientiousness_score    0.0000585   0.0199    0.00294 9.98e- 1
## 11 tipi_emotional_stability_score -0.0466      0.0208   -2.24    2.56e- 2
## 12 tipi_openness_score             0.0482      0.0204    2.36    1.87e- 2
## 13 attachment_secure_score        -0.0202      0.0114   -1.77    7.68e- 2
## 14 attachment_avoidant_score      -0.00337     0.0119   -0.282   7.78e- 1
## 15 attachment_anxious_score        0.0154      0.0131    1.17    2.41e- 1

Model Interpretation

Regression coefficients in this model are interpreted independently – they assume all other variables in the model are held constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others. We would interpret these findings as suggesting that burnout is a function of loneliness, and social support scores and personality traits – particularly emotional stability and oppenness to new experiences.

One other factor to take notice of is the change in the R-squared between our simple model and our multivariable model. R-squared is a measure of model fit in linear regression – it is the proportion of the variation in the outcome that is predictable from the explanatory variables. The Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected. In our simple model the adjusted R-squared was 0.01169 and in the multivariable model the R squared was 0.3618. This suggests that our current model predicts about 36% of the variation in Burnout scores.

The output for your model also includes a measure called the residual standard error. In the simple regression model, the residual standard error was 1.091 and in the multivariable model, the RSE was 0.8974. A smaller RSE indicates better model fit because the RSE is the average variation of points around the fitted regression line. Basically it tells you that your line is better

You can use the RSE and R-squared to help you choose between models. The goal is to have higher R-squared and lower RSE. Sometimes this encourages us to put more and more variables into a model. However, generally speaking, we should prefer smaller less complex models to more complex ones – so if you are not seeing improvement you can opt for the more parsimonious (or simple) model.

Model Fit and Assumptions of Linear Regression

In addition to checking model fit, you also have to check to make sure that your model meets the assumptions of a linear regression model, which are: 1) Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be linear. 2) Normality of residuals. The residual errors are assumed to be normally distributed. 3) Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homoscedasticity) 4) Independence of residuals error terms. 5) No multi-colinearity of variables included in model. Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.

Linearity Assumption

Lets start with checking the linearity assumption, which can be checked using the residuals vs. fitted plot, we can call up this plot using a simple plot function:

plot(model_3, 1) #The linearity assumption is met when the red line is approximately horizontal and centred on 0.

Homogeneity of Variance

Homogeneity of variance can be checked using the scale-location plot, which can be called using the 3rd type of plot called by the plot function:

plot(model_3, 3) #The homogeneity of variance assumption is met when the red line is approximately horizontal. Large swings indicate that the variance changes as the values of the outcome changes, which indicates heterogeneity.

Normality of Residuals

The normality assumption can be checked using the normality of residuals plot, which can be called using the following function:

plot(model_3, 2) #In this plot, you want to see the data approximately on a diagonal line.

You can also visually inspect your outcome to get a sense of whether your outcome is normally distributed. However, you should note that the assumption for linear regression is the normality of the residuals, and not necessarily the normality of your variable outcome. Of course, these frequently go together and so we like to know whether our outcome is distributed normally.

hist(burnout_data$BurnoutScore) #Plot a histogram: As you can see there is some skew from the high burnout end, however, overall I think its a pretty healthy bell curve.

Sometimes a visual inspection wont cut it and you can check out a few quantitative measures, such as shapiro wilk test or kolmogorov-smirnov test.

shapiro.test(burnout_data$BurnoutScore) #Run the shapiro-wilk test;

## 
##  Shapiro-Wilk normality test
## 
## data:  burnout_data$BurnoutScore
## W = 0.99283, p-value = 2.327e-09

ks.test(burnout_data$BurnoutScore, "pnorm") #run the KS test; pnorm indicates we are testing against a normal distribution

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  burnout_data$BurnoutScore
## D = 0.89342, p-value < 2.2e-16
## alternative hypothesis: two-sided

Note that in both of the above tests, the p-values are statistically significant, indicating that the data are non-normal. This is because we have a large sample size, so detecting deviations from normality is really easy. As such, we usually do not rely on these tests solely. The histogram and Q-Q plots in the present case were satisfactory. With enough experience you will “know it when you see it” when it comes to assessing normality.

Outliers

Next we want to check for outliers. The plot() function can be used to identify these:

plot(model_3, 5) #Observations with standardized residuals greater than or less than 3 may be outliers. You could try modeling the data by omitting those variables. They are numbered in the dataset so they can be easily removed by nesting a list within the bracket symbol and using the - sign to say I do not want these variables. Remember that from the first class, the content in brackets to the left of the comma refers to rows and to the right of the comma refers to columns

burnout_data_no_outliers <- burnout_data[-c(370, 392, 1063)] #

Lets try the model without the outliers:

model_4 <- lm(formula = BurnoutScore ~ gender + age + UCLA3Score + zimet_overall_score + discrimination_score + tipi_extraversion_score + tipi_agreeable_score + tipi_conscientiousness_score + tipi_emotional_stability_score + tipi_emotional_stability_score + tipi_openness_score + attachment_secure_score + attachment_avoidant_score + attachment_anxious_score, data = burnout_data_no_outliers) #Create a linear regression model object
broom::tidy(model_4) # View Results

## # A tibble: 15 × 5
##    term                             estimate std.error statistic  p.value
##    <chr>                               <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)                     3.06        0.514     5.95    5.02e- 9
##  2 genderMan                      -0.0326      0.0862   -0.378   7.06e- 1
##  3 genderNon-binary               -0.0402      0.332    -0.121   9.04e- 1
##  4 age                             0.000234    0.00321   0.0731  9.42e- 1
##  5 UCLA3Score                      0.236       0.0298    7.92    1.62e-14
##  6 zimet_overall_score            -0.146       0.0402   -3.64    3.04e- 4
##  7 discrimination_score            0.00890     0.00471   1.89    5.94e- 2
##  8 tipi_extraversion_score         0.00813     0.0193    0.421   6.74e- 1
##  9 tipi_agreeable_score           -0.0381      0.0215   -1.77    7.74e- 2
## 10 tipi_conscientiousness_score    0.0000585   0.0199    0.00294 9.98e- 1
## 11 tipi_emotional_stability_score -0.0466      0.0208   -2.24    2.56e- 2
## 12 tipi_openness_score             0.0482      0.0204    2.36    1.87e- 2
## 13 attachment_secure_score        -0.0202      0.0114   -1.77    7.68e- 2
## 14 attachment_avoidant_score      -0.00337     0.0119   -0.282   7.78e- 1
## 15 attachment_anxious_score        0.0154      0.0131    1.17    2.41e- 1

plot(model_4, 5) #plot the residuals vs. leverage. If the red line continues to be quite skewed, you might identify additional outliers and remove these as well by adding them to your list above. In this case, the line looks quite good.

Another type of outlier you might want to consider is the presence of influential values. This can be done by looking at cooks distance. A rule of thumb is that an observation has high influence if Cook’s distance exceeds 4/(n - p - 1), where n is the number of observations and p the number of explanatory variables. So in our case, we can identify a thresshold using the formula below

4/(497 - 13 - 1) # Thresshold is 0.008281573 for model 3, which is based on 497 observations and has 13 predictor variables.

## [1] 0.008281573

plot(model_3, 4) #As you can see, observations 392, 370, and 1063 (the same outliers identified in the last model, could be removed). Note there may be others that are below our thresshold, but we know that these are the worrying variables we should deal with first. Generally speaking, we want to limit the data we remove as then our model may become less externally valid to the real world.

Multicolinearity

The final thing we want to check for is multicolinearity, which can be done by calculating the variable inflation factor for each variable in our model. This can be done using the VIF() function from the regclass package.

install.packages("regclass", repos = "http://cran.us.r-project.org") #Install package

## package 'regclass' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library(regclass) #read library
VIF(model_3) #Calculate VIFs;

##                                    GVIF Df GVIF^(1/(2*Df))
## gender                         1.198890  2        1.046393
## age                            1.453819  1        1.205744
## UCLA3Score                     1.646890  1        1.283312
## zimet_overall_score            1.409677  1        1.187298
## discrimination_score           1.787552  1        1.336994
## tipi_extraversion_score        1.266307  1        1.125303
## tipi_agreeable_score           1.737687  1        1.318214
## tipi_conscientiousness_score   1.555833  1        1.247330
## tipi_emotional_stability_score 1.629230  1        1.276413
## tipi_openness_score            1.344348  1        1.159460
## attachment_secure_score        1.721943  1        1.312228
## attachment_avoidant_score      1.838748  1        1.356004
## attachment_anxious_score       2.370624  1        1.539683

Values from 5-10 or up indicate potential multicolinearity and could be a good sign for you to consider the variables in your model from a theoretical perspective. Note that multicolinearity is less a concern whenthe variables with high VIFs are control variables, and the variables of interest do not have high VIFs. Here’s the thing about multicollinearity: it’s only a problem for the variables that are collinear. It increases the standard errors of their coefficients, and it may make those coefficients unstable in several ways. But so long as the collinear variables are only used as control variables, and they are not collinear with your variables of interest, there’s no problem. The coefficients of the variables of interest are not affected, and the performance of the control variables as controls is not impaired. Also, note that If the proportion of cases in the reference category is small, the indicator variables will necessarily have high VIFs, even if the categorical variable is not associated with other variables in the regression model.

If any of the results from the test indicate violation of the assumptions, it is good to test explanatory models 1 by 1 to see which ones might be the issue. If the issue is with your outcome, you may need to convert it to a categorical variable and analyze it using logistic regression (which we will discuss next week). Alternatively, if the issue is with an explanatory variable, you might need to drop it form your model, or transform it in some way, or remove specific observations that might be impacting the results.

Variable Importance

Once you have a model that you are fairly satisfied with and you feel meets the assumptions of linear regression at least as best as you can, one additional thing you might do is calculate the relative importance of the variables in your model. There are lots of ways to do this, and little agreement as to which is best. I suggest using the varImp function, of the caret pacakge:

install.packages("caret", repos = "http://cran.us.r-project.org") #install package

## package 'caret' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("caret") #load library for package

varImp(model_3) #Run the varImp function on model 3, Larger values indicate higher contribution to the model. As we can see here, the loneliness scores and social support scores are the most important.

##                                    Overall
## genderMan                      0.377635661
## genderNon-binary               0.120859591
## age                            0.073085424
## UCLA3Score                     7.923217142
## zimet_overall_score            3.638782866
## discrimination_score           1.889829241
## tipi_extraversion_score        0.420606825
## tipi_agreeable_score           1.769478261
## tipi_conscientiousness_score   0.002939864
## tipi_emotional_stability_score 2.239350494
## tipi_openness_score            2.358682784
## attachment_secure_score        1.773075965
## attachment_avoidant_score      0.281953095
## attachment_anxious_score       1.173279397

With that, you’ve now got what you need to know to construct a linear regression model. As you have seen, linear regression is all about key assumptions. As such, more flexible models are often used. Next time we will talk about one such flexible approach, logistic regression.

7. Logistic Regression

This section introduces you to the following concepts and their corresponding functions:

Logistic regression: glm(), coef(), confint(),logitgof (), PseudoR2(), varImp(), predict(), ifelse()
Multinomial logistic regression: multinom(), tidy()
Ordinal logistic regression: polr()

In the last chater, we learned how to construct linear regression models. Logistic Regression models are another useful modeling approach. This time we will cover three types of logistic regresison models: binomial models, multinomial models, and ordinal models.

Getting Started

To get started with learning about logistic regression, lets read in our burnout dataset.

burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data

Binomial Logistic Regression

Now lets explore the use of the glm() function, which is used to create a logsitic regression model.

?glm #Call help function

The glm() function allows you to construct several types of models. You can read in the help file that these models are identified using the “family =” argument. In this chapter we forcus on “binomial” family, which is used for the binomial logistic regression model.

If you encounter a scenario in which you would be required to use a different family of models (e.g., you have count data and are therefore considering a Poisson model), it may be easier to restructure the variable as a binary variable so that you can use the “binomial” option, as it is fairly flexible compared to many of these. However, it is easy to learn about the other types of models from google or other sources. Outside of linear and logistic regression models, the most commonly used models are Poisson models.

For this exercise, we want to create a binomial logistic regression model, the kind used to analyze a binary outcome variable. To do so, I will use the categorical version of the Burnout Score as the outcome and the continuous version of the loneliness scale score as the explanatory factor:

burnout_data$BurnoutScore_groups <- as.factor(burnout_data$BurnoutScore_groups) #make sure outcome is a factor
glm_model <- glm(formula = BurnoutScore_groups ~ UCLA3Score, family = binomial(link = "logit"), data = burnout_data)
broom::tidy(glm_model)

## # A tibble: 2 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   -4.28     0.199      -21.5 9.61e-103
## 2 UCLA3Score     0.603    0.0321      18.8 1.67e- 78

The coefficient here is interpreted as for each 1-point increase in the UCLA 3 loneliness score there is a 0.60 increase in the log odds of being burnt out. Now log odds are kind of a pain, so when epidemiologists work with logistic regression models, we typically convert the log odds to an odds ratio. To do this we simply exponentiate the estimate (coefficient) using the exp() function. To call the coefficients out of the model, we could use the coef() function or you could use model$coefficients:

exp((coef(glm_model))) #Use coef() function to call results, and exp() to convert to Odds Ratio

## (Intercept)  UCLA3Score 
##  0.01378765  1.82696096

exp(glm_model$coefficients) #use $ to refer to coefficients element of object and exp() to convert to Odds Ratio

## (Intercept)  UCLA3Score 
##  0.01378765  1.82696096

We interpret the odds ratio as there being a 82.7% increase in odds of being burnt out for each 1 point increase in the UCLA 3 scores. It is also common to report an odds ratio along with a confidence interval to communicate uncertainty around your estimate. To calculate the confidence intervals, we can use the confint() function.

exp(confint(glm_model))

##                   2.5 %    97.5 %
## (Intercept) 0.009272364 0.0202388
## UCLA3Score  1.717042436 1.9476053

You can use the cbind() function to save these results:

cbind(exp((coef(glm_model))), exp(confint(glm_model))) #Bind the columns from the coef() and confint() model

##                              2.5 %    97.5 %
## (Intercept) 0.01378765 0.009272364 0.0202388
## UCLA3Score  1.82696096 1.717042436 1.9476053

You can also assign this as a dataframe and save the dataframe as a csv using the following code:

GLM_table <- cbind(exp((coef(glm_model))), exp(confint(glm_model))) #save cbind results as object
GLM_table <- as.data.frame(GLM_table) #convert object to data frame
write.csv(x = GLM_table, file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\GLM_table.csv") #Write dataset to csv on computer

Model Fit

Now that you’ve got some results, you can assess the goodness of fit for your model, using a hosmer-Lemeshow goodness of test fit using the logitgof() function, which is part of the generalhoslem package. Note that Traditional residual plots are not very helpful with logistic regression.

install.packages("generalhoslem", repos = "http://cran.us.r-project.org") #install package

## package 'generalhoslem' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("generalhoslem") #load library for package

logitgof(obs = glm_model$y, fitted(glm_model)) #run hoslem test

## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  glm_model$y, fitted(glm_model)
## X-squared = 8.6349, df = 4, p-value = 0.0709

Notice that the observed values are extracted from the outcome of the model and then the expected values are calculated using the fitted() function. Because different objects have different features, each of the examples today gets to the observed and fitted values differently.

The results of this test have a p-value of 0.0709, indicating no evidence of poor fit (but very close). This tells you that your model has adequately acceptable goodness of fit to your data, but in our case it is borderline so you may want to consider a more complex model.

In addition to testing for goodness of fit, logistic regression are also frequently evaluated based on their predictive power, similar to how we used R-squared in linear regression. However, the R-squared does not work well in logistic regression, so we employ McFadden’s Pseudo R-squared using the PseudoR2() function from the DescTools package:

install.packages("DescTools", repos = "http://cran.us.r-project.org") #install package

## package 'DescTools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("DescTools") #load library for package

PseudoR2(glm_model, "all") #run McFadden's Pseudo R-squared; #All specifies we want all model statistics.

##        McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
##       0.1530704       0.1516964       0.1735279       0.2436881       0.1600797 
## VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
##       0.2886466       0.1834035       0.2469587       0.1850409    2469.4635165 
##             BIC          logLik         logLik0              G2 
##    2480.9776188   -1232.7317583   -1455.5304125     445.5973085

The values for a pseudo R squared tend to be considerably lower than R squared; The creator for the statistic says that values of 0.2 to 0.4 indicates good fit. In our case our score is below that, so perhaps we need to revise our model and compare the Pseudo R2 of the old and new models to see if we can improve things?

glm_model_2 <- glm(formula = BurnoutScore_groups ~ gender + age + UCLA3Score + zimet_overall_score + discrimination_score + tipi_extraversion_score + tipi_agreeable_score + tipi_conscientiousness_score + tipi_emotional_stability_score + tipi_emotional_stability_score + tipi_openness_score + attachment_secure_score + attachment_avoidant_score + attachment_anxious_score, family = binomial(link = "logit"), data = burnout_data)
summary(glm_model_2)

## 
## Call:
## glm(formula = BurnoutScore_groups ~ gender + age + UCLA3Score + 
##     zimet_overall_score + discrimination_score + tipi_extraversion_score + 
##     tipi_agreeable_score + tipi_conscientiousness_score + tipi_emotional_stability_score + 
##     tipi_emotional_stability_score + tipi_openness_score + attachment_secure_score + 
##     attachment_avoidant_score + attachment_anxious_score, family = binomial(link = "logit"), 
##     data = burnout_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0363  -0.7954  -0.3082   0.7351   3.0283  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -3.041738   1.551997  -1.960  0.05001 .  
## genderNon-binary               -0.246547   0.962630  -0.256  0.79786    
## genderWoman                     0.049797   0.248916   0.200  0.84144    
## age                             0.003721   0.010537   0.353  0.72398    
## UCLA3Score                      0.552657   0.093435   5.915 3.32e-09 ***
## zimet_overall_score            -0.327889   0.117585  -2.789  0.00529 ** 
## discrimination_score            0.030949   0.014488   2.136  0.03266 *  
## tipi_extraversion_score         0.004050   0.059564   0.068  0.94579    
## tipi_agreeable_score           -0.120926   0.066214  -1.826  0.06781 .  
## tipi_conscientiousness_score   -0.011387   0.060180  -0.189  0.84992    
## tipi_emotional_stability_score -0.122803   0.063711  -1.927  0.05392 .  
## tipi_openness_score             0.104328   0.065496   1.593  0.11118    
## attachment_secure_score        -0.039546   0.038326  -1.032  0.30214    
## attachment_avoidant_score       0.014096   0.035458   0.398  0.69097    
## attachment_anxious_score        0.098093   0.041621   2.357  0.01843 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 635.88  on 496  degrees of freedom
## Residual deviance: 452.41  on 482  degrees of freedom
##   (1951 observations deleted due to missingness)
## AIC: 482.41
## 
## Number of Fisher Scoring iterations: 5

PseudoR2(glm_model_2, "all") #rerun McFadden's Pseudo R-squared for model_2

##        McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
##       0.2885335       0.2413548       0.3086856       0.4276572       0.2696255 
## VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
##       0.4803630       0.3200616       0.5307165       0.3231435     482.4077537 
##             BIC          logLik         logLik0              G2 
##     545.5366041    -226.2038768    -317.9402944     183.4728352

PseudoR2(glm_model, "all") #rerun McFadden's Pseudo R-squared for model

##        McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
##       0.1530704       0.1516964       0.1735279       0.2436881       0.1600797 
## VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
##       0.2886466       0.1834035       0.2469587       0.1850409    2469.4635165 
##             BIC          logLik         logLik0              G2 
##    2480.9776188   -1232.7317583   -1455.5304125     445.5973085

As you can the new model substantially improves over the other models. But how do we know which variables in our model are relatively more important? Simply adding variables to a model to increase its Pseudo R-squared seems like a dubious practice. I suggest using the varImp function, of the caret pacakge:

install.packages("caret", repos = "http://cran.us.r-project.org") #install package
library("caret") #load library for package

varImp(glm_model_2) #Run the varImp function on model 2, which provides the absolute value of the t-statistic for each model parameter. Variables with large absolute values are relatively more important. As we can see here, the loneliness scores and social support scores are the most important (similar to what we found when modelling our data as a linear regression model using the continuous burnout scores).

##                                   Overall
## genderNon-binary               0.25611787
## genderWoman                    0.20005748
## age                            0.35314959
## UCLA3Score                     5.91489834
## zimet_overall_score            2.78853205
## discrimination_score           2.13625902
## tipi_extraversion_score        0.06799547
## tipi_agreeable_score           1.82627405
## tipi_conscientiousness_score   0.18922460
## tipi_emotional_stability_score 1.92749758
## tipi_openness_score            1.59289367
## attachment_secure_score        1.03184554
## attachment_avoidant_score      0.39753352
## attachment_anxious_score       2.35679207

Another thing you might want to do is to test model accuracy. The model accuracy is measured as the proportion of observations that have been correctly classified. It can be calculated using the predict() function.

glm.probs <- predict(glm_model_2,type = "response") #Predict() assigns fitted model values for each participants based on the model
glm.pred <- ifelse(glm.probs > 0.5, "1", "0") #ifelse() is a function that Assigns 1 if prediction is above 50%, and 0 if prediction is below 50%
prop.table(table(glm.pred == glm_model_2$y)) #prop.table() function tells me that the correct assignment is made about 75.7% of the time (i.e., the predicted value == the actual value)

## 
##     FALSE      TRUE 
## 0.2434608 0.7565392

table(predicted = glm.pred, actual = glm_model_2$y)

##          actual
## predicted   0   1
##         0 285  77
##         1  44  91

This sounds good, but a coin flip has a 50-50 chance of being accurate. When can dive deeper, by constructing a table, you can see that 77 people who were burnt out were predicted by our model not to be (FALSE negative); and 44 of those who were not burnt out were predicted to be (FALSE positive). This suggests that your model predicts better than random, but its still not perfect. You don’t need to search for a perfect model, but you should be cautious claiming that your model is good, when it misclassifies almost one in four participants. #later in this course we will discuss more about comparing models. Sometimes the best strategy is to just build a better model than you have, because no model is perfect.

Check Multicolinearity

As with linear regression modes, binomial logistic regression models also assume that the variables are not co-linear. As such, you can use the VIF() function from the regclass package to assess this.

library("regclass") #load library for package
VIF(glm_model_2) #looks good no extremely high values

##                                    GVIF Df GVIF^(1/(2*Df))
## gender                         1.231106  2        1.053353
## age                            1.594795  1        1.262852
## UCLA3Score                     1.281699  1        1.132121
## zimet_overall_score            1.203995  1        1.097267
## discrimination_score           1.653544  1        1.285902
## tipi_extraversion_score        1.247772  1        1.117037
## tipi_agreeable_score           1.590532  1        1.261163
## tipi_conscientiousness_score   1.400683  1        1.183505
## tipi_emotional_stability_score 1.147639  1        1.071279
## tipi_openness_score            1.473251  1        1.213776
## attachment_secure_score        1.693511  1        1.301350
## attachment_avoidant_score      1.455864  1        1.206592
## attachment_anxious_score       1.988822  1        1.410256

Multinomial Logistic Regression

Binomial logistic regression is great when you have just two levels in your outcome. In our dataset, we have a variable that asked participants whether the pandemic had made them more lonely, less lonely, or whether they were about the same as they had always been. Binomial logistic regression would not work on this variable because it has more than two levels.

table(burnout_data$lonely_change_covid)

## 
##       About the same     Much less lonely     Much more lonely 
##                  634                  109                  523 
## Somewhat less lonely Somewhat more lonely 
##                  200                  964

As you can see most people felt somewhat (n = 964) or much (n = 523) more lonely. BUt sizeable numbers reported that their loneliness was about the same as it was before the pandemic. We might be curious what factors are associated with these experiences. Lets examine, whether women or men were more likely to report changes in their levels of loneliness. TO do this, we will run a multinomial logistic regression using the multinom() function from the nnet package.

install.packages("nnet", repos = "http://cran.us.r-project.org") #install package

## package 'nnet' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("nnet") #load library for package

multinom_model <- multinom(formula = lonely_change_covid ~ gender, model = TRUE, data = burnout_data)

## # weights:  20 (12 variable)
## initial  value 3910.934127 
## iter  10 value 3368.824089
## final  value 3362.277471 
## converged

summary(multinom_model)

## Call:
## multinom(formula = lonely_change_covid ~ gender, data = burnout_data, 
##     model = TRUE)
## 
## Coefficients:
##                      (Intercept) genderNon-binary   genderWoman
## Much less lonely      -1.8499554        0.2405371  0.1669536068
## Much more lonely      -0.5203069       -0.3961239  0.6168463358
## Somewhat less lonely  -1.1568775        0.1071966 -0.0008526973
## Somewhat more lonely   0.4425127       -0.4938345 -0.0236253960
## 
## Std. Errors:
##                      (Intercept) genderNon-binary genderWoman
## Much less lonely      0.15212931        0.5684499   0.2114312
## Much more lonely      0.09184618        0.4283123   0.1219902
## Somewhat less lonely  0.11465031        0.4538499   0.1651529
## Somewhat more lonely  0.07186705        0.3283244   0.1037569
## 
## Residual Deviance: 6724.555 
## AIC: 6748.555

The output for this multinom model is somewhat hard to look at because we recieve a coefficient for every level of the outcome. The reference variable in this case is “About the same”, which you could change using the relevel() function if you wanted to. We will leave it as is for now. Lets first use the tidy() function in the broom pacakge to convert the results to odds ratios (i.e., exponentiate = TRUE) and calculate confidence intervals (i.e., conf.int = TRUE, conf.level = 0.95, )

install.packages("broom", repos = "http://cran.us.r-project.org") #install package

## package 'broom' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("broom") #load library for package

broom::tidy(multinom_model, conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE) #Tidy of the results using the tidy function!!

## # A tibble: 12 × 8
##    y.level              term   estim…¹ std.e…² statis…³  p.value conf.…⁴ conf.…⁵
##    <chr>                <chr>    <dbl>   <dbl>    <dbl>    <dbl>   <dbl>   <dbl>
##  1 Much less lonely     (Inte…   0.157  0.152  -1.22e+1 5.05e-34   0.117   0.212
##  2 Much less lonely     gende…   1.27   0.568   4.23e-1 6.72e- 1   0.417   3.88 
##  3 Much less lonely     gende…   1.18   0.211   7.90e-1 4.30e- 1   0.781   1.79 
##  4 Much more lonely     (Inte…   0.594  0.0918 -5.66e+0 1.47e- 8   0.496   0.712
##  5 Much more lonely     gende…   0.673  0.428  -9.25e-1 3.55e- 1   0.291   1.56 
##  6 Much more lonely     gende…   1.85   0.122   5.06e+0 4.27e- 7   1.46    2.35 
##  7 Somewhat less lonely (Inte…   0.314  0.115  -1.01e+1 6.09e-24   0.251   0.394
##  8 Somewhat less lonely gende…   1.11   0.454   2.36e-1 8.13e- 1   0.457   2.71 
##  9 Somewhat less lonely gende…   0.999  0.165  -5.16e-3 9.96e- 1   0.723   1.38 
## 10 Somewhat more lonely (Inte…   1.56   0.0719  6.16e+0 7.40e-10   1.35    1.79 
## 11 Somewhat more lonely gende…   0.610  0.328  -1.50e+0 1.33e- 1   0.321   1.16 
## 12 Somewhat more lonely gende…   0.977  0.104  -2.28e-1 8.20e- 1   0.797   1.20 
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## #   ⁴conf.low, ⁵conf.high

In these results, the “About the same” group is your referent, so all comparisons are against that group. So, compared to people who’s loneliness did not change, people who were much more lonely had 85% higher odds of being a woman.

Lets do a conceptual check on this data by reviewing the descriptive:

prop.table(table(burnout_data$lonely_change_covid, burnout_data$gender), margin = 1) #calculate % of each group who were each gender

##                       
##                               Man Non-binary      Woman
##   About the same       0.50157729 0.03154574 0.46687697
##   Much less lonely     0.45871560 0.03669725 0.50458716
##   Much more lonely     0.36137667 0.01529637 0.62332696
##   Somewhat less lonely 0.50000000 0.03500000 0.46500000
##   Somewhat more lonely 0.51348548 0.01970954 0.46680498

As you can see, 50.2% of those who were “about the same,” were men compared to only about 36.1% of those who were much more lonely. We could also look at this data by gender:

prop.table(table(burnout_data$gender, burnout_data$lonely_change_covid), margin = 1) #calculate % of each group who were each gender

##             
##              About the same Much less lonely Much more lonely
##   Man            0.27604167       0.04340278       0.16406250
##   Non-binary     0.34482759       0.06896552       0.13793103
##   Woman          0.24262295       0.04508197       0.26721311
##             
##              Somewhat less lonely Somewhat more lonely
##   Man                  0.08680556           0.42968750
##   Non-binary           0.12068966           0.32758621
##   Woman                0.07622951           0.36885246

Only 16.4% of men were much more lonely as aresult of the pandemic, compared to 26.7% of men. These descriptive statistics agree with the results found in the model!

We can also use the same functions above to check goodness of fit, the pseudo R squared, and identify the importance of variables

Model Fit

The model fit for a multinomial logistic regression model can also be tested using the logitgof() function:

library("generalhoslem")
logitgof(obs = na.omit(burnout_data$lonely_change_covid), exp = fitted(multinom_model)) #notice I use the na.omit function here to restrict down the outcome variable to include only those that were present in the observed dataset na.omit() removes mising. Then the expected values are calculated using the fitted() function.

## 
##  Hosmer and Lemeshow test (multinomial model)
## 
## data:  na.omit(burnout_data$lonely_change_covid), fitted(multinom_model)
## X-squared = 3.7544e-07, df = 0, p-value < 2.2e-16

Likewise, you can also check Pseudo R-squared to assess model fit.

library("DescTools") #load package
PseudoR2(multinom_model, "all")

##        McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
##    6.509770e-03    2.963994e-03    1.796916e-02    1.915080e-02              NA 
## VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
##              NA              NA              NA              NA    6.748555e+03 
##             BIC          logLik         logLik0              G2 
##    6.818103e+03   -3.362277e+03   -3.384309e+03    4.406214e+01

Variable Importance

As with other models, it is informative to check variable importance. However, as we only have one variable, lets first build a model that includes the age variable.

library("nnet") #load package
multinom_model_2 <- multinom(formula = lonely_change_covid ~ gender + age, model = TRUE, data = burnout_data)

## # weights:  25 (16 variable)
## initial  value 3910.934127 
## iter  10 value 3538.470171
## iter  20 value 3331.388276
## final  value 3330.671460 
## converged

summary(multinom_model_2)

## Call:
## multinom(formula = lonely_change_covid ~ gender + age, data = burnout_data, 
##     model = TRUE)
## 
## Coefficients:
##                      (Intercept) genderNon-binary genderWoman          age
## Much less lonely      -2.3918714        0.2231964 0.100150457  0.014287964
## Much more lonely      -1.1051994       -0.4150591 0.544377317  0.015371055
## Somewhat less lonely  -0.5111307        0.1329242 0.060438748 -0.018495488
## Somewhat more lonely   0.7185628       -0.4829126 0.005327498 -0.007717216
## 
## Std. Errors:
##                      (Intercept) genderNon-binary genderWoman         age
## Much less lonely       0.2907118        0.5691078   0.2140786 0.006349814
## Much more lonely       0.1693394        0.4294329   0.1235672 0.003716101
## Somewhat less lonely   0.2375174        0.4549662   0.1664447 0.006099787
## Somewhat more lonely   0.1443500        0.3286260   0.1047088 0.003487082
## 
## Residual Deviance: 6661.343 
## AIC: 6693.343

library("caret") #load package
varImp(multinom_model_2) #Check variable importance

##                     Overall
## genderNon-binary 1.25409225
## genderWoman      0.71029402
## age              0.05587172

Ordinal Logistic Regression

The third type of regression we want to cover today is an ordinal logistic regression model. This is useful when your outcome has a natural order to it. Usually your selected referent is either the highest or lowest level of the variable. The first step is to make sure your variable is saved as an ordered factor. Lets give this a try using the question that assesses how frequently participants feel lonely:

table(burnout_data$lonely_direct) #call descriptives

## 
##                          All of the time (e.g. 5-7 days)] 
##                                                       137 
##                           None of the time (e.g., 0 days) 
##                                                       384 
## Occasionally or a moderate amount of time (e.g. 3-4 days) 
##                                                       307 
##                             Rarely (e.g. less than 1 day) 
##                                                       869 
##              Some or a little of the time (e.g. 1-2 days) 
##                                                       748

As you can see, the variable levels are currently ordered alphabetically. We will need to reorder the variable using the ordered() function:

burnout_data$lonely_direct <- ordered(burnout_data$lonely_direct, levels = c("None of the time (e.g., 0 days)", "Rarely (e.g. less than 1 day)", "Some or a little of the time (e.g. 1-2 days)", "Occasionally or a moderate amount of time (e.g. 3-4 days)", "All of the time (e.g. 5-7 days)]")) #Convert to factor and specify order

table(burnout_data$lonely_direct) #call descriptives; As you can see now, the variable is now ordered. We can construct an ordinal logistic regresison using the polr() function from the MASS package

## 
##                           None of the time (e.g., 0 days) 
##                                                       384 
##                             Rarely (e.g. less than 1 day) 
##                                                       869 
##              Some or a little of the time (e.g. 1-2 days) 
##                                                       748 
## Occasionally or a moderate amount of time (e.g. 3-4 days) 
##                                                       307 
##                          All of the time (e.g. 5-7 days)] 
##                                                       137

install.packages("MASS", repos = "http://cran.us.r-project.org") #install package
library(MASS) #load library for package

ord_model <- polr(formula = lonely_direct ~ age + gender + lonely_change_covid + BurnoutScore, Hess = TRUE, data = burnout_data)
broom::tidy(ord_model) # Get a summary, which is hard to read, so convert to a tidy table using the broom package

## # A tibble: 12 × 5
##    term                                          estim…¹ std.e…² stati…³ coef.…⁴
##    <chr>                                           <dbl>   <dbl>   <dbl> <chr>  
##  1 age                                           0.00402 0.00263  1.53   coeffi…
##  2 genderNon-binary                              0.115   0.274    0.421  coeffi…
##  3 genderWoman                                   0.0300  0.0794   0.378  coeffi…
##  4 lonely_change_covidMuch less lonely           0.0188  0.214    0.0878 coeffi…
##  5 lonely_change_covidMuch more lonely           1.59    0.123   12.9    coeffi…
##  6 lonely_change_covidSomewhat less lonely       0.0633  0.157    0.404  coeffi…
##  7 lonely_change_covidSomewhat more lonely       0.400   0.0979   4.08   coeffi…
##  8 BurnoutScore                                  0.923   0.0434  21.3    coeffi…
##  9 None of the time (e.g., 0 days)|Rarely (e.g.… 1.66    0.190    8.73   scale  
## 10 Rarely (e.g. less than 1 day)|Some or a litt… 3.80    0.199   19.1    scale  
## 11 Some or a little of the time (e.g. 1-2 days)… 5.74    0.219   26.3    scale  
## 12 Occasionally or a moderate amount of time (e… 7.43    0.246   30.2    scale  
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## #   ⁴coef.type

library(broom) #load library for broom package
broom::tidy(ord_model, conf.int = TRUE, conf.level = 0.95, pvalue = TRUE, exponentiate = TRUE) #Tidy of the results using the tidy function!!

## # A tibble: 12 × 7
##    term                          estim…¹ std.e…² stati…³ conf.…⁴ conf.…⁵ coef.…⁶
##    <chr>                           <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>  
##  1 age                              1.00 0.00263  1.53     0.999    1.01 coeffi…
##  2 genderNon-binary                 1.12 0.274    0.421    0.656    1.92 coeffi…
##  3 genderWoman                      1.03 0.0794   0.378    0.882    1.20 coeffi…
##  4 lonely_change_covidMuch less…    1.02 0.214    0.0878   0.668    1.55 coeffi…
##  5 lonely_change_covidMuch more…    4.90 0.123   12.9      3.85     6.25 coeffi…
##  6 lonely_change_covidSomewhat …    1.07 0.157    0.404    0.783    1.45 coeffi…
##  7 lonely_change_covidSomewhat …    1.49 0.0979   4.08     1.23     1.81 coeffi…
##  8 BurnoutScore                     2.52 0.0434  21.3      2.31     2.74 coeffi…
##  9 None of the time (e.g., 0 da…    5.24 0.190    8.73    NA       NA    scale  
## 10 Rarely (e.g. less than 1 day…   44.5  0.199   19.1     NA       NA    scale  
## 11 Some or a little of the time…  312.   0.219   26.3     NA       NA    scale  
## 12 Occasionally or a moderate a… 1685.   0.246   30.2     NA       NA    scale  
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## #   ⁴conf.low, ⁵conf.high, ⁶coef.type

As you can see, in this model, we do not have to worry about multiple coefficients for each explanatory variable. The nice thing with ordinal regression is it gives us a single effect for each term. So we might interpret the effect of Burnout as “Each 1 point increase in burnout scores was associated with a 2.52 times (or 152%) increase in odds of being lonely more frequently. The other variable level with a significant effect was among those who said they were filling much more lonelier since the start of the pandemic: We would interpret this as”People who were much more lonelier since the start of the pandemic had 4.90 times (or 390%) higher odds of being lonely more frequently.

Model Fit and Assumptions

One assumption unique to an ordinal model is that no variable in your model has a disproportionate effect on a specific level of your outcome variable. To test for this assumption, we conduct the Brant-Wald test using the brant() function within the brant package.

install.packages("brant", repos = "http://cran.us.r-project.org")

## package 'brant' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("brant")
brant::brant(ord_model)

## ---------------------------------------------------------------------------- 
## Test for                 X2  df  probability 
## ---------------------------------------------------------------------------- 
## Omnibus                      278.25  24  0
## age                      112.63  3   0
## genderNon-binary             18.25   3   0
## genderWoman                  6.83    3   0.08
## lonely_change_covidMuch less lonely      31.97   3   0
## lonely_change_covidMuch more lonely      20.72   3   0
## lonely_change_covidSomewhat less lonely  2.19    3   0.53
## lonely_change_covidSomewhat more lonely  14.52   3   0
## BurnoutScore                 40.19   3   0
## ---------------------------------------------------------------------------- 
## 
## H0: Parallel Regression Assumption holds

A p-value of less than 0.05 on this test-particularly on the Omnibus plus at least one of the variables-should be interpreted as a failure of the proportional odds assumption. In this case, we would likely opt for a multinomial model or collapse our variable in a way that would make it suitable for a binomial logistic regression model.

The model fit can also be assessed by the logitgof() function:

library(generalhoslem)
logitgof(ord_model$model$lonely_direct, exp = fitted(ord_model), ord = TRUE)

## 
##  Hosmer and Lemeshow test (ordinal model)
## 
## data:  ord_model$model$lonely_direct, fitted(ord_model)
## X-squared = 88.288, df = 35, p-value = 1.703e-06

We can further check the pseudo R-squared:

library("DescTools") #load package
PseudoR2(ord_model, "all")

##        McFadden        CoxSnell      Nagelkerke   AldrichNelson VeallZimmermann 
##       0.1370127       0.3265479       0.3458577              NA              NA 
##           Efron McKelveyZavoina            Tjur             AIC             BIC 
##              NA              NA              NA    5848.2884448    5917.3781900 
##          logLik         logLik0              G2 
##   -2912.1442224   -3374.4924412     924.6964376

Finally, it is informative to check the variable importance. Unfortunately the varImp() function does not have a method for dealing with a polr() objects, as such it cannot be used for ordinal regression.

One option you have it to construct the model without each variable and then test the pseudo R squared on each, taking note of which variables have the biggest impact on values.

ord_model_no_age <- polr(formula = lonely_direct ~ gender + lonely_change_covid + BurnoutScore, Hess = TRUE, data = burnout_data) #construct without age

ord_model_no_gender <- polr(formula = lonely_direct ~ age + lonely_change_covid + BurnoutScore, Hess = TRUE, data = burnout_data) #construct without gender

ord_model_no_lonely_change_covid <- polr(formula = lonely_direct ~ age + gender + BurnoutScore, Hess = TRUE, data = burnout_data) #construct without lonely_change

ord_model_no_burnoutScore <- polr(formula = lonely_direct ~ age + gender + lonely_change_covid, Hess = TRUE, data = burnout_data) #construct without burnout

Once the models are each constructed, you can test the pseudo R-squared of each and compare.

PseudoR2(ord_model, "all") #Mcfadden = 0.1370127

##        McFadden        CoxSnell      Nagelkerke   AldrichNelson VeallZimmermann 
##       0.1370127       0.3265479       0.3458577              NA              NA 
##           Efron McKelveyZavoina            Tjur             AIC             BIC 
##              NA              NA              NA    5848.2884448    5917.3781900 
##          logLik         logLik0              G2 
##   -2912.1442224   -3374.4924412     924.6964376

PseudoR2(ord_model_no_age, "all")  #Mcfadden = 0.1366661

##        McFadden        CoxSnell      Nagelkerke   AldrichNelson VeallZimmermann 
##       0.1366661       0.3258741       0.3451440              NA              NA 
##           Efron McKelveyZavoina            Tjur             AIC             BIC 
##              NA              NA              NA    5848.6275130    5911.9597794 
##          logLik         logLik0              G2 
##   -2913.3137565   -3374.4924412     922.3573694

PseudoR2(ord_model_no_gender, "all")  #Mcfadden = 0.1369715

##        McFadden        CoxSnell      Nagelkerke   AldrichNelson VeallZimmermann 
##       0.1369715       0.3264678       0.3457729              NA              NA 
##           Efron McKelveyZavoina            Tjur             AIC             BIC 
##              NA              NA              NA    5844.5664064    5902.1411941 
##          logLik         logLik0              G2 
##   -2912.2832032   -3374.4924412     924.4184760

PseudoR2(ord_model_no_lonely_change_covid, "all")  #Mcfadden = 0.1100810

##        McFadden        CoxSnell      Nagelkerke   AldrichNelson VeallZimmermann 
##       0.1100810       0.2725013       0.2885357              NA              NA 
##           Efron McKelveyZavoina            Tjur             AIC             BIC 
##              NA              NA              NA    6070.3457893    6116.4567596 
##          logLik         logLik0              G2 
##   -3027.1728946   -3401.6273375     748.9088857

PseudoR2(ord_model_no_burnoutScore, "all")  #Mcfadden = 0.0635745

##        McFadden        CoxSnell      Nagelkerke   AldrichNelson VeallZimmermann 
##       0.0635745       0.1674025       0.1773402              NA              NA 
##           Efron McKelveyZavoina            Tjur             AIC             BIC 
##              NA              NA              NA    6571.3339832    6635.0725065 
##          logLik         logLik0              G2 
##   -3274.6669916   -3496.9861484     444.6383136

Finally, we can calculate the change in R-squared values by subtracting the pseudo R-squared value of each nested model from the pseudo R-squared value of the full model.

0.1370127 - 0.1366661 #ord_model_no_age = 0.0003466

## [1] 0.0003466

0.1370127 - 0.1369715 #ord_model_no_gender = 0.0000412

## [1] 4.12e-05

0.1370127 - 0.1100810 #ord_model_no_lonely_change_covid = 0.0269317

## [1] 0.0269317

0.1370127 - 0.0635745 #ord_model_no_burnoutScore = 0.0734382

## [1] 0.0734382

Here we can see that the biggest variable impact was on the removal of the burnout score. Again highlighting a strong connection between loneliness and burnout. Relatively small changes in the model were attributed to age and gender. This gives you a sense of what the most important variables in the model R. Technically, a similar approach could be used in other models.

Multicolinearity

The other limitation of ordinal models is they do not work with your typical variable inflation factors. You can however check multicolienarity by (1) converting your outcome to a numeric variable, (2) constructing a linear regression model, and (3) calculating the VIF scores to test for multicolinearity.

Step 1. Create numeric version of the outcome

burnout_data$lonely_direct <- as.character(burnout_data$lonely_direct)
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "None of the time (e.g., 0 days)"] <- 1
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "Rarely (e.g. less than 1 day)"] <- 2
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "Some or a little of the time (e.g. 1-2 days)"] <- 3
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "Occasionally or a moderate amount of time (e.g. 3-4 days)"] <- 4
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "All of the time (e.g. 5-7 days)]"] <- 5

Step 2. Create a linear regression model

linear_model <- lm(formula = lonely_direct_numeric ~ age + gender + lonely_change_covid + BurnoutScore, data = burnout_data)

Step 3. Calculate VIF to measure multicolinearity

VIF(linear_model)

##                         GVIF Df GVIF^(1/(2*Df))
## age                 1.057796  1        1.028492
## gender              1.047706  2        1.011719
## lonely_change_covid 1.197383  4        1.022773
## BurnoutScore        1.162099  1        1.078007

The above is a nice work-around that should satisfy. Since the VIF is really a function of inter-correlations in the design matrix (which doesn’t depend on the dependent variable or the non-linear mapping from the linear predictor into the space of the response variable [i.e., the link function in a glm]).

8. Variable and Model Selection

This section introduces you to the following concepts and their corresponding functions:

Model selection: AIC(), lrtest(), coxtest(), step()
Variable selection: regTermTest()

So far, we have covered linear and logistic regression models and how to evaluate them. Now, our focus is on how to compare models to decide which one is better. We will also review some different ways to construct a model. These skills will add to those we’ve already learned to help you build better models by deciding what variables and combinations of variables you should include.

Getting Started

As always, lets start by reading in our burnout dataset:

burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data

Model Selection

Linear Regression

Now lets learn how to compare linear regression models first. We will start by building three similar linear regression models using the BurnoutScore, age, income, and ethnicity variables. To start I am going to create a new dataset that removes missing observations that are missing any of the variables of interest. This ensures that all the models I am looking at are built with the same underlying data.

burnout_data_subset <- burnout_data[!is.na(burnout_data$BurnoutScore),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$age),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$income),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$ethnicity),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$BurnoutScore_groups),]

Model 1 predicts burnout scores using age as the explanatory factor:

model_1 <- lm(formula = BurnoutScore ~ age, data = burnout_data_subset) #Create a linear regression model object
broom::tidy(model_1) # get results for the model object

## # A tibble: 2 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  3.53      0.0607      58.2   0     
## 2 age         -0.00294   0.00146     -2.02  0.0434

Model 2 predicts burnout scores using age and income as the explanatory factors:

model_2 <- lm(formula = BurnoutScore ~ age + income, data = burnout_data_subset) #Create a linear regression model object
broom::tidy(model_2) # get results for the model object

## # A tibble: 19 × 5
##    term                       estimate std.error statistic   p.value
##    <chr>                         <dbl>     <dbl>     <dbl>     <dbl>
##  1 (Intercept)                 3.67      0.110     33.4    1.01e-200
##  2 age                        -0.00318   0.00147   -2.16   3.10e-  2
##  3 income$100,000 to $149,999 -0.293     0.116     -2.52   1.18e-  2
##  4 income$15,000 to $19,999   -0.309     0.137     -2.26   2.39e-  2
##  5 income$150,000 to $199,999 -0.324     0.134     -2.42   1.57e-  2
##  6 income$20,000 to $24,999    0.113     0.132      0.857  3.91e-  1
##  7 income$200,000 or more     -0.149     0.186     -0.803  4.22e-  1
##  8 income$25,000 to $29,999   -0.0159    0.136     -0.117  9.07e-  1
##  9 income$30,000 to $34,999   -0.189     0.132     -1.43   1.52e-  1
## 10 income$35,000 to $39,999   -0.204     0.136     -1.50   1.34e-  1
## 11 income$40,000 to $44,999   -0.186     0.134     -1.39   1.63e-  1
## 12 income$45,000 to $49,999   -0.0902    0.142     -0.633  5.27e-  1
## 13 income$5,000 to $9,999     -0.0355    0.145     -0.244  8.07e-  1
## 14 income$50,000 to $59,999    0.0895    0.127      0.705  4.81e-  1
## 15 income$60,000 to $69,999   -0.00263   0.132     -0.0200 9.84e-  1
## 16 income$70,000 to $79,999   -0.152     0.129     -1.18   2.37e-  1
## 17 income$80,000 to $89,999   -0.103     0.140     -0.732  4.64e-  1
## 18 income$90,000 to $99,999   -0.163     0.133     -1.22   2.22e-  1
## 19 incomeUnder $5,000         -0.124     0.172     -0.722  4.70e-  1

Model 3 predicts burnout scores using age and ethnicity as the explanatory factors:

model_3 <- lm(formula = BurnoutScore ~ age + ethnicity, data = burnout_data_subset) #Create a linear regression model object
broom::tidy(model_3) # get results for the model object

## # A tibble: 14 × 5
##    term                                       estimate std.e…¹ stati…²   p.value
##    <chr>                                         <dbl>   <dbl>   <dbl>     <dbl>
##  1 (Intercept)                                 3.28    0.0877    37.4  5.80e-240
##  2 age                                        -0.00435 0.00152   -2.86 4.31e-  3
##  3 ethnicityArab                               0.412   0.160      2.57 1.03e-  2
##  4 ethnicityChinese                            0.245   0.154      1.58 1.13e-  1
##  5 ethnicityFilipino                           0.407   0.261      1.56 1.20e-  1
##  6 ethnicityIndigenous                         0.184   0.112      1.65 9.95e-  2
##  7 ethnicityJapanese                           0.307   0.292      1.05 2.92e-  1
##  8 ethnicityKorean                             0.292   0.261      1.12 2.64e-  1
##  9 ethnicityLatin American                     0.358   0.137      2.61 8.99e-  3
## 10 ethnicityNone of the above, I identify as   0.543   0.144      3.78 1.64e-  4
## 11 ethnicitySouth Asian (e.g., East Indian, …  0.630   0.174      3.61 3.07e-  4
## 12 ethnicitySoutheast Asian (e.g., Vietnames…  0.320   0.219      1.46 1.45e-  1
## 13 ethnicityWest Asian (e.g., Iranian, Afgha…  0.684   0.292      2.34 1.92e-  2
## 14 ethnicityWhite                              0.340   0.0811     4.20 2.82e-  5
## # … with abbreviated variable names ¹std.error, ²statistic

To compare these models, we use something called a fit index, in particular we frequently rely on the Adjusted Information Criterion (AIC). AIC is not displayed by default. However, can be extracted using the AIC() function, as follows:

AIC(model_1, model_2, model_3)

##         df      AIC
## model_1  3 7122.987
## model_2 20 7122.044
## model_3 15 7116.428

As you can see, model 3 has the lowest AIC, and is therefore the best fitting model of those reviewed. We can test whether these differences are statistically significant using the lrtest() function from the lmtest package.

install.packages("lmtest", repos = "http://cran.us.r-project.org")

## package 'lmtest' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("lmtest")
lrtest(model_1, model_2)

## Likelihood ratio test
## 
## Model 1: BurnoutScore ~ age
## Model 2: BurnoutScore ~ age + income
##   #Df  LogLik Df  Chisq Pr(>Chisq)   
## 1   3 -3558.5                        
## 2  20 -3541.0 17 34.943   0.006329 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the result shows a p-value (0.006329). This means that adding the income variable to the model did lead to a significantly improved fit over the model 1.

Model 1 and model 2 can be compared using the lrtest() function, only when they are nested (meaning that model 2 contains all the variables that model 1 does, plus some additional ones). In our example model 1 is tested inside model 2. However, model 2 is not nested inside model 3, because model 3 does not include ethnicity. Therefore we could compare model 3 to model 1 using lrtest, because they are nested:

lrtest(model_1, model_3)

## Likelihood ratio test
## 
## Model 1: BurnoutScore ~ age
## Model 2: BurnoutScore ~ age + ethnicity
##   #Df  LogLik Df  Chisq Pr(>Chisq)   
## 1   3 -3558.5                        
## 2  15 -3543.2 12 30.559   0.002299 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

However, to compare model 2 and model 3, we need to use the coxtest() function, also from the lmtest package

coxtest(model_2, model_3)

## Cox test
## 
## Model 1: BurnoutScore ~ age + income
## Model 2: BurnoutScore ~ age + ethnicity
##                 Estimate Std. Error z value  Pr(>|z|)    
## fitted(M1) ~ M2  -15.139    0.52066 -29.077 < 2.2e-16 ***
## fitted(M2) ~ M1  -17.302    0.57398 -30.143 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As you can see, the p-value is statistically significant, suggesting that the difference in model fit is statistically significant. Combine this with what you know about the AIC, and you can tell that model_3 is a better fit than model_2.

As models become more and more complex, we can also use functions that automate the variable selection based on the AIC value. One such function is the step() function. Lets build ourselves a complex model and test the step function out:

linear_model_complex <- lm(formula = BurnoutScore ~ age + gender + ethnicity + income + educational_attainment + identity_lgbtq + identity_disability + lonely_direct + work_dignity + work_support + discrimination_score, model = TRUE, data = burnout_data)
broom::tidy(linear_model_complex)

## # A tibble: 58 × 5
##    term                estimate std.error statistic  p.value
##    <chr>                  <dbl>     <dbl>     <dbl>    <dbl>
##  1 (Intercept)          4.93      0.250      19.8   2.44e-75
##  2 age                 -0.00726   0.00212    -3.42  6.37e- 4
##  3 genderNon-binary     0.174     0.190       0.917 3.59e- 1
##  4 genderWoman          0.171     0.0541      3.15  1.65e- 3
##  5 ethnicityArab        0.222     0.197       1.13  2.60e- 1
##  6 ethnicityChinese     0.283     0.191       1.48  1.39e- 1
##  7 ethnicityFilipino    0.168     0.311       0.541 5.89e- 1
##  8 ethnicityIndigenous  0.114     0.129       0.888 3.75e- 1
##  9 ethnicityJapanese    0.0788    0.377       0.209 8.35e- 1
## 10 ethnicityKorean      0.0852    0.311       0.274 7.84e- 1
## # … with 48 more rows

step(linear_model_complex, direction = "backward") #backwards selection of variables

## Start:  AIC=-256.57
## BurnoutScore ~ age + gender + ethnicity + income + educational_attainment + 
##     identity_lgbtq + identity_disability + lonely_direct + work_dignity + 
##     work_support + discrimination_score
## 
##                          Df Sum of Sq     RSS     AIC
## - ethnicity              12     6.114  918.09 -272.34
## - income                 17    22.540  934.51 -260.49
## <none>                                 911.97 -256.57
## - identity_lgbtq          1     1.986  913.96 -255.89
## - work_dignity            4     7.212  919.18 -254.86
## - work_support            4     8.782  920.75 -252.76
## - gender                  2     7.904  919.87 -249.94
## - educational_attainment  8    17.974  929.94 -248.52
## - age                     1     9.102  921.07 -246.33
## - identity_disability     1    13.340  925.31 -240.68
## - discrimination_score    1    34.457  946.43 -212.88
## - lonely_direct           4   269.467 1181.44   54.37
## 
## Step:  AIC=-272.34
## BurnoutScore ~ age + gender + income + educational_attainment + 
##     identity_lgbtq + identity_disability + lonely_direct + work_dignity + 
##     work_support + discrimination_score
## 
##                          Df Sum of Sq     RSS      AIC
## - income                 17    23.309  941.39 -275.447
## <none>                                 918.09 -272.336
## - identity_lgbtq          1     1.817  919.90 -271.900
## - work_dignity            4     6.816  924.90 -271.224
## - work_support            4     9.001  927.09 -268.316
## - gender                  2     8.257  926.34 -265.305
## - educational_attainment  8    18.832  936.92 -263.321
## - age                     1     8.747  926.83 -262.654
## - identity_disability     1    13.944  932.03 -255.764
## - discrimination_score    1    33.142  951.23 -230.647
## - lonely_direct           4   275.967 1194.05   43.456
## 
## Step:  AIC=-275.45
## BurnoutScore ~ age + gender + educational_attainment + identity_lgbtq + 
##     identity_disability + lonely_direct + work_dignity + work_support + 
##     discrimination_score
## 
##                          Df Sum of Sq     RSS      AIC
## <none>                                 941.39 -275.447
## - identity_lgbtq          1     2.025  943.42 -274.800
## - work_dignity            4     7.686  949.08 -273.429
## - work_support            4     8.544  949.94 -272.316
## - gender                  2    10.035  951.43 -266.384
## - educational_attainment  8    19.863  961.26 -265.723
## - age                     1     9.495  950.89 -265.084
## - identity_disability     1    15.564  956.96 -257.245
## - discrimination_score    1    35.727  977.12 -231.557
## - lonely_direct           4   276.194 1217.59   33.503

## 
## Call:
## lm(formula = BurnoutScore ~ age + gender + educational_attainment + 
##     identity_lgbtq + identity_disability + lonely_direct + work_dignity + 
##     work_support + discrimination_score, data = burnout_data, 
##     model = TRUE)
## 
## Coefficients:
##                                                                           (Intercept)  
##                                                                              5.083453  
##                                                                                   age  
##                                                                             -0.007172  
##                                                                      genderNon-binary  
##                                                                              0.195005  
##                                                                           genderWoman  
##                                                                              0.190145  
##                                               educational_attainmentBachelor's degree  
##                                                                             -0.313381  
##   educational_attainmentCollege, CEGEP or other non-university certificate or diploma  
##                                                                             -0.207758  
## educational_attainmentDegree in medicine, dentistry, veterinary medicine or optometry  
##                                                                             -0.260849  
##                                                educational_attainmentEarned doctorate  
##                                                                              0.139102  
##                                    educational_attainmentHigh School Diploma or Lower  
##                                                                             -0.241908  
##                                                 educational_attainmentMaster's degree  
##                                                                             -0.215207  
##          educational_attainmentUniversity certificate or diploma above bachelor level  
##                                                                             -0.322637  
##          educational_attainmentUniversity certificate or diploma below bachelor level  
##                                                                             -0.496635  
##                             identity_lgbtqSexual or gender minorities (e.g., LGBTQ2+)  
##                                                                              0.154236  
##                                                 identity_disability8888: Not Selected  
##                                                                             -0.328294  
##                                          lonely_directNone of the time (e.g., 0 days)  
##                                                                             -1.948005  
##                lonely_directOccasionally or a moderate amount of time (e.g. 3-4 days)  
##                                                                             -0.624743  
##                                            lonely_directRarely (e.g. less than 1 day)  
##                                                                             -1.453003  
##                             lonely_directSome or a little of the time (e.g. 1-2 days)  
##                                                                             -1.154553  
##                                                                work_dignityCompletely  
##                                                                             -0.484545  
##                                                                work_dignityNot at all  
##                                                                             -0.209211  
##                                                                  work_dignitySomewhat  
##                                                                             -0.207658  
##                                                               work_dignityVery Little  
##                                                                             -0.100759  
##                                                                 work_dignityVery Well  
##                                                                             -0.287573  
##                                                                work_supportCompletely  
##                                                                              0.078410  
##                                                                work_supportNot at all  
##                                                                              0.348530  
##                                                                  work_supportSomewhat  
##                                                                              0.234562  
##                                                               work_supportVery Little  
##                                                                              0.204010  
##                                                                 work_supportVery Well  
##                                                                                    NA  
##                                                                  discrimination_score  
##                                                                              0.017826

The following output is the last step of the backwards elimination procedure. Backwards elimiation begins with all variables listed in the model, then removes the least significant variables until AIC in minimized. You can see here that age, gender, educational_attainment, identity_lgbtq, identity_disability, lonely_direct, work_dignity, work_support, discrimination_score are retained while ethnicity and income are removed.

So for our final model, we can simply build the model with the variables that remained in the model.

final_linear_model <- lm(formula = BurnoutScore ~ age + gender + educational_attainment + identity_lgbtq + identity_disability + lonely_direct + work_dignity + work_support + discrimination_score, model = TRUE, data = burnout_data) #create final multivariable model

library(broom) #load in broom package to create tidy table
table <- tidy(final_linear_model, conf.int = TRUE, conf.level = 0.95) #create and save table using tidy() function

Binomial Logistic Regression

Lets now look at model fit for a binomial logistic regression models: To start we will create three binomial logistic regression models.

Model 1 predicts burnout score groups using age as the explanatory factor:

bi_model_1 <- glm(formula = BurnoutScore_groups ~ age, data = burnout_data_subset, family = "binomial") #Create a linear regression model object
broom::tidy(bi_model_1) # get results for the model object

## # A tibble: 2 × 5
##   term        estimate std.error statistic      p.value
##   <chr>          <dbl>     <dbl>     <dbl>        <dbl>
## 1 (Intercept) -0.646     0.120       -5.40 0.0000000670
## 2 age         -0.00347   0.00289     -1.20 0.231

Model 2 predicts burnout score groups using age and income as the explanatory factors:

bi_model_2 <- glm(formula = BurnoutScore_groups ~ age + income, data = burnout_data_subset, family = "binomial") #Create a linear regression model object
broom::tidy(bi_model_2) # get results for the model object

## # A tibble: 19 × 5
##    term                       estimate std.error statistic p.value
##    <chr>                         <dbl>     <dbl>     <dbl>   <dbl>
##  1 (Intercept)                -0.371     0.210      -1.77   0.0772
##  2 age                        -0.00334   0.00294    -1.13   0.257 
##  3 income$100,000 to $149,999 -0.473     0.226      -2.09   0.0366
##  4 income$15,000 to $19,999   -0.368     0.266      -1.38   0.167 
##  5 income$150,000 to $199,999 -0.652     0.271      -2.41   0.0160
##  6 income$20,000 to $24,999    0.119     0.249       0.479  0.632 
##  7 income$200,000 or more     -0.553     0.380      -1.45   0.146 
##  8 income$25,000 to $29,999   -0.259     0.263      -0.985  0.325 
##  9 income$30,000 to $34,999   -0.538     0.262      -2.05   0.0403
## 10 income$35,000 to $39,999   -0.494     0.269      -1.83   0.0666
## 11 income$40,000 to $44,999   -0.328     0.259      -1.26   0.206 
## 12 income$45,000 to $49,999   -0.523     0.284      -1.84   0.0658
## 13 income$5,000 to $9,999     -0.400     0.285      -1.40   0.160 
## 14 income$50,000 to $59,999    0.143     0.239       0.598  0.550 
## 15 income$60,000 to $69,999   -0.183     0.252      -0.727  0.468 
## 16 income$70,000 to $79,999   -0.0875    0.244      -0.358  0.720 
## 17 income$80,000 to $89,999   -0.292     0.272      -1.08   0.282 
## 18 income$90,000 to $99,999   -0.356     0.259      -1.37   0.169 
## 19 incomeUnder $5,000         -0.0973    0.328      -0.297  0.767

Model 3 predicts burnout score groups using age, ethnicity, and Loneliness scores as the explanatory factors:

bi_model_3 <- glm(formula = BurnoutScore_groups ~ age + ethnicity, data = burnout_data_subset, family = "binomial") #Create a linear regression model object
broom::tidy(bi_model_3) # get results for the model object

## # A tibble: 14 × 5
##    term                                         estimate std.e…¹ stati…² p.value
##    <chr>                                           <dbl>   <dbl>   <dbl>   <dbl>
##  1 (Intercept)                                  -1.01    0.186    -5.45  5.13e-8
##  2 age                                          -0.00612 0.00304  -2.02  4.39e-2
##  3 ethnicityArab                                 0.296   0.330     0.898 3.69e-1
##  4 ethnicityChinese                              0.0803  0.330     0.243 8.08e-1
##  5 ethnicityFilipino                             0.915   0.492     1.86  6.33e-2
##  6 ethnicityIndigenous                           0.265   0.235     1.13  2.59e-1
##  7 ethnicityJapanese                             0.521   0.571     0.912 3.62e-1
##  8 ethnicityKorean                               0.658   0.502     1.31  1.90e-1
##  9 ethnicityLatin American                       0.460   0.277     1.66  9.73e-2
## 10 ethnicityNone of the above, I identify as     0.958   0.280     3.42  6.26e-4
## 11 ethnicitySouth Asian (e.g., East Indian, Pa…  0.976   0.333     2.93  3.39e-3
## 12 ethnicitySoutheast Asian (e.g., Vietnamese,…  0.481   0.436     1.10  2.70e-1
## 13 ethnicityWest Asian (e.g., Iranian, Afghan,…  1.11    0.543     2.05  4.01e-2
## 14 ethnicityWhite                                0.514   0.173     2.97  3.02e-3
## # … with abbreviated variable names ¹std.error, ²statistic

The AIC Scores can again be extracted (though you may have noticed they are provided in the default output of the glm() function)

AIC(bi_model_1, bi_model_2, bi_model_3)

##            df      AIC
## bi_model_1  2 2936.558
## bi_model_2 19 2942.471
## bi_model_3 14 2937.738

Model_1 in this case shows the lowest AIC score. However, lets use lrtest() to compare the nested models and coxtest to compare the non-nested models:

lrtest(bi_model_1, bi_model_2)

## Likelihood ratio test
## 
## Model 1: BurnoutScore_groups ~ age
## Model 2: BurnoutScore_groups ~ age + income
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   2 -1466.3                       
## 2  19 -1452.2 17 28.087    0.04393 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lrtest(bi_model_1, bi_model_3)

## Likelihood ratio test
## 
## Model 1: BurnoutScore_groups ~ age
## Model 2: BurnoutScore_groups ~ age + ethnicity
##   #Df  LogLik Df Chisq Pr(>Chisq)  
## 1   2 -1466.3                      
## 2  14 -1454.9 12 22.82    0.02929 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

coxtest(bi_model_2, bi_model_3)

## Cox test
## 
## Model 1: BurnoutScore_groups ~ age + income
## Model 2: BurnoutScore_groups ~ age + ethnicity
##                 Estimate Std. Error z value  Pr(>|z|)    
## fitted(M1) ~ M2  -11.407    0.43043 -26.501 < 2.2e-16 ***
## fitted(M2) ~ M1  -14.126    0.52720 -26.795 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that both models 2 and 3 are statistically better than model 1 (and model 3 is better than 2), despite model 1 having the lowest AIC. However, we can see that the p-values comparing model 1 to the other two models are not very small. This demonstrates that model fit is complex. On one hand the lrtest() and coxtest() are saying “choose model 3” and the AIC() values are saying “choose model 1”. Its important to know that the AIC penalizes model complexity – therefore trying to strike a good balance between model fit and model simplicity. What we know from reading all these results together, plus looking at model coefficients, is that the contributions of gender and ethnicity alone do not necessarily justify a more complex model. Thats not saying that a more complex model can’t be used, just that the evidence we have is conflicting and not very strong one way or another. For the sake of exploration, lets create a fourth model, combining age, gender, and ethnicity and compare the aic values and conduct a liklihood ratio test using the lrtest()

bi_model_4 <- glm(formula = BurnoutScore_groups ~ age + gender + ethnicity, data = burnout_data_subset, family = "binomial") #Create a linear regression model object
summary(bi_model_4) # get results for the model object

## 
## Call:
## glm(formula = BurnoutScore_groups ~ age + gender + ethnicity, 
##     family = "binomial", data = burnout_data_subset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2714  -0.8737  -0.8071   1.4078   1.8624  
## 
## Coefficients:
##                                                                              Estimate
## (Intercept)                                                                 -1.114004
## age                                                                         -0.007474
## genderNon-binary                                                             0.668406
## genderWoman                                                                  0.323814
## ethnicityArab                                                                0.272512
## ethnicityChinese                                                             0.035787
## ethnicityFilipino                                                            0.871382
## ethnicityIndigenous                                                          0.226262
## ethnicityJapanese                                                            0.465545
## ethnicityKorean                                                              0.605984
## ethnicityLatin American                                                      0.446422
## ethnicityNone of the above, I identify as                                    0.910566
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.)        0.939222
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.)  0.460146
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.)                            1.093747
## ethnicityWhite                                                               0.479927
##                                                                             Std. Error
## (Intercept)                                                                   0.188446
## age                                                                           0.003074
## genderNon-binary                                                              0.284727
## genderWoman                                                                   0.092278
## ethnicityArab                                                                 0.330832
## ethnicityChinese                                                              0.331297
## ethnicityFilipino                                                             0.496211
## ethnicityIndigenous                                                           0.235486
## ethnicityJapanese                                                             0.574038
## ethnicityKorean                                                               0.505114
## ethnicityLatin American                                                       0.278324
## ethnicityNone of the above, I identify as                                     0.281974
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.)         0.334630
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.)   0.437881
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.)                             0.544679
## ethnicityWhite                                                                0.174245
##                                                                             z value
## (Intercept)                                                                  -5.912
## age                                                                          -2.432
## genderNon-binary                                                              2.348
## genderWoman                                                                   3.509
## ethnicityArab                                                                 0.824
## ethnicityChinese                                                              0.108
## ethnicityFilipino                                                             1.756
## ethnicityIndigenous                                                           0.961
## ethnicityJapanese                                                             0.811
## ethnicityKorean                                                               1.200
## ethnicityLatin American                                                       1.604
## ethnicityNone of the above, I identify as                                     3.229
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.)         2.807
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.)   1.051
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.)                             2.008
## ethnicityWhite                                                                2.754
##                                                                             Pr(>|z|)
## (Intercept)                                                                 3.39e-09
## age                                                                          0.01503
## genderNon-binary                                                             0.01890
## genderWoman                                                                  0.00045
## ethnicityArab                                                                0.41010
## ethnicityChinese                                                             0.91398
## ethnicityFilipino                                                            0.07908
## ethnicityIndigenous                                                          0.33664
## ethnicityJapanese                                                            0.41737
## ethnicityKorean                                                              0.23026
## ethnicityLatin American                                                      0.10872
## ethnicityNone of the above, I identify as                                    0.00124
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.)        0.00500
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.)  0.29333
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.)                            0.04464
## ethnicityWhite                                                               0.00588
##                                                                                
## (Intercept)                                                                 ***
## age                                                                         *  
## genderNon-binary                                                            *  
## genderWoman                                                                 ***
## ethnicityArab                                                                  
## ethnicityChinese                                                               
## ethnicityFilipino                                                           .  
## ethnicityIndigenous                                                            
## ethnicityJapanese                                                              
## ethnicityKorean                                                                
## ethnicityLatin American                                                        
## ethnicityNone of the above, I identify as                                   ** 
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.)       ** 
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.)    
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.)                           *  
## ethnicityWhite                                                              ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2934.0  on 2355  degrees of freedom
## Residual deviance: 2894.3  on 2340  degrees of freedom
## AIC: 2926.3
## 
## Number of Fisher Scoring iterations: 4

AIC(bi_model_1, bi_model_4) #compare AIC

##            df      AIC
## bi_model_1  2 2936.558
## bi_model_4 16 2926.271

lrtest(bi_model_1, bi_model_4) #compare lrtest()

## Likelihood ratio test
## 
## Model 1: BurnoutScore_groups ~ age
## Model 2: BurnoutScore_groups ~ age + gender + ethnicity
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   2 -1466.3                         
## 2  16 -1447.1 14 38.288  0.0004695 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that the full model is improved compared to the age only model.

Multinomial Logistic Regression

Lets now build some multinomial logistic regresison models using the multinom() function from the nnet package and use the AIC function to compare:

install.packages("nnet", repos = "http://cran.us.r-project.org") #install package
library("nnet") #load library for package

Age only model

multinom_model_1 <- multinom(formula = lonely_change_covid ~ age, model = TRUE, data = burnout_data)

## # weights:  15 (8 variable)
## initial  value 3910.934127 
## iter  10 value 3404.631506
## final  value 3346.744956 
## converged

broom::tidy(multinom_model_1)

## # A tibble: 8 × 6
##   y.level              term        estimate std.error statistic  p.value
##   <chr>                <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 Much less lonely     (Intercept) -2.35      0.280       -8.38 5.13e-17
## 2 Much less lonely     age          0.0147    0.00627      2.34 1.94e- 2
## 3 Much more lonely     (Intercept) -0.917     0.161       -5.70 1.17e- 8
## 4 Much more lonely     age          0.0178    0.00367      4.86 1.19e- 6
## 5 Somewhat less lonely (Intercept) -0.490     0.230       -2.13 3.29e- 2
## 6 Somewhat less lonely age         -0.0182    0.00605     -3.00 2.68e- 3
## 7 Somewhat more lonely (Intercept)  0.707     0.140        5.05 4.41e- 7
## 8 Somewhat more lonely age         -0.00765   0.00345     -2.22 2.67e- 2

Age gender model

multinom_model_2 <- multinom(formula = lonely_change_covid ~ age + gender, model = TRUE, data = burnout_data)

## # weights:  25 (16 variable)
## initial  value 3910.934127 
## iter  10 value 3538.470171
## iter  20 value 3331.388276
## final  value 3330.671460 
## converged

broom::tidy(multinom_model_2)

## # A tibble: 16 × 6
##    y.level              term             estimate std.error statistic  p.value
##    <chr>                <chr>               <dbl>     <dbl>     <dbl>    <dbl>
##  1 Much less lonely     (Intercept)      -2.39      0.291     -8.23   1.91e-16
##  2 Much less lonely     age               0.0143    0.00635    2.25   2.44e- 2
##  3 Much less lonely     genderNon-binary  0.223     0.569      0.392  6.95e- 1
##  4 Much less lonely     genderWoman       0.100     0.214      0.468  6.40e- 1
##  5 Much more lonely     (Intercept)      -1.11      0.169     -6.53   6.73e-11
##  6 Much more lonely     age               0.0154    0.00372    4.14   3.53e- 5
##  7 Much more lonely     genderNon-binary -0.415     0.429     -0.967  3.34e- 1
##  8 Much more lonely     genderWoman       0.544     0.124      4.41   1.06e- 5
##  9 Somewhat less lonely (Intercept)      -0.511     0.238     -2.15   3.14e- 2
## 10 Somewhat less lonely age              -0.0185    0.00610   -3.03   2.43e- 3
## 11 Somewhat less lonely genderNon-binary  0.133     0.455      0.292  7.70e- 1
## 12 Somewhat less lonely genderWoman       0.0604    0.166      0.363  7.17e- 1
## 13 Somewhat more lonely (Intercept)       0.719     0.144      4.98   6.43e- 7
## 14 Somewhat more lonely age              -0.00772   0.00349   -2.21   2.69e- 2
## 15 Somewhat more lonely genderNon-binary -0.483     0.329     -1.47   1.42e- 1
## 16 Somewhat more lonely genderWoman       0.00533   0.105      0.0509 9.59e- 1

Age ethnicity model

multinom_model_3 <- multinom(formula = lonely_change_covid ~ age + ethnicity, model = TRUE, data = burnout_data)

## # weights:  75 (56 variable)
## initial  value 3910.934127 
## iter  10 value 3534.576873
## iter  20 value 3306.453616
## iter  30 value 3294.440945
## iter  40 value 3293.951705
## iter  50 value 3293.882902
## iter  60 value 3293.874501
## final  value 3293.874421 
## converged

broom::tidy(multinom_model_3)

## # A tibble: 56 × 6
##    y.level          term                       estimate std.e…¹ statis…² p.value
##    <chr>            <chr>                         <dbl>   <dbl>    <dbl>   <dbl>
##  1 Much less lonely (Intercept)                 -1.97   3.81e-1 -5.17e+0 2.31e-7
##  2 Much less lonely age                          0.0150 6.63e-3  2.26e+0 2.36e-2
##  3 Much less lonely ethnicityArab               -0.850  8.06e-1 -1.05e+0 2.92e-1
##  4 Much less lonely ethnicityChinese            -0.109  5.85e-1 -1.86e-1 8.52e-1
##  5 Much less lonely ethnicityFilipino          -15.4    1.96e-7 -7.85e+7 0      
##  6 Much less lonely ethnicityIndigenous         -0.232  5.21e-1 -4.45e-1 6.56e-1
##  7 Much less lonely ethnicityJapanese            1.06   9.68e-1  1.09e+0 2.75e-1
##  8 Much less lonely ethnicityKorean            -15.4    1.88e-7 -8.22e+7 0      
##  9 Much less lonely ethnicityLatin American     -1.04   8.01e-1 -1.30e+0 1.95e-1
## 10 Much less lonely ethnicityNone of the abov…  -0.645  7.02e-1 -9.18e-1 3.59e-1
## # … with 46 more rows, and abbreviated variable names ¹std.error, ²statistic

test AIC

AIC(multinom_model_1, multinom_model_2, multinom_model_3) #compare AIC

##                  df      AIC
## multinom_model_1  8 6709.490
## multinom_model_2 16 6693.343
## multinom_model_3 56 6699.749

Here we see that the second model has the lowest AIC

Lets use lrtest() to compare the nested models:

lrtest(multinom_model_1, multinom_model_2)

## Likelihood ratio test
## 
## Model 1: lonely_change_covid ~ age
## Model 2: lonely_change_covid ~ age + gender
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -3346.7                         
## 2  16 -3330.7  8 32.147  8.766e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lrtest(multinom_model_1, multinom_model_3)

## Likelihood ratio test
## 
## Model 1: lonely_change_covid ~ age
## Model 2: lonely_change_covid ~ age + ethnicity
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -3346.7                         
## 2  56 -3293.9 48 105.74  3.148e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the case of the non-nested models, there is no function readily available to asess relative fit of non-nested multinomial logistic regression models. As such, we rely on the AIC comparisons and conclude that model 2 is preferable. However, we can turn any non-nested model into a nested model and test whether adding a specific variable enhances the model overall. Lets give that a try:

multinom_model_4 <- multinom(formula = lonely_change_covid ~ age + gender + ethnicity, model = TRUE, data = burnout_data)

## # weights:  85 (64 variable)
## initial  value 3910.934127 
## iter  10 value 3586.207423
## iter  20 value 3316.597365
## iter  30 value 3279.531244
## iter  40 value 3278.968796
## iter  50 value 3278.777025
## iter  60 value 3278.757737
## final  value 3278.756046 
## converged

broom::tidy(multinom_model_4)

## # A tibble: 64 × 6
##    y.level          term                estimate     std.error statistic p.value
##    <chr>            <chr>                  <dbl>         <dbl>     <dbl>   <dbl>
##  1 Much less lonely (Intercept)          -2.00   0.385          -5.19e+0 2.05e-7
##  2 Much less lonely age                   0.0147 0.00670         2.20e+0 2.79e-2
##  3 Much less lonely genderNon-binary      0.322  0.576           5.58e-1 5.77e-1
##  4 Much less lonely genderWoman           0.0920 0.215           4.27e-1 6.69e-1
##  5 Much less lonely ethnicityArab        -0.860  0.806          -1.07e+0 2.86e-1
##  6 Much less lonely ethnicityChinese     -0.121  0.586          -2.06e-1 8.37e-1
##  7 Much less lonely ethnicityFilipino   -19.0    0.00000000585  -3.24e+9 0      
##  8 Much less lonely ethnicityIndigenous  -0.245  0.522          -4.69e-1 6.39e-1
##  9 Much less lonely ethnicityJapanese     1.03   0.970           1.06e+0 2.89e-1
## 10 Much less lonely ethnicityKorean     -19.2    0.00000000447  -4.29e+9 0      
## # … with 54 more rows

lrtest(multinom_model_3, multinom_model_4)

## Likelihood ratio test
## 
## Model 1: lonely_change_covid ~ age + ethnicity
## Model 2: lonely_change_covid ~ age + gender + ethnicity
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  56 -3293.9                         
## 2  64 -3278.8  8 30.237  0.0001919 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that adding the variable is beneficial. Note we could also run the lrtest by calling specific variables:

lrtest(multinom_model_4, "gender")

## # weights:  75 (56 variable)
## initial  value 3910.934127 
## iter  10 value 3534.576873
## iter  20 value 3306.453616
## iter  30 value 3294.440945
## iter  40 value 3293.951705
## iter  50 value 3293.882902
## iter  60 value 3293.874501
## final  value 3293.874421 
## converged

## Likelihood ratio test
## 
## Model 1: lonely_change_covid ~ age + gender + ethnicity
## Model 2: lonely_change_covid ~ age + ethnicity
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  64 -3278.8                         
## 2  56 -3293.9 -8 30.237  0.0001919 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lrtest(multinom_model_4, "ethnicity")

## # weights:  25 (16 variable)
## initial  value 3910.934127 
## iter  10 value 3538.470171
## iter  20 value 3331.388276
## final  value 3330.671460 
## converged

## Likelihood ratio test
## 
## Model 1: lonely_change_covid ~ age + gender + ethnicity
## Model 2: lonely_change_covid ~ age + gender
##   #Df  LogLik  Df  Chisq Pr(>Chisq)    
## 1  64 -3278.8                          
## 2  16 -3330.7 -48 103.83  5.449e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lrtest(multinom_model_4, "age")

## # weights:  80 (60 variable)
## initial  value 3910.934127 
## iter  10 value 3356.260039
## iter  20 value 3302.449937
## iter  30 value 3298.120455
## iter  40 value 3297.618478
## iter  50 value 3297.574580
## iter  60 value 3297.568501
## final  value 3297.568363 
## converged

## Likelihood ratio test
## 
## Model 1: lonely_change_covid ~ age + gender + ethnicity
## Model 2: lonely_change_covid ~ gender + ethnicity
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1  64 -3278.8                         
## 2  60 -3297.6 -4 37.625  1.339e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Ordinal Logistic Regression

Lets go through this one more time, with ordinal logistic regression models created using the polr() function.

Convert out outcome to an ordinal variable

burnout_data$lonely_direct <- ordered(burnout_data$lonely_direct, levels = c("None of the time (e.g., 0 days)", "Rarely (e.g. less than 1 day)", "Some or a little of the time (e.g. 1-2 days)", "Occasionally or a moderate amount of time (e.g. 3-4 days)", "All of the time (e.g. 5-7 days)]")) #Convert to factor and specify order

table(burnout_data$lonely_direct) #call descriptives; As you can see now, the variable is now ordered. We can construct an ordinal logistic regresison using the polr() function from the MASS package

## 
##                           None of the time (e.g., 0 days) 
##                                                       384 
##                             Rarely (e.g. less than 1 day) 
##                                                       869 
##              Some or a little of the time (e.g. 1-2 days) 
##                                                       748 
## Occasionally or a moderate amount of time (e.g. 3-4 days) 
##                                                       307 
##                          All of the time (e.g. 5-7 days)] 
##                                                       137

install.packages("MASS", repos = "http://cran.us.r-project.org") #install package
library("MASS") #load library for package

ord_model_1 <- polr(formula = lonely_direct ~ age, Hess = TRUE, data = burnout_data)
broom::tidy(ord_model_1)

## # A tibble: 5 × 5
##   term                                          estimate std.e…¹ stati…² coef.…³
##   <chr>                                            <dbl>   <dbl>   <dbl> <chr>  
## 1 age                                            0.00484 0.00251    1.93 coeffi…
## 2 None of the time (e.g., 0 days)|Rarely (e.g.… -1.50    0.108    -13.8  scale  
## 3 Rarely (e.g. less than 1 day)|Some or a litt…  0.228   0.101      2.26 scale  
## 4 Some or a little of the time (e.g. 1-2 days)…  1.69    0.109     15.5  scale  
## 5 Occasionally or a moderate amount of time (e…  3.01    0.132     22.9  scale  
## # … with abbreviated variable names ¹std.error, ²statistic, ³coef.type

ord_model_2 <- polr(formula = lonely_direct ~ age + gender, Hess = TRUE, data = burnout_data)
broom::tidy(ord_model_2)

## # A tibble: 7 × 5
##   term                                          estimate std.e…¹ stati…² coef.…³
##   <chr>                                            <dbl>   <dbl>   <dbl> <chr>  
## 1 age                                            0.00345 0.00254    1.36 coeffi…
## 2 genderNon-binary                               0.537   0.257      2.09 coeffi…
## 3 genderWoman                                    0.275   0.0750     3.67 coeffi…
## 4 None of the time (e.g., 0 days)|Rarely (e.g.… -1.41    0.110    -12.8  scale  
## 5 Rarely (e.g. less than 1 day)|Some or a litt…  0.321   0.104      3.10 scale  
## 6 Some or a little of the time (e.g. 1-2 days)…  1.79    0.112     16.0  scale  
## 7 Occasionally or a moderate amount of time (e…  3.12    0.134     23.2  scale  
## # … with abbreviated variable names ¹std.error, ²statistic, ³coef.type

ord_model_3 <- polr(formula = lonely_direct ~ age + ethnicity, Hess = TRUE, data = burnout_data)
broom::tidy(ord_model_3)

## # A tibble: 17 × 5
##    term                                         estimate std.e…¹ stati…² coef.…³
##    <chr>                                           <dbl>   <dbl>   <dbl> <chr>  
##  1 age                                           0.00263 0.00261  1.01   coeffi…
##  2 ethnicityArab                                -0.260   0.255   -1.02   coeffi…
##  3 ethnicityChinese                              0.00315 0.251    0.0125 coeffi…
##  4 ethnicityFilipino                             0.916   0.393    2.33   coeffi…
##  5 ethnicityIndigenous                           0.0645  0.177    0.365  coeffi…
##  6 ethnicityJapanese                             0.733   0.445    1.65   coeffi…
##  7 ethnicityKorean                               0.336   0.408    0.825  coeffi…
##  8 ethnicityLatin American                       0.147   0.211    0.698  coeffi…
##  9 ethnicityNone of the above, I identify as     0.409   0.251    1.63   coeffi…
## 10 ethnicitySouth Asian (e.g., East Indian, Pa…  0.544   0.268    2.03   coeffi…
## 11 ethnicitySoutheast Asian (e.g., Vietnamese,…  0.380   0.326    1.17   coeffi…
## 12 ethnicityWest Asian (e.g., Iranian, Afghan,…  0.487   0.431    1.13   coeffi…
## 13 ethnicityWhite                                0.307   0.127    2.41   coeffi…
## 14 None of the time (e.g., 0 days)|Rarely (e.g… -1.34    0.147   -9.17   scale  
## 15 Rarely (e.g. less than 1 day)|Some or a lit…  0.393   0.142    2.78   scale  
## 16 Some or a little of the time (e.g. 1-2 days…  1.86    0.148   12.6    scale  
## 17 Occasionally or a moderate amount of time (…  3.19    0.166   19.2    scale  
## # … with abbreviated variable names ¹std.error, ²statistic, ³coef.type

AIC(ord_model_1, ord_model_2, ord_model_3) #Check AIC, see that model 2 has the lowest; model 2 is better

##             df      AIC
## ord_model_1  5 7061.353
## ord_model_2  7 7049.343
## ord_model_3 17 7064.581

lrtest(multinom_model_1, multinom_model_2) #Check nested comparison of 1 and 2; 2 is better

## Likelihood ratio test
## 
## Model 1: lonely_change_covid ~ age
## Model 2: lonely_change_covid ~ age + gender
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -3346.7                         
## 2  16 -3330.7  8 32.147  8.766e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lrtest(multinom_model_1, multinom_model_3) #Check nested comparison of 1 and 3; 3 is better

## Likelihood ratio test
## 
## Model 1: lonely_change_covid ~ age
## Model 2: lonely_change_covid ~ age + ethnicity
##   #Df  LogLik Df  Chisq Pr(>Chisq)    
## 1   8 -3346.7                         
## 2  56 -3293.9 48 105.74  3.148e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again, comparing the non-nested models is difficult, but the AIC is pretty good and you could create a nested model or check to see if the additional variable is worth it.

Note that step() also works with the non-linear models.

9. Factor Analysis

This section introduces you to the following concepts and their corresponding functions:

Internal Consistency: alpha(),
Exploratory Factor Analysis: factanal(), fa.parallel()

This chapter focuses on Factor Analysis, a method designed to uncover the relationships between a given set of variables. It looks for a smaller set of underlying or latent constructs that can explain the relationships among the observed or manifest variables. By the end of this lecture, you should be able to conduct a factor analysis and understand some scenarios in which it might be useful to conduct one.

Getting Started

Lets start today by reading in our data:

burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv", na.strings = c("", "NA")) #Read in burnout_data

One of the requirements for a factor analysis is to use numeric variables. As such, we need to convert the items into numeric variables.

install.packages("tidyverse", repos = "http://cran.us.r-project.org")
library(tidyverse)

burnout_data <- mutate_at(.tbl = burnout_data, #Use the mutate function to recreate the dataset
                                      .vars = vars(burnout_tired:burnout_had_it), #Identify the vars you want to edit
                                      .funs = ~case_when(. == "Always" ~ "7", #recode level
                                                         . == "Very Often" ~ "6", #recode level
                                                         . == "Often" ~ "5", #recode level
                                                         . == "Sometimes" ~ "4", #recode level
                                                         . == "Rarely" ~ "3", #recode level
                                                         . == "Almost never" ~ "2", #recode level
                                                         . == "Never" ~ "1", #recode level
                                                         TRUE ~ .)) #Everywhere else just leave it whatever it was in the old dataset

burnout_data$burnout_tired <- as.numeric(burnout_data$burnout_tired)
burnout_data$burnout_disappointed <- as.numeric(burnout_data$burnout_disappointed)
burnout_data$burnout_hopeless <- as.numeric(burnout_data$burnout_hopeless)
burnout_data$burnout_trapped <- as.numeric(burnout_data$burnout_trapped)
burnout_data$burnout_helpless <- as.numeric(burnout_data$burnout_helpless)
burnout_data$burnout_depressed <- as.numeric(burnout_data$burnout_depressed)
burnout_data$burnout_sick <- as.numeric(burnout_data$burnout_sick)
burnout_data$burnout_worthless <- as.numeric(burnout_data$burnout_worthless)
burnout_data$burnout_difficulty_sleeping <- as.numeric(burnout_data$burnout_difficulty_sleeping)
burnout_data$burnout_had_it <- as.numeric(burnout_data$burnout_had_it)

We will work with a single scale, called the ten item personality inventory, which includes the following variables:

burnout_vars <- c("burnout_tired",
               "burnout_disappointed",
               "burnout_hopeless",
               "burnout_trapped",
               "burnout_helpless",
               "burnout_depressed",
               "burnout_sick",
               "burnout_worthless",
               "burnout_difficulty_sleeping",
               "burnout_had_it")

Lets create a dataframe with only these vars:

burnout_data_scale_items <- burnout_data[, names(burnout_data) %in% burnout_vars] #remove unnecessary columns
burnout_data_scale_items <- na.omit(burnout_data_scale_items) #remove observations with missing data

Internal Consistency

Now that our dataset is ready for analysis, we should first assess the internal consistency of our scale items. We can use Chronbach’s alpha to do this. Chronbach’s alpha tells us how closely related a set of items in a group are. The alpha() function in the psych package can be used to assess internal consistency:

library("psych")
alpha(burnout_data_scale_items)

## 
## Reliability analysis   
## Call: alpha(x = burnout_data_scale_items)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean  sd median_r
##       0.91      0.91    0.92      0.51  11 0.0026  3.4 1.1     0.51
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt     0.91  0.91  0.92
## Duhachek  0.91  0.91  0.92
## 
##  Reliability if an item is dropped:
##                             raw_alpha std.alpha G6(smc) average_r  S/N alpha se
## burnout_tired                    0.91      0.91    0.91      0.53 10.1   0.0028
## burnout_disappointed             0.91      0.91    0.91      0.53 10.0   0.0028
## burnout_hopeless                 0.90      0.90    0.90      0.50  9.0   0.0031
## burnout_trapped                  0.90      0.90    0.90      0.50  9.1   0.0031
## burnout_helpless                 0.90      0.90    0.90      0.50  9.1   0.0031
## burnout_depressed                0.90      0.90    0.90      0.50  9.0   0.0031
## burnout_sick                     0.91      0.91    0.91      0.52  9.8   0.0029
## burnout_worthless                0.90      0.91    0.91      0.51  9.5   0.0029
## burnout_difficulty_sleeping      0.91      0.91    0.91      0.53 10.0   0.0028
## burnout_had_it                   0.91      0.91    0.91      0.53 10.0   0.0028
##                              var.r med.r
## burnout_tired               0.0083  0.52
## burnout_disappointed        0.0091  0.52
## burnout_hopeless            0.0073  0.50
## burnout_trapped             0.0077  0.50
## burnout_helpless            0.0075  0.50
## burnout_depressed           0.0092  0.47
## burnout_sick                0.0101  0.52
## burnout_worthless           0.0085  0.51
## burnout_difficulty_sleeping 0.0101  0.53
## burnout_had_it              0.0096  0.51
## 
##  Item statistics 
##                                n raw.r std.r r.cor r.drop mean  sd
## burnout_tired               2356  0.67  0.68  0.64   0.59  3.9 1.4
## burnout_disappointed        2356  0.69  0.70  0.66   0.62  3.7 1.4
## burnout_hopeless            2356  0.83  0.83  0.82   0.78  3.2 1.4
## burnout_trapped             2356  0.81  0.81  0.80   0.76  3.3 1.5
## burnout_helpless            2356  0.81  0.80  0.79   0.75  3.3 1.5
## burnout_depressed           2356  0.82  0.82  0.81   0.77  3.5 1.4
## burnout_sick                2356  0.71  0.71  0.67   0.64  3.2 1.3
## burnout_worthless           2356  0.76  0.75  0.72   0.69  3.0 1.5
## burnout_difficulty_sleeping 2356  0.70  0.70  0.65   0.62  3.6 1.5
## burnout_had_it              2356  0.71  0.70  0.65   0.62  3.5 1.6
## 
## Non missing response frequency for each item
##                                1    2    3    4    5    6    7 miss
## burnout_tired               0.06 0.09 0.21 0.33 0.19 0.09 0.04    0
## burnout_disappointed        0.05 0.12 0.25 0.32 0.15 0.08 0.03    0
## burnout_hopeless            0.14 0.19 0.24 0.25 0.11 0.04 0.02    0
## burnout_trapped             0.15 0.18 0.22 0.25 0.12 0.05 0.03    0
## burnout_helpless            0.16 0.16 0.23 0.26 0.12 0.06 0.02    0
## burnout_depressed           0.10 0.14 0.24 0.31 0.13 0.06 0.02    0
## burnout_sick                0.12 0.19 0.27 0.27 0.10 0.03 0.01    0
## burnout_worthless           0.21 0.20 0.24 0.21 0.09 0.04 0.02    0
## burnout_difficulty_sleeping 0.09 0.15 0.22 0.28 0.15 0.08 0.04    0
## burnout_had_it              0.13 0.14 0.20 0.28 0.14 0.07 0.03    0

As you can see the Cronbach’s alpha is very high at 0.91, which indicates excellent internal consistency:

Value	Interpretation
0.90 to 1.00	Excellent
0.80 to 0.89	Good
0.70 to 0.79	Acceptable
0.60 to 0.69	Questionable
0.50 to 0.59	Poor
0.00 to 0.50	Unacceptable

You can also look under the item statistics at the following statistics: n = number of complete cases raw.r = correlation of each item with the total score, not corrected for item overlap std.r = standardized correlation with total score, not correct for item overlap r.cor = correlation of each item, corrected for item overlap and scale reliability. r.drop = correlation of each item with total score, not including the item mean = mean sd = standard deviation

Exploratory Factor Analysis

This tells us that all the items in the scale are highly correlated (and that being hopeless and depressed are the items most correlated with total scores). However, Cronbach’s alpha does not mean that the items are unidimensional. Since the Burnout Measure is a single scale, we assume it represents a single factor. Therefore, lets construct a factor model, with 1 factor using the factanal() function.

factanal(x = burnout_data_scale_items, factors = 1, rotation="varimax")

## 
## Call:
## factanal(x = burnout_data_scale_items, factors = 1, rotation = "varimax")
## 
## Uniquenesses:
##               burnout_tired        burnout_disappointed 
##                       0.648                       0.604 
##            burnout_hopeless             burnout_trapped 
##                       0.297                       0.332 
##            burnout_helpless           burnout_depressed 
##                       0.343                       0.355 
##                burnout_sick           burnout_worthless 
##                       0.573                       0.457 
## burnout_difficulty_sleeping              burnout_had_it 
##                       0.617                       0.574 
## 
## Loadings:
##                             Factor1
## burnout_tired               0.593  
## burnout_disappointed        0.630  
## burnout_hopeless            0.838  
## burnout_trapped             0.817  
## burnout_helpless            0.810  
## burnout_depressed           0.803  
## burnout_sick                0.653  
## burnout_worthless           0.737  
## burnout_difficulty_sleeping 0.619  
## burnout_had_it              0.653  
## 
##                Factor1
## SS loadings      5.198
## Proportion Var   0.520
## 
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 1196.39 on 35 degrees of freedom.
## The p-value is 1.27e-228

The output from the factanal() function is robust: The first set of results deal with uniqueness. Uniqueness is the variance that is ‘unique’ to the variable and not shared with other variables. A variable with high uniqueness is not explained well by the factor model. A variable with a low uniqueness is explained well by the factor model. Based on this we can see that hopelessness, helplessness, being trapped, and being depressed are expalined well by the factor model, while being tired, dissapointeded, and not sleeping well are less so described by the variable.

The next group of output is the factor loadings. These numbers describe how well each variable maps onto a single factor. Since we only have one factor, we see that all variables map onto it. Again we see that hopelessness, trapped, helplessness, and depression map well onto the one factor model.

In the next section, we have SS loadings. Factors with SS loadings greater than 1 indicate they should be kept. The Proportion Var variable indicates the proportion of variance explained by each factor. If there were multiple factors, we would also see a row for Cumulative Variance, which would indicate the variance of each additional factor. You want these values to be high as they tell you the variance in the data that is explained by the factor model.

Finally, at the bottom is a hypothesis test. This tests whether 1 factor model sufficiently characterizes the variance in the data. A significant p-value suggests that the model is NOT sufficient. To address this we can turn to the fa.parallel() function in the psych package

psych::fa.parallel(x = burnout_data_scale_items) #Create a scree plot. The point in the scree plot where things clearly start to level off (i.e., the elbow) indicates the number of factors that you should use. The function also runs a parallel analysis tells you the optimal numbers of factors (in this case 3).

## Parallel analysis suggests that the number of factors =  3  and the number of components =  1

Given this test, lets run a factor analysis with three factors:

factanal(x = burnout_data_scale_items, factors = 3, rotation="varimax")

## 
## Call:
## factanal(x = burnout_data_scale_items, factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##               burnout_tired        burnout_disappointed 
##                       0.250                       0.436 
##            burnout_hopeless             burnout_trapped 
##                       0.256                       0.274 
##            burnout_helpless           burnout_depressed 
##                       0.303                       0.347 
##                burnout_sick           burnout_worthless 
##                       0.454                       0.394 
## burnout_difficulty_sleeping              burnout_had_it 
##                       0.500                       0.580 
## 
## Loadings:
##                             Factor1 Factor2 Factor3
## burnout_tired               0.174   0.799   0.287  
## burnout_disappointed        0.371   0.630   0.173  
## burnout_hopeless            0.755   0.315   0.274  
## burnout_trapped             0.760   0.312   0.228  
## burnout_helpless            0.726   0.205   0.358  
## burnout_depressed           0.536   0.372   0.477  
## burnout_sick                0.342   0.247   0.607  
## burnout_worthless           0.591   0.126   0.491  
## burnout_difficulty_sleeping 0.266   0.375   0.537  
## burnout_had_it              0.486   0.255   0.345  
## 
##                Factor1 Factor2 Factor3
## SS loadings      2.901   1.695   1.610
## Proportion Var   0.290   0.170   0.161
## Cumulative Var   0.290   0.460   0.621
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 112 on 18 degrees of freedom.
## The p-value is 1.33e-15

In the new output, we see considerbly lower uniqueness scores. In the factor loadings, we see the first factor relates to being hopeless, trapped, and helpless. The second factor represents being tired and dissapointed. The third factor relates to being sick and having difficulty sleeping.

Judging by the SS loadings, we should keep all three factors, and a higher ammount of variance is explained (62%). Technically speaking, the p-value remains insignificant, suggesting we could look at the larger factors, but we should have enough support from our parallel analysis and scree plot to decide that we’ve gone far enough. That said, lets take a look at the other options.

factanal(x = burnout_data_scale_items, factors = 4, rotation="varimax") #p-value = 0.0004

## 
## Call:
## factanal(x = burnout_data_scale_items, factors = 4, rotation = "varimax")
## 
## Uniquenesses:
##               burnout_tired        burnout_disappointed 
##                       0.333                       0.359 
##            burnout_hopeless             burnout_trapped 
##                       0.253                       0.251 
##            burnout_helpless           burnout_depressed 
##                       0.293                       0.342 
##                burnout_sick           burnout_worthless 
##                       0.485                       0.194 
## burnout_difficulty_sleeping              burnout_had_it 
##                       0.483                       0.569 
## 
## Loadings:
##                             Factor1 Factor2 Factor3 Factor4
## burnout_tired               0.185   0.702   0.369          
## burnout_disappointed        0.319   0.702   0.154   0.152  
## burnout_hopeless            0.686   0.343   0.202   0.342  
## burnout_trapped             0.760   0.299   0.229   0.174  
## burnout_helpless            0.706   0.180   0.335   0.254  
## burnout_depressed           0.509   0.343   0.476   0.233  
## burnout_sick                0.312   0.238   0.530   0.282  
## burnout_worthless           0.440   0.136   0.338   0.692  
## burnout_difficulty_sleeping 0.247   0.333   0.562   0.171  
## burnout_had_it              0.475   0.236   0.355   0.154  
## 
##                Factor1 Factor2 Factor3 Factor4
## SS loadings      2.519   1.585   1.429   0.905
## Proportion Var   0.252   0.158   0.143   0.090
## Cumulative Var   0.252   0.410   0.553   0.644
## 
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 27.39 on 11 degrees of freedom.
## The p-value is 0.00402

factanal(x = burnout_data_scale_items, factors = 5, rotation="varimax") #p-value = 0.0385

## 
## Call:
## factanal(x = burnout_data_scale_items, factors = 5, rotation = "varimax")
## 
## Uniquenesses:
##               burnout_tired        burnout_disappointed 
##                       0.315                       0.367 
##            burnout_hopeless             burnout_trapped 
##                       0.247                       0.251 
##            burnout_helpless           burnout_depressed 
##                       0.288                       0.348 
##                burnout_sick           burnout_worthless 
##                       0.478                       0.239 
## burnout_difficulty_sleeping              burnout_had_it 
##                       0.489                       0.005 
## 
## Loadings:
##                             Factor1 Factor2 Factor3 Factor4 Factor5
## burnout_tired               0.171   0.722   0.348   0.112          
## burnout_disappointed        0.311   0.691   0.135   0.152   0.131  
## burnout_hopeless            0.697   0.338   0.216   0.195   0.262  
## burnout_trapped             0.745   0.300   0.221   0.223          
## burnout_helpless            0.704   0.180   0.348   0.212   0.130  
## burnout_depressed           0.493   0.351   0.461   0.235   0.137  
## burnout_sick                0.316   0.242   0.551   0.159   0.187  
## burnout_worthless           0.499   0.124   0.415   0.164   0.546  
## burnout_difficulty_sleeping 0.235   0.345   0.545   0.179          
## burnout_had_it              0.302   0.199   0.224   0.897          
## 
##                Factor1 Factor2 Factor3 Factor4 Factor5
## SS loadings      2.401   1.591   1.391   1.112   0.476
## Proportion Var   0.240   0.159   0.139   0.111   0.048
## Cumulative Var   0.240   0.399   0.538   0.650   0.697
## 
## Test of the hypothesis that 5 factors are sufficient.
## The chi square statistic is 11.74 on 5 degrees of freedom.
## The p-value is 0.0385

factanal(x = burnout_data_scale_items, factors = 6, rotation="varimax") #p-value test satisfied

## 
## Call:
## factanal(x = burnout_data_scale_items, factors = 6, rotation = "varimax")
## 
## Uniquenesses:
##               burnout_tired        burnout_disappointed 
##                       0.388                       0.286 
##            burnout_hopeless             burnout_trapped 
##                       0.266                       0.253 
##            burnout_helpless           burnout_depressed 
##                       0.291                       0.005 
##                burnout_sick           burnout_worthless 
##                       0.512                       0.030 
## burnout_difficulty_sleeping              burnout_had_it 
##                       0.420                       0.005 
## 
## Loadings:
##                             Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
## burnout_tired               0.174   0.630           0.111   0.369   0.178  
## burnout_disappointed        0.278   0.754   0.133   0.143   0.136   0.108  
## burnout_hopeless            0.654   0.331   0.307   0.189   0.177   0.188  
## burnout_trapped             0.721   0.287   0.179   0.213   0.192   0.175  
## burnout_helpless            0.677   0.169   0.254   0.205   0.275   0.198  
## burnout_depressed           0.384   0.288   0.214   0.203   0.299   0.767  
## burnout_sick                0.294   0.226   0.278   0.158   0.456   0.202  
## burnout_worthless           0.382   0.137   0.834   0.155   0.246   0.160  
## burnout_difficulty_sleeping 0.215   0.288   0.169   0.167   0.609   0.150  
## burnout_had_it              0.287   0.188   0.147   0.892   0.197   0.143  
## 
##                Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
## SS loadings      2.021   1.458   1.081   1.070   1.066   0.846
## Proportion Var   0.202   0.146   0.108   0.107   0.107   0.085
## Cumulative Var   0.202   0.348   0.456   0.563   0.670   0.754
## 
## The degrees of freedom for the model is 0 and the fit was 0.0014

As we can see, it takes six factors to satisfy the p-value requirement. However, one reason for this is that we have quite a large dataset. As sample size increases, it is hard to have a “good fitting” model. I personally, would rely on the results of the fa.parallel() analysis, however the decision is subjective.

Given our factor analysis, we could choose to create three new subscales for burnout, assigning each variable to its highest scoring factor:

burnout_data$hope_trapped_helpless_subscale <- burnout_data$burnout_hopeless + burnout_data$burnout_trapped + burnout_data$burnout_helpless + burnout_data$burnout_depressed + burnout_data$burnout_worthless + burnout_data$burnout_had_it

burnout_data$tired_disappointed_subscale <- burnout_data$burnout_tired + burnout_data$burnout_disappointed

burnout_data$sick_subscale <- burnout_data$burnout_sick + burnout_data$burnout_difficulty_sleeping

burnout_data$burnout_scale <- burnout_data$burnout_sick + 
  burnout_data$burnout_difficulty_sleeping + 
  burnout_data$burnout_hopeless + 
  burnout_data$burnout_trapped + 
  burnout_data$burnout_helpless + 
  burnout_data$burnout_depressed + 
  burnout_data$burnout_worthless + 
  burnout_data$burnout_had_it + 
  burnout_data$burnout_tired + 
  burnout_data$burnout_disappointed

Now we can test correlations with the full subscale:

cor.test(x = burnout_data$burnout_scale, burnout_data$hope_trapped_helpless_subscale)

## 
##  Pearson's product-moment correlation
## 
## data:  burnout_data$burnout_scale and burnout_data$hope_trapped_helpless_subscale
## t = 168.57, df = 2354, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9577749 0.9639629
## sample estimates:
##       cor 
## 0.9609889

cor.test(x = burnout_data$burnout_scale, burnout_data$tired_disappointed_subscale)

## 
##  Pearson's product-moment correlation
## 
## data:  burnout_data$burnout_scale and burnout_data$tired_disappointed_subscale
## t = 56.63, df = 2354, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7417640 0.7759856
## sample estimates:
##       cor 
## 0.7593996

cor.test(x = burnout_data$burnout_scale, burnout_data$sick_subscale)

## 
##  Pearson's product-moment correlation
## 
## data:  burnout_data$burnout_scale and burnout_data$sick_subscale
## t = 66.667, df = 2354, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7940899 0.8220856
## sample estimates:
##       cor 
## 0.8085448

Clearly the mental manifestations subscales had the highest correlation: r = 0.961, followed by the the exhaustion subscale (r = 0.809), followed by the physical manifestations subscale (r = 0.759)

Convergent and Discriminant Validity

Testing for positive correlation with variables that you expect to be correlated with an outcome provides evidence for convergent validity, whereas testing for negative correlation with variables that you think are inversely associated with your outcome provides evidence for discriminant validity. With these new scales, we could examine which part of the scale items are most strongly impacted by loneliness (a variable we’ve seen pop up again and again as being important to burnout).

model <- lm(formula = UCLA3Score ~ sick_subscale + tired_disappointed_subscale + hope_trapped_helpless_subscale, data = burnout_data) #include subscales as explanatory factors for UCLA loneliness scores
broom::tidy(model) #summarize model

## # A tibble: 4 × 5
##   term                           estimate std.error statistic   p.value
##   <chr>                             <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)                      2.59     0.103       25.0  1.18e-122
## 2 sick_subscale                   -0.0192   0.0168      -1.14 2.53e-  1
## 3 tired_disappointed_subscale      0.158    0.0154      10.2  4.60e- 24
## 4 hope_trapped_helpless_subscale   0.0952   0.00589     16.2  7.81e- 56

varImp(model) #check variable importance

##                                  Overall
## sick_subscale                   1.142389
## tired_disappointed_subscale    10.231031
## hope_trapped_helpless_subscale 16.170933

From the above, we see that loneliness is especially related to exhaustion and mental manifestations of burnout and not the physical manifestations of burnout. This sort of analysis illustrates why it may be sometimes helpful to identify individual factors from a larger scale.

We could also use the information from a factor analysis to create a shortened version of the scale, making sure to select the one or two most improtant variables for each factor. I’ll choose an arbitrary cut point of 0.60 to measure inclusion:

burnout_data$short_burnout <- burnout_data$burnout_trapped + burnout_data$burnout_tired + burnout_data$burnout_sick

cor.test(x = burnout_data$burnout_scale, y = burnout_data$short_burnout)

## 
##  Pearson's product-moment correlation
## 
## data:  burnout_data$burnout_scale and burnout_data$short_burnout
## t = 115.27, df = 2354, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9153694 0.9275417
## sample estimates:
##       cor 
## 0.9216821

As you can see, the new 3 item burnout measure is strongly correlated with the full burnout scale. Shortening a measure might be useful in instances where using the full scale would be too cumbersome (i.e., future research projects)

We could also use this information to test whether a scale works the same way in different populations. For example, we might run a factor analysis on two separate datasets representing people with and without disability:

burnout_data_disability <- burnout_data[burnout_data$identity_disability != "8888: Not Selected", ] #create dataset for people with disability

burnout_data_disability <- burnout_data_disability[, names(burnout_data_disability) %in% burnout_vars] #remove unnecessary columns

burnout_data_disability <- na.omit(burnout_data_disability) #remove observations with missing data

burnout_data_no_disability <- burnout_data[burnout_data$identity_disability == "8888: Not Selected", ] 

burnout_data_no_disability <- burnout_data_no_disability[, names(burnout_data_no_disability) %in% burnout_vars] #remove unnecessary columns

burnout_data_no_disability <- na.omit(burnout_data_no_disability) #remove observations with missing data

Within each model we can then check for the number of factors:

psych::fa.parallel(x = burnout_data_disability) # Parallel analysis Suggests 2 factors

## Parallel analysis suggests that the number of factors =  2  and the number of components =  1

psych::fa.parallel(x = burnout_data_no_disability)  #Parallel analysis  Suggests 3 factors

## Parallel analysis suggests that the number of factors =  3  and the number of components =  1

Next, we can compare the factor loadings for the three factor models for each group using the fa.congruence() function:

disability_loading <- factanal(x = burnout_data_disability, factors = 3, rotation="varimax") #Model for those with disability: factor 1 = hopeless, depressed, and wortheless factor 2 = tired factor 3 = trapped

no_disability_loading <- factanal(x = burnout_data_no_disability, factors = 3, rotation="varimax") #model for those without disability: factor 1 = hopeless, trapped, helpless, worthless factor 2 = sick, difficulty sleeping factor 3 = tired, disappointed

fa.congruence(disability_loading$loadings, no_disability_loading$loadings)

##         Factor1 Factor2 Factor3
## Factor1    0.95    0.93    0.69
## Factor2    0.67    0.78    0.98
## Factor3    0.96    0.73    0.67

The same can be done for the two factor models for each group:

disability_loading <- factanal(x = burnout_data_disability, factors = 2, rotation="varimax")
disability_loading #factor 1 = hopeless, depressed, and wortheless factor 2 = tired factor 3 = trapped

## 
## Call:
## factanal(x = burnout_data_disability, factors = 2, rotation = "varimax")
## 
## Uniquenesses:
##               burnout_tired        burnout_disappointed 
##                       0.284                       0.564 
##            burnout_hopeless             burnout_trapped 
##                       0.274                       0.304 
##            burnout_helpless           burnout_depressed 
##                       0.296                       0.296 
##                burnout_sick           burnout_worthless 
##                       0.641                       0.424 
## burnout_difficulty_sleeping              burnout_had_it 
##                       0.608                       0.478 
## 
## Loadings:
##                             Factor1 Factor2
## burnout_tired               0.154   0.832  
## burnout_disappointed        0.328   0.574  
## burnout_hopeless            0.773   0.359  
## burnout_trapped             0.787   0.278  
## burnout_helpless            0.796   0.266  
## burnout_depressed           0.698   0.466  
## burnout_sick                0.339   0.494  
## burnout_worthless           0.718   0.245  
## burnout_difficulty_sleeping 0.310   0.544  
## burnout_had_it              0.657   0.301  
## 
##                Factor1 Factor2
## SS loadings      3.625   2.207
## Proportion Var   0.363   0.221
## Cumulative Var   0.363   0.583
## 
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 102.45 on 26 degrees of freedom.
## The p-value is 5.01e-11

no_disability_loading <- factanal(x = burnout_data_no_disability, factors = 2, rotation="varimax") #factor 1 = hopeless, trapped, helpless, worthless factor 2 = sick, difficulty sleeping factor 3 = tired, disappointed

fa.congruence(disability_loading$loadings, no_disability_loading$loadings)

##         Factor1 Factor2
## Factor1    0.99    0.72
## Factor2    0.78    0.99

A congruence coefficient of 0.95 is interpreted as nearly identical, 0.90 as high similarity, and above 0.85 is a fair similarity. Below .85 is indication of bias. Results from our analysis show that factor 1 among people with disability is similar to both factors 1 and 2, factor 2 for PWD is similar to factor 3, and factor 3 is similar to factor 1.

Based on these differences, it is apparent that Burnout may represent a slightly different concept among people with disabilities. At the very least, we should carefully look for differences in the performance of any Burnout scales we use when modeling. Furthermore, this may motivate us to model the effects of burnout using stratified models to see if the predictors of burnout are different in these two populations. In the present case, the two-item factors are more comparable for people with and without disabilities, therefore, we might be motivated to construct and use subscales representing two rather than three subscales – perhaps one subscale reflecting the mental manifestations and second scale representing the physical manifestations.

10. Mediation and Moderation Analyses

This section introduces you to the following concepts and their corresponding functions:

Mediation analysis: mediate()
Stratification: “subset =”
Interactions: *, :, plot_model()

Getting Started

Lets start by reading in our dataset.

burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data

Mediation

The first method we are going to review in this chapter is called mediation. Mediation seeks to identify and explain the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third hypothetical variable, known as a mediator variable. There are four steps to create a mediation model:

Step 1 - Model relationship between your explanatory factor and outcome. In this case, lets use Burnout scores as our outcome and support from coworkers as our explanatory factor (we will also control for age and gender).

table(burnout_data$work_support) #Check out our variable

## 
## 8888:Not Working During COVID                    Completely 
##                           403                           152 
##                    Not at all                      Somewhat 
##                            12                           313 
##                   Very Little                     Very Well 
##                           101                           354

burnout_data$work_support[burnout_data$work_support == "8888:Not Working During COVID"] <- NA#remove those not employed during covid

I’m going to use a binary explanatory factor in this case, so will collapse the not at all, somewhat, and very little groups and the completely and very well groups:

burnout_data$work_support_di[burnout_data$work_support == "Completely" |
                               burnout_data$work_support == "Very Well"] <- "Completely/Very Well"

burnout_data$work_support_di[burnout_data$work_support == "Somewhat" | 
                               burnout_data$work_support == "Very Little" |
                               burnout_data$work_support == "Not at all"] <- "Somewhat/Very Little/Not at all"

burnout_data$work_support_di <- relevel(as.factor(burnout_data$work_support_di), ref = "Completely/Very Well") #Set reference level for variable

table(burnout_data$work_support_di)

## 
##            Completely/Very Well Somewhat/Very Little/Not at all 
##                             506                             426

Remove participants with missing observations across key variables <- Necessary step because each model needs to be based on the same underlying data. The work_support variable was only asked of one-third of respondents.

burnout_data_subset <- burnout_data[!is.na(burnout_data$BurnoutScore),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$work_support_di),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$UCLA3Score),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$age),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$gender),]

Y_X <- lm(formula = BurnoutScore ~ work_support_di + age + gender, data = burnout_data_subset)
broom::tidy(Y_X) #Results show that those who are less supported are more likely to burnout.

## # A tibble: 5 × 5
##   term                                        estimate std.e…¹ stati…²   p.value
##   <chr>                                          <dbl>   <dbl>   <dbl>     <dbl>
## 1 (Intercept)                                  3.15    0.116     27.1  2.80e-118
## 2 work_support_diSomewhat/Very Little/Not at…  0.541   0.0681     7.94 5.98e- 15
## 3 age                                         -0.00357 0.00305   -1.17 2.43e-  1
## 4 genderNon-binary                             0.359   0.232      1.55 1.21e-  1
## 5 genderWoman                                  0.230   0.0691     3.32 9.23e-  4
## # … with abbreviated variable names ¹std.error, ²statistic

Step 2 - Model relationship between your explanatory factor/outcome and your mediator. In this case, our mediator will be lonelinesss. I hypothesize that people who don’t feel supported, experience loneliness, and that loneliness contributes to burnout.

M_X <- lm(formula = UCLA3Score ~ work_support_di + age + gender, data = burnout_data_subset)
broom::tidy(M_X) #less worplace support is associated with loneliness

## # A tibble: 5 × 5
##   term                                         estim…¹ std.e…² stati…³   p.value
##   <chr>                                          <dbl>   <dbl>   <dbl>     <dbl>
## 1 (Intercept)                                  4.74    0.177    26.7   4.25e-116
## 2 work_support_diSomewhat/Very Little/Not at … 0.757   0.104     7.28  7.43e- 13
## 3 age                                          0.00327 0.00466   0.702 4.83e-  1
## 4 genderNon-binary                             0.371   0.354     1.05  2.94e-  1
## 5 genderWoman                                  0.474   0.105     4.50  7.85e-  6
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic

Y_M <- lm(formula = BurnoutScore ~ UCLA3Score + age + gender, data = burnout_data_subset)
broom::tidy(M_X) #Loneliness is associated with burnout

## # A tibble: 5 × 5
##   term                                         estim…¹ std.e…² stati…³   p.value
##   <chr>                                          <dbl>   <dbl>   <dbl>     <dbl>
## 1 (Intercept)                                  4.74    0.177    26.7   4.25e-116
## 2 work_support_diSomewhat/Very Little/Not at … 0.757   0.104     7.28  7.43e- 13
## 3 age                                          0.00327 0.00466   0.702 4.83e-  1
## 4 genderNon-binary                             0.371   0.354     1.05  2.94e-  1
## 5 genderWoman                                  0.474   0.105     4.50  7.85e-  6
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic

Step 3 - Model relationship between your explanatory factor and your outcome, controlling for mediator.

Y_XM <- lm(formula = BurnoutScore ~ work_support_di + UCLA3Score + age + gender, data = burnout_data_subset)
broom::tidy(Y_XM) #Thou the effect  remain statistically significant, You can see that the effect sizes dropped:

## # A tibble: 6 × 5
##   term                                         estimate std.e…¹ stati…²  p.value
##   <chr>                                           <dbl>   <dbl>   <dbl>    <dbl>
## 1 (Intercept)                                   1.73    0.139     12.4  6.87e-33
## 2 work_support_diSomewhat/Very Little/Not at …  0.315   0.0624     5.04 5.53e- 7
## 3 UCLA3Score                                    0.299   0.0196    15.3  6.48e-47
## 4 age                                          -0.00455 0.00272   -1.67 9.48e- 2
## 5 genderNon-binary                              0.248   0.206      1.20 2.30e- 1
## 6 genderWoman                                   0.0880  0.0622     1.41 1.57e- 1
## # … with abbreviated variable names ¹std.error, ²statistic

Step 4 - I can now use the mediate() function from the mediate package

install.packages("mediation", repos = "http://cran.us.r-project.org")

## package 'mediation' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("mediation")

mediation_results <- mediate(M_X, Y_XM, treat='work_support_di', mediator='UCLA3Score', control.value = "Completely/Very Well", treat.value = "Somewhat/Very Little/Not at all", boot=TRUE, sims=100) #Test for mediation using 100 bootstrap simulations

summary(mediation_results) #Get mediation test results.

## 
## Causal Mediation Analysis 
## 
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
## 
##                Estimate 95% CI Lower 95% CI Upper p-value    
## ACME              0.226        0.163         0.30  <2e-16 ***
## ADE               0.315        0.191         0.45  <2e-16 ***
## Total Effect      0.541        0.404         0.67  <2e-16 ***
## Prop. Mediated    0.418        0.301         0.55  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Sample Size Used: 894 
## 
## 
## Simulations: 100

Interpretation of Results

ACME (Effect of mediator/Indirect Effect) - approx 0.224 is the average increase in the outcome variable among the treatment group (i.e. those with low social support) that arrives as a result of loneliness rather than ‘directly’ from the treatment.
ADE (Direct Effect of Explanatory Factor) - approx 0.325 is the average increase in the outcome variable among the treatment group (i.e. those with low social support) that arrives ‘directly’ from the treatment.
Total effects, is the sum of the ACME and ADE.
Prop. Mediated is the proportion of the effect of your outcome that is mediated by your mediator.

Stratification

Mediation is used when you want to understand the mechanism or pathway by which an explanatory factor impacts an outcome. However, when you want to understand whether an effect operates differently in two populations, you should instead stratify your analysis. Lets try this by seeing if the effect of discrimination on burnout is the same on people who identify as Black, indigenous, or as a person of colour compared to those who do not. This can be done either by creating two seperate datasets (which you already know how to do) or by using the subset function.

table(burnout_data$identity_bipoc) #Check out our stratifiying variable (i.e., ethnicity)

## 
##                                                              8888: Not Selected 
##                                                                            2028 
## People of colour (e.g., Black, Indigenous, Asian  or other racialized minority) 
##                                                                             400

summary(burnout_data$discrimination_score) #Check out our explanatory variable (i.e., discrimination)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    9.00   14.00   25.00   24.86   34.00   54.00      92

summary(burnout_data$BurnoutScore) #Check out our outcome variable (i.e., burnout)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    1.00    2.70    3.50    3.42    4.10    7.00      92

BIPOC_model <- lm(formula = BurnoutScore ~ discrimination_score + age + gender, data = burnout_data, subset = (identity_bipoc == "People of colour (e.g., Black, Indigenous, Asian  or other racialized minority)"))
summary(BIPOC_model)

## 
## Call:
## lm(formula = BurnoutScore ~ discrimination_score + age + gender, 
##     data = burnout_data, subset = (identity_bipoc == "People of colour (e.g., Black, Indigenous, Asian  or other racialized minority)"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4393 -0.7006  0.0234  0.7108  3.3677 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.949119   0.246546  11.962  < 2e-16 ***
## discrimination_score  0.016692   0.004821   3.463 0.000598 ***
## age                  -0.005415   0.005528  -0.980 0.327973    
## genderNon-binary      0.234434   0.535565   0.438 0.661840    
## genderWoman           0.238843   0.111277   2.146 0.032502 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.058 on 365 degrees of freedom
##   (50 observations deleted due to missingness)
## Multiple R-squared:  0.04488,    Adjusted R-squared:  0.03441 
## F-statistic: 4.287 on 4 and 365 DF,  p-value: 0.002109

#R^2 0.03441 #Effect of discrimination: 0.016692   

Non_BIPOC_model <- lm(formula = BurnoutScore ~ discrimination_score + age + gender, data = burnout_data, subset = (identity_bipoc == "8888: Not Selected"))
summary(Non_BIPOC_model)

## 
## Call:
## lm(formula = BurnoutScore ~ discrimination_score + age + gender, 
##     data = burnout_data, subset = (identity_bipoc == "8888: Not Selected"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2650 -0.6880  0.0317  0.6491  3.5285 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.597136   0.110941  23.410  < 2e-16 ***
## discrimination_score 0.023961   0.002397   9.996  < 2e-16 ***
## age                  0.002214   0.001684   1.315 0.188632    
## genderNon-binary     0.574048   0.159085   3.608 0.000316 ***
## genderWoman          0.283173   0.050536   5.603 2.41e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.062 on 1878 degrees of freedom
##   (165 observations deleted due to missingness)
## Multiple R-squared:  0.06767,    Adjusted R-squared:  0.06568 
## F-statistic: 34.08 on 4 and 1878 DF,  p-value: < 2.2e-16

#R^2 0.06568  #Effect of discrimination: 0.023961

From the output of the two models above, we can see that by stratifying our analyses, we actually see that the effect of discrimination is smaller on BIPOC vs. non-BIPOC participants. Additionally, the model among BIPOC individuals explains less of the variance in Burnout Scores than the model among non-BIPOC participants (0.016692 vs. 0.023961). This might suggest that discrimination is not the only factor driving burnout among BIPOC people, or perhaps BIPOC people are just more resilient to experiences of discrimination (because they face it so often)

We can do some checks, such as look at the distribution of burnout among non-BIPOC vs. BIPOC- identified participants. To do so, I am going to use an advanced graphics package called ggplot2 and use the functions, ggplot() and geom_density to display the densities of discrimination scores, stratified by identity_bipoc using the fill = function

install.packages("ggplot2", repos = "http://cran.us.r-project.org") #read in package
library("ggplot2") #read in library
ggplot(burnout_data, aes(discrimination_score, fill = !is.na(identity_bipoc))) + #aes() is the mapping command in ggplot, I tell it the variable I want plotted; fill = tells R that I want the data stratified, !is.na() ensures that the non-missing data is not plotted.
  geom_density(alpha = 0.5) #alpha = 0.5 controls transparency #I use the plus sign above to allow me to connect the ggplot() mapping to the type of graph I want()

A simpler way of doing this graph would be to just graph each histogram separately:

hist(burnout_data$discrimination_score[burnout_data$identity_bipoc == "People of colour (e.g., Black, Indigenous, Asian  or other racialized minority)"]) #Plot BIPOC

hist(burnout_data$discrimination_score[burnout_data$identity_bipoc == "8888: Not Selected"]) #Plot Non-BIPOC

Either graphing method reveals a similar story. Many more of the BIPOC folk experience daily discrimination. You can verify this by looking at the descriptives, stratified by the identity_BIPOC variable:

summary(burnout_data$discrimination_score[burnout_data$identity_bipoc == "8888: Not Selected"]) #Summary for Non-BIPOC

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    9.00   14.00   24.00   24.29   33.00   54.00      93

summary(burnout_data$discrimination_score[burnout_data$identity_bipoc == "People of colour (e.g., Black, Indigenous, Asian  or other racialized minority)"]) #Plot BIPOC

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    9.00   18.00   30.00   28.21   36.00   54.00      39

Interactions

I could go through and compare these two models more thoroughly – assess their fit. But you can see from this example that stratifying can be a complex approach to understanding the effect. Another way you can understand these sorts of moderators is through the use of an interaction term. Interactions are specified in a model using the * symbol, as shown below:

Interaction_model <- lm(formula = BurnoutScore ~ discrimination_score*identity_bipoc + age + gender, data = burnout_data)

broom::tidy(Interaction_model) #Use tidy to view results, notice I use :: to specify the package that tidy belongs to, which saves me from using the library() function to call in the broom library

## # A tibble: 7 × 5
##   term                                        estimate std.e…¹ stati…²   p.value
##   <chr>                                          <dbl>   <dbl>   <dbl>     <dbl>
## 1 (Intercept)                                  2.63    0.107    24.5   6.65e-118
## 2 discrimination_score                         0.0236  0.00238   9.92  1.01e- 22
## 3 identity_bipocPeople of colour (e.g., Blac…  0.0487  0.160     0.303 7.62e-  1
## 4 age                                          0.00159 0.00161   0.992 3.22e-  1
## 5 genderNon-binary                             0.545   0.152     3.58  3.50e-  4
## 6 genderWoman                                  0.278   0.0460    6.06  1.64e-  9
## 7 discrimination_score:identity_bipocPeople … -0.00630 0.00535  -1.18  2.39e-  1
## # … with abbreviated variable names ¹std.error, ²statistic

In this model, the effects of discrimination and being a person of colour are both statistically significant predictors of burnout. However, the effect of the interaction term (I.e, the last variable listed showing discrimination:identity_bipoc) is not significant. This helps us understand the results better from our stratified model: Namely, that the effect of discrimination is not statistically different for BIPOC vs. non-BIPOC people. You will notive that the effect of the interaction term is actually negative. This is an artifact for the fact that I am also controlling for the independent effect of discrimination. If you want to include an interaction term without including the individual main effects you can seperate your interaction variables by the : symbol, as shown below

Interaction_model_2 <- lm(formula = BurnoutScore ~ discrimination_score:identity_bipoc + age + gender, data = burnout_data) #Create interaction without independent main effects of interaction term
broom::tidy(Interaction_model_2) #call results

## # A tibble: 6 × 5
##   term                                         estim…¹ std.e…² stati…³   p.value
##   <chr>                                          <dbl>   <dbl>   <dbl>     <dbl>
## 1 (Intercept)                                  2.65    0.0999   26.5   8.98e-135
## 2 age                                          0.00150 0.00157   0.950 3.42e-  1
## 3 genderNon-binary                             0.546   0.152     3.58  3.45e-  4
## 4 genderWoman                                  0.278   0.0459    6.05  1.69e-  9
## 5 discrimination_score:identity_bipoc8888: No… 0.0233  0.00219  10.6   1.03e- 25
## 6 discrimination_score:identity_bipocPeople o… 0.0185  0.00263   7.04  2.50e- 12
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic

This approach is similar to (though not identical to) stratification. It allows us to see the effect of discrimination seperately for BIPOC and non-BIPOC seperately.

Lets cover one more example of interaction with two continuous variables. We will look at the interaction of loneliness and extraversion in predicting burnout. I hypothesize that loneliness contributes more to burnout among those with high extraversion (because they need more social connection).

Interaction_model_3 <- lm(formula = BurnoutScore ~ UCLA3Score*tipi_extraversion_score + age + gender, data = burnout_data) #Create interaction without independent main effects of interaction term
broom::tidy(Interaction_model_3) #call results

## # A tibble: 7 × 5
##   term                               estimate std.error statistic  p.value
##   <chr>                                 <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)                         2.04      0.330       6.20  8.14e-10
## 2 UCLA3Score                          0.324     0.0514      6.31  3.95e-10
## 3 tipi_extraversion_score            -0.0431    0.0390     -1.11  2.69e- 1
## 4 age                                -0.00579   0.00186    -3.11  1.93e- 3
## 5 genderNon-binary                    0.550     0.198       2.78  5.48e- 3
## 6 genderWoman                         0.0462    0.0578      0.799 4.24e- 1
## 7 UCLA3Score:tipi_extraversion_score  0.00252   0.00618     0.407 6.84e- 1

Surprisingly the, interaction term is not significant, though we do see that extravertedness is associated with lower burnout and loneliness is associated with higher burnout. We can explore these effects further using the plot_model() function of the sjPlot package

install.packages("sjPlot", repos = "http://cran.us.r-project.org") #install package

## package 'sjPlot' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages

library("sjPlot") #Install library
plot_model(Interaction_model_3, type = "pred", terms = c("UCLA3Score", "tipi_extraversion_score")) #Plot the interaction model. type = "pred" is an argument to plot the marginal effects (or predicted values of Burnout across levels of loneliness)

As you can see the predicted levels of Burnout across UCLA scores are basically the same (overlapping) for the three levels displayed of the etraversion score, which were defined based on the mean and sd. So the red line represents those values 1 standard deviation below the mean, the blue represents the values up to the mean, and the green represents greater than 1 standard deviation above the mean:

summary(burnout_data$tipi_extraversion_score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.000   7.000   8.000   7.918   9.000  14.000    1266

sd(burnout_data$tipi_extraversion_score, na.rm = TRUE)

## [1] 2.402553

We can also plot a chart for the interaction_model_2:

plot_model(Interaction_model_2, type = "pred", terms = c("discrimination_score", "identity_bipoc"))

Based on these results, neither of my hypothesized interactions were confirmed. But hopefully, you see the value of using interaction terms. If you recall from last week, we talked about comparing model fit. These same methods are helpful when you are considering whether to include an interaction term.

Let’s check the AIC values for the interaction model

no_interaction <- lm(formula = BurnoutScore ~ UCLA3Score + tipi_extraversion_score + age + gender, data = burnout_data) #Create interaction without independent main effects of interaction term
broom::tidy(no_interaction) #call results

## # A tibble: 6 × 5
##   term                    estimate std.error statistic  p.value
##   <chr>                      <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)              1.92      0.151      12.7   1.10e-34
## 2 UCLA3Score               0.344     0.0167     20.7   1.02e-80
## 3 tipi_extraversion_score -0.0280    0.0117     -2.39  1.71e- 2
## 4 age                     -0.00581   0.00186    -3.12  1.85e- 3
## 5 genderNon-binary         0.547     0.197       2.77  5.67e- 3
## 6 genderWoman              0.0459    0.0578      0.793 4.28e- 1

Now we can compare that to the new model without the interaction:

AIC(Interaction_model_3, no_interaction)

##                     df      AIC
## Interaction_model_3  8 3110.899
## no_interaction       7 3109.066

lrtest(Interaction_model_3, no_interaction)

## Likelihood ratio test
## 
## Model 1: BurnoutScore ~ UCLA3Score * tipi_extraversion_score + age + gender
## Model 2: BurnoutScore ~ UCLA3Score + tipi_extraversion_score + age + gender
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   8 -1547.5                     
## 2   7 -1547.5 -1 0.1666     0.6831

As you can see the model without the interaction is superior. Adding the interaction, while perhaps interesting to demonstrate a null result from my hypothesized effect, does not add information to the model. This can be confirmed using the liklihood ratio test since the non-interaction model is nested within the interaction model.

Conclusion

In this course, you have learned many of the essential skills for analyzing survey data with R. I hope that it has been an enjoyable learning experience for you.

If you find any errors in this text or have suggestions on how the course could be improved, please email kcard@sfu.ca.