Studies of behavior often require the use of surveys to collect self-reported data about a participants perceptions, motives, beliefs, and actions. Whether you are studying how and why people get vaccinated against a disease or why they engage in other health related behaviours, such as exercise, socialization, or eating, surveys can provide you insights into the behaviour of individuals and populations.
Surveys can be administered in a variety of formats, ranging from telephone surveys, paper questionnaires, or computer-assisted online surveys. While this review does not explore data collection, each of these types of surveys results in a dataset that will need to be analyzed.
“Essentials for Analyzing Survey Data Using R” will teach you how to conduct statistical analyses using R Studio. In doing so, we often provide the simplest methods for meeting each objective. However, in R there are numerous ways to complete a given task and you are welcome to explore and identify strategies that work better for you.
This section introduces you to the following concepts and their corresponding functions:
R is a statistical language commonly accessed through the integrated development environment called RStudio. Base R performs a vast number of useful statistical operations, but it can be enhanced with packages. These packages are open source and created by the community of R users, making these resources open for public use free of charge.
Follow the instructions at this link to install both R and RStudio on your computer. Install R first then R-Studio.
There are several important windows to become familiar with when working in RStudio:
Four Essential Windows in R Studio
The top left window is a script editor. Here you can edit and execute R code (by highlighting and hitting ctrl+Enter).
The bottom left is the R console, where the R engine does its work. You’re your code is executed, with results and error notifications. You can also enter ad hoc commands in this window, but these commands will not be saved.
In the top right, R objects are displayed and summarized. Objects are named sets of data, whether a set of data from a survey, a list of names, a set of randomly-generated numbers, or even functions you have defined (created) yourself.
The bottom right contains several useful tools: a file explorer, a graphical device for displaying any plots you create (histograms, bar charts, and the like), a list of the packages you have installed, a viewer for function help documentation, and a viewer for web content.
An R function almost universally follows this form: function(arguments).
For instance, the mean() function can take the following arguments, each of which is described in the functions help file:
Argument | Description |
---|---|
x | An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only. |
trim | The fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint. |
na.rm | A logical value indicating whether NAvalues should be stripped before the computation proceeds. |
… | Further arguments passed to or from other methods. |
R takes the arguments in a given order (usually as listed in the help file. This means that you can choose whether or not to name your arguments.
The text below names each argument:
mean(x = dataset$variable_name, trim = 0, na.rm = TRUE)
The text below does not name each argument:
mean(dataset$variable_name, 0, TRUE)
R comes ready with a large set of useful functions, but the R community has created many new functions which perform more complex or specialized tasks or which simplify the language around basic R tasks. For the latter purpose, I highly recommend installing and familiarizing yourself with the Tidyverse family of packages, particularly plyr, dplyr, ggplot2, and readr. These simplify and extend data reading, cleaning, and plotting.
You will need these packages to perform the tasks I demonstrate below. Install them from RStudio under Tools > Install Packages… Input the list of desired packages: plyr tidyverse. Hit OK and RStudio will download and install the packages from CRAN. You can also download the packages from the R console:
install.packages(pkgs = c("plyr", "tidyverse"))
Finally, to activate all of the functions in each package, enter the following in the script editor or R console:
library(plyr)
library(tidyverse)
This will load all of the functions contained in these packages into your R session; without doing this, R will not have access to many of the functions used in this course.
Specific functions are described fully in R. The ? symbol is the basic function for calling up documentation about a given function. To use it, you simply add a question mark before the name of a function. For example, to get information about the mean() function, you can enter the following code:
?mean #Calls up the help documentation for the mean function.
When this code is run, the documentation for the mean function will appear in the”Help” window in Rstudio.
When the information provided is not enough, or you have questions about a specific function or error, the R community is extremely active. Your question has likely already been asked on one of the following websites:
Often times you can simply Google the error you encountered and you will be able to find a solution.
R has a comment flag, the hashtag (#). Any text on a line of code beginning with a # will not be read by the R engine. This allows you to explain what your code is trying to accomplish for your future self or for other readers of your code. It is vital that you comment your code, and it is best that you comment as you go. You will thank yourself time and again for developing this habit.
R is an object-based programming language. An object is a data structure used to store information. Objects can include vectors, lists, matrices, data frames, and many other types of data.
The “<-” symbol is used to assign values to an object. For example, the code below creates an object called “VARIABLE_LIST” than includes the following values: 1, 2, 3, 4, 5, 6, 7.
VARIABLE_LIST <- c(1,2,3,4,5,6,7) #Create an object called "VARIABLE_LIST" that contains the values 1, 2, 3, 4, 5, 6, 7.
You will notice that the values are enclosed in the c() function.
The c() function in R is a type of function known as a constructor. It is used to combine multiple objects into a vector. It can combine numeric values, character strings, logical values (TRUE or FALSE), and many other objects into a single vector.
There are always multiple ways to accomplish a given task in R. For example, since the values in our list are sequential, we could have used the following syntax to create the same list
VARIABLE_LIST_2 <- (1:7) #Create an object called "VARIABLE_LIST" that contains the values 1, 2, 3, 4, 5, 6, 7.
To view these objects, you can simply type the object names into the console window after creating them.
VARIABLE_LIST #Print the VARIABLE_LIST object.
## [1] 1 2 3 4 5 6 7
VARIABLE_LIST_2 #Print the VARIABLE_LIST_2 object.
## [1] 1 2 3 4 5 6 7
Now that you know how to create an object, you are ready to read your data into R.
For this course will utilize data from the 2021 Canadian Social Connection Survey (CSCS). The CSCS is a serial cross-sectional survey with a longitudinal sub-cohort that aims to study the social health and well-being of Canadians in the wake of the COVID-19 pandemic. Data from Wave 1 of the CSCS was collected between April 21st, 2021 and June 1st, 2021, during the third wave of the COVID-19 Pandemic in Canada. Throughout this period, participants were recruited using paid advertising in French and English on Facebook, Twitter, Instagram, and Google. Advertisements were targeted to people aged 16 years of age or older across Canada. Participants were eligible to participate if they were 16 years of age or older, lived in Canada, were able to complete the survey in English or French, and provided informed consent. Upon completion of the survey, participants were eligible to enter a prize draw for a $200 VISA gift card. Ethics review for the CSCS was conducted by the Research Ethics Board at the University of Victoria (Ethics Protocol Number 21-0115).
The questionnaire for this dataset is in a file called “CSCS_Questionnaire.pdf” and the question text and variable names are cross indexed in the file called: “variable_labels.csv”.
On my computer, the CSCS data file is stored at the following path. It is important to recognize that R is case sensitive, so it is important that everything is copied exactly as they are stored:
*“C:- Simon Fraser University (1sfu)Lab\2. Projects\9. Learning HubR for Analyzing Survey Data_data.csv”*
The backslash is a special character in R. It suppresses the special meaning of the character it proceeds. Therefore, to turn the file path into a usable path, you must replace each backslash with a double backslash, as follows:
“C:\Users\Kiffer\OneDrive - Simon Fraser University (1sfu)\HEAL Lab\2. Projects\9. Learning Hub\Using R for Analyzing Survey Data\CSCS_data.csv”
Data are stored in different file formats. For example, Microsoft Excel files often use the extensions .xlsx, .xls, or .csv. These extensions describe the format of the file your data is stored in. Different file formats require different functions in ordder to read their data into R.
Lets begin by learning how to read in .csv files. This can be done using the read.csv() function, as shown below:
CSCS_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\CSCS_data.csv", na.strings = c(" ", "", "NA", "9999: Missing")) #This command creates an object called "CSCS_data" using the read.csv() function. We have used two arguments within this function to specify the file name and the character strings that should be read in as missing.
You can see that after running the command lines above, that the data object has now been added to your environment window in the top right. Next to the object name, it shows the text “2448 obs. of 408 variables.” This indicates that there are 2448 rows of data and 408 columns. Each column is a variable. Each row is a participant observation. You can click on the data object and view the spreadsheet you read in.
You might also notice that the syntax I used above included several arguments within the read.csv() function. These arguments were the “file =” and “na.strings =” arguments. To familiarize yourself with what these arguments do, you can retrieve help documentation on any function, by using typing the ? symbol and the function name, as follows:
?read.csv #Using the ? in front of a function name opens the help file for that function
If you scroll through the help file you will see sections on usage and arguments. These sections tell you what the arguments do.
The “file =” argument is “the name of the file which the data are to be read from. In this case I am reading the file from a file path, which I have provided within quotation marks.
The”na.strings = ” argument is a character vector of strings which are interpreted as NA. In my example above, you will see that I created a list using the c() function. The string in this list says that if the data has a cell that is blank, a single space, NA, or says 9999: Missing, that the data should be read in as missing. You could add other character strings in this list, depending on how missing values are stored in the dataset you are reading in. Removing missing data may make data analysis easier. However, you should be cautious as sometimes missing data has a specific meaning. For example, if skip logic is used to determine which participants answer a given question, they may be missing data because they were not presented a given question.
Now that you understand what the code above accomplished, you can now view your data to confirm that it was read into R correctly. To view your data you can use the View() function.
View(CSCS_data) #The View() function will open the object called "CSCS_data" in a new window. Note that the View() function is typed with a capital V. If you were to use a lowercase V it would not work.
Note that above we read in a .csv file using the ‘read.csv()’ function. As I mentioned, different file formats require using different functions to be read into R.
R includes many functions in its base package. However some functions require you to install a new package.
While beyond the scope of this course, there are other functions and packages for other file formats that work very similar to those covered here. You can usually google and find an appropriate function that suits your specific needs. A few common functions include the read.spss() function, which is used to open SPSS files and the read_sas() function, which is used to open SAS files. As with the example below, you will need to install the packages that contain these functions.
To install a function that is not included in Base R, you will need to use the ‘install.packages()’ function. Lets take a look at the help file for this function.
?install.packages #Using the ? to open help documentation for the install.packages() function.
The ‘install.packages()’ function has an argument called pkgs. We can use this argument to specify the name of the package we want to download. In our case we want the readxl package
install.packages(pkgs = "readxl", repos = "http://cran.us.r-project.org") #Installing the readxl package using the install.packages() function. Note that the pkgs argument specifies the name of the package and the repos = function specifies that repository used. CRAN is the official repository for most R packages.
## package 'readxl' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
install.packages(pkgs = "Rcpp", repos = "http://cran.us.r-project.org") #Installing the Rcpp package using the install.packages() function.
## package 'Rcpp' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
Before an installed package can be used, you need to tell R that you are using a specific language library. This is because some functions have the same name as functions in other packages. To tell R we are using the readxl library we can use the library() function
library(readxl) #Telling R I want to use the readxl library
library(Rcpp) #Telling R I want to use the readxl library
With the package installed, we are now able to read in .xlsx data using the read_xlsx() function. Lets try this with the forward sortation area census data file. If you need to figure out how to use this function or if it ends up not working as expected, remember you can retrieve help documentation by typing the ? in front of the package name.
census_data <- read_xlsx(path = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\census_data.xlsx", sheet = 1) #Using the read_xlsx() function, which is part of the readxl package to create an object called "census_data" from the first sheet of the .xlsx file. Notice that since I set up a working directory earlier, I did not need to copy the entire file path. I only needed the name of the file, which I found by going to the files tab in the bottom right window of R.
As you can see, it is difficult to make sense of the large volume of data displayed using the command above. There are three helpful functions that you can use to better understand the nature of a variable you are working with: class(), levels() and table().
Lets start with the class variable:
class(CSCS_data$gender) #tells you the class of the variable you are working with.
## [1] "character"
As you can see the gender variable is currently stored as a character variable. Data can be stored as character, numbers, factors, as true/FALSE logical values, or some other structure. Different functions specify what types of variables they can work with. For example, lets look at the help documentation for the levels() function:
?levels #retrieve help file for the levels() function.
As you can see the levels function can be used on factors. Currently the gender variable in our dataset is stored as a character variable. We can change it to a factor variable using the as.factor() function. Other similar functions include the as.character(), as.logical(), as.integer(), and as.numeric() functions which can be used to set a variable type to character, logical, integer, or numeric.
CSCS_data$gender <- as.factor(x = CSCS_data$gender) #replace the gender variable with a factor variable representing the same data.
class(x = CSCS_data$gender) #Confirm that the gender variable is now a factor.
## [1] "factor"
You can now run the levels function to identify levels() of the factor
levels(x = CSCS_data$gender)
## [1] "Man" "Non-binary" "Woman"
The levels() function shows that there are three levels: “Man”, “Non-binary”, and “Woman” You can use the table() function to see how many of the 2448 participants identify as each gender. Notice that I include the useNA = argument to show the number of missing observations for the gender variable.
table(CSCS_data$gender, useNA = "always") #create a frequency table for the gender variable.
##
## Man Non-binary Woman <NA>
## 1157 59 1232 0
We now have two separate data files loaded in as R objects. The first object (i.e., CSCS_data) consists of survey responses. Each row represents a participant and each column a variable. The second object (i.e., census_data) consists of census data. Each row represents a forward sortation area (i.e., the first 3 letters of a postal code) and each column represents a variable. To use these two data frames together, we will need to merge them. The merge() function is one function that can be used for this purpose. Lets review the help file for the merge function:
?merge #Review help documentation fir the merge() function
As you can see, the merge function asks for two data frames or objects that you want to merge. You can merge these using a specific variable. If the variable you are merging the data on is labeled differently in the two datasets, you can specify the variable names using the by.x and by.y arguments. In our case, we want to match forward sortation area level data for each participant using their self-reported FSA code. In both datasets this variable is labeled “FSA”. Therefore we can use the “by =” argument. There is also another set of arguments that indicate what you should do in cases where there are not matches between the two datasets. For example, in our case we might not have recruited participants in every postal area. Also, not all participants provided a postal code. To ensure we keep all the participants in our survey, regardless of whether they have a matching postal code, we will use the “all.x = TRUE” argument. Because we don’t want to create new rows for postal codes that dont have any participants living in them, we will also use the “all.y = FALSE” argument. Lets try merging the dataframes together using merge
merged_data <- merge(x = CSCS_data, y = census_data, by = "FSA", all.x = TRUE, all.y = FALSE) #Merge CSCS_data and census_data to create merged_data
You can see that the new variable has 2,448 observations (1 for each participant) and 485 variables (80 from the census_data and 406 from the CSCS_data). Because the FSA variable matched in the two datasets, you have 485 variables in the new merged_data file instead of 486. You can review the names of the columns and rows, using the colnames() and rownames() function.
colnames(x = merged_data) #print names of columns
rownames(x = merged_data) #print names of rows
As you can see there are lots of columns that you might not be interested in working with. It is sometimes helpful and sometimes necessary to create a data file that contains only the variables you want to work with. Likewise, there may be specific observations that you are interested in. To omit rows or columns you do not want to work with it is important to know how to subset a data file.
The bracket (“[]”) symbol provides an easy way to subset your data so that it only includes specific observations or rows from your dataset. Lets create a dataset for each gender. Before we do, however, it is necessary to take a closer look at the gender variable in R. In R, you can call up data about a specific variable by using the $ symbol following your call for the dataframe. So in this example we would call the variable by specifying the name of the data frame, followed by the $ symbol, followed by the name of the gender variable:
merged_data$gender #Print the gender column in the merged_data dataframe
Using the output from the functions above, we are now ready to subset our dataset using the [] symbol. To do so, we will create three datasets called “data_men”, “data_women”, and “data_nonbinary”. Notice that to accomplish this we specify the source data (i.e., merged_data) and then add the bracket symbol immediately thereafter. The [] symbols can be used to specify a subset. I often read them as “where…” Within the [] symbol we then write a statement for each dataset that indicates which observations we want included in the new dataset. In doing so, we use the == sign to indicate “equals”. You will also notice that following our statement there is a comma. When using the [] symbols, information prior to the comma specifies information about rows, and information following the comma specifies information about columns.
data_men <- merged_data[merged_data$gender == "Man" ,]
data_women <- merged_data[merged_data$gender == "Non-binary" ,]
data_nonbinary <- merged_data[merged_data$gender == "Woman" ,]
You can now use the table statement on the original dataset to ensure that the correct number of observations are stored in each of the new data files.
table(merged_data$gender, useNA = "always") #confirm that the number of observations listed in the environment window matches what we expected from the original dataset.
##
## Man Non-binary Woman <NA>
## 1157 59 1232 0
Now that you have a dataset for each gender, lets try creating a dataset that includes only a small subset of variables that we’re interested in working with. In this case, I would like to only include variables included in the Malach-Pines Burnout Measure. To do this, I create an object called keepvars that includes a list of the names of the relevant variables I want to keep.
keepvars <- c("burnout_tired",
"burnout_disappointed",
"burnout_hopeless",
"burnout_trapped",
"burnout_helpless",
"burnout_depressed",
"burnout_sick",
"burnout_worthless",
"burnout_difficulty_sleeping",
"burnout_had_it") #Creates a list of variable names and stores it as an object called "keepvars"
Using this newly created object, which represents a list of variable names, I will use the [] symbols to create a new dataset called “data_men_burnout_vars” that only contains these variables. Because I am referring to column names within the “data_men” dataframe, I am using the names() function in combination with the %in% function, which indicates “in”. So you might read the code below as “create an object called ‘data_men_burnout_vars’ from the”data_men” dataframe where, the column’s names are in the keepvars object.”
data_men_burnout_vars <- data_men[, names(data_men) %in% keepvars] #create a new dataset called "data_men_burnout_vars" that only contains variables that are listed in the keepvars object
If you wanted to create a dataset that did included all variables, except those listed variables, you could create a similar list, which I will instead call “dropvars.” This name is arbitrary (and I could have in fact used the same object–keepvars–from above), but descriptive names are helpful.
dropvars <- c("burnout_tired",
"burnout_disappointed",
"burnout_hopeless",
"burnout_trapped",
"burnout_helpless",
"burnout_depressed",
"burnout_sick",
"burnout_worthless",
"burnout_difficulty_sleeping",
"burnout_had_it") #Creates a list of variable names and stores it as an object called "dropvars"
To drop the variables contained in the “dropvars” list, I again use the brackets function. This time, I use the !() function to tell R that I don’t want these variables. In R, the exclamation symbol indicates “NOT”.
data_men_no_burnout_vars <- data_men[,!(names(data_men) %in% dropvars)] #create a new dataset called "data_men_no_burnout_vars" that contains variables that are NOT listed in the "dropvars" object
This section introduces you to the following concepts and their corresponding functions:
To begin, we will read in and merge the datasets that we will be working with.
library(readxl) #Tell R I am using the readxl library
CSCS_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\CSCS_data.csv", na.strings = c(" ", "", "NA", "9999: Missing")) #read in CSCS data
census_data <- read_xlsx(path = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\census_data.xlsx", sheet = 1) #read in census data
merged_data <- merge(x = CSCS_data, y = census_data, by = "FSA", all.x = TRUE, all.y = FALSE) #Merge CSCS_data and census_data to create merged_data
As you can see there are many variables in this dataset. Lets use the skills we learned last week to create a smaller dataset. To select these variables I went through the questionnaire and data dictionary and identified those that I thought might be interesting to explore. I decided that I would look at four main groups of variables (1) demographic variables, (2) those related to social connection, (3) those related to personality (i.e., attachment, and personality), (4) and those related to work and burnout.
keepvars <- c("gender",
"age",
"ethnicity",
"income",
"relationship_status",
"identity_vetrans",
"identity_indigenous",
"identity_lgbtq",
"identity_disability",
"identity_bipoc",
"identity_pwud",
"identity_newcomers",
"identity_homeless",
"identity_mental_health",
"educational_attainment",
"lonely_change_covid",
"province",
"POP_DENS_SQ_KM",
"FAMILY_SIZE_AVG",
"UNSUITABLE_HOUSING_PER",
"UCLA3Score",
"lonely_direct",
"lonely_duration",
"close_friends_num",
"live_with_partner",
"zimet_overall_score",
"discrimination_score",
"tipi_agreeable_score",
"tipi_extraversion_score",
"tipi_conscientiousness_score",
"tipi_emotional_stability_score",
"tipi_openness_score",
"attachment_secure_score",
"attachment_anxious_score",
"attachment_avoidant_score",
"employment",
"money_concerned",
"work_paid_enough",
"work_quitting",
"work_dignity",
"work_support",
"work_quitting",
"burnout_tired",
"burnout_disappointed",
"burnout_hopeless",
"burnout_trapped",
"burnout_helpless",
"burnout_depressed",
"burnout_sick",
"burnout_worthless",
"burnout_difficulty_sleeping",
"burnout_had_it",
"BurnoutScore",
"BurnoutScore_groups",
"weightvec") #Creates a list of variable names and stores it as an object called "keepvars"
data_burnout <- merged_data[, names(merged_data) %in% keepvars] #create a new dataset called "data_men_burnout_vars" that only contains variables that are listed in the keepvars object
Now that I have created a new smaller dataset, I want to save it for future use. That way I don’t have to recreate my dataset everytime I want to call it in future lectures. To save a dataset I will use the write.csv() function to save it as a csv:
write.csv(data_burnout, file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Save data_burnout to the working directory. Will be able to now load it in future lectures without having to merge the datasets or select my variables.
Now that we have a more manageable sized dataset, lets learn how to calculate descriptive statistics for (1) categorical (i.e., factor/character/logical) and (2) numeric (i.e., numeric/integer) variables.
Lets start with a variable that we know is a categorical variable, such as gender. However, we should always start by verifying the class using the class() function.
class(data_burnout$gender) #Show the class of the gender variable
## [1] "character"
As we suspected Gender is saved as a character variable. To get descriptive statistics for a categorical variable, we can use the table() and prop.table() functions. The table() function provides frequencies and the prop.table() function can be combined with the table function to calculate proportions. Functions can be combined by nesting them. R processess these functions by running the function from the inside out. Take a look:
table(data_burnout$gender) #calculate frequencies of gender
##
## Man Non-binary Woman
## 1157 59 1232
prop.table(table(data_burnout$gender)) #calculate proportions of gender
##
## Man Non-binary Woman
## 0.47263072 0.02410131 0.50326797
As you can see, 50.4% (n = 1232) identified as a woman, 47.3% (n = 1157) identified as a man, and 2.4% identified as non-binary.
Frequently, you are interested in providing descriptive statistics stratified by a second (or even third) variable. Lets first try this for a categorical variable. For the table function, we can simply seperate the variables by a comma:
table(data_burnout$gender, data_burnout$BurnoutScore_groups) #calculate frequencies of gender
##
## 0 1
## Man 802 308
## Non-binary 30 24
## Woman 783 409
One nice feature of the table function is you can add labels if it is helpful.
table(gender = data_burnout$gender, Burnt_Out = data_burnout$BurnoutScore_groups) #calculate frequencies of gender
## Burnt_Out
## gender 0 1
## Man 802 308
## Non-binary 30 24
## Woman 783 409
Getting proportions requires just one more step. You simply nest the table function from above within the prop.table() function and then specify a margin (1 or 2) that is used to calculate the percentage. The Margin = 1 argument calculates row total percentages, and Margin = 2 calculates column total percentages.
prop.table(table(gender = data_burnout$gender, Burnt_Out = data_burnout$BurnoutScore_groups), margin = 1) #calculate percent of men that are burnt out.
## Burnt_Out
## gender 0 1
## Man 0.7225225 0.2774775
## Non-binary 0.5555556 0.4444444
## Woman 0.6568792 0.3431208
prop.table(table(gender = data_burnout$gender, Burnt_Out = data_burnout$BurnoutScore_groups), margin = 2) # calculate the percent of burnt out people that are men.
## Burnt_Out
## gender 0 1
## Man 0.49659443 0.41565452
## Non-binary 0.01857585 0.03238866
## Woman 0.48482972 0.55195682
You can add additional variables, as needed, simply by adding them to the list:
table(gender = data_burnout$gender, Burnt_Out = data_burnout$BurnoutScore_groups, LGBTQ = data_burnout$identity_lgbtq) #calculate the frequency of burnout by gender AND sexual orientation
## , , LGBTQ = 8888: Not Selected
##
## Burnt_Out
## gender 0 1
## Man 723 257
## Non-binary 14 10
## Woman 737 372
##
## , , LGBTQ = Sexual or gender minorities (e.g., LGBTQ2+)
##
## Burnt_Out
## gender 0 1
## Man 77 49
## Non-binary 15 14
## Woman 41 27
Lets now try a numeric variable, such as Burnout Score. To do so we will use the summary() function.
class(data_burnout$BurnoutScore) #Confirm the class of the variable
## [1] "numeric"
summary(data_burnout$BurnoutScore) #get a summary
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 2.70 3.50 3.42 4.10 7.00 92
As you can see, the summary() function gives us the minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum, and the number of missing variables. Since we have the mean, we probably also want to standard deviation, which can be calculated using the sd() function
sd(data_burnout$BurnoutScore, na.rm = TRUE) #get the sd
## [1] 1.096954
The numbers summarizing the variable are great, but sometimes it is helpful to look at the data graphically. Lets review a few of the functions that can be used to create plots:
A dotchart (also known as a dot plot) is a type of chart that uses dots or markers to represent the values of a data set. The position of each dot on the chart represents one of the values in the data set, and the dots are often connected with a line to better visualize the trends in the data. Dotcharts are used to compare groups of related data and to examine the distribution of values within a data set.
dotchart(data_burnout$BurnoutScore) #Create a dotchart, the dotchat gives you a sense of the volume of data in given areas.
A histogram is a type of graph that displays the frequency of data within a set of ranges. It is used to visualize the distribution of numerical data. Histograms are useful for understanding the shape of a data set and for identifying skew in a dataset.
hist(data_burnout$BurnoutScore) #create a histogram. The histogram shows that the basic shape is a bellcurve.
A boxplot is a graphical representation of numerical data that displays the median, quartiles, and range of the data set. It is a useful tool for understanding the distribution of a data set and identifying potential outliers. Boxplots are also known as box and whisker plots
boxplot(data_burnout$BurnoutScore) #create a boxplot. The boxplot shows that the first quartile is slightly below 3 and the 3rd quartile is slightly above 4 and there are some outliers at the upper end of the distribution.
As you can see, all the charts show that the data is centered around 3-4, which agrees with our summary statistics (Median = 3.5, Mean = 3.4), the data are also slightly right-skewed, but basically the shape is normal, and the minimum of 1 and maximum of 7 are clearly identified.
A bar chart is a graphical representation of data using rectangular bars to illustrate numerical comparisons between categories. The bars can be plotted horizontally or vertically, and they are often used to compare values across different categories, such as sales performance over a specific period of time. You can create an easy bar chart for the categorical variables, by nesting a table() function within the barplot() function.
barplot(table(data_burnout$gender))
To save the images, you can go to the plots window in the bottom right and select “Export” then “Save as Image”
Another helpful way to view your data, is to create a table that shows the descriptive statistics for all your variables. This can be done using the tableone package.
install.packages("tableone", repos = "http://cran.us.r-project.org")
## package 'tableone' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library(tableone)
In this example, we are going to create a table with descriptive statistics reported separately for two groups. Often times, a descriptives table is stratified by your outcome variable, but this is not necessarily required.
In this case, we will use gender.
table(merged_data$gender)
##
## Man Non-binary Woman
## 1157 59 1232
class(merged_data$gender)
## [1] "character"
merged_data$outcome_var <- merged_data$gender
The CreateTableOne() function requires you to identifty a dataset. The “strata =” arguement is used to specify the variable you would like to stratify your table by (i.e., often your outcome). If you want results for the overall sample, you do not need to include the “strata =” argument.
Once the CreateTableOne() function creates the table, the print() function is used to create the table. There are a variety of options available, and you can look at the help documentation to determine which of these you would like to use. Finally, as.data.frame() is used to save the table into a format that could then be saved as a .csv file using the write_csv() function.
Table <- CreateTableOne(data = merged_data, strata = "outcome_var")
library(dplyr)
Frequencies <- print(Table, showAllLevels = TRUE, missing = FALSE, printToggle = FALSE, cramVars = TRUE, varLabels = TRUE, explain = FALSE) %>%
as_tibble(rownames = NA)
Table <- as.data.frame(Frequencies)
write_csv(Table, "descriptives_by_gender.csv")
This table will now allow you to review your dataset, helping you identify potential changes that you may need to make as part of the data manipulation stage of your analysis.
This section introduces you to the following concepts and their corresponding functions:
To get started, lets read in the burnout dataset we created last week.
burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data
Now that the data is read in, lets take a look at the intersection of two variables: Gender and Relationship status. Remember we can use the table() to create a cross tab.
table(burnout_data$gender, burnout_data$relationship_status) #cross tab for gender and relationship status
##
## In a relationship Single and dating Single and not dating
## Man 553 307 282
## Non-binary 22 16 19
## Woman 579 204 435
The function above presents these variables together. But imagine that we wanted to analyze these data as a single variable with four levels: Men in a relationship, women in a relationship, single men, and single women. To do this we would have to edit and combine the variable. Lets start by simplifying the relationship status variable. To do so, we will use the [] symbol to tell R that we want to conditionally recode the data. The first step, however, is to create a new variable, as we never want to over-write our data in case we must make edits later. I will create an empty variable called rel_di in the burnout_data dataframe:
burnout_data$rel_di <- NA #Create empty variable
Now that I have the rel_di variable created, I will fill it conditionally based on the old variable, as follows:
burnout_data$rel_di[burnout_data$relationship_status == "Single and dating"] <- "Single" #Recode Single and dating as just single
burnout_data$rel_di[burnout_data$relationship_status == "Single and not dating"] <- "Single" #Recode Single and not-dating as just single
burnout_data$rel_di[burnout_data$relationship_status == "In a relationship"] <- "Not Single" #Recode In a relationship as not single
table(burnout_data$rel_di, useNA = "always") #Check to make sure it worked
##
## Not Single Single <NA>
## 1154 1263 31
As you can see above, I used the == sign to tell R that where the old relationship_status was equal to “Single and dating”, I wanted to assign “Single” to the new variable. The brackets allowed me to create that conditional recode. #I could have also coded the data using the != symbols, which indicates “Does not equal.” Lets give that a try:
burnout_data$rel_di_version_two <- NA #Create empty variable
burnout_data$rel_di_version_two[burnout_data$relationship_status != "In a relationship"] <- "Single" #Recode "Single and not dating" and "Single and dating" as "single"
burnout_data$rel_di_version_two[burnout_data$relationship_status == "In a relationship"] <- "Not Single" #Recode "In a relationship" as "not single"
Lets compare these two versions:
table(burnout_data$rel_di_version_two, burnout_data$rel_di, useNA = "always") #Compare to new variables
##
## Not Single Single <NA>
## Not Single 1154 0 0
## Single 0 1263 0
## <NA> 0 0 31
Note that the values that were originally NA were ignored, this is because we use a special function to deal with missing data: The is.na() function. Here is how it works
burnout_data$rel_di_version_three[is.na(burnout_data$relationship_status)] <- "9999: Missing" #Assign NA the "9999: Missing" value
table(burnout_data$rel_di_version_three, useNA = "always") #Make sure it worked
##
## 9999: Missing <NA>
## 31 2417
You can see that the NA’s have now been reassigned using the is.na() function. I can also insert an ! point in from the is.na function, to specify when I want to recode every value that is not missing:
burnout_data$rel_di_version_three[!is.na(burnout_data$rel_di_version_two)] <- "4444: Not Missing" #Assign all non missing variables as missing
table(burnout_data$rel_di_version_three) #Make sure it worked!
##
## 4444: Not Missing 9999: Missing
## 2417 31
Now that we have a simplified relationship status variable, lets create a new variable that combines relationship and gender:
burnout_data$rel_status_and_gender <- NA #Create an empty variable
To combine variables we need the | symbol and the & symbol. | indicates “or” and & indicates “and”. We can use these to stack conditions as we recode:
burnout_data$rel_status_and_gender[burnout_data$gender == "Man" &
burnout_data$rel_di == "Single"] <- "Single Men" #Create Single Men level
burnout_data$rel_status_and_gender[burnout_data$gender == "Woman" &
burnout_data$rel_di == "Single"] <- "Single Women" #Create Single Women level
burnout_data$rel_status_and_gender[burnout_data$gender == "Non-binary" &
burnout_data$rel_di == "Single"] <- "Single Non-binary folk" #Create Single Women level
burnout_data$rel_status_and_gender[burnout_data$gender == "Man" &
burnout_data$rel_di == "Not Single"] <- "Non-single Men" #Create Single Men level
burnout_data$rel_status_and_gender[burnout_data$gender == "Woman" &
burnout_data$rel_di == "Not Single"] <- "Non-single Women" #Create Single Women level
burnout_data$rel_status_and_gender[burnout_data$gender == "Non-binary" &
burnout_data$rel_di == "Not Single"] <- "Non-single Non-binary folk" #Create Single Women level
table(burnout_data$rel_status_and_gender) #Make Sure it worked
##
## Non-single Men Non-single Non-binary folk
## 553 22
## Non-single Women Single Men
## 579 589
## Single Non-binary folk Single Women
## 35 639
We now have a variable that represents both relationship status and gender, lets take a look at the % of people who are burnt out in each category
prop.table(table(burnout_data$BurnoutScore_groups, burnout_data$rel_status_and_gender), margin = 2) #Cross tab burnout and gender/rel status
##
## Non-single Men Non-single Non-binary folk Non-single Women Single Men
## 0 0.7528090 0.6190476 0.7051510 0.6951872
## 1 0.2471910 0.3809524 0.2948490 0.3048128
##
## Single Non-binary folk Single Women
## 0 0.5161290 0.6130081
## 1 0.4838710 0.3869919
You can see in this data that 24.7% of men in relationships, 29.5% of women in relationships, 38.1% of non-binary folk in relationships, 30.5% of single men, 38.7% of single women, and 48.4% of single non-binary folks were burnt out.
Now imagine, that you wanted to consider the effect that age was playing in burnout within these groups. To do so we would want to take a look at the age variable.
class(burnout_data$age) #Check class
## [1] "integer"
summary(burnout_data$age) #Get summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.0 27.0 33.0 38.7 48.0 89.0
As you can see the median age was 33, and the data are stored as an “integer” Lets say that based on this data, you wanted to know whether the patterns shown above differed for those older than 33 compared to those younger than 33. The first step we’d need to do was to create a new age variable. To do so, we can use the greater than (>), greater than or equal to (>=). less than (<) and less than or equal to (<=) symbols and conditionally recode the data, as follows:
burnout_data$age_di_at_33 <- NA #Create an empty variable
burnout_data$age_di_at_33[burnout_data$age >= 33] <- "33 years or older"
burnout_data$age_di_at_33[burnout_data$age < 33] <- "Younger than 33"
table(burnout_data$age_di_at_33, burnout_data$age) #make sure it worked
##
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 33 years or older 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Younger than 33 3 4 15 8 30 23 48 80 57 154 138 61 154 96 159
##
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 33 years or older 0 0 59 50 114 86 30 47 37 46 27 28 20 15 35
## Younger than 33 58 114 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 33 years or older 17 18 11 11 26 18 20 17 21 24 32 20 33 20 29
## Younger than 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 33 years or older 22 27 19 23 26 18 18 24 18 21 16 24 15 12 6
## Younger than 33 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 76 77 78 79 80 81 82 83 85 88 89
## 33 years or older 6 11 6 5 3 3 4 1 1 1 5
## Younger than 33 0 0 0 0 0 0 0 0 0 0 0
Note that we could have created more groups if we had wanted to!, we could have broken down the data according to quartiles (Younger than 27, 27-32, 33-47, 48+):
burnout_data$age_quartiles <- NA #Create an empty variable
burnout_data$age_quartiles[burnout_data$age < 27] <- "Younger than 27"
burnout_data$age_quartiles[burnout_data$age >= 27 &
burnout_data$age <= 32] <- "27-32"
burnout_data$age_quartiles[burnout_data$age >= 33 &
burnout_data$age <= 47] <- "33-47"
burnout_data$age_quartiles[burnout_data$age >= 48] <- "48+"
table(burnout_data$age_quartiles, burnout_data$age) #make sure it worked
##
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 27-32 0 0 0 0 0 0 0 0 0 0 0 61 154 96 159
## 33-47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 48+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Younger than 27 3 4 15 8 30 23 48 80 57 154 138 0 0 0 0
##
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 27-32 58 114 0 0 0 0 0 0 0 0 0 0 0 0 0
## 33-47 0 0 59 50 114 86 30 47 37 46 27 28 20 15 35
## 48+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## Younger than 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 27-32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 33-47 17 18 0 0 0 0 0 0 0 0 0 0 0 0 0
## 48+ 0 0 11 11 26 18 20 17 21 24 32 20 33 20 29
## Younger than 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 27-32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 33-47 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 48+ 22 27 19 23 26 18 18 24 18 21 16 24 15 12 6
## Younger than 27 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 76 77 78 79 80 81 82 83 85 88 89
## 27-32 0 0 0 0 0 0 0 0 0 0 0
## 33-47 0 0 0 0 0 0 0 0 0 0 0
## 48+ 6 11 6 5 3 3 4 1 1 1 5
## Younger than 27 0 0 0 0 0 0 0 0 0 0 0
Lets look at the cross tabs between these variables and the other two variables we have been working with:
table(burnout_data$age_quartiles, burnout_data$rel_status_and_gender) #Cross tab burnout and gender/rel status
##
## Non-single Men Non-single Non-binary folk Non-single Women
## 27-32 158 3 146
## 33-47 182 10 161
## 48+ 144 7 199
## Younger than 27 69 2 73
##
## Single Men Single Non-binary folk Single Women
## 27-32 182 8 140
## 33-47 126 14 122
## 48+ 71 4 188
## Younger than 27 210 9 189
table(burnout_data$age_di_at_33, burnout_data$rel_status_and_gender) #Cross tab burnout and gender/rel status
##
## Non-single Men Non-single Non-binary folk Non-single Women
## 33 years or older 326 17 360
## Younger than 33 227 5 219
##
## Single Men Single Non-binary folk Single Women
## 33 years or older 197 18 310
## Younger than 33 392 17 329
table(burnout_data$age_quartiles, burnout_data$BurnoutScore_groups) #Cross tab burnout and gender/rel status
##
## 0 1
## 27-32 431 190
## 33-47 404 196
## 48+ 403 190
## Younger than 27 377 165
table(burnout_data$age_di_at_33, burnout_data$BurnoutScore_groups) #Cross tab burnout and gender/rel status
##
## 0 1
## 33 years or older 807 386
## Younger than 33 808 355
You can see that for the four level age breakdown, some of the cell counts get quite small when looking at gender, so preceding with the two level age variable might be better (though sample sizes for the non-binary folk are still quite small).
To now look at burnout groups among those under and over 33, we can use the [] to conditionally select the data where
prop.table(table(burnout_data$BurnoutScore_groups[burnout_data$age_di_at_33 == "33 years or older"],
burnout_data$rel_status_and_gender[burnout_data$age_di_at_33 == "33 years or older"]), margin = 2) #older xtab
##
## Non-single Men Non-single Non-binary folk Non-single Women Single Men
## 0 0.7738854 0.6875000 0.7089337 0.6054054
## 1 0.2261146 0.3125000 0.2910663 0.3945946
##
## Single Non-binary folk Single Women
## 0 0.4666667 0.5939597
## 1 0.5333333 0.4060403
prop.table(table(burnout_data$BurnoutScore_groups[burnout_data$age_di_at_33 == "Younger than 33"],
burnout_data$rel_status_and_gender[burnout_data$age_di_at_33 == "Younger than 33"]), margin = 2) #younger xtab
##
## Non-single Men Non-single Non-binary folk Non-single Women Single Men
## 0 0.7227273 0.4000000 0.6990741 0.7393617
## 1 0.2772727 0.6000000 0.3009259 0.2606383
##
## Single Non-binary folk Single Women
## 0 0.5625000 0.6309148
## 1 0.4375000 0.3690852
Looking at these data, one thing that sticks out to me is that the proportion of non-single non-binary folk who are burnt out nearly doubles when looking at those younger than 33 compared to those older than 33. To make sure this data, makes sense, it’s always a good idea to double check the underlying frequencies:
table(burnout_data$BurnoutScore_groups[burnout_data$age_di_at_33 == "33 years or older"],
burnout_data$rel_status_and_gender[burnout_data$age_di_at_33 == "33 years or older"]) #older xtab
##
## Non-single Men Non-single Non-binary folk Non-single Women Single Men
## 0 243 11 246 112
## 1 71 5 101 73
##
## Single Non-binary folk Single Women
## 0 7 177
## 1 8 121
table(burnout_data$BurnoutScore_groups[burnout_data$age_di_at_33 == "Younger than 33"],
burnout_data$rel_status_and_gender[burnout_data$age_di_at_33 == "Younger than 33"]) #younger xtab
##
## Non-single Men Non-single Non-binary folk Non-single Women Single Men
## 0 159 2 151 278
## 1 61 3 65 98
##
## Single Non-binary folk Single Women
## 0 9 200
## 1 7 117
As you can see from this quality check, it seems the jump in burnout is based on only a few observations. Its always important to be aware that when you’re slicing up data in this way that it can create small sample sizes that make data analyses less reliable.
So far, we’ve covered how to edit variables. One of the common reasons that you have to edit variables, is when you are trying to calculate scale scores. Many of the variables included in this dataset consist of scale items, that need to be calculated to create scale scores. In most cases, I have already done that for the CSCS data, but nevertheless, many datasets that you might work with will not already have these scores calculated, so it is important to learn how to do.
Lets recreate the burnout score. The burnout scale consists of 10 items:
“burnout_tired” “burnout_disappointed” “burnout_hopeless” “burnout_trapped” “burnout_helpless” “burnout_depressed” “burnout_sick” “burnout_worthless” “burnout_difficulty_sleeping” “burnout_had_it”
If you look at these variables, they are scores from “never” to “always”:
table(burnout_data$burnout_tired) #check
##
## Almost never Always Never Often Rarely Sometimes
## 227 97 146 455 504 798
## Very Often
## 216
In order to create a scale score, I will need to recode these variables as numeric variables and the sum their values together to create a final scale score. To do so, I will recode the values using the same variable editing skills taught above:
burnout_data$burnout_tired[burnout_data$burnout_tired == "Never"] <- 1 #Assign Never a value of 1
burnout_data$burnout_tired[burnout_data$burnout_tired == "Almost never"] <- 2 #Assign Almost never a value of 2
burnout_data$burnout_tired[burnout_data$burnout_tired == "Rarely"] <- 3 #Assign Rarely a value of 3
burnout_data$burnout_tired[burnout_data$burnout_tired == "Sometimes"] <- 4 #Assign Sometimes a value of 4
burnout_data$burnout_tired[burnout_data$burnout_tired == "Often"] <- 5 #Assign Often a value of 5
burnout_data$burnout_tired[burnout_data$burnout_tired == "Very Often"] <- 6 #Assign Very Often a value of 6
burnout_data$burnout_tired[burnout_data$burnout_tired == "Always"] <- 7 #Assign Always a value of 7
table(burnout_data$burnout_tired) #Make sure it worked
##
## 1 2 3 4 5 6 7
## 146 227 504 798 455 216 97
As you can see, this would take a few lines of code to do for each of the 10 variables. To make this go a little bit faster, I am going to use the mutate_at, vars, and case_when functions. The intricacies of this code are beyond the scope of this course, but I will provide comments on each line to help you follow.
library(dplyr)
burnout_data <- mutate_at(.tbl = burnout_data, #Use the mutate function to recreate the burnout data
.vars = vars(burnout_disappointed:burnout_had_it), #Edit the vars in the dataset from "burnout_disappointed" to "burnout_had_it"
.funs = ~case_when(. == "Never" ~ "1", #When the old variable says "Never", assign a value of 1
. == "Almost never" ~ "2", #When the old variable says "Almost never", assign a value of 2
. == "Rarely" ~ "3", #When the old variable says "Rarely", assign a value of 3
. == "Sometimes" ~ "4", #When the old variable says "Sometimes", assign a value of 4
. == "Often" ~ "5", #When the old variable says "Often", assign a value of 5
. == "Very Often" ~ "6", #When the old variable says "Very Often", assign a value of 6
. == "Always" ~ "7", #When the old variable says "Always", assign a value of 7
TRUE ~ .)) #Everywhere else just leave it whatever it was in the old dataset
table(burnout_data$burnout_disappointed, useNA = "always") #Check the first variable to make sure it worked
##
## 1 2 3 4 5 6 7 <NA>
## 132 304 606 770 361 194 66 15
Don’t worry, you wont be tested on the intricacies of the mutate_at function, but as you can see, there is usually an easier more efficient way to do something in R. In this course, our goal is to teach the most straightforward way. But Learning new things is a great way to make your R coding experience better.
The next step in calculating the scale scores is to add the individual variable scores together. Before we do this, lets make sure they are numeric variables (remember they were originally stored as characters)
class(burnout_data$burnout_disappointed) #check class
## [1] "character"
burnout_data$burnout_tired <- as.numeric(burnout_data$burnout_tired) #recode as numeric
burnout_data$burnout_disappointed <- as.numeric(burnout_data$burnout_disappointed) #recode as numeric
burnout_data$burnout_hopeless <- as.numeric(burnout_data$burnout_hopeless) #recode as numeric
burnout_data$burnout_trapped <- as.numeric(burnout_data$burnout_trapped) #recode as numeric
burnout_data$burnout_helpless <- as.numeric(burnout_data$burnout_helpless) #recode as numeric
burnout_data$burnout_depressed <- as.numeric(burnout_data$burnout_depressed) #recode as numeric
burnout_data$burnout_sick <- as.numeric(burnout_data$burnout_sick) #recode as numeric
burnout_data$burnout_worthless <- as.numeric(burnout_data$burnout_worthless) #recode as numeric
burnout_data$burnout_difficulty_sleeping <- as.numeric(burnout_data$burnout_difficulty_sleeping) #recode as numeric
burnout_data$burnout_had_it <- as.numeric(burnout_data$burnout_had_it) #recode as numeric
class(burnout_data$burnout_disappointed) #make sure it worked
## [1] "numeric"
Now to add these numeric variables together, we can apply mathmatical transformations (e.g., +, -, *, /).
burnout_data$NewScore <- burnout_data$burnout_tired +
burnout_data$burnout_disappointed +
burnout_data$burnout_hopeless +
burnout_data$burnout_trapped +
burnout_data$burnout_helpless +
burnout_data$burnout_depressed +
burnout_data$burnout_sick +
burnout_data$burnout_worthless +
burnout_data$burnout_difficulty_sleeping +
burnout_data$burnout_had_it
summary(burnout_data$NewScore) #check to make sure it worked
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 10.0 27.0 35.0 34.2 41.0 70.0 92
summary(burnout_data$BurnoutScore) #compare with old score
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 2.70 3.50 3.42 4.10 7.00 92
As you can see there is one big difference between the old score and the new score. The new score is 10 times bigger! This is because the burnout inventory is not actually calculated as a sum of the individual items, it is an average of score items. This is an easy fix, we simply divide by 10!
burnout_data$NewScore <- burnout_data$NewScore / 10 #Calculate average of scale items, by dividing by number of items
summary(burnout_data$NewScore) #check to make sure it worked
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 2.70 3.50 3.42 4.10 7.00 92
While not shown here, you can use other mathematical functions, such as exp() to exponentiate, sum() to add values together, sqrt() to take the square root, mean() to get the average.
?exp
?sum
?sqrt
?mean
This section introduces you to the following concepts and their corresponding functions:
To get started, lets read in our burnout dataset.
burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data
Correlation is typically used when comparing the association between two numeric variables. In our dataset we have a few continuous variables to work with. Lets start with looking at the association between burnout and loneliness scores. To do so, we will first want to inspect the variables and ensure they are in the correct format (i.e., numeric):
class(burnout_data$BurnoutScore) #Make sure they are numeric
## [1] "numeric"
class(burnout_data$UCLA3Score) #Make sure they are numeric
## [1] "integer"
burnout_data$UCLA3Score <- as.numeric(burnout_data$UCLA3Score) #Convert to numeric
Next, we can use the plot() function to visually inspect the relationship between the variables:
plot(burnout_data$BurnoutScore, burnout_data$UCLA3Score) #Plot the correlation between burnout scores and UCLA loneliness scores.
As you can see the top left part of the plot and bottom right part of the plot are quite sparse. You can imagine that if you were to draw a line through this data, you would probably draw it going from the bottom left to the top right. This indicates a positive association. However, with many scales that have a limited range, it can be difficult to ascertain an association visually because observations are overlapping. One strategy you can do is nest your variables within a jitter() function. The jitter() function adds a little bit of noise to your data so that all your observations are not overlapping one another. This can help you see trends a little bit clearer:
plot(jitter(burnout_data$BurnoutScore), jitter(burnout_data$UCLA3Score)) #Plot the correlation between burnout scores and UCLA loneliness scores using Jitter.
While in our data the jitter() function makes it easier to draw a line through the “bulky” party of the data. In other situations it might still be difficult plotting a line. As such, it is useful to calculate statistics that can help tell us the strength of an association objectively.
The cor.test() function can be used to calculate the correlation between two continuous variables
cor.test(burnout_data$BurnoutScore, burnout_data$UCLA3Score)
##
## Pearson's product-moment correlation
##
## data: burnout_data$BurnoutScore and burnout_data$UCLA3Score
## t = 31.193, df = 2336, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5130031 0.5702669
## sample estimates:
## cor
## 0.5422644
Correlation ranges from -1 (Strong negative association) to +1 (Strong Positive Association):
Correlation | Interpretation |
---|---|
0.00 to 0.19 | Very Weak |
0.20 to 0.39 | Weak |
0.40 to 0.59 | Moderate |
0.60 to 0.79 | Strong |
0.80 to 1.00 | Very Strong |
The correlation here is .5422 and the p-value is very small (i.e, 2.2e-16) Therefore we find that there is a moderately strong, statistically-signficant correlation between loneliness and burn out scores.
When loooking at correlations for many variables as once, it is useful to create a correlation matrix. A correlation matrix is a table showing correlation coefficients between variables. Each cell in the table shows the correlation between two variables. A correlation matrix is used to summarize data, as an input into a more advanced analysis, and as a diagnostic for advanced analyses. To create a correlation matrix we use the corrplot() function, within the corrplot package
install.packages("corrplot", repos = "http://cran.us.r-project.org") #Install the corrplot package
## package 'corrplot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library(corrplot) #read in the corrplot library
The first step to creating a correlation is to subset into a new dataframe the variables you are wanting to build a correlation matrix for. Otherwise, your correlation plot would be way to big!
keepvars <- c("BurnoutScore", #Select numeric variables you want included in your correlation matrix
"UCLA3Score",
"zimet_overall_score",
"discrimination_score",
"attachment_secure_score",
"attachment_avoidant_score",
"attachment_anxious_score")
corr_data <- burnout_data[keepvars] #Create new dataframe with the variables you selected
Now that you have the variables selected, you need to create and plot a correlation matrix using the cor() and corrplot() functions respectively.
corr_matrix<-cor(corr_data, use = "pairwise.complete.obs") #Create Correlation Matrix, removing any data with missing for each correlation.
corrplot(corr_matrix, is.corr = FALSE, method = "number") #Plot correlation matrix, and show results as "numbers"
This correlation matrix shows you how strongly the variables in your dataset are correlated using the colour and number. You can choose the method to display the plot using other symbols as well.
Another function you can use to create a correlation plot is the chart.Correlation() function within the PerformanceAnalytics packages. This method builds and plots the correlation plot in one step and has lots more useful graphs (histograms, individual correlation plots, etc), shows statistical significance of each test (the stars), and the size of effect. Note that the variables are listed in the middle square over top the historgram
install.packages("PerformanceAnalytics", repos = "http://cran.us.r-project.org") #Load in PerformanceAnalytics package
## package 'PerformanceAnalytics' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library(PerformanceAnalytics) #Use PerformanceAnalytics library
chart.Correlation(corr_data, histogram=TRUE) #Create plots
Note that there are three kinds of correlation coefficients that are widely used. Most of the correlation functions allow you to specify which one: The Default is the Pearsons correlation coefficient, which requires the variables to be normally distributed, linearly related, and have equal errors across all levels (i.e., homoscedastic). Next week, when we talk about linear regression, We will look at ways of checking whether your data meat these assumptions. In cases where these assumptions are not likely met, you can use the Spearman rank correlation, which is a non-parametric test that is used to measure the degree of association between two variables. The Spearman rank correlation test does not carry any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal. Examples of using the spearman are provided below
chart.Correlation(corr_data, histogram=TRUE, method = "spearman") #Create correlation matrix using the spearman correlation
cor.test(burnout_data$BurnoutScore, burnout_data$UCLA3Score, method = "spearman") #Create Correlation between two variables with the spearman method argument
##
## Spearman's rank correlation rho
##
## data: burnout_data$BurnoutScore and burnout_data$UCLA3Score
## S = 1007256512, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.5271124
This section introduces you to the following concepts and their corresponding functions:
The last chapter reviewed the concept of correlation. This chapter focuses on linear regression. As you will see, linear regression is based on many of the same ideas as correlation. However, linear regression is advantageous because you can begin to work with multiple variables and you can mix variable types quite easily. Linear regression is used when you have a numeric outcome, such as the Burnout Score we have been working with.
To get started with linear regression, lets read in our burnout dataset.
burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data
Now that we have our dataset, lets jump right in an build a linear regression model by creating an object called “model” using the lm() function.
?lm #Get information about the lm() function.
The main pieces of the lm model are the formula and dataset arguments. A formula is written using the ~ as an equals sign (because = and == are already used for other processes in R). When writing a formula, the variable on the left side of the ~ is your outcome. For a linear regression model it must be a numeric variable. The variables on the right side of the ~ are your explanatory factors. These are stratified by + signs. Lets give it a try to model whether burnout and gender are related:
model_1 <- lm(formula = BurnoutScore ~ gender, data = burnout_data) #Create a linear regression model object
broom::tidy(model_1) # get results for the model object
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.30 0.0327 101. 0
## 2 genderNon-binary 0.490 0.152 3.22 0.00129
## 3 genderWoman 0.221 0.0455 4.86 0.00000126
As you can see, the model provides regression coefficients (etimate), standard errors, t-values, and p-values (pr(>|t|)).
The intercept is listed along with two coefficients.
P-values and coefficients in regression analysis work together to tell you which relationships in your model are statistically significant and the nature of those relationships. The coefficients describe the mathematical relationship between each independent variable and the dependent variable. The p-values for the coefficients indicate whether these relationships are statistically significant. A p-value less than 0.05 is typically considered to be statistically significant. The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable and the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease. The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable.
You may have noticed that the third coefficient for men, is missing. This is because it is the referent level. You can change the referent level of a factor variable, as follows:
class(burnout_data$gender)
## [1] "character"
burnout_data$gender <- as.factor(burnout_data$gender)
burnout_data$gender <- relevel(x = burnout_data$gender, ref = "Woman")
Now if we run the model again, you will see that “woman” is the referent level:
model_2 <- lm(formula = BurnoutScore ~ gender, data = burnout_data) #Create a linear regression model object
broom::tidy(model_2) # get results for the model object
## # A tibble: 3 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.52 0.0316 111. 0
## 2 genderMan -0.221 0.0455 -4.86 0.00000126
## 3 genderNon-binary 0.269 0.152 1.77 0.0767
When selecting a referent, you should choose one that theoretically makes sense or sometimes if we don’t particularly care, we just choose the largest group.
In the model we just created our results show that for comparing men to women, burnout scores are -0.2209 points lower. We can compare this with the descriptive results, where there is a 0.2209 difference in the mean burnout scores.
summary(burnout_data$BurnoutScore[burnout_data$gender == "Man"]) #Burnout Score summary is 3.297 for men
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.600 3.400 3.297 4.000 7.000 47
summary(burnout_data$BurnoutScore[burnout_data$gender == "Woman"]) #Burnout Score summary is 3.518 for woman
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.800 3.500 3.518 4.200 7.000 40
3.518-3.297 #Calculate difference in means for women and men
## [1] 0.221
summary(burnout_data$BurnoutScore[burnout_data$gender == "Non-binary"]) #Burnout Score summary is 3.518 for woman
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 3.150 3.800 3.787 4.475 6.900 5
3.518-3.787 #Calculate difference in means for women and non-binary folk
## [1] -0.269
As you can see, the coefficients bare a resemblance to the data. Its important to note however, that only the coefficient for men was statistically significant (p < 0.05), whereas the coefficient for non-binary folks was near, but not statistically significant (p > 0.05). This is probably because of the small sample size of non-binary folks.
The simple regression model can be used to control for confounding variables. Lets build a multivariable regression model controlling for age, loneliness, social support, discrimination, and other personality traits we worked with last week:
model_3 <- lm(formula = BurnoutScore ~ gender + age + UCLA3Score + zimet_overall_score + discrimination_score + tipi_extraversion_score + tipi_agreeable_score + tipi_conscientiousness_score + tipi_emotional_stability_score + tipi_emotional_stability_score + tipi_openness_score + attachment_secure_score + attachment_avoidant_score + attachment_anxious_score, data = burnout_data) #Create a linear regression model object
broom::tidy(model_3) # View Results
## # A tibble: 15 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.06 0.514 5.95 5.02e- 9
## 2 genderMan -0.0326 0.0862 -0.378 7.06e- 1
## 3 genderNon-binary -0.0402 0.332 -0.121 9.04e- 1
## 4 age 0.000234 0.00321 0.0731 9.42e- 1
## 5 UCLA3Score 0.236 0.0298 7.92 1.62e-14
## 6 zimet_overall_score -0.146 0.0402 -3.64 3.04e- 4
## 7 discrimination_score 0.00890 0.00471 1.89 5.94e- 2
## 8 tipi_extraversion_score 0.00813 0.0193 0.421 6.74e- 1
## 9 tipi_agreeable_score -0.0381 0.0215 -1.77 7.74e- 2
## 10 tipi_conscientiousness_score 0.0000585 0.0199 0.00294 9.98e- 1
## 11 tipi_emotional_stability_score -0.0466 0.0208 -2.24 2.56e- 2
## 12 tipi_openness_score 0.0482 0.0204 2.36 1.87e- 2
## 13 attachment_secure_score -0.0202 0.0114 -1.77 7.68e- 2
## 14 attachment_avoidant_score -0.00337 0.0119 -0.282 7.78e- 1
## 15 attachment_anxious_score 0.0154 0.0131 1.17 2.41e- 1
Regression coefficients in this model are interpreted independently – they assume all other variables in the model are held constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others. We would interpret these findings as suggesting that burnout is a function of loneliness, and social support scores and personality traits – particularly emotional stability and oppenness to new experiences.
One other factor to take notice of is the change in the R-squared between our simple model and our multivariable model. R-squared is a measure of model fit in linear regression – it is the proportion of the variation in the outcome that is predictable from the explanatory variables. The Adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases when the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected. In our simple model the adjusted R-squared was 0.01169 and in the multivariable model the R squared was 0.3618. This suggests that our current model predicts about 36% of the variation in Burnout scores.
The output for your model also includes a measure called the residual standard error. In the simple regression model, the residual standard error was 1.091 and in the multivariable model, the RSE was 0.8974. A smaller RSE indicates better model fit because the RSE is the average variation of points around the fitted regression line. Basically it tells you that your line is better
You can use the RSE and R-squared to help you choose between models. The goal is to have higher R-squared and lower RSE. Sometimes this encourages us to put more and more variables into a model. However, generally speaking, we should prefer smaller less complex models to more complex ones – so if you are not seeing improvement you can opt for the more parsimonious (or simple) model.
In addition to checking model fit, you also have to check to make sure that your model meets the assumptions of a linear regression model, which are: 1) Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be linear. 2) Normality of residuals. The residual errors are assumed to be normally distributed. 3) Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homoscedasticity) 4) Independence of residuals error terms. 5) No multi-colinearity of variables included in model. Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.
Lets start with checking the linearity assumption, which can be checked using the residuals vs. fitted plot, we can call up this plot using a simple plot function:
plot(model_3, 1) #The linearity assumption is met when the red line is approximately horizontal and centred on 0.
Homogeneity of variance can be checked using the scale-location plot, which can be called using the 3rd type of plot called by the plot function:
plot(model_3, 3) #The homogeneity of variance assumption is met when the red line is approximately horizontal. Large swings indicate that the variance changes as the values of the outcome changes, which indicates heterogeneity.
The normality assumption can be checked using the normality of residuals plot, which can be called using the following function:
plot(model_3, 2) #In this plot, you want to see the data approximately on a diagonal line.
You can also visually inspect your outcome to get a sense of whether your outcome is normally distributed. However, you should note that the assumption for linear regression is the normality of the residuals, and not necessarily the normality of your variable outcome. Of course, these frequently go together and so we like to know whether our outcome is distributed normally.
hist(burnout_data$BurnoutScore) #Plot a histogram: As you can see there is some skew from the high burnout end, however, overall I think its a pretty healthy bell curve.
Sometimes a visual inspection wont cut it and you can check out a few quantitative measures, such as shapiro wilk test or kolmogorov-smirnov test.
shapiro.test(burnout_data$BurnoutScore) #Run the shapiro-wilk test;
##
## Shapiro-Wilk normality test
##
## data: burnout_data$BurnoutScore
## W = 0.99283, p-value = 2.327e-09
ks.test(burnout_data$BurnoutScore, "pnorm") #run the KS test; pnorm indicates we are testing against a normal distribution
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: burnout_data$BurnoutScore
## D = 0.89342, p-value < 2.2e-16
## alternative hypothesis: two-sided
Note that in both of the above tests, the p-values are statistically significant, indicating that the data are non-normal. This is because we have a large sample size, so detecting deviations from normality is really easy. As such, we usually do not rely on these tests solely. The histogram and Q-Q plots in the present case were satisfactory. With enough experience you will “know it when you see it” when it comes to assessing normality.
Next we want to check for outliers. The plot() function can be used to identify these:
plot(model_3, 5) #Observations with standardized residuals greater than or less than 3 may be outliers. You could try modeling the data by omitting those variables. They are numbered in the dataset so they can be easily removed by nesting a list within the bracket symbol and using the - sign to say I do not want these variables. Remember that from the first class, the content in brackets to the left of the comma refers to rows and to the right of the comma refers to columns
burnout_data_no_outliers <- burnout_data[-c(370, 392, 1063)] #
Lets try the model without the outliers:
model_4 <- lm(formula = BurnoutScore ~ gender + age + UCLA3Score + zimet_overall_score + discrimination_score + tipi_extraversion_score + tipi_agreeable_score + tipi_conscientiousness_score + tipi_emotional_stability_score + tipi_emotional_stability_score + tipi_openness_score + attachment_secure_score + attachment_avoidant_score + attachment_anxious_score, data = burnout_data_no_outliers) #Create a linear regression model object
broom::tidy(model_4) # View Results
## # A tibble: 15 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.06 0.514 5.95 5.02e- 9
## 2 genderMan -0.0326 0.0862 -0.378 7.06e- 1
## 3 genderNon-binary -0.0402 0.332 -0.121 9.04e- 1
## 4 age 0.000234 0.00321 0.0731 9.42e- 1
## 5 UCLA3Score 0.236 0.0298 7.92 1.62e-14
## 6 zimet_overall_score -0.146 0.0402 -3.64 3.04e- 4
## 7 discrimination_score 0.00890 0.00471 1.89 5.94e- 2
## 8 tipi_extraversion_score 0.00813 0.0193 0.421 6.74e- 1
## 9 tipi_agreeable_score -0.0381 0.0215 -1.77 7.74e- 2
## 10 tipi_conscientiousness_score 0.0000585 0.0199 0.00294 9.98e- 1
## 11 tipi_emotional_stability_score -0.0466 0.0208 -2.24 2.56e- 2
## 12 tipi_openness_score 0.0482 0.0204 2.36 1.87e- 2
## 13 attachment_secure_score -0.0202 0.0114 -1.77 7.68e- 2
## 14 attachment_avoidant_score -0.00337 0.0119 -0.282 7.78e- 1
## 15 attachment_anxious_score 0.0154 0.0131 1.17 2.41e- 1
plot(model_4, 5) #plot the residuals vs. leverage. If the red line continues to be quite skewed, you might identify additional outliers and remove these as well by adding them to your list above. In this case, the line looks quite good.
Another type of outlier you might want to consider is the presence of influential values. This can be done by looking at cooks distance. A rule of thumb is that an observation has high influence if Cook’s distance exceeds 4/(n - p - 1), where n is the number of observations and p the number of explanatory variables. So in our case, we can identify a thresshold using the formula below
4/(497 - 13 - 1) # Thresshold is 0.008281573 for model 3, which is based on 497 observations and has 13 predictor variables.
## [1] 0.008281573
plot(model_3, 4) #As you can see, observations 392, 370, and 1063 (the same outliers identified in the last model, could be removed). Note there may be others that are below our thresshold, but we know that these are the worrying variables we should deal with first. Generally speaking, we want to limit the data we remove as then our model may become less externally valid to the real world.
The final thing we want to check for is multicolinearity, which can be done by calculating the variable inflation factor for each variable in our model. This can be done using the VIF() function from the regclass package.
install.packages("regclass", repos = "http://cran.us.r-project.org") #Install package
## package 'regclass' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library(regclass) #read library
VIF(model_3) #Calculate VIFs;
## GVIF Df GVIF^(1/(2*Df))
## gender 1.198890 2 1.046393
## age 1.453819 1 1.205744
## UCLA3Score 1.646890 1 1.283312
## zimet_overall_score 1.409677 1 1.187298
## discrimination_score 1.787552 1 1.336994
## tipi_extraversion_score 1.266307 1 1.125303
## tipi_agreeable_score 1.737687 1 1.318214
## tipi_conscientiousness_score 1.555833 1 1.247330
## tipi_emotional_stability_score 1.629230 1 1.276413
## tipi_openness_score 1.344348 1 1.159460
## attachment_secure_score 1.721943 1 1.312228
## attachment_avoidant_score 1.838748 1 1.356004
## attachment_anxious_score 2.370624 1 1.539683
Values from 5-10 or up indicate potential multicolinearity and could be a good sign for you to consider the variables in your model from a theoretical perspective. Note that multicolinearity is less a concern whenthe variables with high VIFs are control variables, and the variables of interest do not have high VIFs. Here’s the thing about multicollinearity: it’s only a problem for the variables that are collinear. It increases the standard errors of their coefficients, and it may make those coefficients unstable in several ways. But so long as the collinear variables are only used as control variables, and they are not collinear with your variables of interest, there’s no problem. The coefficients of the variables of interest are not affected, and the performance of the control variables as controls is not impaired. Also, note that If the proportion of cases in the reference category is small, the indicator variables will necessarily have high VIFs, even if the categorical variable is not associated with other variables in the regression model.
If any of the results from the test indicate violation of the assumptions, it is good to test explanatory models 1 by 1 to see which ones might be the issue. If the issue is with your outcome, you may need to convert it to a categorical variable and analyze it using logistic regression (which we will discuss next week). Alternatively, if the issue is with an explanatory variable, you might need to drop it form your model, or transform it in some way, or remove specific observations that might be impacting the results.
Once you have a model that you are fairly satisfied with and you feel meets the assumptions of linear regression at least as best as you can, one additional thing you might do is calculate the relative importance of the variables in your model. There are lots of ways to do this, and little agreement as to which is best. I suggest using the varImp function, of the caret pacakge:
install.packages("caret", repos = "http://cran.us.r-project.org") #install package
## package 'caret' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("caret") #load library for package
varImp(model_3) #Run the varImp function on model 3, Larger values indicate higher contribution to the model. As we can see here, the loneliness scores and social support scores are the most important.
## Overall
## genderMan 0.377635661
## genderNon-binary 0.120859591
## age 0.073085424
## UCLA3Score 7.923217142
## zimet_overall_score 3.638782866
## discrimination_score 1.889829241
## tipi_extraversion_score 0.420606825
## tipi_agreeable_score 1.769478261
## tipi_conscientiousness_score 0.002939864
## tipi_emotional_stability_score 2.239350494
## tipi_openness_score 2.358682784
## attachment_secure_score 1.773075965
## attachment_avoidant_score 0.281953095
## attachment_anxious_score 1.173279397
With that, you’ve now got what you need to know to construct a linear regression model. As you have seen, linear regression is all about key assumptions. As such, more flexible models are often used. Next time we will talk about one such flexible approach, logistic regression.
This section introduces you to the following concepts and their corresponding functions:
In the last chater, we learned how to construct linear regression models. Logistic Regression models are another useful modeling approach. This time we will cover three types of logistic regresison models: binomial models, multinomial models, and ordinal models.
To get started with learning about logistic regression, lets read in our burnout dataset.
burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data
Now lets explore the use of the glm() function, which is used to create a logsitic regression model.
?glm #Call help function
The glm() function allows you to construct several types of models. You can read in the help file that these models are identified using the “family =” argument. In this chapter we forcus on “binomial” family, which is used for the binomial logistic regression model.
If you encounter a scenario in which you would be required to use a different family of models (e.g., you have count data and are therefore considering a Poisson model), it may be easier to restructure the variable as a binary variable so that you can use the “binomial” option, as it is fairly flexible compared to many of these. However, it is easy to learn about the other types of models from google or other sources. Outside of linear and logistic regression models, the most commonly used models are Poisson models.
For this exercise, we want to create a binomial logistic regression model, the kind used to analyze a binary outcome variable. To do so, I will use the categorical version of the Burnout Score as the outcome and the continuous version of the loneliness scale score as the explanatory factor:
burnout_data$BurnoutScore_groups <- as.factor(burnout_data$BurnoutScore_groups) #make sure outcome is a factor
glm_model <- glm(formula = BurnoutScore_groups ~ UCLA3Score, family = binomial(link = "logit"), data = burnout_data)
broom::tidy(glm_model)
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -4.28 0.199 -21.5 9.61e-103
## 2 UCLA3Score 0.603 0.0321 18.8 1.67e- 78
The coefficient here is interpreted as for each 1-point increase in the UCLA 3 loneliness score there is a 0.60 increase in the log odds of being burnt out. Now log odds are kind of a pain, so when epidemiologists work with logistic regression models, we typically convert the log odds to an odds ratio. To do this we simply exponentiate the estimate (coefficient) using the exp() function. To call the coefficients out of the model, we could use the coef() function or you could use model$coefficients:
exp((coef(glm_model))) #Use coef() function to call results, and exp() to convert to Odds Ratio
## (Intercept) UCLA3Score
## 0.01378765 1.82696096
exp(glm_model$coefficients) #use $ to refer to coefficients element of object and exp() to convert to Odds Ratio
## (Intercept) UCLA3Score
## 0.01378765 1.82696096
We interpret the odds ratio as there being a 82.7% increase in odds of being burnt out for each 1 point increase in the UCLA 3 scores. It is also common to report an odds ratio along with a confidence interval to communicate uncertainty around your estimate. To calculate the confidence intervals, we can use the confint() function.
exp(confint(glm_model))
## 2.5 % 97.5 %
## (Intercept) 0.009272364 0.0202388
## UCLA3Score 1.717042436 1.9476053
You can use the cbind() function to save these results:
cbind(exp((coef(glm_model))), exp(confint(glm_model))) #Bind the columns from the coef() and confint() model
## 2.5 % 97.5 %
## (Intercept) 0.01378765 0.009272364 0.0202388
## UCLA3Score 1.82696096 1.717042436 1.9476053
You can also assign this as a dataframe and save the dataframe as a csv using the following code:
GLM_table <- cbind(exp((coef(glm_model))), exp(confint(glm_model))) #save cbind results as object
GLM_table <- as.data.frame(GLM_table) #convert object to data frame
write.csv(x = GLM_table, file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\GLM_table.csv") #Write dataset to csv on computer
Now that you’ve got some results, you can assess the goodness of fit for your model, using a hosmer-Lemeshow goodness of test fit using the logitgof() function, which is part of the generalhoslem package. Note that Traditional residual plots are not very helpful with logistic regression.
install.packages("generalhoslem", repos = "http://cran.us.r-project.org") #install package
## package 'generalhoslem' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("generalhoslem") #load library for package
logitgof(obs = glm_model$y, fitted(glm_model)) #run hoslem test
##
## Hosmer and Lemeshow test (binary model)
##
## data: glm_model$y, fitted(glm_model)
## X-squared = 8.6349, df = 4, p-value = 0.0709
Notice that the observed values are extracted from the outcome of the model and then the expected values are calculated using the fitted() function. Because different objects have different features, each of the examples today gets to the observed and fitted values differently.
The results of this test have a p-value of 0.0709, indicating no evidence of poor fit (but very close). This tells you that your model has adequately acceptable goodness of fit to your data, but in our case it is borderline so you may want to consider a more complex model.
In addition to testing for goodness of fit, logistic regression are also frequently evaluated based on their predictive power, similar to how we used R-squared in linear regression. However, the R-squared does not work well in logistic regression, so we employ McFadden’s Pseudo R-squared using the PseudoR2() function from the DescTools package:
install.packages("DescTools", repos = "http://cran.us.r-project.org") #install package
## package 'DescTools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("DescTools") #load library for package
PseudoR2(glm_model, "all") #run McFadden's Pseudo R-squared; #All specifies we want all model statistics.
## McFadden McFaddenAdj CoxSnell Nagelkerke AldrichNelson
## 0.1530704 0.1516964 0.1735279 0.2436881 0.1600797
## VeallZimmermann Efron McKelveyZavoina Tjur AIC
## 0.2886466 0.1834035 0.2469587 0.1850409 2469.4635165
## BIC logLik logLik0 G2
## 2480.9776188 -1232.7317583 -1455.5304125 445.5973085
The values for a pseudo R squared tend to be considerably lower than R squared; The creator for the statistic says that values of 0.2 to 0.4 indicates good fit. In our case our score is below that, so perhaps we need to revise our model and compare the Pseudo R2 of the old and new models to see if we can improve things?
glm_model_2 <- glm(formula = BurnoutScore_groups ~ gender + age + UCLA3Score + zimet_overall_score + discrimination_score + tipi_extraversion_score + tipi_agreeable_score + tipi_conscientiousness_score + tipi_emotional_stability_score + tipi_emotional_stability_score + tipi_openness_score + attachment_secure_score + attachment_avoidant_score + attachment_anxious_score, family = binomial(link = "logit"), data = burnout_data)
summary(glm_model_2)
##
## Call:
## glm(formula = BurnoutScore_groups ~ gender + age + UCLA3Score +
## zimet_overall_score + discrimination_score + tipi_extraversion_score +
## tipi_agreeable_score + tipi_conscientiousness_score + tipi_emotional_stability_score +
## tipi_emotional_stability_score + tipi_openness_score + attachment_secure_score +
## attachment_avoidant_score + attachment_anxious_score, family = binomial(link = "logit"),
## data = burnout_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0363 -0.7954 -0.3082 0.7351 3.0283
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.041738 1.551997 -1.960 0.05001 .
## genderNon-binary -0.246547 0.962630 -0.256 0.79786
## genderWoman 0.049797 0.248916 0.200 0.84144
## age 0.003721 0.010537 0.353 0.72398
## UCLA3Score 0.552657 0.093435 5.915 3.32e-09 ***
## zimet_overall_score -0.327889 0.117585 -2.789 0.00529 **
## discrimination_score 0.030949 0.014488 2.136 0.03266 *
## tipi_extraversion_score 0.004050 0.059564 0.068 0.94579
## tipi_agreeable_score -0.120926 0.066214 -1.826 0.06781 .
## tipi_conscientiousness_score -0.011387 0.060180 -0.189 0.84992
## tipi_emotional_stability_score -0.122803 0.063711 -1.927 0.05392 .
## tipi_openness_score 0.104328 0.065496 1.593 0.11118
## attachment_secure_score -0.039546 0.038326 -1.032 0.30214
## attachment_avoidant_score 0.014096 0.035458 0.398 0.69097
## attachment_anxious_score 0.098093 0.041621 2.357 0.01843 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 635.88 on 496 degrees of freedom
## Residual deviance: 452.41 on 482 degrees of freedom
## (1951 observations deleted due to missingness)
## AIC: 482.41
##
## Number of Fisher Scoring iterations: 5
PseudoR2(glm_model_2, "all") #rerun McFadden's Pseudo R-squared for model_2
## McFadden McFaddenAdj CoxSnell Nagelkerke AldrichNelson
## 0.2885335 0.2413548 0.3086856 0.4276572 0.2696255
## VeallZimmermann Efron McKelveyZavoina Tjur AIC
## 0.4803630 0.3200616 0.5307165 0.3231435 482.4077537
## BIC logLik logLik0 G2
## 545.5366041 -226.2038768 -317.9402944 183.4728352
PseudoR2(glm_model, "all") #rerun McFadden's Pseudo R-squared for model
## McFadden McFaddenAdj CoxSnell Nagelkerke AldrichNelson
## 0.1530704 0.1516964 0.1735279 0.2436881 0.1600797
## VeallZimmermann Efron McKelveyZavoina Tjur AIC
## 0.2886466 0.1834035 0.2469587 0.1850409 2469.4635165
## BIC logLik logLik0 G2
## 2480.9776188 -1232.7317583 -1455.5304125 445.5973085
As you can the new model substantially improves over the other models. But how do we know which variables in our model are relatively more important? Simply adding variables to a model to increase its Pseudo R-squared seems like a dubious practice. I suggest using the varImp function, of the caret pacakge:
install.packages("caret", repos = "http://cran.us.r-project.org") #install package
library("caret") #load library for package
varImp(glm_model_2) #Run the varImp function on model 2, which provides the absolute value of the t-statistic for each model parameter. Variables with large absolute values are relatively more important. As we can see here, the loneliness scores and social support scores are the most important (similar to what we found when modelling our data as a linear regression model using the continuous burnout scores).
## Overall
## genderNon-binary 0.25611787
## genderWoman 0.20005748
## age 0.35314959
## UCLA3Score 5.91489834
## zimet_overall_score 2.78853205
## discrimination_score 2.13625902
## tipi_extraversion_score 0.06799547
## tipi_agreeable_score 1.82627405
## tipi_conscientiousness_score 0.18922460
## tipi_emotional_stability_score 1.92749758
## tipi_openness_score 1.59289367
## attachment_secure_score 1.03184554
## attachment_avoidant_score 0.39753352
## attachment_anxious_score 2.35679207
Another thing you might want to do is to test model accuracy. The model accuracy is measured as the proportion of observations that have been correctly classified. It can be calculated using the predict() function.
glm.probs <- predict(glm_model_2,type = "response") #Predict() assigns fitted model values for each participants based on the model
glm.pred <- ifelse(glm.probs > 0.5, "1", "0") #ifelse() is a function that Assigns 1 if prediction is above 50%, and 0 if prediction is below 50%
prop.table(table(glm.pred == glm_model_2$y)) #prop.table() function tells me that the correct assignment is made about 75.7% of the time (i.e., the predicted value == the actual value)
##
## FALSE TRUE
## 0.2434608 0.7565392
table(predicted = glm.pred, actual = glm_model_2$y)
## actual
## predicted 0 1
## 0 285 77
## 1 44 91
This sounds good, but a coin flip has a 50-50 chance of being accurate. When can dive deeper, by constructing a table, you can see that 77 people who were burnt out were predicted by our model not to be (FALSE negative); and 44 of those who were not burnt out were predicted to be (FALSE positive). This suggests that your model predicts better than random, but its still not perfect. You don’t need to search for a perfect model, but you should be cautious claiming that your model is good, when it misclassifies almost one in four participants. #later in this course we will discuss more about comparing models. Sometimes the best strategy is to just build a better model than you have, because no model is perfect.
As with linear regression modes, binomial logistic regression models also assume that the variables are not co-linear. As such, you can use the VIF() function from the regclass package to assess this.
library("regclass") #load library for package
VIF(glm_model_2) #looks good no extremely high values
## GVIF Df GVIF^(1/(2*Df))
## gender 1.231106 2 1.053353
## age 1.594795 1 1.262852
## UCLA3Score 1.281699 1 1.132121
## zimet_overall_score 1.203995 1 1.097267
## discrimination_score 1.653544 1 1.285902
## tipi_extraversion_score 1.247772 1 1.117037
## tipi_agreeable_score 1.590532 1 1.261163
## tipi_conscientiousness_score 1.400683 1 1.183505
## tipi_emotional_stability_score 1.147639 1 1.071279
## tipi_openness_score 1.473251 1 1.213776
## attachment_secure_score 1.693511 1 1.301350
## attachment_avoidant_score 1.455864 1 1.206592
## attachment_anxious_score 1.988822 1 1.410256
Binomial logistic regression is great when you have just two levels in your outcome. In our dataset, we have a variable that asked participants whether the pandemic had made them more lonely, less lonely, or whether they were about the same as they had always been. Binomial logistic regression would not work on this variable because it has more than two levels.
table(burnout_data$lonely_change_covid)
##
## About the same Much less lonely Much more lonely
## 634 109 523
## Somewhat less lonely Somewhat more lonely
## 200 964
As you can see most people felt somewhat (n = 964) or much (n = 523) more lonely. BUt sizeable numbers reported that their loneliness was about the same as it was before the pandemic. We might be curious what factors are associated with these experiences. Lets examine, whether women or men were more likely to report changes in their levels of loneliness. TO do this, we will run a multinomial logistic regression using the multinom() function from the nnet package.
install.packages("nnet", repos = "http://cran.us.r-project.org") #install package
## package 'nnet' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("nnet") #load library for package
multinom_model <- multinom(formula = lonely_change_covid ~ gender, model = TRUE, data = burnout_data)
## # weights: 20 (12 variable)
## initial value 3910.934127
## iter 10 value 3368.824089
## final value 3362.277471
## converged
summary(multinom_model)
## Call:
## multinom(formula = lonely_change_covid ~ gender, data = burnout_data,
## model = TRUE)
##
## Coefficients:
## (Intercept) genderNon-binary genderWoman
## Much less lonely -1.8499554 0.2405371 0.1669536068
## Much more lonely -0.5203069 -0.3961239 0.6168463358
## Somewhat less lonely -1.1568775 0.1071966 -0.0008526973
## Somewhat more lonely 0.4425127 -0.4938345 -0.0236253960
##
## Std. Errors:
## (Intercept) genderNon-binary genderWoman
## Much less lonely 0.15212931 0.5684499 0.2114312
## Much more lonely 0.09184618 0.4283123 0.1219902
## Somewhat less lonely 0.11465031 0.4538499 0.1651529
## Somewhat more lonely 0.07186705 0.3283244 0.1037569
##
## Residual Deviance: 6724.555
## AIC: 6748.555
The output for this multinom model is somewhat hard to look at because we recieve a coefficient for every level of the outcome. The reference variable in this case is “About the same”, which you could change using the relevel() function if you wanted to. We will leave it as is for now. Lets first use the tidy() function in the broom pacakge to convert the results to odds ratios (i.e., exponentiate = TRUE) and calculate confidence intervals (i.e., conf.int = TRUE, conf.level = 0.95, )
install.packages("broom", repos = "http://cran.us.r-project.org") #install package
## package 'broom' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("broom") #load library for package
broom::tidy(multinom_model, conf.int = TRUE, conf.level = 0.95, exponentiate = TRUE) #Tidy of the results using the tidy function!!
## # A tibble: 12 × 8
## y.level term estim…¹ std.e…² statis…³ p.value conf.…⁴ conf.…⁵
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Much less lonely (Inte… 0.157 0.152 -1.22e+1 5.05e-34 0.117 0.212
## 2 Much less lonely gende… 1.27 0.568 4.23e-1 6.72e- 1 0.417 3.88
## 3 Much less lonely gende… 1.18 0.211 7.90e-1 4.30e- 1 0.781 1.79
## 4 Much more lonely (Inte… 0.594 0.0918 -5.66e+0 1.47e- 8 0.496 0.712
## 5 Much more lonely gende… 0.673 0.428 -9.25e-1 3.55e- 1 0.291 1.56
## 6 Much more lonely gende… 1.85 0.122 5.06e+0 4.27e- 7 1.46 2.35
## 7 Somewhat less lonely (Inte… 0.314 0.115 -1.01e+1 6.09e-24 0.251 0.394
## 8 Somewhat less lonely gende… 1.11 0.454 2.36e-1 8.13e- 1 0.457 2.71
## 9 Somewhat less lonely gende… 0.999 0.165 -5.16e-3 9.96e- 1 0.723 1.38
## 10 Somewhat more lonely (Inte… 1.56 0.0719 6.16e+0 7.40e-10 1.35 1.79
## 11 Somewhat more lonely gende… 0.610 0.328 -1.50e+0 1.33e- 1 0.321 1.16
## 12 Somewhat more lonely gende… 0.977 0.104 -2.28e-1 8.20e- 1 0.797 1.20
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## # ⁴conf.low, ⁵conf.high
In these results, the “About the same” group is your referent, so all comparisons are against that group. So, compared to people who’s loneliness did not change, people who were much more lonely had 85% higher odds of being a woman.
Lets do a conceptual check on this data by reviewing the descriptive:
prop.table(table(burnout_data$lonely_change_covid, burnout_data$gender), margin = 1) #calculate % of each group who were each gender
##
## Man Non-binary Woman
## About the same 0.50157729 0.03154574 0.46687697
## Much less lonely 0.45871560 0.03669725 0.50458716
## Much more lonely 0.36137667 0.01529637 0.62332696
## Somewhat less lonely 0.50000000 0.03500000 0.46500000
## Somewhat more lonely 0.51348548 0.01970954 0.46680498
As you can see, 50.2% of those who were “about the same,” were men compared to only about 36.1% of those who were much more lonely. We could also look at this data by gender:
prop.table(table(burnout_data$gender, burnout_data$lonely_change_covid), margin = 1) #calculate % of each group who were each gender
##
## About the same Much less lonely Much more lonely
## Man 0.27604167 0.04340278 0.16406250
## Non-binary 0.34482759 0.06896552 0.13793103
## Woman 0.24262295 0.04508197 0.26721311
##
## Somewhat less lonely Somewhat more lonely
## Man 0.08680556 0.42968750
## Non-binary 0.12068966 0.32758621
## Woman 0.07622951 0.36885246
Only 16.4% of men were much more lonely as aresult of the pandemic, compared to 26.7% of men. These descriptive statistics agree with the results found in the model!
We can also use the same functions above to check goodness of fit, the pseudo R squared, and identify the importance of variables
The model fit for a multinomial logistic regression model can also be tested using the logitgof() function:
library("generalhoslem")
logitgof(obs = na.omit(burnout_data$lonely_change_covid), exp = fitted(multinom_model)) #notice I use the na.omit function here to restrict down the outcome variable to include only those that were present in the observed dataset na.omit() removes mising. Then the expected values are calculated using the fitted() function.
##
## Hosmer and Lemeshow test (multinomial model)
##
## data: na.omit(burnout_data$lonely_change_covid), fitted(multinom_model)
## X-squared = 3.7544e-07, df = 0, p-value < 2.2e-16
Likewise, you can also check Pseudo R-squared to assess model fit.
library("DescTools") #load package
PseudoR2(multinom_model, "all")
## McFadden McFaddenAdj CoxSnell Nagelkerke AldrichNelson
## 6.509770e-03 2.963994e-03 1.796916e-02 1.915080e-02 NA
## VeallZimmermann Efron McKelveyZavoina Tjur AIC
## NA NA NA NA 6.748555e+03
## BIC logLik logLik0 G2
## 6.818103e+03 -3.362277e+03 -3.384309e+03 4.406214e+01
As with other models, it is informative to check variable importance. However, as we only have one variable, lets first build a model that includes the age variable.
library("nnet") #load package
multinom_model_2 <- multinom(formula = lonely_change_covid ~ gender + age, model = TRUE, data = burnout_data)
## # weights: 25 (16 variable)
## initial value 3910.934127
## iter 10 value 3538.470171
## iter 20 value 3331.388276
## final value 3330.671460
## converged
summary(multinom_model_2)
## Call:
## multinom(formula = lonely_change_covid ~ gender + age, data = burnout_data,
## model = TRUE)
##
## Coefficients:
## (Intercept) genderNon-binary genderWoman age
## Much less lonely -2.3918714 0.2231964 0.100150457 0.014287964
## Much more lonely -1.1051994 -0.4150591 0.544377317 0.015371055
## Somewhat less lonely -0.5111307 0.1329242 0.060438748 -0.018495488
## Somewhat more lonely 0.7185628 -0.4829126 0.005327498 -0.007717216
##
## Std. Errors:
## (Intercept) genderNon-binary genderWoman age
## Much less lonely 0.2907118 0.5691078 0.2140786 0.006349814
## Much more lonely 0.1693394 0.4294329 0.1235672 0.003716101
## Somewhat less lonely 0.2375174 0.4549662 0.1664447 0.006099787
## Somewhat more lonely 0.1443500 0.3286260 0.1047088 0.003487082
##
## Residual Deviance: 6661.343
## AIC: 6693.343
library("caret") #load package
varImp(multinom_model_2) #Check variable importance
## Overall
## genderNon-binary 1.25409225
## genderWoman 0.71029402
## age 0.05587172
The third type of regression we want to cover today is an ordinal logistic regression model. This is useful when your outcome has a natural order to it. Usually your selected referent is either the highest or lowest level of the variable. The first step is to make sure your variable is saved as an ordered factor. Lets give this a try using the question that assesses how frequently participants feel lonely:
table(burnout_data$lonely_direct) #call descriptives
##
## All of the time (e.g. 5-7 days)]
## 137
## None of the time (e.g., 0 days)
## 384
## Occasionally or a moderate amount of time (e.g. 3-4 days)
## 307
## Rarely (e.g. less than 1 day)
## 869
## Some or a little of the time (e.g. 1-2 days)
## 748
As you can see, the variable levels are currently ordered alphabetically. We will need to reorder the variable using the ordered() function:
burnout_data$lonely_direct <- ordered(burnout_data$lonely_direct, levels = c("None of the time (e.g., 0 days)", "Rarely (e.g. less than 1 day)", "Some or a little of the time (e.g. 1-2 days)", "Occasionally or a moderate amount of time (e.g. 3-4 days)", "All of the time (e.g. 5-7 days)]")) #Convert to factor and specify order
table(burnout_data$lonely_direct) #call descriptives; As you can see now, the variable is now ordered. We can construct an ordinal logistic regresison using the polr() function from the MASS package
##
## None of the time (e.g., 0 days)
## 384
## Rarely (e.g. less than 1 day)
## 869
## Some or a little of the time (e.g. 1-2 days)
## 748
## Occasionally or a moderate amount of time (e.g. 3-4 days)
## 307
## All of the time (e.g. 5-7 days)]
## 137
install.packages("MASS", repos = "http://cran.us.r-project.org") #install package
library(MASS) #load library for package
ord_model <- polr(formula = lonely_direct ~ age + gender + lonely_change_covid + BurnoutScore, Hess = TRUE, data = burnout_data)
broom::tidy(ord_model) # Get a summary, which is hard to read, so convert to a tidy table using the broom package
## # A tibble: 12 × 5
## term estim…¹ std.e…² stati…³ coef.…⁴
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 age 0.00402 0.00263 1.53 coeffi…
## 2 genderNon-binary 0.115 0.274 0.421 coeffi…
## 3 genderWoman 0.0300 0.0794 0.378 coeffi…
## 4 lonely_change_covidMuch less lonely 0.0188 0.214 0.0878 coeffi…
## 5 lonely_change_covidMuch more lonely 1.59 0.123 12.9 coeffi…
## 6 lonely_change_covidSomewhat less lonely 0.0633 0.157 0.404 coeffi…
## 7 lonely_change_covidSomewhat more lonely 0.400 0.0979 4.08 coeffi…
## 8 BurnoutScore 0.923 0.0434 21.3 coeffi…
## 9 None of the time (e.g., 0 days)|Rarely (e.g.… 1.66 0.190 8.73 scale
## 10 Rarely (e.g. less than 1 day)|Some or a litt… 3.80 0.199 19.1 scale
## 11 Some or a little of the time (e.g. 1-2 days)… 5.74 0.219 26.3 scale
## 12 Occasionally or a moderate amount of time (e… 7.43 0.246 30.2 scale
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## # ⁴coef.type
library(broom) #load library for broom package
broom::tidy(ord_model, conf.int = TRUE, conf.level = 0.95, pvalue = TRUE, exponentiate = TRUE) #Tidy of the results using the tidy function!!
## # A tibble: 12 × 7
## term estim…¹ std.e…² stati…³ conf.…⁴ conf.…⁵ coef.…⁶
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 age 1.00 0.00263 1.53 0.999 1.01 coeffi…
## 2 genderNon-binary 1.12 0.274 0.421 0.656 1.92 coeffi…
## 3 genderWoman 1.03 0.0794 0.378 0.882 1.20 coeffi…
## 4 lonely_change_covidMuch less… 1.02 0.214 0.0878 0.668 1.55 coeffi…
## 5 lonely_change_covidMuch more… 4.90 0.123 12.9 3.85 6.25 coeffi…
## 6 lonely_change_covidSomewhat … 1.07 0.157 0.404 0.783 1.45 coeffi…
## 7 lonely_change_covidSomewhat … 1.49 0.0979 4.08 1.23 1.81 coeffi…
## 8 BurnoutScore 2.52 0.0434 21.3 2.31 2.74 coeffi…
## 9 None of the time (e.g., 0 da… 5.24 0.190 8.73 NA NA scale
## 10 Rarely (e.g. less than 1 day… 44.5 0.199 19.1 NA NA scale
## 11 Some or a little of the time… 312. 0.219 26.3 NA NA scale
## 12 Occasionally or a moderate a… 1685. 0.246 30.2 NA NA scale
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
## # ⁴conf.low, ⁵conf.high, ⁶coef.type
As you can see, in this model, we do not have to worry about multiple coefficients for each explanatory variable. The nice thing with ordinal regression is it gives us a single effect for each term. So we might interpret the effect of Burnout as “Each 1 point increase in burnout scores was associated with a 2.52 times (or 152%) increase in odds of being lonely more frequently. The other variable level with a significant effect was among those who said they were filling much more lonelier since the start of the pandemic: We would interpret this as”People who were much more lonelier since the start of the pandemic had 4.90 times (or 390%) higher odds of being lonely more frequently.
One assumption unique to an ordinal model is that no variable in your model has a disproportionate effect on a specific level of your outcome variable. To test for this assumption, we conduct the Brant-Wald test using the brant() function within the brant package.
install.packages("brant", repos = "http://cran.us.r-project.org")
## package 'brant' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("brant")
brant::brant(ord_model)
## ----------------------------------------------------------------------------
## Test for X2 df probability
## ----------------------------------------------------------------------------
## Omnibus 278.25 24 0
## age 112.63 3 0
## genderNon-binary 18.25 3 0
## genderWoman 6.83 3 0.08
## lonely_change_covidMuch less lonely 31.97 3 0
## lonely_change_covidMuch more lonely 20.72 3 0
## lonely_change_covidSomewhat less lonely 2.19 3 0.53
## lonely_change_covidSomewhat more lonely 14.52 3 0
## BurnoutScore 40.19 3 0
## ----------------------------------------------------------------------------
##
## H0: Parallel Regression Assumption holds
A p-value of less than 0.05 on this test-particularly on the Omnibus plus at least one of the variables-should be interpreted as a failure of the proportional odds assumption. In this case, we would likely opt for a multinomial model or collapse our variable in a way that would make it suitable for a binomial logistic regression model.
The model fit can also be assessed by the logitgof() function:
library(generalhoslem)
logitgof(ord_model$model$lonely_direct, exp = fitted(ord_model), ord = TRUE)
##
## Hosmer and Lemeshow test (ordinal model)
##
## data: ord_model$model$lonely_direct, fitted(ord_model)
## X-squared = 88.288, df = 35, p-value = 1.703e-06
We can further check the pseudo R-squared:
library("DescTools") #load package
PseudoR2(ord_model, "all")
## McFadden CoxSnell Nagelkerke AldrichNelson VeallZimmermann
## 0.1370127 0.3265479 0.3458577 NA NA
## Efron McKelveyZavoina Tjur AIC BIC
## NA NA NA 5848.2884448 5917.3781900
## logLik logLik0 G2
## -2912.1442224 -3374.4924412 924.6964376
Finally, it is informative to check the variable importance. Unfortunately the varImp() function does not have a method for dealing with a polr() objects, as such it cannot be used for ordinal regression.
One option you have it to construct the model without each variable and then test the pseudo R squared on each, taking note of which variables have the biggest impact on values.
ord_model_no_age <- polr(formula = lonely_direct ~ gender + lonely_change_covid + BurnoutScore, Hess = TRUE, data = burnout_data) #construct without age
ord_model_no_gender <- polr(formula = lonely_direct ~ age + lonely_change_covid + BurnoutScore, Hess = TRUE, data = burnout_data) #construct without gender
ord_model_no_lonely_change_covid <- polr(formula = lonely_direct ~ age + gender + BurnoutScore, Hess = TRUE, data = burnout_data) #construct without lonely_change
ord_model_no_burnoutScore <- polr(formula = lonely_direct ~ age + gender + lonely_change_covid, Hess = TRUE, data = burnout_data) #construct without burnout
Once the models are each constructed, you can test the pseudo R-squared of each and compare.
PseudoR2(ord_model, "all") #Mcfadden = 0.1370127
## McFadden CoxSnell Nagelkerke AldrichNelson VeallZimmermann
## 0.1370127 0.3265479 0.3458577 NA NA
## Efron McKelveyZavoina Tjur AIC BIC
## NA NA NA 5848.2884448 5917.3781900
## logLik logLik0 G2
## -2912.1442224 -3374.4924412 924.6964376
PseudoR2(ord_model_no_age, "all") #Mcfadden = 0.1366661
## McFadden CoxSnell Nagelkerke AldrichNelson VeallZimmermann
## 0.1366661 0.3258741 0.3451440 NA NA
## Efron McKelveyZavoina Tjur AIC BIC
## NA NA NA 5848.6275130 5911.9597794
## logLik logLik0 G2
## -2913.3137565 -3374.4924412 922.3573694
PseudoR2(ord_model_no_gender, "all") #Mcfadden = 0.1369715
## McFadden CoxSnell Nagelkerke AldrichNelson VeallZimmermann
## 0.1369715 0.3264678 0.3457729 NA NA
## Efron McKelveyZavoina Tjur AIC BIC
## NA NA NA 5844.5664064 5902.1411941
## logLik logLik0 G2
## -2912.2832032 -3374.4924412 924.4184760
PseudoR2(ord_model_no_lonely_change_covid, "all") #Mcfadden = 0.1100810
## McFadden CoxSnell Nagelkerke AldrichNelson VeallZimmermann
## 0.1100810 0.2725013 0.2885357 NA NA
## Efron McKelveyZavoina Tjur AIC BIC
## NA NA NA 6070.3457893 6116.4567596
## logLik logLik0 G2
## -3027.1728946 -3401.6273375 748.9088857
PseudoR2(ord_model_no_burnoutScore, "all") #Mcfadden = 0.0635745
## McFadden CoxSnell Nagelkerke AldrichNelson VeallZimmermann
## 0.0635745 0.1674025 0.1773402 NA NA
## Efron McKelveyZavoina Tjur AIC BIC
## NA NA NA 6571.3339832 6635.0725065
## logLik logLik0 G2
## -3274.6669916 -3496.9861484 444.6383136
Finally, we can calculate the change in R-squared values by subtracting the pseudo R-squared value of each nested model from the pseudo R-squared value of the full model.
0.1370127 - 0.1366661 #ord_model_no_age = 0.0003466
## [1] 0.0003466
0.1370127 - 0.1369715 #ord_model_no_gender = 0.0000412
## [1] 4.12e-05
0.1370127 - 0.1100810 #ord_model_no_lonely_change_covid = 0.0269317
## [1] 0.0269317
0.1370127 - 0.0635745 #ord_model_no_burnoutScore = 0.0734382
## [1] 0.0734382
Here we can see that the biggest variable impact was on the removal of the burnout score. Again highlighting a strong connection between loneliness and burnout. Relatively small changes in the model were attributed to age and gender. This gives you a sense of what the most important variables in the model R. Technically, a similar approach could be used in other models.
The other limitation of ordinal models is they do not work with your typical variable inflation factors. You can however check multicolienarity by (1) converting your outcome to a numeric variable, (2) constructing a linear regression model, and (3) calculating the VIF scores to test for multicolinearity.
Step 1. Create numeric version of the outcome
burnout_data$lonely_direct <- as.character(burnout_data$lonely_direct)
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "None of the time (e.g., 0 days)"] <- 1
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "Rarely (e.g. less than 1 day)"] <- 2
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "Some or a little of the time (e.g. 1-2 days)"] <- 3
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "Occasionally or a moderate amount of time (e.g. 3-4 days)"] <- 4
burnout_data$lonely_direct_numeric[burnout_data$lonely_direct == "All of the time (e.g. 5-7 days)]"] <- 5
Step 2. Create a linear regression model
linear_model <- lm(formula = lonely_direct_numeric ~ age + gender + lonely_change_covid + BurnoutScore, data = burnout_data)
Step 3. Calculate VIF to measure multicolinearity
VIF(linear_model)
## GVIF Df GVIF^(1/(2*Df))
## age 1.057796 1 1.028492
## gender 1.047706 2 1.011719
## lonely_change_covid 1.197383 4 1.022773
## BurnoutScore 1.162099 1 1.078007
The above is a nice work-around that should satisfy. Since the VIF is really a function of inter-correlations in the design matrix (which doesn’t depend on the dependent variable or the non-linear mapping from the linear predictor into the space of the response variable [i.e., the link function in a glm]).
This section introduces you to the following concepts and their corresponding functions:
So far, we have covered linear and logistic regression models and how to evaluate them. Now, our focus is on how to compare models to decide which one is better. We will also review some different ways to construct a model. These skills will add to those we’ve already learned to help you build better models by deciding what variables and combinations of variables you should include.
As always, lets start by reading in our burnout dataset:
burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data
Now lets learn how to compare linear regression models first. We will start by building three similar linear regression models using the BurnoutScore, age, income, and ethnicity variables. To start I am going to create a new dataset that removes missing observations that are missing any of the variables of interest. This ensures that all the models I am looking at are built with the same underlying data.
burnout_data_subset <- burnout_data[!is.na(burnout_data$BurnoutScore),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$age),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$income),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$ethnicity),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$BurnoutScore_groups),]
Model 1 predicts burnout scores using age as the explanatory factor:
model_1 <- lm(formula = BurnoutScore ~ age, data = burnout_data_subset) #Create a linear regression model object
broom::tidy(model_1) # get results for the model object
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.53 0.0607 58.2 0
## 2 age -0.00294 0.00146 -2.02 0.0434
Model 2 predicts burnout scores using age and income as the explanatory factors:
model_2 <- lm(formula = BurnoutScore ~ age + income, data = burnout_data_subset) #Create a linear regression model object
broom::tidy(model_2) # get results for the model object
## # A tibble: 19 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.67 0.110 33.4 1.01e-200
## 2 age -0.00318 0.00147 -2.16 3.10e- 2
## 3 income$100,000 to $149,999 -0.293 0.116 -2.52 1.18e- 2
## 4 income$15,000 to $19,999 -0.309 0.137 -2.26 2.39e- 2
## 5 income$150,000 to $199,999 -0.324 0.134 -2.42 1.57e- 2
## 6 income$20,000 to $24,999 0.113 0.132 0.857 3.91e- 1
## 7 income$200,000 or more -0.149 0.186 -0.803 4.22e- 1
## 8 income$25,000 to $29,999 -0.0159 0.136 -0.117 9.07e- 1
## 9 income$30,000 to $34,999 -0.189 0.132 -1.43 1.52e- 1
## 10 income$35,000 to $39,999 -0.204 0.136 -1.50 1.34e- 1
## 11 income$40,000 to $44,999 -0.186 0.134 -1.39 1.63e- 1
## 12 income$45,000 to $49,999 -0.0902 0.142 -0.633 5.27e- 1
## 13 income$5,000 to $9,999 -0.0355 0.145 -0.244 8.07e- 1
## 14 income$50,000 to $59,999 0.0895 0.127 0.705 4.81e- 1
## 15 income$60,000 to $69,999 -0.00263 0.132 -0.0200 9.84e- 1
## 16 income$70,000 to $79,999 -0.152 0.129 -1.18 2.37e- 1
## 17 income$80,000 to $89,999 -0.103 0.140 -0.732 4.64e- 1
## 18 income$90,000 to $99,999 -0.163 0.133 -1.22 2.22e- 1
## 19 incomeUnder $5,000 -0.124 0.172 -0.722 4.70e- 1
Model 3 predicts burnout scores using age and ethnicity as the explanatory factors:
model_3 <- lm(formula = BurnoutScore ~ age + ethnicity, data = burnout_data_subset) #Create a linear regression model object
broom::tidy(model_3) # get results for the model object
## # A tibble: 14 × 5
## term estimate std.e…¹ stati…² p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.28 0.0877 37.4 5.80e-240
## 2 age -0.00435 0.00152 -2.86 4.31e- 3
## 3 ethnicityArab 0.412 0.160 2.57 1.03e- 2
## 4 ethnicityChinese 0.245 0.154 1.58 1.13e- 1
## 5 ethnicityFilipino 0.407 0.261 1.56 1.20e- 1
## 6 ethnicityIndigenous 0.184 0.112 1.65 9.95e- 2
## 7 ethnicityJapanese 0.307 0.292 1.05 2.92e- 1
## 8 ethnicityKorean 0.292 0.261 1.12 2.64e- 1
## 9 ethnicityLatin American 0.358 0.137 2.61 8.99e- 3
## 10 ethnicityNone of the above, I identify as 0.543 0.144 3.78 1.64e- 4
## 11 ethnicitySouth Asian (e.g., East Indian, … 0.630 0.174 3.61 3.07e- 4
## 12 ethnicitySoutheast Asian (e.g., Vietnames… 0.320 0.219 1.46 1.45e- 1
## 13 ethnicityWest Asian (e.g., Iranian, Afgha… 0.684 0.292 2.34 1.92e- 2
## 14 ethnicityWhite 0.340 0.0811 4.20 2.82e- 5
## # … with abbreviated variable names ¹std.error, ²statistic
To compare these models, we use something called a fit index, in particular we frequently rely on the Adjusted Information Criterion (AIC). AIC is not displayed by default. However, can be extracted using the AIC() function, as follows:
AIC(model_1, model_2, model_3)
## df AIC
## model_1 3 7122.987
## model_2 20 7122.044
## model_3 15 7116.428
As you can see, model 3 has the lowest AIC, and is therefore the best fitting model of those reviewed. We can test whether these differences are statistically significant using the lrtest() function from the lmtest package.
install.packages("lmtest", repos = "http://cran.us.r-project.org")
## package 'lmtest' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("lmtest")
lrtest(model_1, model_2)
## Likelihood ratio test
##
## Model 1: BurnoutScore ~ age
## Model 2: BurnoutScore ~ age + income
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -3558.5
## 2 20 -3541.0 17 34.943 0.006329 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you can see, the result shows a p-value (0.006329). This means that adding the income variable to the model did lead to a significantly improved fit over the model 1.
Model 1 and model 2 can be compared using the lrtest() function, only when they are nested (meaning that model 2 contains all the variables that model 1 does, plus some additional ones). In our example model 1 is tested inside model 2. However, model 2 is not nested inside model 3, because model 3 does not include ethnicity. Therefore we could compare model 3 to model 1 using lrtest, because they are nested:
lrtest(model_1, model_3)
## Likelihood ratio test
##
## Model 1: BurnoutScore ~ age
## Model 2: BurnoutScore ~ age + ethnicity
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 3 -3558.5
## 2 15 -3543.2 12 30.559 0.002299 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, to compare model 2 and model 3, we need to use the coxtest() function, also from the lmtest package
coxtest(model_2, model_3)
## Cox test
##
## Model 1: BurnoutScore ~ age + income
## Model 2: BurnoutScore ~ age + ethnicity
## Estimate Std. Error z value Pr(>|z|)
## fitted(M1) ~ M2 -15.139 0.52066 -29.077 < 2.2e-16 ***
## fitted(M2) ~ M1 -17.302 0.57398 -30.143 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As you can see, the p-value is statistically significant, suggesting that the difference in model fit is statistically significant. Combine this with what you know about the AIC, and you can tell that model_3 is a better fit than model_2.
As models become more and more complex, we can also use functions that automate the variable selection based on the AIC value. One such function is the step() function. Lets build ourselves a complex model and test the step function out:
linear_model_complex <- lm(formula = BurnoutScore ~ age + gender + ethnicity + income + educational_attainment + identity_lgbtq + identity_disability + lonely_direct + work_dignity + work_support + discrimination_score, model = TRUE, data = burnout_data)
broom::tidy(linear_model_complex)
## # A tibble: 58 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.93 0.250 19.8 2.44e-75
## 2 age -0.00726 0.00212 -3.42 6.37e- 4
## 3 genderNon-binary 0.174 0.190 0.917 3.59e- 1
## 4 genderWoman 0.171 0.0541 3.15 1.65e- 3
## 5 ethnicityArab 0.222 0.197 1.13 2.60e- 1
## 6 ethnicityChinese 0.283 0.191 1.48 1.39e- 1
## 7 ethnicityFilipino 0.168 0.311 0.541 5.89e- 1
## 8 ethnicityIndigenous 0.114 0.129 0.888 3.75e- 1
## 9 ethnicityJapanese 0.0788 0.377 0.209 8.35e- 1
## 10 ethnicityKorean 0.0852 0.311 0.274 7.84e- 1
## # … with 48 more rows
step(linear_model_complex, direction = "backward") #backwards selection of variables
## Start: AIC=-256.57
## BurnoutScore ~ age + gender + ethnicity + income + educational_attainment +
## identity_lgbtq + identity_disability + lonely_direct + work_dignity +
## work_support + discrimination_score
##
## Df Sum of Sq RSS AIC
## - ethnicity 12 6.114 918.09 -272.34
## - income 17 22.540 934.51 -260.49
## <none> 911.97 -256.57
## - identity_lgbtq 1 1.986 913.96 -255.89
## - work_dignity 4 7.212 919.18 -254.86
## - work_support 4 8.782 920.75 -252.76
## - gender 2 7.904 919.87 -249.94
## - educational_attainment 8 17.974 929.94 -248.52
## - age 1 9.102 921.07 -246.33
## - identity_disability 1 13.340 925.31 -240.68
## - discrimination_score 1 34.457 946.43 -212.88
## - lonely_direct 4 269.467 1181.44 54.37
##
## Step: AIC=-272.34
## BurnoutScore ~ age + gender + income + educational_attainment +
## identity_lgbtq + identity_disability + lonely_direct + work_dignity +
## work_support + discrimination_score
##
## Df Sum of Sq RSS AIC
## - income 17 23.309 941.39 -275.447
## <none> 918.09 -272.336
## - identity_lgbtq 1 1.817 919.90 -271.900
## - work_dignity 4 6.816 924.90 -271.224
## - work_support 4 9.001 927.09 -268.316
## - gender 2 8.257 926.34 -265.305
## - educational_attainment 8 18.832 936.92 -263.321
## - age 1 8.747 926.83 -262.654
## - identity_disability 1 13.944 932.03 -255.764
## - discrimination_score 1 33.142 951.23 -230.647
## - lonely_direct 4 275.967 1194.05 43.456
##
## Step: AIC=-275.45
## BurnoutScore ~ age + gender + educational_attainment + identity_lgbtq +
## identity_disability + lonely_direct + work_dignity + work_support +
## discrimination_score
##
## Df Sum of Sq RSS AIC
## <none> 941.39 -275.447
## - identity_lgbtq 1 2.025 943.42 -274.800
## - work_dignity 4 7.686 949.08 -273.429
## - work_support 4 8.544 949.94 -272.316
## - gender 2 10.035 951.43 -266.384
## - educational_attainment 8 19.863 961.26 -265.723
## - age 1 9.495 950.89 -265.084
## - identity_disability 1 15.564 956.96 -257.245
## - discrimination_score 1 35.727 977.12 -231.557
## - lonely_direct 4 276.194 1217.59 33.503
##
## Call:
## lm(formula = BurnoutScore ~ age + gender + educational_attainment +
## identity_lgbtq + identity_disability + lonely_direct + work_dignity +
## work_support + discrimination_score, data = burnout_data,
## model = TRUE)
##
## Coefficients:
## (Intercept)
## 5.083453
## age
## -0.007172
## genderNon-binary
## 0.195005
## genderWoman
## 0.190145
## educational_attainmentBachelor's degree
## -0.313381
## educational_attainmentCollege, CEGEP or other non-university certificate or diploma
## -0.207758
## educational_attainmentDegree in medicine, dentistry, veterinary medicine or optometry
## -0.260849
## educational_attainmentEarned doctorate
## 0.139102
## educational_attainmentHigh School Diploma or Lower
## -0.241908
## educational_attainmentMaster's degree
## -0.215207
## educational_attainmentUniversity certificate or diploma above bachelor level
## -0.322637
## educational_attainmentUniversity certificate or diploma below bachelor level
## -0.496635
## identity_lgbtqSexual or gender minorities (e.g., LGBTQ2+)
## 0.154236
## identity_disability8888: Not Selected
## -0.328294
## lonely_directNone of the time (e.g., 0 days)
## -1.948005
## lonely_directOccasionally or a moderate amount of time (e.g. 3-4 days)
## -0.624743
## lonely_directRarely (e.g. less than 1 day)
## -1.453003
## lonely_directSome or a little of the time (e.g. 1-2 days)
## -1.154553
## work_dignityCompletely
## -0.484545
## work_dignityNot at all
## -0.209211
## work_dignitySomewhat
## -0.207658
## work_dignityVery Little
## -0.100759
## work_dignityVery Well
## -0.287573
## work_supportCompletely
## 0.078410
## work_supportNot at all
## 0.348530
## work_supportSomewhat
## 0.234562
## work_supportVery Little
## 0.204010
## work_supportVery Well
## NA
## discrimination_score
## 0.017826
The following output is the last step of the backwards elimination procedure. Backwards elimiation begins with all variables listed in the model, then removes the least significant variables until AIC in minimized. You can see here that age, gender, educational_attainment, identity_lgbtq, identity_disability, lonely_direct, work_dignity, work_support, discrimination_score are retained while ethnicity and income are removed.
So for our final model, we can simply build the model with the variables that remained in the model.
final_linear_model <- lm(formula = BurnoutScore ~ age + gender + educational_attainment + identity_lgbtq + identity_disability + lonely_direct + work_dignity + work_support + discrimination_score, model = TRUE, data = burnout_data) #create final multivariable model
library(broom) #load in broom package to create tidy table
table <- tidy(final_linear_model, conf.int = TRUE, conf.level = 0.95) #create and save table using tidy() function
Lets now look at model fit for a binomial logistic regression models: To start we will create three binomial logistic regression models.
Model 1 predicts burnout score groups using age as the explanatory factor:
bi_model_1 <- glm(formula = BurnoutScore_groups ~ age, data = burnout_data_subset, family = "binomial") #Create a linear regression model object
broom::tidy(bi_model_1) # get results for the model object
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.646 0.120 -5.40 0.0000000670
## 2 age -0.00347 0.00289 -1.20 0.231
Model 2 predicts burnout score groups using age and income as the explanatory factors:
bi_model_2 <- glm(formula = BurnoutScore_groups ~ age + income, data = burnout_data_subset, family = "binomial") #Create a linear regression model object
broom::tidy(bi_model_2) # get results for the model object
## # A tibble: 19 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.371 0.210 -1.77 0.0772
## 2 age -0.00334 0.00294 -1.13 0.257
## 3 income$100,000 to $149,999 -0.473 0.226 -2.09 0.0366
## 4 income$15,000 to $19,999 -0.368 0.266 -1.38 0.167
## 5 income$150,000 to $199,999 -0.652 0.271 -2.41 0.0160
## 6 income$20,000 to $24,999 0.119 0.249 0.479 0.632
## 7 income$200,000 or more -0.553 0.380 -1.45 0.146
## 8 income$25,000 to $29,999 -0.259 0.263 -0.985 0.325
## 9 income$30,000 to $34,999 -0.538 0.262 -2.05 0.0403
## 10 income$35,000 to $39,999 -0.494 0.269 -1.83 0.0666
## 11 income$40,000 to $44,999 -0.328 0.259 -1.26 0.206
## 12 income$45,000 to $49,999 -0.523 0.284 -1.84 0.0658
## 13 income$5,000 to $9,999 -0.400 0.285 -1.40 0.160
## 14 income$50,000 to $59,999 0.143 0.239 0.598 0.550
## 15 income$60,000 to $69,999 -0.183 0.252 -0.727 0.468
## 16 income$70,000 to $79,999 -0.0875 0.244 -0.358 0.720
## 17 income$80,000 to $89,999 -0.292 0.272 -1.08 0.282
## 18 income$90,000 to $99,999 -0.356 0.259 -1.37 0.169
## 19 incomeUnder $5,000 -0.0973 0.328 -0.297 0.767
Model 3 predicts burnout score groups using age, ethnicity, and Loneliness scores as the explanatory factors:
bi_model_3 <- glm(formula = BurnoutScore_groups ~ age + ethnicity, data = burnout_data_subset, family = "binomial") #Create a linear regression model object
broom::tidy(bi_model_3) # get results for the model object
## # A tibble: 14 × 5
## term estimate std.e…¹ stati…² p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -1.01 0.186 -5.45 5.13e-8
## 2 age -0.00612 0.00304 -2.02 4.39e-2
## 3 ethnicityArab 0.296 0.330 0.898 3.69e-1
## 4 ethnicityChinese 0.0803 0.330 0.243 8.08e-1
## 5 ethnicityFilipino 0.915 0.492 1.86 6.33e-2
## 6 ethnicityIndigenous 0.265 0.235 1.13 2.59e-1
## 7 ethnicityJapanese 0.521 0.571 0.912 3.62e-1
## 8 ethnicityKorean 0.658 0.502 1.31 1.90e-1
## 9 ethnicityLatin American 0.460 0.277 1.66 9.73e-2
## 10 ethnicityNone of the above, I identify as 0.958 0.280 3.42 6.26e-4
## 11 ethnicitySouth Asian (e.g., East Indian, Pa… 0.976 0.333 2.93 3.39e-3
## 12 ethnicitySoutheast Asian (e.g., Vietnamese,… 0.481 0.436 1.10 2.70e-1
## 13 ethnicityWest Asian (e.g., Iranian, Afghan,… 1.11 0.543 2.05 4.01e-2
## 14 ethnicityWhite 0.514 0.173 2.97 3.02e-3
## # … with abbreviated variable names ¹std.error, ²statistic
The AIC Scores can again be extracted (though you may have noticed they are provided in the default output of the glm() function)
AIC(bi_model_1, bi_model_2, bi_model_3)
## df AIC
## bi_model_1 2 2936.558
## bi_model_2 19 2942.471
## bi_model_3 14 2937.738
Model_1 in this case shows the lowest AIC score. However, lets use lrtest() to compare the nested models and coxtest to compare the non-nested models:
lrtest(bi_model_1, bi_model_2)
## Likelihood ratio test
##
## Model 1: BurnoutScore_groups ~ age
## Model 2: BurnoutScore_groups ~ age + income
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -1466.3
## 2 19 -1452.2 17 28.087 0.04393 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(bi_model_1, bi_model_3)
## Likelihood ratio test
##
## Model 1: BurnoutScore_groups ~ age
## Model 2: BurnoutScore_groups ~ age + ethnicity
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -1466.3
## 2 14 -1454.9 12 22.82 0.02929 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coxtest(bi_model_2, bi_model_3)
## Cox test
##
## Model 1: BurnoutScore_groups ~ age + income
## Model 2: BurnoutScore_groups ~ age + ethnicity
## Estimate Std. Error z value Pr(>|z|)
## fitted(M1) ~ M2 -11.407 0.43043 -26.501 < 2.2e-16 ***
## fitted(M2) ~ M1 -14.126 0.52720 -26.795 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see that both models 2 and 3 are statistically better than model 1 (and model 3 is better than 2), despite model 1 having the lowest AIC. However, we can see that the p-values comparing model 1 to the other two models are not very small. This demonstrates that model fit is complex. On one hand the lrtest() and coxtest() are saying “choose model 3” and the AIC() values are saying “choose model 1”. Its important to know that the AIC penalizes model complexity – therefore trying to strike a good balance between model fit and model simplicity. What we know from reading all these results together, plus looking at model coefficients, is that the contributions of gender and ethnicity alone do not necessarily justify a more complex model. Thats not saying that a more complex model can’t be used, just that the evidence we have is conflicting and not very strong one way or another. For the sake of exploration, lets create a fourth model, combining age, gender, and ethnicity and compare the aic values and conduct a liklihood ratio test using the lrtest()
bi_model_4 <- glm(formula = BurnoutScore_groups ~ age + gender + ethnicity, data = burnout_data_subset, family = "binomial") #Create a linear regression model object
summary(bi_model_4) # get results for the model object
##
## Call:
## glm(formula = BurnoutScore_groups ~ age + gender + ethnicity,
## family = "binomial", data = burnout_data_subset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2714 -0.8737 -0.8071 1.4078 1.8624
##
## Coefficients:
## Estimate
## (Intercept) -1.114004
## age -0.007474
## genderNon-binary 0.668406
## genderWoman 0.323814
## ethnicityArab 0.272512
## ethnicityChinese 0.035787
## ethnicityFilipino 0.871382
## ethnicityIndigenous 0.226262
## ethnicityJapanese 0.465545
## ethnicityKorean 0.605984
## ethnicityLatin American 0.446422
## ethnicityNone of the above, I identify as 0.910566
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.) 0.939222
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.) 0.460146
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.) 1.093747
## ethnicityWhite 0.479927
## Std. Error
## (Intercept) 0.188446
## age 0.003074
## genderNon-binary 0.284727
## genderWoman 0.092278
## ethnicityArab 0.330832
## ethnicityChinese 0.331297
## ethnicityFilipino 0.496211
## ethnicityIndigenous 0.235486
## ethnicityJapanese 0.574038
## ethnicityKorean 0.505114
## ethnicityLatin American 0.278324
## ethnicityNone of the above, I identify as 0.281974
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.) 0.334630
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.) 0.437881
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.) 0.544679
## ethnicityWhite 0.174245
## z value
## (Intercept) -5.912
## age -2.432
## genderNon-binary 2.348
## genderWoman 3.509
## ethnicityArab 0.824
## ethnicityChinese 0.108
## ethnicityFilipino 1.756
## ethnicityIndigenous 0.961
## ethnicityJapanese 0.811
## ethnicityKorean 1.200
## ethnicityLatin American 1.604
## ethnicityNone of the above, I identify as 3.229
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.) 2.807
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.) 1.051
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.) 2.008
## ethnicityWhite 2.754
## Pr(>|z|)
## (Intercept) 3.39e-09
## age 0.01503
## genderNon-binary 0.01890
## genderWoman 0.00045
## ethnicityArab 0.41010
## ethnicityChinese 0.91398
## ethnicityFilipino 0.07908
## ethnicityIndigenous 0.33664
## ethnicityJapanese 0.41737
## ethnicityKorean 0.23026
## ethnicityLatin American 0.10872
## ethnicityNone of the above, I identify as 0.00124
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.) 0.00500
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.) 0.29333
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.) 0.04464
## ethnicityWhite 0.00588
##
## (Intercept) ***
## age *
## genderNon-binary *
## genderWoman ***
## ethnicityArab
## ethnicityChinese
## ethnicityFilipino .
## ethnicityIndigenous
## ethnicityJapanese
## ethnicityKorean
## ethnicityLatin American
## ethnicityNone of the above, I identify as **
## ethnicitySouth Asian (e.g., East Indian, Pakistani, Sri Lankan, etc.) **
## ethnicitySoutheast Asian (e.g., Vietnamese, Cambodian, Laotian, Thai, etc.)
## ethnicityWest Asian (e.g., Iranian, Afghan, etc.) *
## ethnicityWhite **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2934.0 on 2355 degrees of freedom
## Residual deviance: 2894.3 on 2340 degrees of freedom
## AIC: 2926.3
##
## Number of Fisher Scoring iterations: 4
AIC(bi_model_1, bi_model_4) #compare AIC
## df AIC
## bi_model_1 2 2936.558
## bi_model_4 16 2926.271
lrtest(bi_model_1, bi_model_4) #compare lrtest()
## Likelihood ratio test
##
## Model 1: BurnoutScore_groups ~ age
## Model 2: BurnoutScore_groups ~ age + gender + ethnicity
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 2 -1466.3
## 2 16 -1447.1 14 38.288 0.0004695 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see that the full model is improved compared to the age only model.
Lets now build some multinomial logistic regresison models using the multinom() function from the nnet package and use the AIC function to compare:
install.packages("nnet", repos = "http://cran.us.r-project.org") #install package
library("nnet") #load library for package
Age only model
multinom_model_1 <- multinom(formula = lonely_change_covid ~ age, model = TRUE, data = burnout_data)
## # weights: 15 (8 variable)
## initial value 3910.934127
## iter 10 value 3404.631506
## final value 3346.744956
## converged
broom::tidy(multinom_model_1)
## # A tibble: 8 × 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Much less lonely (Intercept) -2.35 0.280 -8.38 5.13e-17
## 2 Much less lonely age 0.0147 0.00627 2.34 1.94e- 2
## 3 Much more lonely (Intercept) -0.917 0.161 -5.70 1.17e- 8
## 4 Much more lonely age 0.0178 0.00367 4.86 1.19e- 6
## 5 Somewhat less lonely (Intercept) -0.490 0.230 -2.13 3.29e- 2
## 6 Somewhat less lonely age -0.0182 0.00605 -3.00 2.68e- 3
## 7 Somewhat more lonely (Intercept) 0.707 0.140 5.05 4.41e- 7
## 8 Somewhat more lonely age -0.00765 0.00345 -2.22 2.67e- 2
Age gender model
multinom_model_2 <- multinom(formula = lonely_change_covid ~ age + gender, model = TRUE, data = burnout_data)
## # weights: 25 (16 variable)
## initial value 3910.934127
## iter 10 value 3538.470171
## iter 20 value 3331.388276
## final value 3330.671460
## converged
broom::tidy(multinom_model_2)
## # A tibble: 16 × 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Much less lonely (Intercept) -2.39 0.291 -8.23 1.91e-16
## 2 Much less lonely age 0.0143 0.00635 2.25 2.44e- 2
## 3 Much less lonely genderNon-binary 0.223 0.569 0.392 6.95e- 1
## 4 Much less lonely genderWoman 0.100 0.214 0.468 6.40e- 1
## 5 Much more lonely (Intercept) -1.11 0.169 -6.53 6.73e-11
## 6 Much more lonely age 0.0154 0.00372 4.14 3.53e- 5
## 7 Much more lonely genderNon-binary -0.415 0.429 -0.967 3.34e- 1
## 8 Much more lonely genderWoman 0.544 0.124 4.41 1.06e- 5
## 9 Somewhat less lonely (Intercept) -0.511 0.238 -2.15 3.14e- 2
## 10 Somewhat less lonely age -0.0185 0.00610 -3.03 2.43e- 3
## 11 Somewhat less lonely genderNon-binary 0.133 0.455 0.292 7.70e- 1
## 12 Somewhat less lonely genderWoman 0.0604 0.166 0.363 7.17e- 1
## 13 Somewhat more lonely (Intercept) 0.719 0.144 4.98 6.43e- 7
## 14 Somewhat more lonely age -0.00772 0.00349 -2.21 2.69e- 2
## 15 Somewhat more lonely genderNon-binary -0.483 0.329 -1.47 1.42e- 1
## 16 Somewhat more lonely genderWoman 0.00533 0.105 0.0509 9.59e- 1
Age ethnicity model
multinom_model_3 <- multinom(formula = lonely_change_covid ~ age + ethnicity, model = TRUE, data = burnout_data)
## # weights: 75 (56 variable)
## initial value 3910.934127
## iter 10 value 3534.576873
## iter 20 value 3306.453616
## iter 30 value 3294.440945
## iter 40 value 3293.951705
## iter 50 value 3293.882902
## iter 60 value 3293.874501
## final value 3293.874421
## converged
broom::tidy(multinom_model_3)
## # A tibble: 56 × 6
## y.level term estimate std.e…¹ statis…² p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Much less lonely (Intercept) -1.97 3.81e-1 -5.17e+0 2.31e-7
## 2 Much less lonely age 0.0150 6.63e-3 2.26e+0 2.36e-2
## 3 Much less lonely ethnicityArab -0.850 8.06e-1 -1.05e+0 2.92e-1
## 4 Much less lonely ethnicityChinese -0.109 5.85e-1 -1.86e-1 8.52e-1
## 5 Much less lonely ethnicityFilipino -15.4 1.96e-7 -7.85e+7 0
## 6 Much less lonely ethnicityIndigenous -0.232 5.21e-1 -4.45e-1 6.56e-1
## 7 Much less lonely ethnicityJapanese 1.06 9.68e-1 1.09e+0 2.75e-1
## 8 Much less lonely ethnicityKorean -15.4 1.88e-7 -8.22e+7 0
## 9 Much less lonely ethnicityLatin American -1.04 8.01e-1 -1.30e+0 1.95e-1
## 10 Much less lonely ethnicityNone of the abov… -0.645 7.02e-1 -9.18e-1 3.59e-1
## # … with 46 more rows, and abbreviated variable names ¹std.error, ²statistic
test AIC
AIC(multinom_model_1, multinom_model_2, multinom_model_3) #compare AIC
## df AIC
## multinom_model_1 8 6709.490
## multinom_model_2 16 6693.343
## multinom_model_3 56 6699.749
Here we see that the second model has the lowest AIC
Lets use lrtest() to compare the nested models:
lrtest(multinom_model_1, multinom_model_2)
## Likelihood ratio test
##
## Model 1: lonely_change_covid ~ age
## Model 2: lonely_change_covid ~ age + gender
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -3346.7
## 2 16 -3330.7 8 32.147 8.766e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(multinom_model_1, multinom_model_3)
## Likelihood ratio test
##
## Model 1: lonely_change_covid ~ age
## Model 2: lonely_change_covid ~ age + ethnicity
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -3346.7
## 2 56 -3293.9 48 105.74 3.148e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the case of the non-nested models, there is no function readily available to asess relative fit of non-nested multinomial logistic regression models. As such, we rely on the AIC comparisons and conclude that model 2 is preferable. However, we can turn any non-nested model into a nested model and test whether adding a specific variable enhances the model overall. Lets give that a try:
multinom_model_4 <- multinom(formula = lonely_change_covid ~ age + gender + ethnicity, model = TRUE, data = burnout_data)
## # weights: 85 (64 variable)
## initial value 3910.934127
## iter 10 value 3586.207423
## iter 20 value 3316.597365
## iter 30 value 3279.531244
## iter 40 value 3278.968796
## iter 50 value 3278.777025
## iter 60 value 3278.757737
## final value 3278.756046
## converged
broom::tidy(multinom_model_4)
## # A tibble: 64 × 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Much less lonely (Intercept) -2.00 0.385 -5.19e+0 2.05e-7
## 2 Much less lonely age 0.0147 0.00670 2.20e+0 2.79e-2
## 3 Much less lonely genderNon-binary 0.322 0.576 5.58e-1 5.77e-1
## 4 Much less lonely genderWoman 0.0920 0.215 4.27e-1 6.69e-1
## 5 Much less lonely ethnicityArab -0.860 0.806 -1.07e+0 2.86e-1
## 6 Much less lonely ethnicityChinese -0.121 0.586 -2.06e-1 8.37e-1
## 7 Much less lonely ethnicityFilipino -19.0 0.00000000585 -3.24e+9 0
## 8 Much less lonely ethnicityIndigenous -0.245 0.522 -4.69e-1 6.39e-1
## 9 Much less lonely ethnicityJapanese 1.03 0.970 1.06e+0 2.89e-1
## 10 Much less lonely ethnicityKorean -19.2 0.00000000447 -4.29e+9 0
## # … with 54 more rows
lrtest(multinom_model_3, multinom_model_4)
## Likelihood ratio test
##
## Model 1: lonely_change_covid ~ age + ethnicity
## Model 2: lonely_change_covid ~ age + gender + ethnicity
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 56 -3293.9
## 2 64 -3278.8 8 30.237 0.0001919 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see that adding the variable is beneficial. Note we could also run the lrtest by calling specific variables:
lrtest(multinom_model_4, "gender")
## # weights: 75 (56 variable)
## initial value 3910.934127
## iter 10 value 3534.576873
## iter 20 value 3306.453616
## iter 30 value 3294.440945
## iter 40 value 3293.951705
## iter 50 value 3293.882902
## iter 60 value 3293.874501
## final value 3293.874421
## converged
## Likelihood ratio test
##
## Model 1: lonely_change_covid ~ age + gender + ethnicity
## Model 2: lonely_change_covid ~ age + ethnicity
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 64 -3278.8
## 2 56 -3293.9 -8 30.237 0.0001919 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(multinom_model_4, "ethnicity")
## # weights: 25 (16 variable)
## initial value 3910.934127
## iter 10 value 3538.470171
## iter 20 value 3331.388276
## final value 3330.671460
## converged
## Likelihood ratio test
##
## Model 1: lonely_change_covid ~ age + gender + ethnicity
## Model 2: lonely_change_covid ~ age + gender
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 64 -3278.8
## 2 16 -3330.7 -48 103.83 5.449e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(multinom_model_4, "age")
## # weights: 80 (60 variable)
## initial value 3910.934127
## iter 10 value 3356.260039
## iter 20 value 3302.449937
## iter 30 value 3298.120455
## iter 40 value 3297.618478
## iter 50 value 3297.574580
## iter 60 value 3297.568501
## final value 3297.568363
## converged
## Likelihood ratio test
##
## Model 1: lonely_change_covid ~ age + gender + ethnicity
## Model 2: lonely_change_covid ~ gender + ethnicity
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 64 -3278.8
## 2 60 -3297.6 -4 37.625 1.339e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lets go through this one more time, with ordinal logistic regression models created using the polr() function.
Convert out outcome to an ordinal variable
burnout_data$lonely_direct <- ordered(burnout_data$lonely_direct, levels = c("None of the time (e.g., 0 days)", "Rarely (e.g. less than 1 day)", "Some or a little of the time (e.g. 1-2 days)", "Occasionally or a moderate amount of time (e.g. 3-4 days)", "All of the time (e.g. 5-7 days)]")) #Convert to factor and specify order
table(burnout_data$lonely_direct) #call descriptives; As you can see now, the variable is now ordered. We can construct an ordinal logistic regresison using the polr() function from the MASS package
##
## None of the time (e.g., 0 days)
## 384
## Rarely (e.g. less than 1 day)
## 869
## Some or a little of the time (e.g. 1-2 days)
## 748
## Occasionally or a moderate amount of time (e.g. 3-4 days)
## 307
## All of the time (e.g. 5-7 days)]
## 137
install.packages("MASS", repos = "http://cran.us.r-project.org") #install package
library("MASS") #load library for package
ord_model_1 <- polr(formula = lonely_direct ~ age, Hess = TRUE, data = burnout_data)
broom::tidy(ord_model_1)
## # A tibble: 5 × 5
## term estimate std.e…¹ stati…² coef.…³
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 age 0.00484 0.00251 1.93 coeffi…
## 2 None of the time (e.g., 0 days)|Rarely (e.g.… -1.50 0.108 -13.8 scale
## 3 Rarely (e.g. less than 1 day)|Some or a litt… 0.228 0.101 2.26 scale
## 4 Some or a little of the time (e.g. 1-2 days)… 1.69 0.109 15.5 scale
## 5 Occasionally or a moderate amount of time (e… 3.01 0.132 22.9 scale
## # … with abbreviated variable names ¹std.error, ²statistic, ³coef.type
ord_model_2 <- polr(formula = lonely_direct ~ age + gender, Hess = TRUE, data = burnout_data)
broom::tidy(ord_model_2)
## # A tibble: 7 × 5
## term estimate std.e…¹ stati…² coef.…³
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 age 0.00345 0.00254 1.36 coeffi…
## 2 genderNon-binary 0.537 0.257 2.09 coeffi…
## 3 genderWoman 0.275 0.0750 3.67 coeffi…
## 4 None of the time (e.g., 0 days)|Rarely (e.g.… -1.41 0.110 -12.8 scale
## 5 Rarely (e.g. less than 1 day)|Some or a litt… 0.321 0.104 3.10 scale
## 6 Some or a little of the time (e.g. 1-2 days)… 1.79 0.112 16.0 scale
## 7 Occasionally or a moderate amount of time (e… 3.12 0.134 23.2 scale
## # … with abbreviated variable names ¹std.error, ²statistic, ³coef.type
ord_model_3 <- polr(formula = lonely_direct ~ age + ethnicity, Hess = TRUE, data = burnout_data)
broom::tidy(ord_model_3)
## # A tibble: 17 × 5
## term estimate std.e…¹ stati…² coef.…³
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 age 0.00263 0.00261 1.01 coeffi…
## 2 ethnicityArab -0.260 0.255 -1.02 coeffi…
## 3 ethnicityChinese 0.00315 0.251 0.0125 coeffi…
## 4 ethnicityFilipino 0.916 0.393 2.33 coeffi…
## 5 ethnicityIndigenous 0.0645 0.177 0.365 coeffi…
## 6 ethnicityJapanese 0.733 0.445 1.65 coeffi…
## 7 ethnicityKorean 0.336 0.408 0.825 coeffi…
## 8 ethnicityLatin American 0.147 0.211 0.698 coeffi…
## 9 ethnicityNone of the above, I identify as 0.409 0.251 1.63 coeffi…
## 10 ethnicitySouth Asian (e.g., East Indian, Pa… 0.544 0.268 2.03 coeffi…
## 11 ethnicitySoutheast Asian (e.g., Vietnamese,… 0.380 0.326 1.17 coeffi…
## 12 ethnicityWest Asian (e.g., Iranian, Afghan,… 0.487 0.431 1.13 coeffi…
## 13 ethnicityWhite 0.307 0.127 2.41 coeffi…
## 14 None of the time (e.g., 0 days)|Rarely (e.g… -1.34 0.147 -9.17 scale
## 15 Rarely (e.g. less than 1 day)|Some or a lit… 0.393 0.142 2.78 scale
## 16 Some or a little of the time (e.g. 1-2 days… 1.86 0.148 12.6 scale
## 17 Occasionally or a moderate amount of time (… 3.19 0.166 19.2 scale
## # … with abbreviated variable names ¹std.error, ²statistic, ³coef.type
AIC(ord_model_1, ord_model_2, ord_model_3) #Check AIC, see that model 2 has the lowest; model 2 is better
## df AIC
## ord_model_1 5 7061.353
## ord_model_2 7 7049.343
## ord_model_3 17 7064.581
lrtest(multinom_model_1, multinom_model_2) #Check nested comparison of 1 and 2; 2 is better
## Likelihood ratio test
##
## Model 1: lonely_change_covid ~ age
## Model 2: lonely_change_covid ~ age + gender
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -3346.7
## 2 16 -3330.7 8 32.147 8.766e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lrtest(multinom_model_1, multinom_model_3) #Check nested comparison of 1 and 3; 3 is better
## Likelihood ratio test
##
## Model 1: lonely_change_covid ~ age
## Model 2: lonely_change_covid ~ age + ethnicity
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -3346.7
## 2 56 -3293.9 48 105.74 3.148e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Again, comparing the non-nested models is difficult, but the AIC is pretty good and you could create a nested model or check to see if the additional variable is worth it.
Note that step() also works with the non-linear models.
This section introduces you to the following concepts and their corresponding functions:
Internal Consistency: alpha(),
Exploratory Factor Analysis: factanal(), fa.parallel()
This chapter focuses on Factor Analysis, a method designed to uncover the relationships between a given set of variables. It looks for a smaller set of underlying or latent constructs that can explain the relationships among the observed or manifest variables. By the end of this lecture, you should be able to conduct a factor analysis and understand some scenarios in which it might be useful to conduct one.
Lets start today by reading in our data:
burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv", na.strings = c("", "NA")) #Read in burnout_data
One of the requirements for a factor analysis is to use numeric variables. As such, we need to convert the items into numeric variables.
install.packages("tidyverse", repos = "http://cran.us.r-project.org")
library(tidyverse)
burnout_data <- mutate_at(.tbl = burnout_data, #Use the mutate function to recreate the dataset
.vars = vars(burnout_tired:burnout_had_it), #Identify the vars you want to edit
.funs = ~case_when(. == "Always" ~ "7", #recode level
. == "Very Often" ~ "6", #recode level
. == "Often" ~ "5", #recode level
. == "Sometimes" ~ "4", #recode level
. == "Rarely" ~ "3", #recode level
. == "Almost never" ~ "2", #recode level
. == "Never" ~ "1", #recode level
TRUE ~ .)) #Everywhere else just leave it whatever it was in the old dataset
burnout_data$burnout_tired <- as.numeric(burnout_data$burnout_tired)
burnout_data$burnout_disappointed <- as.numeric(burnout_data$burnout_disappointed)
burnout_data$burnout_hopeless <- as.numeric(burnout_data$burnout_hopeless)
burnout_data$burnout_trapped <- as.numeric(burnout_data$burnout_trapped)
burnout_data$burnout_helpless <- as.numeric(burnout_data$burnout_helpless)
burnout_data$burnout_depressed <- as.numeric(burnout_data$burnout_depressed)
burnout_data$burnout_sick <- as.numeric(burnout_data$burnout_sick)
burnout_data$burnout_worthless <- as.numeric(burnout_data$burnout_worthless)
burnout_data$burnout_difficulty_sleeping <- as.numeric(burnout_data$burnout_difficulty_sleeping)
burnout_data$burnout_had_it <- as.numeric(burnout_data$burnout_had_it)
We will work with a single scale, called the ten item personality inventory, which includes the following variables:
burnout_vars <- c("burnout_tired",
"burnout_disappointed",
"burnout_hopeless",
"burnout_trapped",
"burnout_helpless",
"burnout_depressed",
"burnout_sick",
"burnout_worthless",
"burnout_difficulty_sleeping",
"burnout_had_it")
Lets create a dataframe with only these vars:
burnout_data_scale_items <- burnout_data[, names(burnout_data) %in% burnout_vars] #remove unnecessary columns
burnout_data_scale_items <- na.omit(burnout_data_scale_items) #remove observations with missing data
Now that our dataset is ready for analysis, we should first assess the internal consistency of our scale items. We can use Chronbach’s alpha to do this. Chronbach’s alpha tells us how closely related a set of items in a group are. The alpha() function in the psych package can be used to assess internal consistency:
library("psych")
alpha(burnout_data_scale_items)
##
## Reliability analysis
## Call: alpha(x = burnout_data_scale_items)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.91 0.91 0.92 0.51 11 0.0026 3.4 1.1 0.51
##
## 95% confidence boundaries
## lower alpha upper
## Feldt 0.91 0.91 0.92
## Duhachek 0.91 0.91 0.92
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se
## burnout_tired 0.91 0.91 0.91 0.53 10.1 0.0028
## burnout_disappointed 0.91 0.91 0.91 0.53 10.0 0.0028
## burnout_hopeless 0.90 0.90 0.90 0.50 9.0 0.0031
## burnout_trapped 0.90 0.90 0.90 0.50 9.1 0.0031
## burnout_helpless 0.90 0.90 0.90 0.50 9.1 0.0031
## burnout_depressed 0.90 0.90 0.90 0.50 9.0 0.0031
## burnout_sick 0.91 0.91 0.91 0.52 9.8 0.0029
## burnout_worthless 0.90 0.91 0.91 0.51 9.5 0.0029
## burnout_difficulty_sleeping 0.91 0.91 0.91 0.53 10.0 0.0028
## burnout_had_it 0.91 0.91 0.91 0.53 10.0 0.0028
## var.r med.r
## burnout_tired 0.0083 0.52
## burnout_disappointed 0.0091 0.52
## burnout_hopeless 0.0073 0.50
## burnout_trapped 0.0077 0.50
## burnout_helpless 0.0075 0.50
## burnout_depressed 0.0092 0.47
## burnout_sick 0.0101 0.52
## burnout_worthless 0.0085 0.51
## burnout_difficulty_sleeping 0.0101 0.53
## burnout_had_it 0.0096 0.51
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## burnout_tired 2356 0.67 0.68 0.64 0.59 3.9 1.4
## burnout_disappointed 2356 0.69 0.70 0.66 0.62 3.7 1.4
## burnout_hopeless 2356 0.83 0.83 0.82 0.78 3.2 1.4
## burnout_trapped 2356 0.81 0.81 0.80 0.76 3.3 1.5
## burnout_helpless 2356 0.81 0.80 0.79 0.75 3.3 1.5
## burnout_depressed 2356 0.82 0.82 0.81 0.77 3.5 1.4
## burnout_sick 2356 0.71 0.71 0.67 0.64 3.2 1.3
## burnout_worthless 2356 0.76 0.75 0.72 0.69 3.0 1.5
## burnout_difficulty_sleeping 2356 0.70 0.70 0.65 0.62 3.6 1.5
## burnout_had_it 2356 0.71 0.70 0.65 0.62 3.5 1.6
##
## Non missing response frequency for each item
## 1 2 3 4 5 6 7 miss
## burnout_tired 0.06 0.09 0.21 0.33 0.19 0.09 0.04 0
## burnout_disappointed 0.05 0.12 0.25 0.32 0.15 0.08 0.03 0
## burnout_hopeless 0.14 0.19 0.24 0.25 0.11 0.04 0.02 0
## burnout_trapped 0.15 0.18 0.22 0.25 0.12 0.05 0.03 0
## burnout_helpless 0.16 0.16 0.23 0.26 0.12 0.06 0.02 0
## burnout_depressed 0.10 0.14 0.24 0.31 0.13 0.06 0.02 0
## burnout_sick 0.12 0.19 0.27 0.27 0.10 0.03 0.01 0
## burnout_worthless 0.21 0.20 0.24 0.21 0.09 0.04 0.02 0
## burnout_difficulty_sleeping 0.09 0.15 0.22 0.28 0.15 0.08 0.04 0
## burnout_had_it 0.13 0.14 0.20 0.28 0.14 0.07 0.03 0
As you can see the Cronbach’s alpha is very high at 0.91, which indicates excellent internal consistency:
Value | Interpretation |
---|---|
0.90 to 1.00 | Excellent |
0.80 to 0.89 | Good |
0.70 to 0.79 | Acceptable |
0.60 to 0.69 | Questionable |
0.50 to 0.59 | Poor |
0.00 to 0.50 | Unacceptable |
You can also look under the item statistics at the following statistics: n = number of complete cases raw.r = correlation of each item with the total score, not corrected for item overlap std.r = standardized correlation with total score, not correct for item overlap r.cor = correlation of each item, corrected for item overlap and scale reliability. r.drop = correlation of each item with total score, not including the item mean = mean sd = standard deviation
This tells us that all the items in the scale are highly correlated (and that being hopeless and depressed are the items most correlated with total scores). However, Cronbach’s alpha does not mean that the items are unidimensional. Since the Burnout Measure is a single scale, we assume it represents a single factor. Therefore, lets construct a factor model, with 1 factor using the factanal() function.
factanal(x = burnout_data_scale_items, factors = 1, rotation="varimax")
##
## Call:
## factanal(x = burnout_data_scale_items, factors = 1, rotation = "varimax")
##
## Uniquenesses:
## burnout_tired burnout_disappointed
## 0.648 0.604
## burnout_hopeless burnout_trapped
## 0.297 0.332
## burnout_helpless burnout_depressed
## 0.343 0.355
## burnout_sick burnout_worthless
## 0.573 0.457
## burnout_difficulty_sleeping burnout_had_it
## 0.617 0.574
##
## Loadings:
## Factor1
## burnout_tired 0.593
## burnout_disappointed 0.630
## burnout_hopeless 0.838
## burnout_trapped 0.817
## burnout_helpless 0.810
## burnout_depressed 0.803
## burnout_sick 0.653
## burnout_worthless 0.737
## burnout_difficulty_sleeping 0.619
## burnout_had_it 0.653
##
## Factor1
## SS loadings 5.198
## Proportion Var 0.520
##
## Test of the hypothesis that 1 factor is sufficient.
## The chi square statistic is 1196.39 on 35 degrees of freedom.
## The p-value is 1.27e-228
The output from the factanal() function is robust: The first set of results deal with uniqueness. Uniqueness is the variance that is ‘unique’ to the variable and not shared with other variables. A variable with high uniqueness is not explained well by the factor model. A variable with a low uniqueness is explained well by the factor model. Based on this we can see that hopelessness, helplessness, being trapped, and being depressed are expalined well by the factor model, while being tired, dissapointeded, and not sleeping well are less so described by the variable.
The next group of output is the factor loadings. These numbers describe how well each variable maps onto a single factor. Since we only have one factor, we see that all variables map onto it. Again we see that hopelessness, trapped, helplessness, and depression map well onto the one factor model.
In the next section, we have SS loadings. Factors with SS loadings greater than 1 indicate they should be kept. The Proportion Var variable indicates the proportion of variance explained by each factor. If there were multiple factors, we would also see a row for Cumulative Variance, which would indicate the variance of each additional factor. You want these values to be high as they tell you the variance in the data that is explained by the factor model.
Finally, at the bottom is a hypothesis test. This tests whether 1 factor model sufficiently characterizes the variance in the data. A significant p-value suggests that the model is NOT sufficient. To address this we can turn to the fa.parallel() function in the psych package
psych::fa.parallel(x = burnout_data_scale_items) #Create a scree plot. The point in the scree plot where things clearly start to level off (i.e., the elbow) indicates the number of factors that you should use. The function also runs a parallel analysis tells you the optimal numbers of factors (in this case 3).
## Parallel analysis suggests that the number of factors = 3 and the number of components = 1
Given this test, lets run a factor analysis with three factors:
factanal(x = burnout_data_scale_items, factors = 3, rotation="varimax")
##
## Call:
## factanal(x = burnout_data_scale_items, factors = 3, rotation = "varimax")
##
## Uniquenesses:
## burnout_tired burnout_disappointed
## 0.250 0.436
## burnout_hopeless burnout_trapped
## 0.256 0.274
## burnout_helpless burnout_depressed
## 0.303 0.347
## burnout_sick burnout_worthless
## 0.454 0.394
## burnout_difficulty_sleeping burnout_had_it
## 0.500 0.580
##
## Loadings:
## Factor1 Factor2 Factor3
## burnout_tired 0.174 0.799 0.287
## burnout_disappointed 0.371 0.630 0.173
## burnout_hopeless 0.755 0.315 0.274
## burnout_trapped 0.760 0.312 0.228
## burnout_helpless 0.726 0.205 0.358
## burnout_depressed 0.536 0.372 0.477
## burnout_sick 0.342 0.247 0.607
## burnout_worthless 0.591 0.126 0.491
## burnout_difficulty_sleeping 0.266 0.375 0.537
## burnout_had_it 0.486 0.255 0.345
##
## Factor1 Factor2 Factor3
## SS loadings 2.901 1.695 1.610
## Proportion Var 0.290 0.170 0.161
## Cumulative Var 0.290 0.460 0.621
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 112 on 18 degrees of freedom.
## The p-value is 1.33e-15
In the new output, we see considerbly lower uniqueness scores. In the factor loadings, we see the first factor relates to being hopeless, trapped, and helpless. The second factor represents being tired and dissapointed. The third factor relates to being sick and having difficulty sleeping.
Judging by the SS loadings, we should keep all three factors, and a higher ammount of variance is explained (62%). Technically speaking, the p-value remains insignificant, suggesting we could look at the larger factors, but we should have enough support from our parallel analysis and scree plot to decide that we’ve gone far enough. That said, lets take a look at the other options.
factanal(x = burnout_data_scale_items, factors = 4, rotation="varimax") #p-value = 0.0004
##
## Call:
## factanal(x = burnout_data_scale_items, factors = 4, rotation = "varimax")
##
## Uniquenesses:
## burnout_tired burnout_disappointed
## 0.333 0.359
## burnout_hopeless burnout_trapped
## 0.253 0.251
## burnout_helpless burnout_depressed
## 0.293 0.342
## burnout_sick burnout_worthless
## 0.485 0.194
## burnout_difficulty_sleeping burnout_had_it
## 0.483 0.569
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4
## burnout_tired 0.185 0.702 0.369
## burnout_disappointed 0.319 0.702 0.154 0.152
## burnout_hopeless 0.686 0.343 0.202 0.342
## burnout_trapped 0.760 0.299 0.229 0.174
## burnout_helpless 0.706 0.180 0.335 0.254
## burnout_depressed 0.509 0.343 0.476 0.233
## burnout_sick 0.312 0.238 0.530 0.282
## burnout_worthless 0.440 0.136 0.338 0.692
## burnout_difficulty_sleeping 0.247 0.333 0.562 0.171
## burnout_had_it 0.475 0.236 0.355 0.154
##
## Factor1 Factor2 Factor3 Factor4
## SS loadings 2.519 1.585 1.429 0.905
## Proportion Var 0.252 0.158 0.143 0.090
## Cumulative Var 0.252 0.410 0.553 0.644
##
## Test of the hypothesis that 4 factors are sufficient.
## The chi square statistic is 27.39 on 11 degrees of freedom.
## The p-value is 0.00402
factanal(x = burnout_data_scale_items, factors = 5, rotation="varimax") #p-value = 0.0385
##
## Call:
## factanal(x = burnout_data_scale_items, factors = 5, rotation = "varimax")
##
## Uniquenesses:
## burnout_tired burnout_disappointed
## 0.315 0.367
## burnout_hopeless burnout_trapped
## 0.247 0.251
## burnout_helpless burnout_depressed
## 0.288 0.348
## burnout_sick burnout_worthless
## 0.478 0.239
## burnout_difficulty_sleeping burnout_had_it
## 0.489 0.005
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4 Factor5
## burnout_tired 0.171 0.722 0.348 0.112
## burnout_disappointed 0.311 0.691 0.135 0.152 0.131
## burnout_hopeless 0.697 0.338 0.216 0.195 0.262
## burnout_trapped 0.745 0.300 0.221 0.223
## burnout_helpless 0.704 0.180 0.348 0.212 0.130
## burnout_depressed 0.493 0.351 0.461 0.235 0.137
## burnout_sick 0.316 0.242 0.551 0.159 0.187
## burnout_worthless 0.499 0.124 0.415 0.164 0.546
## burnout_difficulty_sleeping 0.235 0.345 0.545 0.179
## burnout_had_it 0.302 0.199 0.224 0.897
##
## Factor1 Factor2 Factor3 Factor4 Factor5
## SS loadings 2.401 1.591 1.391 1.112 0.476
## Proportion Var 0.240 0.159 0.139 0.111 0.048
## Cumulative Var 0.240 0.399 0.538 0.650 0.697
##
## Test of the hypothesis that 5 factors are sufficient.
## The chi square statistic is 11.74 on 5 degrees of freedom.
## The p-value is 0.0385
factanal(x = burnout_data_scale_items, factors = 6, rotation="varimax") #p-value test satisfied
##
## Call:
## factanal(x = burnout_data_scale_items, factors = 6, rotation = "varimax")
##
## Uniquenesses:
## burnout_tired burnout_disappointed
## 0.388 0.286
## burnout_hopeless burnout_trapped
## 0.266 0.253
## burnout_helpless burnout_depressed
## 0.291 0.005
## burnout_sick burnout_worthless
## 0.512 0.030
## burnout_difficulty_sleeping burnout_had_it
## 0.420 0.005
##
## Loadings:
## Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
## burnout_tired 0.174 0.630 0.111 0.369 0.178
## burnout_disappointed 0.278 0.754 0.133 0.143 0.136 0.108
## burnout_hopeless 0.654 0.331 0.307 0.189 0.177 0.188
## burnout_trapped 0.721 0.287 0.179 0.213 0.192 0.175
## burnout_helpless 0.677 0.169 0.254 0.205 0.275 0.198
## burnout_depressed 0.384 0.288 0.214 0.203 0.299 0.767
## burnout_sick 0.294 0.226 0.278 0.158 0.456 0.202
## burnout_worthless 0.382 0.137 0.834 0.155 0.246 0.160
## burnout_difficulty_sleeping 0.215 0.288 0.169 0.167 0.609 0.150
## burnout_had_it 0.287 0.188 0.147 0.892 0.197 0.143
##
## Factor1 Factor2 Factor3 Factor4 Factor5 Factor6
## SS loadings 2.021 1.458 1.081 1.070 1.066 0.846
## Proportion Var 0.202 0.146 0.108 0.107 0.107 0.085
## Cumulative Var 0.202 0.348 0.456 0.563 0.670 0.754
##
## The degrees of freedom for the model is 0 and the fit was 0.0014
As we can see, it takes six factors to satisfy the p-value requirement. However, one reason for this is that we have quite a large dataset. As sample size increases, it is hard to have a “good fitting” model. I personally, would rely on the results of the fa.parallel() analysis, however the decision is subjective.
Given our factor analysis, we could choose to create three new subscales for burnout, assigning each variable to its highest scoring factor:
burnout_data$hope_trapped_helpless_subscale <- burnout_data$burnout_hopeless + burnout_data$burnout_trapped + burnout_data$burnout_helpless + burnout_data$burnout_depressed + burnout_data$burnout_worthless + burnout_data$burnout_had_it
burnout_data$tired_disappointed_subscale <- burnout_data$burnout_tired + burnout_data$burnout_disappointed
burnout_data$sick_subscale <- burnout_data$burnout_sick + burnout_data$burnout_difficulty_sleeping
burnout_data$burnout_scale <- burnout_data$burnout_sick +
burnout_data$burnout_difficulty_sleeping +
burnout_data$burnout_hopeless +
burnout_data$burnout_trapped +
burnout_data$burnout_helpless +
burnout_data$burnout_depressed +
burnout_data$burnout_worthless +
burnout_data$burnout_had_it +
burnout_data$burnout_tired +
burnout_data$burnout_disappointed
Now we can test correlations with the full subscale:
cor.test(x = burnout_data$burnout_scale, burnout_data$hope_trapped_helpless_subscale)
##
## Pearson's product-moment correlation
##
## data: burnout_data$burnout_scale and burnout_data$hope_trapped_helpless_subscale
## t = 168.57, df = 2354, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9577749 0.9639629
## sample estimates:
## cor
## 0.9609889
cor.test(x = burnout_data$burnout_scale, burnout_data$tired_disappointed_subscale)
##
## Pearson's product-moment correlation
##
## data: burnout_data$burnout_scale and burnout_data$tired_disappointed_subscale
## t = 56.63, df = 2354, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7417640 0.7759856
## sample estimates:
## cor
## 0.7593996
cor.test(x = burnout_data$burnout_scale, burnout_data$sick_subscale)
##
## Pearson's product-moment correlation
##
## data: burnout_data$burnout_scale and burnout_data$sick_subscale
## t = 66.667, df = 2354, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7940899 0.8220856
## sample estimates:
## cor
## 0.8085448
Clearly the mental manifestations subscales had the highest correlation: r = 0.961, followed by the the exhaustion subscale (r = 0.809), followed by the physical manifestations subscale (r = 0.759)
Testing for positive correlation with variables that you expect to be correlated with an outcome provides evidence for convergent validity, whereas testing for negative correlation with variables that you think are inversely associated with your outcome provides evidence for discriminant validity. With these new scales, we could examine which part of the scale items are most strongly impacted by loneliness (a variable we’ve seen pop up again and again as being important to burnout).
model <- lm(formula = UCLA3Score ~ sick_subscale + tired_disappointed_subscale + hope_trapped_helpless_subscale, data = burnout_data) #include subscales as explanatory factors for UCLA loneliness scores
broom::tidy(model) #summarize model
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.59 0.103 25.0 1.18e-122
## 2 sick_subscale -0.0192 0.0168 -1.14 2.53e- 1
## 3 tired_disappointed_subscale 0.158 0.0154 10.2 4.60e- 24
## 4 hope_trapped_helpless_subscale 0.0952 0.00589 16.2 7.81e- 56
varImp(model) #check variable importance
## Overall
## sick_subscale 1.142389
## tired_disappointed_subscale 10.231031
## hope_trapped_helpless_subscale 16.170933
From the above, we see that loneliness is especially related to exhaustion and mental manifestations of burnout and not the physical manifestations of burnout. This sort of analysis illustrates why it may be sometimes helpful to identify individual factors from a larger scale.
We could also use the information from a factor analysis to create a shortened version of the scale, making sure to select the one or two most improtant variables for each factor. I’ll choose an arbitrary cut point of 0.60 to measure inclusion:
burnout_data$short_burnout <- burnout_data$burnout_trapped + burnout_data$burnout_tired + burnout_data$burnout_sick
cor.test(x = burnout_data$burnout_scale, y = burnout_data$short_burnout)
##
## Pearson's product-moment correlation
##
## data: burnout_data$burnout_scale and burnout_data$short_burnout
## t = 115.27, df = 2354, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9153694 0.9275417
## sample estimates:
## cor
## 0.9216821
As you can see, the new 3 item burnout measure is strongly correlated with the full burnout scale. Shortening a measure might be useful in instances where using the full scale would be too cumbersome (i.e., future research projects)
We could also use this information to test whether a scale works the same way in different populations. For example, we might run a factor analysis on two separate datasets representing people with and without disability:
burnout_data_disability <- burnout_data[burnout_data$identity_disability != "8888: Not Selected", ] #create dataset for people with disability
burnout_data_disability <- burnout_data_disability[, names(burnout_data_disability) %in% burnout_vars] #remove unnecessary columns
burnout_data_disability <- na.omit(burnout_data_disability) #remove observations with missing data
burnout_data_no_disability <- burnout_data[burnout_data$identity_disability == "8888: Not Selected", ]
burnout_data_no_disability <- burnout_data_no_disability[, names(burnout_data_no_disability) %in% burnout_vars] #remove unnecessary columns
burnout_data_no_disability <- na.omit(burnout_data_no_disability) #remove observations with missing data
Within each model we can then check for the number of factors:
psych::fa.parallel(x = burnout_data_disability) # Parallel analysis Suggests 2 factors
## Parallel analysis suggests that the number of factors = 2 and the number of components = 1
psych::fa.parallel(x = burnout_data_no_disability) #Parallel analysis Suggests 3 factors
## Parallel analysis suggests that the number of factors = 3 and the number of components = 1
Next, we can compare the factor loadings for the three factor models for each group using the fa.congruence() function:
disability_loading <- factanal(x = burnout_data_disability, factors = 3, rotation="varimax") #Model for those with disability: factor 1 = hopeless, depressed, and wortheless factor 2 = tired factor 3 = trapped
no_disability_loading <- factanal(x = burnout_data_no_disability, factors = 3, rotation="varimax") #model for those without disability: factor 1 = hopeless, trapped, helpless, worthless factor 2 = sick, difficulty sleeping factor 3 = tired, disappointed
fa.congruence(disability_loading$loadings, no_disability_loading$loadings)
## Factor1 Factor2 Factor3
## Factor1 0.95 0.93 0.69
## Factor2 0.67 0.78 0.98
## Factor3 0.96 0.73 0.67
The same can be done for the two factor models for each group:
disability_loading <- factanal(x = burnout_data_disability, factors = 2, rotation="varimax")
disability_loading #factor 1 = hopeless, depressed, and wortheless factor 2 = tired factor 3 = trapped
##
## Call:
## factanal(x = burnout_data_disability, factors = 2, rotation = "varimax")
##
## Uniquenesses:
## burnout_tired burnout_disappointed
## 0.284 0.564
## burnout_hopeless burnout_trapped
## 0.274 0.304
## burnout_helpless burnout_depressed
## 0.296 0.296
## burnout_sick burnout_worthless
## 0.641 0.424
## burnout_difficulty_sleeping burnout_had_it
## 0.608 0.478
##
## Loadings:
## Factor1 Factor2
## burnout_tired 0.154 0.832
## burnout_disappointed 0.328 0.574
## burnout_hopeless 0.773 0.359
## burnout_trapped 0.787 0.278
## burnout_helpless 0.796 0.266
## burnout_depressed 0.698 0.466
## burnout_sick 0.339 0.494
## burnout_worthless 0.718 0.245
## burnout_difficulty_sleeping 0.310 0.544
## burnout_had_it 0.657 0.301
##
## Factor1 Factor2
## SS loadings 3.625 2.207
## Proportion Var 0.363 0.221
## Cumulative Var 0.363 0.583
##
## Test of the hypothesis that 2 factors are sufficient.
## The chi square statistic is 102.45 on 26 degrees of freedom.
## The p-value is 5.01e-11
no_disability_loading <- factanal(x = burnout_data_no_disability, factors = 2, rotation="varimax") #factor 1 = hopeless, trapped, helpless, worthless factor 2 = sick, difficulty sleeping factor 3 = tired, disappointed
fa.congruence(disability_loading$loadings, no_disability_loading$loadings)
## Factor1 Factor2
## Factor1 0.99 0.72
## Factor2 0.78 0.99
A congruence coefficient of 0.95 is interpreted as nearly identical, 0.90 as high similarity, and above 0.85 is a fair similarity. Below .85 is indication of bias. Results from our analysis show that factor 1 among people with disability is similar to both factors 1 and 2, factor 2 for PWD is similar to factor 3, and factor 3 is similar to factor 1.
Based on these differences, it is apparent that Burnout may represent a slightly different concept among people with disabilities. At the very least, we should carefully look for differences in the performance of any Burnout scales we use when modeling. Furthermore, this may motivate us to model the effects of burnout using stratified models to see if the predictors of burnout are different in these two populations. In the present case, the two-item factors are more comparable for people with and without disabilities, therefore, we might be motivated to construct and use subscales representing two rather than three subscales – perhaps one subscale reflecting the mental manifestations and second scale representing the physical manifestations.
This section introduces you to the following concepts and their corresponding functions:
Lets start by reading in our dataset.
burnout_data <- read.csv(file = "C:\\Users\\Kiffer\\OneDrive - Simon Fraser University (1sfu)\\HEAL Lab\\2. Projects\\9. Learning Hub\\Using R for Analyzing Survey Data\\burnout_data.csv") #Read in burnout_data
The first method we are going to review in this chapter is called mediation. Mediation seeks to identify and explain the mechanism or process that underlies an observed relationship between an independent variable and a dependent variable via the inclusion of a third hypothetical variable, known as a mediator variable. There are four steps to create a mediation model:
Step 1 - Model relationship between your explanatory factor and outcome. In this case, lets use Burnout scores as our outcome and support from coworkers as our explanatory factor (we will also control for age and gender).
table(burnout_data$work_support) #Check out our variable
##
## 8888:Not Working During COVID Completely
## 403 152
## Not at all Somewhat
## 12 313
## Very Little Very Well
## 101 354
burnout_data$work_support[burnout_data$work_support == "8888:Not Working During COVID"] <- NA#remove those not employed during covid
I’m going to use a binary explanatory factor in this case, so will collapse the not at all, somewhat, and very little groups and the completely and very well groups:
burnout_data$work_support_di[burnout_data$work_support == "Completely" |
burnout_data$work_support == "Very Well"] <- "Completely/Very Well"
burnout_data$work_support_di[burnout_data$work_support == "Somewhat" |
burnout_data$work_support == "Very Little" |
burnout_data$work_support == "Not at all"] <- "Somewhat/Very Little/Not at all"
burnout_data$work_support_di <- relevel(as.factor(burnout_data$work_support_di), ref = "Completely/Very Well") #Set reference level for variable
table(burnout_data$work_support_di)
##
## Completely/Very Well Somewhat/Very Little/Not at all
## 506 426
Remove participants with missing observations across key variables <- Necessary step because each model needs to be based on the same underlying data. The work_support variable was only asked of one-third of respondents.
burnout_data_subset <- burnout_data[!is.na(burnout_data$BurnoutScore),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$work_support_di),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$UCLA3Score),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$age),]
burnout_data_subset <- burnout_data_subset[!is.na(burnout_data_subset$gender),]
Y_X <- lm(formula = BurnoutScore ~ work_support_di + age + gender, data = burnout_data_subset)
broom::tidy(Y_X) #Results show that those who are less supported are more likely to burnout.
## # A tibble: 5 × 5
## term estimate std.e…¹ stati…² p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 3.15 0.116 27.1 2.80e-118
## 2 work_support_diSomewhat/Very Little/Not at… 0.541 0.0681 7.94 5.98e- 15
## 3 age -0.00357 0.00305 -1.17 2.43e- 1
## 4 genderNon-binary 0.359 0.232 1.55 1.21e- 1
## 5 genderWoman 0.230 0.0691 3.32 9.23e- 4
## # … with abbreviated variable names ¹std.error, ²statistic
Step 2 - Model relationship between your explanatory factor/outcome and your mediator. In this case, our mediator will be lonelinesss. I hypothesize that people who don’t feel supported, experience loneliness, and that loneliness contributes to burnout.
M_X <- lm(formula = UCLA3Score ~ work_support_di + age + gender, data = burnout_data_subset)
broom::tidy(M_X) #less worplace support is associated with loneliness
## # A tibble: 5 × 5
## term estim…¹ std.e…² stati…³ p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.74 0.177 26.7 4.25e-116
## 2 work_support_diSomewhat/Very Little/Not at … 0.757 0.104 7.28 7.43e- 13
## 3 age 0.00327 0.00466 0.702 4.83e- 1
## 4 genderNon-binary 0.371 0.354 1.05 2.94e- 1
## 5 genderWoman 0.474 0.105 4.50 7.85e- 6
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic
Y_M <- lm(formula = BurnoutScore ~ UCLA3Score + age + gender, data = burnout_data_subset)
broom::tidy(M_X) #Loneliness is associated with burnout
## # A tibble: 5 × 5
## term estim…¹ std.e…² stati…³ p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.74 0.177 26.7 4.25e-116
## 2 work_support_diSomewhat/Very Little/Not at … 0.757 0.104 7.28 7.43e- 13
## 3 age 0.00327 0.00466 0.702 4.83e- 1
## 4 genderNon-binary 0.371 0.354 1.05 2.94e- 1
## 5 genderWoman 0.474 0.105 4.50 7.85e- 6
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic
Step 3 - Model relationship between your explanatory factor and your outcome, controlling for mediator.
Y_XM <- lm(formula = BurnoutScore ~ work_support_di + UCLA3Score + age + gender, data = burnout_data_subset)
broom::tidy(Y_XM) #Thou the effect remain statistically significant, You can see that the effect sizes dropped:
## # A tibble: 6 × 5
## term estimate std.e…¹ stati…² p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.73 0.139 12.4 6.87e-33
## 2 work_support_diSomewhat/Very Little/Not at … 0.315 0.0624 5.04 5.53e- 7
## 3 UCLA3Score 0.299 0.0196 15.3 6.48e-47
## 4 age -0.00455 0.00272 -1.67 9.48e- 2
## 5 genderNon-binary 0.248 0.206 1.20 2.30e- 1
## 6 genderWoman 0.0880 0.0622 1.41 1.57e- 1
## # … with abbreviated variable names ¹std.error, ²statistic
Step 4 - I can now use the mediate() function from the mediate package
install.packages("mediation", repos = "http://cran.us.r-project.org")
## package 'mediation' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("mediation")
mediation_results <- mediate(M_X, Y_XM, treat='work_support_di', mediator='UCLA3Score', control.value = "Completely/Very Well", treat.value = "Somewhat/Very Little/Not at all", boot=TRUE, sims=100) #Test for mediation using 100 bootstrap simulations
summary(mediation_results) #Get mediation test results.
##
## Causal Mediation Analysis
##
## Nonparametric Bootstrap Confidence Intervals with the Percentile Method
##
## Estimate 95% CI Lower 95% CI Upper p-value
## ACME 0.226 0.163 0.30 <2e-16 ***
## ADE 0.315 0.191 0.45 <2e-16 ***
## Total Effect 0.541 0.404 0.67 <2e-16 ***
## Prop. Mediated 0.418 0.301 0.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Sample Size Used: 894
##
##
## Simulations: 100
ACME (Effect of mediator/Indirect Effect) - approx 0.224 is the average increase in the outcome variable among the treatment group (i.e. those with low social support) that arrives as a result of loneliness rather than ‘directly’ from the treatment.
ADE (Direct Effect of Explanatory Factor) - approx 0.325 is the average increase in the outcome variable among the treatment group (i.e. those with low social support) that arrives ‘directly’ from the treatment.
Total effects, is the sum of the ACME and ADE.
Prop. Mediated is the proportion of the effect of your outcome that is mediated by your mediator.
Mediation is used when you want to understand the mechanism or pathway by which an explanatory factor impacts an outcome. However, when you want to understand whether an effect operates differently in two populations, you should instead stratify your analysis. Lets try this by seeing if the effect of discrimination on burnout is the same on people who identify as Black, indigenous, or as a person of colour compared to those who do not. This can be done either by creating two seperate datasets (which you already know how to do) or by using the subset function.
table(burnout_data$identity_bipoc) #Check out our stratifiying variable (i.e., ethnicity)
##
## 8888: Not Selected
## 2028
## People of colour (e.g., Black, Indigenous, Asian or other racialized minority)
## 400
summary(burnout_data$discrimination_score) #Check out our explanatory variable (i.e., discrimination)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 14.00 25.00 24.86 34.00 54.00 92
summary(burnout_data$BurnoutScore) #Check out our outcome variable (i.e., burnout)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.00 2.70 3.50 3.42 4.10 7.00 92
BIPOC_model <- lm(formula = BurnoutScore ~ discrimination_score + age + gender, data = burnout_data, subset = (identity_bipoc == "People of colour (e.g., Black, Indigenous, Asian or other racialized minority)"))
summary(BIPOC_model)
##
## Call:
## lm(formula = BurnoutScore ~ discrimination_score + age + gender,
## data = burnout_data, subset = (identity_bipoc == "People of colour (e.g., Black, Indigenous, Asian or other racialized minority)"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4393 -0.7006 0.0234 0.7108 3.3677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.949119 0.246546 11.962 < 2e-16 ***
## discrimination_score 0.016692 0.004821 3.463 0.000598 ***
## age -0.005415 0.005528 -0.980 0.327973
## genderNon-binary 0.234434 0.535565 0.438 0.661840
## genderWoman 0.238843 0.111277 2.146 0.032502 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.058 on 365 degrees of freedom
## (50 observations deleted due to missingness)
## Multiple R-squared: 0.04488, Adjusted R-squared: 0.03441
## F-statistic: 4.287 on 4 and 365 DF, p-value: 0.002109
#R^2 0.03441 #Effect of discrimination: 0.016692
Non_BIPOC_model <- lm(formula = BurnoutScore ~ discrimination_score + age + gender, data = burnout_data, subset = (identity_bipoc == "8888: Not Selected"))
summary(Non_BIPOC_model)
##
## Call:
## lm(formula = BurnoutScore ~ discrimination_score + age + gender,
## data = burnout_data, subset = (identity_bipoc == "8888: Not Selected"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2650 -0.6880 0.0317 0.6491 3.5285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.597136 0.110941 23.410 < 2e-16 ***
## discrimination_score 0.023961 0.002397 9.996 < 2e-16 ***
## age 0.002214 0.001684 1.315 0.188632
## genderNon-binary 0.574048 0.159085 3.608 0.000316 ***
## genderWoman 0.283173 0.050536 5.603 2.41e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.062 on 1878 degrees of freedom
## (165 observations deleted due to missingness)
## Multiple R-squared: 0.06767, Adjusted R-squared: 0.06568
## F-statistic: 34.08 on 4 and 1878 DF, p-value: < 2.2e-16
#R^2 0.06568 #Effect of discrimination: 0.023961
From the output of the two models above, we can see that by stratifying our analyses, we actually see that the effect of discrimination is smaller on BIPOC vs. non-BIPOC participants. Additionally, the model among BIPOC individuals explains less of the variance in Burnout Scores than the model among non-BIPOC participants (0.016692 vs. 0.023961). This might suggest that discrimination is not the only factor driving burnout among BIPOC people, or perhaps BIPOC people are just more resilient to experiences of discrimination (because they face it so often)
We can do some checks, such as look at the distribution of burnout among non-BIPOC vs. BIPOC- identified participants. To do so, I am going to use an advanced graphics package called ggplot2 and use the functions, ggplot() and geom_density to display the densities of discrimination scores, stratified by identity_bipoc using the fill = function
install.packages("ggplot2", repos = "http://cran.us.r-project.org") #read in package
library("ggplot2") #read in library
ggplot(burnout_data, aes(discrimination_score, fill = !is.na(identity_bipoc))) + #aes() is the mapping command in ggplot, I tell it the variable I want plotted; fill = tells R that I want the data stratified, !is.na() ensures that the non-missing data is not plotted.
geom_density(alpha = 0.5) #alpha = 0.5 controls transparency #I use the plus sign above to allow me to connect the ggplot() mapping to the type of graph I want()
A simpler way of doing this graph would be to just graph each histogram separately:
hist(burnout_data$discrimination_score[burnout_data$identity_bipoc == "People of colour (e.g., Black, Indigenous, Asian or other racialized minority)"]) #Plot BIPOC
hist(burnout_data$discrimination_score[burnout_data$identity_bipoc == "8888: Not Selected"]) #Plot Non-BIPOC
Either graphing method reveals a similar story. Many more of the BIPOC folk experience daily discrimination. You can verify this by looking at the descriptives, stratified by the identity_BIPOC variable:
summary(burnout_data$discrimination_score[burnout_data$identity_bipoc == "8888: Not Selected"]) #Summary for Non-BIPOC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 14.00 24.00 24.29 33.00 54.00 93
summary(burnout_data$discrimination_score[burnout_data$identity_bipoc == "People of colour (e.g., Black, Indigenous, Asian or other racialized minority)"]) #Plot BIPOC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9.00 18.00 30.00 28.21 36.00 54.00 39
I could go through and compare these two models more thoroughly – assess their fit. But you can see from this example that stratifying can be a complex approach to understanding the effect. Another way you can understand these sorts of moderators is through the use of an interaction term. Interactions are specified in a model using the * symbol, as shown below:
Interaction_model <- lm(formula = BurnoutScore ~ discrimination_score*identity_bipoc + age + gender, data = burnout_data)
broom::tidy(Interaction_model) #Use tidy to view results, notice I use :: to specify the package that tidy belongs to, which saves me from using the library() function to call in the broom library
## # A tibble: 7 × 5
## term estimate std.e…¹ stati…² p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.63 0.107 24.5 6.65e-118
## 2 discrimination_score 0.0236 0.00238 9.92 1.01e- 22
## 3 identity_bipocPeople of colour (e.g., Blac… 0.0487 0.160 0.303 7.62e- 1
## 4 age 0.00159 0.00161 0.992 3.22e- 1
## 5 genderNon-binary 0.545 0.152 3.58 3.50e- 4
## 6 genderWoman 0.278 0.0460 6.06 1.64e- 9
## 7 discrimination_score:identity_bipocPeople … -0.00630 0.00535 -1.18 2.39e- 1
## # … with abbreviated variable names ¹std.error, ²statistic
In this model, the effects of discrimination and being a person of colour are both statistically significant predictors of burnout. However, the effect of the interaction term (I.e, the last variable listed showing discrimination:identity_bipoc) is not significant. This helps us understand the results better from our stratified model: Namely, that the effect of discrimination is not statistically different for BIPOC vs. non-BIPOC people. You will notive that the effect of the interaction term is actually negative. This is an artifact for the fact that I am also controlling for the independent effect of discrimination. If you want to include an interaction term without including the individual main effects you can seperate your interaction variables by the : symbol, as shown below
Interaction_model_2 <- lm(formula = BurnoutScore ~ discrimination_score:identity_bipoc + age + gender, data = burnout_data) #Create interaction without independent main effects of interaction term
broom::tidy(Interaction_model_2) #call results
## # A tibble: 6 × 5
## term estim…¹ std.e…² stati…³ p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.65 0.0999 26.5 8.98e-135
## 2 age 0.00150 0.00157 0.950 3.42e- 1
## 3 genderNon-binary 0.546 0.152 3.58 3.45e- 4
## 4 genderWoman 0.278 0.0459 6.05 1.69e- 9
## 5 discrimination_score:identity_bipoc8888: No… 0.0233 0.00219 10.6 1.03e- 25
## 6 discrimination_score:identity_bipocPeople o… 0.0185 0.00263 7.04 2.50e- 12
## # … with abbreviated variable names ¹estimate, ²std.error, ³statistic
This approach is similar to (though not identical to) stratification. It allows us to see the effect of discrimination seperately for BIPOC and non-BIPOC seperately.
Lets cover one more example of interaction with two continuous variables. We will look at the interaction of loneliness and extraversion in predicting burnout. I hypothesize that loneliness contributes more to burnout among those with high extraversion (because they need more social connection).
Interaction_model_3 <- lm(formula = BurnoutScore ~ UCLA3Score*tipi_extraversion_score + age + gender, data = burnout_data) #Create interaction without independent main effects of interaction term
broom::tidy(Interaction_model_3) #call results
## # A tibble: 7 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 2.04 0.330 6.20 8.14e-10
## 2 UCLA3Score 0.324 0.0514 6.31 3.95e-10
## 3 tipi_extraversion_score -0.0431 0.0390 -1.11 2.69e- 1
## 4 age -0.00579 0.00186 -3.11 1.93e- 3
## 5 genderNon-binary 0.550 0.198 2.78 5.48e- 3
## 6 genderWoman 0.0462 0.0578 0.799 4.24e- 1
## 7 UCLA3Score:tipi_extraversion_score 0.00252 0.00618 0.407 6.84e- 1
Surprisingly the, interaction term is not significant, though we do see that extravertedness is associated with lower burnout and loneliness is associated with higher burnout. We can explore these effects further using the plot_model() function of the sjPlot package
install.packages("sjPlot", repos = "http://cran.us.r-project.org") #install package
## package 'sjPlot' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Kiffer\AppData\Local\Temp\RtmpAVsCB5\downloaded_packages
library("sjPlot") #Install library
plot_model(Interaction_model_3, type = "pred", terms = c("UCLA3Score", "tipi_extraversion_score")) #Plot the interaction model. type = "pred" is an argument to plot the marginal effects (or predicted values of Burnout across levels of loneliness)
As you can see the predicted levels of Burnout across UCLA scores are basically the same (overlapping) for the three levels displayed of the etraversion score, which were defined based on the mean and sd. So the red line represents those values 1 standard deviation below the mean, the blue represents the values up to the mean, and the green represents greater than 1 standard deviation above the mean:
summary(burnout_data$tipi_extraversion_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.000 7.000 8.000 7.918 9.000 14.000 1266
sd(burnout_data$tipi_extraversion_score, na.rm = TRUE)
## [1] 2.402553
We can also plot a chart for the interaction_model_2:
plot_model(Interaction_model_2, type = "pred", terms = c("discrimination_score", "identity_bipoc"))
Based on these results, neither of my hypothesized interactions were confirmed. But hopefully, you see the value of using interaction terms. If you recall from last week, we talked about comparing model fit. These same methods are helpful when you are considering whether to include an interaction term.
Let’s check the AIC values for the interaction model
no_interaction <- lm(formula = BurnoutScore ~ UCLA3Score + tipi_extraversion_score + age + gender, data = burnout_data) #Create interaction without independent main effects of interaction term
broom::tidy(no_interaction) #call results
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1.92 0.151 12.7 1.10e-34
## 2 UCLA3Score 0.344 0.0167 20.7 1.02e-80
## 3 tipi_extraversion_score -0.0280 0.0117 -2.39 1.71e- 2
## 4 age -0.00581 0.00186 -3.12 1.85e- 3
## 5 genderNon-binary 0.547 0.197 2.77 5.67e- 3
## 6 genderWoman 0.0459 0.0578 0.793 4.28e- 1
Now we can compare that to the new model without the interaction:
AIC(Interaction_model_3, no_interaction)
## df AIC
## Interaction_model_3 8 3110.899
## no_interaction 7 3109.066
lrtest(Interaction_model_3, no_interaction)
## Likelihood ratio test
##
## Model 1: BurnoutScore ~ UCLA3Score * tipi_extraversion_score + age + gender
## Model 2: BurnoutScore ~ UCLA3Score + tipi_extraversion_score + age + gender
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 8 -1547.5
## 2 7 -1547.5 -1 0.1666 0.6831
As you can see the model without the interaction is superior. Adding the interaction, while perhaps interesting to demonstrate a null result from my hypothesized effect, does not add information to the model. This can be confirmed using the liklihood ratio test since the non-interaction model is nested within the interaction model.
In this course, you have learned many of the essential skills for analyzing survey data with R. I hope that it has been an enjoyable learning experience for you.
If you find any errors in this text or have suggestions on how the course could be improved, please email kcard@sfu.ca.