Quantitative Methods Workshop
1 Information
This page is meant to be used for the course HE-942 at the MGH Institute of Health Professions.
1.1 Reminders
- Link to this page: tinyurl.com/he942quant or http://rpubs.com/anshulkumar/he942quant
- Use
copyandpastewhenever you can to complete this activity.
1.2 Learning objectives
Today we will learn how to:
- View data in Excel
- Import Excel files into R
- Make a nice-looking file in R that integrates the text we want to write and our R code
- Calculate basic descriptive statistics in R
- Use visualizations to explore data in R
2 Explore Data in Excel
Task 1: Download (to your computer) the file WPBA-Baker-Workshop-20191204.xlsx which is available in D2L. Open the file once it is on your computer.
Task 2: What is the unit of observation in this data?
Task 3: What information do we have about each observation (row)?
Task 4: For the surgery with RecordID = 14, who was the resident and who was the attending?
Task 5: What are the “variables” in this Excel sheet?
3 Starting RStudio Cloud
Any time you need to view or analyze some data—whether by yourself or in a team—you can do so at http://rstudio.cloud for free! Just sign in with your Google account or make a new account.
3.1 Make an RMarkdown document
Task 6: Insert code chunk.
Task 7: Try arithmetic.
## [1] 9
Task 8: Store values as variables and then add them.
## [1] 395
Task 9: Change the values of SomeNumber and AnotherNumber. Add once again add them together and see what you get.
Task 10: Repeat the previous task but change the names of the variables SomeNumber and AnotherNumber to anything else. This time subtract one from the other.
Note that all you did to accomplish the last two tasks is you took the code that was initially provided to you and modified it a little bit. A lot of quantitative analysis involves making slight modifications to procedures or code that others have done or written!
Task 11: Publish document using Knit menu.
4 Loading and Exploring Data
4.1 Load data into R
Task 12: Upload our Excel data file into RStudio Cloud.
Task 13: Load the xlsx package with the code below. This package gives us the capability to load Excel files into R.
Task 14: Import the data file. First run this command:
file.choose()
Select the Excel file you want to load.
You will see a file path returned to you that looks like this:
/cloud/project/WPBA-Baker-Workshop-20191204.xlsx
Make sure the above file path matches the file path in the command below:
Now you know how to load an Excel file into R! The data file is loaded and its name is d. In R, d is called a dataframe. But we often refer to d as a dataset or simply our data.
Task 15: View the data within R.
View(d)
It should look exactly the same as it does in Excel.
4.2 Explore the dataset
Task 16: List the variables in the dataset. The OpTrust variables are the ones that start with the label OpT_.
## [1] "RecordID" "AttFamiliar"
## [3] "AttPrevOps" "AttPrevThisOp"
## [5] "AttDifficulty" "AttComplication"
## [7] "AttZwisch" "ResRotTime"
## [9] "ResPrevOpsAttending" "ResPrevThisOppAtt"
## [11] "ResPrevOp" "ResFamiliarAtt"
## [13] "ResDifficulty" "ResComplication"
## [15] "ResZwisch" "RNFAPresent"
## [17] "OpT_AttQuesAsked" "OpT_AttOperPlan"
## [19] "OpT_AttInstruction" "OpT_AttProbSolve"
## [21] "OpT_AttLeadershipRes" "OpT_ResQuesAsked"
## [23] "OpT_ResOpPlan" "OpT_ResInstruction"
## [25] "OpT_ResProbSolve" "OpT_ResLeadershipRes"
## [27] "ResID" "ResidentPGY"
## [29] "AttID" "ObsID"
## [31] "Date" "ObsStartTime"
## [33] "ObsEndTime" "ObsProcedureType"
## [35] "ObsLearnersPresent" "ObsInstructionByAttending"
## [37] "AttPosition" "AttDivision"
## [39] "AttCertYear" "AttGender"
## [41] "ResPGY" "ResGender"
## [43] "ZwischDiff" "na.strings"
Task 17: Look at the distribution of the OpTrust variable that measures the attending’s instruction.
##
## 1 2 3 4
## 15 46 29 7
The command above translates like this: create a table of the variable OpT_AttInstruction within dataframe d. We use the dollar sign operator to select a variable from within a dataframe.
Visualization 1. Histogram.
Task 18: Below is code to make a table that counts how many cases in our data were supervised by each attending. Adapt this code by changing the variable name within the table() function to see how many cases each resident took part in.
##
## Att_1 Att_10 Att_2 Att_3 Att_4 Att_5 Att_6 Att_7 Att_8 Att_9
## 4 1 1 8 15 15 6 31 5 11
Task 19: We can also figure out how many attendings are in our data by taking the exact code above and pasting it within the length() function, shown below. Please adapt this code to figure out how many residents are in the data.
## [1] 10
Task 20: Make an OpTrust index score variable for attendings. First we’ll do this together in Excel. Then we’ll do it together in R. To do this, we’ll need to add together the following variables: OpT_AttQuesAsked, OpT_AttOperPlan, OpT_AttInstruction, OpT_AttProbSolve, OpT_AttLeadershipRes.
Translation of above: Create a new variable within the dataframe d called OpTAttIndex. This new variable will be equal to the sum of [the five variables that follow], each also within the dataset d.
Now it’s your turn…
Task 21: Create a similar index for the five variables related to OpTrust for residents, both in Excel and R.
Task 22: Explore the distribution and trends within the Attending and Resident OpTrust indexes. Some options for this exploration are presented below, only for the Attending OpTrust index. Please modify the code below to explore the Resident OpTrust index that you created.
Visualization 2. Histogram of index distribution:
Summary statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 9.00 11.00 11.14 14.00 20.00
## [1] 11.14433
## [1] 3.47008
Two-way table (cross table or “cross-tab”) of the index and resident PGY:
##
## 1 2 3 5
## 5 2 3 2 0
## 6 1 2 1 0
## 7 1 1 2 0
## 8 1 6 0 0
## 9 2 5 2 1
## 10 1 4 0 3
## 11 1 5 5 2
## 12 1 5 3 2
## 13 0 3 2 2
## 14 0 6 2 2
## 15 1 2 1 3
## 16 0 1 0 2
## 17 0 1 1 0
## 18 0 0 0 2
## 19 0 0 0 1
## 20 0 0 0 1
Do you notice any trends from this table above?
Maybe we need a scatterplot:
Visualization 3. Scatterplot.
Do you see a trend in the plot above?
Let’s re-color the points by resident gender:
Visualization 4. Grouped scatterplot.
We can also create side-by-side plots instead:
Visualization 5. Side-by-side scatterplot.
But what if we use this side-by-side capability to add a third variable into the mix?
Visualization 6. Grouped side-by-side scatterplot
Let’s see how correlated the five OpTrust dimensions are with each other, for attendings:
Visualization 7. Pairwise correlation grid
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
The code above is saying: Look inside dataframe d and take just the five variables that we have listed. Create a pairwise grid that plots each possible pair combination out of these five variables and show us each pair’s scatterplot and correlation. Also, show us the distribution curve of each variable.
5 Finding Groups Within Our Data
We will use PCA (principal component analysis) to see if the five OpTrust dimensions for attendings tell us if the surgeries in our dataset fall into distinct groups or not. Please follow along and do the same for the OpTrust variables for residents.
Task 23: Run the PCA code below.
pca1 <- prcomp(d[c("OpT_AttQuesAsked", "OpT_AttOperPlan", "OpT_AttInstruction", "OpT_AttProbSolve", "OpT_AttLeadershipRes")], scale = TRUE, center = TRUE)
summary(pca1)## Importance of components:
## PC1 PC2 PC3 PC4 PC5
## Standard deviation 1.6855 0.8807 0.8022 0.62471 0.59135
## Proportion of Variance 0.5682 0.1551 0.1287 0.07805 0.06994
## Cumulative Proportion 0.5682 0.7233 0.8520 0.93006 1.00000
The majority of the variance in the five OpTrust variables that we fed into the PCA has been reduced into two “components”.
Visualization 8. PCA components.
## Loading required package: usethis
## Loading required package: plyr
## Loading required package: scales
## Loading required package: grid
Visualization 9. PCA with attending groups.
Visualization 10. PCA with attending gender groups.
Task 24: Create a visualization based on the PCA with resident gender groups (instead of attending gender groups like above).
Task 25: Separate the groups by the gender pairings during each surgery.
First, create a new variable in our dataset d called GenderResAtt.
d$GenderResAtt <- factor(rep(NA, nrow(d)), levels= c("ResFemale AttFemale","ResFemale AttMale","ResMale AttFemale","ResMale AttMale"))
d$GenderResAtt[d$ResGender=="F" & d$AttGender=="F"] <- "ResFemale AttFemale"
d$GenderResAtt[d$ResGender=="F" & d$AttGender=="M"] <- "ResFemale AttMale"
d$GenderResAtt[d$ResGender=="M" & d$AttGender=="F"] <- "ResMale AttFemale"
d$GenderResAtt[d$ResGender=="M" & d$AttGender=="M"] <- "ResMale AttMale"This table below shows us if our new variable was coded correctly:
## , , = F
##
##
## F M
## ResFemale AttFemale 21 0
## ResFemale AttMale 0 0
## ResMale AttFemale 0 13
## ResMale AttMale 0 0
##
## , , = M
##
##
## F M
## ResFemale AttFemale 0 0
## ResFemale AttMale 46 0
## ResMale AttFemale 0 0
## ResMale AttMale 0 17
And here is the new visualization with four groups:
Visualization 11. PCA with resident-attending gender pair groups.