Quantitative Methods Workshop

1 Information
- 1.1 Reminders
- 1.2 Learning objectives
2 Explore Data in Excel
3 Starting RStudio Cloud
- 3.1 Make an RMarkdown document
4 Loading and Exploring Data
- 4.1 Load data into R
- 4.2 Explore the dataset
5 Finding Groups Within Our Data

1 Information

This page is meant to be used for the course HE-942 at the MGH Institute of Health Professions.

1.1 Reminders

Link to this page: tinyurl.com/he942quant or http://rpubs.com/anshulkumar/he942quant
Use copy and paste whenever you can to complete this activity.

1.2 Learning objectives

Today we will learn how to:

View data in Excel
Import Excel files into R
Make a nice-looking file in R that integrates the text we want to write and our R code
Calculate basic descriptive statistics in R
Use visualizations to explore data in R

2 Explore Data in Excel

Task 1: Download (to your computer) the file WPBA-Baker-Workshop-20191204.xlsx which is available in D2L. Open the file once it is on your computer.

Task 2: What is the unit of observation in this data?

Task 3: What information do we have about each observation (row)?

Task 4: For the surgery with RecordID = 14, who was the resident and who was the attending?

Task 5: What are the “variables” in this Excel sheet?

3 Starting RStudio Cloud

Any time you need to view or analyze some data—whether by yourself or in a team—you can do so at http://rstudio.cloud for free! Just sign in with your Google account or make a new account.

3.1 Make an RMarkdown document

Task 6: Insert code chunk.

Task 7: Try arithmetic.

2+7

## [1] 9

Task 8: Store values as variables and then add them.

SomeNumber <- 3
AnotherNumber <- 392
SomeNumber + AnotherNumber

## [1] 395

Task 9: Change the values of SomeNumber and AnotherNumber. Add once again add them together and see what you get.

Task 10: Repeat the previous task but change the names of the variables SomeNumber and AnotherNumber to anything else. This time subtract one from the other.

Note that all you did to accomplish the last two tasks is you took the code that was initially provided to you and modified it a little bit. A lot of quantitative analysis involves making slight modifications to procedures or code that others have done or written!

Task 11: Publish document using Knit menu.

4 Loading and Exploring Data

4.1 Load data into R

Task 12: Upload our Excel data file into RStudio Cloud.

Task 13: Load the xlsx package with the code below. This package gives us the capability to load Excel files into R.

library(xlsx)

Task 14: Import the data file. First run this command:

file.choose()

Select the Excel file you want to load.

You will see a file path returned to you that looks like this:

/cloud/project/WPBA-Baker-Workshop-20191204.xlsx

Make sure the above file path matches the file path in the command below:

d <- read.xlsx("/cloud/project/HE942/WPBA-Baker-Workshop-20191204.xlsx", header=TRUE, sheetIndex=1, na.strings="NA")

Now you know how to load an Excel file into R! The data file is loaded and its name is d. In R, d is called a dataframe. But we often refer to d as a dataset or simply our data.

Task 15: View the data within R.

View(d)

It should look exactly the same as it does in Excel.

4.2 Explore the dataset

Task 16: List the variables in the dataset. The OpTrust variables are the ones that start with the label OpT_.

names(d)

##  [1] "RecordID"                  "AttFamiliar"              
##  [3] "AttPrevOps"                "AttPrevThisOp"            
##  [5] "AttDifficulty"             "AttComplication"          
##  [7] "AttZwisch"                 "ResRotTime"               
##  [9] "ResPrevOpsAttending"       "ResPrevThisOppAtt"        
## [11] "ResPrevOp"                 "ResFamiliarAtt"           
## [13] "ResDifficulty"             "ResComplication"          
## [15] "ResZwisch"                 "RNFAPresent"              
## [17] "OpT_AttQuesAsked"          "OpT_AttOperPlan"          
## [19] "OpT_AttInstruction"        "OpT_AttProbSolve"         
## [21] "OpT_AttLeadershipRes"      "OpT_ResQuesAsked"         
## [23] "OpT_ResOpPlan"             "OpT_ResInstruction"       
## [25] "OpT_ResProbSolve"          "OpT_ResLeadershipRes"     
## [27] "ResID"                     "ResidentPGY"              
## [29] "AttID"                     "ObsID"                    
## [31] "Date"                      "ObsStartTime"             
## [33] "ObsEndTime"                "ObsProcedureType"         
## [35] "ObsLearnersPresent"        "ObsInstructionByAttending"
## [37] "AttPosition"               "AttDivision"              
## [39] "AttCertYear"               "AttGender"                
## [41] "ResPGY"                    "ResGender"                
## [43] "ZwischDiff"                "na.strings"

Task 17: Look at the distribution of the OpTrust variable that measures the attending’s instruction.

table(d$OpT_AttInstruction)

## 
##  1  2  3  4 
## 15 46 29  7

The command above translates like this: create a table of the variable OpT_AttInstruction within dataframe d. We use the dollar sign operator to select a variable from within a dataframe.

Visualization 1. Histogram.

hist(d$OpT_AttInstruction)

Task 18: Below is code to make a table that counts how many cases in our data were supervised by each attending. Adapt this code by changing the variable name within the table() function to see how many cases each resident took part in.

table(d$AttID)

## 
##  Att_1 Att_10  Att_2  Att_3  Att_4  Att_5  Att_6  Att_7  Att_8  Att_9 
##      4      1      1      8     15     15      6     31      5     11

Task 19: We can also figure out how many attendings are in our data by taking the exact code above and pasting it within the length() function, shown below. Please adapt this code to figure out how many residents are in the data.

length(table(d$AttID))

## [1] 10

Task 20: Make an OpTrust index score variable for attendings. First we’ll do this together in Excel. Then we’ll do it together in R. To do this, we’ll need to add together the following variables: OpT_AttQuesAsked, OpT_AttOperPlan, OpT_AttInstruction, OpT_AttProbSolve, OpT_AttLeadershipRes.

d$OpTAttIndex <- d$OpT_AttQuesAsked + d$OpT_AttOperPlan + d$OpT_AttInstruction + d$OpT_AttProbSolve + d$OpT_AttLeadershipRes

Translation of above: Create a new variable within the dataframe d called OpTAttIndex. This new variable will be equal to the sum of [the five variables that follow], each also within the dataset d.

Now it’s your turn…

Task 21: Create a similar index for the five variables related to OpTrust for residents, both in Excel and R.

Task 22: Explore the distribution and trends within the Attending and Resident OpTrust indexes. Some options for this exploration are presented below, only for the Attending OpTrust index. Please modify the code below to explore the Resident OpTrust index that you created.

Visualization 2. Histogram of index distribution:

hist(d$OpTAttIndex)

Summary statistics:

summary(d$OpTAttIndex)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    9.00   11.00   11.14   14.00   20.00

mean(d$OpTAttIndex)

## [1] 11.14433

sd(d$OpTAttIndex)

## [1] 3.47008

Two-way table (cross table or “cross-tab”) of the index and resident PGY:

table(d$OpTAttIndex, d$ResPGY)

##     
##      1 2 3 5
##   5  2 3 2 0
##   6  1 2 1 0
##   7  1 1 2 0
##   8  1 6 0 0
##   9  2 5 2 1
##   10 1 4 0 3
##   11 1 5 5 2
##   12 1 5 3 2
##   13 0 3 2 2
##   14 0 6 2 2
##   15 1 2 1 3
##   16 0 1 0 2
##   17 0 1 1 0
##   18 0 0 0 2
##   19 0 0 0 1
##   20 0 0 0 1

Do you notice any trends from this table above?

Maybe we need a scatterplot:

Visualization 3. Scatterplot.

plot(d$ResPGY, d$OpTAttIndex)

Do you see a trend in the plot above?

Let’s re-color the points by resident gender:

Visualization 4. Grouped scatterplot.

library(ggplot2)
ggplot(d, aes(x = ResPGY, y = OpTAttIndex, colour = ResGender)) + geom_point()

We can also create side-by-side plots instead:

Visualization 5. Side-by-side scatterplot.

ggplot(d, aes(x = ResPGY, y = OpTAttIndex, colour = ResGender)) + geom_point() + facet_wrap( ~ ResGender)

But what if we use this side-by-side capability to add a third variable into the mix?

Visualization 6. Grouped side-by-side scatterplot

ggplot(d, aes(x = ResPGY, y = OpTAttIndex, colour = ResGender)) + geom_point() + facet_wrap( ~ AttGender) + labs(title="Scatterplots by attending gender") + labs(x = "Resident PGY") + labs(y = "Attending OpTrust Index")

Let’s see how correlated the five OpTrust dimensions are with each other, for attendings:

Visualization 7. Pairwise correlation grid

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(d[c("OpT_AttQuesAsked", "OpT_AttOperPlan", "OpT_AttInstruction", "OpT_AttProbSolve", "OpT_AttLeadershipRes")])

The code above is saying: Look inside dataframe d and take just the five variables that we have listed. Create a pairwise grid that plots each possible pair combination out of these five variables and show us each pair’s scatterplot and correlation. Also, show us the distribution curve of each variable.

5 Finding Groups Within Our Data

We will use PCA (principal component analysis) to see if the five OpTrust dimensions for attendings tell us if the surgeries in our dataset fall into distinct groups or not. Please follow along and do the same for the OpTrust variables for residents.

Task 23: Run the PCA code below.

pca1 <- prcomp(d[c("OpT_AttQuesAsked", "OpT_AttOperPlan", "OpT_AttInstruction", "OpT_AttProbSolve", "OpT_AttLeadershipRes")], scale = TRUE, center = TRUE)
summary(pca1)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.6855 0.8807 0.8022 0.62471 0.59135
## Proportion of Variance 0.5682 0.1551 0.1287 0.07805 0.06994
## Cumulative Proportion  0.5682 0.7233 0.8520 0.93006 1.00000

The majority of the variance in the five OpTrust variables that we fed into the PCA has been reduced into two “components”.

Visualization 8. PCA components.

library(devtools)

## Loading required package: usethis

# install_github("vqv/ggbiplot")
library(ggbiplot)

## Loading required package: plyr

## Loading required package: scales

## Loading required package: grid

ggbiplot(pca1)

Visualization 9. PCA with attending groups.

ggbiplot(pca1,ellipse=TRUE, labels=rownames(d), groups = d$AttID)

Visualization 10. PCA with attending gender groups.

ggbiplot(pca1,ellipse=TRUE, labels=rownames(d), groups = d$AttGender)

Task 24: Create a visualization based on the PCA with resident gender groups (instead of attending gender groups like above).

Task 25: Separate the groups by the gender pairings during each surgery.

First, create a new variable in our dataset d called GenderResAtt.

d$GenderResAtt <- factor(rep(NA, nrow(d)), levels= c("ResFemale AttFemale","ResFemale AttMale","ResMale AttFemale","ResMale AttMale"))

d$GenderResAtt[d$ResGender=="F" & d$AttGender=="F"] <- "ResFemale AttFemale"
d$GenderResAtt[d$ResGender=="F" & d$AttGender=="M"] <- "ResFemale AttMale"
d$GenderResAtt[d$ResGender=="M" & d$AttGender=="F"] <- "ResMale AttFemale"
d$GenderResAtt[d$ResGender=="M" & d$AttGender=="M"] <- "ResMale AttMale"

This table below shows us if our new variable was coded correctly:

table(d$GenderResAtt, d$ResGender,d$AttGender)

## , ,  = F
## 
##                      
##                        F  M
##   ResFemale AttFemale 21  0
##   ResFemale AttMale    0  0
##   ResMale AttFemale    0 13
##   ResMale AttMale      0  0
## 
## , ,  = M
## 
##                      
##                        F  M
##   ResFemale AttFemale  0  0
##   ResFemale AttMale   46  0
##   ResMale AttFemale    0  0
##   ResMale AttMale      0 17

And here is the new visualization with four groups:

Visualization 11. PCA with resident-attending gender pair groups.

ggbiplot(pca1,ellipse=TRUE, labels=rownames(d), groups = d$GenderResAtt)