Load libraries.

## here() starts at /Users/galraz1/Desktop/Polygence/Varun
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.4     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Load the data.

Inspect data.

head(match_data)
## # A tibble: 6 x 5
##      X1 all_rating_differ… all_match_results all_Player1Region all_Player2Region
##   <dbl>              <dbl>             <dbl> <chr>             <chr>            
## 1     1              -0.22                 1 #N/A              East Coast       
## 2     2               0.19                 0 East Coast        East Coast       
## 3     3              -0.13                 0 Mid Atlantic and… Mid Atlantic and…
## 4     4               0.19                 1 Mid Atlantic and… East Coast       
## 5     5              -0.26                 0 East Coast        East Coast       
## 6     6               0.04                 0 West Coast        East Coast

Clean data: 1) remove unnecessary X1 column, and set 2) NA regions to ‘real NA’ instead of string.

# 1)
match_data <- match_data %>%
  mutate(X1 = NULL) 

# 2)
match_data[match_data == '#N/A'] = NA

# Inspect
head(match_data)
## # A tibble: 6 x 4
##   all_rating_differe… all_match_resul… all_Player1Region    all_Player2Region   
##                 <dbl>            <dbl> <chr>                <chr>               
## 1               -0.22                1 <NA>                 East Coast          
## 2                0.19                0 East Coast           East Coast          
## 3               -0.13                0 Mid Atlantic and So… Mid Atlantic and So…
## 4                0.19                1 Mid Atlantic and So… East Coast          
## 5               -0.26                0 East Coast           East Coast          
## 6                0.04                0 West Coast           East Coast

Set up predictors. In particular, set up coding scheme: R by default uses dummy coding, but we want effect coding/sum coding.

match_results <- match_data$all_match_results
rating_diff <- match_data$all_rating_differences

# need to tell R that these are categorical variables
player1_region <- as.factor(match_data$all_Player1Region)
player2_region <- as.factor(match_data$all_Player2Region)

# contrast scheme: cont.sum(number of levels)
contrasts(player1_region) = contr.sum(4)
contrasts(player2_region) = contr.sum(4)

Run model.

region_model <- glm(match_results ~ rating_diff + player1_region + player2_region, family = 'binomial')

summary(region_model)
## 
## Call:
## glm(formula = match_results ~ rating_diff + player1_region + 
##     player2_region, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3088  -1.1429  -0.9183   1.1458   1.5426  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -0.29678    0.11138  -2.665  0.00771 **
## rating_diff     -0.05069    0.16510  -0.307  0.75885   
## player1_region1  0.29934    0.11172   2.679  0.00737 **
## player1_region2 -0.15811    0.17812  -0.888  0.37472   
## player1_region3 -0.24385    0.18766  -1.299  0.19381   
## player2_region1  0.09620    0.10966   0.877  0.38035   
## player2_region2 -0.12110    0.17426  -0.695  0.48711   
## player2_region3  0.27079    0.17098   1.584  0.11325   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1163.3  on 839  degrees of freedom
## Residual deviance: 1151.3  on 832  degrees of freedom
##   (160 observations deleted due to missingness)
## AIC: 1167.3
## 
## Number of Fisher Scoring iterations: 4

The way we want to interpret this table is a little different than what we discussed for dummy coding: The intercept no longer represents a particular condition (i.e. the ommitted one) but it represents the grand mean across all predictor values.

You can continue to interpret the coefficient for the rating difference as the change in log odds for a one-unit change in rating_diff, since we didn’t do anything about the coding scheme for this variable.

To understand how to interpret the coefficients for the region variables, let’s look at the contrast table we made earlier when we used contr.sum():

head(contrasts(player1_region))
##                             [,1] [,2] [,3]
## East Coast                     1    0    0
## Mid Atlantic and South East    0    1    0
## Midwest and Texas              0    0    1
## West Coast                    -1   -1   -1

What sum coding/effect coding does is recast the coefficients such that they represent the difference from the grand mean, instead of the difference from the ommitted condition (like with dummy coding). So we interpret the coefficient for the regions as ‘how different are the match outcomes for players from these regions compared to the grand mean?’.

If we wanted to say something like ‘the effect of player 1 being from region X on match outcomes is …’, we would need to take the intercept coefficient and add it to the coefficient for the region of interest.

This is true except for the last region, which you can see is coded as -1 -1 -1. To get the coefficient for this region we would need to take all the coefficients for that player region and subtract those from the grand mean, i.e. the intercept.

# get coefficients
coeffs = as.numeric(coef(summary(region_model))[,1])

# effect of player 1 being from East Coast [we're using coeffs[3] because that's the coefficient that goes with player1_region1, see summary table above]
p1_east_coast = coeffs[1] + coeffs[3]

# effect of player 1 being from Mid Altantic & SE
p1_mid_atl = coeffs[1] + coeffs[4]

# effect of player 1 being from midwest & texas
p1_midw = coeffs[1] + coeffs[5]

# effect of player 1 being from west coast
p1_west_coast = coeffs[1] - coeffs[3] - coeffs[4] - coeffs[5]

Further readings on contrast coding:

  1. Practical question on dummy coding vs sum coding: https://stats.stackexchange.com/questions/52132/how-to-do-regression-with-effect-coding-instead-of-dummy-coding-in-r

  2. Implementing different coding schemes in R: https://marissabarlaz.github.io/portfolio/contrastcoding/

  3. Overview/more explanation of coding schemes: https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/