Context

This code is designed to explore the factors associated with insurance acceptance among mental health providers (data from psychology today).

  • The following steps were taken:
    • Exploratory univariate analysis
    • Building the full logistic regression model
    • Exploring provider supply variables–binary vs continuous
    • Comparing reduced models to full model
    • Exploring the interaction between provider supply and rurality
    • Exploring a stratified model (which was for visual purposes, stratified model was not ultimately chosen as the correct choice)
    • Running full model through a training and testing sample, then exploring the accuracy of the model
    • Running the model through a random forest to better understand the importance of the variables on insurance acceptance among mental health providers

Code book for variables in full model:

  • c_provider_1k: the supply of all types of mental health providers per 1000k county residents, centered around its mean
    • Mean: 0.94
  • provs: provider type in categories.
    • Levels: “Masters level therapists and counselors”, “All other therapists and counselors”, “Psychologists/PhD”, “Advanced practice providers”
    • Level names were changed to be better on output
  • any_child: does a provider also see clients under 18?
    • Binary
  • Rurality: Binary–urban or rural.
    • Based on Categories of the 2013 and 2006 NCHS Urban-Rural Classification Scheme for Counties.
    • Metro distinctions 1-4 = urban
    • metro distinctions 5-6 = rural (micropoliton and non-core)
  • income_med: Binary
    • whether provider is in a county with a median household income that is above or below the median of the sample.
    • Data from ACS 5-year survey
  • group_pr: is the provider part of a group practice?
    • Determined by grouping based on exact addresses.
    • Binary
  • frac_white_med: binary
    • Data from 5-year ACS
    • Binary, whether provider is in a county with a fraction of county population that is white that is above or below the median of the sample

data_acs <- read.csv("K:/WorkingData2/User_Folders/Aine/Project Work/Psychology Today/Original files/working data acs 4-5-24.csv")

data_acs <- data_acs[which(!is.na(data_acs$id)), ] #just in case it reuploads weirdly from csv
#Desc(data_acs$id)
#length of df should be 184'957
# unique IDs should be 167'854

Load Required Packages


Explore urbanicity

metros <- read.csv("K:/WorkingData2/User_Folders/Aine/Project Work/Psychology Today/Original files/NCHSURCodes2013.csv")
data_acs <- left_join(data_acs, metros, by = "fips")

data_acs <- data_acs %>%
  mutate(rurality = case_when(metro_codes==1 | metro_codes==2 | metro_codes==3 | metro_codes==4 ~ "Urban",
                            metro_codes==5 | metro_codes==6 ~ "Rural")
         )

Model building

Exhibit 4. Regression of provider- and county/zip level covariates on insurance acceptance 1. County or zip code composition (population, % white, HRSA supply of MH professionals, income) 2. Provider-level covariates – years in practice, psychologist vs. therapist, telehealth, gender, # of locations (can use addresses to identify multi-provider groups e.g. count number of providers at same location), whether providers themselves practice in more than 1 location


Preparing the data

data_acs_inc <- data_acs[which(data_acs$excluded==0),]

data_acs_inc <- data_acs_inc %>%
  group_by(GEOID) %>%
  mutate(provider_1k = (n() / county_population)*1000) %>%
  ungroup()

# these are not at the "county level", they are at the provider level
# comparing obvs in the dataset against each other rather than what county they are in

data_acs_inc <- data_acs_inc %>%
  mutate(supply_med = if_else(provider_1k>median(provider_1k), 1, 0)
         )

data_acs_inc <- data_acs_inc %>%
   mutate(income_med = if_else(median_income>median(median_income), 1, 0)
         )

data_acs_inc <- data_acs_inc %>%
   mutate(frac_white_med = if_else(fraction_white>median(fraction_white), 1, 0)
         )


# var manipulation
skim(data_acs_inc$provider_1k)
Data summary
Name data_acs_inc$provider_1k
Number of rows 175083
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 0.94 0.65 0 0.52 0.76 1.23 10.49 ▇▁▁▁▁
skim(data_acs_inc$median_income)
Data summary
Name data_acs_inc$median_incom…
Number of rows 175083
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 79975.52 19788.68 24349 65554.5 76367 92600 156821 ▁▇▅▂▁
skim(data_acs_inc$fraction_white)
Data summary
Name data_acs_inc$fraction_whi…
Number of rows 175083
Number of columns 1
_______________________
Column type frequency:
numeric 1
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
data 0 1 0.65 0.15 0.1 0.53 0.66 0.77 0.98 ▁▂▆▇▃
#binary variable for provider supply above or below the median
Desc(data_acs_inc$supply_med)
## ------------------------------------------------------------------------------ 
## data_acs_inc$supply_med (numeric)
## 
##    length        n    NAs  unique      0s  mean  meanCI'
##   175'083  175'083      0       2  87'734  0.50    0.50
##             100.0%   0.0%           50.1%          0.50
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    1.00  1.00    1.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##      1.00     0.50   1.00    0.00    1.00  0.00   -2.00
##                                                        
## 
##    value    freq   perc  cumfreq  cumperc
## 1      0  87'734  50.1%   87'734    50.1%
## 2      1  87'349  49.9%  175'083   100.0%
## 
## ' 95%-CI (classic)

Desc(data_acs_inc$income_med)
## ------------------------------------------------------------------------------ 
## data_acs_inc$income_med (numeric)
## 
##    length        n    NAs  unique      0s  mean  meanCI'
##   175'083  175'083      0       2  92'689  0.47    0.47
##             100.0%   0.0%           52.9%          0.47
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    1.00  1.00    1.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##      1.00     0.50   1.06    0.00    1.00  0.12   -1.99
##                                                        
## 
##    value    freq   perc  cumfreq  cumperc
## 1      0  92'689  52.9%   92'689    52.9%
## 2      1  82'394  47.1%  175'083   100.0%
## 
## ' 95%-CI (classic)

Desc(data_acs_inc$frac_white_med)
## ------------------------------------------------------------------------------ 
## data_acs_inc$frac_white_med (numeric)
## 
##    length        n    NAs  unique      0s  mean  meanCI'
##   175'083  175'083      0       2  87'597  0.50    0.50
##             100.0%   0.0%           50.0%          0.50
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    1.00  1.00    1.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##      1.00     0.50   1.00    0.00    1.00  0.00   -2.00
##                                                        
## 
##    value    freq   perc  cumfreq  cumperc
## 1      0  87'597  50.0%   87'597    50.0%
## 2      1  87'486  50.0%  175'083   100.0%
## 
## ' 95%-CI (classic)

This probably is not needed anymore

#making tertiles
data_acs_inc <- data_acs_inc %>%
  mutate(supply_tert = ntile(provider_1k, 3))

data_acs_inc <- data_acs_inc %>%
  mutate(supply_tert = case_when(supply_tert==1 ~ "low",
                            supply_tert==2 ~ "mid",
                            supply_tert==3 ~ "high")
         )

# Think about keeping supply as a continuous variable
# look at distribution of continuous supply var
data_acs_inc$supply_tert <- as.factor(data_acs_inc$supply_tert)
data_acs_inc <- data_acs_inc %>%
  mutate(provider_type = case_when(provider_type=="Masters level therapists and counselors" ~ "masters",
                            provider_type=="All other therapists and counselors" ~ "counselors",
                            provider_type=="Psychologists/PhD" ~ "psych",
                            provider_type=="Advanced practice providers" ~ "advanced")
         )

data_acs_inc$provs <- as.factor(data_acs_inc$provider_type)
data_acs_inc$insurance_flag <- as.factor(data_acs_inc$insurance_flag)
data_acs_inc$any_child <- as.factor(data_acs_inc$any_child)
data_acs_inc$region <- as.factor(data_acs_inc$region)
data_acs_inc$group_pr <- as.factor(data_acs_inc$group_pr)
data_acs_inc$telehealth <- as.factor(data_acs_inc$telehealth)
data_acs_inc$rurality <- as.factor(data_acs_inc$rurality)
data_acs_inc$supply_med <- as.factor(data_acs_inc$supply_med)
data_acs_inc$income_med <- as.factor(data_acs_inc$income_med)
data_acs_inc$frac_white_med <- as.factor(data_acs_inc$frac_white_med)

#i need to adjust the names in the provider_type variable so it doesn't mess with 
# our output

Adjustments after meeting with Jane on 4/23/24

indicator var: rural versus not (micropolitan and non-core versus other categories) then relook at supply as continuous or above median and below median (dichotimous var). possibly then also a possible interaction term between rurality and supply. for median income: either dichotimous var or percentiles (non-continuous). also looking at fraction white in percentiles

Exploring distrobution of provider supply (continuous)

#provider supply

data_acs_inc %>%
  group_by(insurance_flag) %>%
 skim(provider_1k)
Data summary
Name Piped data
Number of rows 175083
Number of columns 72
_______________________
Column type frequency:
numeric 1
________________________
Group variables insurance_flag

Variable type: numeric

skim_variable insurance_flag n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
provider_1k 0 0 1 0.99 0.74 0 0.53 0.76 1.3 10.49 ▇▁▁▁▁
provider_1k 1 0 1 0.91 0.59 0 0.51 0.75 1.2 10.49 ▇▁▁▁▁
#providers that do accept insurance
p <- data_acs_inc %>%
  ggplot( aes(x=provider_1k, fill= insurance_flag)) +
    geom_histogram(alpha=0.9, color="white") +
    ggtitle("Distrobution of provider supply and insurance acceptance") +
    theme(
      plot.title = element_text(size=10)
    )

#rurality
p2 <- data_acs_inc %>%
  ggplot( aes(x=provider_1k, fill=rurality)) +
    geom_histogram(alpha=0.9, color="white") +
  ggtitle("Distrobution of provider supply in rural and urban areas") +
    theme(
      plot.title = element_text(size=10)
    )

p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exploring distrobution of median county income (continuous)

#median income

data_acs_inc %>%
  group_by(insurance_flag) %>%
 skim(median_income)
Data summary
Name Piped data
Number of rows 175083
Number of columns 72
_______________________
Column type frequency:
numeric 1
________________________
Group variables insurance_flag

Variable type: numeric

skim_variable insurance_flag n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
median_income 0 0 1 82747.92 21062.36 26395 66047 77635 93956 156821 ▁▇▅▂▁
median_income 1 0 1 78477.32 18896.55 24349 64522 75485 88517 156821 ▁▇▅▂▁
#providers that do accept insurance
p <- data_acs_inc %>%
  ggplot( aes(x=median_income, fill=insurance_flag)) +
    geom_histogram(color="white", alpha=0.9) +
  ggtitle("Distrobution of median income by insurance acceptance") +
    theme(
      plot.title = element_text(size=10)
    )

#providers that do accept insurance
p2 <- data_acs_inc %>%
  ggplot( aes(x=fraction_white, fill=insurance_flag)) +
    geom_histogram(color="white", alpha=0.9) +
  ggtitle("Distrobution of fraction white county pop by insurance acceptance") +
    theme(
      plot.title = element_text(size=10)
    )
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Exploring the distrobutions

  • The psych_forest dataset has all of our variables of interest, it will be used for the random forest in later code but I will use it now as well to easily explore the distributions of the variables of interest
psych_forest <- as.data.frame(select(data_acs_inc, c("insurance_flag","provs","any_child","group_pr", "rurality", "supply_med", "income_med", "frac_white_med", "median_income", "provider_1k","fraction_white")))

#remove frac no health insurance
#rural urban versus region (take out region and include rural urban on its own)
# then we can see 

Desc(psych_forest[1:8])
## ------------------------------------------------------------------------------ 
## Describe psych_forest[1:8] (data.frame):
## 
## data frame:  175083 obs. of  8 variables
##      175083 complete cases (100.0%)
## 
##   Nr  ColName         Class   NAs  Levels                                   
##   1   insurance_flag  factor  .    (2): 1-0, 2-1                            
##   2   provs           factor  .    (4): 1-advanced, 2-counselors, 3-masters,
##                                    4-psych                                  
##   3   any_child       factor  .    (2): 1-0, 2-1                            
##   4   group_pr        factor  .    (2): 1-0, 2-1                            
##   5   rurality        factor  .    (2): 1-Rural, 2-Urban                    
##   6   supply_med      factor  .    (2): 1-0, 2-1                            
##   7   income_med      factor  .    (2): 1-0, 2-1                            
##   8   frac_white_med  factor  .    (2): 1-0, 2-1                            
## 
## 
## ------------------------------------------------------------------------------ 
## 1 - insurance_flag (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##       freq   perc  lci.95  uci.95'
## 0   61'422  35.1%   34.9%   35.3%
## 1  113'661  64.9%   64.7%   65.1%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 2 - provs (factor)
## 
##    length       n     NAs  unique  levels   dupes
##   175'083 175'083       0       4       4       y
##            100.0%    0.0%                        
## 
##         level    freq   perc  cumfreq  cumperc
## 1     masters  80'573  46.0%   80'573    46.0%
## 2  counselors  65'478  37.4%  146'051    83.4%
## 3       psych  26'489  15.1%  172'540    98.5%
## 4    advanced   2'543   1.5%  175'083   100.0%

## ------------------------------------------------------------------------------ 
## 3 - any_child (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##       freq   perc  lci.95  uci.95'
## 0   61'038  34.9%   34.6%   35.1%
## 1  114'045  65.1%   64.9%   65.4%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 4 - group_pr (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##       freq   perc  lci.95  uci.95'
## 0  104'650  59.8%   59.5%   60.0%
## 1   70'433  40.2%   40.0%   40.5%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 5 - rurality (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##           freq   perc  lci.95  uci.95'
## Rural    7'010   4.0%    3.9%    4.1%
## Urban  168'073  96.0%   95.9%   96.1%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 6 - supply_med (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##      freq   perc  lci.95  uci.95'
## 0  87'734  50.1%   49.9%   50.3%
## 1  87'349  49.9%   49.7%   50.1%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 7 - income_med (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##      freq   perc  lci.95  uci.95'
## 0  92'689  52.9%   52.7%   53.2%
## 1  82'394  47.1%   46.8%   47.3%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 8 - frac_white_med (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##      freq   perc  lci.95  uci.95'
## 0  87'597  50.0%   49.8%   50.3%
## 1  87'486  50.0%   49.7%   50.2%
## 
## ' 95%-CI (Wilson)

skim(psych_forest[9:11])
Data summary
Name psych_forest[9:11]
Number of rows 175083
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
median_income 0 1 79975.52 19788.68 24349.0 65554.50 76367.00 92600.00 156821.00 ▁▇▅▂▁
provider_1k 0 1 0.94 0.65 0.0 0.52 0.76 1.23 10.49 ▇▁▁▁▁
fraction_white 0 1 0.65 0.15 0.1 0.53 0.66 0.77 0.98 ▁▂▆▇▃

Exploration of possible rurality*supply interaction term

xtabs(~ rurality + supply_med, data = psych_forest)
##         supply_med
## rurality     0     1
##    Rural  4859  2151
##    Urban 82875 85198

Univariate models

#this is currently comparing against advanced practice providers
#advanced practice providers are the smallest group, which may not lead to a great 
#reference category if cells are small. I will change this to be masters level
#providers for now since that is the largest group
data_acs_inc$provs <- contr_code_treatment(data_acs_inc$provs, 
                                                   base = "masters")
uni_prov <- glm(insurance_flag ~ provs, data = data_acs_inc, 
                family = "binomial")
summary(uni_prov) 
## 
## Call:
## glm(formula = insurance_flag ~ provs, family = "binomial", data = data_acs_inc)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.50325    0.00727  69.222   <2e-16 ***
## provs.advanced-masters   -0.38909    0.04038  -9.635   <2e-16 ***
## provs.counselors-masters  0.44135    0.01134  38.917   <2e-16 ***
## provs.psych-masters      -0.24335    0.01437 -16.938   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 224196  on 175079  degrees of freedom
## AIC: 224204
## 
## Number of Fisher Scoring iterations: 4
uni_prov_con <- glm(insurance_flag ~ provider_1k, data = data_acs_inc, 
                family = "binomial")
summary(uni_prov_con) 
## 
## Call:
## glm(formula = insurance_flag ~ provider_1k, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.787316   0.008931   88.16   <2e-16 ***
## provider_1k -0.181152   0.007736  -23.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226327  on 175081  degrees of freedom
## AIC: 226331
## 
## Number of Fisher Scoring iterations: 4
#some other primary independent var models
uni_child <- glm(insurance_flag ~ any_child, data = data_acs_inc, 
                 family = "binomial")
summary(uni_child)
## 
## Call:
## glm(formula = insurance_flag ~ any_child, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.509375   0.008359   60.94   <2e-16 ***
## any_child1  0.164224   0.010444   15.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226644  on 175081  degrees of freedom
## AIC: 226648
## 
## Number of Fisher Scoring iterations: 4
uni_rural <- glm(insurance_flag ~ rurality, data = data_acs_inc, 
                  family = "binomial")
summary(uni_rural)
## 
## Call:
## glm(formula = insurance_flag ~ rurality, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.01039    0.02700   37.42   <2e-16 ***
## ruralityUrban -0.41027    0.02748  -14.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226655  on 175081  degrees of freedom
## AIC: 226659
## 
## Number of Fisher Scoring iterations: 4
uni_income <- glm(insurance_flag ~ income_med, data = data_acs_inc, 
                  family = "binomial")
summary(uni_income)
## 
## Call:
## glm(formula = insurance_flag ~ income_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.749584   0.007036  106.53   <2e-16 ***
## income_med1 -0.278914   0.010040  -27.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226117  on 175081  degrees of freedom
## AIC: 226121
## 
## Number of Fisher Scoring iterations: 4
uni_group <- glm(insurance_flag ~ group_pr, data = data_acs_inc, 
                 family = "binomial")
summary(uni_group)
## 
## Call:
## glm(formula = insurance_flag ~ group_pr, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.524293   0.006396   81.97   <2e-16 ***
## group_pr1   0.231474   0.010306   22.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226382  on 175081  degrees of freedom
## AIC: 226386
## 
## Number of Fisher Scoring iterations: 4
uni_prov1k <- glm(insurance_flag ~ supply_med, data = data_acs_inc, 
                  family = "binomial")
summary(uni_prov1k)
## 
## Call:
## glm(formula = insurance_flag ~ supply_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.656524   0.007119  92.218  < 2e-16 ***
## supply_med1 -0.081823   0.010018  -8.167 3.15e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226824  on 175081  degrees of freedom
## AIC: 226828
## 
## Number of Fisher Scoring iterations: 4
uni_frac_wh <- glm(insurance_flag ~ frac_white_med, data = data_acs_inc, 
                   family = "binomial")
summary(uni_frac_wh)
## 
## Call:
## glm(formula = insurance_flag ~ frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.465865   0.006942   67.11   <2e-16 ***
## frac_white_med1 0.306381   0.010053   30.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 225958  on 175081  degrees of freedom
## AIC: 225962
## 
## Number of Fisher Scoring iterations: 4

Full model with all vars


Checking the binary variable for provider supply

model1 <- glm(insurance_flag ~ supply_med + provs + any_child 
              + rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = insurance_flag ~ supply_med + provs + any_child + 
##     rurality + income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.494212   0.030202  16.364  < 2e-16 ***
## supply_med1              -0.009605   0.010769  -0.892    0.372    
## provs.advanced-masters   -0.400754   0.040696  -9.848  < 2e-16 ***
## provs.counselors-masters  0.461996   0.011450  40.348  < 2e-16 ***
## provs.psych-masters      -0.193292   0.014536 -13.297  < 2e-16 ***
## any_child1                0.139653   0.010696  13.057  < 2e-16 ***
## ruralityUrban            -0.194344   0.028460  -6.829 8.57e-12 ***
## income_med1              -0.281956   0.010880 -25.914  < 2e-16 ***
## group_pr1                 0.214758   0.010564  20.328  < 2e-16 ***
## frac_white_med1           0.300357   0.010387  28.916  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221764  on 175073  degrees of freedom
## AIC: 221784
## 
## Number of Fisher Scoring iterations: 4
model2 <- glm(insurance_flag ~ provs + any_child 
              + rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model2)
## 
## Call:
## glm(formula = insurance_flag ~ provs + any_child + rurality + 
##     income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.49156    0.03005  16.356  < 2e-16 ***
## provs.advanced-masters   -0.40106    0.04069  -9.856  < 2e-16 ***
## provs.counselors-masters  0.46156    0.01144  40.348  < 2e-16 ***
## provs.psych-masters      -0.19360    0.01453 -13.322  < 2e-16 ***
## any_child1                0.14008    0.01069  13.109  < 2e-16 ***
## ruralityUrban            -0.19494    0.02845  -6.852  7.3e-12 ***
## income_med1              -0.28496    0.01035 -27.542  < 2e-16 ***
## group_pr1                 0.21406    0.01054  20.318  < 2e-16 ***
## frac_white_med1           0.30049    0.01039  28.931  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221765  on 175074  degrees of freedom
## AIC: 221783
## 
## Number of Fisher Scoring iterations: 4
anova(model2, model1, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175074 221765.1 NA NA NA
175073 221764.3 1 0.7955322 0.3724325
  • There is no evidence that a binary supply variable is a better fit than no supply variable. I will choose to use the continuous version of the variable

#centering the provider supply variable for ease of interpretation
mean(data_acs_inc$provider_1k)
## [1] 0.9398009
data_acs_inc$c_provider_1k <- data_acs_inc$provider_1k - mean(data_acs_inc$provider_1k)
#because of rounding errors in R, it makes it so that the mean isn't EXACTLY centered at 0 but extremely close

model1 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1)
## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child + 
##     rurality + income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.433403   0.030281  14.312  < 2e-16 ***
## c_provider_1k            -0.132942   0.008291 -16.035  < 2e-16 ***
## provs.advanced-masters   -0.401879   0.040727  -9.868  < 2e-16 ***
## provs.counselors-masters  0.466670   0.011454  40.741  < 2e-16 ***
## provs.psych-masters      -0.185884   0.014553 -12.773  < 2e-16 ***
## any_child1                0.131055   0.010708  12.239  < 2e-16 ***
## ruralityUrban            -0.167692   0.028511  -5.882 4.06e-09 ***
## income_med1              -0.224697   0.011018 -20.394  < 2e-16 ***
## group_pr1                 0.226017   0.010571  21.382  < 2e-16 ***
## frac_white_med1           0.305407   0.010401  29.364  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221506  on 175073  degrees of freedom
## AIC: 221526
## 
## Number of Fisher Scoring iterations: 4
vif(model1)
##                    GVIF Df GVIF^(1/(2*Df))
## c_provider_1k  1.160592  1        1.077308
## provs          1.012826  3        1.002126
## any_child      1.020213  1        1.010056
## rurality       1.056585  1        1.027903
## income_med     1.172852  1        1.082983
## group_pr       1.023455  1        1.011659
## frac_white_med 1.043309  1        1.021425
anova(model1, test = 'Chisq')
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL NA NA 175082 226890.3 NA
c_provider_1k 1 562.8089 175081 226327.5 0
provs 3 2700.1497 175078 223627.3 0
any_child 1 252.1089 175077 223375.2 0
rurality 1 152.3336 175076 223222.9 0
income_med 1 355.7611 175075 222867.1 0
group_pr 1 494.8203 175074 222372.3 0
frac_white_med 1 866.0003 175073 221506.3 0

Step-wise deletion and comparison with full model

options(width = 60)
model2 <- glm(insurance_flag ~  provs + any_child 
              + rurality + income_med + group_pr + frac_white_med,
              data = data_acs_inc, family = "binomial")
#summary(model2)
anova(model2, model1, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175074 221765.1 NA NA NA
175073 221506.3 1 258.7809 0
model3 <- glm(insurance_flag ~ c_provider_1k  + any_child 
              + rurality + income_med + group_pr + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model3)
anova(model3, model1, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175076 224156.2 NA NA NA
175073 221506.3 3 2649.896 0
model4 <- glm(insurance_flag ~  c_provider_1k + provs 
              + rurality + income_med + group_pr + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model4)
anova(model4, model1, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175074 221655.6 NA NA NA
175073 221506.3 1 149.348 0
model5 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + income_med + group_pr + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model5)
anova(model5, model1, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175074 221541.6 NA NA NA
175073 221506.3 1 35.29536 0
model6 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + rurality + group_pr + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model6)
anova(model6, model1, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175074 221922.0 NA NA NA
175073 221506.3 1 415.6647 0
model7 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + rurality + income_med + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model7)
anova(model7, model1, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175074 221966.2 NA NA NA
175073 221506.3 1 459.9373 0
model8 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + rurality + income_med + group_pr, 
              data = data_acs_inc, family = "binomial")
#summary(model8)
anova(model8, model1, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175074 222372.3 NA NA NA
175073 221506.3 1 866.0003 0

Full model (no interaction term)

summary(model1)
## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child + 
##     rurality + income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                           Estimate Std. Error z value
## (Intercept)               0.433403   0.030281  14.312
## c_provider_1k            -0.132942   0.008291 -16.035
## provs.advanced-masters   -0.401879   0.040727  -9.868
## provs.counselors-masters  0.466670   0.011454  40.741
## provs.psych-masters      -0.185884   0.014553 -12.773
## any_child1                0.131055   0.010708  12.239
## ruralityUrban            -0.167692   0.028511  -5.882
## income_med1              -0.224697   0.011018 -20.394
## group_pr1                 0.226017   0.010571  21.382
## frac_white_med1           0.305407   0.010401  29.364
##                          Pr(>|z|)    
## (Intercept)               < 2e-16 ***
## c_provider_1k             < 2e-16 ***
## provs.advanced-masters    < 2e-16 ***
## provs.counselors-masters  < 2e-16 ***
## provs.psych-masters       < 2e-16 ***
## any_child1                < 2e-16 ***
## ruralityUrban            4.06e-09 ***
## income_med1               < 2e-16 ***
## group_pr1                 < 2e-16 ***
## frac_white_med1           < 2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221506  on 175073  degrees of freedom
## AIC: 221526
## 
## Number of Fisher Scoring iterations: 4

Odds ratios

round(
  exp(
    cbind(OR = coef(model1), confint(model1))
    ), 
  2)
## Waiting for profiling to be done...
##                            OR 2.5 % 97.5 %
## (Intercept)              1.54  1.45   1.64
## c_provider_1k            0.88  0.86   0.89
## provs.advanced-masters   0.67  0.62   0.72
## provs.counselors-masters 1.59  1.56   1.63
## provs.psych-masters      0.83  0.81   0.85
## any_child1               1.14  1.12   1.16
## ruralityUrban            0.85  0.80   0.89
## income_med1              0.80  0.78   0.82
## group_pr1                1.25  1.23   1.28
## frac_white_med1          1.36  1.33   1.39

Exploring interaction term between rurality and supply

Looking at stratified models (rural vs urban)

data_acs_rural <- subset(data_acs_inc, rurality=="Rural")
data_acs_urban <- subset(data_acs_inc, rurality=="Urban")

Rural

modelrural <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
               + income_med + group_pr + frac_white_med, data = data_acs_rural, family = "binomial")
summary(modelrural)
## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child + 
##     income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_rural)
## 
## Coefficients:
##                          Estimate Std. Error z value
## (Intercept)               1.08052    0.11032   9.795
## c_provider_1k             0.14980    0.05714   2.622
## provs.advanced-masters   -0.60596    0.21420  -2.829
## provs.counselors-masters  0.17028    0.06075   2.803
## provs.psych-masters      -0.48604    0.08992  -5.405
## any_child1                0.42007    0.05786   7.260
## income_med1              -0.44180    0.08523  -5.184
## group_pr1                 0.66626    0.06838   9.743
## frac_white_med1          -0.45054    0.09740  -4.626
##                          Pr(>|z|)    
## (Intercept)               < 2e-16 ***
## c_provider_1k             0.00875 ** 
## provs.advanced-masters    0.00467 ** 
## provs.counselors-masters  0.00506 ** 
## provs.psych-masters      6.48e-08 ***
## any_child1               3.88e-13 ***
## income_med1              2.17e-07 ***
## group_pr1                 < 2e-16 ***
## frac_white_med1          3.73e-06 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8133.8  on 7009  degrees of freedom
## Residual deviance: 7838.2  on 7001  degrees of freedom
## AIC: 7856.2
## 
## Number of Fisher Scoring iterations: 4
round(
  exp(
    cbind(OR = coef(modelrural), confint(modelrural))
    ), 
  2)
## Waiting for profiling to be done...
##                            OR 2.5 % 97.5 %
## (Intercept)              2.95  2.38   3.67
## c_provider_1k            1.16  1.04   1.30
## provs.advanced-masters   0.55  0.36   0.83
## provs.counselors-masters 1.19  1.05   1.34
## provs.psych-masters      0.62  0.52   0.73
## any_child1               1.52  1.36   1.70
## income_med1              0.64  0.54   0.76
## group_pr1                1.95  1.70   2.23
## frac_white_med1          0.64  0.53   0.77
#the reason that there might be some stronger associations than excepted could be slightly less power 
# however, the cell sizes aren't horribly small
# this should most likely be interpreted carefully, as it is comparing to the median of the entire sample (urban and rural) and not rural alone--meaning that the interpretation of the median variables (income and frac white) may be unrepresentative in the stratified models
xtabs(~ insurance_flag + frac_white_med, data = data_acs_rural)
##               frac_white_med
## insurance_flag    0    1
##              0  154 1717
##              1  620 4519
#this one is arguably worse
xtabs(~ insurance_flag + provs, data = data_acs_rural)
##               provs
## insurance_flag advanced counselors masters psych
##              0       40        620     958   253
##              1       56       1976    2685   422

Urban

modelurban <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
               + income_med + group_pr + frac_white_med, data = data_acs_urban, family = "binomial")
summary(modelurban)
## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child + 
##     income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_urban)
## 
## Coefficients:
##                           Estimate Std. Error z value
## (Intercept)               0.264493   0.012969  20.394
## c_provider_1k            -0.140263   0.008412 -16.674
## provs.advanced-masters   -0.390371   0.041492  -9.408
## provs.counselors-masters  0.479358   0.011671  41.073
## provs.psych-masters      -0.174075   0.014754 -11.799
## any_child1                0.120237   0.010899  11.032
## income_med1              -0.220380   0.011121 -19.817
## group_pr1                 0.213119   0.010720  19.880
## frac_white_med1           0.315798   0.010485  30.120
##                          Pr(>|z|)    
## (Intercept)                <2e-16 ***
## c_provider_1k              <2e-16 ***
## provs.advanced-masters     <2e-16 ***
## provs.counselors-masters   <2e-16 ***
## provs.psych-masters        <2e-16 ***
## any_child1                 <2e-16 ***
## income_med1                <2e-16 ***
## group_pr1                  <2e-16 ***
## frac_white_med1            <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 218521  on 168072  degrees of freedom
## Residual deviance: 213453  on 168064  degrees of freedom
## AIC: 213471
## 
## Number of Fisher Scoring iterations: 4
round(
  exp(
    cbind(OR = coef(modelurban), confint(modelurban))
    ), 
  2)
## Waiting for profiling to be done...
##                            OR 2.5 % 97.5 %
## (Intercept)              1.30  1.27   1.34
## c_provider_1k            0.87  0.85   0.88
## provs.advanced-masters   0.68  0.62   0.73
## provs.counselors-masters 1.62  1.58   1.65
## provs.psych-masters      0.84  0.82   0.86
## any_child1               1.13  1.10   1.15
## income_med1              0.80  0.78   0.82
## group_pr1                1.24  1.21   1.26
## frac_white_med1          1.37  1.34   1.40

Stratification results

  • The stratified results are interesting, although I believe continuing with the stratified models is inappropriate as I do not believe that stratification appropriately handles all of the predictors. I think that interpreting a model with an interaction term is more appropriate.

Interaction model (not stratified)

Interaction between: rurality (binary) x provider supply per 1000 county population (continuous)

model1int <- glm(insurance_flag ~ c_provider_1k*rurality + provs + any_child 
               + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1int)
## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k * rurality + provs + 
##     any_child + income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                             Estimate Std. Error z value
## (Intercept)                  0.51841    0.03617  14.333
## c_provider_1k                0.10261    0.05322   1.928
## ruralityUrban               -0.25340    0.03477  -7.287
## provs.advanced-masters      -0.39971    0.04073  -9.815
## provs.counselors-masters     0.46780    0.01146  40.828
## provs.psych-masters         -0.18487    0.01455 -12.701
## any_child1                   0.13126    0.01071  12.257
## income_med1                 -0.22313    0.01103 -20.232
## group_pr1                    0.22528    0.01057  21.306
## frac_white_med1              0.30466    0.01040  29.292
## c_provider_1k:ruralityUrban -0.24124    0.05379  -4.485
##                             Pr(>|z|)    
## (Intercept)                  < 2e-16 ***
## c_provider_1k                 0.0539 .  
## ruralityUrban               3.18e-13 ***
## provs.advanced-masters       < 2e-16 ***
## provs.counselors-masters     < 2e-16 ***
## provs.psych-masters          < 2e-16 ***
## any_child1                   < 2e-16 ***
## income_med1                  < 2e-16 ***
## group_pr1                    < 2e-16 ***
## frac_white_med1              < 2e-16 ***
## c_provider_1k:ruralityUrban 7.28e-06 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221486  on 175072  degrees of freedom
## AIC: 221508
## 
## Number of Fisher Scoring iterations: 4
#testing the interaction model against a reduced model without main effect and interaction term
mod_reduced <- glm(insurance_flag ~ rurality + provs + any_child + income_med + group_pr 
                   + frac_white_med, data = data_acs_inc, family = "binomial")

anova(mod_reduced, model1int, test='LR')
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
175074 221765.1 NA NA NA
175072 221485.6 2 279.4293 0
exp(0.51841 - 0.25340)
## [1] 1.303444
  • Interaction term is meaningful

Exponentiated interaction term model: ORs

mod.ORs <- round(
  exp(
    cbind(OR = coef(model1int), confint(model1int))
    ), 
  2)
## Waiting for profiling to be done...
mod.ORs
##                               OR 2.5 % 97.5 %
## (Intercept)                 1.68  1.57   1.80
## c_provider_1k               1.11  1.00   1.23
## ruralityUrban               0.78  0.72   0.83
## provs.advanced-masters      0.67  0.62   0.73
## provs.counselors-masters    1.60  1.56   1.63
## provs.psych-masters         0.83  0.81   0.86
## any_child1                  1.14  1.12   1.16
## income_med1                 0.80  0.78   0.82
## group_pr1                   1.25  1.23   1.28
## frac_white_med1             1.36  1.33   1.38
## c_provider_1k:ruralityUrban 0.79  0.71   0.87

General interpretation of full model

Baseline interpretations: * At the (sample) mean level of provider supply per 1000 county residents (0.94 providers per 1000 county population), the odds of a provider in a rural area accepting insurance is 1.68 (95% CI 1.57, 1.80), when controlling for all other factors * At the (sample) mean level of provider supply per 1000 county residents (0.94 providers per 1000 county population), the odds of a provider in an urban area accepting insurance is 1.30 (95% CI 1.57, 1.80), when controlling for all other factors

General trends: * Psychologists and advanced practice providers have lower odds of accepting insurance compared to masters level counselors, while therapists.social workers, etc are more likely to accept insurance compared to masters levels therapists * Providers in group practices, that see children, and are in a county with a higher fraction of white population (compared to the rest of the sample) have higher odds of accepting insurance * Providers in counties with higher than the sample median of household income have lower odds of accepting insurance


Contrasts for interaction

est.rural <- gmodels::estimable(model1int,
        c("c_provider_1k" = 1),
        conf.int = 0.95)
## Registered S3 method overwritten by 'gdata':
##   method         from     
##   reorder.factor DescTools
est.urban <- gmodels::estimable(model1int,
        c("c_provider_1k" = 1,
          "c_provider_1k:ruralityUrban" = 1),
        conf.int = 0.95)

rbind(exp(est.rural[c("Estimate", "Lower.CI", "Upper.CI")]),
      exp(est.urban[c("Estimate", "Lower.CI", "Upper.CI")]))
Estimate Lower.CI Upper.CI
(0 1 0 0 0 0 0 0 0 0 0) 1.1080540 0.9970241 1.2314484
(0 1 0 0 0 0 0 0 0 0 1) 0.8705436 0.8561578 0.8851711

Interpretation of interaction terms

  • The first row shows the estimate for rural and the second row shows the estimate for urban

For Rural: For every unit increase from the mean of provider supply per 1000 county population, providers in rural areas have an 11% increase in odds of accepting insurance (95% CI: 1.00, 1.23). - The mean of providers per 1000 county population is 0.94 providers per 1000 county population

For Urban: For every unit increase from the mean of provider supply per 1000 county population, providers in Urban areas have an 13% decrease in odds of accepting insurance (95% CI: 0.86, 0.89).


Visualizing interaction

library(sjPlot)
## Install package "strengejacke" from GitHub (`devtools::install_github("strengejacke/strengejacke")`) to load all sj-packages at once!
#non-interaction
plot_model(model1,
           type = "eff",
           terms = c("c_provider_1k", "rurality"),
           title = "No interaction",
           axis.title = c("Provider Supply (per 1000 county population)", 
                          "P(Insurance Acceptance)"),
           legend.title = "Rurality")

# Effect of c_provider_1k at each level of rurality
plot_model(model1int,
           type = "eff",
           terms = c("c_provider_1k", "rurality"),
           title = "Interaction between provider supply and rurality",
           axis.title = c("Provider Supply (per 1000 county population)", 
                          "P(Insurance Acceptance)"),
           legend.title = "Rurality")


Testing the accuracy of this model by training then testing

# Split data into train and test
set.seed(430)

split <- initial_split(data_acs_inc, prop = 0.8, strata = insurance_flag)
train <- split %>% 
         training()
test <- split %>% 
        testing()
mod_train <- glm(insurance_flag ~ c_provider_1k*rurality + provs + any_child 
               + income_med + group_pr + frac_white_med, 
               data = train, family = "binomial")

glm_probs = data.frame(probs = predict(mod_train, 
                                       newdata = test, 
                                       type="response"))
## Warning: contrasts dropped from factor provs
glm_pred = glm_probs %>%
  mutate(pred = ifelse(probs>.5, "accept", "does not accept"))

glm_pred = cbind(test, glm_pred)

glm_pred %>% 
  count(pred, insurance_flag) %>%
  spread(insurance_flag, n, fill = 0)
pred 0 1
accept 11526 22009
does not accept 759 724
pr <- prediction(glm_probs,test$insurance_flag)
perf <- performance(pr,measure = "tpr",x.measure = "fpr") 
plot(perf)  

auc(test$insurance_flag,glm_probs) 
## Warning in auc(test$insurance_flag, glm_probs): longer
## object length is not a multiple of shorter object length
## [1] 0.6024096

Random forest for variable importance

  • This is where we are using the psych_forest dataset
set.seed(421)

split <- initial_split(psych_forest, prop = 0.7, strata = insurance_flag)
train <- split %>% 
         training()
test <- split %>% 
        testing()

Training

model <- randomForest(y=train[,1], x=train[,c(2:5, 7:8, 10)], data = train, ntree = 500, 
                      mtry = 5, importance = TRUE , proximity = FALSE)

model
## 
## Call:
##  randomForest(x = train[, c(2:5, 7:8, 10)], y = train[, 1], ntree = 500,      mtry = 5, importance = TRUE, proximity = FALSE, data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 31.79%
## Confusion matrix:
##       0     1 class.error
## 0 13811 29184   0.6787766
## 1  9778 69784   0.1228979

The next step is to validate our model using the test data

set.seed(421)
# The next step is to validate our model using the test data

prediction <- predict(model, newdata = test)

table(prediction, test$insurance_flag) 
##           
## prediction     0     1
##          0  5958  4143
##          1 12469 29956
results<-cbind(prediction,test$insurance_flag)

colnames(results)<-c('pred','real')

results<-as.data.frame(results)

# Finally, let’s calculate the accuracy of the model

sum(prediction==test$insurance_flag) / nrow(test)
## [1] 0.6837376

Variable importance

  • The first column shows the predictor variables used in the RF model, the following columns shows how much each of those variables contribute to predicting absences (0), presences (1) or both (MeanDecreaseAccuracy).
  • The last column shows the node impurity of the model.
varImpPlot(model, sort=T, n.var= 7, 
           main= "Health insurance acceptance: yes or no", pch=16)

#another way to view this
imp=as.data.frame(importance(model)) #Analyze the importance of each variable in the model
imp=cbind(vars=rownames(imp),imp)
print(imp)
##                          vars         0         1
## provs                   provs 162.63527 125.04779
## any_child           any_child  41.14348 100.36086
## group_pr             group_pr  56.62383 113.33937
## rurality             rurality  34.71668  19.52115
## income_med         income_med 102.98961  52.83670
## frac_white_med frac_white_med 154.88514  73.26973
## provider_1k       provider_1k 202.35714 101.38273
##                MeanDecreaseAccuracy MeanDecreaseGini
## provs                      244.6241        1275.8302
## any_child                  121.7993         466.0200
## group_pr                   148.5111         415.3831
## rurality                    56.6692         165.5525
## income_med                 147.2510         386.2581
## frac_white_med             206.1214         393.0726
## provider_1k                277.9238        6586.7761
barplot(imp$MeanDecreaseAccuracy, names.arg=imp$vars)

Examples of variable relationship with outcome and importance

  • Partial dependence plots
par(mfrow = c(2, 2))

partialPlot(model, test, provider_1k, "1", xlab="Providers per 1000 county population", 
            ylab="Log of fraction of votes", lwd=4, col="green")

partialPlot(x=model, pred.data=test, x.var=any_child, which.class="1")
partialPlot(x=model, pred.data=test, x.var=rurality, which.class="1")
partialPlot(x=model, pred.data=test, x.var=provs, which.class="1")

par(mfrow = c(2, 2))
partialPlot(x=model, pred.data=test, x.var=income_med, which.class="1")
partialPlot(x=model, pred.data=test, x.var=frac_white_med, which.class="1")
partialPlot(x=model, pred.data=test, x.var=group_pr, which.class="1")


Session Info