Context

This code is designed to explore the factors associated with insurance acceptance among mental health providers (data from psychology today).

The following steps were taken:
- Exploratory univariate analysis
- Building the full logistic regression model
- Exploring provider supply variables–binary vs continuous
- Comparing reduced models to full model
- Exploring the interaction between provider supply and rurality
- Exploring a stratified model (which was for visual purposes, stratified model was not ultimately chosen as the correct choice)
- Running full model through a training and testing sample, then exploring the accuracy of the model
- Running the model through a random forest to better understand the importance of the variables on insurance acceptance among mental health providers

Code book for variables in full model:

c_provider_1k: the supply of all types of mental health providers per 1000k county residents, centered around its mean
- Mean: 0.94
provs: provider type in categories.
- Levels: “Masters level therapists and counselors”, “All other therapists and counselors”, “Psychologists/PhD”, “Advanced practice providers”
- Level names were changed to be better on output
any_child: does a provider also see clients under 18?
- Binary
Rurality: Binary–urban or rural.
- Based on Categories of the 2013 and 2006 NCHS Urban-Rural Classification Scheme for Counties.
- Metro distinctions 1-4 = urban
- metro distinctions 5-6 = rural (micropoliton and non-core)
income_med: Binary
- whether provider is in a county with a median household income that is above or below the median of the sample.
- Data from ACS 5-year survey
group_pr: is the provider part of a group practice?
- Determined by grouping based on exact addresses.
- Binary
frac_white_med: binary
- Data from 5-year ACS
- Binary, whether provider is in a county with a fraction of county population that is white that is above or below the median of the sample

data_acs <- read.csv("K:/WorkingData2/User_Folders/Aine/Project Work/Psychology Today/Original files/working data acs 4-5-24.csv")

data_acs <- data_acs[which(!is.na(data_acs$id)), ] #just in case it reuploads weirdly from csv
#Desc(data_acs$id)
#length of df should be 184'957
# unique IDs should be 167'854

Load Required Packages

Explore urbanicity

This is based on https://www.cdc.gov/nchs/data/data_acces_files/NCHSUrbruralFileDocumentationInternet2.pdf
quick example of potential interaction between urbanicity and region: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4219594/

metros <- read.csv("K:/WorkingData2/User_Folders/Aine/Project Work/Psychology Today/Original files/NCHSURCodes2013.csv")
data_acs <- left_join(data_acs, metros, by = "fips")

data_acs <- data_acs %>%
  mutate(rurality = case_when(metro_codes==1 | metro_codes==2 | metro_codes==3 | metro_codes==4 ~ "Metro",
                            metro_codes==5 | metro_codes==6 ~ "Rural")
         )

Model building

Exhibit 4. Regression of provider- and county/zip level covariates on insurance acceptance 1. County or zip code composition (population, % white, HRSA supply of MH professionals, income) 2. Provider-level covariates – years in practice, psychologist vs. therapist, telehealth, gender, # of locations (can use addresses to identify multi-provider groups e.g. count number of providers at same location), whether providers themselves practice in more than 1 location

Preparing the data

data_acs_inc <- data_acs[which(data_acs$excluded==0),]

data_acs_inc <- data_acs_inc %>%
  group_by(GEOID) %>%
  mutate(provider_1k = (n() / county_population)*1000) %>%
  ungroup()

# these are not at the "county level", they are at the provider level
# comparing obvs in the dataset against each other rather than what county they are in

data_acs_inc <- data_acs_inc %>%
  mutate(supply_med = if_else(provider_1k>median(provider_1k), 1, 0)
         )

data_acs_inc <- data_acs_inc %>%
   mutate(income_med = if_else(median_income>median(median_income), 1, 0)
         )

data_acs_inc <- data_acs_inc %>%
   mutate(frac_white_med = if_else(fraction_white>median(fraction_white), 1, 0)
         )


# var manipulation
skim(data_acs_inc$provider_1k)

Data summary
Name	data_acs_inc$provider_1k
Number of rows	175083
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	0.94	0.65	0	0.52	0.76	1.23	10.49	▇▁▁▁▁

skim(data_acs_inc$median_income)

Data summary
Name	data_acs_inc$median_incom…
Number of rows	175083
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	79975.52	19788.68	24349	65554.5	76367	92600	156821	▁▇▅▂▁

skim(data_acs_inc$fraction_white)

Data summary
Name	data_acs_inc$fraction_whi…
Number of rows	175083
Number of columns	1
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
data	0	1	0.65	0.15	0.1	0.53	0.66	0.77	0.98	▁▂▆▇▃

#binary variable for provider supply above or below the median
Desc(data_acs_inc$supply_med)

## ------------------------------------------------------------------------------ 
## data_acs_inc$supply_med (numeric)
## 
##    length        n    NAs  unique      0s  mean  meanCI'
##   175'083  175'083      0       2  87'734  0.50    0.50
##             100.0%   0.0%           50.1%          0.50
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    1.00  1.00    1.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##      1.00     0.50   1.00    0.00    1.00  0.00   -2.00
##                                                        
## 
##    value    freq   perc  cumfreq  cumperc
## 1      0  87'734  50.1%   87'734    50.1%
## 2      1  87'349  49.9%  175'083   100.0%
## 
## ' 95%-CI (classic)

Desc(data_acs_inc$income_med)

## ------------------------------------------------------------------------------ 
## data_acs_inc$income_med (numeric)
## 
##    length        n    NAs  unique      0s  mean  meanCI'
##   175'083  175'083      0       2  92'689  0.47    0.47
##             100.0%   0.0%           52.9%          0.47
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    1.00  1.00    1.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##      1.00     0.50   1.06    0.00    1.00  0.12   -1.99
##                                                        
## 
##    value    freq   perc  cumfreq  cumperc
## 1      0  92'689  52.9%   92'689    52.9%
## 2      1  82'394  47.1%  175'083   100.0%
## 
## ' 95%-CI (classic)

Desc(data_acs_inc$frac_white_med)

## ------------------------------------------------------------------------------ 
## data_acs_inc$frac_white_med (numeric)
## 
##    length        n    NAs  unique      0s  mean  meanCI'
##   175'083  175'083      0       2  87'597  0.50    0.50
##             100.0%   0.0%           50.0%          0.50
##                                                        
##       .05      .10    .25  median     .75   .90     .95
##      0.00     0.00   0.00    0.00    1.00  1.00    1.00
##                                                        
##     range       sd  vcoef     mad     IQR  skew    kurt
##      1.00     0.50   1.00    0.00    1.00  0.00   -2.00
##                                                        
## 
##    value    freq   perc  cumfreq  cumperc
## 1      0  87'597  50.0%   87'597    50.0%
## 2      1  87'486  50.0%  175'083   100.0%
## 
## ' 95%-CI (classic)

This probably is not needed anymore

#making tertiles
data_acs_inc <- data_acs_inc %>%
  mutate(supply_tert = ntile(provider_1k, 3))

data_acs_inc <- data_acs_inc %>%
  mutate(supply_tert = case_when(supply_tert==1 ~ "low",
                            supply_tert==2 ~ "mid",
                            supply_tert==3 ~ "high")
         )

# Think about keeping supply as a continuous variable
# look at distribution of continuous supply var
data_acs_inc$supply_tert <- as.factor(data_acs_inc$supply_tert)

data_acs_inc <- data_acs_inc %>%
  mutate(provider_type = case_when(provider_type=="Masters level therapists and counselors" ~ "masters",
                            provider_type=="All other therapists and counselors" ~ "counselors",
                            provider_type=="Psychologists/PhD" ~ "psych",
                            provider_type=="Advanced practice providers" ~ "advanced")
         )

data_acs_inc$provs <- as.factor(data_acs_inc$provider_type)
data_acs_inc$insurance_flag <- as.factor(data_acs_inc$insurance_flag)
data_acs_inc$any_child <- as.factor(data_acs_inc$any_child)
data_acs_inc$region <- as.factor(data_acs_inc$region)
data_acs_inc$group_pr <- as.factor(data_acs_inc$group_pr)
data_acs_inc$telehealth <- as.factor(data_acs_inc$telehealth)
data_acs_inc$rurality <- as.factor(data_acs_inc$rurality)
data_acs_inc$supply_med <- as.factor(data_acs_inc$supply_med)
data_acs_inc$income_med <- as.factor(data_acs_inc$income_med)
data_acs_inc$frac_white_med <- as.factor(data_acs_inc$frac_white_med)

#i need to adjust the names in the provider_type variable so it doesn't mess with 
# our output

Desc(data_acs_inc$rurality)

## ------------------------------------------------------------------------------ 
## data_acs_inc$rurality (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##           freq   perc  lci.95  uci.95'
## Metro  168'073  96.0%   95.9%   96.1%
## Rural    7'010   4.0%    3.9%    4.1%
## 
## ' 95%-CI (Wilson)

Adjustments after meeting with Jane on 4/23/24

indicator var: rural versus not (micropolitan and non-core versus other categories) then relook at supply as continuous or above median and below median (dichotimous var). possibly then also a possible interaction term between rurality and supply. for median income: either dichotimous var or percentiles (non-continuous). also looking at fraction white in percentiles

Exploring distrobution of provider supply (continuous)

#provider supply

data_acs_inc %>%
  group_by(insurance_flag) %>%
 skim(provider_1k)

Data summary
Name	Piped data
Number of rows	175083
Number of columns	72
_______________________
Column type frequency:
numeric	1
________________________
Group variables	insurance_flag

Variable type: numeric

skim_variable	insurance_flag	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
provider_1k	0	0	1	0.99	0.74	0	0.53	0.76	1.3	10.49	▇▁▁▁▁
provider_1k	1	0	1	0.91	0.59	0	0.51	0.75	1.2	10.49	▇▁▁▁▁

#providers that do accept insurance
p <- data_acs_inc %>%
  ggplot( aes(x=provider_1k, fill= insurance_flag)) +
    geom_histogram(alpha=0.9, color="white") +
    ggtitle("Distrobution of provider supply and insurance acceptance") +
    theme(
      plot.title = element_text(size=10)
    )

#rurality
p2 <- data_acs_inc %>%
  ggplot( aes(x=provider_1k, fill=rurality)) +
    geom_histogram(alpha=0.9, color="white") +
  ggtitle("Distrobution of provider supply in rural and non-rural areas") +
    theme(
      plot.title = element_text(size=10)
    )

p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exploring distrobution of median county income (continuous)

#median income

data_acs_inc %>%
  group_by(insurance_flag) %>%
 skim(median_income)

Data summary
Name	Piped data
Number of rows	175083
Number of columns	72
_______________________
Column type frequency:
numeric	1
________________________
Group variables	insurance_flag

Variable type: numeric

skim_variable	insurance_flag	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
median_income	0	0	1	82747.92	21062.36	26395	66047	77635	93956	156821	▁▇▅▂▁
median_income	1	0	1	78477.32	18896.55	24349	64522	75485	88517	156821	▁▇▅▂▁

#providers that do accept insurance
p <- data_acs_inc %>%
  ggplot( aes(x=median_income, fill=insurance_flag)) +
    geom_histogram(color="white", alpha=0.9) +
  ggtitle("Distrobution of median income by insurance acceptance") +
    theme(
      plot.title = element_text(size=10)
    )

#providers that do accept insurance
p2 <- data_acs_inc %>%
  ggplot( aes(x=fraction_white, fill=insurance_flag)) +
    geom_histogram(color="white", alpha=0.9) +
  ggtitle("Distrobution of fraction white county pop by insurance acceptance") +
    theme(
      plot.title = element_text(size=10)
    )
p

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exploring the distrobutions

The psych_forest dataset has all of our variables of interest, it will be used for the random forest in later code but I will use it now as well to easily explore the distributions of the variables of interest

psych_forest <- as.data.frame(select(data_acs_inc, c("insurance_flag","provs","any_child","group_pr", "rurality", "supply_med", "income_med", "frac_white_med", "median_income", "provider_1k","fraction_white")))

#remove frac no health insurance
#rural urban versus region (take out region and include rural urban on its own)
# then we can see 

Desc(psych_forest[1:8])

## ------------------------------------------------------------------------------ 
## Describe psych_forest[1:8] (data.frame):
## 
## data frame:  175083 obs. of  8 variables
##      175083 complete cases (100.0%)
## 
##   Nr  ColName         Class   NAs  Levels                                   
##   1   insurance_flag  factor  .    (2): 1-0, 2-1                            
##   2   provs           factor  .    (4): 1-advanced, 2-counselors, 3-masters,
##                                    4-psych                                  
##   3   any_child       factor  .    (2): 1-0, 2-1                            
##   4   group_pr        factor  .    (2): 1-0, 2-1                            
##   5   rurality        factor  .    (2): 1-Metro, 2-Rural                    
##   6   supply_med      factor  .    (2): 1-0, 2-1                            
##   7   income_med      factor  .    (2): 1-0, 2-1                            
##   8   frac_white_med  factor  .    (2): 1-0, 2-1                            
## 
## 
## ------------------------------------------------------------------------------ 
## 1 - insurance_flag (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##       freq   perc  lci.95  uci.95'
## 0   61'422  35.1%   34.9%   35.3%
## 1  113'661  64.9%   64.7%   65.1%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 2 - provs (factor)
## 
##    length       n     NAs  unique  levels   dupes
##   175'083 175'083       0       4       4       y
##            100.0%    0.0%                        
## 
##         level    freq   perc  cumfreq  cumperc
## 1     masters  80'573  46.0%   80'573    46.0%
## 2  counselors  65'478  37.4%  146'051    83.4%
## 3       psych  26'489  15.1%  172'540    98.5%
## 4    advanced   2'543   1.5%  175'083   100.0%

## ------------------------------------------------------------------------------ 
## 3 - any_child (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##       freq   perc  lci.95  uci.95'
## 0   61'038  34.9%   34.6%   35.1%
## 1  114'045  65.1%   64.9%   65.4%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 4 - group_pr (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##       freq   perc  lci.95  uci.95'
## 0  104'650  59.8%   59.5%   60.0%
## 1   70'433  40.2%   40.0%   40.5%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 5 - rurality (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##           freq   perc  lci.95  uci.95'
## Metro  168'073  96.0%   95.9%   96.1%
## Rural    7'010   4.0%    3.9%    4.1%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 6 - supply_med (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##      freq   perc  lci.95  uci.95'
## 0  87'734  50.1%   49.9%   50.3%
## 1  87'349  49.9%   49.7%   50.1%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 7 - income_med (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##      freq   perc  lci.95  uci.95'
## 0  92'689  52.9%   52.7%   53.2%
## 1  82'394  47.1%   46.8%   47.3%
## 
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------ 
## 8 - frac_white_med (factor - dichotomous)
## 
##    length       n     NAs  unique
##   175'083 175'083       0       2
##            100.0%    0.0%        
## 
##      freq   perc  lci.95  uci.95'
## 0  87'597  50.0%   49.8%   50.3%
## 1  87'486  50.0%   49.7%   50.2%
## 
## ' 95%-CI (Wilson)

skim(psych_forest[9:11])

Data summary
Name	psych_forest[9:11]
Number of rows	175083
Number of columns	3
_______________________
Column type frequency:
numeric	3
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
median_income	1	79975.52	19788.68	24349.0	65554.50	76367.00	92600.00	156821.00	▁▇▅▂▁
provider_1k	1	0.94	0.65	0.0	0.52	0.76	1.23	10.49	▇▁▁▁▁
fraction_white	1	0.65	0.15	0.1	0.53	0.66	0.77	0.98	▁▂▆▇▃

Exploration of possible rurality*supply interaction term

xtabs(~ rurality + supply_med, data = psych_forest)

##         supply_med
## rurality     0     1
##    Metro 82875 85198
##    Rural  4859  2151

Univariate models

#this is currently comparing against advanced practice providers
#advanced practice providers are the smallest group, which may not lead to a great 
#reference category if cells are small. I will change this to be masters level
#providers for now since that is the largest group
data_acs_inc$provs <- contr_code_treatment(data_acs_inc$provs, 
                                                   base = "masters")
uni_prov <- glm(insurance_flag ~ provs, data = data_acs_inc, 
                family = "binomial")
summary(uni_prov)

## 
## Call:
## glm(formula = insurance_flag ~ provs, family = "binomial", data = data_acs_inc)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.50325    0.00727  69.222   <2e-16 ***
## provs.advanced-masters   -0.38909    0.04038  -9.635   <2e-16 ***
## provs.counselors-masters  0.44135    0.01134  38.917   <2e-16 ***
## provs.psych-masters      -0.24335    0.01437 -16.938   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 224196  on 175079  degrees of freedom
## AIC: 224204
## 
## Number of Fisher Scoring iterations: 4

uni_prov_con <- glm(insurance_flag ~ provider_1k, data = data_acs_inc, 
                family = "binomial")
summary(uni_prov_con)

## 
## Call:
## glm(formula = insurance_flag ~ provider_1k, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.787316   0.008931   88.16   <2e-16 ***
## provider_1k -0.181152   0.007736  -23.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226327  on 175081  degrees of freedom
## AIC: 226331
## 
## Number of Fisher Scoring iterations: 4

#some other primary independent var models
uni_child <- glm(insurance_flag ~ any_child, data = data_acs_inc, 
                 family = "binomial")
summary(uni_child)

## 
## Call:
## glm(formula = insurance_flag ~ any_child, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.509375   0.008359   60.94   <2e-16 ***
## any_child1  0.164224   0.010444   15.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226644  on 175081  degrees of freedom
## AIC: 226648
## 
## Number of Fisher Scoring iterations: 4

uni_rural <- glm(insurance_flag ~ rurality, data = data_acs_inc, 
                  family = "binomial")
summary(uni_rural)

## 
## Call:
## glm(formula = insurance_flag ~ rurality, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.60012    0.00510  117.68   <2e-16 ***
## ruralityRural  0.41027    0.02748   14.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226655  on 175081  degrees of freedom
## AIC: 226659
## 
## Number of Fisher Scoring iterations: 4

uni_income <- glm(insurance_flag ~ income_med, data = data_acs_inc, 
                  family = "binomial")
summary(uni_income)

## 
## Call:
## glm(formula = insurance_flag ~ income_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.749584   0.007036  106.53   <2e-16 ***
## income_med1 -0.278914   0.010040  -27.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226117  on 175081  degrees of freedom
## AIC: 226121
## 
## Number of Fisher Scoring iterations: 4

uni_group <- glm(insurance_flag ~ group_pr, data = data_acs_inc, 
                 family = "binomial")
summary(uni_group)

## 
## Call:
## glm(formula = insurance_flag ~ group_pr, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 0.524293   0.006396   81.97   <2e-16 ***
## group_pr1   0.231474   0.010306   22.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226382  on 175081  degrees of freedom
## AIC: 226386
## 
## Number of Fisher Scoring iterations: 4

uni_prov1k <- glm(insurance_flag ~ supply_med, data = data_acs_inc, 
                  family = "binomial")
summary(uni_prov1k)

## 
## Call:
## glm(formula = insurance_flag ~ supply_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.656524   0.007119  92.218  < 2e-16 ***
## supply_med1 -0.081823   0.010018  -8.167 3.15e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 226824  on 175081  degrees of freedom
## AIC: 226828
## 
## Number of Fisher Scoring iterations: 4

uni_frac_wh <- glm(insurance_flag ~ frac_white_med, data = data_acs_inc, 
                   family = "binomial")
summary(uni_frac_wh)

## 
## Call:
## glm(formula = insurance_flag ~ frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     0.465865   0.006942   67.11   <2e-16 ***
## frac_white_med1 0.306381   0.010053   30.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 225958  on 175081  degrees of freedom
## AIC: 225962
## 
## Number of Fisher Scoring iterations: 4

Full model with all vars

Checking the binary variable for provider supply

model1 <- glm(insurance_flag ~ supply_med + provs + any_child 
              + rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1)

## 
## Call:
## glm(formula = insurance_flag ~ supply_med + provs + any_child + 
##     rurality + income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.299868   0.013181  22.749  < 2e-16 ***
## supply_med1              -0.009605   0.010769  -0.892    0.372    
## provs.advanced-masters   -0.400754   0.040696  -9.848  < 2e-16 ***
## provs.counselors-masters  0.461996   0.011450  40.348  < 2e-16 ***
## provs.psych-masters      -0.193292   0.014536 -13.297  < 2e-16 ***
## any_child1                0.139653   0.010696  13.057  < 2e-16 ***
## ruralityRural             0.194344   0.028460   6.829 8.57e-12 ***
## income_med1              -0.281956   0.010880 -25.914  < 2e-16 ***
## group_pr1                 0.214758   0.010564  20.328  < 2e-16 ***
## frac_white_med1           0.300357   0.010387  28.916  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221764  on 175073  degrees of freedom
## AIC: 221784
## 
## Number of Fisher Scoring iterations: 4

model2 <- glm(insurance_flag ~ provs + any_child 
              + rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model2)

## 
## Call:
## glm(formula = insurance_flag ~ provs + any_child + rurality + 
##     income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.29662    0.01267  23.418  < 2e-16 ***
## provs.advanced-masters   -0.40106    0.04069  -9.856  < 2e-16 ***
## provs.counselors-masters  0.46156    0.01144  40.348  < 2e-16 ***
## provs.psych-masters      -0.19360    0.01453 -13.322  < 2e-16 ***
## any_child1                0.14008    0.01069  13.109  < 2e-16 ***
## ruralityRural             0.19494    0.02845   6.852  7.3e-12 ***
## income_med1              -0.28496    0.01035 -27.542  < 2e-16 ***
## group_pr1                 0.21406    0.01054  20.318  < 2e-16 ***
## frac_white_med1           0.30049    0.01039  28.931  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221765  on 175074  degrees of freedom
## AIC: 221783
## 
## Number of Fisher Scoring iterations: 4

anova(model2, model1, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175074	221765.1	NA	NA	NA
175073	221764.3	1	0.7955322	0.3724325

There is no evidence that a binary supply variable is a better fit than no supply variable. I will choose to use the continuous version of the variable

#centering the provider supply variable for ease of interpretation
mean(data_acs_inc$provider_1k)

## [1] 0.9398009

data_acs_inc$c_provider_1k <- data_acs_inc$provider_1k - mean(data_acs_inc$provider_1k)
#because of rounding errors in R, it makes it so that the mean isn't EXACTLY centered at 0 but extremely close

model1 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1)

## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child + 
##     rurality + income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               0.265712   0.012821  20.724  < 2e-16 ***
## c_provider_1k            -0.132942   0.008291 -16.035  < 2e-16 ***
## provs.advanced-masters   -0.401879   0.040727  -9.868  < 2e-16 ***
## provs.counselors-masters  0.466670   0.011454  40.741  < 2e-16 ***
## provs.psych-masters      -0.185884   0.014553 -12.773  < 2e-16 ***
## any_child1                0.131055   0.010708  12.239  < 2e-16 ***
## ruralityRural             0.167692   0.028511   5.882 4.06e-09 ***
## income_med1              -0.224697   0.011018 -20.394  < 2e-16 ***
## group_pr1                 0.226017   0.010571  21.382  < 2e-16 ***
## frac_white_med1           0.305407   0.010401  29.364  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221506  on 175073  degrees of freedom
## AIC: 221526
## 
## Number of Fisher Scoring iterations: 4

vif(model1)

##                    GVIF Df GVIF^(1/(2*Df))
## c_provider_1k  1.160592  1        1.077308
## provs          1.012826  3        1.002126
## any_child      1.020213  1        1.010056
## rurality       1.056585  1        1.027903
## income_med     1.172852  1        1.082983
## group_pr       1.023455  1        1.011659
## frac_white_med 1.043309  1        1.021425

anova(model1, test = 'Chisq')

	Df	Deviance	Resid. Df	Resid. Dev	Pr(>Chi)
NULL	NA	NA	175082	226890.3	NA
c_provider_1k	1	562.8089	175081	226327.5	0
provs	3	2700.1497	175078	223627.3	0
any_child	1	252.1089	175077	223375.2	0
rurality	1	152.3336	175076	223222.9	0
income_med	1	355.7611	175075	222867.1	0
group_pr	1	494.8203	175074	222372.3	0
frac_white_med	1	866.0003	175073	221506.3	0

Step-wise deletion and comparison with full model

options(width = 60)
model2 <- glm(insurance_flag ~  provs + any_child 
              + rurality + income_med + group_pr + frac_white_med,
              data = data_acs_inc, family = "binomial")
#summary(model2)
anova(model2, model1, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175074	221765.1	NA	NA	NA
175073	221506.3	1	258.7809	0

model3 <- glm(insurance_flag ~ c_provider_1k  + any_child 
              + rurality + income_med + group_pr + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model3)
anova(model3, model1, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175076	224156.2	NA	NA	NA
175073	221506.3	3	2649.896	0

model4 <- glm(insurance_flag ~  c_provider_1k + provs 
              + rurality + income_med + group_pr + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model4)
anova(model4, model1, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175074	221655.6	NA	NA	NA
175073	221506.3	1	149.348	0

model5 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + income_med + group_pr + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model5)
anova(model5, model1, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175074	221541.6	NA	NA	NA
175073	221506.3	1	35.29536	0

model6 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + rurality + group_pr + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model6)
anova(model6, model1, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175074	221922.0	NA	NA	NA
175073	221506.3	1	415.6647	0

model7 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + rurality + income_med + frac_white_med, 
              data = data_acs_inc, family = "binomial")
#summary(model7)
anova(model7, model1, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175074	221966.2	NA	NA	NA
175073	221506.3	1	459.9373	0

model8 <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
              + rurality + income_med + group_pr, 
              data = data_acs_inc, family = "binomial")
#summary(model8)
anova(model8, model1, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175074	222372.3	NA	NA	NA
175073	221506.3	1	866.0003	0

Full model (no interaction term)

summary(model1)

## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child + 
##     rurality + income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                           Estimate Std. Error z value
## (Intercept)               0.265712   0.012821  20.724
## c_provider_1k            -0.132942   0.008291 -16.035
## provs.advanced-masters   -0.401879   0.040727  -9.868
## provs.counselors-masters  0.466670   0.011454  40.741
## provs.psych-masters      -0.185884   0.014553 -12.773
## any_child1                0.131055   0.010708  12.239
## ruralityRural             0.167692   0.028511   5.882
## income_med1              -0.224697   0.011018 -20.394
## group_pr1                 0.226017   0.010571  21.382
## frac_white_med1           0.305407   0.010401  29.364
##                          Pr(>|z|)    
## (Intercept)               < 2e-16 ***
## c_provider_1k             < 2e-16 ***
## provs.advanced-masters    < 2e-16 ***
## provs.counselors-masters  < 2e-16 ***
## provs.psych-masters       < 2e-16 ***
## any_child1                < 2e-16 ***
## ruralityRural            4.06e-09 ***
## income_med1               < 2e-16 ***
## group_pr1                 < 2e-16 ***
## frac_white_med1           < 2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221506  on 175073  degrees of freedom
## AIC: 221526
## 
## Number of Fisher Scoring iterations: 4

Odds ratios

round(
  exp(
    cbind(OR = coef(model1), confint(model1))
    ), 
  2)

## Waiting for profiling to be done...

##                            OR 2.5 % 97.5 %
## (Intercept)              1.30  1.27   1.34
## c_provider_1k            0.88  0.86   0.89
## provs.advanced-masters   0.67  0.62   0.72
## provs.counselors-masters 1.59  1.56   1.63
## provs.psych-masters      0.83  0.81   0.85
## any_child1               1.14  1.12   1.16
## ruralityRural            1.18  1.12   1.25
## income_med1              0.80  0.78   0.82
## group_pr1                1.25  1.23   1.28
## frac_white_med1          1.36  1.33   1.39

Exploring interaction term between rurality and supply

Looking at stratified models (rural vs urban)

data_acs_rural <- subset(data_acs_inc, rurality=="Rural")
data_acs_urban <- subset(data_acs_inc, rurality=="Metro")

Rural

modelrural <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
               + income_med + group_pr + frac_white_med, data = data_acs_rural, family = "binomial")
summary(modelrural)

## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child + 
##     income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_rural)
## 
## Coefficients:
##                          Estimate Std. Error z value
## (Intercept)               1.08052    0.11032   9.795
## c_provider_1k             0.14980    0.05714   2.622
## provs.advanced-masters   -0.60596    0.21420  -2.829
## provs.counselors-masters  0.17028    0.06075   2.803
## provs.psych-masters      -0.48604    0.08992  -5.405
## any_child1                0.42007    0.05786   7.260
## income_med1              -0.44180    0.08523  -5.184
## group_pr1                 0.66626    0.06838   9.743
## frac_white_med1          -0.45054    0.09740  -4.626
##                          Pr(>|z|)    
## (Intercept)               < 2e-16 ***
## c_provider_1k             0.00875 ** 
## provs.advanced-masters    0.00467 ** 
## provs.counselors-masters  0.00506 ** 
## provs.psych-masters      6.48e-08 ***
## any_child1               3.88e-13 ***
## income_med1              2.17e-07 ***
## group_pr1                 < 2e-16 ***
## frac_white_med1          3.73e-06 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8133.8  on 7009  degrees of freedom
## Residual deviance: 7838.2  on 7001  degrees of freedom
## AIC: 7856.2
## 
## Number of Fisher Scoring iterations: 4

round(
  exp(
    cbind(OR = coef(modelrural), confint(modelrural))
    ), 
  2)

## Waiting for profiling to be done...

##                            OR 2.5 % 97.5 %
## (Intercept)              2.95  2.38   3.67
## c_provider_1k            1.16  1.04   1.30
## provs.advanced-masters   0.55  0.36   0.83
## provs.counselors-masters 1.19  1.05   1.34
## provs.psych-masters      0.62  0.52   0.73
## any_child1               1.52  1.36   1.70
## income_med1              0.64  0.54   0.76
## group_pr1                1.95  1.70   2.23
## frac_white_med1          0.64  0.53   0.77

#the reason that there might be some stronger associations than excepted could be slightly less power 
# however, the cell sizes aren't horribly small
# this should most likely be interpreted carefully, as it is comparing to the median of the entire sample (urban and rural) and not rural alone--meaning that the interpretation of the median variables (income and frac white) may be unrepresentative in the stratified models
xtabs(~ insurance_flag + frac_white_med, data = data_acs_rural)

##               frac_white_med
## insurance_flag    0    1
##              0  154 1717
##              1  620 4519

#this one is arguably worse
xtabs(~ insurance_flag + provs, data = data_acs_rural)

##               provs
## insurance_flag advanced counselors masters psych
##              0       40        620     958   253
##              1       56       1976    2685   422

Non-rural

modelnonrural <- glm(insurance_flag ~ c_provider_1k + provs + any_child 
               + income_med + group_pr + frac_white_med, data = data_acs_urban, family = "binomial")
summary(modelnonrural)

## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child + 
##     income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_urban)
## 
## Coefficients:
##                           Estimate Std. Error z value
## (Intercept)               0.264493   0.012969  20.394
## c_provider_1k            -0.140263   0.008412 -16.674
## provs.advanced-masters   -0.390371   0.041492  -9.408
## provs.counselors-masters  0.479358   0.011671  41.073
## provs.psych-masters      -0.174075   0.014754 -11.799
## any_child1                0.120237   0.010899  11.032
## income_med1              -0.220380   0.011121 -19.817
## group_pr1                 0.213119   0.010720  19.880
## frac_white_med1           0.315798   0.010485  30.120
##                          Pr(>|z|)    
## (Intercept)                <2e-16 ***
## c_provider_1k              <2e-16 ***
## provs.advanced-masters     <2e-16 ***
## provs.counselors-masters   <2e-16 ***
## provs.psych-masters        <2e-16 ***
## any_child1                 <2e-16 ***
## income_med1                <2e-16 ***
## group_pr1                  <2e-16 ***
## frac_white_med1            <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 218521  on 168072  degrees of freedom
## Residual deviance: 213453  on 168064  degrees of freedom
## AIC: 213471
## 
## Number of Fisher Scoring iterations: 4

round(
  exp(
    cbind(OR = coef(modelnonrural), confint(modelnonrural))
    ), 
  2)

## Waiting for profiling to be done...

##                            OR 2.5 % 97.5 %
## (Intercept)              1.30  1.27   1.34
## c_provider_1k            0.87  0.85   0.88
## provs.advanced-masters   0.68  0.62   0.73
## provs.counselors-masters 1.62  1.58   1.65
## provs.psych-masters      0.84  0.82   0.86
## any_child1               1.13  1.10   1.15
## income_med1              0.80  0.78   0.82
## group_pr1                1.24  1.21   1.26
## frac_white_med1          1.37  1.34   1.40

Stratification results

The stratified results are interesting, although I believe continuing with the stratified models is inappropriate as I do not believe that stratification appropriately handles all of the predictors. I think that interpreting a model with an interaction term is more appropriate.

Interaction model (not stratified)

Interaction between: rurality (binary) x provider supply per 1000 county population (continuous)

model1int <- glm(insurance_flag ~ c_provider_1k*rurality + provs + any_child 
               + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1int)

## 
## Call:
## glm(formula = insurance_flag ~ c_provider_1k * rurality + provs + 
##     any_child + income_med + group_pr + frac_white_med, family = "binomial", 
##     data = data_acs_inc)
## 
## Coefficients:
##                              Estimate Std. Error z value
## (Intercept)                  0.265019   0.012824  20.666
## c_provider_1k               -0.138637   0.008399 -16.507
## ruralityRural                0.253396   0.034775   7.287
## provs.advanced-masters      -0.399714   0.040726  -9.815
## provs.counselors-masters     0.467796   0.011458  40.828
## provs.psych-masters         -0.184868   0.014555 -12.701
## any_child1                   0.131262   0.010709  12.257
## income_med1                 -0.223129   0.011029 -20.232
## group_pr1                    0.225276   0.010573  21.306
## frac_white_med1              0.304661   0.010401  29.292
## c_provider_1k:ruralityRural  0.241243   0.053786   4.485
##                             Pr(>|z|)    
## (Intercept)                  < 2e-16 ***
## c_provider_1k                < 2e-16 ***
## ruralityRural               3.18e-13 ***
## provs.advanced-masters       < 2e-16 ***
## provs.counselors-masters     < 2e-16 ***
## provs.psych-masters          < 2e-16 ***
## any_child1                   < 2e-16 ***
## income_med1                  < 2e-16 ***
## group_pr1                    < 2e-16 ***
## frac_white_med1              < 2e-16 ***
## c_provider_1k:ruralityRural 7.28e-06 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 226890  on 175082  degrees of freedom
## Residual deviance: 221486  on 175072  degrees of freedom
## AIC: 221508
## 
## Number of Fisher Scoring iterations: 4

#testing the interaction model against a reduced model without main effect and interaction term
mod_reduced <- glm(insurance_flag ~ rurality + provs + any_child + income_med + group_pr 
                   + frac_white_med, data = data_acs_inc, family = "binomial")

anova(mod_reduced, model1int, test='LR')

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
175074	221765.1	NA	NA	NA
175072	221485.6	2	279.4293	0

Interaction term is meaningful

Exponentiated interaction term model: ORs

mod.ORs <- round(
  exp(
    cbind(OR = coef(model1int), confint(model1int))
    ), 
  2)

## Waiting for profiling to be done...

mod.ORs

##                               OR 2.5 % 97.5 %
## (Intercept)                 1.30  1.27   1.34
## c_provider_1k               0.87  0.86   0.88
## ruralityRural               1.29  1.20   1.38
## provs.advanced-masters      0.67  0.62   0.73
## provs.counselors-masters    1.60  1.56   1.63
## provs.psych-masters         0.83  0.81   0.86
## any_child1                  1.14  1.12   1.16
## income_med1                 0.80  0.78   0.82
## group_pr1                   1.25  1.23   1.28
## frac_white_med1             1.36  1.33   1.38
## c_provider_1k:ruralityRural 1.27  1.15   1.42

#baseline estimates for non-rural
exp(0.51841 - 0.25340)

## [1] 1.303444

exp(0.447929860-0.322018859)

## [1] 1.134181

exp(0.5897366-0.1856781)

## [1] 1.497892

General interpretation of full model

Baseline interpretations:
* At the (sample) mean level of provider supply per 1000 county residents (0.94 providers per 1000 county population), the odds of a provider in a rural area accepting insurance is 1.68 (95% CI 1.57, 1.80), when controlling for all other factors
* At the (sample) mean level of provider supply per 1000 county residents (0.94 providers per 1000 county population), the odds of a provider in non-rural areas accepting insurance is 1.30 (95% 1.31, 1.50), when controlling for all other factors

General trends:
* Psychologists and advanced practice providers have lower odds of accepting insurance compared to masters level counselors, while therapists.social workers, etc are more likely to accept insurance compared to masters levels therapists
* Providers in group practices, that see children, and are in a county with a higher fraction of white population (compared to the rest of the sample) have higher odds of accepting insurance
* Providers in counties with higher than the sample median of household income have lower odds of accepting insurance

Contrasts for interaction

est.rural <- gmodels::estimable(model1int,
        c("c_provider_1k" = 1),
        conf.int = 0.95)

## Registered S3 method overwritten by 'gdata':
##   method         from     
##   reorder.factor DescTools

est.urban <- gmodels::estimable(model1int,
        c("c_provider_1k" = 1,
          "c_provider_1k:ruralityRural" = 1),
        conf.int = 0.95)

rbind(exp(est.rural[c("Estimate", "Lower.CI", "Upper.CI")]),
      exp(est.urban[c("Estimate", "Lower.CI", "Upper.CI")]))

	Estimate	Lower.CI	Upper.CI
(0 1 0 0 0 0 0 0 0 0 0)	0.8705436	0.8561578	0.8851711
(0 1 0 0 0 0 0 0 0 0 1)	1.1080540	0.9970241	1.2314484

Interpretation of interaction terms

The first row shows the estimate for rural and the second row shows the estimate for urban

For Rural: For every unit increase from the mean of provider supply per 1000 county population, providers in rural areas have an 11% increase in odds of accepting insurance (95% CI: 1.00, 1.23). - The mean of providers per 1000 county population is 0.94 providers per 1000 county population

For non-rural: For every unit increase from the mean of provider supply per 1000 county population, providers in non-rural areas have an 13% decrease in odds of accepting insurance (95% CI: 0.86, 0.89).

Visualizing interaction

library(sjPlot)

## #refugeeswelcome

#non-interaction
plot_model(model1,
           type = "eff",
           terms = c("c_provider_1k", "rurality"),
           title = "No interaction",
           axis.title = c("Provider Supply (per 1000 county population)", 
                          "P(Insurance Acceptance)"),
           legend.title = "Rurality")

# Effect of c_provider_1k at each level of rurality
plot_model(model1int,
           type = "eff",
           terms = c("c_provider_1k", "rurality"),
           title = "Interaction between provider supply and rurality",
           axis.title = c("Provider Supply (per 1000 county population)", 
                          "P(Insurance Acceptance)"),
           legend.title = "Rurality")

Testing the accuracy of this model by training then testing

# Split data into train and test
set.seed(430)

split <- initial_split(data_acs_inc, prop = 0.8, strata = insurance_flag)
train <- split %>% 
         training()
test <- split %>% 
        testing()

mod_train <- glm(insurance_flag ~ c_provider_1k*rurality + provs + any_child 
               + income_med + group_pr + frac_white_med, 
               data = train, family = "binomial")

glm_probs = data.frame(probs = predict(mod_train, 
                                       newdata = test, 
                                       type="response"))

## Warning: contrasts dropped from factor provs

glm_pred = glm_probs %>%
  mutate(pred = ifelse(probs>.5, "accept", "does not accept"))

glm_pred = cbind(test, glm_pred)

glm_pred %>% 
  count(pred, insurance_flag) %>%
  spread(insurance_flag, n, fill = 0)

pred	0	1
accept	11526	22009
does not accept	759	724

pr <- prediction(glm_probs,test$insurance_flag)
perf <- performance(pr,measure = "tpr",x.measure = "fpr") 
plot(perf)

auc(test$insurance_flag,glm_probs)

## Warning in auc(test$insurance_flag, glm_probs): longer
## object length is not a multiple of shorter object length

## [1] 0.6024096

Random forest for variable importance

This is where we are using the psych_forest dataset

set.seed(421)

split <- initial_split(psych_forest, prop = 0.7, strata = insurance_flag)
train <- split %>% 
         training()
test <- split %>% 
        testing()

Training

model <- randomForest(y=train[,1], x=train[,c(2:5, 7:8, 10)], data = train, ntree = 500, 
                      mtry = 5, importance = TRUE , proximity = FALSE)

model

## 
## Call:
##  randomForest(x = train[, c(2:5, 7:8, 10)], y = train[, 1], ntree = 500,      mtry = 5, importance = TRUE, proximity = FALSE, data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 31.75%
## Confusion matrix:
##       0     1 class.error
## 0 13878 29117   0.6772183
## 1  9791 69771   0.1230613

The next step is to validate our model using the test data

set.seed(421)
# The next step is to validate our model using the test data

prediction <- predict(model, newdata = test)

table(prediction, test$insurance_flag)

##           
## prediction     0     1
##          0  5968  4165
##          1 12459 29934

results<-cbind(prediction,test$insurance_flag)

colnames(results)<-c('pred','real')

results<-as.data.frame(results)

# Finally, let’s calculate the accuracy of the model

sum(prediction==test$insurance_flag) / nrow(test)

## [1] 0.6835091

Variable importance

The first column shows the predictor variables used in the RF model, the following columns shows how much each of those variables contribute to predicting absences (0), presences (1) or both (MeanDecreaseAccuracy).
The last column shows the node impurity of the model.

varImpPlot(model, sort=T, n.var= 7, 
           main= "Health insurance acceptance: yes or no", pch=16)

#another way to view this
imp=as.data.frame(importance(model)) #Analyze the importance of each variable in the model
imp=cbind(vars=rownames(imp),imp)
print(imp)

##                          vars         0         1
## provs                   provs 159.20444 123.23111
## any_child           any_child  40.59555  93.43049
## group_pr             group_pr  54.34715 118.34899
## rurality             rurality  34.67945  19.18478
## income_med         income_med 103.78935  57.29550
## frac_white_med frac_white_med 142.70234  73.24832
## provider_1k       provider_1k 198.04780 104.40459
##                MeanDecreaseAccuracy MeanDecreaseGini
## provs                     244.87050        1264.2493
## any_child                 116.21387         475.6065
## group_pr                  151.69823         409.9147
## rurality                   52.61039         164.3426
## income_med                149.13921         386.2367
## frac_white_med            200.31552         390.9788
## provider_1k               283.47960        6587.6862

barplot(imp$MeanDecreaseAccuracy, names.arg=imp$vars)

Examples of variable relationship with outcome and importance

Partial dependence plots

par(mfrow = c(2, 2))

partialPlot(model, test, provider_1k, "1", xlab="Providers per 1000 county population", 
            ylab="Log of fraction of votes", lwd=4, col="green")

partialPlot(x=model, pred.data=test, x.var=any_child, which.class="1")
partialPlot(x=model, pred.data=test, x.var=rurality, which.class="1")
partialPlot(x=model, pred.data=test, x.var=provs, which.class="1")

par(mfrow = c(2, 2))
partialPlot(x=model, pred.data=test, x.var=income_med, which.class="1")
partialPlot(x=model, pred.data=test, x.var=frac_white_med, which.class="1")
partialPlot(x=model, pred.data=test, x.var=group_pr, which.class="1")

Psychology Today Models:

Logistic regression for insurance acceptance outcome

Aine Huntington

Friday, 26 April, 2024

Context

Code book for variables in full model:

Load Required Packages

Explore urbanicity

Model building

Preparing the data

This probably is not needed anymore

Adjustments after meeting with Jane on 4/23/24

Exploring distrobution of provider supply (continuous)

Exploring distrobution of median county income (continuous)

Exploring the distrobutions

Exploration of possible rurality*supply interaction term

Univariate models

Full model with all vars

Checking the binary variable for provider supply

Step-wise deletion and comparison with full model

Full model (no interaction term)

Odds ratios

Exploring interaction term between rurality and supply

Looking at stratified models (rural vs urban)

Rural

Non-rural

Stratification results

Interaction model (not stratified)

Interaction between: rurality (binary) x provider supply per 1000 county population (continuous)

Exponentiated interaction term model: ORs

General interpretation of full model

Contrasts for interaction

Interpretation of interaction terms

Visualizing interaction

Testing the accuracy of this model by training then testing

Random forest for variable importance

Training

The next step is to validate our model using the test data

Variable importance

Examples of variable relationship with outcome and importance

Session Info