Context
This code is designed to explore the factors associated with
insurance acceptance among mental health providers (data from psychology
today).
- The following steps were taken:
- Exploratory univariate analysis
- Building the full logistic regression model
- Exploring provider supply variables–binary vs continuous
- Comparing reduced models to full model
- Exploring the interaction between provider supply and rurality
- Exploring a stratified model (which was for visual purposes,
stratified model was not ultimately chosen as the correct choice)
- Running full model through a training and testing sample, then
exploring the accuracy of the model
- Running the model through a random forest to better understand the
importance of the variables on insurance acceptance among mental health
providers
Code book for variables in full model:
- c_provider_1k: the supply of all types of mental health providers
per 1000k county residents, centered around its mean
- provs: provider type in categories.
- Levels: “Masters level therapists and counselors”, “All other
therapists and counselors”, “Psychologists/PhD”, “Advanced practice
providers”
- Level names were changed to be better on output
- any_child: does a provider also see clients under 18?
- Rurality: Binary–urban or rural.
- Based on Categories of the 2013 and 2006 NCHS Urban-Rural
Classification Scheme for Counties.
- Metro distinctions 1-4 = urban
- metro distinctions 5-6 = rural (micropoliton and non-core)
- income_med: Binary
- whether provider is in a county with a median household income that
is above or below the median of the sample.
- Data from ACS 5-year survey
- group_pr: is the provider part of a group practice?
- Determined by grouping based on exact addresses.
- Binary
- frac_white_med: binary
- Data from 5-year ACS
- Binary, whether provider is in a county with a fraction of county
population that is white that is above or below the median of the
sample
data_acs <- read.csv("K:/WorkingData2/User_Folders/Aine/Project Work/Psychology Today/Original files/working data acs 4-5-24.csv")
data_acs <- data_acs[which(!is.na(data_acs$id)), ] #just in case it reuploads weirdly from csv
#Desc(data_acs$id)
#length of df should be 184'957
# unique IDs should be 167'854
Load Required Packages
Explore urbanicity
metros <- read.csv("K:/WorkingData2/User_Folders/Aine/Project Work/Psychology Today/Original files/NCHSURCodes2013.csv")
data_acs <- left_join(data_acs, metros, by = "fips")
data_acs <- data_acs %>%
mutate(rurality = case_when(metro_codes==1 | metro_codes==2 | metro_codes==3 | metro_codes==4 ~ "Metro",
metro_codes==5 | metro_codes==6 ~ "Rural")
)
Model building
Exhibit 4. Regression of provider- and county/zip level covariates on
insurance acceptance 1. County or zip code composition (population, %
white, HRSA supply of MH professionals, income) 2. Provider-level
covariates – years in practice, psychologist vs. therapist, telehealth,
gender, # of locations (can use addresses to identify multi-provider
groups e.g. count number of providers at same location), whether
providers themselves practice in more than 1 location
Preparing the data
data_acs_inc <- data_acs[which(data_acs$excluded==0),]
data_acs_inc <- data_acs_inc %>%
group_by(GEOID) %>%
mutate(provider_1k = (n() / county_population)*1000) %>%
ungroup()
# these are not at the "county level", they are at the provider level
# comparing obvs in the dataset against each other rather than what county they are in
data_acs_inc <- data_acs_inc %>%
mutate(supply_med = if_else(provider_1k>median(provider_1k), 1, 0)
)
data_acs_inc <- data_acs_inc %>%
mutate(income_med = if_else(median_income>median(median_income), 1, 0)
)
data_acs_inc <- data_acs_inc %>%
mutate(frac_white_med = if_else(fraction_white>median(fraction_white), 1, 0)
)
# var manipulation
skim(data_acs_inc$provider_1k)
Data summary
| Name |
data_acs_inc$provider_1k |
| Number of rows |
175083 |
| Number of columns |
1 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
1 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| data |
0 |
1 |
0.94 |
0.65 |
0 |
0.52 |
0.76 |
1.23 |
10.49 |
▇▁▁▁▁ |
skim(data_acs_inc$median_income)
Data summary
| Name |
data_acs_inc$median_incom… |
| Number of rows |
175083 |
| Number of columns |
1 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
1 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| data |
0 |
1 |
79975.52 |
19788.68 |
24349 |
65554.5 |
76367 |
92600 |
156821 |
▁▇▅▂▁ |
skim(data_acs_inc$fraction_white)
Data summary
| Name |
data_acs_inc$fraction_whi… |
| Number of rows |
175083 |
| Number of columns |
1 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
1 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| data |
0 |
1 |
0.65 |
0.15 |
0.1 |
0.53 |
0.66 |
0.77 |
0.98 |
▁▂▆▇▃ |
#binary variable for provider supply above or below the median
Desc(data_acs_inc$supply_med)
## ------------------------------------------------------------------------------
## data_acs_inc$supply_med (numeric)
##
## length n NAs unique 0s mean meanCI'
## 175'083 175'083 0 2 87'734 0.50 0.50
## 100.0% 0.0% 50.1% 0.50
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 1.00 1.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 1.00 0.50 1.00 0.00 1.00 0.00 -2.00
##
##
## value freq perc cumfreq cumperc
## 1 0 87'734 50.1% 87'734 50.1%
## 2 1 87'349 49.9% 175'083 100.0%
##
## ' 95%-CI (classic)

Desc(data_acs_inc$income_med)
## ------------------------------------------------------------------------------
## data_acs_inc$income_med (numeric)
##
## length n NAs unique 0s mean meanCI'
## 175'083 175'083 0 2 92'689 0.47 0.47
## 100.0% 0.0% 52.9% 0.47
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 1.00 1.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 1.00 0.50 1.06 0.00 1.00 0.12 -1.99
##
##
## value freq perc cumfreq cumperc
## 1 0 92'689 52.9% 92'689 52.9%
## 2 1 82'394 47.1% 175'083 100.0%
##
## ' 95%-CI (classic)

Desc(data_acs_inc$frac_white_med)
## ------------------------------------------------------------------------------
## data_acs_inc$frac_white_med (numeric)
##
## length n NAs unique 0s mean meanCI'
## 175'083 175'083 0 2 87'597 0.50 0.50
## 100.0% 0.0% 50.0% 0.50
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 1.00 1.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 1.00 0.50 1.00 0.00 1.00 0.00 -2.00
##
##
## value freq perc cumfreq cumperc
## 1 0 87'597 50.0% 87'597 50.0%
## 2 1 87'486 50.0% 175'083 100.0%
##
## ' 95%-CI (classic)

This probably is not needed anymore
#making tertiles
data_acs_inc <- data_acs_inc %>%
mutate(supply_tert = ntile(provider_1k, 3))
data_acs_inc <- data_acs_inc %>%
mutate(supply_tert = case_when(supply_tert==1 ~ "low",
supply_tert==2 ~ "mid",
supply_tert==3 ~ "high")
)
# Think about keeping supply as a continuous variable
# look at distribution of continuous supply var
data_acs_inc$supply_tert <- as.factor(data_acs_inc$supply_tert)
data_acs_inc <- data_acs_inc %>%
mutate(provider_type = case_when(provider_type=="Masters level therapists and counselors" ~ "masters",
provider_type=="All other therapists and counselors" ~ "counselors",
provider_type=="Psychologists/PhD" ~ "psych",
provider_type=="Advanced practice providers" ~ "advanced")
)
data_acs_inc$provs <- as.factor(data_acs_inc$provider_type)
data_acs_inc$insurance_flag <- as.factor(data_acs_inc$insurance_flag)
data_acs_inc$any_child <- as.factor(data_acs_inc$any_child)
data_acs_inc$region <- as.factor(data_acs_inc$region)
data_acs_inc$group_pr <- as.factor(data_acs_inc$group_pr)
data_acs_inc$telehealth <- as.factor(data_acs_inc$telehealth)
data_acs_inc$rurality <- as.factor(data_acs_inc$rurality)
data_acs_inc$supply_med <- as.factor(data_acs_inc$supply_med)
data_acs_inc$income_med <- as.factor(data_acs_inc$income_med)
data_acs_inc$frac_white_med <- as.factor(data_acs_inc$frac_white_med)
#i need to adjust the names in the provider_type variable so it doesn't mess with
# our output
Desc(data_acs_inc$rurality)
## ------------------------------------------------------------------------------
## data_acs_inc$rurality (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## Metro 168'073 96.0% 95.9% 96.1%
## Rural 7'010 4.0% 3.9% 4.1%
##
## ' 95%-CI (Wilson)

Adjustments after meeting with Jane on 4/23/24
indicator var: rural versus not (micropolitan and non-core versus
other categories) then relook at supply as continuous or above median
and below median (dichotimous var). possibly then also a possible
interaction term between rurality and supply. for median income: either
dichotimous var or percentiles (non-continuous). also looking at
fraction white in percentiles
Exploring distrobution of provider supply (continuous)
#provider supply
data_acs_inc %>%
group_by(insurance_flag) %>%
skim(provider_1k)
Data summary
| Name |
Piped data |
| Number of rows |
175083 |
| Number of columns |
72 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
1 |
| ________________________ |
|
| Group variables |
insurance_flag |
Variable type: numeric
| provider_1k |
0 |
0 |
1 |
0.99 |
0.74 |
0 |
0.53 |
0.76 |
1.3 |
10.49 |
▇▁▁▁▁ |
| provider_1k |
1 |
0 |
1 |
0.91 |
0.59 |
0 |
0.51 |
0.75 |
1.2 |
10.49 |
▇▁▁▁▁ |
#providers that do accept insurance
p <- data_acs_inc %>%
ggplot( aes(x=provider_1k, fill= insurance_flag)) +
geom_histogram(alpha=0.9, color="white") +
ggtitle("Distrobution of provider supply and insurance acceptance") +
theme(
plot.title = element_text(size=10)
)
#rurality
p2 <- data_acs_inc %>%
ggplot( aes(x=provider_1k, fill=rurality)) +
geom_histogram(alpha=0.9, color="white") +
ggtitle("Distrobution of provider supply in rural and non-rural areas") +
theme(
plot.title = element_text(size=10)
)
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exploring the distrobutions
- The psych_forest dataset has all of our variables of interest, it
will be used for the random forest in later code but I will use it now
as well to easily explore the distributions of the variables of
interest
psych_forest <- as.data.frame(select(data_acs_inc, c("insurance_flag","provs","any_child","group_pr", "rurality", "supply_med", "income_med", "frac_white_med", "median_income", "provider_1k","fraction_white")))
#remove frac no health insurance
#rural urban versus region (take out region and include rural urban on its own)
# then we can see
Desc(psych_forest[1:8])
## ------------------------------------------------------------------------------
## Describe psych_forest[1:8] (data.frame):
##
## data frame: 175083 obs. of 8 variables
## 175083 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 insurance_flag factor . (2): 1-0, 2-1
## 2 provs factor . (4): 1-advanced, 2-counselors, 3-masters,
## 4-psych
## 3 any_child factor . (2): 1-0, 2-1
## 4 group_pr factor . (2): 1-0, 2-1
## 5 rurality factor . (2): 1-Metro, 2-Rural
## 6 supply_med factor . (2): 1-0, 2-1
## 7 income_med factor . (2): 1-0, 2-1
## 8 frac_white_med factor . (2): 1-0, 2-1
##
##
## ------------------------------------------------------------------------------
## 1 - insurance_flag (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 61'422 35.1% 34.9% 35.3%
## 1 113'661 64.9% 64.7% 65.1%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 2 - provs (factor)
##
## length n NAs unique levels dupes
## 175'083 175'083 0 4 4 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 masters 80'573 46.0% 80'573 46.0%
## 2 counselors 65'478 37.4% 146'051 83.4%
## 3 psych 26'489 15.1% 172'540 98.5%
## 4 advanced 2'543 1.5% 175'083 100.0%

## ------------------------------------------------------------------------------
## 3 - any_child (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 61'038 34.9% 34.6% 35.1%
## 1 114'045 65.1% 64.9% 65.4%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 4 - group_pr (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 104'650 59.8% 59.5% 60.0%
## 1 70'433 40.2% 40.0% 40.5%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 5 - rurality (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## Metro 168'073 96.0% 95.9% 96.1%
## Rural 7'010 4.0% 3.9% 4.1%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 6 - supply_med (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 87'734 50.1% 49.9% 50.3%
## 1 87'349 49.9% 49.7% 50.1%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 7 - income_med (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 92'689 52.9% 52.7% 53.2%
## 1 82'394 47.1% 46.8% 47.3%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 8 - frac_white_med (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 87'597 50.0% 49.8% 50.3%
## 1 87'486 50.0% 49.7% 50.2%
##
## ' 95%-CI (Wilson)

Data summary
| Name |
psych_forest[9:11] |
| Number of rows |
175083 |
| Number of columns |
3 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
3 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| median_income |
0 |
1 |
79975.52 |
19788.68 |
24349.0 |
65554.50 |
76367.00 |
92600.00 |
156821.00 |
▁▇▅▂▁ |
| provider_1k |
0 |
1 |
0.94 |
0.65 |
0.0 |
0.52 |
0.76 |
1.23 |
10.49 |
▇▁▁▁▁ |
| fraction_white |
0 |
1 |
0.65 |
0.15 |
0.1 |
0.53 |
0.66 |
0.77 |
0.98 |
▁▂▆▇▃ |
Exploration of possible rurality*supply interaction term
xtabs(~ rurality + supply_med, data = psych_forest)
## supply_med
## rurality 0 1
## Metro 82875 85198
## Rural 4859 2151
Univariate models
#this is currently comparing against advanced practice providers
#advanced practice providers are the smallest group, which may not lead to a great
#reference category if cells are small. I will change this to be masters level
#providers for now since that is the largest group
data_acs_inc$provs <- contr_code_treatment(data_acs_inc$provs,
base = "masters")
uni_prov <- glm(insurance_flag ~ provs, data = data_acs_inc,
family = "binomial")
summary(uni_prov)
##
## Call:
## glm(formula = insurance_flag ~ provs, family = "binomial", data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.50325 0.00727 69.222 <2e-16 ***
## provs.advanced-masters -0.38909 0.04038 -9.635 <2e-16 ***
## provs.counselors-masters 0.44135 0.01134 38.917 <2e-16 ***
## provs.psych-masters -0.24335 0.01437 -16.938 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 224196 on 175079 degrees of freedom
## AIC: 224204
##
## Number of Fisher Scoring iterations: 4
uni_prov_con <- glm(insurance_flag ~ provider_1k, data = data_acs_inc,
family = "binomial")
summary(uni_prov_con)
##
## Call:
## glm(formula = insurance_flag ~ provider_1k, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.787316 0.008931 88.16 <2e-16 ***
## provider_1k -0.181152 0.007736 -23.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226327 on 175081 degrees of freedom
## AIC: 226331
##
## Number of Fisher Scoring iterations: 4
#some other primary independent var models
uni_child <- glm(insurance_flag ~ any_child, data = data_acs_inc,
family = "binomial")
summary(uni_child)
##
## Call:
## glm(formula = insurance_flag ~ any_child, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.509375 0.008359 60.94 <2e-16 ***
## any_child1 0.164224 0.010444 15.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226644 on 175081 degrees of freedom
## AIC: 226648
##
## Number of Fisher Scoring iterations: 4
uni_rural <- glm(insurance_flag ~ rurality, data = data_acs_inc,
family = "binomial")
summary(uni_rural)
##
## Call:
## glm(formula = insurance_flag ~ rurality, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.60012 0.00510 117.68 <2e-16 ***
## ruralityRural 0.41027 0.02748 14.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226655 on 175081 degrees of freedom
## AIC: 226659
##
## Number of Fisher Scoring iterations: 4
uni_income <- glm(insurance_flag ~ income_med, data = data_acs_inc,
family = "binomial")
summary(uni_income)
##
## Call:
## glm(formula = insurance_flag ~ income_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.749584 0.007036 106.53 <2e-16 ***
## income_med1 -0.278914 0.010040 -27.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226117 on 175081 degrees of freedom
## AIC: 226121
##
## Number of Fisher Scoring iterations: 4
uni_group <- glm(insurance_flag ~ group_pr, data = data_acs_inc,
family = "binomial")
summary(uni_group)
##
## Call:
## glm(formula = insurance_flag ~ group_pr, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.524293 0.006396 81.97 <2e-16 ***
## group_pr1 0.231474 0.010306 22.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226382 on 175081 degrees of freedom
## AIC: 226386
##
## Number of Fisher Scoring iterations: 4
uni_prov1k <- glm(insurance_flag ~ supply_med, data = data_acs_inc,
family = "binomial")
summary(uni_prov1k)
##
## Call:
## glm(formula = insurance_flag ~ supply_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.656524 0.007119 92.218 < 2e-16 ***
## supply_med1 -0.081823 0.010018 -8.167 3.15e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226824 on 175081 degrees of freedom
## AIC: 226828
##
## Number of Fisher Scoring iterations: 4
uni_frac_wh <- glm(insurance_flag ~ frac_white_med, data = data_acs_inc,
family = "binomial")
summary(uni_frac_wh)
##
## Call:
## glm(formula = insurance_flag ~ frac_white_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.465865 0.006942 67.11 <2e-16 ***
## frac_white_med1 0.306381 0.010053 30.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 225958 on 175081 degrees of freedom
## AIC: 225962
##
## Number of Fisher Scoring iterations: 4
Full model with all vars
Checking the binary variable for provider supply
model1 <- glm(insurance_flag ~ supply_med + provs + any_child
+ rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = insurance_flag ~ supply_med + provs + any_child +
## rurality + income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.299868 0.013181 22.749 < 2e-16 ***
## supply_med1 -0.009605 0.010769 -0.892 0.372
## provs.advanced-masters -0.400754 0.040696 -9.848 < 2e-16 ***
## provs.counselors-masters 0.461996 0.011450 40.348 < 2e-16 ***
## provs.psych-masters -0.193292 0.014536 -13.297 < 2e-16 ***
## any_child1 0.139653 0.010696 13.057 < 2e-16 ***
## ruralityRural 0.194344 0.028460 6.829 8.57e-12 ***
## income_med1 -0.281956 0.010880 -25.914 < 2e-16 ***
## group_pr1 0.214758 0.010564 20.328 < 2e-16 ***
## frac_white_med1 0.300357 0.010387 28.916 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 221764 on 175073 degrees of freedom
## AIC: 221784
##
## Number of Fisher Scoring iterations: 4
model2 <- glm(insurance_flag ~ provs + any_child
+ rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model2)
##
## Call:
## glm(formula = insurance_flag ~ provs + any_child + rurality +
## income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.29662 0.01267 23.418 < 2e-16 ***
## provs.advanced-masters -0.40106 0.04069 -9.856 < 2e-16 ***
## provs.counselors-masters 0.46156 0.01144 40.348 < 2e-16 ***
## provs.psych-masters -0.19360 0.01453 -13.322 < 2e-16 ***
## any_child1 0.14008 0.01069 13.109 < 2e-16 ***
## ruralityRural 0.19494 0.02845 6.852 7.3e-12 ***
## income_med1 -0.28496 0.01035 -27.542 < 2e-16 ***
## group_pr1 0.21406 0.01054 20.318 < 2e-16 ***
## frac_white_med1 0.30049 0.01039 28.931 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 221765 on 175074 degrees of freedom
## AIC: 221783
##
## Number of Fisher Scoring iterations: 4
anova(model2, model1, test='LR')
| 175074 |
221765.1 |
NA |
NA |
NA |
| 175073 |
221764.3 |
1 |
0.7955322 |
0.3724325 |
- There is no evidence that a binary supply variable is a better fit
than no supply variable. I will choose to use the continuous version of
the variable
#centering the provider supply variable for ease of interpretation
mean(data_acs_inc$provider_1k)
## [1] 0.9398009
data_acs_inc$c_provider_1k <- data_acs_inc$provider_1k - mean(data_acs_inc$provider_1k)
#because of rounding errors in R, it makes it so that the mean isn't EXACTLY centered at 0 but extremely close
model1 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child +
## rurality + income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.265712 0.012821 20.724 < 2e-16 ***
## c_provider_1k -0.132942 0.008291 -16.035 < 2e-16 ***
## provs.advanced-masters -0.401879 0.040727 -9.868 < 2e-16 ***
## provs.counselors-masters 0.466670 0.011454 40.741 < 2e-16 ***
## provs.psych-masters -0.185884 0.014553 -12.773 < 2e-16 ***
## any_child1 0.131055 0.010708 12.239 < 2e-16 ***
## ruralityRural 0.167692 0.028511 5.882 4.06e-09 ***
## income_med1 -0.224697 0.011018 -20.394 < 2e-16 ***
## group_pr1 0.226017 0.010571 21.382 < 2e-16 ***
## frac_white_med1 0.305407 0.010401 29.364 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 221506 on 175073 degrees of freedom
## AIC: 221526
##
## Number of Fisher Scoring iterations: 4
## GVIF Df GVIF^(1/(2*Df))
## c_provider_1k 1.160592 1 1.077308
## provs 1.012826 3 1.002126
## any_child 1.020213 1 1.010056
## rurality 1.056585 1 1.027903
## income_med 1.172852 1 1.082983
## group_pr 1.023455 1 1.011659
## frac_white_med 1.043309 1 1.021425
anova(model1, test = 'Chisq')
| NULL |
NA |
NA |
175082 |
226890.3 |
NA |
| c_provider_1k |
1 |
562.8089 |
175081 |
226327.5 |
0 |
| provs |
3 |
2700.1497 |
175078 |
223627.3 |
0 |
| any_child |
1 |
252.1089 |
175077 |
223375.2 |
0 |
| rurality |
1 |
152.3336 |
175076 |
223222.9 |
0 |
| income_med |
1 |
355.7611 |
175075 |
222867.1 |
0 |
| group_pr |
1 |
494.8203 |
175074 |
222372.3 |
0 |
| frac_white_med |
1 |
866.0003 |
175073 |
221506.3 |
0 |
Step-wise deletion and comparison with full model
options(width = 60)
model2 <- glm(insurance_flag ~ provs + any_child
+ rurality + income_med + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model2)
anova(model2, model1, test='LR')
| 175074 |
221765.1 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
258.7809 |
0 |
model3 <- glm(insurance_flag ~ c_provider_1k + any_child
+ rurality + income_med + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model3)
anova(model3, model1, test='LR')
| 175076 |
224156.2 |
NA |
NA |
NA |
| 175073 |
221506.3 |
3 |
2649.896 |
0 |
model4 <- glm(insurance_flag ~ c_provider_1k + provs
+ rurality + income_med + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model4)
anova(model4, model1, test='LR')
| 175074 |
221655.6 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
149.348 |
0 |
model5 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ income_med + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model5)
anova(model5, model1, test='LR')
| 175074 |
221541.6 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
35.29536 |
0 |
model6 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ rurality + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model6)
anova(model6, model1, test='LR')
| 175074 |
221922.0 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
415.6647 |
0 |
model7 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ rurality + income_med + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model7)
anova(model7, model1, test='LR')
| 175074 |
221966.2 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
459.9373 |
0 |
model8 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ rurality + income_med + group_pr,
data = data_acs_inc, family = "binomial")
#summary(model8)
anova(model8, model1, test='LR')
| 175074 |
222372.3 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
866.0003 |
0 |
Exploring interaction term between rurality and supply
Looking at stratified models (rural vs urban)
data_acs_rural <- subset(data_acs_inc, rurality=="Rural")
data_acs_urban <- subset(data_acs_inc, rurality=="Metro")
Rural
modelrural <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ income_med + group_pr + frac_white_med, data = data_acs_rural, family = "binomial")
summary(modelrural)
##
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child +
## income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_rural)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1.08052 0.11032 9.795
## c_provider_1k 0.14980 0.05714 2.622
## provs.advanced-masters -0.60596 0.21420 -2.829
## provs.counselors-masters 0.17028 0.06075 2.803
## provs.psych-masters -0.48604 0.08992 -5.405
## any_child1 0.42007 0.05786 7.260
## income_med1 -0.44180 0.08523 -5.184
## group_pr1 0.66626 0.06838 9.743
## frac_white_med1 -0.45054 0.09740 -4.626
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## c_provider_1k 0.00875 **
## provs.advanced-masters 0.00467 **
## provs.counselors-masters 0.00506 **
## provs.psych-masters 6.48e-08 ***
## any_child1 3.88e-13 ***
## income_med1 2.17e-07 ***
## group_pr1 < 2e-16 ***
## frac_white_med1 3.73e-06 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8133.8 on 7009 degrees of freedom
## Residual deviance: 7838.2 on 7001 degrees of freedom
## AIC: 7856.2
##
## Number of Fisher Scoring iterations: 4
round(
exp(
cbind(OR = coef(modelrural), confint(modelrural))
),
2)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 2.95 2.38 3.67
## c_provider_1k 1.16 1.04 1.30
## provs.advanced-masters 0.55 0.36 0.83
## provs.counselors-masters 1.19 1.05 1.34
## provs.psych-masters 0.62 0.52 0.73
## any_child1 1.52 1.36 1.70
## income_med1 0.64 0.54 0.76
## group_pr1 1.95 1.70 2.23
## frac_white_med1 0.64 0.53 0.77
#the reason that there might be some stronger associations than excepted could be slightly less power
# however, the cell sizes aren't horribly small
# this should most likely be interpreted carefully, as it is comparing to the median of the entire sample (urban and rural) and not rural alone--meaning that the interpretation of the median variables (income and frac white) may be unrepresentative in the stratified models
xtabs(~ insurance_flag + frac_white_med, data = data_acs_rural)
## frac_white_med
## insurance_flag 0 1
## 0 154 1717
## 1 620 4519
#this one is arguably worse
xtabs(~ insurance_flag + provs, data = data_acs_rural)
## provs
## insurance_flag advanced counselors masters psych
## 0 40 620 958 253
## 1 56 1976 2685 422
Non-rural
modelnonrural <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ income_med + group_pr + frac_white_med, data = data_acs_urban, family = "binomial")
summary(modelnonrural)
##
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child +
## income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_urban)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.264493 0.012969 20.394
## c_provider_1k -0.140263 0.008412 -16.674
## provs.advanced-masters -0.390371 0.041492 -9.408
## provs.counselors-masters 0.479358 0.011671 41.073
## provs.psych-masters -0.174075 0.014754 -11.799
## any_child1 0.120237 0.010899 11.032
## income_med1 -0.220380 0.011121 -19.817
## group_pr1 0.213119 0.010720 19.880
## frac_white_med1 0.315798 0.010485 30.120
## Pr(>|z|)
## (Intercept) <2e-16 ***
## c_provider_1k <2e-16 ***
## provs.advanced-masters <2e-16 ***
## provs.counselors-masters <2e-16 ***
## provs.psych-masters <2e-16 ***
## any_child1 <2e-16 ***
## income_med1 <2e-16 ***
## group_pr1 <2e-16 ***
## frac_white_med1 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 218521 on 168072 degrees of freedom
## Residual deviance: 213453 on 168064 degrees of freedom
## AIC: 213471
##
## Number of Fisher Scoring iterations: 4
round(
exp(
cbind(OR = coef(modelnonrural), confint(modelnonrural))
),
2)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.30 1.27 1.34
## c_provider_1k 0.87 0.85 0.88
## provs.advanced-masters 0.68 0.62 0.73
## provs.counselors-masters 1.62 1.58 1.65
## provs.psych-masters 0.84 0.82 0.86
## any_child1 1.13 1.10 1.15
## income_med1 0.80 0.78 0.82
## group_pr1 1.24 1.21 1.26
## frac_white_med1 1.37 1.34 1.40
Stratification results
- The stratified results are interesting, although I believe
continuing with the stratified models is inappropriate as I do not
believe that stratification appropriately handles all of the predictors.
I think that interpreting a model with an interaction term is more
appropriate.
Testing the accuracy of this model by training then testing
# Split data into train and test
set.seed(430)
split <- initial_split(data_acs_inc, prop = 0.8, strata = insurance_flag)
train <- split %>%
training()
test <- split %>%
testing()
mod_train <- glm(insurance_flag ~ c_provider_1k*rurality + provs + any_child
+ income_med + group_pr + frac_white_med,
data = train, family = "binomial")
glm_probs = data.frame(probs = predict(mod_train,
newdata = test,
type="response"))
## Warning: contrasts dropped from factor provs
glm_pred = glm_probs %>%
mutate(pred = ifelse(probs>.5, "accept", "does not accept"))
glm_pred = cbind(test, glm_pred)
glm_pred %>%
count(pred, insurance_flag) %>%
spread(insurance_flag, n, fill = 0)
| accept |
11526 |
22009 |
| does not accept |
759 |
724 |
pr <- prediction(glm_probs,test$insurance_flag)
perf <- performance(pr,measure = "tpr",x.measure = "fpr")
plot(perf)

auc(test$insurance_flag,glm_probs)
## Warning in auc(test$insurance_flag, glm_probs): longer
## object length is not a multiple of shorter object length
## [1] 0.6024096
Random forest for variable importance
- This is where we are using the psych_forest dataset
set.seed(421)
split <- initial_split(psych_forest, prop = 0.7, strata = insurance_flag)
train <- split %>%
training()
test <- split %>%
testing()
Training
model <- randomForest(y=train[,1], x=train[,c(2:5, 7:8, 10)], data = train, ntree = 500,
mtry = 5, importance = TRUE , proximity = FALSE)
model
##
## Call:
## randomForest(x = train[, c(2:5, 7:8, 10)], y = train[, 1], ntree = 500, mtry = 5, importance = TRUE, proximity = FALSE, data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 31.75%
## Confusion matrix:
## 0 1 class.error
## 0 13878 29117 0.6772183
## 1 9791 69771 0.1230613
The next step is to validate our model using the test data
set.seed(421)
# The next step is to validate our model using the test data
prediction <- predict(model, newdata = test)
table(prediction, test$insurance_flag)
##
## prediction 0 1
## 0 5968 4165
## 1 12459 29934
results<-cbind(prediction,test$insurance_flag)
colnames(results)<-c('pred','real')
results<-as.data.frame(results)
# Finally, let’s calculate the accuracy of the model
sum(prediction==test$insurance_flag) / nrow(test)
## [1] 0.6835091
Variable importance
- The first column shows the predictor variables used in the RF model,
the following columns shows how much each of those variables contribute
to predicting absences (0), presences (1) or both
(MeanDecreaseAccuracy).
- The last column shows the node impurity of the model.
varImpPlot(model, sort=T, n.var= 7,
main= "Health insurance acceptance: yes or no", pch=16)

#another way to view this
imp=as.data.frame(importance(model)) #Analyze the importance of each variable in the model
imp=cbind(vars=rownames(imp),imp)
print(imp)
## vars 0 1
## provs provs 159.20444 123.23111
## any_child any_child 40.59555 93.43049
## group_pr group_pr 54.34715 118.34899
## rurality rurality 34.67945 19.18478
## income_med income_med 103.78935 57.29550
## frac_white_med frac_white_med 142.70234 73.24832
## provider_1k provider_1k 198.04780 104.40459
## MeanDecreaseAccuracy MeanDecreaseGini
## provs 244.87050 1264.2493
## any_child 116.21387 475.6065
## group_pr 151.69823 409.9147
## rurality 52.61039 164.3426
## income_med 149.13921 386.2367
## frac_white_med 200.31552 390.9788
## provider_1k 283.47960 6587.6862
barplot(imp$MeanDecreaseAccuracy, names.arg=imp$vars)

Examples of variable relationship with outcome and importance
par(mfrow = c(2, 2))
partialPlot(model, test, provider_1k, "1", xlab="Providers per 1000 county population",
ylab="Log of fraction of votes", lwd=4, col="green")
partialPlot(x=model, pred.data=test, x.var=any_child, which.class="1")
partialPlot(x=model, pred.data=test, x.var=rurality, which.class="1")
partialPlot(x=model, pred.data=test, x.var=provs, which.class="1")

par(mfrow = c(2, 2))
partialPlot(x=model, pred.data=test, x.var=income_med, which.class="1")
partialPlot(x=model, pred.data=test, x.var=frac_white_med, which.class="1")
partialPlot(x=model, pred.data=test, x.var=group_pr, which.class="1")
