Context
This code is designed to explore the factors associated with
insurance acceptance among mental health providers (data from psychology
today).
- The following steps were taken:
- Exploratory univariate analysis
- Building the full logistic regression model
- Exploring provider supply variables–binary vs continuous
- Comparing reduced models to full model
- Exploring the interaction between provider supply and rurality
- Exploring a stratified model (which was for visual purposes,
stratified model was not ultimately chosen as the correct choice)
- Running full model through a training and testing sample, then
exploring the accuracy of the model
- Running the model through a random forest to better understand the
importance of the variables on insurance acceptance among mental health
providers
Code book for variables in full model:
- c_provider_1k: the supply of all types of mental health providers
per 1000k county residents, centered around its mean
- provs: provider type in categories.
- Levels: “Masters level therapists and counselors”, “All other
therapists and counselors”, “Psychologists/PhD”, “Advanced practice
providers”
- Level names were changed to be better on output
- any_child: does a provider also see clients under 18?
- Rurality: Binary–urban or rural.
- Based on Categories of the 2013 and 2006 NCHS Urban-Rural
Classification Scheme for Counties.
- Metro distinctions 1-4 = urban
- metro distinctions 5-6 = rural (micropoliton and non-core)
- income_med: Binary
- whether provider is in a county with a median household income that
is above or below the median of the sample.
- Data from ACS 5-year survey
- group_pr: is the provider part of a group practice?
- Determined by grouping based on exact addresses.
- Binary
- frac_white_med: binary
- Data from 5-year ACS
- Binary, whether provider is in a county with a fraction of county
population that is white that is above or below the median of the
sample
data_acs <- read.csv("K:/WorkingData2/User_Folders/Aine/Project Work/Psychology Today/Original files/working data acs 4-5-24.csv")
data_acs <- data_acs[which(!is.na(data_acs$id)), ] #just in case it reuploads weirdly from csv
#Desc(data_acs$id)
#length of df should be 184'957
# unique IDs should be 167'854
Load Required Packages
Explore urbanicity
metros <- read.csv("K:/WorkingData2/User_Folders/Aine/Project Work/Psychology Today/Original files/NCHSURCodes2013.csv")
data_acs <- left_join(data_acs, metros, by = "fips")
data_acs <- data_acs %>%
mutate(rurality = case_when(metro_codes==1 | metro_codes==2 | metro_codes==3 | metro_codes==4 ~ "Urban",
metro_codes==5 | metro_codes==6 ~ "Rural")
)
Model building
Exhibit 4. Regression of provider- and county/zip level covariates on
insurance acceptance 1. County or zip code composition (population, %
white, HRSA supply of MH professionals, income) 2. Provider-level
covariates – years in practice, psychologist vs. therapist, telehealth,
gender, # of locations (can use addresses to identify multi-provider
groups e.g. count number of providers at same location), whether
providers themselves practice in more than 1 location
Preparing the data
data_acs_inc <- data_acs[which(data_acs$excluded==0),]
data_acs_inc <- data_acs_inc %>%
group_by(GEOID) %>%
mutate(provider_1k = (n() / county_population)*1000) %>%
ungroup()
# these are not at the "county level", they are at the provider level
# comparing obvs in the dataset against each other rather than what county they are in
data_acs_inc <- data_acs_inc %>%
mutate(supply_med = if_else(provider_1k>median(provider_1k), 1, 0)
)
data_acs_inc <- data_acs_inc %>%
mutate(income_med = if_else(median_income>median(median_income), 1, 0)
)
data_acs_inc <- data_acs_inc %>%
mutate(frac_white_med = if_else(fraction_white>median(fraction_white), 1, 0)
)
# var manipulation
skim(data_acs_inc$provider_1k)
Data summary
| Name |
data_acs_inc$provider_1k |
| Number of rows |
175083 |
| Number of columns |
1 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
1 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| data |
0 |
1 |
0.94 |
0.65 |
0 |
0.52 |
0.76 |
1.23 |
10.49 |
▇▁▁▁▁ |
skim(data_acs_inc$median_income)
Data summary
| Name |
data_acs_inc$median_incom… |
| Number of rows |
175083 |
| Number of columns |
1 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
1 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| data |
0 |
1 |
79975.52 |
19788.68 |
24349 |
65554.5 |
76367 |
92600 |
156821 |
▁▇▅▂▁ |
skim(data_acs_inc$fraction_white)
Data summary
| Name |
data_acs_inc$fraction_whi… |
| Number of rows |
175083 |
| Number of columns |
1 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
1 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| data |
0 |
1 |
0.65 |
0.15 |
0.1 |
0.53 |
0.66 |
0.77 |
0.98 |
▁▂▆▇▃ |
#binary variable for provider supply above or below the median
Desc(data_acs_inc$supply_med)
## ------------------------------------------------------------------------------
## data_acs_inc$supply_med (numeric)
##
## length n NAs unique 0s mean meanCI'
## 175'083 175'083 0 2 87'734 0.50 0.50
## 100.0% 0.0% 50.1% 0.50
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 1.00 1.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 1.00 0.50 1.00 0.00 1.00 0.00 -2.00
##
##
## value freq perc cumfreq cumperc
## 1 0 87'734 50.1% 87'734 50.1%
## 2 1 87'349 49.9% 175'083 100.0%
##
## ' 95%-CI (classic)

Desc(data_acs_inc$income_med)
## ------------------------------------------------------------------------------
## data_acs_inc$income_med (numeric)
##
## length n NAs unique 0s mean meanCI'
## 175'083 175'083 0 2 92'689 0.47 0.47
## 100.0% 0.0% 52.9% 0.47
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 1.00 1.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 1.00 0.50 1.06 0.00 1.00 0.12 -1.99
##
##
## value freq perc cumfreq cumperc
## 1 0 92'689 52.9% 92'689 52.9%
## 2 1 82'394 47.1% 175'083 100.0%
##
## ' 95%-CI (classic)

Desc(data_acs_inc$frac_white_med)
## ------------------------------------------------------------------------------
## data_acs_inc$frac_white_med (numeric)
##
## length n NAs unique 0s mean meanCI'
## 175'083 175'083 0 2 87'597 0.50 0.50
## 100.0% 0.0% 50.0% 0.50
##
## .05 .10 .25 median .75 .90 .95
## 0.00 0.00 0.00 0.00 1.00 1.00 1.00
##
## range sd vcoef mad IQR skew kurt
## 1.00 0.50 1.00 0.00 1.00 0.00 -2.00
##
##
## value freq perc cumfreq cumperc
## 1 0 87'597 50.0% 87'597 50.0%
## 2 1 87'486 50.0% 175'083 100.0%
##
## ' 95%-CI (classic)

This probably is not needed anymore
#making tertiles
data_acs_inc <- data_acs_inc %>%
mutate(supply_tert = ntile(provider_1k, 3))
data_acs_inc <- data_acs_inc %>%
mutate(supply_tert = case_when(supply_tert==1 ~ "low",
supply_tert==2 ~ "mid",
supply_tert==3 ~ "high")
)
# Think about keeping supply as a continuous variable
# look at distribution of continuous supply var
data_acs_inc$supply_tert <- as.factor(data_acs_inc$supply_tert)
data_acs_inc <- data_acs_inc %>%
mutate(provider_type = case_when(provider_type=="Masters level therapists and counselors" ~ "masters",
provider_type=="All other therapists and counselors" ~ "counselors",
provider_type=="Psychologists/PhD" ~ "psych",
provider_type=="Advanced practice providers" ~ "advanced")
)
data_acs_inc$provs <- as.factor(data_acs_inc$provider_type)
data_acs_inc$insurance_flag <- as.factor(data_acs_inc$insurance_flag)
data_acs_inc$any_child <- as.factor(data_acs_inc$any_child)
data_acs_inc$region <- as.factor(data_acs_inc$region)
data_acs_inc$group_pr <- as.factor(data_acs_inc$group_pr)
data_acs_inc$telehealth <- as.factor(data_acs_inc$telehealth)
data_acs_inc$rurality <- as.factor(data_acs_inc$rurality)
data_acs_inc$supply_med <- as.factor(data_acs_inc$supply_med)
data_acs_inc$income_med <- as.factor(data_acs_inc$income_med)
data_acs_inc$frac_white_med <- as.factor(data_acs_inc$frac_white_med)
#i need to adjust the names in the provider_type variable so it doesn't mess with
# our output
Adjustments after meeting with Jane on 4/23/24
indicator var: rural versus not (micropolitan and non-core versus
other categories) then relook at supply as continuous or above median
and below median (dichotimous var). possibly then also a possible
interaction term between rurality and supply. for median income: either
dichotimous var or percentiles (non-continuous). also looking at
fraction white in percentiles
Exploring distrobution of provider supply (continuous)
#provider supply
data_acs_inc %>%
group_by(insurance_flag) %>%
skim(provider_1k)
Data summary
| Name |
Piped data |
| Number of rows |
175083 |
| Number of columns |
72 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
1 |
| ________________________ |
|
| Group variables |
insurance_flag |
Variable type: numeric
| provider_1k |
0 |
0 |
1 |
0.99 |
0.74 |
0 |
0.53 |
0.76 |
1.3 |
10.49 |
▇▁▁▁▁ |
| provider_1k |
1 |
0 |
1 |
0.91 |
0.59 |
0 |
0.51 |
0.75 |
1.2 |
10.49 |
▇▁▁▁▁ |
#providers that do accept insurance
p <- data_acs_inc %>%
ggplot( aes(x=provider_1k, fill= insurance_flag)) +
geom_histogram(alpha=0.9, color="white") +
ggtitle("Distrobution of provider supply and insurance acceptance") +
theme(
plot.title = element_text(size=10)
)
#rurality
p2 <- data_acs_inc %>%
ggplot( aes(x=provider_1k, fill=rurality)) +
geom_histogram(alpha=0.9, color="white") +
ggtitle("Distrobution of provider supply in rural and urban areas") +
theme(
plot.title = element_text(size=10)
)
p
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Exploring the distrobutions
- The psych_forest dataset has all of our variables of interest, it
will be used for the random forest in later code but I will use it now
as well to easily explore the distributions of the variables of
interest
psych_forest <- as.data.frame(select(data_acs_inc, c("insurance_flag","provs","any_child","group_pr", "rurality", "supply_med", "income_med", "frac_white_med", "median_income", "provider_1k","fraction_white")))
#remove frac no health insurance
#rural urban versus region (take out region and include rural urban on its own)
# then we can see
Desc(psych_forest[1:8])
## ------------------------------------------------------------------------------
## Describe psych_forest[1:8] (data.frame):
##
## data frame: 175083 obs. of 8 variables
## 175083 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 insurance_flag factor . (2): 1-0, 2-1
## 2 provs factor . (4): 1-advanced, 2-counselors, 3-masters,
## 4-psych
## 3 any_child factor . (2): 1-0, 2-1
## 4 group_pr factor . (2): 1-0, 2-1
## 5 rurality factor . (2): 1-Rural, 2-Urban
## 6 supply_med factor . (2): 1-0, 2-1
## 7 income_med factor . (2): 1-0, 2-1
## 8 frac_white_med factor . (2): 1-0, 2-1
##
##
## ------------------------------------------------------------------------------
## 1 - insurance_flag (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 61'422 35.1% 34.9% 35.3%
## 1 113'661 64.9% 64.7% 65.1%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 2 - provs (factor)
##
## length n NAs unique levels dupes
## 175'083 175'083 0 4 4 y
## 100.0% 0.0%
##
## level freq perc cumfreq cumperc
## 1 masters 80'573 46.0% 80'573 46.0%
## 2 counselors 65'478 37.4% 146'051 83.4%
## 3 psych 26'489 15.1% 172'540 98.5%
## 4 advanced 2'543 1.5% 175'083 100.0%

## ------------------------------------------------------------------------------
## 3 - any_child (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 61'038 34.9% 34.6% 35.1%
## 1 114'045 65.1% 64.9% 65.4%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 4 - group_pr (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 104'650 59.8% 59.5% 60.0%
## 1 70'433 40.2% 40.0% 40.5%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 5 - rurality (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## Rural 7'010 4.0% 3.9% 4.1%
## Urban 168'073 96.0% 95.9% 96.1%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 6 - supply_med (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 87'734 50.1% 49.9% 50.3%
## 1 87'349 49.9% 49.7% 50.1%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 7 - income_med (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 92'689 52.9% 52.7% 53.2%
## 1 82'394 47.1% 46.8% 47.3%
##
## ' 95%-CI (Wilson)

## ------------------------------------------------------------------------------
## 8 - frac_white_med (factor - dichotomous)
##
## length n NAs unique
## 175'083 175'083 0 2
## 100.0% 0.0%
##
## freq perc lci.95 uci.95'
## 0 87'597 50.0% 49.8% 50.3%
## 1 87'486 50.0% 49.7% 50.2%
##
## ' 95%-CI (Wilson)

Data summary
| Name |
psych_forest[9:11] |
| Number of rows |
175083 |
| Number of columns |
3 |
| _______________________ |
|
| Column type frequency: |
|
| numeric |
3 |
| ________________________ |
|
| Group variables |
None |
Variable type: numeric
| median_income |
0 |
1 |
79975.52 |
19788.68 |
24349.0 |
65554.50 |
76367.00 |
92600.00 |
156821.00 |
▁▇▅▂▁ |
| provider_1k |
0 |
1 |
0.94 |
0.65 |
0.0 |
0.52 |
0.76 |
1.23 |
10.49 |
▇▁▁▁▁ |
| fraction_white |
0 |
1 |
0.65 |
0.15 |
0.1 |
0.53 |
0.66 |
0.77 |
0.98 |
▁▂▆▇▃ |
Exploration of possible rurality*supply interaction term
xtabs(~ rurality + supply_med, data = psych_forest)
## supply_med
## rurality 0 1
## Rural 4859 2151
## Urban 82875 85198
Univariate models
#this is currently comparing against advanced practice providers
#advanced practice providers are the smallest group, which may not lead to a great
#reference category if cells are small. I will change this to be masters level
#providers for now since that is the largest group
data_acs_inc$provs <- contr_code_treatment(data_acs_inc$provs,
base = "masters")
uni_prov <- glm(insurance_flag ~ provs, data = data_acs_inc,
family = "binomial")
summary(uni_prov)
##
## Call:
## glm(formula = insurance_flag ~ provs, family = "binomial", data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.50325 0.00727 69.222 <2e-16 ***
## provs.advanced-masters -0.38909 0.04038 -9.635 <2e-16 ***
## provs.counselors-masters 0.44135 0.01134 38.917 <2e-16 ***
## provs.psych-masters -0.24335 0.01437 -16.938 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 224196 on 175079 degrees of freedom
## AIC: 224204
##
## Number of Fisher Scoring iterations: 4
uni_prov_con <- glm(insurance_flag ~ provider_1k, data = data_acs_inc,
family = "binomial")
summary(uni_prov_con)
##
## Call:
## glm(formula = insurance_flag ~ provider_1k, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.787316 0.008931 88.16 <2e-16 ***
## provider_1k -0.181152 0.007736 -23.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226327 on 175081 degrees of freedom
## AIC: 226331
##
## Number of Fisher Scoring iterations: 4
#some other primary independent var models
uni_child <- glm(insurance_flag ~ any_child, data = data_acs_inc,
family = "binomial")
summary(uni_child)
##
## Call:
## glm(formula = insurance_flag ~ any_child, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.509375 0.008359 60.94 <2e-16 ***
## any_child1 0.164224 0.010444 15.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226644 on 175081 degrees of freedom
## AIC: 226648
##
## Number of Fisher Scoring iterations: 4
uni_rural <- glm(insurance_flag ~ rurality, data = data_acs_inc,
family = "binomial")
summary(uni_rural)
##
## Call:
## glm(formula = insurance_flag ~ rurality, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.01039 0.02700 37.42 <2e-16 ***
## ruralityUrban -0.41027 0.02748 -14.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226655 on 175081 degrees of freedom
## AIC: 226659
##
## Number of Fisher Scoring iterations: 4
uni_income <- glm(insurance_flag ~ income_med, data = data_acs_inc,
family = "binomial")
summary(uni_income)
##
## Call:
## glm(formula = insurance_flag ~ income_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.749584 0.007036 106.53 <2e-16 ***
## income_med1 -0.278914 0.010040 -27.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226117 on 175081 degrees of freedom
## AIC: 226121
##
## Number of Fisher Scoring iterations: 4
uni_group <- glm(insurance_flag ~ group_pr, data = data_acs_inc,
family = "binomial")
summary(uni_group)
##
## Call:
## glm(formula = insurance_flag ~ group_pr, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.524293 0.006396 81.97 <2e-16 ***
## group_pr1 0.231474 0.010306 22.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226382 on 175081 degrees of freedom
## AIC: 226386
##
## Number of Fisher Scoring iterations: 4
uni_prov1k <- glm(insurance_flag ~ supply_med, data = data_acs_inc,
family = "binomial")
summary(uni_prov1k)
##
## Call:
## glm(formula = insurance_flag ~ supply_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.656524 0.007119 92.218 < 2e-16 ***
## supply_med1 -0.081823 0.010018 -8.167 3.15e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 226824 on 175081 degrees of freedom
## AIC: 226828
##
## Number of Fisher Scoring iterations: 4
uni_frac_wh <- glm(insurance_flag ~ frac_white_med, data = data_acs_inc,
family = "binomial")
summary(uni_frac_wh)
##
## Call:
## glm(formula = insurance_flag ~ frac_white_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.465865 0.006942 67.11 <2e-16 ***
## frac_white_med1 0.306381 0.010053 30.48 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 225958 on 175081 degrees of freedom
## AIC: 225962
##
## Number of Fisher Scoring iterations: 4
Full model with all vars
Checking the binary variable for provider supply
model1 <- glm(insurance_flag ~ supply_med + provs + any_child
+ rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = insurance_flag ~ supply_med + provs + any_child +
## rurality + income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.494212 0.030202 16.364 < 2e-16 ***
## supply_med1 -0.009605 0.010769 -0.892 0.372
## provs.advanced-masters -0.400754 0.040696 -9.848 < 2e-16 ***
## provs.counselors-masters 0.461996 0.011450 40.348 < 2e-16 ***
## provs.psych-masters -0.193292 0.014536 -13.297 < 2e-16 ***
## any_child1 0.139653 0.010696 13.057 < 2e-16 ***
## ruralityUrban -0.194344 0.028460 -6.829 8.57e-12 ***
## income_med1 -0.281956 0.010880 -25.914 < 2e-16 ***
## group_pr1 0.214758 0.010564 20.328 < 2e-16 ***
## frac_white_med1 0.300357 0.010387 28.916 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 221764 on 175073 degrees of freedom
## AIC: 221784
##
## Number of Fisher Scoring iterations: 4
model2 <- glm(insurance_flag ~ provs + any_child
+ rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model2)
##
## Call:
## glm(formula = insurance_flag ~ provs + any_child + rurality +
## income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.49156 0.03005 16.356 < 2e-16 ***
## provs.advanced-masters -0.40106 0.04069 -9.856 < 2e-16 ***
## provs.counselors-masters 0.46156 0.01144 40.348 < 2e-16 ***
## provs.psych-masters -0.19360 0.01453 -13.322 < 2e-16 ***
## any_child1 0.14008 0.01069 13.109 < 2e-16 ***
## ruralityUrban -0.19494 0.02845 -6.852 7.3e-12 ***
## income_med1 -0.28496 0.01035 -27.542 < 2e-16 ***
## group_pr1 0.21406 0.01054 20.318 < 2e-16 ***
## frac_white_med1 0.30049 0.01039 28.931 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 221765 on 175074 degrees of freedom
## AIC: 221783
##
## Number of Fisher Scoring iterations: 4
anova(model2, model1, test='LR')
| 175074 |
221765.1 |
NA |
NA |
NA |
| 175073 |
221764.3 |
1 |
0.7955322 |
0.3724325 |
- There is no evidence that a binary supply variable is a better fit
than no supply variable. I will choose to use the continuous version of
the variable
#centering the provider supply variable for ease of interpretation
mean(data_acs_inc$provider_1k)
## [1] 0.9398009
data_acs_inc$c_provider_1k <- data_acs_inc$provider_1k - mean(data_acs_inc$provider_1k)
#because of rounding errors in R, it makes it so that the mean isn't EXACTLY centered at 0 but extremely close
model1 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ rurality + income_med + group_pr + frac_white_med, data = data_acs_inc, family = "binomial")
summary(model1)
##
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child +
## rurality + income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_inc)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.433403 0.030281 14.312 < 2e-16 ***
## c_provider_1k -0.132942 0.008291 -16.035 < 2e-16 ***
## provs.advanced-masters -0.401879 0.040727 -9.868 < 2e-16 ***
## provs.counselors-masters 0.466670 0.011454 40.741 < 2e-16 ***
## provs.psych-masters -0.185884 0.014553 -12.773 < 2e-16 ***
## any_child1 0.131055 0.010708 12.239 < 2e-16 ***
## ruralityUrban -0.167692 0.028511 -5.882 4.06e-09 ***
## income_med1 -0.224697 0.011018 -20.394 < 2e-16 ***
## group_pr1 0.226017 0.010571 21.382 < 2e-16 ***
## frac_white_med1 0.305407 0.010401 29.364 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 226890 on 175082 degrees of freedom
## Residual deviance: 221506 on 175073 degrees of freedom
## AIC: 221526
##
## Number of Fisher Scoring iterations: 4
## GVIF Df GVIF^(1/(2*Df))
## c_provider_1k 1.160592 1 1.077308
## provs 1.012826 3 1.002126
## any_child 1.020213 1 1.010056
## rurality 1.056585 1 1.027903
## income_med 1.172852 1 1.082983
## group_pr 1.023455 1 1.011659
## frac_white_med 1.043309 1 1.021425
anova(model1, test = 'Chisq')
| NULL |
NA |
NA |
175082 |
226890.3 |
NA |
| c_provider_1k |
1 |
562.8089 |
175081 |
226327.5 |
0 |
| provs |
3 |
2700.1497 |
175078 |
223627.3 |
0 |
| any_child |
1 |
252.1089 |
175077 |
223375.2 |
0 |
| rurality |
1 |
152.3336 |
175076 |
223222.9 |
0 |
| income_med |
1 |
355.7611 |
175075 |
222867.1 |
0 |
| group_pr |
1 |
494.8203 |
175074 |
222372.3 |
0 |
| frac_white_med |
1 |
866.0003 |
175073 |
221506.3 |
0 |
Step-wise deletion and comparison with full model
options(width = 60)
model2 <- glm(insurance_flag ~ provs + any_child
+ rurality + income_med + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model2)
anova(model2, model1, test='LR')
| 175074 |
221765.1 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
258.7809 |
0 |
model3 <- glm(insurance_flag ~ c_provider_1k + any_child
+ rurality + income_med + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model3)
anova(model3, model1, test='LR')
| 175076 |
224156.2 |
NA |
NA |
NA |
| 175073 |
221506.3 |
3 |
2649.896 |
0 |
model4 <- glm(insurance_flag ~ c_provider_1k + provs
+ rurality + income_med + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model4)
anova(model4, model1, test='LR')
| 175074 |
221655.6 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
149.348 |
0 |
model5 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ income_med + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model5)
anova(model5, model1, test='LR')
| 175074 |
221541.6 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
35.29536 |
0 |
model6 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ rurality + group_pr + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model6)
anova(model6, model1, test='LR')
| 175074 |
221922.0 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
415.6647 |
0 |
model7 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ rurality + income_med + frac_white_med,
data = data_acs_inc, family = "binomial")
#summary(model7)
anova(model7, model1, test='LR')
| 175074 |
221966.2 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
459.9373 |
0 |
model8 <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ rurality + income_med + group_pr,
data = data_acs_inc, family = "binomial")
#summary(model8)
anova(model8, model1, test='LR')
| 175074 |
222372.3 |
NA |
NA |
NA |
| 175073 |
221506.3 |
1 |
866.0003 |
0 |
Exploring interaction term between rurality and supply
Looking at stratified models (rural vs urban)
data_acs_rural <- subset(data_acs_inc, rurality=="Rural")
data_acs_urban <- subset(data_acs_inc, rurality=="Urban")
Rural
modelrural <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ income_med + group_pr + frac_white_med, data = data_acs_rural, family = "binomial")
summary(modelrural)
##
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child +
## income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_rural)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1.08052 0.11032 9.795
## c_provider_1k 0.14980 0.05714 2.622
## provs.advanced-masters -0.60596 0.21420 -2.829
## provs.counselors-masters 0.17028 0.06075 2.803
## provs.psych-masters -0.48604 0.08992 -5.405
## any_child1 0.42007 0.05786 7.260
## income_med1 -0.44180 0.08523 -5.184
## group_pr1 0.66626 0.06838 9.743
## frac_white_med1 -0.45054 0.09740 -4.626
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## c_provider_1k 0.00875 **
## provs.advanced-masters 0.00467 **
## provs.counselors-masters 0.00506 **
## provs.psych-masters 6.48e-08 ***
## any_child1 3.88e-13 ***
## income_med1 2.17e-07 ***
## group_pr1 < 2e-16 ***
## frac_white_med1 3.73e-06 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8133.8 on 7009 degrees of freedom
## Residual deviance: 7838.2 on 7001 degrees of freedom
## AIC: 7856.2
##
## Number of Fisher Scoring iterations: 4
round(
exp(
cbind(OR = coef(modelrural), confint(modelrural))
),
2)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 2.95 2.38 3.67
## c_provider_1k 1.16 1.04 1.30
## provs.advanced-masters 0.55 0.36 0.83
## provs.counselors-masters 1.19 1.05 1.34
## provs.psych-masters 0.62 0.52 0.73
## any_child1 1.52 1.36 1.70
## income_med1 0.64 0.54 0.76
## group_pr1 1.95 1.70 2.23
## frac_white_med1 0.64 0.53 0.77
#the reason that there might be some stronger associations than excepted could be slightly less power
# however, the cell sizes aren't horribly small
# this should most likely be interpreted carefully, as it is comparing to the median of the entire sample (urban and rural) and not rural alone--meaning that the interpretation of the median variables (income and frac white) may be unrepresentative in the stratified models
xtabs(~ insurance_flag + frac_white_med, data = data_acs_rural)
## frac_white_med
## insurance_flag 0 1
## 0 154 1717
## 1 620 4519
#this one is arguably worse
xtabs(~ insurance_flag + provs, data = data_acs_rural)
## provs
## insurance_flag advanced counselors masters psych
## 0 40 620 958 253
## 1 56 1976 2685 422
Urban
modelurban <- glm(insurance_flag ~ c_provider_1k + provs + any_child
+ income_med + group_pr + frac_white_med, data = data_acs_urban, family = "binomial")
summary(modelurban)
##
## Call:
## glm(formula = insurance_flag ~ c_provider_1k + provs + any_child +
## income_med + group_pr + frac_white_med, family = "binomial",
## data = data_acs_urban)
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.264493 0.012969 20.394
## c_provider_1k -0.140263 0.008412 -16.674
## provs.advanced-masters -0.390371 0.041492 -9.408
## provs.counselors-masters 0.479358 0.011671 41.073
## provs.psych-masters -0.174075 0.014754 -11.799
## any_child1 0.120237 0.010899 11.032
## income_med1 -0.220380 0.011121 -19.817
## group_pr1 0.213119 0.010720 19.880
## frac_white_med1 0.315798 0.010485 30.120
## Pr(>|z|)
## (Intercept) <2e-16 ***
## c_provider_1k <2e-16 ***
## provs.advanced-masters <2e-16 ***
## provs.counselors-masters <2e-16 ***
## provs.psych-masters <2e-16 ***
## any_child1 <2e-16 ***
## income_med1 <2e-16 ***
## group_pr1 <2e-16 ***
## frac_white_med1 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 218521 on 168072 degrees of freedom
## Residual deviance: 213453 on 168064 degrees of freedom
## AIC: 213471
##
## Number of Fisher Scoring iterations: 4
round(
exp(
cbind(OR = coef(modelurban), confint(modelurban))
),
2)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.30 1.27 1.34
## c_provider_1k 0.87 0.85 0.88
## provs.advanced-masters 0.68 0.62 0.73
## provs.counselors-masters 1.62 1.58 1.65
## provs.psych-masters 0.84 0.82 0.86
## any_child1 1.13 1.10 1.15
## income_med1 0.80 0.78 0.82
## group_pr1 1.24 1.21 1.26
## frac_white_med1 1.37 1.34 1.40
Stratification results
- The stratified results are interesting, although I believe
continuing with the stratified models is inappropriate as I do not
believe that stratification appropriately handles all of the predictors.
I think that interpreting a model with an interaction term is more
appropriate.
Testing the accuracy of this model by training then testing
# Split data into train and test
set.seed(430)
split <- initial_split(data_acs_inc, prop = 0.8, strata = insurance_flag)
train <- split %>%
training()
test <- split %>%
testing()
mod_train <- glm(insurance_flag ~ c_provider_1k*rurality + provs + any_child
+ income_med + group_pr + frac_white_med,
data = train, family = "binomial")
glm_probs = data.frame(probs = predict(mod_train,
newdata = test,
type="response"))
## Warning: contrasts dropped from factor provs
glm_pred = glm_probs %>%
mutate(pred = ifelse(probs>.5, "accept", "does not accept"))
glm_pred = cbind(test, glm_pred)
glm_pred %>%
count(pred, insurance_flag) %>%
spread(insurance_flag, n, fill = 0)
| accept |
11526 |
22009 |
| does not accept |
759 |
724 |
pr <- prediction(glm_probs,test$insurance_flag)
perf <- performance(pr,measure = "tpr",x.measure = "fpr")
plot(perf)

auc(test$insurance_flag,glm_probs)
## Warning in auc(test$insurance_flag, glm_probs): longer
## object length is not a multiple of shorter object length
## [1] 0.6024096
Random forest for variable importance
- This is where we are using the psych_forest dataset
set.seed(421)
split <- initial_split(psych_forest, prop = 0.7, strata = insurance_flag)
train <- split %>%
training()
test <- split %>%
testing()
Training
model <- randomForest(y=train[,1], x=train[,c(2:5, 7:8, 10)], data = train, ntree = 500,
mtry = 5, importance = TRUE , proximity = FALSE)
model
##
## Call:
## randomForest(x = train[, c(2:5, 7:8, 10)], y = train[, 1], ntree = 500, mtry = 5, importance = TRUE, proximity = FALSE, data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 31.79%
## Confusion matrix:
## 0 1 class.error
## 0 13811 29184 0.6787766
## 1 9778 69784 0.1228979
The next step is to validate our model using the test data
set.seed(421)
# The next step is to validate our model using the test data
prediction <- predict(model, newdata = test)
table(prediction, test$insurance_flag)
##
## prediction 0 1
## 0 5958 4143
## 1 12469 29956
results<-cbind(prediction,test$insurance_flag)
colnames(results)<-c('pred','real')
results<-as.data.frame(results)
# Finally, let’s calculate the accuracy of the model
sum(prediction==test$insurance_flag) / nrow(test)
## [1] 0.6837376
Variable importance
- The first column shows the predictor variables used in the RF model,
the following columns shows how much each of those variables contribute
to predicting absences (0), presences (1) or both
(MeanDecreaseAccuracy).
- The last column shows the node impurity of the model.
varImpPlot(model, sort=T, n.var= 7,
main= "Health insurance acceptance: yes or no", pch=16)

#another way to view this
imp=as.data.frame(importance(model)) #Analyze the importance of each variable in the model
imp=cbind(vars=rownames(imp),imp)
print(imp)
## vars 0 1
## provs provs 162.63527 125.04779
## any_child any_child 41.14348 100.36086
## group_pr group_pr 56.62383 113.33937
## rurality rurality 34.71668 19.52115
## income_med income_med 102.98961 52.83670
## frac_white_med frac_white_med 154.88514 73.26973
## provider_1k provider_1k 202.35714 101.38273
## MeanDecreaseAccuracy MeanDecreaseGini
## provs 244.6241 1275.8302
## any_child 121.7993 466.0200
## group_pr 148.5111 415.3831
## rurality 56.6692 165.5525
## income_med 147.2510 386.2581
## frac_white_med 206.1214 393.0726
## provider_1k 277.9238 6586.7761
barplot(imp$MeanDecreaseAccuracy, names.arg=imp$vars)

Examples of variable relationship with outcome and importance
par(mfrow = c(2, 2))
partialPlot(model, test, provider_1k, "1", xlab="Providers per 1000 county population",
ylab="Log of fraction of votes", lwd=4, col="green")
partialPlot(x=model, pred.data=test, x.var=any_child, which.class="1")
partialPlot(x=model, pred.data=test, x.var=rurality, which.class="1")
partialPlot(x=model, pred.data=test, x.var=provs, which.class="1")

par(mfrow = c(2, 2))
partialPlot(x=model, pred.data=test, x.var=income_med, which.class="1")
partialPlot(x=model, pred.data=test, x.var=frac_white_med, which.class="1")
partialPlot(x=model, pred.data=test, x.var=group_pr, which.class="1")
