If we look at the dataset, “Number of doctor visits in past 2 weeks” is a count data and it can be a good response variable. Because the visits might have relation to other variables such as gender, age or income and so on. Let’s consider visits as response variable (Y) and others as explanatory/predictor/indicator variables. Let’s find out whether these indicator variables have relation or not with the response variable. Let’s look at the response variable agian, Number of doctor visits in past 2 weeks, which can be considered as number of events per certain time; thus we can run a poisson regression. Also, we can see that in the response variable, visits, there are many zeros; so i will run a zero inflated count model as well. Furthermore, in the response variable, there is an order and the number of visits has a limit which is 10 and the data set is very large; thus we can run a multinomial regression. These are all regressions/methods i will use to find whether there is any relation between response and indicator variables. Before doing any regression, i want to see how individual indicator variable relates to response variable so we can narrow down our indicator variables.
The data is called DoctorVisits which is in AER package. This is cross-section data from the 1977–1978 Australian Health Survey containing 5,190 observations on 12 variables. Here are all the variables:
visits = Number of doctor visits in past 2 weeks
gender = Factor indicating gender.
age = Age in years divided by 100.
income = Annual income in tens of thousands of dollars.
illness = Number of illnesses in past 2 weeks.
reduced = Number of days of reduced activity in past 2 weeks due to illness or injury.
health = General health questionnaire score using Goldberg’s method.
private = Factor. Does the individual have private health insurance? yes or no.
freepoor = Factor. Does the individual have free government health insurance due to low income? yes or no.
freerepat = Factor. Does the individual have free government health insurance due to old age, disability or veteran status? yes or no.
nchronic = Factor. Is there a chronic condition not limiting activity? yes or no.
lchronic = Factor. Is there a chronic condition limiting activity? yes or no.
Since a continuous variable can not be mapped to linetype we will create a new column under “new.visits” name where none= zero number of doctor visit in past 2 weeks, once= one number of doctor visit in past 2 weeks, twice = 2 number of doctor visit in past 2 weeks and so forth, so we can create a ggplot for each pair.
visits vs gender
Gender doesn’t affect visits that much.
visits vs age
As people get older, probability(none) decreases and probability(once) increases; age might be a good indicator variable.
visits vs income
Income doesn’t have that much effect.
visit vs illness
As the number of illness increases, the probability(once) increases too; illness might be a good indicator variale.
visit vs reduced
From the plot, we can see reduced definitely has effect on number of visits.
visits vs health
As health increases, the probability(once) increases indicating a appropiate response variable.
visits vs private
private doesn’t have any effect on number of visits.
visits vs freepoor
People with free government health insurance dont seem to see doctor that much.
visits vs freerepat
People with freerepat seem to see doctor little more than without one.
visits vs nchronic
nchronic doesn’t seem to have effect on number of visits.
visits vs lchronic
People with lchronic seem to see doctor little more than without one.
Since our considered response variable is count data we will use poisson regression
##
## Call:
## glm(formula = visits ~ ., family = poisson, data = DoctorVisits)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9502 -0.6858 -0.5747 -0.4852 5.7055
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.097821 0.101554 -20.657 < 2e-16 ***
## genderfemale 0.156490 0.056139 2.788 0.00531 **
## age 0.279123 0.165981 1.682 0.09264 .
## income -0.187416 0.085478 -2.193 0.02834 *
## illness 0.186156 0.018263 10.193 < 2e-16 ***
## reduced 0.126690 0.005031 25.184 < 2e-16 ***
## health 0.030683 0.010074 3.046 0.00232 **
## privateyes 0.126498 0.071552 1.768 0.07707 .
## freepooryes -0.438462 0.179799 -2.439 0.01474 *
## freerepatyes 0.083640 0.092070 0.908 0.36365
## nchronicyes 0.117300 0.066545 1.763 0.07795 .
## lchronicyes 0.150717 0.082260 1.832 0.06692 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5634.8 on 5189 degrees of freedom
## Residual deviance: 4380.1 on 5178 degrees of freedom
## AIC: 6735.7
##
## Number of Fisher Scoring iterations: 6
P-value
## [1] 1
Since the p value is 1 and deviance is 4380.1 0n 5178 degrees of freedom we fail to reject the null hypothesis indicating an good-fitting model. Based on the results, let’s reduce the indicator variables as income, illness, reduced, health and freepoor.
##
## Call:
## glm(formula = visits ~ income + illness + reduced + health +
## freepoor, family = poisson, data = DoctorVisits)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8049 -0.6742 -0.5800 -0.5057 5.7636
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.703968 0.063928 -26.654 < 2e-16 ***
## income -0.320842 0.076246 -4.208 2.58e-05 ***
## illness 0.210165 0.017328 12.128 < 2e-16 ***
## reduced 0.130412 0.004834 26.976 < 2e-16 ***
## health 0.030798 0.009927 3.102 0.00192 **
## freepooryes -0.642487 0.172180 -3.731 0.00019 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 5634.8 on 5189 degrees of freedom
## Residual deviance: 4412.2 on 5184 degrees of freedom
## AIC: 6755.8
##
## Number of Fisher Scoring iterations: 6
We can see that illness and reduced are the most positively significant indicator variables followed by health whereas income is the most negatively significant indicator variable followed by freepoor.
Coefficients of these variables
## (Intercept) income illness reduced health freepooryes
## 0.1819601 0.7255381 1.2338811 1.1392981 1.0312775 0.5259826
As you increase the age by one, the expected number of visits for a parson decreases by a factor of 0.73. As you increase the illness by one, the expected number of visits for a parson increases by a factor of 1.23. As you increase the reduced by one, the expected number of visits for a parson increases by a factor of 1.14.As you increase the health by one, the expected number of visits for a parson increases by a factor of 1.03. As you increase the freepooryes by one, the expected number of visits for a parson decreases by a factor of 0.53.
We can see that the residual deviance for the model is normally distributed.
## # A tibble: 10 x 2
## visits n
## <dbl> <int>
## 1 0. 4141
## 2 1. 782
## 3 2. 174
## 4 3. 30
## 5 4. 24
## 6 5. 9
## 7 6. 12
## 8 7. 12
## 9 8. 5
## 10 9. 1
We see there are many zeros in response variable. Let’s use zero inflated count model regression
Zero Inflated Model
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
##
## Call:
## zeroinfl(formula = visits ~ . | reduced, data = DoctorVisits)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.4907 -0.4420 -0.3809 -0.3252 12.0230
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.493646 0.130825 -11.417 < 2e-16 ***
## genderfemale 0.172259 0.060884 2.829 0.00466 **
## age 0.282190 0.181252 1.557 0.11950
## income -0.233319 0.094707 -2.464 0.01376 *
## illness 0.183538 0.020363 9.013 < 2e-16 ***
## reduced 0.109244 0.007050 15.496 < 2e-16 ***
## health 0.028700 0.010875 2.639 0.00831 **
## privateyes 0.106815 0.078105 1.368 0.17144
## freepooryes -0.441772 0.193674 -2.281 0.02255 *
## freerepatyes 0.005954 0.100379 0.059 0.95270
## nchronicyes 0.117323 0.072478 1.619 0.10550
## lchronicyes 0.134158 0.089427 1.500 0.13357
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.24367 0.14350 -1.698 0.0895 .
## reduced -0.08485 0.02116 -4.011 6.05e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 27
## Log-likelihood: -3280 on 14 Df
## [1] 198.7991
## [1] 0
Since the P-value is 0 we can conclude this model is not justifiable.
Using QuasiPoisson Regression
##
## Call:
## glm(formula = visits ~ ., family = quasipoisson, data = DoctorVisits)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9502 -0.6858 -0.5747 -0.4852 5.7055
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.097821 0.117010 -17.929 < 2e-16 ***
## genderfemale 0.156490 0.064683 2.419 0.01558 *
## age 0.279123 0.191244 1.460 0.14448
## income -0.187416 0.098488 -1.903 0.05711 .
## illness 0.186156 0.021043 8.847 < 2e-16 ***
## reduced 0.126690 0.005796 21.857 < 2e-16 ***
## health 0.030683 0.011607 2.644 0.00823 **
## privateyes 0.126498 0.082442 1.534 0.12500
## freepooryes -0.438462 0.207164 -2.116 0.03435 *
## freerepatyes 0.083640 0.106083 0.788 0.43048
## nchronicyes 0.117300 0.076674 1.530 0.12611
## lchronicyes 0.150717 0.094780 1.590 0.11186
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 1.327571)
##
## Null deviance: 5634.8 on 5189 degrees of freedom
## Residual deviance: 4380.1 on 5178 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
The results are close to the results of poisson regression
Since we have a large data set and there is a order in response variable we can use ordinal multinomial regression
## Start: AIC=6361.51
## factor(visits) ~ gender + age + income + illness + reduced +
## health + private + freepoor + freerepat + nchronic + lchronic
##
## Df AIC
## - income 1 6360.5
## - nchronic 1 6361.2
## <none> 6361.5
## - age 1 6362.9
## - lchronic 1 6363.1
## - freerepat 1 6366.3
## - private 1 6367.3
## - freepoor 1 6368.1
## - gender 1 6368.5
## - health 1 6372.7
## - illness 1 6450.7
## - reduced 1 6616.7
##
## Step: AIC=6360.48
## factor(visits) ~ gender + age + illness + reduced + health +
## private + freepoor + freerepat + nchronic + lchronic
##
## Df AIC
## - nchronic 1 6360.1
## <none> 6360.5
## - lchronic 1 6362.0
## - age 1 6362.3
## - private 1 6365.8
## - freepoor 1 6366.3
## - freerepat 1 6366.5
## - gender 1 6369.2
## - health 1 6371.8
## - illness 1 6450.9
## - reduced 1 6615.0
##
## Step: AIC=6360.07
## factor(visits) ~ gender + age + illness + reduced + health +
## private + freepoor + freerepat + lchronic
##
## Df AIC
## <none> 6360.1
## - lchronic 1 6360.2
## - age 1 6363.4
## - private 1 6365.8
## - freepoor 1 6365.8
## - freerepat 1 6366.3
## - gender 1 6369.4
## - health 1 6371.3
## - illness 1 6464.7
## - reduced 1 6614.7
##
## Re-fitting to get Hessian
## Call:
## polr(formula = factor(visits) ~ gender + age + illness + reduced +
## health + private + freepoor + freerepat + lchronic, data = DoctorVisits)
##
## Coefficients:
## Value Std. Error t value
## genderfemale 0.26439 0.07888 3.352
## age 0.54017 0.23382 2.310
## illness 0.27975 0.02690 10.400
## reduced 0.17203 0.01072 16.040
## health 0.06087 0.01646 3.697
## privateyes 0.27097 0.09822 2.759
## freepooryes -0.67465 0.25831 -2.612
## freerepatyes 0.38285 0.13348 2.868
## lchronicyes 0.15623 0.10605 1.473
##
## Intercepts:
## Value Std. Error t value
## 0|1 2.7060 0.1095 24.7031
## 1|2 4.4757 0.1279 34.9847
## 2|3 5.6652 0.1561 36.2868
## 3|4 6.0844 0.1726 35.2592
## 4|5 6.5978 0.2000 32.9966
## 5|6 6.8800 0.2191 31.4027
## 6|7 7.4202 0.2660 27.8966
## 7|8 8.5433 0.4269 20.0104
## 8|9 10.3476 1.0079 10.2668
##
## Residual Deviance: 6324.068
## AIC: 6360.068
Three most significant predictors are reduced, illness and health based on largest t-values. We can say that the odds of moving from 0 to 1|2 (or from 0|1) increase by a factor of exp(.172)=1.187 as reduced increases by one unit. A similar argument can be used for interpretation of other variables.
Number of visits per last 2 weeks, the response variable, significantly depends on reduced, illness and health variables.
### creating new column
data(DoctorVisits, package="AER")
new.visits <- factor(DoctorVisits$visits)
levels(new.visits) <- list(none = as.character(0),once = as.character(1),
twice = as.character(2), three.times=as.character(3),
four.times=as.character(4),five.times=as.character(5),
six.times=as.character(6),seven.times=as.character(7),
eight.times=as.character(8),nine.times=as.character(9),
ten.times=as.character(10))
library(tibble)
new.doctorvisit.data <- add_column(DoctorVisits,new.visits, .after = 1)
## visits vs gender
library(dplyr)
library(ggplot2)
visits.vs.gender <- group_by(new.doctorvisit.data, gender, new.visits) %>%
summarise(count=n()) %>% group_by(gender) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.gender, aes(x=gender, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visits vs age
visits.vs.age <- group_by(new.doctorvisit.data, age, new.visits) %>%
summarise(count=n()) %>% group_by(age) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.age, aes(x=age, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visits vs income
visits.vs.income <- group_by(new.doctorvisit.data, income, new.visits) %>%
summarise(count=n()) %>% group_by(income) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.income, aes(x=income, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visit vs illness
visits.vs.illness <- group_by(new.doctorvisit.data, illness, new.visits) %>%
summarise(count=n()) %>% group_by(illness) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.illness, aes(x=illness, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visit vs reduced
visits.vs.reduced <- group_by(new.doctorvisit.data, reduced, new.visits) %>%
summarise(count=n()) %>% group_by(reduced) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.reduced, aes(x=reduced, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visits vs health
visits.vs.health <- group_by(new.doctorvisit.data, health, new.visits) %>%
summarise(count=n()) %>% group_by(health) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.health, aes(x=health, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visits vs private
visits.vs.private <- group_by(new.doctorvisit.data, private, new.visits) %>%
summarise(count=n()) %>% group_by(private) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.private, aes(x=private, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visits vs freepoor
visits.vs.freepoor <- group_by(new.doctorvisit.data, freepoor, new.visits) %>%
summarise(count=n()) %>% group_by(freepoor) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.freepoor, aes(x=freepoor, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visits vs freerepat
visits.vs.freerepat <- group_by(new.doctorvisit.data, freerepat, new.visits) %>%
summarise(count=n()) %>% group_by(freerepat) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.freerepat, aes(x=freerepat, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visits vs nchronic
visits.vs.nchronic <- group_by(new.doctorvisit.data,nchronic, new.visits) %>%
summarise(count=n()) %>% group_by(nchronic) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.nchronic, aes(x=nchronic, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
## visits vs lchronic
visits.vs.lchronic <- group_by(new.doctorvisit.data,lchronic, new.visits) %>%
summarise(count=n()) %>% group_by(lchronic) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.lchronic, aes(x=lchronic, y=proportion, group=new.visits,
linetype=new.visits))+geom_line()
##Poisson Regression
data(DoctorVisits, package="AER")
mod.poisson <- glm(visits~., family = poisson, DoctorVisits)
summary(mod.poisson)
## P-value
pchisq(mod.poisson$deviance, df=mod.poisson$df.residual, lower.tail=FALSE)
##Poisson Reduced Model
mod.poisson.1 <- glm(visits~income+illness+reduced+health+freepoor,
family = poisson, DoctorVisits)
summary(mod.poisson.1)
exp(coef(mod.poisson.1))
library(faraway)
qqnorm(residuals(mod.poisson.1))
halfnorm(residuals(mod.poisson))
plot(mod.poisson.1,1)
## Counting Zeros
DoctorVisits %>% group_by(visits) %>% tally()
##zero inflated model
library(pscl)
modz <- zeroinfl(visits ~. , data=DoctorVisits)
summary(modz)
modz2 <- zeroinfl(visits ~.|reduced , data=DoctorVisits)
summary(modz2)
lrt <- 2*(modz$loglik - modz2$loglik)
lrt
1-pchisq(198.7991,14)
##Using QuasiPoisson Regression
modq <- glm(visits ~ ., family=quasipoisson,DoctorVisits)
summary(modq)
### Multinomial regression
library(MASS)
pomod <- polr(factor(visits)~ gender+age+income+illness+reduced+health+private+
freepoor+freerepat+nchronic+lchronic, data=DoctorVisits)
pomodi <- step(pomod)
summary(step(pomod))