STAT 707 Final

Introduction

If we look at the dataset, “Number of doctor visits in past 2 weeks” is a count data and it can be a good response variable. Because the visits might have relation to other variables such as gender, age or income and so on. Let’s consider visits as response variable (Y) and others as explanatory/predictor/indicator variables. Let’s find out whether these indicator variables have relation or not with the response variable. Let’s look at the response variable agian, Number of doctor visits in past 2 weeks, which can be considered as number of events per certain time; thus we can run a poisson regression. Also, we can see that in the response variable, visits, there are many zeros; so i will run a zero inflated count model as well. Furthermore, in the response variable, there is an order and the number of visits has a limit which is 10 and the data set is very large; thus we can run a multinomial regression. These are all regressions/methods i will use to find whether there is any relation between response and indicator variables. Before doing any regression, i want to see how individual indicator variable relates to response variable so we can narrow down our indicator variables.

Data

The data is called DoctorVisits which is in AER package. This is cross-section data from the 1977–1978 Australian Health Survey containing 5,190 observations on 12 variables. Here are all the variables:

visits = Number of doctor visits in past 2 weeks
gender = Factor indicating gender.
age = Age in years divided by 100.
income = Annual income in tens of thousands of dollars.
illness = Number of illnesses in past 2 weeks.
reduced = Number of days of reduced activity in past 2 weeks due to illness or injury.
health = General health questionnaire score using Goldberg’s method.
private = Factor. Does the individual have private health insurance? yes or no.
freepoor = Factor. Does the individual have free government health insurance due to low income? yes or no.
freerepat = Factor. Does the individual have free government health insurance due to old age, disability or veteran status? yes or no.
nchronic = Factor. Is there a chronic condition not limiting activity? yes or no.
lchronic = Factor. Is there a chronic condition limiting activity? yes or no.

Analysis

Since a continuous variable can not be mapped to linetype we will create a new column under “new.visits” name where none= zero number of doctor visit in past 2 weeks, once= one number of doctor visit in past 2 weeks, twice = 2 number of doctor visit in past 2 weeks and so forth, so we can create a ggplot for each pair.

visits vs gender

Gender doesn’t affect visits that much.

visits vs age

As people get older, probability(none) decreases and probability(once) increases; age might be a good indicator variable.

visits vs income

Income doesn’t have that much effect.

visit vs illness

As the number of illness increases, the probability(once) increases too; illness might be a good indicator variale.

visit vs reduced

From the plot, we can see reduced definitely has effect on number of visits.

visits vs health

As health increases, the probability(once) increases indicating a appropiate response variable.

visits vs private

private doesn’t have any effect on number of visits.

visits vs freepoor

People with free government health insurance dont seem to see doctor that much.

visits vs freerepat

People with freerepat seem to see doctor little more than without one.

visits vs nchronic

nchronic doesn’t seem to have effect on number of visits.

visits vs lchronic

People with lchronic seem to see doctor little more than without one.

Poisson Regression

Since our considered response variable is count data we will use poisson regression

## 
## Call:
## glm(formula = visits ~ ., family = poisson, data = DoctorVisits)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9502  -0.6858  -0.5747  -0.4852   5.7055  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.097821   0.101554 -20.657  < 2e-16 ***
## genderfemale  0.156490   0.056139   2.788  0.00531 ** 
## age           0.279123   0.165981   1.682  0.09264 .  
## income       -0.187416   0.085478  -2.193  0.02834 *  
## illness       0.186156   0.018263  10.193  < 2e-16 ***
## reduced       0.126690   0.005031  25.184  < 2e-16 ***
## health        0.030683   0.010074   3.046  0.00232 ** 
## privateyes    0.126498   0.071552   1.768  0.07707 .  
## freepooryes  -0.438462   0.179799  -2.439  0.01474 *  
## freerepatyes  0.083640   0.092070   0.908  0.36365    
## nchronicyes   0.117300   0.066545   1.763  0.07795 .  
## lchronicyes   0.150717   0.082260   1.832  0.06692 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5634.8  on 5189  degrees of freedom
## Residual deviance: 4380.1  on 5178  degrees of freedom
## AIC: 6735.7
## 
## Number of Fisher Scoring iterations: 6

P-value

## [1] 1

Since the p value is 1 and deviance is 4380.1 0n 5178 degrees of freedom we fail to reject the null hypothesis indicating an good-fitting model. Based on the results, let’s reduce the indicator variables as income, illness, reduced, health and freepoor.

## 
## Call:
## glm(formula = visits ~ income + illness + reduced + health + 
##     freepoor, family = poisson, data = DoctorVisits)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8049  -0.6742  -0.5800  -0.5057   5.7636  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.703968   0.063928 -26.654  < 2e-16 ***
## income      -0.320842   0.076246  -4.208 2.58e-05 ***
## illness      0.210165   0.017328  12.128  < 2e-16 ***
## reduced      0.130412   0.004834  26.976  < 2e-16 ***
## health       0.030798   0.009927   3.102  0.00192 ** 
## freepooryes -0.642487   0.172180  -3.731  0.00019 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5634.8  on 5189  degrees of freedom
## Residual deviance: 4412.2  on 5184  degrees of freedom
## AIC: 6755.8
## 
## Number of Fisher Scoring iterations: 6

We can see that illness and reduced are the most positively significant indicator variables followed by health whereas income is the most negatively significant indicator variable followed by freepoor.

Coefficients of these variables

## (Intercept)      income     illness     reduced      health freepooryes 
##   0.1819601   0.7255381   1.2338811   1.1392981   1.0312775   0.5259826

As you increase the age by one, the expected number of visits for a parson decreases by a factor of 0.73. As you increase the illness by one, the expected number of visits for a parson increases by a factor of 1.23. As you increase the reduced by one, the expected number of visits for a parson increases by a factor of 1.14.As you increase the health by one, the expected number of visits for a parson increases by a factor of 1.03. As you increase the freepooryes by one, the expected number of visits for a parson decreases by a factor of 0.53.

We can see that the residual deviance for the model is normally distributed.

Zero Inflated Count Model

## # A tibble: 10 x 2
##    visits     n
##     <dbl> <int>
##  1     0.  4141
##  2     1.   782
##  3     2.   174
##  4     3.    30
##  5     4.    24
##  6     5.     9
##  7     6.    12
##  8     7.    12
##  9     8.     5
## 10     9.     1

We see there are many zeros in response variable. Let’s use zero inflated count model regression

Zero Inflated Model

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

## 
## Call:
## zeroinfl(formula = visits ~ . | reduced, data = DoctorVisits)
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4907 -0.4420 -0.3809 -0.3252 12.0230 
## 
## Count model coefficients (poisson with log link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.493646   0.130825 -11.417  < 2e-16 ***
## genderfemale  0.172259   0.060884   2.829  0.00466 ** 
## age           0.282190   0.181252   1.557  0.11950    
## income       -0.233319   0.094707  -2.464  0.01376 *  
## illness       0.183538   0.020363   9.013  < 2e-16 ***
## reduced       0.109244   0.007050  15.496  < 2e-16 ***
## health        0.028700   0.010875   2.639  0.00831 ** 
## privateyes    0.106815   0.078105   1.368  0.17144    
## freepooryes  -0.441772   0.193674  -2.281  0.02255 *  
## freerepatyes  0.005954   0.100379   0.059  0.95270    
## nchronicyes   0.117323   0.072478   1.619  0.10550    
## lchronicyes   0.134158   0.089427   1.500  0.13357    
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.24367    0.14350  -1.698   0.0895 .  
## reduced     -0.08485    0.02116  -4.011 6.05e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 27 
## Log-likelihood: -3280 on 14 Df

## [1] 198.7991

## [1] 0

Since the P-value is 0 we can conclude this model is not justifiable.

Using QuasiPoisson Regression

## 
## Call:
## glm(formula = visits ~ ., family = quasipoisson, data = DoctorVisits)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9502  -0.6858  -0.5747  -0.4852   5.7055  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.097821   0.117010 -17.929  < 2e-16 ***
## genderfemale  0.156490   0.064683   2.419  0.01558 *  
## age           0.279123   0.191244   1.460  0.14448    
## income       -0.187416   0.098488  -1.903  0.05711 .  
## illness       0.186156   0.021043   8.847  < 2e-16 ***
## reduced       0.126690   0.005796  21.857  < 2e-16 ***
## health        0.030683   0.011607   2.644  0.00823 ** 
## privateyes    0.126498   0.082442   1.534  0.12500    
## freepooryes  -0.438462   0.207164  -2.116  0.03435 *  
## freerepatyes  0.083640   0.106083   0.788  0.43048    
## nchronicyes   0.117300   0.076674   1.530  0.12611    
## lchronicyes   0.150717   0.094780   1.590  0.11186    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 1.327571)
## 
##     Null deviance: 5634.8  on 5189  degrees of freedom
## Residual deviance: 4380.1  on 5178  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

The results are close to the results of poisson regression

Multinomial Regression

Since we have a large data set and there is a order in response variable we can use ordinal multinomial regression

## Start:  AIC=6361.51
## factor(visits) ~ gender + age + income + illness + reduced + 
##     health + private + freepoor + freerepat + nchronic + lchronic
## 
##             Df    AIC
## - income     1 6360.5
## - nchronic   1 6361.2
## <none>         6361.5
## - age        1 6362.9
## - lchronic   1 6363.1
## - freerepat  1 6366.3
## - private    1 6367.3
## - freepoor   1 6368.1
## - gender     1 6368.5
## - health     1 6372.7
## - illness    1 6450.7
## - reduced    1 6616.7
## 
## Step:  AIC=6360.48
## factor(visits) ~ gender + age + illness + reduced + health + 
##     private + freepoor + freerepat + nchronic + lchronic
## 
##             Df    AIC
## - nchronic   1 6360.1
## <none>         6360.5
## - lchronic   1 6362.0
## - age        1 6362.3
## - private    1 6365.8
## - freepoor   1 6366.3
## - freerepat  1 6366.5
## - gender     1 6369.2
## - health     1 6371.8
## - illness    1 6450.9
## - reduced    1 6615.0
## 
## Step:  AIC=6360.07
## factor(visits) ~ gender + age + illness + reduced + health + 
##     private + freepoor + freerepat + lchronic
## 
##             Df    AIC
## <none>         6360.1
## - lchronic   1 6360.2
## - age        1 6363.4
## - private    1 6365.8
## - freepoor   1 6365.8
## - freerepat  1 6366.3
## - gender     1 6369.4
## - health     1 6371.3
## - illness    1 6464.7
## - reduced    1 6614.7

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = factor(visits) ~ gender + age + illness + reduced + 
##     health + private + freepoor + freerepat + lchronic, data = DoctorVisits)
## 
## Coefficients:
##                 Value Std. Error t value
## genderfemale  0.26439    0.07888   3.352
## age           0.54017    0.23382   2.310
## illness       0.27975    0.02690  10.400
## reduced       0.17203    0.01072  16.040
## health        0.06087    0.01646   3.697
## privateyes    0.27097    0.09822   2.759
## freepooryes  -0.67465    0.25831  -2.612
## freerepatyes  0.38285    0.13348   2.868
## lchronicyes   0.15623    0.10605   1.473
## 
## Intercepts:
##     Value   Std. Error t value
## 0|1  2.7060  0.1095    24.7031
## 1|2  4.4757  0.1279    34.9847
## 2|3  5.6652  0.1561    36.2868
## 3|4  6.0844  0.1726    35.2592
## 4|5  6.5978  0.2000    32.9966
## 5|6  6.8800  0.2191    31.4027
## 6|7  7.4202  0.2660    27.8966
## 7|8  8.5433  0.4269    20.0104
## 8|9 10.3476  1.0079    10.2668
## 
## Residual Deviance: 6324.068 
## AIC: 6360.068

Three most significant predictors are reduced, illness and health based on largest t-values. We can say that the odds of moving from 0 to 1|2 (or from 0|1) increase by a factor of exp(.172)=1.187 as reduced increases by one unit. A similar argument can be used for interpretation of other variables.

Conclusion

Number of visits per last 2 weeks, the response variable, significantly depends on reduced, illness and health variables.

Code/Appendix

### creating new column
data(DoctorVisits, package="AER")
new.visits <- factor(DoctorVisits$visits)
levels(new.visits) <- list(none = as.character(0),once = as.character(1),
                           twice = as.character(2), three.times=as.character(3),
                           four.times=as.character(4),five.times=as.character(5),
                           six.times=as.character(6),seven.times=as.character(7),
                           eight.times=as.character(8),nine.times=as.character(9),
                           ten.times=as.character(10))
library(tibble)
new.doctorvisit.data <- add_column(DoctorVisits,new.visits, .after = 1)


## visits vs gender 
library(dplyr)
library(ggplot2)
visits.vs.gender <- group_by(new.doctorvisit.data, gender, new.visits) %>% 
  summarise(count=n()) %>% group_by(gender) %>% mutate(etotal=sum(count),
   proportion= count/etotal)
ggplot(visits.vs.gender, aes(x=gender, y=proportion, group=new.visits, 
                             linetype=new.visits))+geom_line()

## visits vs age
visits.vs.age <- group_by(new.doctorvisit.data, age, new.visits) %>% 
  summarise(count=n()) %>% group_by(age) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.age, aes(x=age, y=proportion, group=new.visits, 
                          linetype=new.visits))+geom_line()

## visits vs income 
visits.vs.income <- group_by(new.doctorvisit.data, income, new.visits) %>% 
  summarise(count=n()) %>% group_by(income) %>% mutate(etotal=sum(count),
 proportion= count/etotal)
ggplot(visits.vs.income, aes(x=income, y=proportion, group=new.visits, 
                             linetype=new.visits))+geom_line()

## visit vs illness
visits.vs.illness <- group_by(new.doctorvisit.data, illness, new.visits) %>% 
  summarise(count=n()) %>% group_by(illness) %>% mutate(etotal=sum(count),
  proportion= count/etotal)
ggplot(visits.vs.illness, aes(x=illness, y=proportion, group=new.visits, 
                              linetype=new.visits))+geom_line()

## visit vs reduced
visits.vs.reduced <- group_by(new.doctorvisit.data, reduced, new.visits) %>% 
  summarise(count=n()) %>% group_by(reduced) %>% mutate(etotal=sum(count),
 proportion= count/etotal)
ggplot(visits.vs.reduced, aes(x=reduced, y=proportion, group=new.visits, 
                              linetype=new.visits))+geom_line()

## visits vs health 
visits.vs.health <- group_by(new.doctorvisit.data, health, new.visits) %>% 
  summarise(count=n()) %>% group_by(health) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.health, aes(x=health, y=proportion, group=new.visits, 
                             linetype=new.visits))+geom_line()

## visits vs private
visits.vs.private <- group_by(new.doctorvisit.data, private, new.visits) %>% 
  summarise(count=n()) %>% group_by(private) %>% mutate(etotal=sum(count),
proportion= count/etotal)
ggplot(visits.vs.private, aes(x=private, y=proportion, group=new.visits, 
                              linetype=new.visits))+geom_line()

## visits vs freepoor
visits.vs.freepoor <- group_by(new.doctorvisit.data, freepoor, new.visits) %>% 
  summarise(count=n()) %>% group_by(freepoor) %>% mutate(etotal=sum(count),
 proportion= count/etotal)
ggplot(visits.vs.freepoor, aes(x=freepoor, y=proportion, group=new.visits, 
                               linetype=new.visits))+geom_line()

## visits vs freerepat
visits.vs.freerepat <- group_by(new.doctorvisit.data, freerepat, new.visits) %>% 
  summarise(count=n()) %>% group_by(freerepat) %>% mutate(etotal=sum(count),
 proportion= count/etotal)
ggplot(visits.vs.freerepat, aes(x=freerepat, y=proportion, group=new.visits, 
                                linetype=new.visits))+geom_line()

## visits vs nchronic
visits.vs.nchronic <- group_by(new.doctorvisit.data,nchronic, new.visits) %>% 
  summarise(count=n()) %>% group_by(nchronic) %>% mutate(etotal=sum(count),
 proportion= count/etotal)
ggplot(visits.vs.nchronic, aes(x=nchronic, y=proportion, group=new.visits, 
                               linetype=new.visits))+geom_line()


## visits vs lchronic
visits.vs.lchronic <- group_by(new.doctorvisit.data,lchronic, new.visits) %>% 
  summarise(count=n()) %>% group_by(lchronic) %>% mutate(etotal=sum(count),
 proportion= count/etotal)
ggplot(visits.vs.lchronic, aes(x=lchronic, y=proportion, group=new.visits, 
                               linetype=new.visits))+geom_line()

##Poisson Regression
data(DoctorVisits, package="AER")
mod.poisson <- glm(visits~., family = poisson, DoctorVisits)
summary(mod.poisson)

## P-value
pchisq(mod.poisson$deviance, df=mod.poisson$df.residual, lower.tail=FALSE)

##Poisson Reduced Model
mod.poisson.1 <- glm(visits~income+illness+reduced+health+freepoor, 
                  family = poisson, DoctorVisits)
summary(mod.poisson.1)

exp(coef(mod.poisson.1))

library(faraway)
qqnorm(residuals(mod.poisson.1))
halfnorm(residuals(mod.poisson))
plot(mod.poisson.1,1)

## Counting Zeros 
DoctorVisits %>% group_by(visits) %>% tally()

##zero inflated model 
library(pscl)
modz <- zeroinfl(visits ~. , data=DoctorVisits)
summary(modz)
modz2 <- zeroinfl(visits ~.|reduced , data=DoctorVisits)
summary(modz2)

lrt <- 2*(modz$loglik - modz2$loglik)
lrt
1-pchisq(198.7991,14)

##Using QuasiPoisson Regression
modq <- glm(visits ~ ., family=quasipoisson,DoctorVisits)
summary(modq)

### Multinomial regression
library(MASS)
pomod <- polr(factor(visits)~ gender+age+income+illness+reduced+health+private+
                freepoor+freerepat+nchronic+lchronic, data=DoctorVisits)
pomodi <- step(pomod)
summary(step(pomod))