Poisson Regression Model:

I am interested to build model to find correlation between Total number of people diagonoses with aids and othe independent variable like sex,race, Neighborhood and total mubmer of hiv diagonoses.I will use poisson regression model because my dependent variable represents the number of independent events that occur during a fixed period of time. I am assuming that number of people diagnoses with AIDS will be different amoung different race, gender and how many people diagnoses with HIV.

About The Data:

These data were reported to the NYC DOHMH by June 30, 2014.This dataset includes data on new diagnoses of HIV and AIDS in NYC for the calendar years 2010 through 2013 by Neighborhood, Sex, and Race/Ethnicity.

Data Import and Preparation:

library(dplyr)
library(tidyr)
library(ggplot2)
library(Zelig)
library(DT)
Hiv<-read.csv("C:/Users/Afzal Hossain/Downloads/HIV_AIDS_Diagnoses_by_Neighborhood__Sex__and_Race_Ethnicity.csv")
hiv<-select(Hiv, "YEAR","Neighborhood..U.H.F.","SEX","RACE.ETHNICITY","TOTAL.NUMBER.OF.AIDS.DIAGNOSES","TOTAL.NUMBER.OF.HIV.DIAGNOSES","TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES")
datatable(hiv)

Recoding and Removing NA from Race and Sex Variables:

hiv$SEX<-as.character(hiv$SEX)
hiv$SEX[hiv$SEX=='All']<- NA
hiv$RACE.ETHNICITY<-as.character(hiv$RACE.ETHNICITY)
hiv$RACE.ETHNICITY[hiv$RACE.ETHNICITY=="All" | hiv$RACE.ETHNICITY =="Unknown"]<-NA
hiv$RACE.ETHNICITY=recode(hiv$RACE.ETHNICITY,'Native American' = 'Other', "Multiracial"="Other", "Asian/Pacific Islander"="Other")

Changing Data Type:

hiv$YEAR<-as.factor(hiv$YEAR)
hiv$SEX<-as.factor(hiv$SEX)
hiv$RACE.ETHNICITY<-as.factor(hiv$RACE.ETHNICITY)
hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES<-as.integer(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES)
hiv$TOTAL.NUMBER.OF.AIDS.DIAGNOSES<-as.integer(hiv$TOTAL.NUMBER.OF.AIDS.DIAGNOSES)
hiv$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES<-as.integer(hiv$TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES)
str(hiv)
## 'data.frame':    2928 obs. of  7 variables:
##  $ YEAR                                         : Factor w/ 4 levels "2010","2011",..: 1 2 1 3 4 4 4 4 3 1 ...
##  $ Neighborhood..U.H.F.                         : Factor w/ 44 levels "All","Bayside - Little Neck",..: 19 35 33 40 44 12 12 34 16 15 ...
##  $ SEX                                          : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 1 2 NA ...
##  $ RACE.ETHNICITY                               : Factor w/ 4 levels "Black","Hispanic",..: 1 3 NA NA NA 1 3 NA NA NA ...
##  $ TOTAL.NUMBER.OF.AIDS.DIAGNOSES               : int  78 2 23 2 2 60 2 2 2 15 ...
##  $ TOTAL.NUMBER.OF.HIV.DIAGNOSES                : int  119 2 67 2 2 113 2 2 2 36 ...
##  $ TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES: int  2 2 41 2 2 49 2 2 2 41 ...

Building Count Data Model Using Zelig:

First, I will build a simple model with Sex and Year. Then I will use all the variables for my second model and Finally, I will add an interaction between Sex and Race.

#Simple Model
z.t <- zelig(TOTAL.NUMBER.OF.AIDS.DIAGNOSES~SEX +YEAR , model = "poisson", data = hiv, cite = FALSE)

# Addiing all variables
z.t1 <- zelig(TOTAL.NUMBER.OF.AIDS.DIAGNOSES~SEX+ TOTAL.NUMBER.OF.HIV.DIAGNOSES + RACE.ETHNICITY +TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES + YEAR , model = "poisson", data = hiv, cite = FALSE)

# Adding Interaction
z.t2 <- zelig(TOTAL.NUMBER.OF.AIDS.DIAGNOSES~SEX* RACE.ETHNICITY+TOTAL.NUMBER.OF.HIV.DIAGNOSES  +TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES + YEAR, model = "poisson", data = hiv, cite = FALSE)

#looking at the result
texreg::htmlreg(list(z.t, z.t1, z.t2),doctype = FALSE)
Statistical models
Model 1 Model 2 Model 3
(Intercept) 2.98*** 2.92*** 3.09***
(0.01) (0.02) (0.02)
SEXMale 0.45*** 0.28*** 0.08***
(0.01) (0.01) (0.02)
YEAR2011 -0.09*** -0.10*** -0.09***
(0.01) (0.01) (0.01)
YEAR2012 -0.10*** -0.07*** -0.06***
(0.01) (0.01) (0.01)
YEAR2013 -0.20*** -0.21*** -0.20***
(0.01) (0.01) (0.01)
TOTAL.NUMBER.OF.HIV.DIAGNOSES 0.01*** 0.01***
(0.00) (0.00)
RACE.ETHNICITYHispanic -0.04** 0.02
(0.01) (0.02)
RACE.ETHNICITYOther -1.72*** -2.14***
(0.02) (0.03)
RACE.ETHNICITYWhite -0.26*** -1.12***
(0.01) (0.03)
TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES 0.01*** 0.01***
(0.00) (0.00)
SEXMale:RACE.ETHNICITYHispanic -0.09***
(0.02)
SEXMale:RACE.ETHNICITYOther 0.61***
(0.04)
SEXMale:RACE.ETHNICITYWhite 1.15***
(0.03)
AIC 115678.03 40598.95 38700.60
BIC 115707.63 40655.28 38773.82
Log Likelihood -57834.01 -20289.48 -19337.30
Deviance 105199.85 32903.57 30999.21
Num. obs. 2752 2064 2064
p < 0.001, p < 0.01, p < 0.05

Example Interpretation:

The coefficient for TOTAL.NUMBER.OF.HIV.DIAGNOSES is .01. This means that the expected log count for a one-unit increase in TOTAL.NUMBER.OF.HIV.DIAGNOSES is .01.HIV causes AIDS therefore, it is expected that incerase in HIV diagnoses will incerease in AIDS diagnoses.

Now, I will use simulation to interpreter our result of my best model based on AIC ad BIC which is more understandable for human mind.

Making sense of Poisson Regression Model Results:

HIV Effect in Standard Deviation Increase:

x <- setx(z.t2, TOTAL.NUMBER.OF.HIV.DIAGNOSES = mean(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES))

# Adding one standard Deviation
x1 <- setx(z.t2, TOTAL.NUMBER.OF.HIV.DIAGNOSES = mean(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES) + sd(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES))

#Simulate the Quantities of Interest
s <- sim(z.t2, x = x, x1 = x1)

# First Difference
fd <- s$get_qi(xvalue="x1", qi="fd")
summary(fd)
##        V1       
##  Min.   :2.772  
##  1st Qu.:2.970  
##  Median :3.019  
##  Mean   :3.023  
##  3rd Qu.:3.081  
##  Max.   :3.246

Simulated first difference telling me that increase in one standard deviation of total number of people diagnoses with HIV will increase total number of people diagnoses with AIDs by 3.03 on average.

HIV Effect in Different Quantile:

# 25% Quantile
xl <- setx(z.t2, TOTAL.NUMBER.OF.HIV.DIAGNOSES = quantile(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES, .25))
xl1 <- setx(z.t2, TOTAL.NUMBER.OF.HIV.DIAGNOSES = quantile(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES, .25) + sd(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES))
sl <- sim(z.t2, x = xl, x1 = xl1)
fdl <- sl$get_qi(xvalue="x1", qi="fd")

# 50% Quantile
xm <- setx(z.t2, TOTAL.NUMBER.OF.HIV.DIAGNOSES = quantile(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES, .50))
xm1 <- setx(z.t2, TOTAL.NUMBER.OF.HIV.DIAGNOSES = quantile(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES, .50) + sd(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES))
sm <- sim(z.t2, x = xm, x1 = xm1)
fdm <- sm$get_qi(xvalue="x1", qi="fd")

# 75% Quantile
xh <- setx(z.t2, TOTAL.NUMBER.OF.HIV.DIAGNOSES = quantile(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES, .75))
xh1 <- setx(z.t2, TOTAL.NUMBER.OF.HIV.DIAGNOSES = quantile(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES, .75) + sd(hiv$TOTAL.NUMBER.OF.HIV.DIAGNOSES))
sh <- sim(z.t2, x = xh, x1 = xh1)
fdh <- sh$get_qi(xvalue="x1", qi="fd")

# Putting all of them in one place
h <- as.data.frame(cbind(fdl, fdm, fdh))

hqdd <- h %>% 
  gather(quantile, aidsmv, 1:3)

# Renaming
hqdd$quantile=recode(hqdd$quantile,'V1' = 'Q1', "V2"="Q2", "V3"="Q3")

#Group By Quantile
Q_mean<-hqdd%>% 
  group_by(quantile) %>% 
  summarise(mean = mean(aidsmv), sd = sd(aidsmv))

datatable(Q_mean)

Plotting the Distributions:

ggplot(hqdd, aes(aidsmv)) + geom_histogram() + facet_grid(~quantile,scales = "free")+geom_vline(data=Q_mean,aes(xintercept=mean), color="red",linetype = "dashed",size = 1)+xlab("First Differences Mean Value")+ theme(panel.background = element_rect(fill = "aliceblue", colour = "aliceblue"))

From this analysis, it shows that as total number of people diagnoses with HIV increase, on average the total number of people diagnoses with AIDs increase as well. The increase is dramatic from 50% quantile to 75% quantile where on average diagnoses with AIDs increase from 2.24 to 4.1. The distributions for Q1, Q2 and Q3 are very similar and approximately symmetrical.

Gender Effect:

mx <- setx(z.t2, SEX = "Male")
fx1 <- setx(z.t2, SEX = "Female")

gs <- sim(z.t2, x = x, x1 = fx1)

# First Differences
gfd <- gs$get_qi(xvalue="x1", qi="fd")

summary(gfd)
##        V1        
##  Min.   :-3.593  
##  1st Qu.:-3.250  
##  Median :-3.156  
##  Mean   :-3.155  
##  3rd Qu.:-3.065  
##  Max.   :-2.690

I can see from the first difference between male and female that female are on average 3.2 less likely to diagnoses with AIDS than male.

Creating Dataframe for Male and Female for plotting:

m1x <- setx(z.t2, SEX = "Male")
ms <- sim(z.t2, x = x)

# Quantity of interest expected values instead of first difference
mfd<-ms$get_qi(xvalue="x", qi="ev")

f1x1 <- setx(z.t2, SEX = "Female")
fs <- sim(z.t2, x = f1x1)

#Quantity of interest expected values
ffd <- gs$get_qi(xvalue="x1", qi="ev")


# Putting all of them in one place
gend <- as.data.frame(cbind(mfd,ffd))
gendd <- gend %>% 
  gather(Sex, Aidsmv, 1:2)


# Renaming
gendd$Sex=recode(gendd$Sex,'V1' = 'MALE', "V2"="FEMALE")

# group by sex 
ms_gend<- gendd %>% group_by(Sex) %>% summarise(mean = mean(Aidsmv), sd = sd(Aidsmv))

datatable(ms_gend)

Form hear we can see the same thing as first difference that on average female are 3.15(6.1-2.95) less likely diagnoses with AIDS than male. Female distribution is little right tail, but Male distribution is approximately symmetrical.

Plotting the Distributions of Gender:

ggplot(gendd, aes(Aidsmv)) + geom_histogram() + facet_grid(~Sex,scales = "free")+geom_vline(data=ms_gend,aes(xintercept=mean), color="red",linetype = "dashed",size = 1)+xlab("AIDS Expected Mean Value")+theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"))

Testing Effect of Sex in Different Race(My Interaction Term):

# White
h1x <- setx(z.t2, SEX = "Male", RACE.ETHNICITY = "White")
h1x1 <- setx(z.t2, SEX = "Female", RACE.ETHNICITY = "White")
h1s <- sim(z.t2, x = h1x, x1 = h1x1)

# Black
h2x <- setx(z.t2, SEX = "Male", RACE.ETHNICITY = "Black")
h2x1 <- setx(z.t2, SEX = "Female", RACE.ETHNICITY = "Black")
h2s <- sim(z.t2, x = h2x, x1 = h2x1)

# Hispanic
h3x <- setx(z.t2, SEX = "Male", RACE.ETHNICITY = "Hispanic")
h3x1 <- setx(z.t2, SEX = "Female", RACE.ETHNICITY = "Hispanic")
h3s <- sim(z.t2, x = h3x, x1 = h3x1)

# Other
h4x <- setx(z.t2, SEX = "Male", RACE.ETHNICITY = "Other")
h4x1 <- setx(z.t2, SEX = "Female", RACE.ETHNICITY = "Other")
h4s <- sim(z.t2, x = h4x, x1 = h4x1)

Putting All of Them In One Place:

hd1 <- h1s$get_qi(xvalue="x1", qi="fd")
hd2 <- h2s$get_qi(xvalue="x1", qi="fd")
hd3 <- h3s$get_qi(xvalue="x1", qi="fd")
hd4 <- h4s$get_qi(xvalue="x1", qi="fd")

hdfd <- as.data.frame(cbind(hd1, hd2, hd3,hd4))

hdd <- hdfd %>% 
  gather(Race, Aidmv, 1:4)


datatable(hdd)

Group Different Race:

# Renaming
hdd$Race=recode(hdd$Race,'V1' = 'WHITE_FEMALE', "V2"="BLACK_FEMALE", "V3"="HISPANIC_FEMALE","V4" ="OTHER_FEMALE")

# Group By
rf_mean<-hdd %>% 
  group_by(Race) %>% 
  summarise(mean = mean(Aidmv), sd = sd(Aidmv))

datatable(rf_mean)

From this result, I can say that Female of all race are on average less likely to diagnoses with AIDS than male exccept for Hispanic female. Difference between male and female in terms of diagnoses with AIDS among white race is the significant compare to any other races where white female are on average 19.7 less likely to diagnoses with AIDS.

Plotting the Distributions:

ggplot(hdd, aes(Aidmv)) + geom_histogram() + facet_grid(~Race,scales = "free")+geom_vline(data=rf_mean,aes(xintercept=mean), color="red",linetype = "dashed",size = 1)+xlab("First Differences Mean Value ")+ theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"))

Result and Limitation :

By conducting simulation base count data regression analysis, I can say that my hypothesis was right that different factor like race, sext and amount of people diagnoses does effect on AIDS diagnoses count. Nevertheless, the important variables that I could not control for were age and income and this variables can effect the number of ADIS diagnosos.The dataset doesn’t contain those variable. Therefore, my analysis can have potential bias, but like every mode, my model is not perfect.

Some More Visitation Using jtools(Without Zelig):

#My Best Model from Zelig

library(jtools)
a.test <- glm(TOTAL.NUMBER.OF.AIDS.DIAGNOSES~SEX* RACE.ETHNICITY+TOTAL.NUMBER.OF.HIV.DIAGNOSES  +TOTAL.NUMBER.OF.CONCURRENT.HIV.AIDS.DIAGNOSES + YEAR, family = poisson(link = "log"), data = hiv)

Creating Poisson Regression Coefficient Plots With ggplot2:

plot_summs(a.test, scale = TRUE, exp = TRUE,color.class = "forestgreen")+ theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"))

We can see White Race are significantly differ from other in term how sprade the coefficient are, as we saw before in the expected mean of female AID diagnoses which are really low compare to male.

Ploting Race:

effect_plot(a.test, pred=RACE.ETHNICITY, int.width = .8,colors="forestgreen")+ theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"))

This Shows that even though White Female significantly less likely to diagnosis with AIDS than male(from our previous simulation) but model prediction that overall whites are less likely to have AIDS diagnoses compare to Black

Ploting Number of HIV Diagnoses:

effect_plot(a.test, pred = TOTAL.NUMBER.OF.HIV.DIAGNOSES,colors='forestgreen')+ theme(panel.background = element_rect(fill = "aliceblue",colour = "aliceblue"))

As number of HIV Diagnosis increase the number of AIDS increase which we saw before.