1 Categorical Analysis on Mental Health in Tech Company Survey

1.1 Introduction

We will perfom a categorical analysis on the mental survey in the tech workplace from 2014. We will use Treatment (have you thought treatment for a mental health condition) as the reponse variable to perform the categorical analysis. We want to know what are the strongest predictors of mental health illness or certain attitudes towards mental health issue. In addition, we would like to evaluate our model by comparing the predicted treatment vs the actual treatment variable.

1.1.1 Data cleaning

We will clean the data and assign value and groups in this section.

gender. categorize them as female, male, undecided.
treatment, assign value to treatment, if consider treatment =“Yes” then we assign the value= 1, otherwise treat=0 .then we acculmate the the treatment consideration in each country, to calucate the response rate and map it to the geographic locations. Due to the limited data and uneven data collections, the response rate might be locationally biased. For example Japan, and some european countries has 100% response rate, because there’s only 1 patient in the country.
age group, look at the age distribution, and we have decided to stratify into 2 groups, <=30 or >30.
Company Size, we assign no_employees <100 to be small company, 100-1000 to medium company, >= 1000 is large company for future categorical analysis.

#Sex
survey$s<- ifelse(survey$Gender %in% c("F" , "Female", "FEMALE" ,"female", "f"), "Female" ,
                          ifelse(survey$Gender %in% c("Male", "male","M","m","MALE","maile"), "Male", "Undecided"))

#treatment score
survey$treats <- ifelse(survey$treatment=="Yes", 1,0)
                    

#work interfere
survey$worki <- ifelse(is.na(survey$work_interfere), "O", survey$work_interfere)


# company size
survey$cpsize<- ifelse(survey$no_employees %in% c("1-5" , "6-25", "26-100"), "Small",
                ifelse(survey$no_employees %in% c("100-500","500-1000"), "Meduim", "Large"))


#0. age
age1 <- ggplot(survey, aes(Age))+
  geom_histogram()+xlim(0,75)+labs(title="Age Distribution")

age1

## we split the age into two groups. ( <30 or  >=30)
survey$agegp <- ifelse(survey$Age< 30 , "< 30", ">=30")
### graphe age group

#Company size

msp6 <- ggplot(survey, aes(x=no_employees))+
  geom_bar(fill="violet")+
  labs(x="Company size", y="Count",
       title="Company Size Distribution")+ theme(legend.position="none") 
msp6

1.1.2 Geographical Distribution

We will what is the percentage of people considerating a mental health treatment worldwide. In addition, how many people are considerating a mental treatment inside USA in each state

1.1.2.1 Worldwide

In the worldwide mental health treatment consideration map, we used response rate for each country.

\[ response \; rate = \frac{\# people \; consider \; treatment }{ Total \# \; people \; in \; the \; country } \times 100 \]

the graph showed USA has approximately 50% percent of employees consider a mental health treatment. although some countries like China, Russia has a low response rate, we have to keep in mind this does not imply everyone is mentally healthy, simply could be the amount of data is too small or people are not aware of mental health issue yet.

#percentage for each country, thought about rescaling, but will lose much info
icountry <- group_by(survey, Country)
ic2 <- dplyr::summarise(icountry, add=sum(treats,na.rm=TRUE) , n=n())
ic2$treatpct <- ((ic2$add*100)/ (ic2$n))
ic2 <- arrange(ic2, desc(treatpct))

n <- joinCountryData2Map(ic2, joinCode="NAME", nameJoinColumn="Country")

## 48 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 195 codes from the map weren't represented in your data

mapCountryData(n, nameColumnToPlot="treatpct", mapTitle="Worldwide Mental Treatment Consideration Distribution", catMethod="fixedWidth", colourPalette = "rainbow")

# state
istate <- group_by(survey, state )
istate2 <- dplyr::summarise(istate, add=sum(treats, na.rm=TRUE))
istate3 <- filter (istate2, !is.na(state) & add>0 )

map<- map_data("state")

maptest<- data.frame("region"= c("alabama", "arizona",  "california", "colorado", "connecticut" ,"district of columbia","florida", "georgia", "iowa" ,"idaho"  , 
                                 "illinois", "indiana"  , "kentucky"   ,  "louisiana"  , "massachusetts"   , "maryland",  "maine", "michigan","minnesota" , "Missouri" ,"Mississippi" ,      "north carolina", "nebraska" , "new hampshire" , "new jersey", "nevada" ,"new york" , "ohio" , "oklahoma" ,  "oregon" , "pennsylvania"  , "south carolina", "south dakota" , "tennessee" , "texas" , "utah" ,  "virginia", "washington" , "wisconsin" , "wyoming"), "count"=c(7 , 6, 86,  4,  2,  1,  8,  6,  3,  1, 20, 13,  1,  1 ,10,  2,  1, 10, 12,  4,  1,  7, 1,  2,  3 , 2, 30, 20,  2, 17, 14,  2 , 1, 18, 25
, 7 , 6, 41,  9,1  ))

total<- merge(map, maptest, y.by="region")

data("fifty_states")

1.1.2.2 USA

Since majority of the data are collected in USA, 710 employees. So we will focus on exploring the employees’ attitude towards mental health treatment in USA. As the gradient map shows, many people consider the treatment in California, Washington state, New york. We have noticed those states are foundations of tech company.

pt1 <- ggplot(maptest, aes(map_id = region)) + 
  # map points to the fifty_states shape data
  geom_map(aes(fill = count), map = fifty_states) + 
  expand_limits(x = fifty_states$long, y = fifty_states$lat) +
  coord_map() +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "", title="Mental Health Treament Distribution in USA") +
  theme(legend.position = "bottom", 
        panel.background = element_blank()) 

pt1

1.2 General Analysis on Mental Treatment

In this section we will graph all the mental health related variables distributions in three subgroups, Personal Info, Professional working enivorment and Personal attitude. By observating the graphs, we exclude some variables for future analysis. Such as, If two variables are similar (care_options, benefits), the we will pick one. We want to exclude the variable contains too many “NA” or “Don’t Know” .

sex1 <- group_by(survey, s )
sex2 <- dplyr::summarise(sex1,  count=n())


#0. Age group
sp0 <- ggplot(survey, aes(x=agegp))+
  geom_bar( fill="yellow")+
  labs(x="Age group", y="Count",
       title="Age Group")+ theme(legend.position="none") 


#1. Gender
sp1 <- ggplot(sex2, aes(x=s,y=count))+
  geom_bar(aes(y=count, fill="s"), stat="identity")+
  labs(x="Gender", y="Count",
       title="Gender distribution")+ theme(legend.position="none") 

#2. Treatment consideration
ms1 <- group_by(survey, treatment )
ms2 <- dplyr::summarise(ms1,  count=n())

msp1 <- ggplot(ms2, aes(x=treatment,y=count))+
  geom_bar(aes(y=count, fill="tretment"), stat="identity")+
  labs(x="Treatment", y="Count",
       title="Treatment Consideration?")+ theme(legend.position="none") 

#3. family history
msp2 <- ggplot(survey, aes(x=family_history))+
  geom_bar(fill="coral")+
  labs(x="Family History", y="Count",
       title="Family History ")+ theme(legend.position="none") 

#4. remote work
msp3 <- ggplot(survey, aes(x=remote_work))+
  geom_bar(fill="blue")+
  labs(x="Remote work", y="Count",
       title="Working Location")+ theme(legend.position="none") 

#5. benefit
msp4 <- ggplot(survey, aes(x=benefits))+
  geom_bar(fill="darkblue")+
  labs(x="Benefits", y="Count",
       title="Benefits ?")+ theme(legend.position="none") 

#6. Tech company
msp5 <- ggplot(survey, aes(x=tech_company))+
  geom_bar(fill="skyblue")+
  labs(x="Tech", y="Count",
       title="Company Type")+ theme(legend.position="none") 


#7. Company size
msp61 <- ggplot(survey, aes(x=cpsize))+
  geom_bar(fill="violet")+
  labs(x="Company size", y="Count",
       title="Company Size")+ theme(legend.position="none") 

#8. seek help
msp7 <- ggplot(survey, aes(x=seek_help))+
  geom_bar(fill="purple")+
  labs(x="Help", y="Count",
       title="Help")+ theme(legend.position="none") 

#9. care_option
msp8 <- ggplot(survey, aes(x=care_options))+
  geom_bar(fill="pink")+
  labs(x="care Option", y="Count",
       title="care option ?")+ theme(legend.position="none") 

#10. mental_health_consequences
msp9 <- ggplot(survey, aes(x=mental_health_consequence))+
  geom_bar(fill="green")+
  labs(x="Mental cons", y="Count",
       title="Mental cons")+ theme(legend.position="none") 

#11. Physical health
msp10 <- ggplot(survey, aes(x=phys_health_consequence))+
  geom_bar(fill="seagreen")+
  labs(x="Physical Health", y="Count",
       title="physical health")+ theme(legend.position="none") 

#12. Mental health interview
msp11 <- ggplot(survey, aes(x=mental_health_interview))+
  geom_bar(fill="darkgreen")+
  labs(x="Mental interview", y="Count",
       title="Mental interview")+ theme(legend.position="none") 

#13. mental vs physical
msp12 <- ggplot(survey, aes(x=mental_vs_physical))+
  geom_bar(fill="seagreen2")+
  labs(x="Mental vs Physical", y="Count",
       title="Mental vs Physical")+ theme(legend.position="none") 

#14 work interfere
msp13 <- ggplot(survey, aes(x=work_interfere))+
  geom_bar(fill="seagreen3")+
  labs(x="Work interfere", y="Count",
       title="Work interfere")+ theme(legend.position="none", axis.text.x=element_text(size=rel(0.7))) 


#15 leave work for mental health issue
msp14 <- ggplot(survey, aes(x=leave))+
  geom_bar(fill="seagreen4")+
  labs(x="Leave work", y="Count",
       title="Work Leave")+ theme(legend.position="none", axis.text.x=element_text(size=rel(0.5))) 


#16 coworker relations
msp15 <- ggplot(survey, aes(x=coworkers))+
  geom_bar(fill="cadetblue")+
  labs(x="Coworker relation", y="Count",
       title="Coworker relation")+ theme(legend.position="none")


#17 supervisor relations
msp16 <- ggplot(survey, aes(x=supervisor))+
  geom_bar(fill="cadetblue1")+
  labs(x="Supervisor relation", y="Count",
       title="Supervisor relation")+ theme(legend.position="none")

1.2.1 Personal Info

we will look into the employees personal info, such as age, family history, gender, consider of treatment, phyical vs mental. From those info, we can see the percentage to consider treatment is very close. We have more employee greater than 30 years old. Since the survey is done in a tech related company, we have a lot more male than felmales. The mental vs Physical condition contains too many “Don’t Know”, so we will drop this variable for our analysis. Most of people do not suffer a mental health issue. However, we can see there are also quiet a number of people have various degree of mental issues.

mplot1 <- multiplot(sp0, sp1, msp1, msp2, msp12, msp11,  cols=3)

1.2.2 Professional Working Enviroment

We will show the distribution that is professional related , such as company size, benefit working enivorment . The difficult level of leaving company? We can see most of the survey are from a small Tech company (<100 people). about 2/3 of them are working on site. We are not sure if care option or Working intereference will do to the model, so we will exam them in the later section, however, we will exclude Leave for work and Benefits variables. Because there are too much “Don’t Know”, even it has association with the reponse variable “Treatment”, we can’t give a definitely classification.

mplot2 <- multiplot(msp3, msp4, msp5, msp61, msp8, msp13, msp14,  cols=3)

1.2.3 Personal Attitude

we will show the graph that’s related to personal attitude towards mental health. For instance, the willingness to talk to their problems to supervisor and coworker. Are they provided mental health learning and channels to seek for help? The opinion of physical health to mental health consequence. From the graph it shows most of the people are not provided enough mental health info for them to learn.

mplot3 <- multiplot(msp7, msp9, msp10, msp15, msp16,  cols=3)

1.3 Association Selection

The previous section, we ruled out some variables for our final model selection by visualization. Now we want to use the mosaic plot to see some ambigous variables such as “coworked relation” would be an important factor for Treatment consideration.

1.3.1 Personal info

We can see age, gender and family history definatley matters , as well as bring the mental health issue to a potential employer

#Age group + family hitory
arwork <- group_by(survey, treatment, agegp, family_history)
arwork1 <- summarise(arwork,  count=n())
a.lm <- loglm(count~ agegp +treatment+ family_history, data=arwork1)
a.lm1<-mosaic(a.lm , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

# gender
grwork <- group_by(survey, treatment, s)
grwork1 <- summarise(grwork,  count=n())
g.lm <- loglm(count~ s +treatment, data=grwork1)
g.lm1<-mosaic(g.lm , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

#mental interview
mtiwork <- group_by(survey, treatment, mental_health_interview)
mtiwork1 <- summarise(mtiwork,  count=n())
mti.lm <- loglm(count~ mental_health_interview +treatment, data=mtiwork1)
mti.lm1<-mosaic(mti.lm , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

1.3.2 Professional Enviroment

we can see the size didn’t matter, neither does the company type and working locations. but the Care options mattered, when compnay provided caring option the more people attend the treatment or considering the physical treatment. or if the work has been interfered, we can see never interfered doesnt need treatment, sometimes and often interfered people demonstrated a strong tendency need treatment.

#Remote work
mrwork <- group_by(survey, treatment, remote_work)
mrwork1 <- summarise(mrwork,  count=n())
w.lm <- loglm(count~ remote_work+treatment, data=mrwork1)
w.lm1<-mosaic(w.lm , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

#company size
mcwork <- group_by(survey, treatment, cpsize)
mcwork1 <- summarise(mcwork,  count=n())
wc.lm <- loglm(count~ cpsize+treatment, data=mcwork1)
wc.lm1<-mosaic(wc.lm , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

#Company type (tech vs non-tech)
mtwork <- group_by(survey, treatment, tech_company)
mtwork1 <- summarise(mtwork,  count=n())
wt.lm <- loglm(count~ tech_company+treatment, data=mtwork1)
wt.lm1<-mosaic(wt.lm , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

#Company care option
ctwork <- group_by(survey, treatment, care_options)
ctwork1 <- summarise(ctwork,  count=n())
ct.lm <- loglm(count~ care_options +treatment, data=ctwork1)
ct.lm1<-mosaic(ct.lm , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

#Working Interfere 
itwork <- group_by(survey, treatment, work_interfere)
itwork1 <- summarise(itwork,  count=n())
it.lm <- loglm(count~ work_interfere +treatment, data=itwork1)
it.lm1<-mosaic(it.lm , clip=FALSE, abbreviate_labs=TRUE)

1.3.3 Personal Attitude

From the Mosaic graph, we can see coworker actually made a differences with treatment, however, supervisor doesn’t. In addition, seek for help and mental health consequences are associated with Mental health,too.

# coworker
mcowork <- group_by(survey, treatment, coworkers)
mcowork1 <- summarise(mcowork,  count=n())
v.lm <- loglm(count~ coworkers+treatment, data=mcowork1)
v.m1<-mosaic(v.lm , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

# supervisor
msup <- group_by(survey, treatment, supervisor)
msup1 <- summarise(msup,  count=n())
v.lm1 <- loglm(count~ supervisor+treatment, data=msup1)
v.m1<-mosaic(v.lm1 , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

# seek for help
mseek <- group_by(survey, treatment, seek_help)
mseek1 <- summarise(mseek,  count=n())
v.lm2 <- loglm(count~ seek_help+treatment, data=mseek1)
v.m2<-mosaic(v.lm2 , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

# physical health consequence
mphy <- group_by(survey, treatment, phys_health_consequence)
mphy1 <- summarise(mphy,  count=n())
v.lm3 <- loglm(count~ phys_health_consequence+treatment, data=mphy1)
v.m3<-mosaic(v.lm3 , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

# mental health consequence
mment <- group_by(survey, treatment, mental_health_consequence)
mment1 <- summarise(mment,  count=n())
v.lm4 <- loglm(count~ mental_health_consequence+treatment, data=mment1)
v.m4<-mosaic(v.lm4 , clip=FALSE, gp_args = list(interpolate = c(1, 1.8)))

1.4 Logistic Model

From the above simple mosaic graph analysis and p-value. we decide to continue working with the following variables : age group, gender, mental_interview, family history, care options, work interfere, seek for help, cowork and mental health consequences. We will use all the variable at once for the Logistic model, it is not going to be visually friendly, so we will keep seperating the variables into three categories as before for visualization purpose.

1.4.1 Personal info Model

we use age group, family history, gender and mental_health_interview to predict the treatment. As the result we can see, Female with a family history is the “high risk” population. Undecided gender is high risk too. maybe they are very vulnerable .

arworkt <- group_by(survey, treatment, agegp, family_history, mental_health_interview, s)
arwork1t <- summarise(arworkt,  count=n())

model1 <- multinom(treatment ~ family_history + agegp +s +mental_health_interview, weight=count, data= arwork1t)

## # weights:  8 (7 variable)
## initial  value 872.672300 
## iter  10 value 758.409887
## final  value 758.369928 
## converged

model1

## Call:
## multinom(formula = treatment ~ family_history + agegp + s + mental_health_interview, 
##     data = arwork1t, weights = count)
## 
## Coefficients:
##                (Intercept)          family_historyYes 
##                 -0.3336442                  1.5854053 
##                  agegp>=30                      sMale 
##                  0.2716196                 -0.8854718 
##                 sUndecided  mental_health_interviewNo 
##                 -0.3560520                  0.3459768 
## mental_health_interviewYes 
##                  0.8638573 
## 
## Residual Deviance: 1516.74 
## AIC: 1530.74

arwork1t$probability<- (predict(model1, type="probs"))

gg <- ggplot(arwork1t, aes(x=count, y=probability, colour=s))+        geom_line(size=2.5) + theme_bw() +
      xlim(0,150)+ geom_point(color="black", size=1.5) +facet_grid(agegp~family_history) +labs (title="Treatment ( Age group vs family history)")

gg

1.4.2 Professional Enviorment

we can see never being bothered (1) has lower probability of getting a treatment, where has often bothered, sometimes being bothered has higher probability of considerating a treatment.

arworkc <- group_by(survey, treatment,care_options, worki)
arwork1c <- summarise(arworkc,  count=n())

model1c <- multinom(treatment ~ care_options + worki, weight=count, data= arwork1c)

## # weights:  8 (7 variable)
## initial  value 872.672300 
## iter  10 value 506.770431
## final  value 504.977857 
## converged

model1c

## Call:
## multinom(formula = treatment ~ care_options + worki, data = arwork1c, 
##     weights = count)
## 
## Coefficients:
##          (Intercept) care_optionsNot sure      care_optionsYes 
##          -2.13489667           0.03120351           0.99877111 
##               worki2               worki3               worki4 
##           3.53559286           2.57883530           2.98334912 
##               workiO 
##          -2.33022654 
## 
## Residual Deviance: 1009.956 
## AIC: 1023.956

arwork1c$probability<- (predict(model1c, type="probs"))

gg1 <- ggplot(arwork1c, aes(x=count, y=probability, colour=treatment))+        geom_line(size=2.5) + theme_bw() +
      xlim(0,150)+ geom_point(position = position_jitter(height = 0.02, width = 0)) + geom_smooth(method = "glm", method.args = list(family="quasibinomial"),alpha= 0.2, aes(fill=treatment), size= 2.5, fullrange = TRUE)+labs (title="Treatment  vs Work interfer")

gg2a <- gg1+ facet_wrap(~ worki)

gg2a

1.4.3 Personal Attitude

We can see if an employee open to talk to a mental health problem with a coworker and has a chanel to seek for help has higher rate of getting a treatment(prob in (0.5 to 0.7)) ,than dont know a chanel or doesnt have a chanel to seek for help and dont want to disscuss the mental problems with coworkers(prob in (0.3 ,0.5))

arworkp <- group_by(survey, treatment, seek_help, mental_health_consequence, coworkers)
arwork1p <- summarise(arworkp,  count=n())

model1p <- multinom(treatment ~ seek_help + coworkers  +mental_health_consequence, weight=count, data= arwork1p)

## # weights:  8 (7 variable)
## initial  value 872.672300 
## iter  10 value 842.598633
## final  value 842.150982 
## converged

model1p

## Call:
## multinom(formula = treatment ~ seek_help + coworkers + mental_health_consequence, 
##     data = arwork1p, weights = count)
## 
## Coefficients:
##                  (Intercept)                  seek_helpNo 
##                  -0.46433290                   0.06566466 
##                 seek_helpYes        coworkersSome of them 
##                   0.54904249                   0.49819807 
##                 coworkersYes  mental_health_consequenceNo 
##                   1.05447601                  -0.65080606 
## mental_health_consequenceYes 
##                   0.45485041 
## 
## Residual Deviance: 1684.302 
## AIC: 1698.302

arwork1p$probability<- (predict(model1p, type="probs"))

gg2 <- ggplot(arwork1p, aes(x=count, y=probability, colour=treatment))+        geom_line(size=2.5) + theme_bw() +
      xlim(0,70)+ geom_point(color="black", size=1.5) +facet_grid(seek_help~ coworkers) +labs (title="Treatment (seek help vs coworkers)")
gg2

1.5 Model Evaluating

1.5.1 Logistic Regression Review

we will use a logistic regression to predict treatment(Yes, No) base on multtiple predictors

The logistic regression model is a linear model for the log odds, or logit that $ Y = 1 $ , given the values in x,

\[Logit(\Pi_i)= log(\frac{\Pi_i}{1- \Pi_i})= \alpha +{\beta}*x_i^{T} = \alpha+\beta_1*x_{i1}+\beta_2*x_{i2}+ ..+ \beta_t*x_{it}\]

1.5.2 Modeling

From the previous work , we will construct a final model with all the variables that we think it is signifcant and use it to predict treatment. If predicted probability <0.5 , then predicted treatment =“No”, otherwise if predicted probability >= 0.5, then predicted treatment=“Yes”. then we will compare the predicted value and actual value. Since the response variable is categorical, either yes or no. We will use Logistic method to do the prediction. We can see our model has a 83.5% correct rate. So we think it is a good model

final.glm <- glm(formula= treatment ~ agegp + s+ family_history + mental_health_interview + care_options+ worki + seek_help+ mental_health_consequence + coworkers , data= survey, family=binomial)

summary(final.glm)

## 
## Call:
## glm(formula = treatment ~ agegp + s + family_history + mental_health_interview + 
##     care_options + worki + seek_help + mental_health_consequence + 
##     coworkers, family = binomial, data = survey)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6452  -0.3760   0.2197   0.6285   3.2411  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.02061    0.45032  -4.487 7.22e-06 ***
## agegp>=30                     0.18516    0.17202   1.076 0.281754    
## sMale                        -0.83671    0.23117  -3.620 0.000295 ***
## sUndecided                   -0.63115    0.45953  -1.373 0.169610    
## family_historyYes             0.98502    0.17296   5.695 1.23e-08 ***
## mental_health_interviewNo     0.38568    0.25605   1.506 0.132009    
## mental_health_interviewYes    1.09313    0.55645   1.964 0.049474 *  
## care_optionsNot sure         -0.19165    0.21805  -0.879 0.379447    
## care_optionsYes               0.93628    0.20640   4.536 5.73e-06 ***
## worki2                        3.42097    0.33244  10.291  < 2e-16 ***
## worki3                        2.41564    0.27794   8.691  < 2e-16 ***
## worki4                        2.80189    0.24505  11.434  < 2e-16 ***
## workiO                       -2.34152    0.54905  -4.265 2.00e-05 ***
## seek_helpNo                  -0.49140    0.21023  -2.337 0.019416 *  
## seek_helpYes                 -0.35605    0.26376  -1.350 0.177043    
## mental_health_consequenceNo  -0.25363    0.20986  -1.209 0.226827    
## mental_health_consequenceYes -0.08125    0.21905  -0.371 0.710716    
## coworkersSome of them         0.31102    0.21691   1.434 0.151609    
## coworkersYes                  0.98924    0.31582   3.132 0.001734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1745.17  on 1258  degrees of freedom
## Residual deviance:  932.67  on 1240  degrees of freedom
## AIC: 970.67
## 
## Number of Fisher Scoring iterations: 6

survey$predict <- predict(final.glm, survey, type="response")
survey$predictc <- ifelse( survey$predict<0.5 , "No", "Yes")

c <- table(survey$treatment, survey$predictc, dnn=c("Real","Predict") )
c

##      Predict
## Real   No Yes
##   No  474 148
##   Yes  59 578

#Correct Percentage 

correct.percentage <-(474+ 578)*100/ (1259)
correct.percentage

## [1] 83.55838

#Final graph

d <- survey[, c(8,34)]
df <- group_by(d, treatment, predictc)
df1 <- summarise(df,  count=n())
df1$label <- paste(df1$treatment, df1$predictc, sep=",")
df1$predict <- ifelse(df1$label %in% c("Yes,Yes" ,"No,No") , "Correct","Wrong")

gg.final <- ggplot(df1, aes(x=reorder(label,count), y=count))+  geom_bar(aes(y=count, fill=predict), stat="identity") +theme_bw() + labs(title="Treatment Prediction Correctness Distribution" ,x="Real vs Predict value")

gg.final

1.6 Conclusion

From previous categorical analysis on Mental health, we observed personal fact does matter, in addition , out attitude towards mental health matter as well. On the contrast, the com

comments <- survey[ , c(1,27)]
comments1 <- filter(comments, !is.na(comments))
comments2 <- comments1 [ , c(2)]

set.seed(888)
wordcloud(comments2, max.words=100 ,random.order=FALSE,rot.per=0.35,colors=brewer.pal(8, "Dark2"), main="Title")

a545_lab3

Fangya Tan

March 12, 2018