Introduction

“A habit is defined as a way of behaving that is repeated so often it no longer involves a conscious thought.” http://www.encyclopedia.com/medicine/news-wires-white-papers-and-books/habits-and-behaviors Human beings are surrounded by habits because by nature they are creatures of habit. Some of them are considered “good” or positive like exercising every morning from 7 to 8am. Others are considered “bad” or even prejudicial, such as smoking a cigarette or taking drugs. In the United States, there are a few studies that identify these negative habits and analyze their influence on one’s behavior. More specifically, how these habits can lead to serious consequences such as criminal activity and mental illnesses that can produce harm to others and to themselves.

Description of Data

[The data used was part of a study entitled “Gender, Mental Illness and Crime in the United States” developed by Melissa Thompson as part of the Inter-university Consortium for Political and Social Research] http://www.icpsr.umich.edu/cocoon/ICPSR/TERMS/27521.xml. This study was developed to capture the effect of gender on some mental illnesses such as depression, on drug use and on crime. This study also tries to capture the interaction with the criminal justice system on depression and drug use. This database is composed of 55,602 respondents from the National Household Survey on Drug Use and Health (NSDUH) in 2004. Respondents were members of United States households from ages 12 and above. Information related to the use of illicit drugs, alcohol and tobacco, as well as the criminal activity, depression and other factors were included. Other additional variables, such as the depression index for both adults and youth were included by the researcher.

Background

A common mental illness that affects more than 15 million American adults is known as depression. There are different forms of depression, one is acknowledged as Persistent Depressive Disorder, or PDD and is formely known as dysthymia. This one affects about 3.3 million American adults https://www.adaa.org/about-adaa/press-room/facts-statistics.

According to the World Health Organization (WHO, 2010), major depression also carries the heaviest burden of disability among mental and behavioral disorders. According to the Diagnostic and Statistical Manual of Mental Disorders (DSM-IV), a major depressive episode is known as a period of two or more weeks of a depressed mood or loss of interest. It can be reflected through a number of symptoms such as sleeping problems, eating disorders, lack of energy, concentration and self-image issues. According to the National Institute of Mental Health, an estimated 16.1 million adults aged 18 or older in the United States had at least one major depressive episode in the past year https://www.nimh.nih.gov/health/statistics/prevalence/major-depression-among-adults.shtml.

What are the causes of depression? Many factors and reasons have been given to the causes of depression but there are some that seem very logical. Addiction is a well-known negative habit that is not only related to drug use, it can also relate to tobacco use and alcoholic drinking. These habits are more common in depressed individuals than in the general population. As the National Institute on Alcohol Abuse and Alcoholism, both men and women who are diagnosed with depression are most likely to be alcohol-dependent http://pubs.niaaa.nih.gov/publications/arh26-2/90-98.htm.

Purpose of this Study

The purpose of this study in particular is to understand the influence of gender and other population characteristics (age, income and/or education levels) and the addiction level (use of drugs, tobacco and alcohol consumption) in mental illness, particularly depression. The Thompson’s study includes criminal activity, yet this study will only be limited by the depression index calculated for both youth and adults in the study. An experimental design is introduced to identify the differences between population groups and different levels of addiction to depression. ANOVA is used as a tool of analysis for testing these differences and plots of main and interaction effects are introduced to provide a visual aid for the analysis.

Importing and Cleaning Data Set

To start the analysis, the original dataset was imported, which included a total of 55,602 observations and 3,011 variables.

#loading data from TSV file
crimedata <- read.delim("CrimeData.tsv")
newdata <- data.frame(crimedata$CASEID, crimedata$IRSEX, crimedata$CATAG2, crimedata$INCOME_R, crimedata$EDU_DUMMY, crimedata$BINGEHVY, crimedata$CDCGMO, crimedata$CDNOCGMO, crimedata$CDUFLAG, crimedata$SUMFLAG, crimedata$MJOFLAG, crimedata$IEMFLAG, crimedata$PSYFLAG2, crimedata$ANYCRIME, crimedata$DEPRESSIONINDEX2)

As observed above, the original dataset included a total of 55,602 observations and 3,011 variables. In accordance to the purpose of the study only 15 variables remained, which are contained in newdata dataset. A glimpse of this dataset can be observed below:

load("data2.RData")
head(newdata)
##   id IRSEX CATAG2 INCOME_R EDU_DUMMY BINGEHVY CDCGMO CDNOCGMO CDUFLAG
## 1  1     1      2        1         1        1      0        0       0
## 2  2     1      2        1         1        3      0        0       0
## 3  3     1      3        3         1        4      0        1       1
## 4  4     1      2        1         1        1      0        1       1
## 5  5     2      3        1         0        4      0        0       0
## 6  6     1      3        2         0        3      0        1       1
##   SUMFLAG MJOFLAG IEMFLAG ANYCRIME DEPRESSIONINDEX2
## 1       1       0       1        0               -1
## 2       1       1       0        0               -1
## 3       0       0       0        0               -1
## 4       1       0       1        0               -1
## 5       1       1       0        0               -1
## 6       0       0       0        0                5
tail(newdata)
##          id IRSEX CATAG2 INCOME_R EDU_DUMMY BINGEHVY CDCGMO CDNOCGMO
## 55597 55597     2      2        3         1        2      1        0
## 55598 55598     1      2        1         1        2      0        0
## 55599 55599     1      2        2         0        1      1        0
## 55600 55600     2      2        1         0        3      1        0
## 55601 55601     1      3        2         1        3      0        0
## 55602 55602     1      3        7         1        3      0        0
##       CDUFLAG SUMFLAG MJOFLAG IEMFLAG ANYCRIME DEPRESSIONINDEX2
## 55597       1       1       0       1        0               -1
## 55598       0       0       0       0        0               -1
## 55599       1       1       0       1        0               -1
## 55600       1       1       0       1        0               -1
## 55601       0       1       1       0        0               -1
## 55602       0       1       0       1        0               -1

A description of these variables are presented in Table 1.

Initial summary statistics were also produced for each one of the variables under study:

summary(newdata$IRSEX)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    1.52    2.00    2.00
summary(newdata$CATAG2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    2.00    2.01    3.00    3.00
summary(newdata$INCOME_R)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    1.00    1.00    2.06    3.00    7.00
summary(newdata$EDU_DUMMY)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3222  1.0000  1.0000
summary(newdata$DEPRESSIONINDEX2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.0000 -1.0000 -1.0000 -0.4923 -1.0000  9.0000

As observed, these variables need some treatment to work for the experimental design. Factor variables were created representing demographic groups gender, age, income; and addiction related subgroups cigar, druguse and alcohol. The continuous variable under study was depression. This variable included a -1- value that was affecting the mean value, as observed above. This value was assigned to non-respondents, thus the observations had to be dropped. The final subset was generated as study2.

The new structure of this dataset includes the factor variables created as follows:

str(study2)
## 'data.frame':    3808 obs. of  9 variables:
##  $ gender    : Factor w/ 2 levels "male","female": 1 2 2 1 2 1 1 2 1 2 ...
##  $ age       : Factor w/ 3 levels "young","yadult",..: 3 3 2 2 3 3 3 2 2 2 ...
##  $ income    : int  2 1 3 3 5 1 3 1 3 1 ...
##  $ educ      : Factor w/ 2 levels "low","high": NA NA NA 1 1 NA 1 NA NA 1 ...
##  $ alcohol   : Factor w/ 4 levels "none","low","binge",..: 2 3 1 4 1 1 4 1 3 3 ...
##  $ druguse   : Factor w/ 3 levels "none","low","high": 1 3 3 3 2 2 2 2 1 3 ...
##  $ cigar     : Factor w/ 3 levels "nodaily","dailylow",..: 3 2 2 2 3 2 2 1 1 2 ...
##  $ depression: int  5 9 5 7 9 8 4 6 4 9 ...
##  $ crime     : int  0 0 0 0 0 0 0 1 0 1 ...

The response variable depression after cleaning the data can be observed as follows:

summary(study2$depression)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   5.000   7.000   6.414   8.000   9.000

Exploratory Data Analysis

The factors selected for the study were evaluated and the response variable distribution was also observed. The following histogram presents the values for depression:

hist(study2$depression)

###Main Effects

A box plot for each subgroup is generated to observe the main effects of each one of the factors. The effect of a factor is defined to be the change in the response variable produced by a change in the level of the factor (Montgomery, 2016).

Let’s see for example for Gender, it can be observed that there is a slight difference between males and females, but it appears they are not completely different from one another.

#Boxplot for Gender
boxplot(study2$depression ~ study2$gender, main = "Depression Index per Gender", xlab= "Gender", ylab ="depression index")

Now let’s observe the variable age. Notice that the youth group was not recorded in the depression index as observed, thus the young adults and adults are the remaining levels for this factor:

#Boxplot for Age group
boxplot(study2$depression ~ study2$age, main = "Depression Index for age group", xlab= "Age", ylab ="depression index")

For the income variable, we can observe the different levels:

#Boxplot for Income
boxplot(study2$depression ~ study2$income, main = "Depression Index for income ranges", xlab= "Income Range", ylab ="depression index")

Now, for the druguse variable, we can observe the difference among the different druguse habits:

#Boxplot for Drug Use
boxplot(study2$depression ~ study2$druguse, main = "Depression Index based on Drug Use", xlab= "Level of Drug Use", ylab ="depression index")

The same analysis is followed by the cigar variable:

#Boxplot for Daily Cigar Use
boxplot(study2$depression ~ study2$cigar, main = "Depression Index based on Daily Cigar Use", xlab= "Level of Daily Cigar Use", ylab ="depression index")

Finally, we are able to observe the last variable alcohol related to Alcohol Consumption:

#Boxplot for Alcohol
boxplot(study2$depression ~ study2$alcohol, main = "Depression Index based on Alcohol", xlab= "Level of Drug Use", ylab ="depression index")

The numerical results of these main effects are provided in this report for gender, druguse, cigar and alcohol . For simplicity reasons, we are not calculating the other factors, although the code is provided to perform de calculations to the other variables in the APPENDIX. Main effects calculated were as follows:

me_gender <- mean(subset(study2$depression, study2$gender== "male")) - mean(subset(study2$depression, study2$gender == "female"))
me_gender
## [1] -0.4167738
me_druguse1 <- mean(subset(study2$depression, study2$druguse== "low")) - mean(subset(study2$depression, study2$druguse == "none"))
me_druguse1
## [1] 0.4621165
me_druguse2 <- mean(subset(study2$depression, study2$druguse== "high")) - mean(subset(study2$depression, study2$druguse == "low"))
me_druguse2
## [1] 0.248422
me_druguse3 <- mean(subset(study2$depression, study2$druguse== "high")) - mean(subset(study2$depression, study2$druguse == "none"))
me_druguse3
## [1] 0.7105384
me_alcohol <- mean(subset(study2$depression, study2$alcohol== "heavy")) - mean(subset(study2$depression, study2$alcohol == "none"))
me_alcohol
## [1] -0.007538116
me_cigar1 <- mean(subset(study2$depression, study2$cigar== "dailylow")) - mean(subset(study2$depression, study2$cigar == "nodaily"))
me_cigar1
## [1] 0.7657666
me_cigar2 <- mean(subset(study2$depression, study2$cigar== "dailyhigh")) - mean(subset(study2$depression, study2$cigar == "dailylow"))
me_cigar2
## [1] -0.2777636
me_cigar3 <- mean(subset(study2$depression, study2$cigar== "dailyhigh")) - mean(subset(study2$depression, study2$cigar == "nodaily"))
me_cigar3
## [1] 0.488003

Two Factor Interactions

When observing the differences between multiple factors, it is often important to analyze the interactions among them. The two-factor interactions can provide a magnitude of these differences when the main effects is not significant in at least one of the variables (Montgomery, 2016).

The following plots present the two factor interactions for the combination of the four variables mentioned above.

Results from these plots indicate that there might be some significant interaction effects between druguse, cigar and alcohol.

ANOVA

The Analysis of Variance (ANOVA) is a statistical procedure used for comparing means from subsets of data and/or test the degree to which they vary or differ in an experiment. More specifically, this procedure provides the test that that compares the variation among the observations that belong to this subsets.

One-Way ANOVA

The one-way Analysis of Variance is used to compare one independent (categorical) variable to one continuous variable. The latter is an extension of the two-sample t test for independent groups, where more than two groups are being compared. As explained by Hall in https://web.mst.edu/~psyworld/anovadescribe.htm, the data is sub-divided into different groups based on a single factor and the treatment or set of factor levels. Then the one-way ANOVA investigates whether the variation in the measurements taken on the individual components of the data sets can be explained by the treatment levels introduced by the factor.

An example of this can be observed for each one of the most significant variables found in the study: gender, income, cigar and druguse:

##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$gender    1    149  148.56   23.64 1.21e-06 ***
## Residuals     3806  23917    6.28                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##               Df Sum Sq Mean Sq F value Pr(>F)
## study2$age     1      4   4.471   0.707    0.4
## Residuals   3806  24061   6.322
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$income    1    111  110.63   17.58 2.82e-05 ***
## Residuals     3806  23955    6.29                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                  Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$druguse    2    355  177.27   28.45 5.46e-13 ***
## Residuals      3805  23711    6.23                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                Df Sum Sq Mean Sq F value Pr(>F)    
## study2$cigar    2    480   239.9    38.7 <2e-16 ***
## Residuals    3805  23586     6.2                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                  Df Sum Sq Mean Sq F value Pr(>F)
## study2$alcohol    3     17   5.818    0.92   0.43
## Residuals      3804  24048   6.322

As observed by the F-statistic of the first ANOVA, there is a significant difference between males and females on depression level as the test proves their means to be statistically different at a 99.9% level. Likewise, the second ANOVA suggests that a person who has a higher exposure to drugs could be highly depressive as this test provides a significant difference between these groups on depression. The same is true for cigar and income. There are no significant differences among age groups nor alcohol levels, as the p-values exceed 0.05 for a 95% confidence interval.

Two-Way ANOVA

The two-way ANOVA is also provided to observe the differences the demographic variables with the addiction related variables, as we may observe a combined effect that it may present on depression. For example, males who do heavy drinking could be less depressive than women who also do heavy drinking. The following analysis for the two-way ANOVA to evaluate gender interactions are shown below:

anova14 <- aov(study2$depression ~ study2$gender*study2$druguse)
summary(anova14)
##                                Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$gender                   1    149  148.56  24.040 9.83e-07 ***
## study2$druguse                  2    403  201.66  32.633 8.87e-15 ***
## study2$gender:study2$druguse    2     19    9.40   1.521    0.219    
## Residuals                    3802  23495    6.18                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova15 <- aov(study2$depression ~ study2$gender*study2$cigar)
summary(anova15)
##                              Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$gender                 1    149  148.56  24.142 9.33e-07 ***
## study2$cigar                  2    505  252.34  41.008  < 2e-16 ***
## study2$gender:study2$cigar    2     17    8.55   1.389    0.249    
## Residuals                  3802  23395    6.15                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova16 <- aov(study2$depression ~ study2$gender*study2$alcohol)
summary(anova16)
##                                Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$gender                   1    149  148.56  23.684 1.18e-06 ***
## study2$alcohol                  3     34   11.30   1.801    0.145    
## study2$gender:study2$alcohol    3     48   15.91   2.536    0.055 .  
## Residuals                    3800  23835    6.27                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the previous results it seems there is potentially a difference among gender and alcohol level interactions. The F-statistic is not as high, but it assures a 90% confidence interval.

Now, given that the income level groups appeared to be statistically significant in the individual analysis, a two-way ANOVA was developed as to observe the effect of the interaction of the income with respect to the different addiction levels:

anova34 <- aov(study2$depression ~ study2$income*study2$druguse)
summary(anova34)
##                                Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$income                   1    111  110.63  17.839 2.46e-05 ***
## study2$druguse                  2    367  183.72  29.626 1.71e-13 ***
## study2$income:study2$druguse    2     10    4.94   0.797    0.451    
## Residuals                    3802  23578    6.20                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova35 <- aov(study2$depression ~ study2$income*study2$cigar)
summary(anova35)
##                              Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$income                 1    111  110.63  17.906 2.38e-05 ***
## study2$cigar                  2    462  231.17  37.417  < 2e-16 ***
## study2$income:study2$cigar    2      3    1.46   0.236     0.79    
## Residuals                  3802  23490    6.18                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova36 <- aov(study2$depression ~ study2$income*study2$alcohol)
summary(anova36)
##                                Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$income                   1    111  110.63  17.570 2.83e-05 ***
## study2$alcohol                  3      8    2.61   0.414    0.743    
## study2$income:study2$alcohol    3     22    7.23   1.148    0.328    
## Residuals                    3800  23925    6.30                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From this analysis it proves that differences are significant (with the exception of alcohol) but their interactions are not.

Further and based on the Interaction Plots observed previously, the final ANOVA tables provide the analysis for interactions among the addiction levels.

anova45 <- aov(study2$depression ~ study2$druguse*study2$cigar)
summary(anova45)
##                               Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$druguse                 2    355  177.27  28.725 4.15e-13 ***
## study2$cigar                   2    250  124.95  20.247 1.79e-09 ***
## study2$druguse:study2$cigar    4     17    4.24   0.687    0.601    
## Residuals                   3799  23444    6.17                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova46 <- aov(study2$depression ~ study2$druguse*study2$alcohol)
summary(anova46)
##                                 Df Sum Sq Mean Sq F value   Pr(>F)    
## study2$druguse                   2    355  177.27  28.505 5.16e-13 ***
## study2$alcohol                   3     52   17.23   2.771   0.0401 *  
## study2$druguse:study2$alcohol    6     52    8.69   1.397   0.2119    
## Residuals                     3796  23607    6.22                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova56 <- aov(study2$depression ~ study2$cigar*study2$alcohol)
summary(anova56)
##                               Df Sum Sq Mean Sq F value Pr(>F)    
## study2$cigar                   2    480  239.88  38.728 <2e-16 ***
## study2$alcohol                 3     32   10.55   1.703  0.164    
## study2$cigar:study2$alcohol    6     42    6.98   1.127  0.344    
## Residuals                   3796  23512    6.19                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results show no significant interaction differences among the population segments.

Final Remarks

This report developed an analysis that related depression on different segments of the population and addictive habits. The Analysis of Variance was used to identified potential differences in depression based on different levels of addiction, on gender, age groups and income level.

Results have shown that there is in fact a strong statistical difference among male and females with respect to depression. Also, it is important to highlight the income level groups are also differentiated in the depression indexes calculated from the study. Another important aspect of the study was the drug use levels clearly produce different depression indices among the population groups and daily cigar use follow the same trend. It was interesting to notice that the level of alcohol consumption was not captured in the study, only when interacting with gender. Following this study it should be noticed that eventhough alcohol levels do not potentially affect depression, it might affect other behavior traits, such as criminal activity. Further studies on criminal activity could provide a broader analysis to the research findings.

References

  1. ADAA.org, Facts and Statistics, https://www.adaa.org/about-adaa/press-room/facts-statistics
  2. Hall, R.(1998) https://web.mst.edu/~psyworld/anovadescribe.htm
  3. Montgomery, D. (2016), Design and Analysis of Experiments, Wiley.
  4. National Institute of Mental Health https://www.nimh.nih.gov/health/statistics/prevalence/major-depression-among-adults.shtml
  5. National Institute on Alcohol Abuse and Alcoholism, http://pubs.niaaa.nih.gov/publications/arh26-2/90-98.htm
  6. Thompson, M. (2011), Gender, Mental Illness, and Crime in the United States, 2004. ICPSR27521-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2011-02-10. http://doi.org/10.3886/ICPSR27521.v1”)

APPENDIX

Complete code is provided below:

#loading data from TSV file
crimedata <- read.delim("CrimeData.tsv")
str(crimedata)

#creating new dataframe with variables selected for this study
newdata <- data.frame(id = crimedata$CASEID, IRSEX = crimedata$IRSEX, CATAG2 = crimedata$CATAG2, INCOME_R = crimedata$INCOME_R, EDU_DUMMY= crimedata$EDU_DUMMY, BINGEHVY = crimedata$BINGEHVY, CDCGMO = crimedata$CDCGMO, CDNOCGMO = crimedata$CDNOCGMO, CDUFLAG = crimedata$CDUFLAG, SUMFLAG =  crimedata$SUMFLAG, MJOFLAG = crimedata$MJOFLAG, IEMFLAG = crimedata$IEMFLAG, ANYCRIME = crimedata$ANYCRIME, DEPRESSIONINDEX2 = crimedata$DEPRESSIONINDEX2)
save(newdata,file="data2.RData")
load("data2.RData")
head(newdata)
tail(newdata)

summary(newdata$IRSEX)
summary(newdata$CATAG2)
summary(newdata$INCOME_R)
summary(newdata$EDU_DUMMY)
summary(newdata$DEPRESSIONINDEX2)

gen_levels <- c(male = 1, female = 2)
age_levels <- c(young = 1, yadult = 2, adult = 3)
edu_levels <- c(low = 1, high = 2)
alc_levels <- c(none = 4, low =3, binge=2 , heavy=1) 
drug_levels <- c(none = 1, low =2, high=3) 
cig_levels <- c(nodaily=1, dailylow=2, dailyhigh=3) 

cigarrel<- ifelse(newdata$CDUFLAG==0, 1,ifelse(newdata$CDCGMO==1, 2, 
                                              ifelse(newdata$CDNOCGMO==1,3,0)))
drugrel<- ifelse(newdata$SUMFLAG==0, 1, 
                 ifelse(newdata$MJOFLAG==1, 2, 3))

studydata <- data.frame(gender = (gender = factor(newdata$IRSEX, levels = gen_levels, labels = names(gen_levels))),
                        age = (age = factor(newdata$CATAG2, levels = age_levels, labels = names(age_levels))), 
                        income = newdata$INCOME_R, educ = (educ = factor(newdata$EDU_DUMMY, levels = edu_levels, labels = names(edu_levels))),
                        alcohol = (alcohol = factor(newdata$BINGEHVY, levels = alc_levels, labels = names(alc_levels))), 
                        druguse= (druguse = factor(drugrel, levels = drug_levels, labels = names(drug_levels))), 
                        cigar= (cigar = factor(cigarrel,levels = cig_levels, labels = names(cig_levels))), 
                        depression = newdata$DEPRESSIONINDEX2, crime = newdata$ANYCRIME)

study2 <- subset(studydata, studydata$depression>=0)
str(study2)
head(study2)
tail(study2)
summary(study2$depression)

#Boxplot for Gender
boxplot(study2$depression ~ study2$gender, main = "Depression Index per Gender", xlab= "Gender", ylab ="depression index")

#Boxplot for Age group
boxplot(study2$depression ~ study2$age, main = "Depression Index for age group", xlab= "Age", ylab ="depression index")

#Boxplot for Income
boxplot(study2$depression ~ study2$income, main = "Depression Index for income ranges", xlab= "Income Range", ylab ="depression index")

#Boxplot for Drug Use
boxplot(study2$depression ~ study2$druguse, main = "Depression Index based on Drug Use", xlab= "Level of Drug Use", ylab ="depression index")

#Boxplot for Daily Cigar Use
boxplot(study2$depression ~ study2$cigar, main = "Depression Index based on Daily Cigar Use", xlab= "Level of Daily Cigar Use", ylab ="depression index")

#Boxplot for Alcohol
boxplot(study2$depression ~ study2$alcohol, main = "Depression Index based on Alcohol", xlab= "Level of Drug Use", ylab ="depression index")

#Calculating main effects
me_gender <- mean(subset(study2$depression, study2$gender== "male")) - mean(subset(study2$depression, study2$gender == "female"))
me_gender
me_druguse1 <- mean(subset(study2$depression, study2$druguse== "low")) - mean(subset(study2$depression, study2$druguse == "none"))
me_druguse1
me_druguse2 <- mean(subset(study2$depression, study2$druguse== "high")) - mean(subset(study2$depression, study2$druguse == "low"))
me_druguse2
me_druguse3 <- mean(subset(study2$depression, study2$druguse== "high")) - mean(subset(study2$depression, study2$druguse == "none"))
me_druguse3
me_alcohol <- mean(subset(study2$depression, study2$alcohol== "heavy")) - mean(subset(study2$depression, study2$alcohol == "none"))
me_alcohol
me_cigar1 <- mean(subset(study2$depression, study2$cigar== "dailylow")) - mean(subset(study2$depression, study2$cigar == "nodaily"))
me_cigar1
me_cigar2 <- mean(subset(study2$depression, study2$cigar== "dailyhigh")) - mean(subset(study2$depression, study2$cigar == "dailylow"))
me_cigar2
me_cigar3 <- mean(subset(study2$depression, study2$cigar== "dailyhigh")) - mean(subset(study2$depression, study2$cigar == "nodaily"))
me_cigar3

#Interaction Effect Plots
interaction.plot(study2$gender, study2$druguse, study2$depression)
interaction.plot(study2$gender, study2$cigar, study2$depression)
interaction.plot (study2$gender, study2$alcohol, study2$depression)
interaction.plot(study2$druguse, study2$cigar, study2$depression)
interaction.plot(study2$druguse, study2$alcohol, study2$depression)
interaction.plot(study2$cigar, study2$alcohol, study2$depression)


###Analysis of Variance (ANOVA)
## One-way ANOVA
anova1 <- aov(study2$depression ~ study2$gender)
summary(anova1)
anova2 <- aov(study2$depression ~ study2$age)
summary(anova2)
anova3 <- aov(study2$depression ~ study2$income)
summary(anova3)
anova4 <- aov(study2$depression ~ study2$druguse)
summary(anova4)
anova5 <- aov(study2$depression ~ study2$cigar)
summary(anova5)
anova6 <- aov(study2$depression ~ study2$alcohol)
summary(anova6)

## Two-way ANOVA
anova14 <- aov(study2$depression ~ study2$gender*study2$druguse)
summary(anova14)
anova15 <- aov(study2$depression ~ study2$gender*study2$cigar)
summary(anova15)
anova16 <- aov(study2$depression ~ study2$gender*study2$alcohol)
summary(anova16)

anova34 <- aov(study2$depression ~ study2$income*study2$druguse)
summary(anova34)
anova35 <- aov(study2$depression ~ study2$income*study2$cigar)
summary(anova35)
anova36 <- aov(study2$depression ~ study2$income*study2$alcohol)
summary(anova36)

anova45 <- aov(study2$depression ~ study2$druguse*study2$cigar)
summary(anova45)
anova46 <- aov(study2$depression ~ study2$druguse*study2$alcohol)
summary(anova46)
anova56 <- aov(study2$depression ~ study2$cigar*study2$alcohol)
summary(anova56)