Introduction

Fractional Factorial Designs are used widely in Experimental Design when there are no possibilities of running full-factorial designs given the large effort involved in generating all the runs. Some of the applications are of more exploratory nature and involve a large number of potential factors of interest. These applications require a fractional design, which emphasize on main effects and a reduced number of low order interactions (Cox and Reid, 2000)

Description of Data

[The data used was part of a study entitled “Gender, Mental Illness and Crime in the United States” developed by Melissa Thompson as part of the Inter-university Consortium for Political and Social Research] http://www.icpsr.umich.edu/cocoon/ICPSR/TERMS/27521.xml. This database is composed of 55,602 respondents from the National Household Survey on Drug Use and Health (NSDUH) in 2004. Respondents were members of United States households from ages 12 and above. Information related to the use of illicit drugs, alcohol and tobacco, as well as the criminal activity, depression and other factors were included in the database. Other additional variables, such as the depression index for both adults and youth were included by the researcher. Of interest to this study is the depression variable for youth population.

Purpose of this Study

The purpose of this study in particular is to understand the influence of gender and/or other population characteristics (i.e. age, education levels) and the addiction level (use of drugs, tobacco and alcohol consumption) in mental illness, particularly depression. The Thompson’s study includes criminal activity, yet this study will only be limited by the depression index calculated for both youth and adults in the study. A fractional factorial experimental design is introduced to identify the main differences between population groups and different levels of addiction to depression of the youth population. ANOVA is used as a tool of analysis for testing these differences and plots of main and interaction effects are introduced to provide a visual aid for the analysis.

DataSet: Selection of IVs and Response Variables

Data Importing and Cleaning

To start the analysis, the original dataset was imported, which included a total of 55,602 observations and 3,011 variables.

#loading data from TSV file
crimedata <- read.delim("CrimeData.tsv")
probcrime <- ifelse(crimedata$YEYATTAK==1, 0.368,
       ifelse(crimedata$YEYATTAK ==2, 0.92, 
              ifelse(crimedata$YEYATTAK ==3, 0.996, 
                     ifelse(crimedata$YEYATTAK ==4, 0.999, 
                            ifelse(crimedata$YEYATTAK ==5, 1, NA)))))
crime1 = ifelse(crimedata$ANYCRIME==0, 0,
                ifelse(crimedata$ANYCRIME ==1, 1,NA))
newdata <- data.frame(id = crimedata$CASEID, IRSEX = crimedata$IRSEX, CATAG2 = crimedata$CATAG2, AGEYOUNG= crimedata$AGE2, INCOME_R = crimedata$INCOME_R, EDU_DUMMY= crimedata$EDU_DUMMY, EDUC2 = crimedata$IREDUC2, CATAG2 = crimedata$CATAG2,  BINGEHVY = crimedata$BINGEHVY, CDCGMO = crimedata$CDCGMO, CDNOCGMO = crimedata$CDNOCGMO, CDUFLAG = crimedata$CDUFLAG, SUMFLAG =  crimedata$SUMFLAG, MJOFLAG = crimedata$MJOFLAG, IEMFLAG = crimedata$IEMFLAG, ANYCRIME = crime1, CRIME2 = probcrime, ARREST = crimedata$NUMARREST, DEPRESS = crimedata$YODEPRESSIONINDEX)

As observed above, the original dataset included a total of 55,602 observations and 3,011 variables. In accordance to the purpose of the study only 19 variables remained, which are contained in newdata dataset.

load("data2.RData")
str(newdata)
## 'data.frame':    55602 obs. of  19 variables:
##  $ id       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ IRSEX    : int  1 1 1 1 2 1 2 2 2 1 ...
##  $ CATAG2   : int  2 2 3 2 3 3 3 2 3 3 ...
##  $ AGEYOUNG : int  11 9 17 10 15 17 15 8 16 14 ...
##  $ INCOME_R : int  1 1 3 1 1 2 3 1 4 3 ...
##  $ EDU_DUMMY: int  1 1 1 1 0 0 1 0 1 0 ...
##  $ EDUC2    : int  11 10 10 10 8 8 11 8 11 8 ...
##  $ CATAG2.1 : int  2 2 3 2 3 3 3 2 3 3 ...
##  $ BINGEHVY : int  1 3 4 1 4 3 3 4 4 4 ...
##  $ CDCGMO   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ CDNOCGMO : int  0 0 1 1 0 1 1 0 1 0 ...
##  $ CDUFLAG  : int  0 0 1 1 0 1 1 1 1 0 ...
##  $ SUMFLAG  : int  1 1 0 1 1 0 0 1 0 0 ...
##  $ MJOFLAG  : int  0 1 0 0 1 0 0 1 0 0 ...
##  $ IEMFLAG  : int  1 0 0 1 0 0 0 0 0 0 ...
##  $ ANYCRIME : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CRIME2   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ ARREST   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DEPRESS  : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...

Possible factor variables were created representing demographic groups gender, age, education; and addiction-related subgroups cigar, druguse and alcohol. Possible response variables were youth depression index, DEPRESS, and the some criminal-activity based variables such as the number of times arrested, the probability of criminal activity and others. as observed above from ANYCRIME, CRIME2 and ARREST variables in the dataset.

##     gender   age educ cigar alcohol druguse crime prcrime arrests
## 56  female child   hs    no    none    high     1   0.920       0
## 78    male young   hs    no    none    high     0   0.368       0
## 109   male child   hs    no    none    high     0   0.368       0
## 139   male child   hs    no    none    none     1   0.368       0
## 143 female young   hs    no    none    none     0   0.368       0
## 169   male child   hs    no    none    high     0   0.368       0
##     depression
## 56           7
## 78           3
## 109          1
## 139          1
## 143          8
## 169          9

Response Variable

There were several interests in this study. Analyze the criminal outcome of the different population groups according to their addiction levels and demographics or analyzing the depression index. This study was oriented to the youth population, thus there was of major interest to understand how the addiction levels affect the depression of these kids. Thus the variable depression was selected for the study. The distribution of this variable is observed from the following histogram.

Factors and Levels

As observed from the dataset, there are some variables that have two levels, like gender, where the respondents were either male or female. Yet there were others that had multiple levels. The level of education could be divided in two as for the youth population this was divided between only elementary school education elem and the high school education hs. For the age group variable, a similar case was observed, where a child (13 years or younger) was differentiated from a young kid (14-17 years). Addiction-related variables such as the daily use of tobacco cigar was also binary, which meant if they use it daily yes or if they don’t no. The alcohol variable was observed in four levels from the initial study but was conveniently divided in three levels: none, low and high. The druguse variable was also evidenced in three levels: none, low (only Marijuana use) and high (other drugs consumed).

In order to determine the four factors, initial descriptive analysis was produced for each one of the variables under study in order to observe significant differences through their boxplots:

After an initial evaluation of the dataset, the two 2-level factors used were gender and cigar and two 3-level factors used were druguse and alcohol.

Experimental Design

Full Factorial Design

If we were to analyze all four factors under homogeneous conditions, we would have to provide a full factorial design. In this case, it would be a \((2^2)(3^2)\). Yet, this same design can be easily converted into a \(2^k\) by converting the 3 level factors into two 2-level factors. For our experiment, the alcohol and druguse variables were divided in low_alcohol with 2 levels (0 for none and 1 for low) and high_alcohol with 2 levels (0 for low and 1 for high). The same treatment was done with the druguse variable.

##   depression gender cigardaily alcohol drugs low_alc high_alc low_drug
## 1          7 female         no    none  high       0        0        0
## 2          3   male         no    none  high       0        0        0
## 3          1   male         no    none  high       0        0        0
## 4          1   male         no    none  none       0        0        0
## 5          8 female         no    none  none       0        0        0
## 6          9   male         no    none  high       0        0        0
##   high_drug
## 1         1
## 2         1
## 3         1
## 4         0
## 5         0
## 6         1

Thus, if we were to design the full factorial experiment, that would imply a total of 64 runs, as shown:

##    gender cigardaily low_alc high_alc low_drug high_drug
## 1       0          0       0        0        0         0
## 2       1          0       0        0        0         0
## 3       0          1       0        0        0         0
## 4       1          1       0        0        0         0
## 5       0          0       1        0        0         0
## 6       1          0       1        0        0         0
## 7       0          1       1        0        0         0
## 8       1          1       1        0        0         0
## 9       0          0       0        1        0         0
## 10      1          0       0        1        0         0
## 11      0          1       0        1        0         0
## 12      1          1       0        1        0         0
## 13      0          0       1        1        0         0
## 14      1          0       1        1        0         0
## 15      0          1       1        1        0         0
## 16      1          1       1        1        0         0
## 17      0          0       0        0        1         0
## 18      1          0       0        0        1         0
## 19      0          1       0        0        1         0
## 20      1          1       0        0        1         0
## 21      0          0       1        0        1         0
## 22      1          0       1        0        1         0
## 23      0          1       1        0        1         0
## 24      1          1       1        0        1         0
## 25      0          0       0        1        1         0
## 26      1          0       0        1        1         0
## 27      0          1       0        1        1         0
## 28      1          1       0        1        1         0
## 29      0          0       1        1        1         0
## 30      1          0       1        1        1         0
## 31      0          1       1        1        1         0
## 32      1          1       1        1        1         0
## 33      0          0       0        0        0         1
## 34      1          0       0        0        0         1
## 35      0          1       0        0        0         1
## 36      1          1       0        0        0         1
## 37      0          0       1        0        0         1
## 38      1          0       1        0        0         1
## 39      0          1       1        0        0         1
## 40      1          1       1        0        0         1
## 41      0          0       0        1        0         1
## 42      1          0       0        1        0         1
## 43      0          1       0        1        0         1
## 44      1          1       0        1        0         1
## 45      0          0       1        1        0         1
## 46      1          0       1        1        0         1
## 47      0          1       1        1        0         1
## 48      1          1       1        1        0         1
## 49      0          0       0        0        1         1
## 50      1          0       0        0        1         1
## 51      0          1       0        0        1         1
## 52      1          1       0        0        1         1
## 53      0          0       1        0        1         1
## 54      1          0       1        0        1         1
## 55      0          1       1        0        1         1
## 56      1          1       1        0        1         1
## 57      0          0       0        1        1         1
## 58      1          0       0        1        1         1
## 59      0          1       0        1        1         1
## 60      1          1       0        1        1         1
## 61      0          0       1        1        1         1
## 62      1          0       1        1        1         1
## 63      0          1       1        1        1         1
## 64      1          1       1        1        1         1

Fractional Factorial Design

The new structure of this dataset includes the factor variables created as follows:

By using the R package ‘FrF2’ we can determine the fractional factorial design. The highest resolution for this experimental design is Resolution III, with 6 factors and 8 runs.

runs <- 2^(6-3)
nam2 <- c("gender","cigardaily", "low_alc", "high_alc", "low_drug", "high_drug")
frac_design <- FrF2(runs, factor.names = nam2 , default.levels = c("0","1"))
summary(frac_design)
## Call:
## FrF2(runs, factor.names = nam2, default.levels = c("0", "1"))
## 
## Experimental design of type  FrF2 
## 8  runs
## 
## Factor settings (scale ends):
##   gender cigardaily low_alc high_alc low_drug high_drug
## 1      0          0       0        0        0         0
## 2      1          1       1        1        1         1
## 
## Design generating information:
## $legend
## [1] A=gender     B=cigardaily C=low_alc    D=high_alc   E=low_drug  
## [6] F=high_drug 
## 
## $generators
## [1] D=AB E=AC F=BC
## 
## 
## Alias structure:
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD
## 
## 
## The design itself:
##   gender cigardaily low_alc high_alc low_drug high_drug
## 1      0          0       0        1        1         1
## 2      0          1       1        0        0         1
## 3      1          1       1        1        1         1
## 4      1          0       0        0        0         1
## 5      0          1       0        0        1         0
## 6      1          1       0        1        0         0
## 7      0          0       1        1        0         0
## 8      1          0       1        0        1         0
## class=design, type= FrF2
frac1 <-data.frame(frac_design)

As observed, the aliasing structure of the fractional factorial design can be obtained and the generators of this design. In this particular experimental design, the generators of the fractional factorial design are D=AB, E=AC, F=BC, that is, the main effects corresponding to the factors high_alc, low_drug and high_drug, which are aliased with some two-level interactions. This specific structure also suggests that there are some two-level interactions that are highly aliased with each other, AF=BE=CD.

aliasprint(frac_design)
## $legend
## [1] A=gender     B=cigardaily C=low_alc    D=high_alc   E=low_drug  
## [6] F=high_drug 
## 
## $main
## [1] A=BD=CE B=AD=CF C=AE=BF D=AB=EF E=AC=DF F=BC=DE
## 
## $fi2
## [1] AF=BE=CD

By using the aliased interactions and multiplying by their main effects, the generator can also be determined as I=ABD=ACE=BCF. By adding this generator row to the matrix, it can be corroborated that they all equal the generator column:

frac2_design <- FrF2(runs, factor.names = c("A","B","C","D","E","F"), default.levels = c("-1","1"))
frac2 <-data.frame(frac2_design)
frac2["I"] = 1

Randomization,Blocking and Replication

It is important to note the principles that guide experimental design: randomization, blocking and replication. In this experiment, the runs shown above must be completely randomized. That means, they should be randomly selected, assigned and executed. Replication and blocking increases the precision of the results. Yet, in this experimental study we will not be using replication as the purpose is to have reduced number of experimental runs. If we decided to increase the number of runs for some reason, we might be better off generating a higher resolution design, such as a \(2^(6-2)\).

Adding Response Variable to the Design

The next step in the design is to gather the data. As stated previously this has to be randomly selected and the experimental runs must be done in a random order

As a first step, given that in this case we have a database from where to obtain the data, the data is allocated as samples that match the different factor levels:

For our fractional factorial design the order of experimental runs was developed as follows:

rand_fd <- frac1[sample(nrow(frac1)),]
rand_fd
##   gender cigardaily low_alc high_alc low_drug high_drug
## 4      1          0       0        0        0         1
## 3      1          1       1        1        1         1
## 2      0          1       1        0        0         1
## 1      0          0       0        1        1         1
## 6      1          1       0        1        0         0
## 8      1          0       1        0        1         0
## 7      0          0       1        1        0         0
## 5      0          1       0        0        1         0

The main effects for these experimental runs are calculated as follows:

## [1] 1.875
## [1] 0.875
## [1] -1.125
## [1] -0.875
## [1] 0.625
## [1] 0.375

The first value corresponds to gender, thus this high effect compared to the rest indicates it is highly probable that there is a potential difference between males and females.

Estimation

The model can be further tested and the main and interaction effects can be estimating using a linear model:

fit <- lm(proj3$depression ~ (proj3$gender + proj3$cigardaily + proj3$low_alc + proj3$high_alc + proj3$low_drug + proj3$high_drug)^2)
summary(fit)
## 
## Call:
## lm.default(formula = proj3$depression ~ (proj3$gender + proj3$cigardaily + 
##     proj3$low_alc + proj3$high_alc + proj3$low_drug + proj3$high_drug)^2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5406 -1.1254  0.6847  1.6847  3.9563 
## 
## Coefficients: (2 not defined because of singularities)
##                                        Estimate Std. Error t value
## (Intercept)                              5.4045     0.1006  53.701
## proj3$genderfemale                       0.9108     0.1252   7.274
## proj3$cigardailyyes                      1.4668     0.5014   2.926
## proj3$low_alc                            0.0319     0.2585   0.123
## proj3$high_alc                           0.8340     0.7786   1.071
## proj3$low_drug                           0.3134     0.2604   1.204
## proj3$high_drug                          0.5578     0.1724   3.237
## proj3$genderfemale:proj3$cigardailyyes  -0.6342     0.3060  -2.072
## proj3$genderfemale:proj3$low_alc         0.2112     0.2375   0.889
## proj3$genderfemale:proj3$high_alc        0.2858     0.4737   0.603
## proj3$genderfemale:proj3$low_drug        0.4052     0.3015   1.344
## proj3$genderfemale:proj3$high_drug       0.2522     0.2084   1.210
## proj3$cigardailyyes:proj3$low_alc       -0.4499     0.2992  -1.504
## proj3$cigardailyyes:proj3$high_alc      -0.9579     0.4977  -1.925
## proj3$cigardailyyes:proj3$low_drug      -0.3932     0.5422  -0.725
## proj3$cigardailyyes:proj3$high_drug     -0.4174     0.4850  -0.861
## proj3$low_alc:proj3$high_alc                 NA         NA      NA
## proj3$low_alc:proj3$low_drug            -0.7060     0.3296  -2.142
## proj3$low_alc:proj3$high_drug           -0.1533     0.2501  -0.613
## proj3$high_alc:proj3$low_drug           -0.8091     1.0171  -0.795
## proj3$high_alc:proj3$high_drug          -0.4251     0.8046  -0.528
## proj3$low_drug:proj3$high_drug               NA         NA      NA
##                                        Pr(>|t|)    
## (Intercept)                             < 2e-16 ***
## proj3$genderfemale                     4.37e-13 ***
## proj3$cigardailyyes                     0.00346 ** 
## proj3$low_alc                           0.90178    
## proj3$high_alc                          0.28414    
## proj3$low_drug                          0.22882    
## proj3$high_drug                         0.00122 ** 
## proj3$genderfemale:proj3$cigardailyyes  0.03833 *  
## proj3$genderfemale:proj3$low_alc        0.37393    
## proj3$genderfemale:proj3$high_alc       0.54628    
## proj3$genderfemale:proj3$low_drug       0.17915    
## proj3$genderfemale:proj3$high_drug      0.22623    
## proj3$cigardailyyes:proj3$low_alc       0.13280    
## proj3$cigardailyyes:proj3$high_alc      0.05435 .  
## proj3$cigardailyyes:proj3$low_drug      0.46845    
## proj3$cigardailyyes:proj3$high_drug     0.38953    
## proj3$low_alc:proj3$high_alc                 NA    
## proj3$low_alc:proj3$low_drug            0.03228 *  
## proj3$low_alc:proj3$high_drug           0.54001    
## proj3$high_alc:proj3$low_drug           0.42639    
## proj3$high_alc:proj3$high_drug          0.59733    
## proj3$low_drug:proj3$high_drug               NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.372 on 3164 degrees of freedom
## Multiple R-squared:  0.07356,    Adjusted R-squared:  0.06799 
## F-statistic: 13.22 on 19 and 3164 DF,  p-value: < 2.2e-16
anova(fit)
## Analysis of Variance Table
## 
## Response: proj3$depression
##                                    Df  Sum Sq Mean Sq  F value    Pr(>F)
## proj3$gender                        1   812.4  812.36 144.4170 < 2.2e-16
## proj3$cigardaily                    1   169.3  169.35  30.1053 4.414e-08
## proj3$low_alc                       1     5.1    5.06   0.8989   0.34314
## proj3$high_alc                      1    23.0   22.99   4.0873   0.04329
## proj3$low_drug                      1     0.0    0.00   0.0004   0.98452
## proj3$high_drug                     1   289.4  289.43  51.4543 9.086e-13
## proj3$gender:proj3$cigardaily       1    13.4   13.37   2.3763   0.12329
## proj3$gender:proj3$low_alc          1     8.9    8.93   1.5880   0.20771
## proj3$gender:proj3$high_alc         1     2.4    2.41   0.4285   0.51277
## proj3$gender:proj3$low_drug         1     3.6    3.63   0.6454   0.42183
## proj3$gender:proj3$high_drug        1     7.6    7.59   1.3491   0.24553
## proj3$cigardaily:proj3$low_alc      1    10.2   10.22   1.8177   0.17768
## proj3$cigardaily:proj3$high_alc     1    34.4   34.39   6.1131   0.01347
## proj3$cigardaily:proj3$low_drug     1     0.3    0.25   0.0445   0.83294
## proj3$cigardaily:proj3$high_drug    1     4.0    4.02   0.7154   0.39771
## proj3$low_alc:proj3$low_drug        1    23.4   23.44   4.1674   0.04129
## proj3$low_alc:proj3$high_drug       1     2.1    2.07   0.3674   0.54446
## proj3$high_alc:proj3$low_drug       1     2.0    2.01   0.3568   0.55031
## proj3$high_alc:proj3$high_drug      1     1.6    1.57   0.2791   0.59733
## Residuals                        3164 17797.8    5.63                   
##                                     
## proj3$gender                     ***
## proj3$cigardaily                 ***
## proj3$low_alc                       
## proj3$high_alc                   *  
## proj3$low_drug                      
## proj3$high_drug                  ***
## proj3$gender:proj3$cigardaily       
## proj3$gender:proj3$low_alc          
## proj3$gender:proj3$high_alc         
## proj3$gender:proj3$low_drug         
## proj3$gender:proj3$high_drug        
## proj3$cigardaily:proj3$low_alc      
## proj3$cigardaily:proj3$high_alc  *  
## proj3$cigardaily:proj3$low_drug     
## proj3$cigardaily:proj3$high_drug    
## proj3$low_alc:proj3$low_drug     *  
## proj3$low_alc:proj3$high_drug       
## proj3$high_alc:proj3$low_drug       
## proj3$high_alc:proj3$high_drug      
## Residuals                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As observered from this first model results, the main effects concerning the low alcohol and low drug levels were not statistically significant. On the other hand, gender and the level of cigar appears to be significant with the depression index. Analysis of variance performed indicated significant differences among the gender and cigar level groups. The drug and alcohol related factors were still significant at their highest levels. This implies that the factor effects between none and low are not significant in both cases but the difference does exists between low and high levels of alcohol and druguse.

With respect to interaction effects, there is only one interaction effect that seems to be statistically significant, it is the effect of the cigar daily use and high alcohol consumption.

If we reduce the model to the low-high level interaction among alcohol and drug use, the resulting model is as follows:

fit2 <- lm(proj3$depression ~ (proj3$gender + proj3$cigardaily + proj3$high_alc + proj3$high_drug)^2)
summary(fit2)
## 
## Call:
## lm.default(formula = proj3$depression ~ (proj3$gender + proj3$cigardaily + 
##     proj3$high_alc + proj3$high_drug)^2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.354 -1.152  0.583  1.583  3.583 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                             5.41701    0.09041  59.913
## proj3$genderfemale                      1.00433    0.11241   8.935
## proj3$cigardailyyes                     1.16786    0.31887   3.663
## proj3$high_alc                          0.53214    0.56083   0.949
## proj3$high_drug                         0.52278    0.15756   3.318
## proj3$genderfemale:proj3$cigardailyyes -0.56922    0.29824  -1.909
## proj3$genderfemale:proj3$high_alc       0.26754    0.46432   0.576
## proj3$genderfemale:proj3$high_drug      0.20815    0.19104   1.090
## proj3$cigardailyyes:proj3$high_alc     -0.82510    0.45788  -1.802
## proj3$cigardailyyes:proj3$high_drug    -0.39707    0.28932  -1.372
## proj3$high_alc:proj3$high_drug         -0.06204    0.55521  -0.112
##                                        Pr(>|t|)    
## (Intercept)                             < 2e-16 ***
## proj3$genderfemale                      < 2e-16 ***
## proj3$cigardailyyes                    0.000254 ***
## proj3$high_alc                         0.342770    
## proj3$high_drug                        0.000917 ***
## proj3$genderfemale:proj3$cigardailyyes 0.056401 .  
## proj3$genderfemale:proj3$high_alc      0.564529    
## proj3$genderfemale:proj3$high_drug     0.275980    
## proj3$cigardailyyes:proj3$high_alc     0.071641 .  
## proj3$cigardailyyes:proj3$high_drug    0.170025    
## proj3$high_alc:proj3$high_drug         0.911038    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.375 on 3173 degrees of freedom
## Multiple R-squared:  0.06866,    Adjusted R-squared:  0.06572 
## F-statistic: 23.39 on 10 and 3173 DF,  p-value: < 2.2e-16
anova(fit2)
## Analysis of Variance Table
## 
## Response: proj3$depression
##                                    Df  Sum Sq Mean Sq  F value    Pr(>F)
## proj3$gender                        1   812.4  812.36 144.0657 < 2.2e-16
## proj3$cigardaily                    1   169.3  169.35  30.0321 4.582e-08
## proj3$high_alc                      1    19.1   19.13   3.3927   0.06558
## proj3$high_drug                     1   262.7  262.66  46.5813 1.049e-11
## proj3$gender:proj3$cigardaily       1    13.6   13.56   2.4040   0.12112
## proj3$gender:proj3$high_alc         1     1.0    1.00   0.1771   0.67393
## proj3$gender:proj3$high_drug        1     5.7    5.75   1.0190   0.31284
## proj3$cigardaily:proj3$high_alc     1    24.0   24.00   4.2565   0.03918
## proj3$cigardaily:proj3$high_drug    1    11.1   11.06   1.9621   0.16138
## proj3$high_alc:proj3$high_drug      1     0.1    0.07   0.0125   0.91104
## Residuals                        3173 17891.9    5.64                   
##                                     
## proj3$gender                     ***
## proj3$cigardaily                 ***
## proj3$high_alc                   .  
## proj3$high_drug                  ***
## proj3$gender:proj3$cigardaily       
## proj3$gender:proj3$high_alc         
## proj3$gender:proj3$high_drug        
## proj3$cigardaily:proj3$high_alc  *  
## proj3$cigardaily:proj3$high_drug    
## proj3$high_alc:proj3$high_drug      
## Residuals                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This second model seems to be more accurate although it can be even reduced some more as only two interaction effects appear to be significant, the female and cigar daily use and the high alcohol and cigar dailyuse.

Main Effects

The numerical results of these main effects are provided in this report for gender, cigardaily,low_alc,high_alc,low_drug and high_drug. Main effects calculated were as follows:

me_gender <- mean(subset(proj3$depression, proj3$gender== "female")) - mean(subset(proj3$depression, proj3$gender == "male"))
me_gender
## [1] 1.05953
me_cigar <- mean(subset(proj3$depression, proj3$cigardaily== "yes")) - mean(subset(proj3$depression, proj3$cigardaily == "no"))
me_cigar
## [1] 0.7786925
me_lowalc <- mean(subset(proj3$depression, proj3$low_alc== 1)) - mean(subset(proj3$depression, proj3$low_alc== 0))
me_lowalc
## [1] 0.3053936
me_highalc <- mean(subset(proj3$depression, proj3$high_alc== 1)) - mean(subset(proj3$depression, proj3$high_alc== 0))
me_highalc
## [1] 0.6528198
me_lowdrug <- mean(subset(proj3$depression, proj3$low_drug== 1)) - mean(subset(proj3$depression, proj3$low_drug== 0))
me_lowdrug
## [1] 0.07802172
me_highdrug <- mean(subset(proj3$depression, proj3$high_drug== 1)) - mean(subset(proj3$depression, proj3$high_drug== 0))
me_highdrug
## [1] 0.7402103

Two Factor Interactions

The following plots present the two factor interactions for the combination of the factors studied (this case discarting the alcohol and drug lowest level).

Results from these plots confirm the significant interaction effects between cigardaily and high_alc variables.

Final Remarks

This report developed an analysis that related depression on different segments of the population and addictive habits. As response variable, the depression of the youth population was analyzed. As factors this study used gender, cigar daily use, alcohol and drug levels. Using Fractional Factorial designs, a reduced design was generated as to explore the differences among these groups with just a few number of observations. In this case, only 8 observations in total. Random selection of the response variable was developed and a random order as part of the design.

These results from the fractional factorial design indicated some potential differences between gender (main effect of 1.875). This model was estimated with a full estimation of a linear regression model. The Analysis of Variance was used to identified potential differences in depression based on different levels of addiction (cigar, alcohol and drug use) and gender. The results corroborated that gender is statistically significant. Athough there were other differences that could not be captured from the simplified experiments. The daily cigar use was significant and the high levels for both alcohol and drug use.

From these results it seems that females are likely to be more depressive than males. Also, the kids that have a habit of smoking at a daily basis are more susceptible to being depressive. Another important aspect of the study was the drug use levels clearly produce different depression indices among youth population, as kids that do many kinds of drugs are likely to become depressive. The same with respect to the alcohol levels, as the level of alcohol consumption is highest, there is likely to produce higher depression.

Results also show that their is a significant interaction between daily cigar use and high alcohol consumption. This is an important finding that suggests that these habits should be avoided from early stages as it can become serious for the mental health of these children.

References

  1. Cox, D.R. and Reid, N. (2000) The Theory of the Design of Experiments, Chapman & Hall/CRC, 86p.
  2. Thompson, M. (2011), Gender, Mental Illness, and Crime in the United States, 2004. ICPSR27521-v1. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor], 2011-02-10. http://doi.org/10.3886/ICPSR27521.v1”)

APPENDIX

Complete code is provided below:

#loading data from TSV file
crimedata <- read.delim("CrimeData.tsv")
str(crimedata)

#creating new dataframe with variables selected for this study
probcrime <- ifelse(crimedata$YEYATTAK==1, 0.368,
       ifelse(crimedata$YEYATTAK ==2, 0.92, 
              ifelse(crimedata$YEYATTAK ==3, 0.996, 
                     ifelse(crimedata$YEYATTAK ==4, 0.999, 
                            ifelse(crimedata$YEYATTAK ==5, 1, NA)))))
crime1 = ifelse(crimedata$ANYCRIME==0, 0,
                ifelse(crimedata$ANYCRIME ==1, 1,NA))
newdata <- data.frame(id = crimedata$CASEID, IRSEX = crimedata$IRSEX, CATAG2 = crimedata$CATAG2, AGEYOUNG= crimedata$AGE2, INCOME_R = crimedata$INCOME_R, EDU_DUMMY= crimedata$EDU_DUMMY, EDUC2 = crimedata$IREDUC2, CATAG2 = crimedata$CATAG2,  BINGEHVY = crimedata$BINGEHVY, CDCGMO = crimedata$CDCGMO, CDNOCGMO = crimedata$CDNOCGMO, CDUFLAG = crimedata$CDUFLAG, SUMFLAG =  crimedata$SUMFLAG, MJOFLAG = crimedata$MJOFLAG, IEMFLAG = crimedata$IEMFLAG, ANYCRIME = crime1, CRIME2 = probcrime, ARREST = crimedata$NUMARREST, DEPRESS = crimedata$YODEPRESSIONINDEX)

save(newdata,file="data2.RData")
load("data2.RData")
head(newdata)
tail(newdata)

summary(newdata$IRSEX)
summary(newdata$CATAG2)
summary(newdata$INCOME_R)
summary(newdata$EDU_DUMMY)
summary(newdata$IREDUC2)
summary(newdata$DEPRESSIONINDEX2)
tapply(newdata$CRIME2, newdata$IRSEX, summary)
tapply(newdata$ANYCRIME, newdata$IRSEX, summary)
tapply(newdata$CRIME2, newdata$EDU_DUMMY, summary)
tapply(newdata$ANYCRIME, newdata$EDU_DUMMY, summary)
tapply(newdata$DEPRESS, newdata$EDU_DUMMY, summary)
tapply(newdata$DEPRESS, newdata$IRSEX, summary)


gen_levels <- c(male = 1, female = 2)
age_levels <- c(child = 1, young = 2)
edu_levels <- c(elem = 1, hs = 2)

alc_levels <- c(none = 1, low =2, high=3) 
drug_levels <- c(none = 1, low =2, high=3) 
cig_levels <- c(no=1, yes=2) 

age_rel<- ifelse(newdata$AGEYOUNG==1, 1, 
                 ifelse(newdata$AGEYOUNG==2, 1,
                        ifelse(newdata$AGEYOUNG==3, 1, 2)))
edu_rel<- ifelse(newdata$EDUC2==1, 1, 2)
#cigarrel<- ifelse(newdata$CDCGMO==1, 1, ifelse(newdata$CDNOCGMO==1,2,NA))
cigarrel<- ifelse(newdata$CDUFLAG==0, 1, 2)
alc_rel<- ifelse(newdata$BINGEHVY==4, 1, 
                 ifelse(newdata$BINGEHVY==1, 3, 2))
drugrel<- ifelse(newdata$SUMFLAG==0, 1, 
                 ifelse(newdata$MJOFLAG==1, 2, 3))

studydata <- data.frame(gender = (gender = factor(newdata$IRSEX, levels = gen_levels, labels = names(gen_levels))),
                        age = (age = factor(age_rel, levels = age_levels, labels = names(age_levels))),
                        educ = (educ = factor(edu_rel, levels = edu_levels, labels = names(edu_levels))),
                        cigar = (cigar = factor(cigarrel, levels = cig_levels, labels = names(cig_levels))),
                        alcohol = (alcohol = factor(alc_rel, levels = alc_levels, labels = names(alc_levels))), 
                        druguse= (druguse = factor(drugrel, levels = drug_levels, labels = names(drug_levels))), 
                        crime = newdata$ANYCRIME, prcrime=newdata$CRIME2, arrests = newdata$ARREST, depression = newdata$DEPRESS)
#studydata <- data.frame(gender = (gender = factor(newdata$IRSEX, levels = gen_levels, labels = names(gen_levels))),
#                        age = (age = factor(newdata$CATAG2, levels = age_levels, labels = names(age_levels))),
#                        alcohol = (alcohol = factor(alc_rel, levels = alc_levels, labels = names(alc_levels))), 
#                        druguse= (druguse = factor(drugrel, levels = drug_levels, labels = names(drug_levels))), 
#                        crime = newdata$ANYCRIME, prcrime=newdata$CRIME2)

data2proj3 = subset(studydata, studydata$depression>=0)
save(data2proj3,file="data3.RData")
load("data3.RData")

#Data Analysis
hist(data2proj3$depression, main = "Depression Index of the Youth Population", xlab= "Level of Depression (low 0 - high 9)")
boxplot(data2proj3$depression ~ data2proj3$alcohol, main = "Depression Index based on Alcohol", xlab= "Level of Alcohol Use", ylab ="depression index")
boxplot(data2proj3$depression ~ data2proj3$druguse, main = "Depression Index based on Drug Use", xlab= "Level of Drug Use", ylab ="depression index")
boxplot(data2proj3$depression ~ data2proj3$gender, main = "Depression Index based on Gender", xlab= "Gender", ylab ="depression index")
boxplot(data2proj3$depression ~ data2proj3$age, main = "Depression Index based on Age", xlab= "Age of Youth", ylab ="depression index")
boxplot(data2proj3$depression ~ data2proj3$educ , main = "Depression Index based on Education", xlab= "Education", ylab ="depression index")
boxplot(data2proj3$depression ~ data2proj3$cigar, main = "Depression Index based on Cigar Use", xlab= "Cigar Daily Use?", ylab ="depression index")

#Generating Design for 2^k design
proj3 <- data.frame(depression = data2proj3$depression, gender = data2proj3$gender, cigardaily = data2proj3$cigar, alcohol = data2proj3$alcohol, drugs = data2proj3$druguse, low_alc = NA, high_alc = NA, low_drug = NA, high_drug = NA)  
head(proj3)

proj3$low_alc[proj3$alcohol == "none"] <- 0
proj3$low_alc[proj3$alcohol == "low"] <- 1
proj3$low_alc[proj3$alcohol == "high"] <- 0

proj3$high_alc[proj3$alcohol == "none"] <- 0
proj3$high_alc[proj3$alcohol == "low"] <- 0
proj3$high_alc[proj3$alcohol == "high"] <- 1

proj3$low_drug[proj3$drugs == "none"] <- 0
proj3$low_drug[proj3$drugs == "low"] <- 1
proj3$low_drug[proj3$drugs == "high"] <- 0

proj3$high_drug[proj3$drugs == "none"] <- 0
proj3$high_drug[proj3$drugs == "low"] <- 0
proj3$high_drug[proj3$drugs == "high"] <- 1

head(proj3)

#Full Factorial Design
expand.grid(gender = c(0,1), cigardaily = c(0,1), low_alc = c(0,1), high_alc = c(0,1), low_drug = c(0,1), high_drug = c(0,1))

#Fractional Factorial Design
library(FrF2)
runs <- 2^(6-3)
nam2 <- c("gender","cigardaily", "low_alc", "high_alc", "low_drug", "high_drug")
frac_design <- FrF2(runs, factor.names = nam2 , default.levels = c("0","1"))
summary(frac_design)
frac1 <-data.frame(frac_design)

#Aliasing Structure
aliasprint(frac_design)
frac2_design <- FrF2(runs, factor.names = c("A","B","C","D","E","F"), default.levels = c("-1","1"))
frac2 <-data.frame(frac2_design)
frac2["I"] = 1

#assigning response variable
run1 <- subset(proj3, (proj3$gender== 'female' & proj3$cigardaily== 'yes' & proj3$low_alc==0 & proj3$low_drug==0))
run2 <- subset(proj3, (proj3$gender== 'male' & proj3$cigardaily== 'no' & proj3$low_alc==0 & proj3$low_drug==1))
run3 <- subset(proj3, (proj3$gender== 'male' & proj3$cigardaily== 'no' & proj3$low_alc==1 & proj3$low_drug==0))
run4 <- subset(proj3, (proj3$gender== 'male' & proj3$cigardaily== 'yes' & proj3$low_alc==0 & proj3$low_drug==1))
run5 <- subset(proj3, (proj3$gender== 'male' & proj3$cigardaily== 'yes' & proj3$low_alc==1 & proj3$low_drug==0))
run6 <- subset(proj3, (proj3$gender== 'female' & proj3$cigardaily== 'no' & proj3$low_alc==1 & proj3$low_drug==1))
run7 <- subset(proj3, (proj3$gender== 'female' & proj3$cigardaily== 'no' & proj3$low_alc==0 & proj3$low_drug==0))
run8 <- subset(proj3, (proj3$gender== 'female' & proj3$cigardaily== 'yes' & proj3$low_alc==1 & proj3$low_drug==1))

rv <- cbind(sample(run1$depression,1),sample(run2$depression,1),sample(run3$depression,1),sample(run4$depression,1),sample(run5$depression,1),sample(run6$depression,1),sample(run7$depression,1),sample(run8$depression,1))

frac2["RV"]=NA
for (i in 1:8){ frac2$RV[i] = rv[i] }
rand_fd <- frac2[sample(nrow(frac2)),]
rand_fd

me_A <- 1.875
me_A 
me_B <- 0.875
me_B
me_C <- -1.125
me_C
me_D <- -0.875
me_D
me_E <- 0.625
me_E
me_F <- 0.375
me_F


fit <- lm(proj3$depression ~ (proj3$gender + proj3$cigardaily + proj3$low_alc + proj3$high_alc + proj3$low_drug + proj3$high_drug)^2)
summary(fit)
anova(fit)

qqnorm(residuals(fit))
qqline(residuals(fit))

fit2 <- lm(proj3$depression ~ (proj3$gender + proj3$cigardaily + proj3$high_alc + proj3$high_drug)^2)
summary(fit2)
anova(fit2)

#Main effects 
me_gender <- mean(subset(proj3$depression, proj3$gender== "female")) - mean(subset(proj3$depression, proj3$gender == "male"))
me_gender
me_cigar <- mean(subset(proj3$depression, proj3$cigardaily== "yes")) - mean(subset(proj3$depression, proj3$cigardaily == "no"))
me_cigar
me_lowalc <- mean(subset(proj3$depression, proj3$low_alc== 1)) - mean(subset(proj3$depression, proj3$low_alc== 0))
me_lowalc
me_highalc <- mean(subset(proj3$depression, proj3$high_alc== 1)) - mean(subset(proj3$depression, proj3$high_alc== 0))
me_highalc
me_lowdrug <- mean(subset(proj3$depression, proj3$low_drug== 1)) - mean(subset(proj3$depression, proj3$low_drug== 0))
me_lowdrug
me_highdrug <- mean(subset(proj3$depression, proj3$high_drug== 1)) - mean(subset(proj3$depression, proj3$high_drug== 0))
me_highdrug

#Interaction Plots
interaction.plot(proj3$gender, proj3$cigardaily, proj3$depression)
interaction.plot(proj3$gender, proj3$high_alc, proj3$depression)
interaction.plot (proj3$gender, proj3$high_drug, proj3$depression)
interaction.plot(proj3$cigardaily, proj3$high_alc, proj3$depression)
interaction.plot(proj3$cigardaily, proj3$high_drug, proj3$depression)
interaction.plot(proj3$high_alc, proj3$high_drug, proj3$depression)