Voting in Reddit Sub-Communities Analyzed by Taguchi Design

Rensselaer Polytechnic Institute

Date: December 16, 2016

Version 1.3

1. Setting

In an earlier report, voting polarization was analyzed across three different subreddits on Reddit.com using a fractional factorial design.¹ The exploratory data analysis hinted at potentially significant differences between subreddits, but the ANOVA indicated otherwise. The hypothesized cause for the discrepancy was the artificial decomposition of non-ordinal 3-level factors into two 2-level factors. In this analysis, a Taguchi Design is implemented to ask the same research question: do significant differences in voting polarization exist between subreddits?

System under test

Github user umbrae gathered top Reddit posts and placed them on Github as .csv files arranged by subreddit. The chosen subreddits for this study were r/science, r/politics, and r/news. Links to the Github repository can be found below in the Appendix.

Factors and levels

Identical data from the earlier study¹ were used. The two 3-level independent variables (IVs) include:

subreddit: r/science, r/politics, and r/news
link domain: .org, .com, or other (such as .gov or .co.uk)

The two 2-level IVS selected include:

relative number of comments: few (bottom 50%), or many (top 50%)
time of post: earlier (before cutoff date), or later (after cutoff date)

The 3-level IVs were each broken down into two 2-level IVs. subreddit_name became SR1 and SR2, and domain_type became DT1 and DT2. The middle level of each 3-level factor was broken down into (1,-1) and (-1,1) in an alternating fashion using toggled boolean varaibles.¹ The result was a \(2^{6}\) full factorial design.

Continuous variables

Both continuous IVs, relative number of comments and time of post, were cast as 2-level categorical variables to meet the requirements of the experimental design. The variable levels were appropriated into two categories based on an an arbitrary cutoff value. The median of each factor was chosen as the cutoff value to maximize the number of available combinations for the factorial.

Response variable

The response variable, percent_upvotes, was simply calculated by dividing the total number of upvotes by the total number of votes (upvotes + downvotes) for that post. A universally-liked post would approach 100% upvotes, while a more polarizing post would approach 50%.

The Data

Refer to the earlier report for a description of how the raw data was imported, cleaned, and organized.¹

In the \(2^{6}\) full factorial design, there are 64 possible factor-level combinations. A tree-like sorting algorithm was implemented to assign unique experiment codes (exp_code from 0-63) to each factor-level combination.¹ These experiment codes are later used to match response variables to the appropriate experimental runs. The structure, head, and tail of the raw data frame (after removing and adding columns) is given below.

## 'data.frame':    2998 obs. of  17 variables:
##  $ created_utc     : num  1.34e+09 1.34e+09 1.35e+09 1.34e+09 1.36e+09 ...
##  $ score           : int  5785 5593 4626 4486 4465 4438 4305 4066 4047 4025 ...
##  $ domain          : chr  "twitter.com" "webcast.web.cern.ch" "nasa.gov" "guardian.co.uk" ...
##  $ ups             : int  53727 33456 42063 18267 33283 9418 16929 16996 19220 12772 ...
##  $ downs           : int  47942 27863 37437 13781 28818 4980 12624 12930 15173 8747 ...
##  $ num_comments    : int  4567 2919 1515 739 1135 1141 1165 1008 717 1069 ...
##  $ subreddit_name  : chr  "science" "science" "science" "science" ...
##  $ percent_comments: num  4.49 4.76 1.91 2.31 1.83 ...
##  $ domain_type     : chr  ".com" "other" "other" "other" ...
##  $ percent_upvotes : num  52.8 54.6 52.9 57 53.6 ...
##  $ time_code       : Factor w/ 2 levels "-1","1": 1 1 1 1 2 2 1 1 1 1 ...
##  $ comments_code   : Factor w/ 2 levels "-1","1": 1 1 1 1 1 2 1 1 1 1 ...
##  $ DT1             : Factor w/ 2 levels "-1","1": 2 2 2 2 1 2 1 2 2 1 ...
##  $ DT2             : Factor w/ 2 levels "-1","1": 1 2 2 2 2 1 2 2 1 2 ...
##  $ SR1             : Factor w/ 2 levels "-1","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ SR2             : Factor w/ 2 levels "-1","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ exp_code        : num  56 60 60 60 54 59 52 60 56 52 ...

##   created_utc score              domain   ups downs num_comments
## 1  1344231209  5785         twitter.com 53727 47942         4567
## 2  1341387530  5593 webcast.web.cern.ch 33456 27863         2919
## 3  1354217414  4626            nasa.gov 42063 37437         1515
## 4  1342384318  4486      guardian.co.uk 18267 13781          739
## 5  1361725918  4465         reuters.com 33283 28818         1135
## 6  1363051689  4438          nature.com  9418  4980         1141
##   subreddit_name percent_comments domain_type percent_upvotes time_code
## 1        science         4.492028        .com        52.84502        -1
## 2        science         4.760352       other        54.56058        -1
## 3        science         1.905660       other        52.90943        -1
## 4        science         2.305916       other        56.99888        -1
## 5        science         1.827668        .com        53.59495         1
## 6        science         7.924712        .com        65.41186         1
##   comments_code DT1 DT2 SR1 SR2 exp_code
## 1            -1   1  -1   1   1       56
## 2            -1   1   1   1   1       60
## 3            -1   1   1   1   1       60
## 4            -1   1   1   1   1       60
## 5            -1  -1   1   1   1       54
## 6             1   1  -1   1   1       59

##      created_utc score                domain  ups downs num_comments
## 2993  1366385781  1031           reuters.com 1439   408          157
## 2994  1364274310  1033 earthquake-report.com 1390   357          262
## 2995  1292330835  1024     newsfeed.time.com 2282  1258          206
## 2996  1374840797  1027        guardian.co.uk 1390   363           35
## 2997  1342784992  1023             self.news 1862   839          442
## 2998  1376504442   809            nbcdfw.com 1052   243          176
##      subreddit_name percent_comments domain_type percent_upvotes time_code
## 2993           news         8.500271        .com        77.91012         1
## 2994           news        14.997138        .com        79.56497         1
## 2995           news         5.819209        .com        64.46328        -1
## 2996           news         1.996577       other        79.29264         1
## 2997           news        16.364310       other        68.93743        -1
## 2998           news        13.590734        .com        81.23552         1
##      comments_code DT1 DT2 SR1 SR2 exp_code
## 2993             1  -1   1  -1  -1        7
## 2994             1   1  -1  -1  -1       11
## 2995            -1  -1   1  -1  -1        4
## 2996            -1   1   1  -1  -1       14
## 2997             1   1   1  -1  -1       13
## 2998             1   1  -1  -1  -1       11

2. (Experimental) Design

The qualityTools package in R was installed and implemented to construct the Taguchi Design. Useful functions within this package include taguchiChoose() and taguchiDesign().

Organization of experiment to test the hypothesis

In this experiment, we are performing ANOVA on a linear model developed from a Taguchi Design. Since the design and analysis have occured after the data was collected, the sample was selected based on groups in order to provide a balanced design using the same sampling technique applied in the report on fractional factorial designs.¹

Knowing that there were a total of six 2-level factors, the taguchiChoose() function was used to show the possible Taguchi Designs.

# Install and load Quality Tools package
install.packages("qualityTools", repos='http://cran.us.r-project.org')

## package 'qualityTools' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\Deagen\AppData\Local\Temp\RtmpgVWzhc\downloaded_packages

require(qualityTools)

# choose Taguchi Design
taguchiChoose(level1 = 2, factors1 = 6)

## 6 factors on 2 levels and 0 factors on 0 levels with 0 desired interactions to be estimated
## 
## Possible Designs:
## 
## L8_2 L12_2 L16_2 L32_2
## 
## Use taguchiDesign("L8_2") or different to create a taguchi design object

The resulting designs to choose from include L8_2, L12_2, L16_2, L32_2. The notation for Taguchi Designs is as follows:
\[L_{N}(S^{k})\] where \(N\) is the number of experimental runs, \(S\) is the number of levels, and \(k\) is the number of factors.² Since we are interested in minimizing the amount of runs in our experiment, the \(L_{8}(2^{6})\). In the notation of qualityTools, we are interested in the L8_2 design. Through this design, we can run 1/8 of the 64 designs that a \(2^{6}\) full factorial design would otherwise require.

##   StandOrder RunOrder Replicate A B C D E F G  y
## 1          5        1         1 2 1 2 1 2 1 2 NA
## 2          7        2         1 2 2 1 1 2 2 1 NA
## 3          1        3         1 1 1 1 1 1 1 1 NA
## 4          3        4         1 1 2 2 1 1 2 2 NA
## 5          4        5         1 1 2 2 2 2 1 1 NA
## 6          8        6         1 2 2 1 2 1 1 2 NA
## 7          6        7         1 2 1 2 2 1 2 1 NA
## 8          2        8         1 1 1 1 2 2 2 2 NA

Rather than (-1,1) encoding for the 2-level factors, qualityTools uses (1,2) encoding. To convert HIGH (2 –> 1) and LOW (1 –> -1) encodings, the rows in the dataframe were multiplied by 2, and 3 was subtracted.

# remove unnecessary columns from mydesign
mydesign <- subset(mydesign, select = -c(StandOrder, RunOrder, Replicate, G, y))

# change column names to match factors in experiment
colnames(mydesign) <- c("SR1", "SR2", "DT1", "DT2", "time_code", "comments_code")

# convert encodings
mydesign <- mydesign * 2 - 3

The experimental runs were assigned experiment codes based on the tree-like sorting algorithm, and a histogram confirmed that there were sufficient factor-level combinations for sampling (Fig. 1).

##   SR1 SR2 DT1 DT2 time_code comments_code exp_code
## 1   1  -1   1  -1         1            -1       42
## 2   1   1  -1  -1         1             1       51
## 3  -1  -1  -1  -1        -1            -1        0
## 4  -1   1   1  -1        -1             1       25
## 5  -1   1   1   1         1            -1       30
## 6   1   1  -1   1        -1            -1       52
## 7   1  -1   1   1        -1             1       45
## 8  -1  -1  -1   1         1             1        7

Fig. 1 Distribution of eligible experiment codes for the Taguchi Design.

Rationale for design

The rationale for a Taguchi Design is to minimize trials and maximize efficiency. Taguchi Designs are highly fractional designs good for screening experiments or product/process development. What separates the Taguchi Design from other fractional designs is its focus on Parameter Design and Tolerance Design to make a product/process more robust against noise factors.³ For example, the parameters (factors) are divided into an “inner array” (engineer has control) and an “outer array” (noise). While some information is certainly lost in a fractional design, Taguchi uses orthogonal arrays to select the trial runs most efficiently.

Randomization

Randomization is implemented by default in the taguchiDesign() method. The response data was also sampled randomly from the dataset.

Replication

While there are sufficient data to perform replicates, the goal of this analysis is to compare to the fractional factorial design that performed only 8 trial runs. Therefore, one response from each factor-level combination in the Taguchi Design was selected.

Blocking

Blocking against nuisance factors was not considered within the scope of this analysis.

3. (Statistical) Analysis

ANOVA was performed on a linear model developed from the response of the 8-run Taguchi Design.

Exploratory data analysis

The factors in the \(2^{6}\) factorial design that had been broken up from 3-level factors do not have any physical significance, as mentioned earlier. The exploratory analysis for the \(2^{2}*3^{2}\) factorial design is below.

Box plots of the percent upvotes versus each of the factors (Fig. 2) seem to indicate that there might be a difference across subreddit, and also potentially on the relative number of comments. Domain type and Time of post do not seem to affect the percent of upvotes. However, in each case there appear to be outliers that could skew the data.

Fig. 2 Box plots showing the main effects for the \(2^{2}*3^{2}\) full factorial design.¹

For comparison, box plots of the percent upvotes after decomposition of the 3-level factors (subreddit_name, domain_type) into 2-level factors (SR1/SR2 and DT1/DT2, respectively), do suggest similar trends despite some outliers (Fig. 3).

Fig. 3 Box plots showing the main effects for the \(2^{6}\) full factorial design after decomposition of 3-level factors.

However, we know from the earlier report that the response data has a strongly skewed distribution.¹ Since the dataset includes “Top Posts” on Reddit, rather than all posts, posts with a lower percent of upvotes were not included the dataset, resulting in the skew.

Testing

A sample of response data was pulled from the dataset (with exp_code as the filter) with the factor-level combinations defined by the Taguchi Design. In addition, the responses were matched up with the randomized trial order specified by the Taguchi Design and added to the taguchi object.

# create a sample of the eligible data
taguchidata <- eligibledata %>% group_by(exp_code) %>% sample_n(size = 1)

# show the values of the response variable (percent_upvotes) for each corresponding exp_code
response <- taguchidata[,c("exp_code", "percent_upvotes")]

# re-arrange rows of the response variables so that the exp_code matches up with the mydesign dataframe
response <- response[match(as.vector(mydesign$exp_code), response$exp_code),]

# casting as data frame necessary to prevent Effect Plot bug!!
response <- as.data.frame(response)

# add response data to taguchi design
response(plan) = response[,2]
summary(plan)

## Taguchi SINGLE Design
## Information about the factors:
## 
##               A       B       C       D         E             F       G
## value 1      -1      -1      -1      -1        -1            -1      -1
## value 2       1       1       1       1         1             1       1
## name        SR1     SR2     DT1     DT2 time_code comments_code   empty
## unit                                                                   
## type    numeric numeric numeric numeric   numeric       numeric numeric
## 
## -----------
## 
##   StandOrder RunOrder Replicate A B C D E F G response[, 2]
## 1          5        1         1 2 1 2 1 2 1 2      59.37804
## 2          7        2         1 2 2 1 1 2 2 1      64.21902
## 3          1        3         1 1 1 1 1 1 1 1      74.59591
## 4          3        4         1 1 2 2 1 1 2 2      56.74639
## 5          4        5         1 1 2 2 2 2 1 1      54.59522
## 6          8        6         1 2 2 1 2 1 1 2      64.86343
## 7          6        7         1 2 1 2 2 1 2 1      60.03572
## 8          2        8         1 1 1 1 2 2 2 2      59.19547
## 
## -----------

ANOVA on the linear model defined by the main effects resulted in p-values greater than 0.05, thus we cannot say that the differences observed are statistically significant.

# perform ANOVA on the linear model looking only at main effects
mydesign[,"percent_upvotes"] <- response[,"percent_upvotes"]
MyAOV <- aov(lm(mydesign$percent_upvotes ~ (mydesign$SR1 + mydesign$SR2 + mydesign$DT1 + mydesign$DT2 + mydesign$time_code + mydesign$comments_code)))
summary(MyAOV)

##                        Df Sum Sq Mean Sq F value Pr(>F)
## mydesign$SR1            1   1.41    1.41   0.064  0.842
## mydesign$SR2            1  20.42   20.42   0.929  0.512
## mydesign$DT1            1 128.95  128.95   5.865  0.249
## mydesign$DT2            1  33.01   33.01   1.501  0.436
## mydesign$time_code      1  44.43   44.43   2.021  0.390
## mydesign$comments_code  1  21.90   21.90   0.996  0.501
## Residuals               1  21.99   21.99

The main effects, using the effectPlot() function, showed some weak trends but in the end we cannot say that the main effects are statistically significant (Fig. 4). [NOTE: Ignore the last plot; the seventh column in the Taguchi Design was dropped since we only had six factors.]

Fig. 4 Effect plots for the six 2-level factors in this Taguchi design (ignoring the last plot).

Only 8 runs were conducted in this Taguchi Design, and a larger sample size or additional replicates may have produced more significant results. Assuming that the effect sizes are weak, such a small sample is unlikely to produce significant results.

Diagnostics / Model Adequacy Checking

To check the assumptions on which the model is based, a Normal Q-Q plot was generated with all of the data. The residuals begin to deviate significantly from the model at the tail ends of the plot, but the data near the center fits neatly along the model line (Fig. 5).

Fig. 5 Normal Q-Q plot of residuals from the linear model, taken from all the data.

The Normal Q-Q plot for the 8-run sample of the Taguchi Design has a distinctly different shape, indicating a strongly bimodal distribution.

Fig. 6 Normal Q-Q plot of residuals from the linear model, taken only from the sample of the Taguchi Design.

Contingencies to the experimental design

As mentioned in the earlier report, Reddit vote tallying, which ultimately determines the list of top posts, was significantly adjusted for the first time in 11 years as Reddit instituted a new voting system that no longer suppresses top scores.⁴ The dataset used in this analysis applies to the voting system prior to this changeover.

Summary of relevant theory

The results of the Taguchi Design agree with the Fractional Factorial Design in the sense that neither approach detected a statistically significant impact of the main effects. Again, the limitations to a highly fractional design approach include the disadvantage of less information (or aliasing in FFD). The difference between the two design matrices is not immediately obvious, other than the fact that Taguchi includes the {1} combination (exp_code=0) as well as two adjacent factor-level combinations (refer back to Fig. 1).

A final note on Taguchi Designs is that its fractional technique based on orthogonal arrays is not what makes the design favorable. Rather, the design philosophy including the Parameter and Tolerance Design is what makes the technique useful for robust product/process development. The dataset used in this report is not necessarily optimal for a Taguchi Design. However, if one were to identify “noise” parameters determined by the users of the website, then the domain type and relative number of comments would likely be in the outer array. Meanwhile, the inner array would consist of the subreddit names.

4. References

Deagen, M. “Voting Polarization Among Reddit Sub-Communities.” RPubs. December 10, 2016. Available December 12, 2016 http://rpubs.com/deagem/reddit.
Sigma Quotient. “Taguchi method - Introduction.” YouTube. October 6, 2016. Available December 12, 2016 https://www.youtube.com/watch?v=VPMhTpr95lM.
“What are Taguchi designs?” Engineering Statistics Handbook. Available December 12, 2016 http://www.itl.nist.gov/div898/handbook/pri/section5/pri56.htm.
KeyserSosa. “Scores on posts are about to start going up.” Reddit. December 6, 2016. Available December 12, 2016 https://www.reddit.com/r/announcements/comments/5gvd6b/scores_on_posts_are_about_to_start_going_up/.

5. Appendices

Raw data

The datasets for all subreddits can be found at https://github.com/umbrae/reddit-top-2.5-million. The chosen datasets for this study were r/science, r/politics, and r/news, which are easily searchable from the URL and can be downloaded as .CSV files. In this report, the RCurl library was used to retrieve the raw .CSV files directly from their respective URLs.

Complete and documented R code

# Project 4: Optimal Designs
# ISYE 6020
# Mike Deagen

# Objective: Use TAGUCHI DESIGN to explore polarization (in terms of % upvotes) for top posts in various subreddits: r/science, r/politics, and r/news


####### THIS CODE, COPIED FROM PROJECT 3, TO RETRIEVE, CLEAN, AND ORGANIZE THE REDDIT DATA #########
# include RCurl library to be able to retrieve .CSV from a URL
library(RCurl)
# FrF2 library must be installed and loaded
library(FrF2)
# dplyr library must be installed and loaded
library(dplyr)

# obtain raw data from online repository
science_data <- getURL('https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/science.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
politics_data <- getURL('https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/politics.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
news_data <- getURL('https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/news.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
# import raw data as data frames
science_df <- read.csv(textConnection(science_data), header=TRUE)
politics_df <- read.csv(textConnection(politics_data), header=TRUE)
news_df <- read.csv(textConnection(news_data), header=TRUE)

# drop columns NOT of interest
DropColumns <- function(x){
  x <- subset(x, select = -c(id, title, author, permalink, selftext, link_flair_text, over_18, thumbnail, edited, link_flair_css_class, author_flair_css_class, is_self, name, url, distinguished, subreddit_id))
  return(x)
}
# apply the DropColumns function to each dataframe
science_df <- DropColumns(science_df)
politics_df <- DropColumns(politics_df)
news_df <- DropColumns(news_df)


# add column with subreddit name
science_df[,"subreddit_name"] <- c("science")
politics_df[,"subreddit_name"] <- c("politics")
news_df[,"subreddit_name"] <- c("news")

# print the header of each data frame
head(science_df)
head(politics_df)
head(news_df)

# join all datasets together
all_data <- rbind(science_df, politics_df, news_df)

# create column with NORMALIZED comments (as a percent of total votes)
all_data$percent_comments <- (100*all_data$num_comments/(all_data$ups + all_data$downs))

# convert domain column to CHAR type
all_data$domain <- as.character(all_data$domain)

# create column with DOMAIN TYPE (.org, .com, or other)
for(i in 1:nrow(all_data)) {
  
  if (endsWith(all_data[i,'domain'],'.org')) {
    all_data[i,'domain_type'] <- c('.org')
  }
  else if (endsWith(all_data[i,'domain'],'.com')) {
    all_data[i,'domain_type'] <- c('.com')
  }
  else {
    all_data[i,'domain_type'] <- c('other')
  }
}

# create column with PERCENT_UPVOTES
all_data$percent_upvotes <-(100*all_data$ups/(all_data$ups + all_data$downs))

# view structure, head, and tail of the dataset
str(all_data)
head(all_data)
tail(all_data)

# define cutoff for comments HIGH/LOW
cutoff_comments <- median(all_data$percent_comments)

# define cutoff time
cutoff_time <- median(all_data$created_utc)

# determine EXPERIMENT CODE (0-35) based on factor levels
for(i in 1:nrow(all_data)) {
  # initialize coded 0/1 for time and comments cutoffs
  all_data[i,'time_code'] <- -1
  all_data[i,'comments_code'] <- -1
  
  # initialize exp_code as zero
  all_data[i,'exp_code'] <- 0
  
  
  # SORT BY SUBREDDIT NAME
  # if subreddit_name is 'politics', add 12
  if (all_data[i,'subreddit_name']=='politics') {
    all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 12
  }
  # if subreddit_name is 'politics', add 24
  else if (all_data[i,'subreddit_name']=='news') {
    all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 24
  }
  
  # SORT BY DOMAIN TYPE
  # if domain_type is '.com', add 4
  if (all_data[i,'domain_type']=='.com') {
    all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 4
  }
  # if domain_type is 'other', add 8
  else if (all_data[i,'domain_type']=='other') {
    all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 8
  }
  
  # SORT BY TIME CUTOFF (and code -1/1 based on designation)
  # if created_utc is GREATER THAN cuttoff_time, add 2
  if (all_data[i,'created_utc'] > cutoff_time) {
    all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 2
    all_data[i,'time_code'] <- 1
  }
  
  # SORT BY COMMENTS CUTOFF (and code -1/1 based on designation)
  # if percent comments is GREATER THAN cuttoff_comments, add 1
  if (all_data[i,'percent_comments'] > cutoff_comments) {
    all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 1
    all_data[i,'comments_code'] <- 1
  }
}


# check spread of exp codes
hist(all_data$exp_code, breaks = 36, xlab="Experiment Code (0-35)", main = "Confirmation that all factor-level combinations exist")

# determine the least-occurring exp_code(s), which will limit the number of possible replicates
exp_table <- table(all_data$exp_code)

# create a new dataset for breaking down 3-level factors into two 2-level factors each (and remove old exp_code column)
all_data2 <- subset(all_data, select = -c(exp_code))

# toggle variables that will determing splitting of middle of 3-level into (-1,1) or (1,-1)
SR_toggle <- TRUE
DT_toggle <- TRUE

for(i in 1:nrow(all_data2)) {
  
  if (all_data2[i,'domain_type']=='.org') {
    all_data2[i,'DT1'] <- -1
    all_data2[i,'DT2'] <- -1
  }
  else if (all_data[i,'domain_type']=='.com') {
    if (DT_toggle) {
      all_data2[i,'DT1'] <- 1
      all_data2[i,'DT2'] <- -1
    }
    else {
      all_data2[i,'DT1'] <- -1
      all_data2[i,'DT2'] <- 1
    }
    DT_toggle <- !DT_toggle
  }
  else {
    all_data2[i,'DT1'] <- 1
    all_data2[i,'DT2'] <- 1
  }
  
  if (all_data2[i,'subreddit_name']=='news') {
    all_data2[i,'SR1'] <- -1
    all_data2[i,'SR2'] <- -1
  }
  else if (all_data[i,'subreddit_name']=='politics') {
    if (SR_toggle) {
      all_data2[i,'SR1'] <- 1
      all_data2[i,'SR2'] <- -1
    }
    else {
      all_data2[i,'SR1'] <- -1
      all_data2[i,'SR2'] <- 1
    }
    SR_toggle <- !SR_toggle
  }
  else {
    all_data2[i,'SR1'] <- 1
    all_data2[i,'SR2'] <- 1
  }
}

# The 2^6 design now has 64 possible experiment codes
# Apply similar sorting structure as above, this time using DT1 and DT2 in place of domain_type and SR1 and SR2 in place of subreddit_name
# determine EXPERIMENT CODE (0-63) based on factor levels
for(i in 1:nrow(all_data2)) {
  # initialize exp_code as zero
  all_data2[i,'exp_code'] <- 0
  
  # SORT BY SR1
  # if SR1 is +1, add 32
  if (all_data2[i,'SR1']=='1') {
    all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 32
  }
  
  # SORT BY SR2
  # if SR2 is +1, add 16
  if (all_data2[i,'SR2']=='1') {
    all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 16
  }
  
  # SORT BY DT1
  # if DT1 is +1, add 8
  if (all_data2[i,'DT1']=='1') {
    all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 8
  }
  
  # SORT BY DT2
  # if DT2 is +1, add 4
  if (all_data2[i,'DT2']=='1') {
    all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 4
  }
  
  # SORT BY TIME CUTOFF (coded -1/1 based on designation)
  # if time_code is +1, add 2
  if (all_data2[i,'time_code']=='1') {
    all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 2
  }
  
  # SORT BY COMMENTS CUTOFF (coded -1/1 based on designation)
  # if comments_code is +1, add 1
  if (all_data2[i,'comments_code']=='1') {
    all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 1
  }
}
# determine the least-occurring exp_code(s), which will limit the number of possible replicates
exp_table2 <- table(all_data2$exp_code)


# set appropriate columns as factors, and return subreddit_name and domain_type to characters
all_data2$SR1 <- as.factor(all_data2$SR1)
all_data2$SR2 <- as.factor(all_data2$SR2)
all_data2$DT1 <- as.factor(all_data2$DT1)
all_data2$DT2 <- as.factor(all_data2$DT2)
all_data2$time_code <- as.factor(all_data2$time_code)
all_data2$comments_code <- as.factor(all_data2$comments_code)
all_data2$subreddit_name <- as.character(all_data2$subreddit_name)
all_data2$domain_type <- as.character(all_data2$domain_type)

######## END OF CODE FROM PROJECT 3 #########



# Install and load Quality Tools package
install.packages("qualityTools", repos='http://cran.us.r-project.org')
require(qualityTools)

# choose Taguchi Design
taguchiChoose(level1 = 2, factors1 = 6)

# show Taguchi Design (randomize by default is TRUE)
set.seed(1)

plan <- taguchiDesign("L8_2")
values(plan) = list(A = c(-1,1), B = c(-1,1), C = c(-1,1), D = c(-1,1), E = c(-1,1), F = c(-1,1), G = c(-1, 1))
names(plan) = c("SR1", "SR2", "DT1", "DT2", "time_code", "comments_code", "empty")
mydesign <- as.data.frame(plan)
mydesign

# remove unnecessary columns from mydesign
mydesign <- subset(mydesign, select = -c(StandOrder, RunOrder, Replicate, G, y))

# change column names to match factors in experiment
colnames(mydesign) <- c("SR1", "SR2", "DT1", "DT2", "time_code", "comments_code")

# convert encodings
mydesign <- mydesign * 2 - 3
mydesign

# create a new row in mydesign with the experiment code corresponding to the factor level sorting algorithm used earlier
for(i in 1:nrow(mydesign)) {
  # initialize exp_code as zero
  mydesign[i,'exp_code'] <- 0
  
  # SORT BY SR1
  # if SR1 is +1, add 32
  if (mydesign[i,'SR1']=='1') {
    mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 32
  }
  
  # SORT BY SR2
  # if SR2 is +1, add 16
  if (mydesign[i,'SR2']=='1') {
    mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 16
  }
  
  # SORT BY DT1
  # if DT1 is +1, add 8
  if (mydesign[i,'DT1']=='1') {
    mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 8
  }
  
  # SORT BY DT2
  # if DT2 is +1, add 4
  if (mydesign[i,'DT2']=='1') {
    mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 4
  }
  
  # SORT BY TIME CUTOFF (coded -1/1 based on designation)
  # if time_code is +1, add 2
  if (mydesign[i,'time_code']=='1') {
    mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 2
  }
  
  # SORT BY COMMENTS CUTOFF (coded -1/1 based on designation)
  # if comments_code is +1, add 1
  if (mydesign[i,'comments_code']=='1') {
    mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 1
  }
}

# create subset of eligible data that meets the criteria of the fractional factorial design (using exp_code as the filter)
eligibledata <- all_data2[is.element(all_data2$exp_code, mydesign$exp_code),]

# check spread of exp codes for all_data2
hist(eligibledata$exp_code, breaks = 64, xlab="Experiment Code (0-63)", main = "Eligible data for sampling")

# create a sample of the eligible data
taguchidata <- eligibledata %>% group_by(exp_code) %>% sample_n(size = 1)


# show the values of the response variable (percent_upvotes) for each corresponding exp_code
response <- taguchidata[,c("exp_code", "percent_upvotes")]
colnames(response) <- c("exp_code", "percentupvotes")

# re-arrange rows of the response variables so that the exp_code matches up with the mydesign dataframe
response <- response[match(as.vector(mydesign$exp_code), response$exp_code),]

# casting as data frame necessary to prevent Effect Plot bug!!
response <- as.data.frame(response)

# add response data to taguchi design
response(plan) = response[,2]
summary(plan)

# print effect plots
effectPlot(plan, main = "Effect Plot for response", ylab = "% upvotes", col = 2)

# perform ANOVA on the linear model looking only at main effects
mydesign[,"percent_upvotes"] <- response[,"percent_upvotes"]
MyAOV <- aov(lm(mydesign$percent_upvotes ~ (mydesign$SR1 + mydesign$SR2 + mydesign$DT1 + mydesign$DT2 + mydesign$time_code + mydesign$comments_code)))
summary(MyAOV)


# EXPLORATORY DATA ANALYSIS

#Boxplots for 2^2*3^2 full factorial
# boxplots of response variables for full dataset
par(mfrow=c(2,3))
boxplot(percent_upvotes~subreddit_name, all_data, xlab="Subreddit", ylab = "Percent upvotes")
boxplot(percent_upvotes~domain_type, all_data, xlab="Domain type", ylab = "Percent upvotes")
boxplot(percent_upvotes~time_code, all_data2, xlab="Time of post", ylab = "Percent upvotes")
boxplot(percent_upvotes~comments_code, all_data2, xlab="Relative number of comments", ylab = "Percent upvotes")

#Boxplots for 2^6 full factorial
# boxplots of response variables after decomposition of subreddit (SR1, SR2) and domain_type (DT1,DT2)
par(mfrow=c(2,3))
boxplot(percent_upvotes~SR1, all_data2, xlab="SR1 (non-physical)", ylab = "Percent upvotes")
boxplot(percent_upvotes~SR2, all_data2, xlab="SR2 (non-physical)", ylab = "Percent upvotes")
boxplot(percent_upvotes~DT1, all_data2, xlab="DT1 (non-physical)", ylab = "Percent upvotes")
boxplot(percent_upvotes~DT2, all_data2, xlab="DT2 (non-physical)", ylab = "Percent upvotes")
boxplot(percent_upvotes~time_code, all_data2, xlab="Time of post", ylab = "Percent upvotes")
boxplot(percent_upvotes~comments_code, all_data2, xlab="Relative number of comments", ylab = "Percent upvotes")

# Model Adequacy
#QQ plot of all data
model_all = lm(percent_upvotes~SR1 + SR2 + DT1 + DT2 + time_code + comments_code, data = all_data2)
qqnorm(residuals(model_all))
qqline(residuals(model_all))

plot(fitted(model_all), residuals(model_all))

#QQ plot of Taguchi experiment
model_taguchi = lm(percent_upvotes~SR1 + SR2 + DT1 + DT2 + time_code + comments_code, data = mydesign)
qqnorm(residuals(model_taguchi))
qqline(residuals(model_taguchi))

plot(fitted(model_taguchi), residuals(model_taguchi))