In an earlier report, voting polarization was analyzed across three different subreddits on Reddit.com using a fractional factorial design.1 The exploratory data analysis hinted at potentially significant differences between subreddits, but the ANOVA indicated otherwise. The hypothesized cause for the discrepancy was the artificial decomposition of non-ordinal 3-level factors into two 2-level factors. In this analysis, a Taguchi Design is implemented to ask the same research question: do significant differences in voting polarization exist between subreddits?
Github user umbrae gathered top Reddit posts and placed them on Github as .csv files arranged by subreddit. The chosen subreddits for this study were r/science, r/politics, and r/news. Links to the Github repository can be found below in the Appendix.
Identical data from the earlier study1 were used. The two 3-level independent variables (IVs) include:
The two 2-level IVS selected include:
The 3-level IVs were each broken down into two 2-level IVs. subreddit_name became SR1 and SR2, and domain_type became DT1 and DT2. The middle level of each 3-level factor was broken down into (1,-1) and (-1,1) in an alternating fashion using toggled boolean varaibles.1 The result was a \(2^{6}\) full factorial design.
Both continuous IVs, relative number of comments and time of post, were cast as 2-level categorical variables to meet the requirements of the experimental design. The variable levels were appropriated into two categories based on an an arbitrary cutoff value. The median of each factor was chosen as the cutoff value to maximize the number of available combinations for the factorial.
The response variable, percent_upvotes, was simply calculated by dividing the total number of upvotes by the total number of votes (upvotes + downvotes) for that post. A universally-liked post would approach 100% upvotes, while a more polarizing post would approach 50%.
Refer to the earlier report for a description of how the raw data was imported, cleaned, and organized.1
In the \(2^{6}\) full factorial design, there are 64 possible factor-level combinations. A tree-like sorting algorithm was implemented to assign unique experiment codes (exp_code from 0-63) to each factor-level combination.1 These experiment codes are later used to match response variables to the appropriate experimental runs. The structure, head, and tail of the raw data frame (after removing and adding columns) is given below.
## 'data.frame': 2998 obs. of 17 variables:
## $ created_utc : num 1.34e+09 1.34e+09 1.35e+09 1.34e+09 1.36e+09 ...
## $ score : int 5785 5593 4626 4486 4465 4438 4305 4066 4047 4025 ...
## $ domain : chr "twitter.com" "webcast.web.cern.ch" "nasa.gov" "guardian.co.uk" ...
## $ ups : int 53727 33456 42063 18267 33283 9418 16929 16996 19220 12772 ...
## $ downs : int 47942 27863 37437 13781 28818 4980 12624 12930 15173 8747 ...
## $ num_comments : int 4567 2919 1515 739 1135 1141 1165 1008 717 1069 ...
## $ subreddit_name : chr "science" "science" "science" "science" ...
## $ percent_comments: num 4.49 4.76 1.91 2.31 1.83 ...
## $ domain_type : chr ".com" "other" "other" "other" ...
## $ percent_upvotes : num 52.8 54.6 52.9 57 53.6 ...
## $ time_code : Factor w/ 2 levels "-1","1": 1 1 1 1 2 2 1 1 1 1 ...
## $ comments_code : Factor w/ 2 levels "-1","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ DT1 : Factor w/ 2 levels "-1","1": 2 2 2 2 1 2 1 2 2 1 ...
## $ DT2 : Factor w/ 2 levels "-1","1": 1 2 2 2 2 1 2 2 1 2 ...
## $ SR1 : Factor w/ 2 levels "-1","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ SR2 : Factor w/ 2 levels "-1","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ exp_code : num 56 60 60 60 54 59 52 60 56 52 ...
## created_utc score domain ups downs num_comments
## 1 1344231209 5785 twitter.com 53727 47942 4567
## 2 1341387530 5593 webcast.web.cern.ch 33456 27863 2919
## 3 1354217414 4626 nasa.gov 42063 37437 1515
## 4 1342384318 4486 guardian.co.uk 18267 13781 739
## 5 1361725918 4465 reuters.com 33283 28818 1135
## 6 1363051689 4438 nature.com 9418 4980 1141
## subreddit_name percent_comments domain_type percent_upvotes time_code
## 1 science 4.492028 .com 52.84502 -1
## 2 science 4.760352 other 54.56058 -1
## 3 science 1.905660 other 52.90943 -1
## 4 science 2.305916 other 56.99888 -1
## 5 science 1.827668 .com 53.59495 1
## 6 science 7.924712 .com 65.41186 1
## comments_code DT1 DT2 SR1 SR2 exp_code
## 1 -1 1 -1 1 1 56
## 2 -1 1 1 1 1 60
## 3 -1 1 1 1 1 60
## 4 -1 1 1 1 1 60
## 5 -1 -1 1 1 1 54
## 6 1 1 -1 1 1 59
## created_utc score domain ups downs num_comments
## 2993 1366385781 1031 reuters.com 1439 408 157
## 2994 1364274310 1033 earthquake-report.com 1390 357 262
## 2995 1292330835 1024 newsfeed.time.com 2282 1258 206
## 2996 1374840797 1027 guardian.co.uk 1390 363 35
## 2997 1342784992 1023 self.news 1862 839 442
## 2998 1376504442 809 nbcdfw.com 1052 243 176
## subreddit_name percent_comments domain_type percent_upvotes time_code
## 2993 news 8.500271 .com 77.91012 1
## 2994 news 14.997138 .com 79.56497 1
## 2995 news 5.819209 .com 64.46328 -1
## 2996 news 1.996577 other 79.29264 1
## 2997 news 16.364310 other 68.93743 -1
## 2998 news 13.590734 .com 81.23552 1
## comments_code DT1 DT2 SR1 SR2 exp_code
## 2993 1 -1 1 -1 -1 7
## 2994 1 1 -1 -1 -1 11
## 2995 -1 -1 1 -1 -1 4
## 2996 -1 1 1 -1 -1 14
## 2997 1 1 1 -1 -1 13
## 2998 1 1 -1 -1 -1 11
The qualityTools package in R was installed and implemented to construct the Taguchi Design. Useful functions within this package include taguchiChoose() and taguchiDesign().
In this experiment, we are performing ANOVA on a linear model developed from a Taguchi Design. Since the design and analysis have occured after the data was collected, the sample was selected based on groups in order to provide a balanced design using the same sampling technique applied in the report on fractional factorial designs.1
Knowing that there were a total of six 2-level factors, the taguchiChoose() function was used to show the possible Taguchi Designs.
# Install and load Quality Tools package
install.packages("qualityTools", repos='http://cran.us.r-project.org')
## package 'qualityTools' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\Deagen\AppData\Local\Temp\RtmpgVWzhc\downloaded_packages
require(qualityTools)
# choose Taguchi Design
taguchiChoose(level1 = 2, factors1 = 6)
## 6 factors on 2 levels and 0 factors on 0 levels with 0 desired interactions to be estimated
##
## Possible Designs:
##
## L8_2 L12_2 L16_2 L32_2
##
## Use taguchiDesign("L8_2") or different to create a taguchi design object
The resulting designs to choose from include L8_2, L12_2, L16_2, L32_2. The notation for Taguchi Designs is as follows:
\[L_{N}(S^{k})\] where \(N\) is the number of experimental runs, \(S\) is the number of levels, and \(k\) is the number of factors.2 Since we are interested in minimizing the amount of runs in our experiment, the \(L_{8}(2^{6})\). In the notation of qualityTools, we are interested in the L8_2 design. Through this design, we can run 1/8 of the 64 designs that a \(2^{6}\) full factorial design would otherwise require.
## StandOrder RunOrder Replicate A B C D E F G y
## 1 5 1 1 2 1 2 1 2 1 2 NA
## 2 7 2 1 2 2 1 1 2 2 1 NA
## 3 1 3 1 1 1 1 1 1 1 1 NA
## 4 3 4 1 1 2 2 1 1 2 2 NA
## 5 4 5 1 1 2 2 2 2 1 1 NA
## 6 8 6 1 2 2 1 2 1 1 2 NA
## 7 6 7 1 2 1 2 2 1 2 1 NA
## 8 2 8 1 1 1 1 2 2 2 2 NA
Rather than (-1,1) encoding for the 2-level factors, qualityTools uses (1,2) encoding. To convert HIGH (2 –> 1) and LOW (1 –> -1) encodings, the rows in the dataframe were multiplied by 2, and 3 was subtracted.
# remove unnecessary columns from mydesign
mydesign <- subset(mydesign, select = -c(StandOrder, RunOrder, Replicate, G, y))
# change column names to match factors in experiment
colnames(mydesign) <- c("SR1", "SR2", "DT1", "DT2", "time_code", "comments_code")
# convert encodings
mydesign <- mydesign * 2 - 3
The experimental runs were assigned experiment codes based on the tree-like sorting algorithm, and a histogram confirmed that there were sufficient factor-level combinations for sampling (Fig. 1).
## SR1 SR2 DT1 DT2 time_code comments_code exp_code
## 1 1 -1 1 -1 1 -1 42
## 2 1 1 -1 -1 1 1 51
## 3 -1 -1 -1 -1 -1 -1 0
## 4 -1 1 1 -1 -1 1 25
## 5 -1 1 1 1 1 -1 30
## 6 1 1 -1 1 -1 -1 52
## 7 1 -1 1 1 -1 1 45
## 8 -1 -1 -1 1 1 1 7
Fig. 1 Distribution of eligible experiment codes for the Taguchi Design.
The rationale for a Taguchi Design is to minimize trials and maximize efficiency. Taguchi Designs are highly fractional designs good for screening experiments or product/process development. What separates the Taguchi Design from other fractional designs is its focus on Parameter Design and Tolerance Design to make a product/process more robust against noise factors.3 For example, the parameters (factors) are divided into an “inner array” (engineer has control) and an “outer array” (noise). While some information is certainly lost in a fractional design, Taguchi uses orthogonal arrays to select the trial runs most efficiently.
Randomization is implemented by default in the taguchiDesign() method. The response data was also sampled randomly from the dataset.
While there are sufficient data to perform replicates, the goal of this analysis is to compare to the fractional factorial design that performed only 8 trial runs. Therefore, one response from each factor-level combination in the Taguchi Design was selected.
Blocking against nuisance factors was not considered within the scope of this analysis.
ANOVA was performed on a linear model developed from the response of the 8-run Taguchi Design.
The factors in the \(2^{6}\) factorial design that had been broken up from 3-level factors do not have any physical significance, as mentioned earlier. The exploratory analysis for the \(2^{2}*3^{2}\) factorial design is below.
Box plots of the percent upvotes versus each of the factors (Fig. 2) seem to indicate that there might be a difference across subreddit, and also potentially on the relative number of comments. Domain type and Time of post do not seem to affect the percent of upvotes. However, in each case there appear to be outliers that could skew the data.
Fig. 2 Box plots showing the main effects for the \(2^{2}*3^{2}\) full factorial design.1
For comparison, box plots of the percent upvotes after decomposition of the 3-level factors (subreddit_name, domain_type) into 2-level factors (SR1/SR2 and DT1/DT2, respectively), do suggest similar trends despite some outliers (Fig. 3).
Fig. 3 Box plots showing the main effects for the \(2^{6}\) full factorial design after decomposition of 3-level factors.
However, we know from the earlier report that the response data has a strongly skewed distribution.1 Since the dataset includes “Top Posts” on Reddit, rather than all posts, posts with a lower percent of upvotes were not included the dataset, resulting in the skew.
A sample of response data was pulled from the dataset (with exp_code as the filter) with the factor-level combinations defined by the Taguchi Design. In addition, the responses were matched up with the randomized trial order specified by the Taguchi Design and added to the taguchi object.
# create a sample of the eligible data
taguchidata <- eligibledata %>% group_by(exp_code) %>% sample_n(size = 1)
# show the values of the response variable (percent_upvotes) for each corresponding exp_code
response <- taguchidata[,c("exp_code", "percent_upvotes")]
# re-arrange rows of the response variables so that the exp_code matches up with the mydesign dataframe
response <- response[match(as.vector(mydesign$exp_code), response$exp_code),]
# casting as data frame necessary to prevent Effect Plot bug!!
response <- as.data.frame(response)
# add response data to taguchi design
response(plan) = response[,2]
summary(plan)
## Taguchi SINGLE Design
## Information about the factors:
##
## A B C D E F G
## value 1 -1 -1 -1 -1 -1 -1 -1
## value 2 1 1 1 1 1 1 1
## name SR1 SR2 DT1 DT2 time_code comments_code empty
## unit
## type numeric numeric numeric numeric numeric numeric numeric
##
## -----------
##
## StandOrder RunOrder Replicate A B C D E F G response[, 2]
## 1 5 1 1 2 1 2 1 2 1 2 59.37804
## 2 7 2 1 2 2 1 1 2 2 1 64.21902
## 3 1 3 1 1 1 1 1 1 1 1 74.59591
## 4 3 4 1 1 2 2 1 1 2 2 56.74639
## 5 4 5 1 1 2 2 2 2 1 1 54.59522
## 6 8 6 1 2 2 1 2 1 1 2 64.86343
## 7 6 7 1 2 1 2 2 1 2 1 60.03572
## 8 2 8 1 1 1 1 2 2 2 2 59.19547
##
## -----------
ANOVA on the linear model defined by the main effects resulted in p-values greater than 0.05, thus we cannot say that the differences observed are statistically significant.
# perform ANOVA on the linear model looking only at main effects
mydesign[,"percent_upvotes"] <- response[,"percent_upvotes"]
MyAOV <- aov(lm(mydesign$percent_upvotes ~ (mydesign$SR1 + mydesign$SR2 + mydesign$DT1 + mydesign$DT2 + mydesign$time_code + mydesign$comments_code)))
summary(MyAOV)
## Df Sum Sq Mean Sq F value Pr(>F)
## mydesign$SR1 1 1.41 1.41 0.064 0.842
## mydesign$SR2 1 20.42 20.42 0.929 0.512
## mydesign$DT1 1 128.95 128.95 5.865 0.249
## mydesign$DT2 1 33.01 33.01 1.501 0.436
## mydesign$time_code 1 44.43 44.43 2.021 0.390
## mydesign$comments_code 1 21.90 21.90 0.996 0.501
## Residuals 1 21.99 21.99
The main effects, using the effectPlot() function, showed some weak trends but in the end we cannot say that the main effects are statistically significant (Fig. 4). [NOTE: Ignore the last plot; the seventh column in the Taguchi Design was dropped since we only had six factors.]
Fig. 4 Effect plots for the six 2-level factors in this Taguchi design (ignoring the last plot).
Only 8 runs were conducted in this Taguchi Design, and a larger sample size or additional replicates may have produced more significant results. Assuming that the effect sizes are weak, such a small sample is unlikely to produce significant results.
To check the assumptions on which the model is based, a Normal Q-Q plot was generated with all of the data. The residuals begin to deviate significantly from the model at the tail ends of the plot, but the data near the center fits neatly along the model line (Fig. 5).
Fig. 5 Normal Q-Q plot of residuals from the linear model, taken from all the data.
The Normal Q-Q plot for the 8-run sample of the Taguchi Design has a distinctly different shape, indicating a strongly bimodal distribution.
Fig. 6 Normal Q-Q plot of residuals from the linear model, taken only from the sample of the Taguchi Design.
As mentioned in the earlier report, Reddit vote tallying, which ultimately determines the list of top posts, was significantly adjusted for the first time in 11 years as Reddit instituted a new voting system that no longer suppresses top scores.4 The dataset used in this analysis applies to the voting system prior to this changeover.
The results of the Taguchi Design agree with the Fractional Factorial Design in the sense that neither approach detected a statistically significant impact of the main effects. Again, the limitations to a highly fractional design approach include the disadvantage of less information (or aliasing in FFD). The difference between the two design matrices is not immediately obvious, other than the fact that Taguchi includes the {1} combination (exp_code=0) as well as two adjacent factor-level combinations (refer back to Fig. 1).
A final note on Taguchi Designs is that its fractional technique based on orthogonal arrays is not what makes the design favorable. Rather, the design philosophy including the Parameter and Tolerance Design is what makes the technique useful for robust product/process development. The dataset used in this report is not necessarily optimal for a Taguchi Design. However, if one were to identify “noise” parameters determined by the users of the website, then the domain type and relative number of comments would likely be in the outer array. Meanwhile, the inner array would consist of the subreddit names.
The datasets for all subreddits can be found at https://github.com/umbrae/reddit-top-2.5-million. The chosen datasets for this study were r/science, r/politics, and r/news, which are easily searchable from the URL and can be downloaded as .CSV files. In this report, the RCurl library was used to retrieve the raw .CSV files directly from their respective URLs.
# Project 4: Optimal Designs
# ISYE 6020
# Mike Deagen
# Objective: Use TAGUCHI DESIGN to explore polarization (in terms of % upvotes) for top posts in various subreddits: r/science, r/politics, and r/news
####### THIS CODE, COPIED FROM PROJECT 3, TO RETRIEVE, CLEAN, AND ORGANIZE THE REDDIT DATA #########
# include RCurl library to be able to retrieve .CSV from a URL
library(RCurl)
# FrF2 library must be installed and loaded
library(FrF2)
# dplyr library must be installed and loaded
library(dplyr)
# obtain raw data from online repository
science_data <- getURL('https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/science.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
politics_data <- getURL('https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/politics.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
news_data <- getURL('https://raw.githubusercontent.com/umbrae/reddit-top-2.5-million/master/data/news.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)
# import raw data as data frames
science_df <- read.csv(textConnection(science_data), header=TRUE)
politics_df <- read.csv(textConnection(politics_data), header=TRUE)
news_df <- read.csv(textConnection(news_data), header=TRUE)
# drop columns NOT of interest
DropColumns <- function(x){
x <- subset(x, select = -c(id, title, author, permalink, selftext, link_flair_text, over_18, thumbnail, edited, link_flair_css_class, author_flair_css_class, is_self, name, url, distinguished, subreddit_id))
return(x)
}
# apply the DropColumns function to each dataframe
science_df <- DropColumns(science_df)
politics_df <- DropColumns(politics_df)
news_df <- DropColumns(news_df)
# add column with subreddit name
science_df[,"subreddit_name"] <- c("science")
politics_df[,"subreddit_name"] <- c("politics")
news_df[,"subreddit_name"] <- c("news")
# print the header of each data frame
head(science_df)
head(politics_df)
head(news_df)
# join all datasets together
all_data <- rbind(science_df, politics_df, news_df)
# create column with NORMALIZED comments (as a percent of total votes)
all_data$percent_comments <- (100*all_data$num_comments/(all_data$ups + all_data$downs))
# convert domain column to CHAR type
all_data$domain <- as.character(all_data$domain)
# create column with DOMAIN TYPE (.org, .com, or other)
for(i in 1:nrow(all_data)) {
if (endsWith(all_data[i,'domain'],'.org')) {
all_data[i,'domain_type'] <- c('.org')
}
else if (endsWith(all_data[i,'domain'],'.com')) {
all_data[i,'domain_type'] <- c('.com')
}
else {
all_data[i,'domain_type'] <- c('other')
}
}
# create column with PERCENT_UPVOTES
all_data$percent_upvotes <-(100*all_data$ups/(all_data$ups + all_data$downs))
# view structure, head, and tail of the dataset
str(all_data)
head(all_data)
tail(all_data)
# define cutoff for comments HIGH/LOW
cutoff_comments <- median(all_data$percent_comments)
# define cutoff time
cutoff_time <- median(all_data$created_utc)
# determine EXPERIMENT CODE (0-35) based on factor levels
for(i in 1:nrow(all_data)) {
# initialize coded 0/1 for time and comments cutoffs
all_data[i,'time_code'] <- -1
all_data[i,'comments_code'] <- -1
# initialize exp_code as zero
all_data[i,'exp_code'] <- 0
# SORT BY SUBREDDIT NAME
# if subreddit_name is 'politics', add 12
if (all_data[i,'subreddit_name']=='politics') {
all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 12
}
# if subreddit_name is 'politics', add 24
else if (all_data[i,'subreddit_name']=='news') {
all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 24
}
# SORT BY DOMAIN TYPE
# if domain_type is '.com', add 4
if (all_data[i,'domain_type']=='.com') {
all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 4
}
# if domain_type is 'other', add 8
else if (all_data[i,'domain_type']=='other') {
all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 8
}
# SORT BY TIME CUTOFF (and code -1/1 based on designation)
# if created_utc is GREATER THAN cuttoff_time, add 2
if (all_data[i,'created_utc'] > cutoff_time) {
all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 2
all_data[i,'time_code'] <- 1
}
# SORT BY COMMENTS CUTOFF (and code -1/1 based on designation)
# if percent comments is GREATER THAN cuttoff_comments, add 1
if (all_data[i,'percent_comments'] > cutoff_comments) {
all_data[i,'exp_code'] <- all_data[i,'exp_code'] + 1
all_data[i,'comments_code'] <- 1
}
}
# check spread of exp codes
hist(all_data$exp_code, breaks = 36, xlab="Experiment Code (0-35)", main = "Confirmation that all factor-level combinations exist")
# determine the least-occurring exp_code(s), which will limit the number of possible replicates
exp_table <- table(all_data$exp_code)
# create a new dataset for breaking down 3-level factors into two 2-level factors each (and remove old exp_code column)
all_data2 <- subset(all_data, select = -c(exp_code))
# toggle variables that will determing splitting of middle of 3-level into (-1,1) or (1,-1)
SR_toggle <- TRUE
DT_toggle <- TRUE
for(i in 1:nrow(all_data2)) {
if (all_data2[i,'domain_type']=='.org') {
all_data2[i,'DT1'] <- -1
all_data2[i,'DT2'] <- -1
}
else if (all_data[i,'domain_type']=='.com') {
if (DT_toggle) {
all_data2[i,'DT1'] <- 1
all_data2[i,'DT2'] <- -1
}
else {
all_data2[i,'DT1'] <- -1
all_data2[i,'DT2'] <- 1
}
DT_toggle <- !DT_toggle
}
else {
all_data2[i,'DT1'] <- 1
all_data2[i,'DT2'] <- 1
}
if (all_data2[i,'subreddit_name']=='news') {
all_data2[i,'SR1'] <- -1
all_data2[i,'SR2'] <- -1
}
else if (all_data[i,'subreddit_name']=='politics') {
if (SR_toggle) {
all_data2[i,'SR1'] <- 1
all_data2[i,'SR2'] <- -1
}
else {
all_data2[i,'SR1'] <- -1
all_data2[i,'SR2'] <- 1
}
SR_toggle <- !SR_toggle
}
else {
all_data2[i,'SR1'] <- 1
all_data2[i,'SR2'] <- 1
}
}
# The 2^6 design now has 64 possible experiment codes
# Apply similar sorting structure as above, this time using DT1 and DT2 in place of domain_type and SR1 and SR2 in place of subreddit_name
# determine EXPERIMENT CODE (0-63) based on factor levels
for(i in 1:nrow(all_data2)) {
# initialize exp_code as zero
all_data2[i,'exp_code'] <- 0
# SORT BY SR1
# if SR1 is +1, add 32
if (all_data2[i,'SR1']=='1') {
all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 32
}
# SORT BY SR2
# if SR2 is +1, add 16
if (all_data2[i,'SR2']=='1') {
all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 16
}
# SORT BY DT1
# if DT1 is +1, add 8
if (all_data2[i,'DT1']=='1') {
all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 8
}
# SORT BY DT2
# if DT2 is +1, add 4
if (all_data2[i,'DT2']=='1') {
all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 4
}
# SORT BY TIME CUTOFF (coded -1/1 based on designation)
# if time_code is +1, add 2
if (all_data2[i,'time_code']=='1') {
all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 2
}
# SORT BY COMMENTS CUTOFF (coded -1/1 based on designation)
# if comments_code is +1, add 1
if (all_data2[i,'comments_code']=='1') {
all_data2[i,'exp_code'] <- all_data2[i,'exp_code'] + 1
}
}
# determine the least-occurring exp_code(s), which will limit the number of possible replicates
exp_table2 <- table(all_data2$exp_code)
# set appropriate columns as factors, and return subreddit_name and domain_type to characters
all_data2$SR1 <- as.factor(all_data2$SR1)
all_data2$SR2 <- as.factor(all_data2$SR2)
all_data2$DT1 <- as.factor(all_data2$DT1)
all_data2$DT2 <- as.factor(all_data2$DT2)
all_data2$time_code <- as.factor(all_data2$time_code)
all_data2$comments_code <- as.factor(all_data2$comments_code)
all_data2$subreddit_name <- as.character(all_data2$subreddit_name)
all_data2$domain_type <- as.character(all_data2$domain_type)
######## END OF CODE FROM PROJECT 3 #########
# Install and load Quality Tools package
install.packages("qualityTools", repos='http://cran.us.r-project.org')
require(qualityTools)
# choose Taguchi Design
taguchiChoose(level1 = 2, factors1 = 6)
# show Taguchi Design (randomize by default is TRUE)
set.seed(1)
plan <- taguchiDesign("L8_2")
values(plan) = list(A = c(-1,1), B = c(-1,1), C = c(-1,1), D = c(-1,1), E = c(-1,1), F = c(-1,1), G = c(-1, 1))
names(plan) = c("SR1", "SR2", "DT1", "DT2", "time_code", "comments_code", "empty")
mydesign <- as.data.frame(plan)
mydesign
# remove unnecessary columns from mydesign
mydesign <- subset(mydesign, select = -c(StandOrder, RunOrder, Replicate, G, y))
# change column names to match factors in experiment
colnames(mydesign) <- c("SR1", "SR2", "DT1", "DT2", "time_code", "comments_code")
# convert encodings
mydesign <- mydesign * 2 - 3
mydesign
# create a new row in mydesign with the experiment code corresponding to the factor level sorting algorithm used earlier
for(i in 1:nrow(mydesign)) {
# initialize exp_code as zero
mydesign[i,'exp_code'] <- 0
# SORT BY SR1
# if SR1 is +1, add 32
if (mydesign[i,'SR1']=='1') {
mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 32
}
# SORT BY SR2
# if SR2 is +1, add 16
if (mydesign[i,'SR2']=='1') {
mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 16
}
# SORT BY DT1
# if DT1 is +1, add 8
if (mydesign[i,'DT1']=='1') {
mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 8
}
# SORT BY DT2
# if DT2 is +1, add 4
if (mydesign[i,'DT2']=='1') {
mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 4
}
# SORT BY TIME CUTOFF (coded -1/1 based on designation)
# if time_code is +1, add 2
if (mydesign[i,'time_code']=='1') {
mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 2
}
# SORT BY COMMENTS CUTOFF (coded -1/1 based on designation)
# if comments_code is +1, add 1
if (mydesign[i,'comments_code']=='1') {
mydesign[i,'exp_code'] <- mydesign[i,'exp_code'] + 1
}
}
# create subset of eligible data that meets the criteria of the fractional factorial design (using exp_code as the filter)
eligibledata <- all_data2[is.element(all_data2$exp_code, mydesign$exp_code),]
# check spread of exp codes for all_data2
hist(eligibledata$exp_code, breaks = 64, xlab="Experiment Code (0-63)", main = "Eligible data for sampling")
# create a sample of the eligible data
taguchidata <- eligibledata %>% group_by(exp_code) %>% sample_n(size = 1)
# show the values of the response variable (percent_upvotes) for each corresponding exp_code
response <- taguchidata[,c("exp_code", "percent_upvotes")]
colnames(response) <- c("exp_code", "percentupvotes")
# re-arrange rows of the response variables so that the exp_code matches up with the mydesign dataframe
response <- response[match(as.vector(mydesign$exp_code), response$exp_code),]
# casting as data frame necessary to prevent Effect Plot bug!!
response <- as.data.frame(response)
# add response data to taguchi design
response(plan) = response[,2]
summary(plan)
# print effect plots
effectPlot(plan, main = "Effect Plot for response", ylab = "% upvotes", col = 2)
# perform ANOVA on the linear model looking only at main effects
mydesign[,"percent_upvotes"] <- response[,"percent_upvotes"]
MyAOV <- aov(lm(mydesign$percent_upvotes ~ (mydesign$SR1 + mydesign$SR2 + mydesign$DT1 + mydesign$DT2 + mydesign$time_code + mydesign$comments_code)))
summary(MyAOV)
# EXPLORATORY DATA ANALYSIS
#Boxplots for 2^2*3^2 full factorial
# boxplots of response variables for full dataset
par(mfrow=c(2,3))
boxplot(percent_upvotes~subreddit_name, all_data, xlab="Subreddit", ylab = "Percent upvotes")
boxplot(percent_upvotes~domain_type, all_data, xlab="Domain type", ylab = "Percent upvotes")
boxplot(percent_upvotes~time_code, all_data2, xlab="Time of post", ylab = "Percent upvotes")
boxplot(percent_upvotes~comments_code, all_data2, xlab="Relative number of comments", ylab = "Percent upvotes")
#Boxplots for 2^6 full factorial
# boxplots of response variables after decomposition of subreddit (SR1, SR2) and domain_type (DT1,DT2)
par(mfrow=c(2,3))
boxplot(percent_upvotes~SR1, all_data2, xlab="SR1 (non-physical)", ylab = "Percent upvotes")
boxplot(percent_upvotes~SR2, all_data2, xlab="SR2 (non-physical)", ylab = "Percent upvotes")
boxplot(percent_upvotes~DT1, all_data2, xlab="DT1 (non-physical)", ylab = "Percent upvotes")
boxplot(percent_upvotes~DT2, all_data2, xlab="DT2 (non-physical)", ylab = "Percent upvotes")
boxplot(percent_upvotes~time_code, all_data2, xlab="Time of post", ylab = "Percent upvotes")
boxplot(percent_upvotes~comments_code, all_data2, xlab="Relative number of comments", ylab = "Percent upvotes")
# Model Adequacy
#QQ plot of all data
model_all = lm(percent_upvotes~SR1 + SR2 + DT1 + DT2 + time_code + comments_code, data = all_data2)
qqnorm(residuals(model_all))
qqline(residuals(model_all))
plot(fitted(model_all), residuals(model_all))
#QQ plot of Taguchi experiment
model_taguchi = lm(percent_upvotes~SR1 + SR2 + DT1 + DT2 + time_code + comments_code, data = mydesign)
qqnorm(residuals(model_taguchi))
qqline(residuals(model_taguchi))
plot(fitted(model_taguchi), residuals(model_taguchi))