Use v Production Clustering

Prepare data for analysis

To read and process the data in R, you will need to load the “foreign” and “dplyr” packages using library(). The “foreign” package allows us to read .DTA files and “dplyr” will ease manipulation of the data once read.

First we use read.dta() to read in the full dataset as a dataframe that we name “fullData”. We retain the full dataset in case we want to incorporte additional variables into our analysis, but we also create an additional dataframe named “data” that contains only the observations and variables we need. As we only want one observation/row per subject, rather than one observation/row per subject per period, we limit the dataframe to a single period using filter(). Moreover, we’re only looking at preference data from Stages 1 and 3 of the experiment so we ignore the variables from Stage 2. The select() command allows us to limit our working dataset “data” to only those variables we consider potentially relevant to clustering. For ease of reading, we us arrange() to sort the observations based on their session ID.

fullData <- read.dta("ProdVCM_master_working_160422.dta")

data <-
  arrange(
    filter(
      select(fullData,
              treatment:subjectid, contributionp_i_0:contributionp_i_12,
              contributionp_ii_0:contributionp_ii_12, contributionp_i_13,
              contributionp_ii_13, scenb, scen2b, session_id ),
      period == 15 ),
    session_id )

## Warning: package 'bindrcpp' was built under R version 3.3.3

For the most part, our analysis uses data available in the raw dataset, but we also use two variables that have to be created. We create those variables by using source() to call a data processing R script, which you can find in the same repository as this markdown file. In that script, we also create several additional variables that we and future researchers might find useful for exploratory and formal analysis. Those variables are introduced in the Appendix of this markdown file, and the script that creates them is heavily commented so that you can see how the new variables are created. (The processing script also specifies and loads the cluster graphing function graphCluster_i() which can be helpful in visually inspecting the results of the clustering procedure.)

Sourcing the data processing script results in a series of warnings (which have been supressed here) regarding the producion of NaNs. These NaNs are expected and detailed in the in-code comments throughout the script. Sourcing the data processings script also produces untenable variables names, which we corrected with the make.names() and names() commands.

source("Cluster Parameters_var creation_170222.R")

valid_column_names <- make.names(names=names(data), unique=TRUE, allow_ = TRUE)
names(data) <- valid_column_names

To help future researchers easily assign clusters on their preferred variables and compare those clusters and input variables to ours, we introduce two additional dataframes. One is “dataClusterOrig”, which contains only those variables used to determine our original clustering. We then assign the other additional dataframe “clusterVars” to take the same values as “dataClusterOrig”.

We explain in our paper our reasons for including the variables in “dataClusterOrig”. Researchers wanting to cluster experimental subjects on an alternative set of variables can create a parallel object (“dataClusterAlt”), which they can then assign to “clusterVars”. All subsiquent steps in the clusteringprocedure use “clusterVars”, so changing that object to your preferred clustering dataframe (i.e., “dataClusterAlt”) will produce new clusters, while simultaneously allowing you to retain the original clustering dataframe (i.e., “dataClusterOrig”) in case you want to make quick comparisons.

dataClusterOrig <- select(data, iTotCon, scenb, scen2b, iCon00, iCon12, iConSd)
# dataClusterAlt <- select(data, var1, var2, ...)

clusterVars <- dataClusterOrig # replace "dataClusterOrig" w/ "dataClusterAlt"

To reiterate:

“clusterVars” is the dataframe on which all cluster calculations will be based, and will be referred to as “the dataframe” in the remainder of this markdown document;
“clusterVars” is currently identical to “dataClusterOrig”, which contains the variables used in our original clustering procedure, but you can set it to equal “dataClusterAlt” if you want to try clustering on a different set of variables but keep the original set of variables handy for comparison;
“data” is a dataframe limited to Stage 1 and 3 variables, with one observation/row per subject, and is probably the dataframe from which you could create “dataClusterAlt”;
“fullData” is the entire raw dataset, without the new variables created by the sourced data processing script.

Scale data

Although it is not necessary to scale data to perform cluster analyses, it makes sense in this case. The differences between observations for some variables (e.g., iTotCon) are significantly larger than the differences between observations for other variables (e.g., scen2b).

max(data$iTotCon) - min(data$iTotCon)           # varaible with large range

## [1] 156

max(data$scen2b) - min(data$scen2b)             # varaible with small range

## [1] 0.3516484

Left as is, the variables with the larger differences would be the principle drivers of distance when clustering. There is no theoretical reason to prefer one variable over another in the custering process so we will scale the data so that each variable has a mean of 0 and a standard deviation of 1.

To do so, we apply the mean() function to each column of the dataframe. The result is a vector of means, one for each variable. We also apply the sd() function to each column of the dataframe. The result is a vector of standard deviations, one for each variable. Finally, we use the vector of means and the vector of standard deviations to create a scaled dataframe of the cluster variables. The scale() function will subtract the “center” argument (here, the mean of each variable) from every entry in the appropriate column and then divide the result by the “scale” argument (here, the SD of each variable).

varMeans = apply(clusterVars, 2, mean)          # the "2" specifies columns
varSDs = apply(clusterVars, 2, sd)              # the "2" specifies columns

clusterVarsScale = scale(clusterVars, center=varMeans, scale = varSDs)

Instead of the absolute value of the measure, each cell in the resulting dataframe (called “clusterVarsScale”) displays a z-score.

Calculate the optimal number of clusters and partition

For this section of the analysis we use the “NbClust” and “factoextra” packages using library(). The “NbClust” package “provides 30 indices for determining the relevant number of clusters and proposes to users the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods” (Charrad et al., 2014). The “factoextra” package will allow us to visualize some of the clustering output.

We run NbClust() to determine the most popular number of clusters and the most popular cluster partitions, both across multiple indices and clustering methods. The minimum number of clusters is set at three because we have good reason to believe—and the literature often conceives of—at least that many preference types (e.g., “altruists”, “free-riders”, and “conditional cooperators”), and the maximum is set at twenty clusters because we want to allow for petentially many (but not infinite) in-between or heretofore unspecified preference types. We run fviz_nbclust() to further visualize the popularity of different numbers of clusters across a range of indices. We hide the output for both commands here, save for a single chart showing the key takeaway: A plurality of indices identify six as the optimal number of clusters.

nbRes <- NbClust(clusterVarsScale, distance = "euclidean",
                 min.nc = 3, max.nc = 20, method = "complete", index ="all")

fviz_nbclust(nbRes) + theme_minimal()

Finally, we view the actual partitioning into clusters that was supported by a plurality of indicies. Although kMeans clustering is non-deterministic, NbClust() consistently produces the same clustering, for reasons documented in Charrad et al. (2014). The resulting vector shows to which cluster an experimental subject will be assigned.

nbRes$Best.partition

##   [1] 1 1 1 2 3 4 2 2 5 5 1 1 3 1 3 1 5 1 5 1 3 1 1 1 4 5 3 2 5 5 3 3 1 4 5
##  [36] 1 1 1 2 3 2 2 3 4 5 3 2 1 1 1 3 1 1 3 5 2 2 3 3 3 1 3 5 1 2 3 1 3 3 3
##  [71] 2 3 4 4 3 1 4 1 5 1 2 3 5 4 3 2 1 6 3 5 1 5 1 2 1 1 5 1 1 2 3 1 1 5 1
## [106] 5 2 5 4 1 3 5 1 6 3 1 1 2 1 5 3 5 3 2 3 3 4 2 3 2 5 1 5 1 1 1 5 2 1 3
## [141] 1 5 3 1 2 1 6 5 5 6 2 2 5 1 2 5 1 2 5 3 5 1 1 2 1 5 3 4 2 5 1 2 5 2 2
## [176] 5 1 3 1 4

Append clusters to dataset and save

We create a new dataframe to hold the results of our cluster analysis, leaving our previous dataframes intact in case we need to reevaluate anything. To that results dataframe we append the cluster assignments we just calculated.

clusterRes <- data
clusterRes[, "clust_orig"] <- nbRes$Best.partition

Finally, we write our dataset to the drive as .DTA files that can be merged with the raw, original datasets in Stata. We save two such files. The first is a tidy dataset with one row per subject, and the second is a long dataset, with one row per subject per experimental round (of which there were 15). Note that this last chunk of code is not elavuated here.

cluster2merge <- select(clusterRes, 
                        subjectid, session_id, clust_orig)
write.dta(cluster2merge, "cluster2merge.dta")

clust2mergeFull <- cluster2merge[rep(seq_len(nrow(cluster2Merge)), each=15), ]
write.dta(clust2mergeFull, "cluster2mergeFull.dta")

Appendix

Terminology key

The following terms are used in variable names:

‘x’ refers average group contributions, to which subjects respond with y
‘y’ refers subject contributions, made in response to a specified x
“Con” refers to “contribution”
“Leg” refers to a one unit movement along the average group contributions schedule (a one unit change in x)
“Seg” refer to a “segment”, which is a series of uniteruped legs with slopes of the same sign (-,0,+)
“Mag” refers to “magnitude” (i.e., a value of y)
“Len” referes to “length” (i.e., a value of x)
“Dif” in a varname indicates that the var takes a difference between two values
“Num” in the varname indicates that the var is a count
“Tot” in the varname indicates that the var is a sum or total
“Avg” in the varname indicates that the var is an average
“Pos”, “Neg”, and “Zero” in the var name indicate a positive, negative, or flat slope, respectively
“i” and “ii” in the varname indicates the data came from Stage 1 or Stage 3, respectively

Variable key

Variable	Description
iCon00 … 12	Contribution given specified mean group contribution
iMinCon	Minimum contribution
iMaxCon	Maximum contribution
iConVar	Variance of the conditional contributions
iConSd	Standard deviation of the conditional contributions
iMagLeg01 … 12	Magnitude of the specified leg
iTotCon	Sum of contributions above/below mean group contributions
iDifTotAvgCon	Total contrib. above/below total mean group contributions
iNumPosLeg	Number of legs w/ (+) slopes in contribution schedule
iNumNegLeg	Number of legs w/ (-) slopes in contribution schedule
iNumZerLeg	Number of legs w/ flat slopes in contribution schedule
iTotMagPosLeg	Tot. magnitude of legs w/ (+) slopes in contrib. schedule
iTotMagNegLeg	Tot. magnitude of legs w/ (-) slopes in contrib. schedule
iAvgMagPosLeg	Mean magnitude of legs w/ (+) slopes in contrib. schedule*
iAvgMagNegLeg	Mean magnitude of legs w/ (-) slopes in contrib. schedule*
iNumPosSeg	Number of segments w/ (+) slopes in contribution schedule
iNumNegSeg	Number of segments w/ (-) slopes in contribution schedule
iNumZerSeg	Number of segments w/ flat slopes in contribution schedule
iAvgMagPosSeg	Mean magnitude of (+) segments in contribution schedule
iAvgMagNegSeg	Mean magnitude of (-) segments in contribution schedule
iAvgLenPosSeg	Mean length of (+) segments
iAvgLenNegSeg	Mean length of (-) segments
iAvgLenZerSeg	Mean length of flat segments
NOTES	*NaNs turned to 0s

Session Information

The following information is the output of sessionInfo() run on the computer that ran the above detailed analysis:

R version 3.3.2 (2016-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows >= 8 x64 (build 9200)

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] factoextra_1.0.4 ggplot2_2.2.0 NbClust_3.0 dplyr_0.5.0
[5] foreign_0.8-67