Purpose

For our last meeting, we decided to investigate if the “Storm, Norming and Performing” anecdote holds true to the projects we have analyzed insofar. Specifically, Sven proposed the following 3 time-series for the Triangle Motif Counts:

Positive / Negative
Positive / (Positive + Negative)
Negative / (Positive + Negative)

Some Terminology

Motif Counts are at “Project Granularity” for a given time window. For instance, if a project has 50 positive triangle motif count in a 3 months window, this means that there were 50 “cases” (i.e. positive triangles!) where a pair of developers edited the same file within the 3 months and communicated on every case (hence “positive”). The pair of developers can be different for each case, as well as the file. Analogously, a negative motif count of 60 for the said project 3 months time window would mean that during these months they edited the same file but did not communicate (hence “negative”).

Threat Reminder

Important Threat to Validity: A strong assumption our current pipeline has is that we don’t measure the strength of communication for every positive triangle case. For example, regardless of developers A and B posting one comment in a shared JIRA issue of the same file or hundreds of them, they will be computed as the same: their communication link will exist for the whole time window either as YES/NO.

Current Available Data Limitation and Alternative Options

Unfortunately, the project motif count is not readily available in any of the .csv files and will require longer time to make sense of Mitchel’s script. The File Participation Motif Count, however, is readily available.

A File Participation Motif Count was defined previously in the project in order to perform correlations of the motif and file metrics. It represents how many times a given file has participated on the counted motifs in the project level. For instance, considering the prior example of 50 triangle positive motifs in the 3 months window, if file banana.java “file participation motif count = 30”, then out of the 50 project positive triangles, banana.java appeared on 30 of them. Since triangles only contain one file at a time by definition (the other two points being the 2 developers), this would leave the remaining 20 “to be counted” as file participation of the remaining files in the project. The same idea applies to the “negative triangle file participation motif count”.

Below I present one tentative analysis using the current Cassandra data which Wolfgang already had calculated which we could perform on other projects.

Projects Time Series

#Load required analysis libraries
library(knitr)
library(ggplot2)
library(data.table)
library(ggthemes)
science_theme = theme(panel.background=element_blank(), panel.grid.major = element_line(size = 0.5, color = "grey"), 
    axis.line = element_line(size = 0.7, color = "black"), legend.position = c(0.9, 
        0.2), text = element_text(size = 14))
#Set motif_time_series.zip path, (folders generated by codeface conway).

Cassandra

#Load each time window 
cassandra.004 <- fread("~/Desktop/motif_time_series/cassandra/004--1a73af-87cb1c/quality_analysis/triangle/jira/quality_data.csv")
cassandra.003 <- fread("~/Desktop/motif_time_series/cassandra/003--658433-1a73af/quality_analysis/triangle/jira/quality_data.csv")
cassandra.002 <- fread("~/Desktop/motif_time_series/cassandra/002--c5b024-658433/quality_analysis/triangle/jira/quality_data.csv")
cassandra.001 <- fread("~/Desktop/motif_time_series/cassandra/001--781018-c5b024/quality_analysis/triangle/jira/quality_data.csv")

Current File Metrics

This is the current metrics we have for a given file. Notice that motif.count and anti.motif.count here are the file participation positive and file participation negative counts explained previously.

kable(head(cassandra.004))

V1	entity	BugIssueCount	Churn	CountLineCode	motif.count	motif.anti.count	dev.count	motif.percent.diff	motif.ratio	bug.density	motif.count.norm	motif.anti.count.norm
1	bin.cqlsh.py	3	40	12	2	4	3	0.6666667	0.6666667	0.2307692	0.6666667	1.3333333
2	conf.cassandra-env.sh	2	126	6	0	2	3	2.0000000	1.0000000	0.2857143	0.0000000	0.6666667
3	doc.convert_yaml_to_rst.py	0	150	2	2	0	3	2.0000000	0.0000000	0.0000000	0.6666667	0.0000000
4	pylib.cqlshlib.copyutil.py	8	499	13	2	4	10	0.6666667	0.6666667	0.5714286	0.2000000	0.4000000
5	pylib.cqlshlib.cql3handling.py	2	128	1	4	16	5	1.2000000	0.8000000	1.0000000	0.8000000	3.2000000
6	pylib.cqlshlib.test.test_cqlsh_completion.py	0	24	3	2	0	2	2.0000000	0.0000000	0.0000000	1.0000000	0.0000000

Triangle Positive File Participation Histogram for Window 004

This shows the oldest time window (004) triangle positive file participation distribution.

ggplot(cassandra.004, aes(motif.count)) + geom_histogram(binwidth = 5) + science_theme + xlab("File Participation Motif Count") + ylab ("Frequency") + ggtitle ("Positive Triangle File Participation Histogram")

Triangle Negative File Participation Histogram for Window 004

This shows the oldest time window (004) triangle negative file participation distribution.

ggplot(cassandra.004, aes(motif.anti.count)) + geom_histogram(binwidth = 5) + science_theme + xlab("File Participation Motif Count") + ylab ("Frequency") + ggtitle ("Negative Triangle File Participation Histogram")

Time Series

In order to generate a time series, I used the reimaining 3 time windows (003,002,001) already available by Wolfgang and summarized each window distribution by their mean and standard deviation, as shown below for the 3 Time Series intended in the purpose section:

#Time Series for 4 Windows
cassandra.positive.by.negative.motif.ts <- rbind(
  cassandra.001[,.(mean(motif.count)/mean(motif.anti.count),sd(motif.count))],
  cassandra.002[,.(mean(motif.count)/mean(motif.anti.count),sd(motif.count))],
  cassandra.003[,.(mean(motif.count)/mean(motif.anti.count),sd(motif.count))],
  cassandra.004[,.(mean(motif.count)/mean(motif.anti.count),sd(motif.count))]
)
colnames(cassandra.positive.by.negative.motif.ts) <- c("mean","sd")
cassandra.positive.by.negative.motif.ts$window <- rownames(cassandra.positive.by.negative.motif.ts)
cassandra.positive.by.negative.motif.ts$group <- "Pos/Neg"

#Time Series for 4 Windows of Positive/(Positive+Negative)
cassandra.positive.by.sum.motif.ts <- rbind(
  cassandra.001[,.(mean(motif.count)/(mean(motif.count)+mean(motif.anti.count)),sd(motif.anti.count))],
  cassandra.002[,.(mean(motif.count)/(mean(motif.count)+mean(motif.anti.count)),sd(motif.anti.count))],
  cassandra.003[,.(mean(motif.count)/(mean(motif.count)+mean(motif.anti.count)),sd(motif.anti.count))],
  cassandra.004[,.(mean(motif.count)/(mean(motif.count)+mean(motif.anti.count)),sd(motif.anti.count))]
)
colnames(cassandra.positive.by.sum.motif.ts) <- c("mean","sd")
cassandra.positive.by.sum.motif.ts$window <- rownames(cassandra.positive.by.sum.motif.ts)
cassandra.positive.by.sum.motif.ts$group <- "Pos/(Pos+Neg)"

#Time Series for 4 Windows of Negative/(Positive+Negative)
cassandra.negative.by.sum.motif.ts <- rbind(
  cassandra.001[,.(mean(motif.anti.count)/(mean(motif.count)+mean(motif.anti.count)),sd(motif.anti.count))],
  cassandra.002[,.(mean(motif.anti.count)/(mean(motif.count)+mean(motif.anti.count)),sd(motif.anti.count))],
  cassandra.003[,.(mean(motif.anti.count)/(mean(motif.count)+mean(motif.anti.count)),sd(motif.anti.count))],
  cassandra.004[,.(mean(motif.anti.count)/(mean(motif.count)+mean(motif.anti.count)),sd(motif.anti.count))]
)
colnames(cassandra.negative.by.sum.motif.ts) <- c("mean","sd")
cassandra.negative.by.sum.motif.ts$window <- rownames(cassandra.negative.by.sum.motif.ts)
cassandra.negative.by.sum.motif.ts$group <- "Neg/(Pos+Neg)"

Purpose Time Series:

library(directlabels)
cassandra.motif.ts <- rbind(cassandra.positive.by.negative.motif.ts,cassandra.positive.by.sum.motif.ts,cassandra.negative.by.sum.motif.ts)
p <- ggplot(cassandra.motif.ts, aes(x=window, y=mean,group=group,colour=group))+geom_line()+geom_point() + theme_bw() + ylim(0,1) + xlab("Window") + ylab("File Participation Motif Count Mean") + ggtitle("Triangle") + theme(legend.position=c(.5,0.9))
direct.label(p,"first.qp")

#+geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd,group=group), width=.1)

Motif Time Series Analysis

Carlos V. A. Silva

1/10/2017