Authors: Duzhin Fedor, Tan Joo Seng, Tan Siew Eng

Preparation and reading data

We have collected raw whatsapp chats of student teams in four courses, two in business and two in mathematics. For each course, we have a directory with all chats exported into txt files and one xlsx file with anonymised student information. Below are some statistics.

Data processing

Here, we merge all the chats that we have into one huge table. Below is the number of messages in each course.

## 
##    A    B    C    D 
## 7132 4583 3703 1226

Data exploration

## [1] " are some descriptive statistics for each course."                                      
## [2] "This is a bar chart of the number of messages per team coloured according to the course"
## [3] "We see that students in ZOOM2401 and ZOOM3110 wrote more messages."

Main stats

By course

Below are some descriptive statistics for each course.

By team

Below are some descriptive statistics for each team.

By student

Below is a sample of desriptive statistics by student.

Visualizations

Number of messages by team

This is a bar chart of the number of messages per team coloured according to the course. We see that students in “MH2401” and “MH3110” wrote more messages.

Number of words per message

For each course, we looked at the mean number of words per message for each student in that course. Below is the whisker chart. It shows all the quartiles of the distribution of the mean number of words per message and outliers. The main observation is that students in “FOM” wrote much longer messages than students in the other 3 courses in our study.

In fact, the median number of words in a message in FOM is larger than the largest number of words in any message in any other course. This probably reflects a unique style of the course FOM. Below is a random long message written by an FOM student.

## [1] " Summary: Gary Hamel believes in 2 things: 1. That there is a need for a new management system completely different from the one that is being utilized by everyone now a. The problem with current business models is that business requires the skills of a large group of people and managemnt helps to coordinate these people to developing goods and services, but as they align employee's interest with that of the company it suppresses the entrepreneurship of the people (current management practices) b. long as companies break away from the current business models and venture into new one they will be able to achieve more. Since china and India have now become costs leaders, this displaced the deficiency goal and now companies should focus on innovation c. They should be drivers of change rather than adapters of change d. Although he provides little guidance on these new business models, he does have a clear vision on the goals of these new business models and that is to: i. Ensure that all employees are constantly innovating ii. Engaging work environment to encourage employees to give their all iii. Increasing the pace of strategic renewal 2. management innovation will provide companies with the competitive edge a. sustainable source of competitive ad b. best guide to future of management is the internet as it provides real time connectivity; coordinate human efforts similar to the conventional means but it is less restrictive Problems with achieving his propositions: cent There has been evidence that most companies who diverge from the traditional way of management always tend to go back to the conventional ways when they face problems o Companies that do continue to make use of different business models are often able to do so because of other factors ( google is because of first mover advantage in web search and partnerships with many other companies) cent Complexity organizational integration cent Business innovation as a competitive ed is not easy- most of the time management innovations are because of environmental adaptations; or it may be because of entrepreneurs beliefs ( apple)( often pursue their visions uncompromisingly or autocratically) Questions: 1.why is it that companies that diverge from traditional buisness strategies always tend to go back to the traditional means. What exactly is the common problems that these companies face 2. What is it in traditional business models that are so essential in ensuring that business stays in the competition that businesses are unable to let go and start a new 3.how can companies integrate the essential parts of the traditional business models and new innovative ones to make their company successful 4. Hamel suggests that the internet is the best guide to future of management. How can companies make use of the internet in developing new business models"

Message length per team

Below is the distribution of word count per message by team. Even though we see that students in FOM write longer messages than students in the other three courses, there is still high variance across different teams within FOM.

Number of message by student

Below is the whisker chart of the number of messages by a student. Students in MH3110 wrote more messages than students in other three courses, but we don’t see anything as extreme as the difference in the word count per message. The simplest explanation of students in MH2401 and MH3110 (math courses) to have written more messages than students in FOM and TP6102 (business courses) is that the project took longer time.

Regressions

Here we construct various linear regression models to predict the project score of a team based on observable characteristics of whatsapp chats.

Project score vs the mean number of messages

We see that there is an overall positive relationship between the mean number of messages per student and the project score. This relationship is statistically significant in FOM and TP6102, i.e., business courses. The simplest explanation is that the mean number of messages per student is indicative of average effort students invest into the project and hence the relationship is positive. However, there is a lot that whatsapp chats cannot capture - face-to-face meetings, students’ individual strengths, communication via alternative channels etc.

We will do a linear regression with the number of messages and the course as regressors.

## 
## Call:
## lm(formula = `project score` ~ `N messages per student` + course, 
##     data = teamStats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.1700  -3.7239  -0.9865   3.9398  12.4058 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              76.50805    2.23214  34.276   <2e-16 ***
## `N messages per student`  0.01799    0.01599   1.125    0.266    
## courseB                  -3.02895    2.53290  -1.196    0.238    
## courseC                  -2.86058    2.40097  -1.191    0.239    
## courseD                  -1.01034    2.87578  -0.351    0.727    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.492 on 49 degrees of freedom
## Multiple R-squared:  0.07894,    Adjusted R-squared:  0.003748 
## F-statistic:  1.05 on 4 and 49 DF,  p-value: 0.3913

There is a positive but insignificant relation between the number of messages per student and the project score. If we constructed different regressions for different courses, then there would be a positive statistically significant relation with two out of four courses:

## Course A 
## 
## Call:
## lm(formula = `project score` ~ `N messages per student`, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6195  -3.7181  -0.7734   5.9245  11.6873 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              79.071190   3.147331  25.123  2.1e-12 ***
## `N messages per student` -0.009809   0.026715  -0.367    0.719    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.586 on 13 degrees of freedom
## Multiple R-squared:  0.01026,    Adjusted R-squared:  -0.06587 
## F-statistic: 0.1348 on 1 and 13 DF,  p-value: 0.7194
## 
## 
## Course B 
## 
## Call:
## lm(formula = `project score` ~ `N messages per student`, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.002  -6.376  -1.190   5.639  12.257 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              71.76358    3.67854  19.509 2.74e-09 ***
## `N messages per student`  0.04149    0.03785   1.096    0.299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.415 on 10 degrees of freedom
## Multiple R-squared:  0.1073, Adjusted R-squared:  0.018 
## F-statistic: 1.202 on 1 and 10 DF,  p-value: 0.2987
## 
## 
## Course C 
## 
## Call:
## lm(formula = `project score` ~ `N messages per student`, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5317 -2.0706 -1.2109  0.5086 10.3767 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              72.73523    1.38802  52.402   <2e-16 ***
## `N messages per student`  0.03659    0.02098   1.744    0.102    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.842 on 15 degrees of freedom
## Multiple R-squared:  0.1686, Adjusted R-squared:  0.1131 
## F-statistic: 3.041 on 1 and 15 DF,  p-value: 0.1016
## 
## 
## Course D 
## 
## Call:
## lm(formula = `project score` ~ `N messages per student`, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9120 -2.3107  0.1968  3.0944  5.0722 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               67.8459     2.7828  24.380 8.55e-09 ***
## `N messages per student`   0.3601     0.1106   3.257   0.0116 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.035 on 8 degrees of freedom
## Multiple R-squared:   0.57,  Adjusted R-squared:  0.5163 
## F-statistic: 10.61 on 1 and 8 DF,  p-value: 0.01158
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL

We need to combine regression results into a table. This is easy, but we should check existing papers in JoLA for the format.

Project score vs the mean length of a message

The relationship between project score and the mean word count in a message is positive for three courses and is negative for one course, FOM, where students wrote much longer messages than students of the other three courses. While these trends are not statistically significant, there is something here. Further investigation showed that FOM students tend to paste long citations into their whatsapp chats and teams who wrote more original messages were, at the same time, teams who got higher project scores but wrote shorter messages on average.

We will do a linear regression with the mean number of words per message and the course as regressors.

## 
## Call:
## lm(formula = `project score` ~ `mean words per message` + course, 
##     data = teamStats)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.0796  -3.4788  -0.6403   4.1851  12.9914 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               78.9430     1.7776  44.409   <2e-16 ***
## `mean words per message`  -0.0853     0.0669  -1.275    0.208    
## courseB                   -3.3639     2.5052  -1.343    0.186    
## courseC                   -3.5661     2.2921  -1.556    0.126    
## courseD                    1.6773     4.0670   0.412    0.682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.468 on 49 degrees of freedom
## Multiple R-squared:  0.08549,    Adjusted R-squared:  0.01083 
## F-statistic: 1.145 on 4 and 49 DF,  p-value: 0.3465

There is a negative insignificant trend. It could be explained by the fact that “TP6102” is a special course where students wrote too long messages because they simply used long citations.

Below are different regressions for different courses:

## Course A 
## 
## Call:
## lm(formula = `project score` ~ `mean words per message`, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.258  -2.198   1.387   4.178   7.368 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               68.4307     6.4270  10.647 8.65e-08 ***
## `mean words per message`   1.0697     0.6778   1.578    0.139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.985 on 13 degrees of freedom
## Multiple R-squared:  0.1608, Adjusted R-squared:  0.09625 
## F-statistic: 2.491 on 1 and 13 DF,  p-value: 0.1385
## 
## 
## Course B 
## 
## Call:
## lm(formula = `project score` ~ `mean words per message`, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.105  -5.623  -1.421   6.089  12.259 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               68.8077     6.9903   9.843 1.84e-06 ***
## `mean words per message`   0.6482     0.7084   0.915    0.382    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.555 on 10 degrees of freedom
## Multiple R-squared:  0.07726,    Adjusted R-squared:  -0.01501 
## F-statistic: 0.8373 on 1 and 10 DF,  p-value: 0.3817
## 
## 
## Course C 
## 
## Call:
## lm(formula = `project score` ~ `mean words per message`, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6170 -2.8328 -0.5894  0.3939 10.6554 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              74.98154    2.86702  26.153 6.28e-14 ***
## `mean words per message` -0.04551    0.26967  -0.169    0.868    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.209 on 15 degrees of freedom
## Multiple R-squared:  0.001895,   Adjusted R-squared:  -0.06465 
## F-statistic: 0.02848 on 1 and 15 DF,  p-value: 0.8682
## 
## 
## Course D 
## 
## Call:
## lm(formula = `project score` ~ `mean words per message`, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8054 -2.8360  0.6479  4.0567  4.7849 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              82.11636    3.26412  25.157 6.67e-09 ***
## `mean words per message` -0.11234    0.05195  -2.162   0.0626 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.888 on 8 degrees of freedom
## Multiple R-squared:  0.3689, Adjusted R-squared:   0.29 
## F-statistic: 4.676 on 1 and 8 DF,  p-value: 0.06256
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL

All trends are still insignificant.

Timelines

To be honest, not much can be extracted from timelines. Still, we can think about it.

Total messages in all chats

Here are timelines of all the chats on one plot, with colours representing different teams.

## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Network analysis

For each team and students A and B in that team, we will count how many times B’s message directly followed A’s message. This is a proxy to the number of times B replied to A (whether B actually replied to A is hard to figure out without manually going through all the logs). Below is a table of the number of replies for one team.

##        replied
## wrote   18101 18102 18103 18104 18105
##   18101     0     1     1     3     2
##   18102     0     0     5     7     4
##   18103     2     4     0    12     6
##   18104     4     9     8     0    12
##   18105     1     2     9    12     0

A

Here, we visualize each team as a directed graph and the thickness of the arrow from A to B is proportional to the number of replies of B to A.

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL

B

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL

C

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL
## 
## [[16]]
## NULL
## 
## [[17]]
## NULL

D

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL

Centralization

Given a graph \(G\), we calculate degree centrality of each vertex and then we calculate degree centralization of the whole graph. The reason to choose degree centralization over betweenness, closeness, or eigenvector centrality is that it is the only weighted centrality measure that we know of for which the balanced tree is the graph of maximum centralization with the given number of edges (Fedor will explain it nicely for the paper with all the needed references and equations).

Our main conjecture is that teams with decentralized communication are more successful than team with centralized communication. The main explanation is that a centralized communication pattern means that there is one leader in the team who is enthusiastic about the project and basically tells everyone what to do, but the rest of the students take less initiative. On the contrary, decentralized communication pattern means that students contribute equally. This conjecture is partially confirmed by data analysis below.

Regressions

Here is a scatterplot of centralization scores vs project scores by the course. Two courses, B and D display an obvious negative trend (teams that are highly centralized are likely to get a lower project score), but none of the trends is statistically significant.

First, we do a single linear regression with the project score as the response variable and centralization and the course being predictors.

## 
## Call:
## lm(formula = `project score` ~ centralization + course, data = teamStats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.796  -3.724  -1.126   4.743  12.711 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      80.348      2.492  32.243   <2e-16 ***
## centralization   -7.578      6.411  -1.182   0.2429    
## courseB          -2.837      2.552  -1.112   0.2717    
## courseC          -3.961      2.313  -1.713   0.0931 .  
## courseD          -2.179      2.648  -0.823   0.4145    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.483 on 49 degrees of freedom
## Multiple R-squared:  0.08133,    Adjusted R-squared:  0.006338 
## F-statistic: 1.085 on 4 and 49 DF,  p-value: 0.3745

Now we will do it for each course. The only course for which we have a statistically significant negative trend is FOM. We believe that this is explained by the small number of data points.

## Course A 
## 
## Call:
## lm(formula = `project score` ~ centralization, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3437  -4.3727  -0.2801   5.5297  10.7939 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      75.311      6.441  11.693 2.85e-08 ***
## centralization    9.918     21.316   0.465    0.649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.562 on 13 degrees of freedom
## Multiple R-squared:  0.01638,    Adjusted R-squared:  -0.05928 
## F-statistic: 0.2165 on 1 and 13 DF,  p-value: 0.6494
## 
## 
## Course B 
## 
## Call:
## lm(formula = `project score` ~ centralization, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3375  -3.6378   0.0675   3.1094  12.7289 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       93.48      10.85   8.617 6.11e-06 ***
## centralization   -52.05      29.57  -1.761    0.109    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.782 on 10 degrees of freedom
## Multiple R-squared:  0.2366, Adjusted R-squared:  0.1603 
## F-statistic: 3.099 on 1 and 10 DF,  p-value: 0.1088
## 
## 
## Course C 
## 
## Call:
## lm(formula = `project score` ~ centralization, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5419 -2.5549 -0.5494  0.4565 10.4693 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     74.5549     1.6507   45.16   <2e-16 ***
## centralization  -0.1039     5.2877   -0.02    0.985    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.213 on 15 degrees of freedom
## Multiple R-squared:  2.573e-05,  Adjusted R-squared:  -0.06664 
## F-statistic: 0.000386 on 1 and 15 DF,  p-value: 0.9846
## 
## 
## Course D 
## 
## Call:
## lm(formula = `project score` ~ centralization, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.268 -2.580 -1.063  1.204  9.261 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      84.180      3.318  25.372 6.24e-09 ***
## centralization  -27.650     10.052  -2.751    0.025 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.411 on 8 degrees of freedom
## Multiple R-squared:  0.486,  Adjusted R-squared:  0.4218 
## F-statistic: 7.566 on 1 and 8 DF,  p-value: 0.02504
## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL

Vocabulary

Below are word clouds by course. They give some idea of what the project topic in each course is.

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL

A

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL

B

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL

C

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL
## 
## [[11]]
## NULL
## 
## [[12]]
## NULL
## 
## [[13]]
## NULL
## 
## [[14]]
## NULL
## 
## [[15]]
## NULL
## 
## [[16]]
## NULL
## 
## [[17]]
## NULL

D

## [[1]]
## NULL
## 
## [[2]]
## NULL
## 
## [[3]]
## NULL
## 
## [[4]]
## NULL
## 
## [[5]]
## NULL
## 
## [[6]]
## NULL
## 
## [[7]]
## NULL
## 
## [[8]]
## NULL
## 
## [[9]]
## NULL
## 
## [[10]]
## NULL