Main stats
By course
Below are some descriptive statistics for each course.
By team
Below are some descriptive statistics for each team.
By student
Below is a sample of desriptive statistics by student.
Authors: Duzhin Fedor, Tan Joo Seng, Tan Siew Eng
We have collected raw whatsapp chats of student teams in four courses, two in business and two in mathematics. For each course, we have a directory with all chats exported into txt files and one xlsx file with anonymised student information. Below are some statistics.
Here, we merge all the chats that we have into one huge table. Below is the number of messages in each course.
##
## A B C D
## 7132 4583 3703 1226
## [1] " are some descriptive statistics for each course."
## [2] "This is a bar chart of the number of messages per team coloured according to the course"
## [3] "We see that students in ZOOM2401 and ZOOM3110 wrote more messages."
Below are some descriptive statistics for each course.
Below are some descriptive statistics for each team.
Below is a sample of desriptive statistics by student.
This is a bar chart of the number of messages per team coloured according to the course. We see that students in “MH2401” and “MH3110” wrote more messages.
For each course, we looked at the mean number of words per message for each student in that course. Below is the whisker chart. It shows all the quartiles of the distribution of the mean number of words per message and outliers. The main observation is that students in “FOM” wrote much longer messages than students in the other 3 courses in our study.
In fact, the median number of words in a message in FOM is larger than the largest number of words in any message in any other course. This probably reflects a unique style of the course FOM. Below is a random long message written by an FOM student.
## [1] " Summary: Gary Hamel believes in 2 things: 1. That there is a need for a new management system completely different from the one that is being utilized by everyone now a. The problem with current business models is that business requires the skills of a large group of people and managemnt helps to coordinate these people to developing goods and services, but as they align employee's interest with that of the company it suppresses the entrepreneurship of the people (current management practices) b. long as companies break away from the current business models and venture into new one they will be able to achieve more. Since china and India have now become costs leaders, this displaced the deficiency goal and now companies should focus on innovation c. They should be drivers of change rather than adapters of change d. Although he provides little guidance on these new business models, he does have a clear vision on the goals of these new business models and that is to: i. Ensure that all employees are constantly innovating ii. Engaging work environment to encourage employees to give their all iii. Increasing the pace of strategic renewal 2. management innovation will provide companies with the competitive edge a. sustainable source of competitive ad b. best guide to future of management is the internet as it provides real time connectivity; coordinate human efforts similar to the conventional means but it is less restrictive Problems with achieving his propositions: cent There has been evidence that most companies who diverge from the traditional way of management always tend to go back to the conventional ways when they face problems o Companies that do continue to make use of different business models are often able to do so because of other factors ( google is because of first mover advantage in web search and partnerships with many other companies) cent Complexity organizational integration cent Business innovation as a competitive ed is not easy- most of the time management innovations are because of environmental adaptations; or it may be because of entrepreneurs beliefs ( apple)( often pursue their visions uncompromisingly or autocratically) Questions: 1.why is it that companies that diverge from traditional buisness strategies always tend to go back to the traditional means. What exactly is the common problems that these companies face 2. What is it in traditional business models that are so essential in ensuring that business stays in the competition that businesses are unable to let go and start a new 3.how can companies integrate the essential parts of the traditional business models and new innovative ones to make their company successful 4. Hamel suggests that the internet is the best guide to future of management. How can companies make use of the internet in developing new business models"
Below is the distribution of word count per message by team. Even though we see that students in FOM write longer messages than students in the other three courses, there is still high variance across different teams within FOM.
Below is the whisker chart of the number of messages by a student. Students in MH3110 wrote more messages than students in other three courses, but we don’t see anything as extreme as the difference in the word count per message. The simplest explanation of students in MH2401 and MH3110 (math courses) to have written more messages than students in FOM and TP6102 (business courses) is that the project took longer time.
Here we construct various linear regression models to predict the project score of a team based on observable characteristics of whatsapp chats.
We see that there is an overall positive relationship between the mean number of messages per student and the project score. This relationship is statistically significant in FOM and TP6102, i.e., business courses. The simplest explanation is that the mean number of messages per student is indicative of average effort students invest into the project and hence the relationship is positive. However, there is a lot that whatsapp chats cannot capture - face-to-face meetings, students’ individual strengths, communication via alternative channels etc.
We will do a linear regression with the number of messages and the course as regressors.
##
## Call:
## lm(formula = `project score` ~ `N messages per student` + course,
## data = teamStats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.1700 -3.7239 -0.9865 3.9398 12.4058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.50805 2.23214 34.276 <2e-16 ***
## `N messages per student` 0.01799 0.01599 1.125 0.266
## courseB -3.02895 2.53290 -1.196 0.238
## courseC -2.86058 2.40097 -1.191 0.239
## courseD -1.01034 2.87578 -0.351 0.727
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.492 on 49 degrees of freedom
## Multiple R-squared: 0.07894, Adjusted R-squared: 0.003748
## F-statistic: 1.05 on 4 and 49 DF, p-value: 0.3913
There is a positive but insignificant relation between the number of messages per student and the project score. If we constructed different regressions for different courses, then there would be a positive statistically significant relation with two out of four courses:
## Course A
##
## Call:
## lm(formula = `project score` ~ `N messages per student`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6195 -3.7181 -0.7734 5.9245 11.6873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.071190 3.147331 25.123 2.1e-12 ***
## `N messages per student` -0.009809 0.026715 -0.367 0.719
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.586 on 13 degrees of freedom
## Multiple R-squared: 0.01026, Adjusted R-squared: -0.06587
## F-statistic: 0.1348 on 1 and 13 DF, p-value: 0.7194
##
##
## Course B
##
## Call:
## lm(formula = `project score` ~ `N messages per student`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.002 -6.376 -1.190 5.639 12.257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.76358 3.67854 19.509 2.74e-09 ***
## `N messages per student` 0.04149 0.03785 1.096 0.299
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.415 on 10 degrees of freedom
## Multiple R-squared: 0.1073, Adjusted R-squared: 0.018
## F-statistic: 1.202 on 1 and 10 DF, p-value: 0.2987
##
##
## Course C
##
## Call:
## lm(formula = `project score` ~ `N messages per student`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5317 -2.0706 -1.2109 0.5086 10.3767
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.73523 1.38802 52.402 <2e-16 ***
## `N messages per student` 0.03659 0.02098 1.744 0.102
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.842 on 15 degrees of freedom
## Multiple R-squared: 0.1686, Adjusted R-squared: 0.1131
## F-statistic: 3.041 on 1 and 15 DF, p-value: 0.1016
##
##
## Course D
##
## Call:
## lm(formula = `project score` ~ `N messages per student`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9120 -2.3107 0.1968 3.0944 5.0722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.8459 2.7828 24.380 8.55e-09 ***
## `N messages per student` 0.3601 0.1106 3.257 0.0116 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.035 on 8 degrees of freedom
## Multiple R-squared: 0.57, Adjusted R-squared: 0.5163
## F-statistic: 10.61 on 1 and 8 DF, p-value: 0.01158
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
We need to combine regression results into a table. This is easy, but we should check existing papers in JoLA for the format.
The relationship between project score and the mean word count in a message is positive for three courses and is negative for one course, FOM, where students wrote much longer messages than students of the other three courses. While these trends are not statistically significant, there is something here. Further investigation showed that FOM students tend to paste long citations into their whatsapp chats and teams who wrote more original messages were, at the same time, teams who got higher project scores but wrote shorter messages on average.
We will do a linear regression with the mean number of words per message and the course as regressors.
##
## Call:
## lm(formula = `project score` ~ `mean words per message` + course,
## data = teamStats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.0796 -3.4788 -0.6403 4.1851 12.9914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.9430 1.7776 44.409 <2e-16 ***
## `mean words per message` -0.0853 0.0669 -1.275 0.208
## courseB -3.3639 2.5052 -1.343 0.186
## courseC -3.5661 2.2921 -1.556 0.126
## courseD 1.6773 4.0670 0.412 0.682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.468 on 49 degrees of freedom
## Multiple R-squared: 0.08549, Adjusted R-squared: 0.01083
## F-statistic: 1.145 on 4 and 49 DF, p-value: 0.3465
There is a negative insignificant trend. It could be explained by the fact that “TP6102” is a special course where students wrote too long messages because they simply used long citations.
Below are different regressions for different courses:
## Course A
##
## Call:
## lm(formula = `project score` ~ `mean words per message`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.258 -2.198 1.387 4.178 7.368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.4307 6.4270 10.647 8.65e-08 ***
## `mean words per message` 1.0697 0.6778 1.578 0.139
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.985 on 13 degrees of freedom
## Multiple R-squared: 0.1608, Adjusted R-squared: 0.09625
## F-statistic: 2.491 on 1 and 13 DF, p-value: 0.1385
##
##
## Course B
##
## Call:
## lm(formula = `project score` ~ `mean words per message`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.105 -5.623 -1.421 6.089 12.259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.8077 6.9903 9.843 1.84e-06 ***
## `mean words per message` 0.6482 0.7084 0.915 0.382
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.555 on 10 degrees of freedom
## Multiple R-squared: 0.07726, Adjusted R-squared: -0.01501
## F-statistic: 0.8373 on 1 and 10 DF, p-value: 0.3817
##
##
## Course C
##
## Call:
## lm(formula = `project score` ~ `mean words per message`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6170 -2.8328 -0.5894 0.3939 10.6554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.98154 2.86702 26.153 6.28e-14 ***
## `mean words per message` -0.04551 0.26967 -0.169 0.868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.209 on 15 degrees of freedom
## Multiple R-squared: 0.001895, Adjusted R-squared: -0.06465
## F-statistic: 0.02848 on 1 and 15 DF, p-value: 0.8682
##
##
## Course D
##
## Call:
## lm(formula = `project score` ~ `mean words per message`, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8054 -2.8360 0.6479 4.0567 4.7849
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.11636 3.26412 25.157 6.67e-09 ***
## `mean words per message` -0.11234 0.05195 -2.162 0.0626 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.888 on 8 degrees of freedom
## Multiple R-squared: 0.3689, Adjusted R-squared: 0.29
## F-statistic: 4.676 on 1 and 8 DF, p-value: 0.06256
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
All trends are still insignificant.
To be honest, not much can be extracted from timelines. Still, we can think about it.
Here are timelines of all the chats on one plot, with colours representing different teams.
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
For each team and students A and B in that team, we will count how many times B’s message directly followed A’s message. This is a proxy to the number of times B replied to A (whether B actually replied to A is hard to figure out without manually going through all the logs). Below is a table of the number of replies for one team.
## replied
## wrote 18101 18102 18103 18104 18105
## 18101 0 1 1 3 2
## 18102 0 0 5 7 4
## 18103 2 4 0 12 6
## 18104 4 9 8 0 12
## 18105 1 2 9 12 0
Here, we visualize each team as a directed graph and the thickness of the arrow from A to B is proportional to the number of replies of B to A.
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
##
## [[16]]
## NULL
##
## [[17]]
## NULL
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
Given a graph \(G\), we calculate degree centrality of each vertex and then we calculate degree centralization of the whole graph. The reason to choose degree centralization over betweenness, closeness, or eigenvector centrality is that it is the only weighted centrality measure that we know of for which the balanced tree is the graph of maximum centralization with the given number of edges (Fedor will explain it nicely for the paper with all the needed references and equations).
Our main conjecture is that teams with decentralized communication are more successful than team with centralized communication. The main explanation is that a centralized communication pattern means that there is one leader in the team who is enthusiastic about the project and basically tells everyone what to do, but the rest of the students take less initiative. On the contrary, decentralized communication pattern means that students contribute equally. This conjecture is partially confirmed by data analysis below.
Here is a scatterplot of centralization scores vs project scores by the course. Two courses, B and D display an obvious negative trend (teams that are highly centralized are likely to get a lower project score), but none of the trends is statistically significant.
First, we do a single linear regression with the project score as the response variable and centralization and the course being predictors.
##
## Call:
## lm(formula = `project score` ~ centralization + course, data = teamStats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.796 -3.724 -1.126 4.743 12.711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.348 2.492 32.243 <2e-16 ***
## centralization -7.578 6.411 -1.182 0.2429
## courseB -2.837 2.552 -1.112 0.2717
## courseC -3.961 2.313 -1.713 0.0931 .
## courseD -2.179 2.648 -0.823 0.4145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.483 on 49 degrees of freedom
## Multiple R-squared: 0.08133, Adjusted R-squared: 0.006338
## F-statistic: 1.085 on 4 and 49 DF, p-value: 0.3745
Now we will do it for each course. The only course for which we have a statistically significant negative trend is FOM. We believe that this is explained by the small number of data points.
## Course A
##
## Call:
## lm(formula = `project score` ~ centralization, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.3437 -4.3727 -0.2801 5.5297 10.7939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 75.311 6.441 11.693 2.85e-08 ***
## centralization 9.918 21.316 0.465 0.649
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.562 on 13 degrees of freedom
## Multiple R-squared: 0.01638, Adjusted R-squared: -0.05928
## F-statistic: 0.2165 on 1 and 13 DF, p-value: 0.6494
##
##
## Course B
##
## Call:
## lm(formula = `project score` ~ centralization, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3375 -3.6378 0.0675 3.1094 12.7289
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 93.48 10.85 8.617 6.11e-06 ***
## centralization -52.05 29.57 -1.761 0.109
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.782 on 10 degrees of freedom
## Multiple R-squared: 0.2366, Adjusted R-squared: 0.1603
## F-statistic: 3.099 on 1 and 10 DF, p-value: 0.1088
##
##
## Course C
##
## Call:
## lm(formula = `project score` ~ centralization, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5419 -2.5549 -0.5494 0.4565 10.4693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 74.5549 1.6507 45.16 <2e-16 ***
## centralization -0.1039 5.2877 -0.02 0.985
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.213 on 15 degrees of freedom
## Multiple R-squared: 2.573e-05, Adjusted R-squared: -0.06664
## F-statistic: 0.000386 on 1 and 15 DF, p-value: 0.9846
##
##
## Course D
##
## Call:
## lm(formula = `project score` ~ centralization, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.268 -2.580 -1.063 1.204 9.261
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.180 3.318 25.372 6.24e-09 ***
## centralization -27.650 10.052 -2.751 0.025 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.411 on 8 degrees of freedom
## Multiple R-squared: 0.486, Adjusted R-squared: 0.4218
## F-statistic: 7.566 on 1 and 8 DF, p-value: 0.02504
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
Below are word clouds by course. They give some idea of what the project topic in each course is.
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL
##
## [[11]]
## NULL
##
## [[12]]
## NULL
##
## [[13]]
## NULL
##
## [[14]]
## NULL
##
## [[15]]
## NULL
##
## [[16]]
## NULL
##
## [[17]]
## NULL
## [[1]]
## NULL
##
## [[2]]
## NULL
##
## [[3]]
## NULL
##
## [[4]]
## NULL
##
## [[5]]
## NULL
##
## [[6]]
## NULL
##
## [[7]]
## NULL
##
## [[8]]
## NULL
##
## [[9]]
## NULL
##
## [[10]]
## NULL