For Project 4, you have two options:

You can do the Project below, or you can pick a language-related data set that interests you and do your own analysis.

In either case, the document does not need to be long: equivalent to 2-3 pages in Microsoft Word (but turn it in as an RMarkdown html file).

In your writeup, you need to do the following:

make a graph and describe what it shows
run a linear regression and discuss the results
report a correlation
use a logarithm

The writeup should use RMarkdown with blocks of text and blocks of code, like we have been doing for the problem set.

Below is the description for the experiment that is the “default” option for Project 4. If you are analyzing your own data set, you can ignore the below.

Background

If I were inviting a work colleague to come over to hang out with me and my dog in the park, I might say something like: “I heard you might be in the area this weekend. If so (and it’s not raining!), would you like to drop by and go to the park with me and Fogo?”

If my partner is in the other room and I want to see if he’s up for a dog walk with me, I might just text him this: 🎾.

In both cases, the meanings are similar and my goals are similar: I want to see if the person I’m texting will come to the park with me and my dog. But, in the first case, it takes about 40 words. In the second case, it takes one emoji.

What if I had just texted my colleague the tennis ball emoji? Would they have even known what I was talking about? Maybe they would have thought I wanted to play tennis. Or watch tennis. And they might have found my terseness rude!

But my partner knows that our dog loves the park and loves chasing his ball there and that we do it almost everyday. Our shared knowledge means that we can be shorter and more efficient in our communication.

This property of language (the importance of shared context) affects the way languages change and evolve over time. This is exactly what you will explore in this project by playing a communication game in which you and a partner communicated about shapes.

Familiarize yourself with the experiment

The best way to get a sense of the experiment is to play with it!

With a partner, both go to this link, but replace the XXXXX with a codeword that you decide (doesn’t matter what it is). It can be both your names like (KyleJiyoung): http://euca-169-231-235-145.eucalyptus.cloud.aristotle.ucsb.edu:8888/tangrams_sequential/index.html?gameId=XXXXX

You will find yourself in an interactive game, in which there are a series of shapes and a chat box where you can chat with your partner.

Chat your name to your partner in the box, to make sure that you are both there and in the same room. (I recommend you be on Zoom together, at least during the initial setup.)

The goal of the game is for the Director to tell the Matcher which object to click (the one in black), by entering text into the chat box. The Matcher clicks a box and will get feedback whether they clicked the correct box. For this game, you will always be only either the Director or Matcher.

Note: everything you record in the chat box will be recorded and you will get to analyze it later! So please do not enter anything in the chat box you would not want to be seen by your instructors and everyone else in the class.

Basics

The basics of the experiment were that there were 12 objects to describe and you completed 6 Rounds, each consisting of 12 trials. So there were 12 * 6 = 72 trials total, over the 6 rounds. Your team saw each object exactly 6 times (if you completed the whole thing). What we are interested in is seeing how communication changes over time, and what that can tell us about language.

Intro questions and overview

First, we read in the data set as a data frame d.

Here are what the variables in the data frame mean:

gameid: the ID for an individual set of players
roundNum: Each team seems 72 trials or rounds. roundNum is thus a number from 1 to 72.
occurenceNum: How many times the particular object has been seen. When occurrenceNum is 1, that means it’s the first time. If it’s 3, that means it’s the third time.
intendedName: what object was supposed to be guessed
contents: what was actually said by the matcher + director
correct: if the matcher clicked the correct thing
char_count: number of letters in the conversation
word_count: number of words in the conversation
jaccard_with_prev_occ: measure of similarity between this trial and the previous trial for the same object. So if this trial shows object A, how similar is this text to the LAST time the team saw object A.
jaccard_with_random: measure of similarity between this trial’s contents and the trial’s contents for the same object for a random team (that is, NOT this team).

Analysis

Your job is to analyze this data!

You can think about the following questions, but don’t need to answer all of them. And you can answer some of your own! The document does not need to be long: equivalent to 2-3 pages in Microsoft Word (but turn it in as an RMarkdown html file).

How long are the text responses (contents column) in general? What distribution does it follow?
Do longer texts lead to more accurate answers?
How does accuracy change over the course of the experiment?
How does the length of text change over the course of the experiment?
How does textual similarity change over the course of the experiment?
What is the relationship between char_count and word_count?
Are some objects (intendedName column) easier or harder to guess than others? Do they generate texts of different lengths?

In your writeup, you need to do the following:

make a graph and describe what it shows
run a linear regression and discuss the results
report a correlation
use a logarithm

The writeup should use RMarkdown with blocks of text and blocks of code, like we have been doing for the problem set.

Be sure to write a few sentences of a conclusin paragraph: what does the analysis tell us overall?

This is intentionally a bit open-ended: be creative and apply the statistical lessons we learned in class.

How long are the text responses (contents column) in general? What distribution does it follow?

The length of the text responses follow an exponential distribution, it has a very long tail. By taking logarithm of char_count, we get a histogram of the distribution of logged char_count. Logged char_count approximately follow the normal distribution.

How does accuracy change over the course of the experiment?

From the plot above, we can see that, in general, accuracy increases over the course of the experiment. Closer to the end of the experiment (approximately, roundNum > 50), the accuracy is about 100% for most of the times.

How does the length of text change over the course of the experiment?

First, I will examine how does length of text by character count change over the course of the experiment.

In general, over the course of the experiment, length of text by character count decreases. One possible explanation is that, as participants are more familiar with the process, they are able to precisely describe what they see using fewer text.

Next, I will look at how length of text by word count change over the course of the experiment.

Based on the plot above, we can see that length of text by word count decreases over the course of the experiment. One possible explanation is that, as participants are more familiar with the process, they are able to precisely describe what they see with fewer text.

Next, I will fit a linear model to see if there’s a linear relationship between roundNum and text length.

## 
## Call:
## lm(formula = mean.length ~ roundNum, data = d.byRound.3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1256 -3.3648 -0.9747  2.4587 19.2793 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.81303    1.18753   22.58   <2e-16 ***
## roundNum    -0.39560    0.02827  -13.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.986 on 70 degrees of freedom
## Multiple R-squared:  0.7366, Adjusted R-squared:  0.7329 
## F-statistic: 195.8 on 1 and 70 DF,  p-value: < 2.2e-16

The intercept is 26.81 and the slope is -0.39. The multiple R-squared is 0.73. This suggests that 0.73 of the variance is explained by the model. There is a significant negative linear relationship between roundNum and text length by word count.

How does textual similarity change over the course of the experiment?

First, I will analyze how textual similarity with previous trial for the same object.

In general, over the course of the experiment, textual similarity with the previous trial for the same object increases. Interestingly, when roundNum is less than or equal to 12, textual similarity with the previous trial is zero. This is possibly because, at the beginning of the experiment, all of the objects are seen for the first time (there are no previous trial of the same object). Thus, it’s impossible to measure textual similarity with the previous trial for the same object.

Next, I will look at how textual similarity between this trial’s contents and the trial’s contents for the same object for a random team change over the course of the experiment.

The dots on the plot are scattered. I will fit a linear model to see if there is a linear relationship between textual similarity and roundNum, and if there is a linear relationship, how much variance is captured by the model.

## 
## Call:
## lm(formula = mean.sim ~ roundNum, data = d.byRound.sim.2)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.045352 -0.018663 -0.003194  0.015944  0.088451 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0785436  0.0056189  13.978  < 2e-16 ***
## roundNum    -0.0008328  0.0001338  -6.226 3.12e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02359 on 70 degrees of freedom
## Multiple R-squared:  0.3564, Adjusted R-squared:  0.3472 
## F-statistic: 38.76 on 1 and 70 DF,  p-value: 3.119e-08

The intercept is 0.078 and the slope is very close to zero. There is a significant linear relationship between roundNum and textual similarity. The multiple R-squared is 0.3564. This suggests that only 35.64% of the variance is explained by the model. This is not a very high multiple R-squared value.

What is the relationship between char_count and word_count?

From the plot above, it seems like there is a linear relationship between char_count and word_count, but there is an outlier, and the dots on the left are very close to each other. This far outlier suggests that it might be good to take logarithm of the values.

Next, I will fit a linear model on this data to see if there is a linear relationship between char_count and word_count.

## 
## Call:
## lm(formula = word_count ~ char_count, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -248.499   -5.351   -3.672    0.858  110.355 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.254324   0.285974   21.87   <2e-16 ***
## char_count  0.106790   0.001345   79.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.98 on 1957 degrees of freedom
## Multiple R-squared:  0.7632, Adjusted R-squared:  0.7631 
## F-statistic:  6308 on 1 and 1957 DF,  p-value: < 2.2e-16

## [1] 0.8736188

The intercept is 6.25 and the slope is 0.106. The R-squared value is 0.76. This suggests that 0.76 of the variance is explained by the model.

Also, this relationship is significant with a correlation coefficient r = 0.87. This is a very high correlation.

Next, I will take logarithm of the two variables.

Then, I will fit a linear model on the logged values.

## 
## Call:
## lm(formula = log.word.count ~ log.char.count, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.72057 -0.12075  0.03681  0.12350  1.64582 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.645818   0.018329  -89.79   <2e-16 ***
## log.char.count  0.999730   0.004996  200.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2706 on 1957 degrees of freedom
## Multiple R-squared:  0.9534, Adjusted R-squared:  0.9534 
## F-statistic: 4.004e+04 on 1 and 1957 DF,  p-value: < 2.2e-16

## [1] 0.9764209

Based on this, we can see that the intercept is -1.64 and the slope is 0.99.

The R-squared value is now 0.95, which means that 95% of the variance can be explained by the model. Note that the R-squared value of the previous linear model on original values is 0.76. This suggests that, by taking logarithms of the values, we are able to explain more of the variance.

In addition, the correlation coefficient (r = 0.97) is also higher than the previous correlation coefficient (r = 0.87). There is a stronger linear relationship between logged character count and logged word count, than just character count and word count.

Are some objects (intendedName column) easier or harder to guess than others? Do they generate texts of different lengths?

From the bar chart above, we can see that, in general, C and I are easier to guess than other items. B is harder to guess than all other items.

Do they generate texts of different lengths?

Yes! C and I, which are the two easiest items to guess, generate shortest texts (in number of words). Interestingly, G, which is the second hardest item to guess, generates longest texts.

Project 4: Communication Game Project