For Project 4, you have two options:
You can do the Project below, or you can pick a language-related data set that interests you and do your own analysis.
In either case, the document does not need to be long: equivalent to 2-3 pages in Microsoft Word (but turn it in as an RMarkdown html file).
In your writeup, you need to do the following:
The writeup should use RMarkdown with blocks of text and blocks of code, like we have been doing for the problem set.
Below is the description for the experiment that is the “default” option for Project 4. If you are analyzing your own data set, you can ignore the below.
If I were inviting a work colleague to come over to hang out with me and my dog in the park, I might say something like: “I heard you might be in the area this weekend. If so (and it’s not raining!), would you like to drop by and go to the park with me and Fogo?”
If my partner is in the other room and I want to see if he’s up for a dog walk with me, I might just text him this: 🎾.
In both cases, the meanings are similar and my goals are similar: I want to see if the person I’m texting will come to the park with me and my dog. But, in the first case, it takes about 40 words. In the second case, it takes one emoji.
What if I had just texted my colleague the tennis ball emoji? Would they have even known what I was talking about? Maybe they would have thought I wanted to play tennis. Or watch tennis. And they might have found my terseness rude!
But my partner knows that our dog loves the park and loves chasing his ball there and that we do it almost everyday. Our shared knowledge means that we can be shorter and more efficient in our communication.
This property of language (the importance of shared context) affects the way languages change and evolve over time. This is exactly what you will explore in this project by playing a communication game in which you and a partner communicated about shapes.
The best way to get a sense of the experiment is to play with it!
With a partner, both go to this link, but replace the XXXXX with a codeword that you decide (doesn’t matter what it is). It can be both your names like (KyleJiyoung): http://euca-169-231-235-145.eucalyptus.cloud.aristotle.ucsb.edu:8888/tangrams_sequential/index.html?gameId=XXXXX
You will find yourself in an interactive game, in which there are a series of shapes and a chat box where you can chat with your partner.
Chat your name to your partner in the box, to make sure that you are both there and in the same room. (I recommend you be on Zoom together, at least during the initial setup.)
The goal of the game is for the Director to tell the Matcher which object to click (the one in black), by entering text into the chat box. The Matcher clicks a box and will get feedback whether they clicked the correct box. For this game, you will always be only either the Director or Matcher.
Note: everything you record in the chat box will be recorded and you will get to analyze it later! So please do not enter anything in the chat box you would not want to be seen by your instructors and everyone else in the class.
The basics of the experiment were that there were 12 objects to describe and you completed 6 Rounds, each consisting of 12 trials. So there were 12 * 6 = 72 trials total, over the 6 rounds. Your team saw each object exactly 6 times (if you completed the whole thing). What we are interested in is seeing how communication changes over time, and what that can tell us about language.
First, we read in the data set as a data frame d.
Here are what the variables in the data frame mean:
Your job is to analyze this data!
You can think about the following questions, but don’t need to answer all of them. And you can answer some of your own! The document does not need to be long: equivalent to 2-3 pages in Microsoft Word (but turn it in as an RMarkdown html file).
In your writeup, you need to do the following:
The writeup should use RMarkdown with blocks of text and blocks of code, like we have been doing for the problem set.
Be sure to write a few sentences of a conclusin paragraph: what does the analysis tell us overall?
This is intentionally a bit open-ended: be creative and apply the statistical lessons we learned in class.
The length of the text responses follow an exponential distribution, it has a very long tail. By taking logarithm of char_count, we get a histogram of the distribution of logged char_count. Logged char_count approximately follow the normal distribution.
From the plot above, we can see that, in general, accuracy increases over the course of the experiment. Closer to the end of the experiment (approximately, roundNum > 50), the accuracy is about 100% for most of the times.
First, I will examine how does length of text by character count change over the course of the experiment.
In general, over the course of the experiment, length of text by character count decreases. One possible explanation is that, as participants are more familiar with the process, they are able to precisely describe what they see using fewer text.
Next, I will look at how length of text by word count change over the course of the experiment.
Based on the plot above, we can see that length of text by word count decreases over the course of the experiment. One possible explanation is that, as participants are more familiar with the process, they are able to precisely describe what they see with fewer text.
Next, I will fit a linear model to see if there’s a linear relationship between roundNum and text length.
##
## Call:
## lm(formula = mean.length ~ roundNum, data = d.byRound.3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1256 -3.3648 -0.9747 2.4587 19.2793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.81303 1.18753 22.58 <2e-16 ***
## roundNum -0.39560 0.02827 -13.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.986 on 70 degrees of freedom
## Multiple R-squared: 0.7366, Adjusted R-squared: 0.7329
## F-statistic: 195.8 on 1 and 70 DF, p-value: < 2.2e-16
The intercept is 26.81 and the slope is -0.39. The multiple R-squared is 0.73. This suggests that 0.73 of the variance is explained by the model. There is a significant negative linear relationship between roundNum and text length by word count.
First, I will analyze how textual similarity with previous trial for the same object.
In general, over the course of the experiment, textual similarity with the previous trial for the same object increases. Interestingly, when roundNum is less than or equal to 12, textual similarity with the previous trial is zero. This is possibly because, at the beginning of the experiment, all of the objects are seen for the first time (there are no previous trial of the same object). Thus, it’s impossible to measure textual similarity with the previous trial for the same object.
Next, I will look at how textual similarity between this trial’s contents and the trial’s contents for the same object for a random team change over the course of the experiment.
The dots on the plot are scattered. I will fit a linear model to see if there is a linear relationship between textual similarity and roundNum, and if there is a linear relationship, how much variance is captured by the model.
##
## Call:
## lm(formula = mean.sim ~ roundNum, data = d.byRound.sim.2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.045352 -0.018663 -0.003194 0.015944 0.088451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0785436 0.0056189 13.978 < 2e-16 ***
## roundNum -0.0008328 0.0001338 -6.226 3.12e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02359 on 70 degrees of freedom
## Multiple R-squared: 0.3564, Adjusted R-squared: 0.3472
## F-statistic: 38.76 on 1 and 70 DF, p-value: 3.119e-08
The intercept is 0.078 and the slope is very close to zero. There is a significant linear relationship between roundNum and textual similarity. The multiple R-squared is 0.3564. This suggests that only 35.64% of the variance is explained by the model. This is not a very high multiple R-squared value.
From the plot above, it seems like there is a linear relationship between char_count and word_count, but there is an outlier, and the dots on the left are very close to each other. This far outlier suggests that it might be good to take logarithm of the values.
Next, I will fit a linear model on this data to see if there is a linear relationship between char_count and word_count.
##
## Call:
## lm(formula = word_count ~ char_count, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -248.499 -5.351 -3.672 0.858 110.355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.254324 0.285974 21.87 <2e-16 ***
## char_count 0.106790 0.001345 79.42 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.98 on 1957 degrees of freedom
## Multiple R-squared: 0.7632, Adjusted R-squared: 0.7631
## F-statistic: 6308 on 1 and 1957 DF, p-value: < 2.2e-16
## [1] 0.8736188
The intercept is 6.25 and the slope is 0.106. The R-squared value is 0.76. This suggests that 0.76 of the variance is explained by the model.
Also, this relationship is significant with a correlation coefficient r = 0.87. This is a very high correlation.
Next, I will take logarithm of the two variables.
Then, I will fit a linear model on the logged values.
##
## Call:
## lm(formula = log.word.count ~ log.char.count, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72057 -0.12075 0.03681 0.12350 1.64582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.645818 0.018329 -89.79 <2e-16 ***
## log.char.count 0.999730 0.004996 200.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2706 on 1957 degrees of freedom
## Multiple R-squared: 0.9534, Adjusted R-squared: 0.9534
## F-statistic: 4.004e+04 on 1 and 1957 DF, p-value: < 2.2e-16
## [1] 0.9764209
Based on this, we can see that the intercept is -1.64 and the slope is 0.99.
The R-squared value is now 0.95, which means that 95% of the variance can be explained by the model. Note that the R-squared value of the previous linear model on original values is 0.76. This suggests that, by taking logarithms of the values, we are able to explain more of the variance.
In addition, the correlation coefficient (r = 0.97) is also higher than the previous correlation coefficient (r = 0.87). There is a stronger linear relationship between logged character count and logged word count, than just character count and word count.
From the bar chart above, we can see that, in general, C and I are easier to guess than other items. B is harder to guess than all other items.
Do they generate texts of different lengths?
Yes! C and I, which are the two easiest items to guess, generate shortest texts (in number of words). Interestingly, G, which is the second hardest item to guess, generates longest texts.