Total: 25 points

Instructions:
1. Rename this file by replacing “LASTNAME” with your last name. This can be done via the RStudio menu (File >> Rename).
2. Write your full name in the chunk above beside author:.
3. Before beginning, it is good practice to create a directory that contains your R scripts as well as any data you will need. This can be done in the console directly with the setwd() function or via the RStudio menu (Session >> Set Working Directory).
4. Write R code to answer the questions below. The code should be written within the chunks provided for each question. These chunks begin with three back ticks and the letter r in curly brackets (```{r}) and end with three back ticks. You can add as much space as you need within the chunks but do not delete the back ticks or otherwise modify the chunks in any way or the file will cause errors when compiled.
5. When you have answered all of the questions, click the Knit button. This will create an HTML file in your working directory.
6. Upload the HTML file to Moodle.

Data description:
The lcf.csv dataset contains data from the Leerdercorpus Frans, a learner corpus of L2 French. The variables in the data frame are:

  • ID: anonymous ID corresponding to the learner who wrote the text
  • CEFR: the proficiency level of the text (B2, C1 or C2)
  • TOPIC: the writing topic
  • LING.NOUN: the average frequency of nouns related to “linguistics” (based on a corpus of linguistics articles written by native speakers)
  • RTTR: root type-token ratio (a measure of lexical diversity)
  • AWL: average word length

Hint:
While completing the assignment, it may be helpful to keep the following questions in mind:

  • What kinds of variables are involved in your hypothesis (integer, ordinal, categorical etc.) and how many?
  • Are data points in your data related such that you can associate them to each other in a meaningful way?
  • What is the statistic of the dependent variable in the statistical hypothesis?
  • What does the distribution of the data of your test statistic look like?
  • How big are the samples you collected?
  • What assumptions must be met before running a particular statistical test?

1 Load the data set (“lcf.csv”) into a dataframe called “lcf”.

setwd("~/Desktop/Satistics for linguistics/Assignment")
lcf<- read.csv("lcf.csv", header = TRUE, stringsAsFactors = TRUE)
attach(lcf)
str (lcf)
## 'data.frame':    169 obs. of  6 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CEFR     : Factor w/ 3 levels "B2","C1","C2": 2 2 2 2 2 2 3 2 2 2 ...
##  $ TOPIC    : Factor w/ 7 levels "delinquency",..: 2 6 2 6 2 2 2 2 2 7 ...
##  $ LING.NOUN: num  204 414 234 290 304 ...
##  $ RTTR     : num  7.99 9.06 8.25 9.04 8.64 ...
##  $ AWL      : num  4.36 4.84 4.55 4.82 4.34 ...

3 You want to test whether lexical diversity (RTTR) increases with proficiency (CEFR).

3.1 Formulate hypotheses. 1 point

Answer:
H0= lexical diversity does not increase with proficiency r=0 H1= lexical diversity increases with proficiency r ≠ 0

3.2 Calculate descriptive statistics and represent the data graphically. 2 points

SUMA<- tapply(RTTR, CEFR, summary)
SUM_T<-table(RTTR, CEFR)
tapply(RTTR,CEFR, median)
##       B2       C1       C2 
## 10.05773 10.13688 11.01353
tapply(RTTR,CEFR, IQR)
##       B2       C1       C2 
## 1.908718 2.094596 1.513173
boxplot(SUMA)

3.3 Test your hypothesis with analytical statistics. 2 points

CEFR_NUM<- as.numeric(CEFR)
cor.test(RTTR,CEFR_NUM)
## 
##  Pearson's product-moment correlation
## 
## data:  RTTR and CEFR_NUM
## t = 2.2579, df = 167, p-value = 0.02525
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02171548 0.31488988
## sample estimates:
##       cor 
## 0.1721113

3.4 Summarize the result(s) briefly. 3 points

Answer: After carrying out a correlation test to proof wether there was an increase of lexical diversity with a higher proficiency level, it was found that the p-value = 0.025 < p-value=0.05, therefore the null hypothesis is rejected meaning that the lexical diversity does increase with proficiency.

4 You want to test whether texts written on the topic of freedom were rated more proficient than texts rated on the topic of language. In other words, whether there is an association between topic (TOPIClanguage or TOPICfreedom) and CEFR level (CEFR b2 or CEFRc1 or CEFRc2).

4.1 Formulate hypotheses. 1 point

Answer:
H0 = texts with the topic of freedom were not rated more proficient than the ones with the topic language H1 = text with the topic of freedom were rated more proficient than the ones with the topic language

4.2 Summarize the data numerically and represent the data graphically. 2 points

table(TOPIC,CEFR)
##               CEFR
## TOPIC          B2 C1 C2
##   delinquency  11 42 13
##   euthanasia    2 23  6
##   foreigners    1  3  0
##   freedom       7 13  4
##   language      4 13 13
##   nuclear       1  9  0
##   state_reform  0  3  1
bar_plot <- lcf[lcf$TOPIC %in% c('language', 'freedom'), c('TOPIC', 'CEFR')]
bar_plot$TOPIC <- factor(bar_plot$TOPIC, levels = c('language', 'freedom'))
barplot(table(bar_plot$CEFR, bar_plot$TOPIC), beside = TRUE, legend = TRUE,
        col = c('lightblue', 'lightgreen'), main = 'Bar Plot of TOPIC by CEFR',
        xlab = 'TOPIC', ylab = 'Frequency')

4.3 Test your hypothesis with analytical statistics. 1 point

data_lf <- lcf[lcf$TOPIC %in% c('language', 'freedom'), c('TOPIC', 'CEFR')]
data_lf$TOPIC <- factor(data_lf$TOPIC, levels = c('language', 'freedom'))
table_lf <- table(data_lf$CEFR, data_lf$TOPIC)
fisher.test(table_lf)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table_lf
## p-value = 0.07322
## alternative hypothesis: two.sided

4.4 Summarize the result(s) briefly. 1 point

Answer: The texts with the topic of freedom were rated more proficient than the ones in the topic language. According to a fisher test performed with a p-value = 0.073 > 0.05 we reject the null hypothesis.

5 You want to test whether lexical diversity (RTTR) is correlated with average word length (AWL). In other words, do learners who use a more diverse vocabulary in their texts also use longer words?

5.1 Formulate hypotheses. 1 point

Answer:
H0= lexical diversity does not correlate with average word length H1 = lexical diversity correlates with average word length.

5.2 Represent the data graphically. 1 point

plot(RTTR,AWL)

5.3 Test your hypothesis with analytical statistics. 1 point

cor.test(RTTR,AWL)
## 
##  Pearson's product-moment correlation
## 
## data:  RTTR and AWL
## t = 5.3784, df = 167, p-value = 2.503e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2476492 0.5058614
## sample estimates:
##       cor 
## 0.3842442

5.4 Summarize the result(s) briefly. 1 point

Answer: There is no correlation between the lexical diversity and the word length that learners used in the texts. According to a Pearson’s test with a p-value = 2.50e-07 < 0.05 we reject the null hypothesis.