Total: 25 points
Instructions:
1. Rename this file by replacing “LASTNAME” with your last name. This
can be done via the RStudio menu (File >> Rename).
2. Write your full name in the chunk above beside
author:.
3. Before beginning, it is good practice to create a directory that
contains your R scripts as well as any data you will need. This can be
done in the console directly with the setwd() function or
via the RStudio menu (Session >> Set Working Directory).
4. Write R code to answer the questions below. The code should be
written within the chunks provided for each question. These chunks begin
with three back ticks and the letter r in curly brackets
(```{r}) and end with three back ticks. You can add as much
space as you need within the chunks but do not delete the back ticks or
otherwise modify the chunks in any way or the file will cause errors
when compiled.
5. When you have answered all of the questions, click the
Knit button. This will create an HTML file in your working
directory.
6. Upload the HTML file to Moodle.
Data description:
The lcf.csv dataset contains data from the
Leerdercorpus Frans, a learner corpus of L2 French. The
variables in the data frame are:
ID: anonymous ID corresponding to the learner who wrote
the textCEFR: the proficiency level of the text (B2, C1 or
C2)TOPIC: the writing topicLING.NOUN: the average frequency of nouns related to
“linguistics” (based on a corpus of linguistics articles written by
native speakers)RTTR: root type-token ratio (a measure of lexical
diversity)AWL: average word lengthHint:
While completing the assignment, it may be helpful to keep the following
questions in mind:
setwd("~/Desktop/Satistics for linguistics/Assignment")
lcf<- read.csv("lcf.csv", header = TRUE, stringsAsFactors = TRUE)
attach(lcf)
str (lcf)
## 'data.frame': 169 obs. of 6 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ CEFR : Factor w/ 3 levels "B2","C1","C2": 2 2 2 2 2 2 3 2 2 2 ...
## $ TOPIC : Factor w/ 7 levels "delinquency",..: 2 6 2 6 2 2 2 2 2 7 ...
## $ LING.NOUN: num 204 414 234 290 304 ...
## $ RTTR : num 7.99 9.06 8.25 9.04 8.64 ...
## $ AWL : num 4.36 4.84 4.55 4.82 4.34 ...
Answer:
H0= lexical diversity does not increase with proficiency r=0 H1= lexical
diversity increases with proficiency r ≠0
SUMA<- tapply(RTTR, CEFR, summary)
SUM_T<-table(RTTR, CEFR)
tapply(RTTR,CEFR, median)
## B2 C1 C2
## 10.05773 10.13688 11.01353
tapply(RTTR,CEFR, IQR)
## B2 C1 C2
## 1.908718 2.094596 1.513173
boxplot(SUMA)
CEFR_NUM<- as.numeric(CEFR)
cor.test(RTTR,CEFR_NUM)
##
## Pearson's product-moment correlation
##
## data: RTTR and CEFR_NUM
## t = 2.2579, df = 167, p-value = 0.02525
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.02171548 0.31488988
## sample estimates:
## cor
## 0.1721113
Answer: After carrying out a correlation test to proof wether there was an increase of lexical diversity with a higher proficiency level, it was found that the p-value = 0.025 < p-value=0.05, therefore the null hypothesis is rejected meaning that the lexical diversity does increase with proficiency.
Answer:
H0 = texts with the topic of freedom were not rated more proficient than
the ones with the topic language H1 = text with the topic of freedom
were rated more proficient than the ones with the topic language
table(TOPIC,CEFR)
## CEFR
## TOPIC B2 C1 C2
## delinquency 11 42 13
## euthanasia 2 23 6
## foreigners 1 3 0
## freedom 7 13 4
## language 4 13 13
## nuclear 1 9 0
## state_reform 0 3 1
bar_plot <- lcf[lcf$TOPIC %in% c('language', 'freedom'), c('TOPIC', 'CEFR')]
bar_plot$TOPIC <- factor(bar_plot$TOPIC, levels = c('language', 'freedom'))
barplot(table(bar_plot$CEFR, bar_plot$TOPIC), beside = TRUE, legend = TRUE,
col = c('lightblue', 'lightgreen'), main = 'Bar Plot of TOPIC by CEFR',
xlab = 'TOPIC', ylab = 'Frequency')
data_lf <- lcf[lcf$TOPIC %in% c('language', 'freedom'), c('TOPIC', 'CEFR')]
data_lf$TOPIC <- factor(data_lf$TOPIC, levels = c('language', 'freedom'))
table_lf <- table(data_lf$CEFR, data_lf$TOPIC)
fisher.test(table_lf)
##
## Fisher's Exact Test for Count Data
##
## data: table_lf
## p-value = 0.07322
## alternative hypothesis: two.sided
Answer: The texts with the topic of freedom were rated more proficient than the ones in the topic language. According to a fisher test performed with a p-value = 0.073 > 0.05 we reject the null hypothesis.