Total: 25 points

Instructions:
1. Rename this file by replacing “LASTNAME” with your last name. This can be done via the RStudio menu (File >> Rename).
2. Write your full name in the chunk above beside author:.
3. Before beginning, it is good practice to create a directory that contains your R scripts as well as any data you will need. This can be done in the console directly with the setwd() function or via the RStudio menu (Session >> Set Working Directory).
4. Write R code to answer the questions below. The code should be written within the chunks provided for each question. These chunks begin with three back ticks and the letter r in curly brackets (```{r}) and end with three back ticks. You can add as much space as you need within the chunks but do not delete the back ticks or otherwise modify the chunks in any way or the file will cause errors when compiled.
5. When you have answered all of the questions, click the Knit button. This will create an HTML file in your working directory.
6. Upload the HTML file to Moodle.

Data description:
The lcf.csv dataset contains data from the Leerdercorpus Frans, a learner corpus of L2 French. The variables in the data frame are:

ID: anonymous ID corresponding to the learner who wrote the text
CEFR: the proficiency level of the text (B2, C1 or C2)
TOPIC: the writing topic
LING.NOUN: the average frequency of nouns related to “linguistics” (based on a corpus of linguistics articles written by native speakers)
RTTR: root type-token ratio (a measure of lexical diversity)
AWL: average word length

Hint:
While completing the assignment, it may be helpful to keep the following questions in mind:

What kinds of variables are involved in your hypothesis (integer, ordinal, categorical etc.) and how many?
Are data points in your data related such that you can associate them to each other in a meaningful way?
What is the statistic of the dependent variable in the statistical hypothesis?
What does the distribution of the data of your test statistic look like?
How big are the samples you collected?
What assumptions must be met before running a particular statistical test?

1 Load the data set (“lcf.csv”) into a dataframe called “lcf”.

setwd("~/Desktop/Satistics for linguistics/Assignment")
lcf<- read.csv("lcf.csv", header = TRUE, stringsAsFactors = TRUE)
attach(lcf)
str (lcf)

## 'data.frame':    169 obs. of  6 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ CEFR     : Factor w/ 3 levels "B2","C1","C2": 2 2 2 2 2 2 3 2 2 2 ...
##  $ TOPIC    : Factor w/ 7 levels "delinquency",..: 2 6 2 6 2 2 2 2 2 7 ...
##  $ LING.NOUN: num  204 414 234 290 304 ...
##  $ RTTR     : num  7.99 9.06 8.25 9.04 8.64 ...
##  $ AWL      : num  4.36 4.84 4.55 4.82 4.34 ...

2 You want to test whether learners who wrote about linguistics topics (TOPIC = language) used more sophisticated linguistics related nouns (LING.NOUN) in their texts as compared to learners who wrote about freedom (TOPIC = freedom).

Note: for LING.NOUN, the lower the value, the more sophisticated the vocabulary. A text that has a low values for LING.NOUN used more sophisticated nouns whereas a text that has high values of LING.NOUN used less sophisticated nouns.

2.1 Formulate hypotheses. 1 point

Answer: H0= Learners that wrote about linguistics topics did not use more sophisticated linguistics related nouns in their text in comparison to their peers who wrote about freedom.

H1= Learners that wrote about linguistics topics used more sophisticated linguistics related nouns in their text in comparison to their peers who wrote about freedom.

2.2 Is your alternative hypothesis one-tailed or two-tailed? Explain. 1 point

Answer: since we are taking into account the differences between two variables and its direction leans towards the higher result, we can see that this is a one tailed hypothesis.

2.3 Are the samples dependent or independent? Explain. 1 point

Answer: Since the students belong to the same group (learners) the samples can be considered dependent.

2.4 Calculate descriptive statistics and represent the data graphically. 2 points

tab1<- tapply(LING.NOUN,TOPIC,  summary)
tapply (LING.NOUN,TOPIC, mean)

##  delinquency   euthanasia   foreigners      freedom     language      nuclear 
##     310.2887     258.8503     316.1601     334.4867     347.0069     358.3364 
## state_reform 
##     321.6552

tapply (LING.NOUN,TOPIC, sd)

##  delinquency   euthanasia   foreigners      freedom     language      nuclear 
##     67.14895     81.98270     36.40221     71.78894     81.46098     81.13195 
## state_reform 
##     91.21303

boxplot(tab1)

2.5 Test your hypothesis with analytical statistics. 2 points

t.test(LING.NOUN[TOPIC=="language"], LING.NOUN[TOPIC=="freedom"])

## 
##  Welch Two Sample t-test
## 
## data:  LING.NOUN[TOPIC == "language"] and LING.NOUN[TOPIC == "freedom"]
## t = 0.59965, df = 51.472, p-value = 0.5514
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -29.38683  54.42710
## sample estimates:
## mean of x mean of y 
##  347.0069  334.4867

2.6 Summarize the result(s) briefly. 1 point

Answer:
According to a T-test performed accept the null hypothesis provided that the p-value= 0.5514. Therefore, there is no significant difference between the LING.NOUN production of learners who wrote about Language and the ones who wrote about Freedom.

3 You want to test whether lexical diversity (RTTR) increases with proficiency (CEFR).

3.1 Formulate hypotheses. 1 point

Answer:
H0= lexical diversity does not increase with proficiency r=0 H1= lexical diversity increases with proficiency r ≠ 0

3.2 Calculate descriptive statistics and represent the data graphically. 2 points

SUMA<- tapply(RTTR, CEFR, summary)
SUM_T<-table(RTTR, CEFR)
tapply(RTTR,CEFR, median)

##       B2       C1       C2 
## 10.05773 10.13688 11.01353

tapply(RTTR,CEFR, IQR)

##       B2       C1       C2 
## 1.908718 2.094596 1.513173

boxplot(SUMA)

3.3 Test your hypothesis with analytical statistics. 2 points

CEFR_NUM<- as.numeric(CEFR)
cor.test(RTTR,CEFR_NUM)

## 
##  Pearson's product-moment correlation
## 
## data:  RTTR and CEFR_NUM
## t = 2.2579, df = 167, p-value = 0.02525
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.02171548 0.31488988
## sample estimates:
##       cor 
## 0.1721113

3.4 Summarize the result(s) briefly. 3 points

Answer: After carrying out a correlation test to proof wether there was an increase of lexical diversity with a higher proficiency level, it was found that the p-value = 0.025 < p-value=0.05, therefore the null hypothesis is rejected meaning that the lexical diversity does increase with proficiency.

4 You want to test whether texts written on the topic of freedom were rated more proficient than texts rated on the topic of language. In other words, whether there is an association between topic (TOPIClanguage or TOPICfreedom) and CEFR level (CEFR b2 or CEFRc1 or CEFRc2).

4.1 Formulate hypotheses. 1 point

Answer:
H0 = texts with the topic of freedom were not rated more proficient than the ones with the topic language H1 = text with the topic of freedom were rated more proficient than the ones with the topic language

4.2 Summarize the data numerically and represent the data graphically. 2 points

table(TOPIC,CEFR)

##               CEFR
## TOPIC          B2 C1 C2
##   delinquency  11 42 13
##   euthanasia    2 23  6
##   foreigners    1  3  0
##   freedom       7 13  4
##   language      4 13 13
##   nuclear       1  9  0
##   state_reform  0  3  1

bar_plot <- lcf[lcf$TOPIC %in% c('language', 'freedom'), c('TOPIC', 'CEFR')]
bar_plot$TOPIC <- factor(bar_plot$TOPIC, levels = c('language', 'freedom'))
barplot(table(bar_plot$CEFR, bar_plot$TOPIC), beside = TRUE, legend = TRUE,
        col = c('lightblue', 'lightgreen'), main = 'Bar Plot of TOPIC by CEFR',
        xlab = 'TOPIC', ylab = 'Frequency')

4.3 Test your hypothesis with analytical statistics. 1 point

data_lf <- lcf[lcf$TOPIC %in% c('language', 'freedom'), c('TOPIC', 'CEFR')]
data_lf$TOPIC <- factor(data_lf$TOPIC, levels = c('language', 'freedom'))
table_lf <- table(data_lf$CEFR, data_lf$TOPIC)
fisher.test(table_lf)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  table_lf
## p-value = 0.07322
## alternative hypothesis: two.sided

4.4 Summarize the result(s) briefly. 1 point

Answer: The texts with the topic of freedom were rated more proficient than the ones in the topic language. According to a fisher test performed with a p-value = 0.073 > 0.05 we reject the null hypothesis.

5 You want to test whether lexical diversity (RTTR) is correlated with average word length (AWL). In other words, do learners who use a more diverse vocabulary in their texts also use longer words?

5.1 Formulate hypotheses. 1 point

Answer:
H0= lexical diversity does not correlate with average word length H1 = lexical diversity correlates with average word length.

5.2 Represent the data graphically. 1 point

plot(RTTR,AWL)

5.3 Test your hypothesis with analytical statistics. 1 point

cor.test(RTTR,AWL)

## 
##  Pearson's product-moment correlation
## 
## data:  RTTR and AWL
## t = 5.3784, df = 167, p-value = 2.503e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2476492 0.5058614
## sample estimates:
##       cor 
## 0.3842442

5.4 Summarize the result(s) briefly. 1 point

Answer: There is no correlation between the lexical diversity and the word length that learners used in the texts. According to a Pearson’s test with a p-value = 2.50e-07 < 0.05 we reject the null hypothesis.

LFIAL2260 Assignment 02

Gabriel Mejia Vargas

2023-12-13