Instructions:
1. Rename this file by replacing “LASTNAME” with your last name. This can be done via the RStudio menu (File >> Rename).
2. Write your full name in the chunk above beside author:.
3. Before beginning, it is good practice to create a directory that contains your R scripts as well as any data you will need. This can be done in the console directly with the setwd() function or via the RStudio menu (Session >> Set Working Directory).
4. Write R code to answer the questions below. The code should be written within the chunks provided for each question. These chunks begin with three back ticks and the letter r in curly brackets (```{r}) and end with three back ticks. You can add as much space as you need within the chunks but do not delete the back ticks or otherwise modify the chunks in any way or the file will cause errors when compiled.
5. When you have answered all of the questions, click the Knit button. This will create an HTML file in your working directory.
6. Upload the HTML file to Moodle.

Data description:
The icle.csv dataset contains the metadata and several lexical diversity measures for the International Corpus of Learner English. The metadata include:
- FILE: file number
- COUNTRY: country where the text was written
- L1: the native language of the learner
- SEX: the sex of the learner
- AGE: the age of the learner - TYPE: the register
- CEFR : the proficiency level
- EXAM: whether the text was written under exam conditions
- CONDITIONS : whether the text was produced in a timed condition
- REFTOOLS: whether reference tools were used
The rest of the variables are lexical diversity variables (see Lu, 2012).

Note: the S stands for “sophisticated” in SWORDTYPES, SLEXTYPES, SWORDTOKENS and SLEXTOKENS.

1 Load the full ICLE dataset (“icle.csv”) into a dataframe called “icle”. 1 point

icle<- read.csv("icle.csv")

3 What is wrong with the TYPE variable? Replace any nonsensical values with NAs. 2 points

Note: To see NAs when you call table(), use the argument useNA = "always".

unique(icle$TYPE)
## [1] "Literary"      "Argumentative" "5"             "Other"
Type1<-replace(icle$TYPE,icle$TYPE==5,NA)
table(Type1, useNA="always")
## Type1
## Argumentative      Literary         Other          <NA> 
##           429            32             8             5

Answer: There is a number (5) within the values.

4 What are the four least common L1s in the corpus? 2 points

sort(table(icle$L1))
## 
##          Albanian            Arabic           Chinese  Chinese-Mandarin 
##                 1                 1                 1                 1 
##              Urdu           Punjabi           Persian        Lithuanian 
##                 9                11                15                17 
## Chinese-Cantonese             Czech        Macedonian         Bulgarian 
##                18                19                19                20 
##             Dutch           Finnish            French            German 
##                20                20                20                20 
##             Greek         Hungarian           Italian          Japanese 
##                20                20                20                20 
##         Norwegian            Polish        Portuguese           Serbian 
##                20                20                20                20 
##           Spanish           Swedish            Tswana           Turkish 
##                20                20                20                20 
##           Russian 
##                22

Answer: The least common L1s are Albanian, Arabic, Chinese and Chinese-Mandarin

5 What is the average age of the writers whose L1 is German? 2 points

German<- icle[icle$L1 == "German", "AGE"]
mean(German)
## [1] 22.9

Answer: The average age of writers whose L1 is German is 22.9

6 What type of variable is SENTENCES? 1 point

class(icle$SENTENCES)
## [1] "integer"

Answer: The type of variable in SENTENCE is nominal because its an integer

7 Calculate a measure of central tendency for the SENTENCES variable as well as a measure of dispersion. 2 points

mean(icle$SENTENCES)
## [1] 33.47679
sd(icle$SENTENCES)
## [1] 11.25359

Answer: For this type of variable we use the mean as a measure of central tendency, this is equal to 33.47679 and the measure of dispersion is obtained from the standard deviation which is 11.25359

8 The variable SWORDTOKENS is currently a measure of absolute frequency (the number of sophisticated words). Transform it to a measure of relative frequency. 1 point

total<-sum(icle$SWORDTOKENS)
icle$SWORDTOKENS/total
##   [1] 0.0019621654 0.0018363856 0.0016351379 0.0018112296 0.0032451197
##   [6] 0.0008049909 0.0042010465 0.0012577983 0.0021634132 0.0015596700
##  [11] 0.0011320185 0.0025910646 0.0016099819 0.0013332663 0.0015345140
##  [16] 0.0028929362 0.0007546790 0.0028677802 0.0013332663 0.0028174683
##  [21] 0.0017357617 0.0026665325 0.0022137251 0.0018866975 0.0024904407
##  [26] 0.0026162206 0.0026665325 0.0007546790 0.0026162206 0.0025407527
##  [31] 0.0014087342 0.0023646609 0.0037482391 0.0017609177 0.0021382572
##  [36] 0.0032451197 0.0014087342 0.0021382572 0.0014338901 0.0020376333
##  [41] 0.0022640370 0.0017357617 0.0011320185 0.0008804588 0.0011571745
##  [46] 0.0025155967 0.0035973033 0.0020627893 0.0017609177 0.0018112296
##  [51] 0.0016099819 0.0011320185 0.0018112296 0.0025910646 0.0017106058
##  [56] 0.0018363856 0.0019370095 0.0018363856 0.0035721473 0.0018866975
##  [61] 0.0024904407 0.0014590461 0.0027420004 0.0025155967 0.0034966794
##  [66] 0.0038237070 0.0026916885 0.0020376333 0.0022891930 0.0015345140
##  [71] 0.0028677802 0.0018363856 0.0005534313 0.0019621654 0.0022640370
##  [76] 0.0026413765 0.0022640370 0.0024149728 0.0025659086 0.0014590461
##  [81] 0.0022388811 0.0018363856 0.0009056148 0.0022640370 0.0018112296
##  [86] 0.0018363856 0.0016099819 0.0008301469 0.0022891930 0.0015848259
##  [91] 0.0006792111 0.0013332663 0.0013081103 0.0015093580 0.0019873214
##  [96] 0.0013332663 0.0017357617 0.0032199638 0.0030941839 0.0010313946
## [101] 0.0018363856 0.0020376333 0.0011320185 0.0026665325 0.0014590461
## [106] 0.0018363856 0.0034463675 0.0029684041 0.0028174683 0.0021382572
## [111] 0.0033205876 0.0017357617 0.0023898169 0.0026162206 0.0017609177
## [116] 0.0018615416 0.0043016704 0.0011571745 0.0022388811 0.0013835782
## [121] 0.0022891930 0.0021885691 0.0037482391 0.0013332663 0.0020627893
## [126] 0.0016602938 0.0016602938 0.0020879453 0.0014842021 0.0020124774
## [131] 0.0014338901 0.0017609177 0.0022137251 0.0016351379 0.0017609177
## [136] 0.0023898169 0.0019873214 0.0019370095 0.0019873214 0.0036727712
## [141] 0.0027671564 0.0041004226 0.0020124774 0.0025910646 0.0032954317
## [146] 0.0021885691 0.0024149728 0.0023898169 0.0024652848 0.0010062387
## [151] 0.0021885691 0.0011571745 0.0013584222 0.0013835782 0.0021885691
## [156] 0.0008804588 0.0022388811 0.0024904407 0.0018866975 0.0018615416
## [161] 0.0019873214 0.0016602938 0.0011823304 0.0026413765 0.0024149728
## [166] 0.0018112296 0.0011823304 0.0012829543 0.0018615416 0.0013332663
## [171] 0.0035973033 0.0016602938 0.0015345140 0.0029432481 0.0022891930
## [176] 0.0019118535 0.0022640370 0.0013332663 0.0010817066 0.0015345140
## [181] 0.0024904407 0.0017106058 0.0018615416 0.0024904407 0.0023143490
## [186] 0.0010313946 0.0024149728 0.0012577983 0.0021885691 0.0011823304
## [191] 0.0012326424 0.0015345140 0.0028677802 0.0026162206 0.0013584222
## [196] 0.0016099819 0.0024401288 0.0020124774 0.0018112296 0.0028174683
## [201] 0.0008804588 0.0018866975 0.0012577983 0.0009810827 0.0011823304
## [206] 0.0022891930 0.0029432481 0.0021382572 0.0021382572 0.0021382572
## [211] 0.0027168444 0.0023646609 0.0018112296 0.0026916885 0.0029180922
## [216] 0.0008301469 0.0016602938 0.0031948078 0.0019118535 0.0017609177
## [221] 0.0028174683 0.0021885691 0.0023898169 0.0026665325 0.0022388811
## [226] 0.0017357617 0.0019621654 0.0021382572 0.0030438720 0.0021131012
## [231] 0.0018112296 0.0027671564 0.0024149728 0.0031444959 0.0040249547
## [236] 0.0038740189 0.0017609177 0.0020879453 0.0038488630 0.0023646609
## [241] 0.0011320185 0.0018112296 0.0022891930 0.0013835782 0.0017106058
## [246] 0.0028929362 0.0013081103 0.0014338901 0.0017357617 0.0017357617
## [251] 0.0027923123 0.0021382572 0.0012829543 0.0015596700 0.0025155967
## [256] 0.0020627893 0.0050563494 0.0040249547 0.0015848259 0.0027923123
## [261] 0.0017860737 0.0025910646 0.0036979271 0.0023143490 0.0034715234
## [266] 0.0027168444 0.0031444959 0.0012074864 0.0022388811 0.0046538539
## [271] 0.0016099819 0.0023898169 0.0022640370 0.0022388811 0.0028677802
## [276] 0.0007546790 0.0021131012 0.0024149728 0.0028426243 0.0018363856
## [281] 0.0010817066 0.0030941839 0.0032702757 0.0015345140 0.0005534313
## [286] 0.0015848259 0.0020627893 0.0022388811 0.0014087342 0.0011823304
## [291] 0.0017860737 0.0016351379 0.0018363856 0.0012577983 0.0012577983
## [296] 0.0019621654 0.0020376333 0.0008553029 0.0015596700 0.0019370095
## [301] 0.0020879453 0.0016854498 0.0011320185 0.0023898169 0.0022891930
## [306] 0.0010565506 0.0015596700 0.0025659086 0.0013835782 0.0016854498
## [311] 0.0015596700 0.0014842021 0.0035218354 0.0013835782 0.0022388811
## [316] 0.0035469913 0.0020627893 0.0018363856 0.0034966794 0.0030690280
## [321] 0.0026162206 0.0006037432 0.0031193399 0.0012074864 0.0025155967
## [326] 0.0019621654 0.0020627893 0.0036727712 0.0029180922 0.0024904407
## [331] 0.0014338901 0.0023898169 0.0015848259 0.0013332663 0.0020879453
## [336] 0.0032451197 0.0033960555 0.0034966794 0.0018866975 0.0025407527
## [341] 0.0020376333 0.0013835782 0.0022891930 0.0016854498 0.0034463675
## [346] 0.0025407527 0.0018615416 0.0013332663 0.0019118535 0.0044274502
## [351] 0.0021634132 0.0023143490 0.0017357617 0.0020124774 0.0019118535
## [356] 0.0024904407 0.0033457436 0.0018363856 0.0022891930 0.0027168444
## [361] 0.0022137251 0.0024149728 0.0014590461 0.0012577983 0.0027923123
## [366] 0.0030690280 0.0015596700 0.0013081103 0.0013332663 0.0041004226
## [371] 0.0012577983 0.0022891930 0.0013332663 0.0022388811 0.0036224592
## [376] 0.0008049909 0.0030438720 0.0027671564 0.0017357617 0.0016854498
## [381] 0.0012829543 0.0017609177 0.0024652848 0.0032199638 0.0010313946
## [386] 0.0013081103 0.0028929362 0.0017106058 0.0016854498 0.0017860737
## [391] 0.0028174683 0.0016099819 0.0008301469 0.0012074864 0.0015093580
## [396] 0.0015093580 0.0020627893 0.0010817066 0.0026162206 0.0022891930
## [401] 0.0021634132 0.0018615416 0.0025659086 0.0018112296 0.0015345140
## [406] 0.0035973033 0.0022388811 0.0019873214 0.0010565506 0.0024904407
## [411] 0.0032451197 0.0041507346 0.0024149728 0.0031444959 0.0027923123
## [416] 0.0013332663 0.0023898169 0.0015345140 0.0018615416 0.0026916885
## [421] 0.0018866975 0.0033205876 0.0018112296 0.0019621654 0.0035218354
## [426] 0.0018363856 0.0022388811 0.0024904407 0.0019118535 0.0020124774
## [431] 0.0018615416 0.0024401288 0.0016351379 0.0022640370 0.0025910646
## [436] 0.0029684041 0.0017106058 0.0024149728 0.0029180922 0.0018615416
## [441] 0.0024149728 0.0028174683 0.0016602938 0.0029432481 0.0014087342
## [446] 0.0036979271 0.0021382572 0.0014338901 0.0021382572 0.0012829543
## [451] 0.0041004226 0.0026162206 0.0020627893 0.0024401288 0.0014590461
## [456] 0.0008553029 0.0017357617 0.0016351379 0.0011068625 0.0015848259
## [461] 0.0015596700 0.0021382572 0.0020124774 0.0011571745 0.0028174683
## [466] 0.0026413765 0.0015848259 0.0032451197 0.0009559267 0.0026916885
## [471] 0.0015093580 0.0015848259 0.0015093580 0.0011823304

9 In addition to lexical complexity, we are also interested in syntactic (grammatical) complexity. One way that this can be calculated is dividing the number of tokens (WORDTOKENS) by the number of sentences (SENTENCES) to get the average sentence length. Calculate the average sentence length and create a new variable in the data frame called ASL. 1 point

ASL<-icle$WORDTOKENS/icle$SENTENCES
(ASL)
##   [1] 21.266667 15.724138 14.225000 51.100000 28.814815 25.450000 17.476190
##   [8] 24.812500 14.488889 17.090909 23.807692 17.425532 17.968750 17.222222
##  [15] 22.434783 22.648649 28.062500 20.896552 20.541667 24.576923 39.571429
##  [22] 28.040000 17.054054 23.482759 17.625000  9.800000 35.736842 20.458333
##  [29] 27.160000 19.250000 18.772727 23.666667 15.696970 32.105263 20.054054
##  [36] 16.228571 13.775000 19.062500 16.903226 29.720000 21.900000 15.340909
##  [43] 11.425926 22.086957 31.000000 24.684211 22.560976 19.612903 18.714286
##  [50] 20.966667 15.515152 10.375000 13.305085 16.477273 22.550000 21.942857
##  [57] 20.789474 22.000000 22.156250 19.628571 17.290323 16.394737 18.937500
##  [64] 19.458333 23.838710 18.040816 21.655172 29.583333 26.035714 17.370370
##  [71] 16.162162 27.038462 26.066667 20.958333 15.116667 15.755556 15.000000
##  [78] 17.261905 21.114286 17.758621 21.562500 16.861111 26.117647 19.718750
##  [85] 32.846154 20.911765 16.900000 17.108108 20.405405 16.846154 14.894737
##  [92] 29.333333 14.425532 22.043478 16.097561 17.931034 23.916667 10.783784
##  [99] 15.952381 18.571429 18.864865 19.884615 21.350000 19.888889 26.600000
## [106] 36.764706 24.727273 14.043478 20.636364 16.783784 18.218750 22.705882
## [113] 23.777778 17.395349 39.062500 20.064516 15.055556 19.642857 21.322581
## [120] 20.035714 17.868421 24.178571 17.260870 14.650000 19.173913 19.333333
## [127] 25.000000 24.172414 20.115385 19.843750 17.655172 23.500000 21.966667
## [134] 19.700000 17.843750 33.333333 20.531250 19.137931 22.387097 19.853659
## [141] 23.333333 25.896552 18.212121 13.361702 32.842105 25.166667 20.243243
## [148] 17.633333 15.625000 22.473684 26.600000 17.685714 31.312500 21.171429
## [155] 16.537037 34.888889 23.480000 20.387097 23.590909 26.137931 35.533333
## [162] 25.666667 13.888889 18.071429 18.600000 17.341463 13.195652 19.500000
## [169] 18.740741 20.269231 15.521739 22.173913 19.638889 11.901639 22.960000
## [176] 20.085714 15.309524 41.000000 13.750000 21.750000 12.761905 27.944444
## [183] 27.952381 14.177778 19.063830 17.969697 24.333333 39.727273 15.045455
## [190] 25.300000 20.615385 21.933333 27.863636 19.620690 14.250000 21.904762
## [197] 18.957447 13.800000 23.133333 14.595238 22.391304 22.450000 19.619048
## [204] 18.846154 20.680000 15.627907 13.229167 18.500000 20.931034 18.227273
## [211] 27.653846 19.812500 18.846154 19.972222 11.347826 15.205128 16.432432
## [218] 39.882353 20.142857 20.275862 19.515152 14.770833 18.363636 15.878049
## [225] 10.301587 19.960000 22.607143 18.461538 15.600000 18.025641 19.269231
## [232] 16.562500 21.161290 14.673469 16.413043 11.128571 19.343750 31.857143
## [239] 22.500000 32.000000  9.867925 18.448276 14.250000 19.360000 15.350000
## [246] 18.294118 18.600000 18.862069 13.866667 13.108696 16.789474 18.400000
## [253] 15.868421 23.888889 34.380952 16.755102 27.171429 20.131579 17.166667
## [260] 17.729730 22.322581 18.714286 17.068182 29.720000 23.333333 19.589744
## [267] 19.058824 19.280000 17.791667 20.257143 21.470588 45.600000 18.156250
## [274] 21.758621 20.210526 16.421053 19.687500 20.935484 19.081081 17.828571
## [281] 34.066667 12.979167 27.806452 15.875000 14.300000 19.750000 15.024390
## [288] 13.769231 21.657143 22.172414 19.000000 16.833333 11.358491 16.375000
## [295] 19.814815 21.111111 12.885246 21.750000 30.800000 15.629630 13.734694
## [302] 26.608696 14.200000 18.656250 21.972222 13.516129 25.176471 25.153846
## [309] 16.810811 33.578947 29.565217 18.878049 16.744681 23.250000 16.365854
## [316] 17.048780 28.153846 19.703704 16.720930 22.433333 12.981818 18.136364
## [323] 19.225000 39.687500 31.850000 34.650000 17.547619 26.756757 12.469388
## [330] 18.380952 17.413793 34.210526 15.055556 21.464286 18.718750 13.620000
## [337] 25.076923 20.750000 17.894737 28.333333 21.206897 27.565217 15.843750
## [344] 14.488372 20.297297 18.729730 13.254902 20.350000 17.156250 18.642857
## [351] 26.157895 35.304348 20.038462 21.437500 17.857143 14.227273 20.363636
## [358] 13.825000 19.243902 12.960784 26.428571 10.754098 20.882353 17.000000
## [365] 13.210526 14.490909 17.363636 15.581395 33.055556 21.848485 23.080000
## [372] 19.783784 22.086957 13.234043 18.027778 22.541667 18.975000 18.125000
## [379]  7.828125 15.875000 22.478261 15.923077 21.029412 19.066667 19.307692
## [386] 32.157895 27.583333 13.111111 25.000000 18.088235 12.196970 18.866667
## [393] 21.000000 26.478261 14.487805 14.536585 21.111111 24.476190 18.173913
## [400] 12.387755 25.400000 19.315789 25.000000 16.435897 16.250000 18.000000
## [407] 23.307692 16.285714 18.296296 19.307692 35.421053 21.914286 20.000000
## [414] 21.375000 18.281250 15.441176 26.066667 19.615385 22.821429 10.425926
## [421] 16.547170 14.345455 15.641026 20.640000 15.369565 17.153846 20.000000
## [428] 27.681818 20.461538 16.368421 23.269231 35.315789 16.931818 25.366667
## [435]  9.965517 22.250000 20.636364 16.243243 18.395349 15.857143 27.200000
## [442] 29.000000 16.421053 16.972973 14.000000 23.183673 20.727273 21.846154
## [449] 16.413793 14.658537 21.666667 16.292683 23.277778 15.894737 28.761905
## [456] 16.636364 25.714286 13.347826 15.666667 14.409091 27.478261 22.068966
## [463] 14.318182 19.769231 17.352941 16.000000 18.057143 16.365854 14.666667
## [470] 25.923077 19.222222 18.906250 37.166667 15.689655

10 Create two new variables, ASLc which is the centered values of ASL and ASLz which is the standardized values of ASL. 2 points

ASLc<-mean(ASL)
ASLz<-sd(ASL)

11 How many standard deviations is the ASL for production ITVE1003 away from the mean? 1 point

ITVE1003 <- icle[icle$FILE == "ITVE1003",] 
ITVE1003_ASL<- ITVE1003$WORDTOKENS/ITVE1003$SENTENCES
scale(ITVE1003_ASL, ASLc)
##      [,1]
## [1,]    1
## attr(,"scaled:center")
## [1] 20.38737
## attr(,"scaled:scale")
## [1] 1.267805

Answer: The production ITVE1003 is 1.267805 standard deviations away from the mean.

12 Suppose that our research question is the following: “Does the average sentence length (ASL) increase with proficiency (CEFR)?”. Formulate a hypothesis and a null hypothesis. Which of the two variables is the dependent variable and which of the two is the independent variable? 2 points

Answer: H1= The average sentence length increases with proficiency H0= The average sentence length does not increase with proficiency Dependent variable = sentence length Independent variable = proficiency

13 Calculate the mean and standard deviation of ASL for each proficiency level. 1 point

B1 <- icle[icle$CEFR == "B1",]
B1_ASL<-B1$WORDTOKENS/B1$SENTENCES
mean(B1_ASL)
## [1] 18.2253
sd(B1_ASL)
## [1] 7.139761
B2 <- icle[icle$CEFR == "B2",]
B2_ASL<-B2$WORDTOKENS/B2$SENTENCES
mean(B2_ASL)
## [1] 20.39007
sd(B2_ASL)
## [1] 7.002471
C1 <- icle[icle$CEFR == "C1",]
C1_ASL<-C1$WORDTOKENS/C1$SENTENCES
mean(C1_ASL)
## [1] 20.49522
sd(C1_ASL)
## [1] 5.192608
C2 <- icle[icle$CEFR == "C2",]
C2_ASL<-C2$WORDTOKENS/C2$SENTENCES
mean(C2_ASL)
## [1] 21.06296
sd(C2_ASL)
## [1] 5.038336

Answer: B1 mean = 18.2253 B1 standard deviation= 7.139761 B2 mean = 20.39007 B2 standard deviation = 7.002471 C1 mean = 20.49522 C1 standard deviation = 5.192608 C2 mean = 21.06296 C2 standard deviation = 5.038336

14 Plot the data using a notched boxplot. 1 point

boxplot(B1_ASL,B2_ASL, C1_ASL, C2_ASL, notch=TRUE,names=c("B1", "B2","C1","C2")) 
text(1:4, c(mean(B1_ASL), mean(B2_ASL),mean(C1_ASL), mean(C2_ASL)) , c("+", "+","+","+"))