Instructions:
1. Rename this file by replacing “LASTNAME” with your last name. This
can be done via the RStudio menu (File >> Rename).
2. Write your full name in the chunk above beside
author:.
3. Before beginning, it is good practice to create a directory that
contains your R scripts (this file) as well as any data you will need
(the “metadata_ICLEv2.csv” file). This can be done in the console
directly with the setwd() function or via the RStudio menu
(Session >> Set Working Directory).
4. Write R code to answer the questions below. The code should be
written within the chunks provided for each question. These chunks begin
with three back ticks and the letter r in curly brackets
(```{r}) and end with three back ticks. You can add as much
space as you need within the chunks but do not delete the back ticks or
otherwise modify the chunks in any way or the file will cause errors
when compiled.
5. When you have answered all of the questions, click the
Knit button. This will create an HTML file in your working
directory.
6. Upload the HTML file to Moodle.
ICLEv2 <- read.delim("metadata_ICLEv2.csv", header = TRUE,)
attach(ICLEv2)
str(ICLEv2)
## 'data.frame': 6085 obs. of 33 variables:
## $ file : chr "BGSU1001" "BGSU1002" "BGSU1003" "BGSU1004" ...
## $ corpus_version : chr "1" "1" "1" "1" ...
## $ subcorpus_code : chr "BG" "BG" "BG" "BG" ...
## $ subcorpus_name : chr "Bulgarian" "Bulgarian" "Bulgarian" "Bulgarian" ...
## $ subsubcorpus_code: chr "BGSU" "BGSU" "BGSU" "BGSU" ...
## $ title : chr "Some people say that in our modern world, dominated by science and technology and industrialisation, there is n"| __truncated__ "Most University degrees are theoretical and do not prepare us for the real life. Do you agree or disagree?" "Some people say that in our modern world, dominated by science and technology and industrialisation, there is n"| __truncated__ "Most University degrees are theoretical and do not prepare us for the real life. Do you agree or disagree?" ...
## $ tagged : chr "No" "No" "No" "No" ...
## $ type : chr "Argumentative" "Argumentative" "Argumentative" "Argumentative" ...
## $ length : int 500 502 779 522 580 577 580 525 373 325 ...
## $ conditions : chr "No Timing" "No Timing" "No Timing" "No Timing" ...
## $ reftools : chr "Yes" "Yes" "Yes" "Yes" ...
## $ exam : chr "No" "No" "No" "No" ...
## $ age : int 20 20 20 20 21 21 21 21 21 21 ...
## $ sex : chr "Female" "Female" "Female" "Female" ...
## $ country : chr "Bulgaria" "Bulgaria" "Bulgaria" "Bulgaria" ...
## $ llanguage : chr "Bulgarian" "Bulgarian" "Bulgarian" "Bulgarian" ...
## $ homelang1 : chr "Bulgarian" "Bulgarian" "Bulgarian" "Bulgarian" ...
## $ homelang2 : chr "None" "None" "None" "None" ...
## $ homelang3 : chr "None" "None" "None" "None" ...
## $ instit : chr "Code48" "Code48" "Code48" "Code48" ...
## $ schooleng : num 8 8 8 8 10 10 8 8 8 8 ...
## $ unieng : num 2 2 2 2 2 2 2 2 2 2 ...
## $ monthseng : num 0 0 0 0 0 0 0 0 0 0 ...
## $ olang1 : chr "Spanish" "Spanish" "German" "German" ...
## $ olang2 : chr "Russian" "Russian" "None" "None" ...
## $ olang3 : chr "None" "None" "None" "None" ...
## $ date : chr "13/06/96 00:00:00" "06/06/96 00:00:00" "06/06/96 00:00:00" "06/06/96 00:00:00" ...
## $ status : chr "Complete" "Complete" "Complete" "Complete" ...
## $ comments : chr "-" "-" "-" "-" ...
## $ active : int 1 1 1 1 1 1 1 1 1 1 ...
## $ interface1 : int 1 1 1 1 1 1 1 1 1 1 ...
## $ instit2 : chr "Bulgaria - Sofia University « St. Kliment Ohridski »" "Bulgaria - Sofia University « St. Kliment Ohridski »" "Bulgaria - Sofia University « St. Kliment Ohridski »" "Bulgaria - Sofia University « St. Kliment Ohridski »" ...
## $ title2 : chr "Some people say that in our modern world, dominated by science and technology and industrialisation, there is n"| __truncated__ "Most University degrees are theoretical and do not prepare us for the real life. Do you agree or disagree?" "Some people say that in our modern world, dominated by science and technology and industrialisation, there is n"| __truncated__ "Most University degrees are theoretical and do not prepare us for the real life. Do you agree or disagree?" ...
nrow(ICLEv2)
## [1] 6085
Answer: There are 6085 datapoints in ICLEv2
ncol(ICLEv2)
## [1] 33
Answer:
unique(conditions)
## [1] "No Timing" "Timed" "Unknown"
Answer: The type of variable stored under the conditions column is nominal
tail(length,3)
## [1] 607 585 576
Answer: The last 3 values of the length variable are 607, 585 & 576
table(exam)
## exam
## No Unknown Yes
## 3738 371 1976
Answer: 1976 texts were written under exam conditions
table(country == "Italy")
##
## FALSE TRUE
## 5684 401
Answer: 401 texts come from Italy
mean(length)
## [1] 616.7675
Answer: The mean length of texts is 616.7675
sd(length)
## [1] 269.2484
Answer: The standard deviation of text length is 269.2484
Answer:
hist(length)
Answer: If we take a look at the histogram, we can infer that the data is not normally distributed, if that is the case, we cannot use the mean to summarize the length variable.
male <-table(sex == "Male")
prop.table(male)*100
##
## FALSE TRUE
## 76.7954 23.2046
Answer: 23,21% of the text from the dataset was written by male learners
table(age)
## age
## -1 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
## 238 13 135 828 1106 901 826 656 394 407 122 114 62 49 33 12
## 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
## 18 21 22 24 21 5 11 5 8 3 7 3 2 6 6 7
## 48 49 50 51 53 54 55 56 57 61 66 71
## 4 2 2 3 1 2 1 1 1 1 1 1
Answer: There is a negative value in the age variable
Remember to remove the problematic values you discovered in Question 11.
age_cor <- age[age >=0]
IQR(age_cor)
## [1] 3
Answer: The IQR for age is 3 years
table(age_cor)
## age_cor
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 13 135 828 1106 901 826 656 394 407 122 114 62 49 33 12 18
## 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
## 21 22 24 21 5 11 5 8 3 7 3 2 6 6 7 4
## 49 50 51 53 54 55 56 57 61 66 71
## 2 2 3 1 2 1 1 1 1 1 1
Answer: The most frequent age is 20 years provided that a total of 1106 learners are that age.