Instructions:
1. Rename this file by replacing “LASTNAME” with your last name. This can be done via the RStudio menu (File >> Rename).
2. Write your full name in the chunk above beside author:.
3. Before beginning, it is good practice to create a directory that contains your R scripts (this file) as well as any data you will need (the “metadata_ICLEv2.csv” file). This can be done in the console directly with the setwd() function or via the RStudio menu (Session >> Set Working Directory).
4. Write R code to answer the questions below. The code should be written within the chunks provided for each question. These chunks begin with three back ticks and the letter r in curly brackets (```{r}) and end with three back ticks. You can add as much space as you need within the chunks but do not delete the back ticks or otherwise modify the chunks in any way or the file will cause errors when compiled.
5. When you have answered all of the questions, click the Knit button. This will create an HTML file in your working directory.
6. Upload the HTML file to Moodle.

1 Load the ICLE metadata file (“metadata_ICLEv2.csv”) into a dataframe called “ICLEv2”.

ICLEv2 <- read.delim("metadata_ICLEv2.csv", header = TRUE,)
attach(ICLEv2)

3 How many data points are there in the ICLEv2 dataframe?

nrow(ICLEv2)
## [1] 6085

Answer: There are 6085 datapoints in ICLEv2

4 How many variables are there in the ICLEv2 dataframe?

ncol(ICLEv2)
## [1] 33

Answer:

5 What type of variable is stored under the “conditions” variable?

unique(conditions)
## [1] "No Timing" "Timed"     "Unknown"

Answer: The type of variable stored under the conditions column is nominal

6 What are the last 3 values of the “length” variable?

tail(length,3)
## [1] 607 585 576

Answer: The last 3 values of the length variable are 607, 585 & 576

7 How many texts were written under “exam” conditions?

table(exam)
## exam
##      No Unknown     Yes 
##    3738     371    1976

Answer: 1976 texts were written under exam conditions

8 How many texts come from Italy?

table(country == "Italy")
## 
## FALSE  TRUE 
##  5684   401

Answer: 401 texts come from Italy

9 What is the mean length of texts?

mean(length)
## [1] 616.7675

Answer: The mean length of texts is 616.7675

10 What is the standard deviation of text length?

sd(length)
## [1] 269.2484

Answer: The standard deviation of text length is 269.2484

11 Generate a graph to explore frequency distribution in the length variable.

Answer:

hist(length)

12 Can we use the mean to summarize the “length” variable?

Answer: If we take a look at the histogram, we can infer that the data is not normally distributed, if that is the case, we cannot use the mean to summarize the length variable.

13 What is the percentage of texts written by males in the dataset?

male <-table(sex == "Male")
prop.table(male)*100
## 
##   FALSE    TRUE 
## 76.7954 23.2046

Answer: 23,21% of the text from the dataset was written by male learners

14 What is the problem with the “age” variable?

table(age)
## age
##   -1   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31 
##  238   13  135  828 1106  901  826  656  394  407  122  114   62   49   33   12 
##   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47 
##   18   21   22   24   21    5   11    5    8    3    7    3    2    6    6    7 
##   48   49   50   51   53   54   55   56   57   61   66   71 
##    4    2    2    3    1    2    1    1    1    1    1    1

Answer: There is a negative value in the age variable

15 What is the interquartile range for “age”?

Remember to remove the problematic values you discovered in Question 11.

age_cor <- age[age >=0]
IQR(age_cor)
## [1] 3

Answer: The IQR for age is 3 years

16 What is the most frequent age? How many learners are that age?

table(age_cor)
## age_cor
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
##   13  135  828 1106  901  826  656  394  407  122  114   62   49   33   12   18 
##   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48 
##   21   22   24   21    5   11    5    8    3    7    3    2    6    6    7    4 
##   49   50   51   53   54   55   56   57   61   66   71 
##    2    2    3    1    2    1    1    1    1    1    1

Answer: The most frequent age is 20 years provided that a total of 1106 learners are that age.