Instructions:
1. Rename this file by replacing “LASTNAME” with your last name. This can be done via the RStudio menu (File >> Rename).
2. Write your full name in the chunk above beside author:.
3. Before beginning, it is good practice to create a directory that contains your R scripts (this file) as well as any data you will need (the “metadata_ICLEv2.csv” file). This can be done in the console directly with the setwd() function or via the RStudio menu (Session >> Set Working Directory).
4. Write R code to answer the questions below. The code should be written within the chunks provided for each question. These chunks begin with three back ticks and the letter r in curly brackets (```{r}) and end with three back ticks. You can add as much space as you need within the chunks but do not delete the back ticks or otherwise modify the chunks in any way or the file will cause errors when compiled.
5. When you have answered all of the questions, click the Knit button. This will create an HTML file in your working directory.
6. Upload the HTML file to Moodle.

1 Load the ICLE metadata file (“metadata_ICLEv2.csv”) into a dataframe called “ICLEv2”.

ICLEv2 <- read.delim("~/Desktop/Exercise/metadata_ICLEv2.csv")
attach(ICLEv2)
ICLEv2<-data.frame(ICLEv2)

3 How many data points are there in the ICLEv2 dataframe?

dim(ICLEv2)
## [1] 6085   33

Answer: 6085

4 How many variables are there in the ICLEv2 dataframe?

dim(ICLEv2)
## [1] 6085   33

Answer: 33

5 What type of variable is stored under the “conditions” variable?

table(conditions)
## conditions
## No Timing     Timed   Unknown 
##      3793      2051       241

Answer: The variable stored is a nominal variable.

6 What are the last 3 values of the “length” variable?

tail(length)
## [1] 569 493 788 607 585 576

Answer: 607, 585, 576

7 How many texts were written under “exam” conditions?

table(exam)
## exam
##      No Unknown     Yes 
##    3738     371    1976

Answer: 1976 texts were written under exam conditions.

8 How many texts come from Italy?

table(country)
## country
##         Austria         Belgium        Botswana        Bulgaria China-Hong Kong 
##              70             473             161             302             800 
##  China-Mainland  Czech Republic         Finland         Germany           Italy 
##             179             241             391             302             401 
##           Japan     Netherlands          Norway           Other          Poland 
##             366             109             317              48             363 
##          Russia    South Africa           Spain          Sweden     Switzerland 
##             266             358             250             342              60 
##          Turkey         Unknown 
##             280               6

Answer: 401 texts come from Italy.

9 What is the mean length of texts?

mean(length)
## [1] 616.7675

Answer: 616.7675 is the mean length

10 What is the standard deviation of text length?

sd(length)
## [1] 269.2484

Answer: The standard deviation of text length is 269.2484

11 Generate a graph to explore frequency distribution in the length variable.

Answer:

hist(length)

12 Can we use the mean to summarize the “length” variable?

Answer: No, since there is not a normal distribution, we cannot use the mean to summarize the “length” variable, the min value and max value are too far from each other.

summary(length)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    92.0   472.0   554.0   616.8   696.0  4139.0
mean(length)
## [1] 616.7675

13 What is the percentage of texts written by males in the dataset?

prop.table(table(sex))*100
## sex
##     Female       Male    Unknown 
## 76.2037798 23.2046015  0.5916187

Answer: 23.20% of the text were written by males in the dataset.

14 What is the problem with the “age” variable?

unique(age)
##  [1] 20 21 22 23 19 -1 28 27 25 18 38 30 26 29 41 24 35 40 34 42 48 36 31 37 44
## [26] 43 49 46 45 33 32 54 50 55 17 39 47 51 56 53 66 71 61 57

Answer: There is a negative value in this variable.

15 What is the interquartile range for “age”?

Remember to remove the problematic values you discovered in Question 11.

a<-age
b<-which(a==-1)#which elements of a are =1?
ag<-a[-c(b)]
quantile(ag)
##   0%  25%  50%  75% 100% 
##   17   20   21   23   71
IQR(ag)
## [1] 3

Answer: The interquartile range for age is 3.

16 What is the most frequent age? How many learners are that age?

which.max(table(ag))
## 20 
##  4
table(ag==20)
## 
## FALSE  TRUE 
##  4741  1106

Answer: The most frequent age is 20, and 1106 learners are that age.