- "Data are values of qualitative or quantitative variables, belonging to a set of items (i.e., populations)." (wiki)
- Linguistics: a data-intensive discipline?
- The Decision is YOURS: English teacher or (Linguistic) Data Scientist
Shu-Kai Hsieh 謝舒凱
GIL, National Taiwan University


In general, Corpus data science involves a chain of works
.. are two sides of the same coin: it is pointless to study one without the other.
(You simply cannot design an experiment and interpret the results without
understanding what the data are telling you and what they do not
and – even more importantly – cannot tell you.)

## One way (easiest and fastest)
dataset <- read.csv("http://www.biostat.jhsph.edu/~rpeng/coursera/selfquiz/selfquiz-data.csv")
## You may want to store a local copy for later
download.file("http://www.biostat.jhsph.edu/~rpeng/coursera/selfquiz/selfquiz-data.csv",
"selfquiz-data.csv")
dataset <- read.csv("selfquiz-data.csv")
##
names(dataset) ## colnames(dataset) also works
##
rownames(dataset)
##
head(dataset, 6) ## print(dataset[1:6, ])
##
nrow(dataset)
##
tail(dataset)
## 37
miss <- is.na(dataset[, "Ozone"]) ## A vector of TRUE/FALSE
sum(miss)
## 42.13
mean(dataset[, "Ozone"], na.rm = TRUE)
##
subset(dataset, Ozone > 31 & Temp > 90)
##
m <- numeric(6)
for (i in 1:6) {
m[i] <- mean(dataset[, i], na.rm = TRUE)
}
##
s <- apply(dataset, 2, sd, na.rm = TRUE)
print(s)
print(m)
tapply(dataset$Ozone, dataset$Month, mean, na.rm = TRUE)
## 5 6 7 8 9
## 23.62 29.44 59.12 59.96 31.45
## set.seed(1) ## Just so the answer is repeatable
dataset[sample(nrow(dataset), 5), ]
## Ozone Solar.R Wind Temp Month Day
## 126 73 183 2.8 93 9 3
## 99 122 255 4.0 89 8 7
## 119 NA 153 5.7 88 8 27
## 83 NA 258 9.7 81 7 22
## 79 61 285 6.3 84 7 18