Corpus Processing Methods II

An R-chitecture Introduction

Shu-Kai Hsieh 謝舒凱
GIL, National Taiwan University

Download

Introduction: Data, Data, Data!

  1. "Data are values of qualitative or quantitative variables, belonging to a set of items (i.e., populations)." (wiki)
  2. Linguistics: a data-intensive discipline?
  3. The Decision is YOURS: English teacher or (Linguistic) Data Scientist

English Teacher

alt text

Data Scientist

alt text

What's so special about CPM ? OR, What do you mean by processing data?

In general, Corpus data science involves a chain of works

  1. Pre-processing (cleaning, tokenizing, segmentation, etc)
  2. Data annotation (Semi-automatic) Labeling (POS tagging) and Management
  3. Exploratory Data Analysis (with workable knowledge of Statistics)
  4. Hypothesis testing
  5. Prediction, Statistical Modeling, etc
  6. Presentation and Web application (Demo: Shiny-LexicoR)

Empirical Method and Statistics

.. are two sides of the same coin: it is pointless to study one without the other. 
(You simply cannot design an experiment and interpret the results without 
understanding what the data are telling you and what they do not 
and – even more importantly – cannot tell you.)

Why R/R-chitecture?

alt text

R Self Warm-up Quiz ( from R. Peng)

## One way (easiest and fastest)
dataset <- read.csv("http://www.biostat.jhsph.edu/~rpeng/coursera/selfquiz/selfquiz-data.csv")

## You may want to store a local copy for later
download.file("http://www.biostat.jhsph.edu/~rpeng/coursera/selfquiz/selfquiz-data.csv", 
    "selfquiz-data.csv")
dataset <- read.csv("selfquiz-data.csv")

R Self Warm-up Quiz (level 1)

  • What are the column names of the data frame?
  • What are the row names of the data frame?
  • Extract the first 6 rows of the data frame and print them to the console.
  • How many observations (i.e. rows) are in this data frame?

Possible solutions

##
names(dataset)  ## colnames(dataset) also works
##
rownames(dataset)
##
head(dataset, 6)  ## print(dataset[1:6, ])
##
nrow(dataset)

R Self Warm-up Quiz (level 2)

  • Extract the last 6 rows of the data frame and print them to the console
  • How many missing values are in the "Ozone" column of this data frame?
  • What is the mean of the "Ozone"" column in this dataset? Exclude missing values (coded as NA) from this calculation.
  • Extract the subset of rows of the data frame where Ozone values are above 31 and Temp values are above 90.

Possible solutions

##
tail(dataset)
## 37
miss <- is.na(dataset[, "Ozone"])  ## A vector of TRUE/FALSE
sum(miss)
## 42.13
mean(dataset[, "Ozone"], na.rm = TRUE)
##
subset(dataset, Ozone > 31 & Temp > 90)

R Self Warm-up Quiz (level 3)

  • Use a for loop to create a vector of length 6 containing the mean of each column in the data frame (excluding all missing values).
  • Use the apply function to calculate the standard deviation of each column in the data frame (excluding all missing values).

Possible solutions

##
m <- numeric(6)
for (i in 1:6) {
    m[i] <- mean(dataset[, i], na.rm = TRUE)
}
##
s <- apply(dataset, 2, sd, na.rm = TRUE)
print(s)
print(m)

R Self Warm-up Quiz (level 4)

  • Calculate the mean of "Ozone" for each Month in the data frame and create a vector containing the monthly means (exclude all missing values).
  • Draw a random sample of 5 rows from the data frame.

Possible solutions

tapply(dataset$Ozone, dataset$Month, mean, na.rm = TRUE)
##     5     6     7     8     9 
## 23.62 29.44 59.12 59.96 31.45 

## set.seed(1) ## Just so the answer is repeatable
dataset[sample(nrow(dataset), 5), ]
##     Ozone Solar.R Wind Temp Month Day
## 126    73     183  2.8   93     9   3
## 99    122     255  4.0   89     8   7
## 119    NA     153  5.7   88     8  27
## 83     NA     258  9.7   81     7  22
## 79     61     285  6.3   84     7  18