2 Working with data

2.1 Dataset

We will work with datasets arranged in tables.

We indicate by \(\mathbf{X}\) the table that contains the dataset.

The dataset has \(n\) rows and \(d\) columns, it is \(n \times d\).


Each row represents one data example (equivalently: instance, object).

We indicate \(\mathbf{x}_{i.}\) the \(i\)th instance.

For example, 2nd instance: \(\textbf{x}_{2.} = (\, x_{21}, x_{22}, \dots , x_{2d} \,)^T\)

Note: by convention all vectors \(\mathbf{x}\) are columns (use \(^T\) to indicate transpose)


Each column represents an attribute (equivalently: variable).

We indicate \(\mathbf{x}_{.j}\) the \(j\)th attribute.

For example, 3rd attribute: \(\textbf{x}_{.3} = (\, x_{13}, x_{23}, \dots , x_{n3} \,)^T\)

2.2 R data frame

2.2.1 Working directory

Before trying to read data from your file on your disc, you need to make sure R working directory is the correct one for reading the file.

getwd()
## [1] "/cloud/project"

To change the working directory, use setwd(). (I don’t need to change my working directory because R is already where I want. I will call setwd() just to show you how to use it.)

setwd("/cloud/project")
getwd()
## [1] "/cloud/project"

2.2.2 Read csv file

To read tabular data from a csv file or other text file with column separators we will use read.table() function. The function has many arguments. You can use ?read.table to learn about them.

The most important ones are:

  • file character “name” of file to read
  • header logical (TRUE/FALSE) if file has header line with column names
  • sep character by which columns are separated, usually “,”
  • ‘colClasses’ character vector of object classes for each column, e.g. colClasses = c("character", "numeric", "numeric")

We use read.table() to read our data into an R object

my_data <- read.table("survey.csv", header=TRUE, sep=",")

my_data is a data.frame, a special object for dealing with tabular data with mixed classes (numeric/character/boolean).

2.2.3 Contents of data frame

We can check the contents of my_data by printing a few top lines.

head(my_data)

The object is also in the environment pane (top right). Clicking its name will call the View() function and open the data in the editor pane in a table similar to Excel. Careful, this is not a good idea if your data is big (many rows and columns)! Opening it may consume to much of the computer resources and cause your system to freeze (at least temporarily).

Our data are small so you can go ahead and try.

2.2.4 Structure of data frame

Clicking the blue arrow in front of the name will expand the line and give you a brief description of its contents.

You can get the same result by using the str function

str(my_data)
## 'data.frame':    66 obs. of  10 variables:
##  $ Response      : int  136736 136735 136731 136728 136730 136742 136727 136748 136746 136729 ...
##  $ Filier        : Factor w/ 2 levels "EE","IG": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Q01_Transport : Factor w/ 5 levels "bike","car","moto",..: 4 1 1 1 3 3 3 5 3 1 ...
##  $ Q02_Time      : int  60 10 20 15 15 20 10 15 15 8 ...
##  $ Q03_Distance  : num  12 2 4 3.8 20 7 3 1.5 4 2.5 ...
##  $ Q04_Trips     : int  2 3 4 5 4 4 5 4 4 4 ...
##  $ Q05_Food      : Factor w/ 6 levels "equilibre","hit of the day",..: 3 4 4 5 1 4 4 5 4 1 ...
##  $ Q06_Vegetarian: int  0 0 0 0 0 0 0 1 0 0 ...
##  $ Q07_Mode      : Factor w/ 2 levels "Full time","Part time": 2 2 1 1 1 2 1 2 2 1 ...
##  $ Q08_Eiffel    : int  300 350 300 324 110 165 324 180 400 182 ...

To get just the list of column names use names function

names(my_data)
##  [1] "Response"       "Filier"         "Q01_Transport"  "Q02_Time"      
##  [5] "Q03_Distance"   "Q04_Trips"      "Q05_Food"       "Q06_Vegetarian"
##  [9] "Q07_Mode"       "Q08_Eiffel"

2.2.5 Size of data frame

Usefule functions for checking the size of the data are

dim(my_data)
## [1] 66 10
nrow(my_data)
## [1] 66
ncol(my_data)
## [1] 10

2.3 Subsetting R objects

To extract elements from objects (such as vectors or data frames) we use the square brackets [,].

2.3.1 Subsetting vectors

For example for a vector

ten_letters <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')
ten_letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

we extract the 2nd element by using its index

ten_letters[2]
## [1] "b"

To extract a sequence of elements we pass in an integer sequence

ten_letters[2:5]
## [1] "b" "c" "d" "e"

To extract elements in a list we pass in the vector of the indexes

ten_letters[c(2, 4, 6)]
## [1] "b" "d" "f"

To extract everything except an alement we use the minus - symbol

ten_letters[-3]
## [1] "a" "b" "d" "e" "f" "g" "h" "i" "j"

The 3rd letter “c” is missing from the list.


We can also use logical conditions for subsetting.

First check which letters are “larger” then letter d (alphabetical order).

ten_letters > 'd'
##  [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

We can use this to extract the letters larger then d as follows

ten_letters[ten_letters > 'd']
## [1] "e" "f" "g" "h" "i" "j"

2.3.2 Subsetting data frame

Data frame is a 2-dimensional object therefore each element is specified by its two cooridinate indexes [i,j].

head(my_data)

We extract a single element by using its indexes

my_data[2,3]
## [1] bike
## Levels: bike car moto transport public walk
my_data[3,2]
## [1] IG
## Levels: EE IG

To extract a full row or a column, we omit the other index.

To extract a row

my_data[3,]

To extract a column

my_data[,3]
##  [1] transport public bike             bike             bike            
##  [5] moto             moto             moto             walk            
##  [9] moto             bike             car              car             
## [13] car              transport public car              transport public
## [17] moto             transport public walk             transport public
## [21] walk             moto             walk             transport public
## [25] moto             car              moto             transport public
## [29] transport public transport public transport public car             
## [33] transport public transport public transport public transport public
## [37] car              transport public bike             transport public
## [41] moto             moto             transport public transport public
## [45] transport public moto             transport public transport public
## [49] transport public moto             transport public transport public
## [53] transport public car              transport public transport public
## [57] car              moto             transport public transport public
## [61] transport public transport public car              car             
## [65] transport public transport public
## Levels: bike car moto transport public walk

We can also extract columns by using their names

my_data$Q03_Distance
##  [1]  12.0   2.0   4.0   3.8  20.0   7.0   3.0   1.5   4.0   2.5   8.0
## [12]   3.0  12.0  70.0   6.3   2.0  14.0   2.5   0.1   8.0   0.0   3.0
## [23]   3.0  16.0  19.0  20.0   5.0   2.0   4.0  20.0   2.5   9.0   9.0
## [34]  21.0   4.0 177.0  13.0   9.0   2.0   7.5   6.0   2.0  10.0  35.0
## [45]  20.0   2.0   5.0   8.0 120.0   8.0  10.7   3.0  50.0  20.0   5.0
## [56]  75.0  30.0  10.0   5.0  80.0   2.3  10.0  10.0  20.0   1.5   5.0

or

my_data[,'Q03_Distance']
##  [1]  12.0   2.0   4.0   3.8  20.0   7.0   3.0   1.5   4.0   2.5   8.0
## [12]   3.0  12.0  70.0   6.3   2.0  14.0   2.5   0.1   8.0   0.0   3.0
## [23]   3.0  16.0  19.0  20.0   5.0   2.0   4.0  20.0   2.5   9.0   9.0
## [34]  21.0   4.0 177.0  13.0   9.0   2.0   7.5   6.0   2.0  10.0  35.0
## [45]  20.0   2.0   5.0   8.0 120.0   8.0  10.7   3.0  50.0  20.0   5.0
## [56]  75.0  30.0  10.0   5.0  80.0   2.3  10.0  10.0  20.0   1.5   5.0

Similarly as for vectors we can extract multiple rows or columns using integer sequences or index vectors.

First 3 rows

my_data[1:3,]

Selected columns

my_data[,c(2,5,7)]

We can also use logical indexing.

Extract rows with walk in the 2nd column.

my_data[my_data[,3]=='walk',]

2.4 Updating data frame

A data frame is an object in R and can be manipulated as any other object.

You can use the assignment operator <- to change values within the data frame.

Update single element

my_data[1,4] = 200
head(my_data)

Change name of column

names(my_data)[1] <- "ID"
head(my_data)

Drop column from data frame

my_data <- my_data[,-1]
head(my_data)

2.5 Subsetting quick reference

Comand Description
my_data[3, 5] # element in 3rd row and 5th column (single element)
my_data[, 5] # 5th column (vector)
my_data[5] # 5th column (data frame)
my_data[3,] # 3rd row (data frame)
my_data[2:4,] # rows 2:4 (data frame)
my_data[, 5:8] # columns 5:8 (data frame)
my_data[2, 5:8] # 2nd row of columns 5:8 (vector)
my_data[, -3] # all data except 3rd column (data frame)
my_data[, -3] # all data except 3rd column (data frame)
my_data[-(1:4),] # all data except first 4 rows (data frame)
Comand Description
my_data[,"Q01_Transport"] # column with name “Q01_Transport” (vector)
my_data$Q01_Transport # column with name “Q01_Transport” (vector)
my_data[5:nrow(my_data),] # all rows from 5 to the last (data frame)
my_data[my_data[,"Q01_Transport"]=="bike",] # all rows where Q01_Transport==“bike”
my_data[my_data[,2]=="bike" & my_data[,4]>10,] # all rows where Q01_Transport==“bike” and Q03_Distance>10