We indicate by \(\mathbf{X}\) the table that contains the dataset.
The dataset has \(n\) rows and \(d\) columns, it is \(n \times d\).
Each row represents one data example (equivalently: instance, object).
We indicate \(\mathbf{x}_{i.}\) the \(i\)th instance.
For example, 2nd instance: \(\textbf{x}_{2.} = (\, x_{21}, x_{22}, \dots , x_{2d} \,)^T\)
Note: by convention all vectors \(\mathbf{x}\) are columns (use \(^T\) to indicate transpose)
Each column represents an attribute (equivalently: variable).
We indicate \(\mathbf{x}_{.j}\) the \(j\)th attribute.
For example, 3rd attribute: \(\textbf{x}_{.3} = (\, x_{13}, x_{23}, \dots , x_{n3} \,)^T\)
Before trying to read data from your file on your disc, you need to make sure R working directory is the correct one for reading the file.
getwd()
## [1] "/cloud/project"
To change the working directory, use setwd(). (I don’t need to change my working directory because R is already where I want. I will call setwd() just to show you how to use it.)
setwd("/cloud/project")
getwd()
## [1] "/cloud/project"
To read tabular data from a csv file or other text file with column separators we will use read.table() function. The function has many arguments. You can use ?read.table to learn about them.
The most important ones are:
file character “name” of file to readheader logical (TRUE/FALSE) if file has header line with column namessep character by which columns are separated, usually “,”colClasses = c("character", "numeric", "numeric")We use read.table() to read our data into an R object. The data in the original file are separated by semicolon so we use sep=";"
my_data <- read.table("survey.csv", header=TRUE, sep=",")
my_data is a data.frame, a special object for dealing with tabular data with mixed classes (numeric/character/boolean).
We can check the contents of my_data by printing a few top lines.
head(my_data)
The object is also in the environment pane (top right). Clicking its name will call the View() function and open the data in the editor pane in a table similar to Excel. Careful, this is not a good idea if your data is big (many rows and columns)! Opening it may consume to much of the computer resources and cause your system to freeze (at least temporarily).
Our data are small so you can go ahead and try.
Clicking the blue arrow in front of the name will expand the line and give you a brief description of its contents.
You can get the same result by using the str function
str(my_data)
## 'data.frame': 85 obs. of 10 variables:
## $ Response : int 136736 136735 136731 136728 136730 136742 136727 136748 136746 136729 ...
## $ Filier : Factor w/ 2 levels "EE","IG": 2 2 2 2 2 2 2 2 2 2 ...
## $ Q01_Transport : Factor w/ 5 levels "bike","car","moto",..: 4 1 1 1 3 3 3 5 3 1 ...
## $ Q02_Time : int 60 10 20 15 15 20 10 15 15 8 ...
## $ Q03_Distance : num 12 2 4 3.8 20 7 3 1.5 4 2.5 ...
## $ Q04_Trips : int 2 3 4 5 4 4 5 4 4 4 ...
## $ Q05_Food : Factor w/ 6 levels "equilibre","hit of the day",..: 3 4 4 5 1 4 4 5 4 1 ...
## $ Q06_Vegetarian: int 0 0 0 0 0 0 0 1 0 0 ...
## $ Q07_Mode : Factor w/ 2 levels "Full time","Part time": 2 2 1 1 1 2 1 2 2 1 ...
## $ Q08_Eiffel : int 300 350 300 324 110 165 324 180 400 182 ...
To get just the list of column names use names function
names(my_data)
## [1] "Response" "Filier" "Q01_Transport" "Q02_Time"
## [5] "Q03_Distance" "Q04_Trips" "Q05_Food" "Q06_Vegetarian"
## [9] "Q07_Mode" "Q08_Eiffel"
Usefule functions for checking the size of the data are
dim(my_data)
## [1] 85 10
nrow(my_data)
## [1] 85
ncol(my_data)
## [1] 10
To extract elements from objects (such as vectors or data frames) we use the square brackets [,].
For example for a vector
ten_letters <- c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j')
ten_letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
we extract the 2nd element by using its index
ten_letters[2]
## [1] "b"
To extract a sequence of elements we pass in an integer sequence
ten_letters[2:5]
## [1] "b" "c" "d" "e"
To extract elements in a list we pass in the vector of the indexes
ten_letters[c(2, 4, 6)]
## [1] "b" "d" "f"
To extract everything except an alement we use the minus - symbol
ten_letters[-3]
## [1] "a" "b" "d" "e" "f" "g" "h" "i" "j"
The 3rd letter “c” is missing from the list.
We can also use logical conditions for subsetting.
First check which letters are “larger” then letter d (alphabetical order).
ten_letters > 'd'
## [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
We can use this to extract the letters larger then d as follows
ten_letters[ten_letters > 'd']
## [1] "e" "f" "g" "h" "i" "j"
Data frame is a 2-dimensional object therefore each element is specified by its two cooridinate indexes [i,j].
head(my_data)
We extract a single element by using its indexes
my_data[2,3]
## [1] bike
## Levels: bike car moto transport public walk
my_data[3,2]
## [1] IG
## Levels: EE IG
To extract a full row or a column, we omit the other index.
To extract a row
my_data[3,]
To extract a column
my_data[,3]
## [1] transport public bike bike bike
## [5] moto moto moto walk
## [9] moto bike car car
## [13] car transport public car transport public
## [17] moto transport public walk transport public
## [21] walk moto walk transport public
## [25] moto car moto transport public
## [29] transport public transport public transport public car
## [33] transport public transport public transport public transport public
## [37] car transport public bike transport public
## [41] moto moto transport public transport public
## [45] transport public moto transport public transport public
## [49] transport public moto transport public transport public
## [53] transport public car transport public transport public
## [57] car moto transport public transport public
## [61] transport public transport public car car
## [65] transport public transport public transport public transport public
## [69] moto transport public car moto
## [73] car transport public transport public car
## [77] walk transport public moto car
## [81] car transport public transport public moto
## [85] moto
## Levels: bike car moto transport public walk
We can also extract columns by using their names
my_data$Q03_Distance
## [1] 12.0 2.0 4.0 3.8 20.0 7.0 3.0 1.5 4.0 2.5 8.0
## [12] 3.0 12.0 70.0 6.3 2.0 14.0 2.5 0.1 8.0 0.0 3.0
## [23] 3.0 16.0 19.0 20.0 5.0 2.0 4.0 20.0 2.5 9.0 9.0
## [34] 21.0 4.0 177.0 13.0 9.0 2.0 7.5 6.0 2.0 10.0 35.0
## [45] 20.0 2.0 5.0 8.0 120.0 8.0 10.7 3.0 50.0 20.0 5.0
## [56] 75.0 30.0 10.0 5.0 80.0 2.3 10.0 10.0 20.0 1.5 5.0
## [67] 20.0 4.0 2.0 15.0 6.0 1.0 10.0 8.0 2.0 40.0 1.0
## [78] 9.3 16.0 9.0 40.0 5.0 10.5 6.4 7.0
or
my_data[,'Q03_Distance']
## [1] 12.0 2.0 4.0 3.8 20.0 7.0 3.0 1.5 4.0 2.5 8.0
## [12] 3.0 12.0 70.0 6.3 2.0 14.0 2.5 0.1 8.0 0.0 3.0
## [23] 3.0 16.0 19.0 20.0 5.0 2.0 4.0 20.0 2.5 9.0 9.0
## [34] 21.0 4.0 177.0 13.0 9.0 2.0 7.5 6.0 2.0 10.0 35.0
## [45] 20.0 2.0 5.0 8.0 120.0 8.0 10.7 3.0 50.0 20.0 5.0
## [56] 75.0 30.0 10.0 5.0 80.0 2.3 10.0 10.0 20.0 1.5 5.0
## [67] 20.0 4.0 2.0 15.0 6.0 1.0 10.0 8.0 2.0 40.0 1.0
## [78] 9.3 16.0 9.0 40.0 5.0 10.5 6.4 7.0
Similarly as for vectors we can extract multiple rows or columns using integer sequences or index vectors.
First 3 rows
my_data[1:3,]
Selected columns
my_data[,c(2,5,7)]
We can also use logical indexing.
Extract rows with walk in the 2nd column.
my_data[my_data[,3]=='walk',]
A data frame is an object in R and can be manipulated as any other object.
You can use the assignment operator <- to change values within the data frame.
Update single element
my_data[1,4] = 200
head(my_data)
Change name of column
names(my_data)[1] <- "ID"
head(my_data)
Drop column from data frame
my_data <- my_data[,-1]
head(my_data)
| Comand | Description |
|---|---|
my_data[3, 5] |
# element in 3rd row and 5th column (single element) |
my_data[, 5] |
# 5th column (vector) |
my_data[5] |
# 5th column (data frame) |
my_data[3,] |
# 3rd row (data frame) |
my_data[2:4,] |
# rows 2:4 (data frame) |
my_data[, 5:8] |
# columns 5:8 (data frame) |
my_data[2, 5:8] |
# 2nd row of columns 5:8 (vector) |
my_data[, -3] |
# all data except 3rd column (data frame) |
my_data[, -3] |
# all data except 3rd column (data frame) |
my_data[-(1:4),] |
# all data except first 4 rows (data frame) |
| Comand | Description |
|---|---|
my_data[,"Q01_Transport"] |
# column with name “Q01_Transport” (vector) |
my_data$Q01_Transport |
# column with name “Q01_Transport” (vector) |
my_data[5:nrow(my_data),] |
# all rows from 5 to the last (data frame) |
my_data[my_data[,"Q01_Transport"]=="bike",] |
# all rows where Q01_Transport==“bike” |
my_data[my_data[,2]=="bike" & my_data[,4]>10,] |
# all rows where Q01_Transport==“bike” and Q03_Distance>10 |