Julius Schmid

In this class, you will learn how to:

Extract individual values from a data set
Change individual values within a data set

Let us start by downloading the deck.csv data file from Canvas and uploading it to RStudio Cloud. We import the data with the read.csv file and state that all strings should not be interpreted as factors, setting stringsAsFactors = FALSE. This is exactly what we learned in the previous in-class activity. Now, In order to see what the data set looks like, we print the first 10 rows using the head() function:

deck <- read.csv("deck.csv",stringsAsFactors=FALSE)
head(deck,10)

We observe that the deck data set contains data for a whole deck of French cards. #Positive Numbers R treats positive integers just like ij notation in linear algebra: deck[i,j] will return the value of deck that is in the ith row and the jth column.

Let us return the entry corresponding to the first row and the first column in the data frame. That is, we pick the first card and want to know its face:

deck[1, 1]

## [1] "king"

The first card on the deck is a king.

Now, we want additional information of the first card. We are also interested in its suit and value. We can access this data by returning the first row together with all three column indices of the data frame:

deck[1, c(1, 2, 3)]

Now we know that the first card is not only a king, but also a spades with a value of 13.

We save the drawn spades king in a new variable new, to be able to refer to him easily:

new <- deck[1, c(1, 2, 3)]

Without raising trouble, we can also return the same row twice. We do that by setting the row index 1 twice, using the concatenate function again:

deck[c(1, 1), c(1, 2, 3)]

Next, we print the first two rows and first two columns of the deck data frame. Since the output is two-dimensional, it will automatically be interpreted as a Data Frame.

deck[1:2, 1:2]

On the other hand, if there is a possibility to drop a dimension, R will do it by default. For example, if we want to return the first two entries of the first column, R will interpret the result as a vector:

deck[1:2, 1]

## [1] "king"  "queen"

We can avoid this by setting the optional parameter drop = FALSE. In this case, the result will still be interpreted as a data frame:

deck[1:2, 1, drop = FALSE]

#Negative Numbers

Note that we can also state negative indices. By doing this, we do not specify which rows or columns we want to return. Instead, we specify which rows and columns we do NOT want to return, i. e. R will print all rows or columns except for the ones that have a negative index. In the following case, we print all rows except for the rows 2 through 52. This is just another way to return only the first row:

deck[-(2:52), 1:3]

Note, that the call below would return an error message. We state that we want to return the first entry of the first column, and say at the same time that we do not want to return the first entry of the first column. This is a contradiction.

#deck[c(-1, 1), 1]

We leave the code as a comment to avoid the error message when running the code.

#Zero

The following instruction creates an empty object.

deck[0, 0]

This is due to the fact that, in R, the indexing starts with 1. Hence, there cannot exist real data in the df entry [0,0].

#Blank Spaces

When we want to return a whole row or column, we can also just leave the space in front of or after the comma, respectively, instead of listing all row or column indices. Thus, another way to print the first row would be the following, leaving the space for column indices empty:

deck[1, ]

#Logical Values

We could also use logical values to indicate which data frame entries we want to return. To do this, we need to set a boolean parameter for each column, for example, to precisely state if we want to print the corresponding column (TRUE) or not (FALSE). In the following code, we decide to print the first row entry of the columns 1 and 2, since the corresponding boolean values are TRUE.

deck[1, c(TRUE, TRUE, FALSE)]

In a next step, we do the same for the rows. Since we have a total of 52 rows, we need to set 52 values to either TRUE (T) or FALSE (F). In the rows variable, only the first entry will be set to TRUE, all others to FALSE:

rows <- c(T, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F,
F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F,
F, F, F, F, F, F, F, F, F, F, F, F, F, F)

Now applying the row selection to the deck data frame, we get an additional method to return the first row:

deck[rows,]

We could have also set more values to TRUE, which would have resulted in the returning of more rows than just the first one.

#Names

The most intuitive way would be to extract the columns based on their column name, which is similar to an SQL manner. In this case, we print again the first row:

deck[1, c("face", "suit", "value")]

You can use a blank space to tell R to extract every value in a dimension.

# the entire value column
deck[ , "value"]

##  [1] 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3  2
## [26]  1 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3
## [51]  2  1

Complete the following code to make a function that returns the first row of a data frame:

We have seen a lot of possibilities to do this above. Assuming that the input cards is a data frame, we set the row index to 1 and leave the space for the column indices empty, which means that all columns will be returned:

deal <- function(cards) {
cards[1,]
}

Let us apply the function to our deck data frame:

deal(deck)

It worked! Just like we would expect, the deal() function returns the first row of the deck data frame.

In the next step, we wnat to shuffle the deck and save the new order in the variable deck2. First, we need to create a random permutation of the numbers 1 through 52:

set.seed(0) 
shuffle <- sample(1:52, replace=FALSE)

In order to apply the shuffle permutation to the the deck and actually shuffle cards instead of indices, we set the permutation as the row indices:

deck2 <- deck[shuffle, ]

Now, we return the first 10 rows to see if the shuffling worked:

head(deck2,10)

We see that the shuffling was succesful.

We can also just switch the first two cards, leaving all other cards at their original places. This is done with the following code:

deck3 <- deck[c(2, 1, 3:52), ]

#How Modify Order Another way to create a permutation includes also the sample() function. Above, we set the first argument to the numbers 1 through 52. We do the same in the code below. Above, we set replace = FALSE, meaning that we cannot use the same index twice. This is solved below by setting size = 52. Since there is a total number of 52 rows, this also ensures that there are no duplicates and each index is used exactly once.

#First create a vector of 52 numbers in random order and store it in an object named random.
random <- sample(1:52, size = 52)
random

##  [1]  1 51 40  6 23 44 49 50 39 11 17 36 13 25 47 48 20 29 45 22 35 28 12 16 52
## [26] 34 21 46 42  9  7 19 18 32 43 10 26 15 33 30  3  2 24 31 27  8 41 38 37  5
## [51] 14  4

We get a different order than when we used the shuffle permutation. Now, do the same as above, i. e. order the rows in the deck data frame according to our random permutation:

deck4 <- deck[random, ]
head(deck4)

Exercise

Use the preceding ideas to write a shuffle function. shuffle should take a data frame and return a shuffled copy of the data frame.

We recreate our first approach that we used to set up deck2, using the replace = FALSE parameter as an input for the sample() function:

shuffle <- function(cards) {
random <- sample(1:52, replace = FALSE)
cards[random, ]
}

Let us pick the first card before shuffling them:

deal(deck)

This is a spades king with a value of 13.

Now, we shuffle the cards:

deck2 <- shuffle(deck)

We pick the first card again and hopefully get a different card than before the shuffling:

deal(deck2)

Indeed! This time we get a clubs jack, valued 11.

#Dollar Signs and Double Brackets Two types of object in R obey an optional second system of notation. You can extract values from data frames and lists with the $ syntax. You will encounter the $ syntax again and again as an R programmer, so let’s examine how it works. To select a column from a data frame, write the data frame’s name and the column name separated by a $. Notice that no quotes should go around the column name. If we want to return the value column of the deck data frame, we call deck$value:

deck$value

##  [1] 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3  2
## [26]  1 13 12 11 10  9  8  7  6  5  4  3  2  1 13 12 11 10  9  8  7  6  5  4  3
## [51]  2  1

Since we did not set drop = FALSE, deck$value is interpreted as a vector. We can calculate the mean of this vector by applying the mean() function:

mean(deck$value)

## [1] 7

The same way, we calculate the median by calling the median() function on the value column:

median(deck$value)

## [1] 7

For the column deck$value, mean and median are the same.

In the next step, we make a list, consisting of numbers, a logical value, and 3 strings.

lst <- list(numbers = c(1, 2), logical = TRUE, strings = c("a", "b", "c"))
lst

## $numbers
## [1] 1 2
## 
## $logical
## [1] TRUE
## 
## $strings
## [1] "a" "b" "c"

And then subset it:

lst[1]

## $numbers
## [1] 1 2

The result is a smaller list with one element. That element is the vector c(1, 2). This can be annoying because many R functions do not work with lists. For example, sum(lst[1]) will return an error. It would be horrible if once you stored a vector in a list, you could only ever get it back as a list:

#sum(lst[1]) will return an error.

When you use the $ notation, R will return the selected values as they are, with no list structure around them:

lst$numbers

## [1] 1 2

You can then immediately feed the results to a function:

sum(lst$numbers)

## [1] 3

If the elements in your list do not have names (or you do not wish to use the names), you can use two brackets, instead of one, to subset the list. This notation will do the same thing as the $ notation:

lst[[1]]

## [1] 1 2

In other words, if you subset a list with single-bracket notation, R will return a smaller list. If you subset a list with double-bracket notation, R will return just the values that were inside an element of the list. You can combine this feature with any of R’s indexing methods:

lst["numbers"]

## $numbers
## [1] 1 2

lst[["numbers"]]

## [1] 1 2

When calling list names instead of indices, this works exactly the same. Using one single bracket, the output is a smaller list. Using two brackets, R returns the values inside the element of the list.

#Changing Values in Place

For the next considerations, we create a new vector vec, containing 6 zeros.

vec <- c(0, 0, 0, 0, 0, 0)
vec

## [1] 0 0 0 0 0 0

We can manipulate the vector through assigning a new value to a vector entry, using the entry’s index. Let us set the first entry to 1000:

vec[1] <- 1000
vec

## [1] 1000    0    0    0    0    0

We can also change more than one value at the same time, again using indices. With the concatenate function, we assign new values to the entries 1, 3, and 5.

vec[c(1, 3, 5)] <- c(12, 3, 99)
vec

## [1] 12  0  3  0 99  0

We can also combine the vector manipulation with mathematical statements. In the code below, we increase the last three entries by 3:

vec[4:6] <- vec[4:6] + 3
vec

## [1]  12   0   3   3 102   3

Even though vec has a length of 6, we can assign a value to the 7th component, forcing the vector to grow by one component.

vec[7] <- 0
vec

## [1]  12   0   3   3 102   3   0

Now, again, we import the deck.csv data file that we downloaded from Canvas, store the data in a variable deck2, add a new column which represents the initial card number in the deck, and print the first 10 rows, using the head() function with input parameter 10.

deck2 <- read.csv("deck.csv",stringsAsFactors=FALSE)
deck2$new <- 1:52
head(deck2,10)

Apparently, the new column was not as useful as we thought. Hence, we remove the column new from the dataframe deck2 by setting the column to NULL. Again, we print the first 10 rows:

deck2$new <- NULL
head(deck2,10)

With the next call we print four specific rows of the data frame, namely the rows with the indices 13, 26, 39, and 52.

deck2[c(13, 26, 39, 52), ]

We see that we picked all aces. Now, let’s return only the value column for these aces:

deck2[c(13, 26, 39, 52), 3]

## [1] 1 1 1 1

Our output is a vector, since we only used single brackets. The values are all 1. Another way to return the third column is by calling its name, using the dollar sign:

deck2$value[c(13, 26, 39, 52)]

## [1] 1 1 1 1

The ourput stays the same. Now, we decide that aces should have the value 14 instead of 1. We apply our new learnings in order to manipulate the value column for all aces, assigning the value 14 to each one of them.

deck2$value[c(13, 26, 39, 52)] <- c(14, 14, 14, 14)

Now all the aces have the value 14 and hence, they are the most valuable cards on the deck.

Extraction

Exercise