This lesson will illustrate how to use the tidyverse grammar for (1) reducing the length of the dataframe to specific rows using the filter function in the dplyr package, and (2) reducing the width of the dataframe to specific columns using the select function in the dplyr package.
You don’t have to use the dplyr packages to do this, but this package makes it much easier to remember and read. ## Preliminaries Install the dplyr package if you haven’t already done so. If you have already done so, then erase or comment out the following code chunk
# install.packages('dplyr')
Load the dplyr package. You should do this any time you start a new R
session. Notice the warning about masking objects from other packages.
This means, for instance, that filter is a function in the
stats package, as well as the dplyr package. By loading the dplyr
package the default filter function will be the one from dplyr, rather
than the filter function from the stats package. You can still call the
filter function from the stats package by adding the package name before
the function and separating it with two colons:
stats::filter().
library(dplyr)
Save this file in a folder that contains the jan17Items.csv file. Alternatively, move that file to the folder in which this notebook file is saved on your machine.
Set the working directory to the folder where this notebook file and data are saved. You can do that in one of at least two ways:
setwd() command, orRead in the jan17Items data as j17i.
str(j17i)
'data.frame': 8899 obs. of 21 variables:
$ Time : chr "2017-01-26T21:18:00Z" "2017-01-26T20:30:00Z" "2017-01-26T20:30:00Z" "2017-01-26T20:30:00Z" ...
$ OperationType : chr "SALE" "SALE" "SALE" "SALE" ...
$ BarCode : chr "*" "*" "*" "*" ...
$ CashierName : chr "Nicholas Villines" "Carla Knotts" "Carla Knotts" "Carla Knotts" ...
$ LineItem : chr "Glass Mug" "Lamb Chops" "Salmon and Wheat Bran Salad" "Fountain Drink" ...
$ Department : chr "Beverage" "Entrees" "Entrees" "Beverage" ...
$ Category : chr "Glass Bottle" "Lamb Chops" "Salmon and Wheat Bran Salad" "Fountain" ...
$ CardholderName : chr NA NA NA NA ...
$ RegisterName : chr "RT149" "RT151" "RT151" "RT151" ...
$ StoreNumber : chr "AZ23501411" "AZ23501251" "AZ23501305" "AZ23501289" ...
$ TransactionNumber: chr "00Z67OS78157" "00XT6G2179417" "00XT6G2179417" "00XT6G2179417" ...
$ CustomerCode : chr "CWM11331L8O" "CXV10742CJW" "CXV10742CJW" "CXV10742CJW" ...
$ Cost : num 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 ...
$ Price : num 3.73 13.38 13.9 2.88 4.83 ...
$ Quantity : int 1 1 1 1 1 1 1 1 1 1 ...
$ Modifiers : num 0.01 0.01 0.01 0.01 0.01 2.36 2.36 1.08 1.08 0.01 ...
$ Subtotal : num 3.74 13.39 13.91 2.89 4.84 ...
$ Discounts : num 3 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 3.63 ...
$ NetTotal : num 0.74 13.42 13.94 2.92 4.87 ...
$ Tax : num NaN 1.06 1.09 0.23 0.38 1.13 1.13 1.23 1.04 NaN ...
$ TotalDue : num NaN 14.48 15.03 3.15 5.25 ...
You can filter rows of data to observations that meet a certain
condition using the filter() function from the dplyr
package along with the relational operator that pertains to your
situation.
highCost <- filter(j17i, Cost > 11)
dim(highCost)
[1] 14 21
Compare this to the code to filter rows using base R.
highCostBase <- j17i[j17i$Cost > 11,]
dim(highCostBase)
[1] 14 21
Base R requires you to type the name of the dataframe twice. It’s not that big of a deal, but when you start combining multiple preprocessing functions, the tidyverse grammar is easier to read because you don’t have to type out the name of the dataframe as often, and it’s typically less verbose in general.
You can also filter on multiple conditions. For instance, to reduce the observations only to those where cost is greater than 11 AND price is greater than 13 you would do this:
highCostAndPrice <- filter(j17i, Cost > 11 & Price > 13)
dim(highCostAndPrice)
[1] 0 21
If you want to be less exclusive and include observations where cost is greater than 11 OR price is greater than 13 then you can do this:
highCostOrPrice <- filter(j17i, Cost > 11 | Price > 13)
dim(highCostOrPrice)
[1] 3278 21
You can reduce the width of a dataframe by selecting only certain
columns using the select() function from the dplyr package.
Here’s how you would select onl the Cost column:
highCost_Cost <- select(highCost, Cost)
str(highCost_Cost)
'data.frame': 14 obs. of 1 variable:
$ Cost: num 19 19 189 189 189 ...
Notice that the highCost_Cost object is still a dataframe object. Compare this to the base R approach:
highCost_CostBase <- highCost[,c('Cost')]
str(highCost_CostBase)
num [1:14] 19 19 189 189 189 ...
Base R is only a little more verbose, but it’s definitely harder to
read. It takes more mental power to identify what’s going on compared to
the dplyr approach in which the select() function
explicitly indicates that you’re selecting certain columns.
To select multiple columns, such as the Cost and Price columns is a straightforward extension of selecting a single column:
highCost_CostPrice <- select(highCost, Cost, Price)
str(highCost_CostPrice)
'data.frame': 14 obs. of 2 variables:
$ Cost : num 19 19 189 189 189 ...
$ Price: num NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
There are some occasions when the filter and select functions from dplyr are harder than using base R, but you will not encounter those cases in this course.