Read in the ities.csv datafile as a dataframe object, df.
df <- read.csv('ities.csv')
Display the number of rows and columns in the dataset using an appropriate R function. Below the output, identify which numbers from the output correspond to the number of rows and columns.
dim(df)
## [1] 438151 13
There are 438,151 rows and 13 columns.
Display the structure of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.
str(df)
## 'data.frame': 438151 obs. of 13 variables:
## $ Date : chr "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
## $ OperationType : chr "SALE" "SALE" "SALE" "SALE" ...
## $ CashierName : chr "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
## $ LineItem : chr "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
## $ Department : chr "Entrees" "Beverage" "Kabobs" "Salad" ...
## $ Category : chr "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
## $ RegisterName : chr "RT149" "RT149" "RT149" "RT149" ...
## $ StoreNumber : chr "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
## $ TransactionNumber: chr "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
## $ CustomerCode : chr "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
## $ Price : num 66.22 2.88 12.02 18.43 18.43 ...
## $ Quantity : int 1 1 2 1 1 1 1 1 1 1 ...
## $ TotalDue : num 66.22 2.88 24.04 18.43 18.43 ...
There are 10 columns with a character type and 3 columns with a numeric type. Also, there are different values in the StoreNumber columns which provides evidence that there are multple stores associated with this restaurant.
Display a summary of the columns in df. Below the output, comment on at least two columns for which the existing data type is not useful for the summary function, the format to which they should be changed, and why that change would be helpful.
summary(df[,c('Date', 'OperationType', 'CashierName', 'LineItem', 'Department', 'Category', 'RegisterName', 'StoreNumber', 'TransactionNumber', 'CustomerCode', 'Price', 'Quantity', 'TotalDue')])
## Date OperationType CashierName LineItem
## Length:438151 Length:438151 Length:438151 Length:438151
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Department Category RegisterName StoreNumber
## Length:438151 Length:438151 Length:438151 Length:438151
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TransactionNumber CustomerCode Price Quantity
## Length:438151 Length:438151 Min. :-5740.51 Min. : 1.000
## Class :character Class :character 1st Qu.: 4.50 1st Qu.: 1.000
## Mode :character Mode :character Median : 11.29 Median : 1.000
## Mean : 14.36 Mean : 1.177
## 3rd Qu.: 14.68 3rd Qu.: 1.000
## Max. :21449.97 Max. :815.000
## NA's :12
## TotalDue
## Min. :-5740.51
## 1st Qu.: 4.50
## Median : 11.80
## Mean : 15.26
## 3rd Qu.: 15.04
## Max. :21449.97
## NA's :12
Both categories should be changed to a factor because the numeric value will allow for the data to be interpreted most easily in the summary function, plots, and other valuable data analysis’.
Convert the values in Department and LineItem columns to lower case and save them as new columns, Department_lower and LineItem_lower. Display the first five rows of only those four columns, Department, Department_lower, LineItem, and LineItem_lower to verify that the case conversion worked.
install.packages(“dplyr”) library(“dplyr”)
df$LineItem_lower <- df$LineItem
df$Department_lower <- df$Department
df$LineItem_lower <- tolower(df$LineItem_lower)
df$Department_lower <- tolower(df$Department_lower)
df[1:5, c(5, 15, 4, 14)]
## Department Department_lower LineItem
## 1 Entrees entrees Salmon and Wheat Bran Salad
## 2 Beverage beverage Fountain Drink
## 3 Kabobs kabobs Beef and Squash Kabob
## 4 Salad salad Salmon and Wheat Bran Salad
## 5 Salad salad Salmon and Wheat Bran Salad
## LineItem_lower
## 1 salmon and wheat bran salad
## 2 fountain drink
## 3 beef and squash kabob
## 4 salmon and wheat bran salad
## 5 salmon and wheat bran salad
Use the “plot” function on Department_lower, and then run that code chunk. You will get an error. Below the output, describe the reason for the error. Then make sure and comment out this code chunk by placing a hashtag/pound sign on the far left of the line of code. If you don’t comment out code that contains an error, then the markdown file will not be able to knit to an html file.
#/plot(df$Department_lower)
The code chunk generated an error because the Department_lower is currently displaying data as a character. This column needs to be converted to a factor for the plot to generate the information properly.
Convert the datatype of Department_lower to a factor type. Do not create a new column, just convert it in place. Display the structure of the dataframe, df, to verify that the Department_lower column is a factor data type. Beneath the output, indicate how many levels there are in the Department_lower column.
df$Department_Lower <- as.factor(df$Department_lower)
class(df$Department_Lower)
## [1] "factor"
str(df)
## 'data.frame': 438151 obs. of 16 variables:
## $ Date : chr "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
## $ OperationType : chr "SALE" "SALE" "SALE" "SALE" ...
## $ CashierName : chr "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
## $ LineItem : chr "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
## $ Department : chr "Entrees" "Beverage" "Kabobs" "Salad" ...
## $ Category : chr "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
## $ RegisterName : chr "RT149" "RT149" "RT149" "RT149" ...
## $ StoreNumber : chr "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
## $ TransactionNumber: chr "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
## $ CustomerCode : chr "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
## $ Price : num 66.22 2.88 12.02 18.43 18.43 ...
## $ Quantity : int 1 1 2 1 1 1 1 1 1 1 ...
## $ TotalDue : num 66.22 2.88 24.04 18.43 18.43 ...
## $ LineItem_lower : chr "salmon and wheat bran salad" "fountain drink" "beef and squash kabob" "salmon and wheat bran salad" ...
## $ Department_lower : chr "entrees" "beverage" "kabobs" "salad" ...
## $ Department_Lower : Factor w/ 9 levels "beverage","catering",..: 3 1 6 7 7 4 6 3 3 1 ...
df$Department_Lower <- forcats::fct_infreq(df$Department_Lower)
summary(df$Department_Lower)
## entrees kabobs sides beverage general salad gift cards
## 152575 102053 97284 35746 27885 20870 731
## catering swag
## 651 356
After converting the column Department_lower to a numeric factor type, the output shows that there are 9 levels in the column. A summary of the data shows each department and the number of occurrences from most to least frequent.
Use the “plot” function on the Department_lower column to display a plot of that column from most frequent on the left to least frequent on the right. Below the output, identify the department that occurs most frequently, as well as the one that occurs least frequently.
It may be hard for you to read the names of all of the departments from the plot. You may have to add additional code to adjust the plot or to print out additional summary information so that you can identify the Departments that appear most/least frequently. Make sure that your comments are supported by the code that is displayed.
plot(df$Department_Lower,
main = "Department Frequency",
xlab = "Department",
ylab = "Occurrences",
col = "lightblue",
las = 2,
cex.names = 0.7,
cex.axis = 0.7)
As shown in the above plot, entrees are the most frequent department while swag is the least frequent department identified in the exercise of this data.