HW3

Task #1

Read in the ities.csv datafile as a dataframe object, df.

df <- read.csv('ities.csv')

Task #2

Display the number of rows and columns in the dataset using an appropriate R function. Below the output, identify which numbers from the output correspond to the number of rows and columns.

dim(df)

## [1] 438151     13

There are 438,151 rows and 13 columns.

Task #3

Display the structure of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.

str(df)

## 'data.frame':    438151 obs. of  13 variables:
##  $ Date             : chr  "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
##  $ OperationType    : chr  "SALE" "SALE" "SALE" "SALE" ...
##  $ CashierName      : chr  "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
##  $ LineItem         : chr  "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
##  $ Department       : chr  "Entrees" "Beverage" "Kabobs" "Salad" ...
##  $ Category         : chr  "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
##  $ RegisterName     : chr  "RT149" "RT149" "RT149" "RT149" ...
##  $ StoreNumber      : chr  "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
##  $ TransactionNumber: chr  "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
##  $ CustomerCode     : chr  "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
##  $ Price            : num  66.22 2.88 12.02 18.43 18.43 ...
##  $ Quantity         : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ TotalDue         : num  66.22 2.88 24.04 18.43 18.43 ...

There are 10 columns with a character type and 3 columns with a numeric type. Also, there are different values in the StoreNumber columns which provides evidence that there are multple stores associated with this restaurant.

Task #4

Display a summary of the columns in df. Below the output, comment on at least two columns for which the existing data type is not useful for the summary function, the format to which they should be changed, and why that change would be helpful.

summary(df[,c('Date', 'OperationType', 'CashierName', 'LineItem', 'Department', 'Category', 'RegisterName', 'StoreNumber', 'TransactionNumber', 'CustomerCode', 'Price', 'Quantity', 'TotalDue')])

##      Date           OperationType      CashierName          LineItem        
##  Length:438151      Length:438151      Length:438151      Length:438151     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   Department          Category         RegisterName       StoreNumber       
##  Length:438151      Length:438151      Length:438151      Length:438151     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  TransactionNumber  CustomerCode           Price             Quantity      
##  Length:438151      Length:438151      Min.   :-5740.51   Min.   :  1.000  
##  Class :character   Class :character   1st Qu.:    4.50   1st Qu.:  1.000  
##  Mode  :character   Mode  :character   Median :   11.29   Median :  1.000  
##                                        Mean   :   14.36   Mean   :  1.177  
##                                        3rd Qu.:   14.68   3rd Qu.:  1.000  
##                                        Max.   :21449.97   Max.   :815.000  
##                                        NA's   :12                          
##     TotalDue       
##  Min.   :-5740.51  
##  1st Qu.:    4.50  
##  Median :   11.80  
##  Mean   :   15.26  
##  3rd Qu.:   15.04  
##  Max.   :21449.97  
##  NA's   :12

The following items are not useful in their existing data type:
- Date
- Department

Both categories should be changed to a factor because the numeric value will allow for the data to be interpreted most easily in the summary function, plots, and other valuable data analysis’.

Task #5

Convert the values in Department and LineItem columns to lower case and save them as new columns, Department_lower and LineItem_lower. Display the first five rows of only those four columns, Department, Department_lower, LineItem, and LineItem_lower to verify that the case conversion worked.

install.packages(“dplyr”) library(“dplyr”)

df$LineItem_lower <- df$LineItem
df$Department_lower <- df$Department
df$LineItem_lower <- tolower(df$LineItem_lower)
df$Department_lower <- tolower(df$Department_lower)
df[1:5, c(5, 15, 4, 14)]

##   Department Department_lower                    LineItem
## 1    Entrees          entrees Salmon and Wheat Bran Salad
## 2   Beverage         beverage              Fountain Drink
## 3     Kabobs           kabobs       Beef and Squash Kabob
## 4      Salad            salad Salmon and Wheat Bran Salad
## 5      Salad            salad Salmon and Wheat Bran Salad
##                LineItem_lower
## 1 salmon and wheat bran salad
## 2              fountain drink
## 3       beef and squash kabob
## 4 salmon and wheat bran salad
## 5 salmon and wheat bran salad

Task #6

Use the “plot” function on Department_lower, and then run that code chunk. You will get an error. Below the output, describe the reason for the error. Then make sure and comment out this code chunk by placing a hashtag/pound sign on the far left of the line of code. If you don’t comment out code that contains an error, then the markdown file will not be able to knit to an html file.

#/plot(df$Department_lower)

The code chunk generated an error because the Department_lower is currently displaying data as a character. This column needs to be converted to a factor for the plot to generate the information properly.

Task #7

Convert the datatype of Department_lower to a factor type. Do not create a new column, just convert it in place. Display the structure of the dataframe, df, to verify that the Department_lower column is a factor data type. Beneath the output, indicate how many levels there are in the Department_lower column.

df$Department_Lower <- as.factor(df$Department_lower)
class(df$Department_Lower)

## [1] "factor"

str(df)

## 'data.frame':    438151 obs. of  16 variables:
##  $ Date             : chr  "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
##  $ OperationType    : chr  "SALE" "SALE" "SALE" "SALE" ...
##  $ CashierName      : chr  "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
##  $ LineItem         : chr  "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
##  $ Department       : chr  "Entrees" "Beverage" "Kabobs" "Salad" ...
##  $ Category         : chr  "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
##  $ RegisterName     : chr  "RT149" "RT149" "RT149" "RT149" ...
##  $ StoreNumber      : chr  "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
##  $ TransactionNumber: chr  "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
##  $ CustomerCode     : chr  "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
##  $ Price            : num  66.22 2.88 12.02 18.43 18.43 ...
##  $ Quantity         : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ TotalDue         : num  66.22 2.88 24.04 18.43 18.43 ...
##  $ LineItem_lower   : chr  "salmon and wheat bran salad" "fountain drink" "beef and squash kabob" "salmon and wheat bran salad" ...
##  $ Department_lower : chr  "entrees" "beverage" "kabobs" "salad" ...
##  $ Department_Lower : Factor w/ 9 levels "beverage","catering",..: 3 1 6 7 7 4 6 3 3 1 ...

df$Department_Lower <- forcats::fct_infreq(df$Department_Lower)

summary(df$Department_Lower)

##    entrees     kabobs      sides   beverage    general      salad gift cards 
##     152575     102053      97284      35746      27885      20870        731 
##   catering       swag 
##        651        356

After converting the column Department_lower to a numeric factor type, the output shows that there are 9 levels in the column. A summary of the data shows each department and the number of occurrences from most to least frequent.

Task #8

Use the “plot” function on the Department_lower column to display a plot of that column from most frequent on the left to least frequent on the right. Below the output, identify the department that occurs most frequently, as well as the one that occurs least frequently.

It may be hard for you to read the names of all of the departments from the plot. You may have to add additional code to adjust the plot or to print out additional summary information so that you can identify the Departments that appear most/least frequently. Make sure that your comments are supported by the code that is displayed.

plot(df$Department_Lower,
        main = "Department Frequency",
        xlab = "Department",
        ylab = "Occurrences",
        col = "lightblue",
        las = 2,
        cex.names = 0.7,
        cex.axis = 0.7)

As shown in the above plot, entrees are the most frequent department while swag is the least frequent department identified in the exercise of this data.

HW3

Connie Brauer

2022-08-30