Read in the ities.csv datafile as a dataframe object, df.
df <- read.csv('ities.csv')
No descriptive answer is needed here.
(2 points) Display the count of rows and columns in the dataframe using an appropriate R function. Below the output, identify the count of rows and the count of columns.
nrow(df)
## [1] 438128
ncol(df)
## [1] 13
This dataframe has 438128 rows and 13 columns.
(3 points) Use the appropriate R function to display the structure (i.e., number of rows, columns, column names, column data type, some values from each column) of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.
summary(df)
## Date OperationType CashierName LineItem
## Length:438128 Length:438128 Length:438128 Length:438128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Department Category RegisterName StoreNumber
## Length:438128 Length:438128 Length:438128 Length:438128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TransactionNumber CustomerCode Price Quantity
## Length:438128 Length:438128 Min. :-5740.51 Min. : 1.000
## Class :character Class :character 1st Qu.: 4.50 1st Qu.: 1.000
## Mode :character Mode :character Median : 11.29 Median : 1.000
## Mean : 14.36 Mean : 1.177
## 3rd Qu.: 14.68 3rd Qu.: 1.000
## Max. :21449.97 Max. :815.000
## NA's :12
## TotalDue
## Min. :-5740.51
## 1st Qu.: 4.50
## Median : 11.80
## Mean : 15.25
## 3rd Qu.: 15.04
## Max. :21449.97
## NA's :12
str(df)
## 'data.frame': 438128 obs. of 13 variables:
## $ Date : chr "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
## $ OperationType : chr "SALE" "SALE" "SALE" "SALE" ...
## $ CashierName : chr "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
## $ LineItem : chr "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
## $ Department : chr "Entrees" "Beverage" "Kabobs" "Salad" ...
## $ Category : chr "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
## $ RegisterName : chr "RT149" "RT149" "RT149" "RT149" ...
## $ StoreNumber : chr "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
## $ TransactionNumber: chr "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
## $ CustomerCode : chr "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
## $ Price : num 66.22 2.88 12.02 18.43 18.43 ...
## $ Quantity : int 1 1 2 1 1 1 1 1 1 1 ...
## $ TotalDue : num 66.22 2.88 24.04 18.43 18.43 ...
This summary shows that the range of prices goes from a first quartile of $4.50 to third quartile of $14.68. It however also includes negative price values that reflect money going out for purchasing services like “catering.” ## Task 4
(6 points) Is every transaction summarized in one row of the dataframe? Include a code chunk with code that will display some kind of evidence (e.g., number of rows and number of unique transaction numbers) to support your conclusion. Below the code chunk, clearly indicate how the output of your code supports your decision.
In the summary default shows each column and what type of data is in each column.
# Print the first six rows
summary.default(df)
## Length Class Mode
## Date 438128 -none- character
## OperationType 438128 -none- character
## CashierName 438128 -none- character
## LineItem 438128 -none- character
## Department 438128 -none- character
## Category 438128 -none- character
## RegisterName 438128 -none- character
## StoreNumber 438128 -none- character
## TransactionNumber 438128 -none- character
## CustomerCode 438128 -none- character
## Price 438128 -none- numeric
## Quantity 438128 -none- numeric
## TotalDue 438128 -none- numeric
length(unique(df))
## [1] 13
nrow(df)
## [1] 438128
ncol(df)
## [1] 13
nrow(df)*ncol(df)
## [1] 5695664
#This display shows a summary of what type of data is in each column. the number of rows was 438,128, number of columns was 13, and total transacions 5,695,664 - as the product of rows x columns.
(3 points) Display the summaries of the Price, Quantity and TotalDue columns. Below the output, provide a brief interpretation of the output for each column.
summary (df,Price)
## Date OperationType CashierName LineItem
## Length:438128 Length:438128 Length:438128 Length:438128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Department Category RegisterName StoreNumber
## Length:438128 Length:438128 Length:438128 Length:438128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TransactionNumber CustomerCode Price Quantity
## Length:438128 Length:438128 Min. :-5740.51 Min. : 1.000
## Class :character Class :character 1st Qu.: 4.50 1st Qu.: 1.000
## Mode :character Mode :character Median : 11.29 Median : 1.000
## Mean : 14.36 Mean : 1.177
## 3rd Qu.: 14.68 3rd Qu.: 1.000
## Max. :21449.97 Max. :815.000
## NA's :12
## TotalDue
## Min. :-5740.51
## 1st Qu.: 4.50
## Median : 11.80
## Mean : 15.25
## 3rd Qu.: 15.04
## Max. :21449.97
## NA's :12
summary (df,Quantity)
## Date OperationType CashierName LineItem
## Length:438128 Length:438128 Length:438128 Length:438128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Department Category RegisterName StoreNumber
## Length:438128 Length:438128 Length:438128 Length:438128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TransactionNumber CustomerCode Price Quantity
## Length:438128 Length:438128 Min. :-5740.51 Min. : 1.000
## Class :character Class :character 1st Qu.: 4.50 1st Qu.: 1.000
## Mode :character Mode :character Median : 11.29 Median : 1.000
## Mean : 14.36 Mean : 1.177
## 3rd Qu.: 14.68 3rd Qu.: 1.000
## Max. :21449.97 Max. :815.000
## NA's :12
## TotalDue
## Min. :-5740.51
## 1st Qu.: 4.50
## Median : 11.80
## Mean : 15.25
## 3rd Qu.: 15.04
## Max. :21449.97
## NA's :12
summary (df,TotalDue)
## Date OperationType CashierName LineItem
## Length:438128 Length:438128 Length:438128 Length:438128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Department Category RegisterName StoreNumber
## Length:438128 Length:438128 Length:438128 Length:438128
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## TransactionNumber CustomerCode Price Quantity
## Length:438128 Length:438128 Min. :-5740.51 Min. : 1.000
## Class :character Class :character 1st Qu.: 4.50 1st Qu.: 1.000
## Mode :character Mode :character Median : 11.29 Median : 1.000
## Mean : 14.36 Mean : 1.177
## 3rd Qu.: 14.68 3rd Qu.: 1.000
## Max. :21449.97 Max. :815.000
## NA's :12
## TotalDue
## Min. :-5740.51
## 1st Qu.: 4.50
## Median : 11.80
## Mean : 15.25
## 3rd Qu.: 15.04
## Max. :21449.97
## NA's :12
#The price ranges from -5,740.51 to $21,449, but clustered between 4.50 to 14.68. #The quanity alo ranges from 1.0 to 815, but almost enitrely <2. #The total due is a reflection of the price.
(6 points) Display the boxplots of the log values for the Price, Quantity and TotalDue columns. Below the output, provide a brief description of three insights that you see in the boxplots. As part of your description, indicate how the output from task 5 relates to the boxplots in this task.
boxplot(log10(df$Price))
## Warning in boxplot(log10(df$Price)): NaNs produced
boxplot(log10(df$Quantity))
boxplot(log10(df$TotalDue))
## Warning in boxplot(log10(df$TotalDue)): NaNs produced
The boxplots for price and total output are similar. This is reflects
what was mentioned in Question 5. The bloxplot for Quantity also is
consistent with question 5 as it shows that there is a wide range with
significant outlier, but the overwhelming quantities are <2. This is
why there is a solid line all the way at the botttom.