Read in the ities.csv datafile as a dataframe object, df.
df<-read.csv("ities.csv")
No descriptive answer is needed here.
(2 points) Display the count of rows and columns in the dataframe using an appropriate R function. Below the output, identify the count of rows and the count of columns.
dim(df)
## [1] 438151 13
The dataframe has 438151 rows and 13 columns.
(3 points) Use the appropriate R function to display the structure (i.e., number of rows, columns, column names, column data type, some values from each column) of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.
str(df)
## 'data.frame': 438151 obs. of 13 variables:
## $ Date : chr "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
## $ OperationType : chr "SALE" "SALE" "SALE" "SALE" ...
## $ CashierName : chr "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
## $ LineItem : chr "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
## $ Department : chr "Entrees" "Beverage" "Kabobs" "Salad" ...
## $ Category : chr "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
## $ RegisterName : chr "RT149" "RT149" "RT149" "RT149" ...
## $ StoreNumber : chr "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
## $ TransactionNumber: chr "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
## $ CustomerCode : chr "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
## $ Price : num 66.22 2.88 12.02 18.43 18.43 ...
## $ Quantity : int 1 1 2 1 1 1 1 1 1 1 ...
## $ TotalDue : num 66.22 2.88 24.04 18.43 18.43 ...
This is a large dataframe with quite a few rows but few columns. Some of the variables have character values, others are numerical, and one listed variable lists integer values.
(6 points) True or False: Every transaction is summarized in one row of the dataframe. Display at least one calculation in the code chunk below. Below the calculation(s), clearly indicate whether the statement is true or false and explain how the output of your calculation(s) supports your conclusion.
head(df)
## Date OperationType CashierName LineItem Department
## 1 7/18/2016 SALE Wallace Kuiper Salmon and Wheat Bran Salad Entrees
## 2 7/18/2016 SALE Wallace Kuiper Fountain Drink Beverage
## 3 7/18/2016 SALE Wallace Kuiper Beef and Squash Kabob Kabobs
## 4 7/18/2016 SALE Wallace Kuiper Salmon and Wheat Bran Salad Salad
## 5 7/18/2016 SALE Wallace Kuiper Salmon and Wheat Bran Salad Salad
## 6 7/18/2016 SALE Wallace Kuiper Beef and Broccoli Stir Fry general
## Category RegisterName StoreNumber TransactionNumber
## 1 Salmon and Wheat Bran Salad RT149 AZ23501305 002XIIC146121
## 2 Fountain RT149 AZ23501289 002XIIC146121
## 3 Beef RT149 AZ23501367 00PG9FL135736
## 4 general RT149 AZ23501633 00Z3B4R37335
## 5 general RT149 AZ23501633 00Z3B4R37335
## 6 general RT149 AZ23501640 006LUOW47310
## CustomerCode Price Quantity TotalDue
## 1 CWM11331L8O 66.22 1 66.22
## 2 CWM11331L8O 2.88 1 2.88
## 3 CWM11331L8O 12.02 2 24.04
## 4 CWM11331L8O 18.43 1 18.43
## 5 CWM11331L8O 18.43 1 18.43
## 6 CWM11331L8O 15.04 1 15.04
False. Even when just considering the first six transactions via the head function, we see the same transaction number in row one, also appear in row two. We see the same with row four and five. Thus, it is impossible for every transaction to be summarized within one row in this specific dataframe.
(3 points) Display the summaries of the Price, Quantity and TotalDue columns. Below the output, provide a brief interpretation of the output for each column.
summary(df$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -5740.51 4.50 11.29 14.36 14.68 21449.97 12
summary(df$Quantity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.177 1.000 815.000
summary(df$TotalDue)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -5740.51 4.50 11.80 15.26 15.04 21449.97 12
Price column values range between -5740.51 and 21449.97 with a mean of 14.36, and a median of 11.29. Quantity column values range between 1 and 815 with a mean of 1.177, and a median of 1.000 Total due column values range between -5740.51 and 21449.97 with a mean of 15.26 and a median of 11.80
(6 points) Display the boxplots of the log values for the Price, Quantity and TotalDue columns. Below the output, provide a brief description of three insights that you see in the boxplots. As part of your description, indicate how the output from task 5 relates to the boxplots in this task.
options(warn = -1)
boxplot(log(df[, c("Price", "Quantity", "TotalDue")]))
It appears from the box plots that numbers one (Price) and two
(TotalDue) are virtually identical at first glance. It is only after
looking at the data analysis from task five that we see differences
between median, mean, and third quartile values between the variables.
Box plot number two has no actual “box” to it since it’s first and third
quartile values are identical. It is also interesting to note that all
values for box plot two (Quantity) are all positive, which does make
sense. Much of the output from task five (lower quartile, upper
quartile, median, max, minimum values) go into the creation of the box
plot. Also as an aside I only used the options(warn = -1) function to
remove the error message that I kept receiving when generating these box
charts, because the logarithm of a negative number is undefined.