HW2

Task 1

Read in the ities.csv datafile as a dataframe object, df.

df<-read.csv("ities.csv")

No descriptive answer is needed here.

Task 2

(2 points) Display the count of rows and columns in the dataframe using an appropriate R function. Below the output, identify the count of rows and the count of columns.

dim(df)

## [1] 438151     13

The dataframe has 438151 rows and 13 columns.

Task 3

(3 points) Use the appropriate R function to display the structure (i.e., number of rows, columns, column names, column data type, some values from each column) of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.

str(df)

## 'data.frame':    438151 obs. of  13 variables:
##  $ Date             : chr  "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
##  $ OperationType    : chr  "SALE" "SALE" "SALE" "SALE" ...
##  $ CashierName      : chr  "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
##  $ LineItem         : chr  "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
##  $ Department       : chr  "Entrees" "Beverage" "Kabobs" "Salad" ...
##  $ Category         : chr  "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
##  $ RegisterName     : chr  "RT149" "RT149" "RT149" "RT149" ...
##  $ StoreNumber      : chr  "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
##  $ TransactionNumber: chr  "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
##  $ CustomerCode     : chr  "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
##  $ Price            : num  66.22 2.88 12.02 18.43 18.43 ...
##  $ Quantity         : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ TotalDue         : num  66.22 2.88 24.04 18.43 18.43 ...

This is a large dataframe with quite a few rows but few columns. Some of the variables have character values, others are numerical, and one listed variable lists integer values.

Task 4

(6 points) True or False: Every transaction is summarized in one row of the dataframe. Display at least one calculation in the code chunk below. Below the calculation(s), clearly indicate whether the statement is true or false and explain how the output of your calculation(s) supports your conclusion.

head(df)

##        Date OperationType    CashierName                    LineItem Department
## 1 7/18/2016          SALE Wallace Kuiper Salmon and Wheat Bran Salad    Entrees
## 2 7/18/2016          SALE Wallace Kuiper              Fountain Drink   Beverage
## 3 7/18/2016          SALE Wallace Kuiper       Beef and Squash Kabob     Kabobs
## 4 7/18/2016          SALE Wallace Kuiper Salmon and Wheat Bran Salad      Salad
## 5 7/18/2016          SALE Wallace Kuiper Salmon and Wheat Bran Salad      Salad
## 6 7/18/2016          SALE Wallace Kuiper  Beef and Broccoli Stir Fry    general
##                      Category RegisterName StoreNumber TransactionNumber
## 1 Salmon and Wheat Bran Salad        RT149  AZ23501305     002XIIC146121
## 2                    Fountain        RT149  AZ23501289     002XIIC146121
## 3                        Beef        RT149  AZ23501367     00PG9FL135736
## 4                     general        RT149  AZ23501633      00Z3B4R37335
## 5                     general        RT149  AZ23501633      00Z3B4R37335
## 6                     general        RT149  AZ23501640      006LUOW47310
##   CustomerCode Price Quantity TotalDue
## 1  CWM11331L8O 66.22        1    66.22
## 2  CWM11331L8O  2.88        1     2.88
## 3  CWM11331L8O 12.02        2    24.04
## 4  CWM11331L8O 18.43        1    18.43
## 5  CWM11331L8O 18.43        1    18.43
## 6  CWM11331L8O 15.04        1    15.04

False. Even when just considering the first six transactions via the head function, we see the same transaction number in row one, also appear in row two. We see the same with row four and five. Thus, it is impossible for every transaction to be summarized within one row in this specific dataframe.

Task 5

(3 points) Display the summaries of the Price, Quantity and TotalDue columns. Below the output, provide a brief interpretation of the output for each column.

summary(df$Price)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -5740.51     4.50    11.29    14.36    14.68 21449.97       12

summary(df$Quantity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.177   1.000 815.000

summary(df$TotalDue)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -5740.51     4.50    11.80    15.26    15.04 21449.97       12

Price column values range between -5740.51 and 21449.97 with a mean of 14.36, and a median of 11.29. Quantity column values range between 1 and 815 with a mean of 1.177, and a median of 1.000 Total due column values range between -5740.51 and 21449.97 with a mean of 15.26 and a median of 11.80

Task 6

(6 points) Display the boxplots of the log values for the Price, Quantity and TotalDue columns. Below the output, provide a brief description of three insights that you see in the boxplots. As part of your description, indicate how the output from task 5 relates to the boxplots in this task.

options(warn = -1)
boxplot(log(df[, c("Price", "Quantity", "TotalDue")]))

It appears from the box plots that numbers one (Price) and two (TotalDue) are virtually identical at first glance. It is only after looking at the data analysis from task five that we see differences between median, mean, and third quartile values between the variables. Box plot number two has no actual “box” to it since it’s first and third quartile values are identical. It is also interesting to note that all values for box plot two (Quantity) are all positive, which does make sense. Much of the output from task five (lower quartile, upper quartile, median, max, minimum values) go into the creation of the box plot. Also as an aside I only used the options(warn = -1) function to remove the error message that I kept receiving when generating these box charts, because the logarithm of a negative number is undefined.

HW2