HW2

Task 1

Read in the ities.csv datafile as a dataframe object, df.

df <- read.csv('ities.csv')
df <- na.omit(df)

No descriptive answer is needed here.

Task 2

(2 points) Display the count of rows and columns in the dataframe using an appropriate R function. Below the output, identify the count of rows and the count of columns.

dim(df)

## [1] 438139     13

This dataframe has 438139 rows and 13 columns.

Task 3

(3 points) Use the appropriate R function to display the structure (i.e., number of rows, columns, column names, column data type, some values from each column) of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.

str(df)

## 'data.frame':    438139 obs. of  13 variables:
##  $ Date             : chr  "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
##  $ OperationType    : chr  "SALE" "SALE" "SALE" "SALE" ...
##  $ CashierName      : chr  "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
##  $ LineItem         : chr  "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
##  $ Department       : chr  "Entrees" "Beverage" "Kabobs" "Salad" ...
##  $ Category         : chr  "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
##  $ RegisterName     : chr  "RT149" "RT149" "RT149" "RT149" ...
##  $ StoreNumber      : chr  "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
##  $ TransactionNumber: chr  "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
##  $ CustomerCode     : chr  "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
##  $ Price            : num  66.22 2.88 12.02 18.43 18.43 ...
##  $ Quantity         : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ TotalDue         : num  66.22 2.88 24.04 18.43 18.43 ...
##  - attr(*, "na.action")= 'omit' Named int [1:12] 150 153 154 162 193 201 210 1565 1575 1578 ...
##   ..- attr(*, "names")= chr [1:12] "150" "153" "154" "162" ...

dim(df)

## [1] 438139     13

nrow(df)

## [1] 438139

ncol(df)

## [1] 13

names(df)

##  [1] "Date"              "OperationType"     "CashierName"      
##  [4] "LineItem"          "Department"        "Category"         
##  [7] "RegisterName"      "StoreNumber"       "TransactionNumber"
## [10] "CustomerCode"      "Price"             "Quantity"         
## [13] "TotalDue"

The dataframe has rows and columns that utilizes charater, integer, and numeric data types.

Task 4

(6 points) True or False: Every transaction is summarized in one row of the dataframe. Display at least one calculation in the code chunk below. Below the calculation(s), clearly indicate whether the statement is true or false and explain how the output of your calculation(s) supports your conclusion.

length(unique(df$TransactionNumber))

## [1] 161056

False, every transaction is not summarized in one row. We show 161,056 unique transaction numbers but have over 438,139 rows of data, meaning there are multiple rows using the same transaction number and each observation is not a summary.

Task 5

(3 points) Display the summaries of the Price, Quantity and TotalDue columns. Below the output, provide a brief interpretation of the output for each column.

summary(df[,c('Price', 'Quantity' , 'TotalDue')])

##      Price             Quantity          TotalDue       
##  Min.   :-5740.51   Min.   :  1.000   Min.   :-5740.51  
##  1st Qu.:    4.50   1st Qu.:  1.000   1st Qu.:    4.50  
##  Median :   11.29   Median :  1.000   Median :   11.80  
##  Mean   :   14.36   Mean   :  1.177   Mean   :   15.26  
##  3rd Qu.:   14.68   3rd Qu.:  1.000   3rd Qu.:   15.04  
##  Max.   :21449.97   Max.   :815.000   Max.   :21449.97

In this summary I can see that the quantity of items being purchased is rarely ever more than 1 item but that we have an outlier with a max of 815. Due to the quantity commonly being 1 or close to 1 the Price and TotalDue are extremely similar. NaNs have been removed from my prior code in question 1. Another important observation is the high spread in the data with a low min of -5,740.51 and a high Max of 21,449.97 for both Price and TotalDue.

Task 6

(6 points) Display the boxplots of the log values for the Price, Quantity and TotalDue columns. Below the output, provide a brief description of three insights that you see in the boxplots. As part of your description, indicate how the output from task 5 relates to the boxplots in this task.

boxplot(log(df[,c('Price', 'Quantity' , 'TotalDue')]))

## Warning in FUN(X[[i]], ...): NaNs produced

## Warning in FUN(X[[i]], ...): NaNs produced

The boxplot for Quantity confirms that the data is highly concentrated in on point, making the interquartile range impossible to distinguish. We also see some outliers in our upper quartile. The Price and TotalDue data is extremely similar as indicated in Q5, the data is almost identical. The Interquartile is small, indicating the prices are concentrated to the median. The bloxplot also shows why the mean is higher than the median because of the higher amount of oberservations in the upper quartile pulling the mean higher.

HW2