HW2

Task 1

Read in the ities.csv datafile as a dataframe object, df.

df <- read.csv('ities.csv', stringsAsFactors = F, header = T)

No descriptive answer is needed here.

Task 2

(2 points) Display the count of rows and columns in the dataframe using an appropriate R function. Below the output, identify the count of rows and the count of columns.

dim(df)

## [1] 438151     13

This dataframe has 438151 rows and 13 columns.

Task 3

(3 points) Use the appropriate R function to display the structure (i.e., number of rows, columns, column names, column data type, some values from each column) of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.

str(df)

## 'data.frame':    438151 obs. of  13 variables:
##  $ Date             : chr  "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
##  $ OperationType    : chr  "SALE" "SALE" "SALE" "SALE" ...
##  $ CashierName      : chr  "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
##  $ LineItem         : chr  "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
##  $ Department       : chr  "Entrees" "Beverage" "Kabobs" "Salad" ...
##  $ Category         : chr  "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
##  $ RegisterName     : chr  "RT149" "RT149" "RT149" "RT149" ...
##  $ StoreNumber      : chr  "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
##  $ TransactionNumber: chr  "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
##  $ CustomerCode     : chr  "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
##  $ Price            : num  66.22 2.88 12.02 18.43 18.43 ...
##  $ Quantity         : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ TotalDue         : num  66.22 2.88 24.04 18.43 18.43 ...

Ten of the variables (columns) have the character value types, two of the variables are number values and one variable has integer values. There are 438151 observations with these 13 variables.

Task 4

(6 points) True or False: Every transaction is summarized in one row of the dataframe. Display at least one calculation in the code chunk below. Below the calculation(s), clearly indicate whether the statement is true or false and explain how the output of your calculation(s) supports your conclusion.

head(df$TransactionNumber)

## [1] "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" 
## [5] "00Z3B4R37335"  "006LUOW47310"

False. In viewing the first six observations, you can see that there are multiple observations with the same transaction number.

Task 5

(3 points) Display the summaries of the Price, Quantity and TotalDue columns. Below the output, provide a brief interpretation of the output for each column.

summary(df$Price)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -5740.51     4.50    11.29    14.36    14.68 21449.97       12

summary(df$Quantity)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.177   1.000 815.000

summary(df$TotalDue)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -5740.51     4.50    11.80    15.26    15.04 21449.97       12

The range of values in the Price column is -5740.01 to 21449.97. The median is 11.39 ad the mean is 14.36. There are 12 Null values in the column.

The range of values in the Quantity column is 1 to 815. The median is 1 ad the mean is 1.177. There are no Null values in the column.

The range of values in the TotalDue column is -5740.01 to 21449.97. The median is 11.8 ad the mean is 15.26. There are 12 Null values in the column.

Task 6

(6 points) Display the boxplots of the log values for the Price, Quantity and TotalDue columns. Below the output, provide a brief description of three insights that you see in the boxplots. As part of your description, indicate how the output from task 5 relates to the boxplots in this task.

boxplot (log10(df$Price))

## Warning in boxplot(log10(df$Price)): NaNs produced

boxplot (log10(df$Quantity))

boxplot (log10(df$TotalDue))

## Warning in boxplot(log10(df$TotalDue)): NaNs produced

All of the boxplots for Price, Quantity and TotalDue show wide ranges of values with quite a few outliers. The central tendency for Quantity is stronger than Price and Quantity, and the 1st quartile, median and 3rd quartile are all 1. The outliers pull the median up. The central tendency for Price and TotalDue are also strong, but have quite a few outliers, as well. NaNs were produced for both Price and TotalDue because of negative values in the dataset.

HW2