Task 1

Read in the ities.csv datafile as a dataframe object, df.

# Read the CSV file into a data frame named df
df <- read.csv("ities.csv")

Task 2

(2 points) Display the count of rows and columns in the dataframe using an appropriate R function. Below the output, identify the count of rows and the count of columns.

# Get the number of rows
num_rows <- nrow(df)

# Print the number of rows
cat("Number of rows:", num_rows, "\n")
## Number of rows: 438128
# Get the number of columns
num_columns <- ncol(df)

# Print the number of columns
cat("Number of columns:", num_columns, "\n")
## Number of columns: 13
# Print the summary statement 
cat("This dataframe has", num_rows, "rows and", num_columns, "columns.\n")
## This dataframe has 438128 rows and 13 columns.

Summary

  • Number of rows: 438128
  • Number of columns: 13
  • This dataframe has 438128 rows and 13 columns.

Task 3

(3 points) Use the appropriate R function to display the structure (i.e., number of rows, columns, column names, column data type, some values from each column) of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.

# Display the structure of the data frame
str(df)
## 'data.frame':    438128 obs. of  13 variables:
##  $ Date             : chr  "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
##  $ OperationType    : chr  "SALE" "SALE" "SALE" "SALE" ...
##  $ CashierName      : chr  "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
##  $ LineItem         : chr  "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
##  $ Department       : chr  "Entrees" "Beverage" "Kabobs" "Salad" ...
##  $ Category         : chr  "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
##  $ RegisterName     : chr  "RT149" "RT149" "RT149" "RT149" ...
##  $ StoreNumber      : chr  "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
##  $ TransactionNumber: chr  "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
##  $ CustomerCode     : chr  "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
##  $ Price            : num  66.22 2.88 12.02 18.43 18.43 ...
##  $ Quantity         : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ TotalDue         : num  66.22 2.88 24.04 18.43 18.43 ...

Summary

  • The data frame has 438128 observations and 13 variables.
  • The columns in the dataframe include character (chr), numeric (num), and integer (int) data types.

Task 4

(6 points) Is every transaction summarized in one row of the dataframe? Include a code chunk with code that will display some kind of evidence (e.g., number of rows and number of unique transaction numbers) to support your conclusion. Below the code chunk, clearly indicate how the output of your code supports your decision.

# The number of rows in the dataframe
num_rows <- nrow(df)

# The number of unique transaction numbers
num_unique_transactions <- length(unique(df$TransactionNumber))

# Print the results
cat("Number of rows in the dataframe:", num_rows, "\n")
## Number of rows in the dataframe: 438128
cat("Number of unique transaction numbers:", num_unique_transactions, "\n")
## Number of unique transaction numbers: 161053

Summary

  • Number of rows in the dataframe: 438128
  • Number of unique transaction numbers: 161053

Conclusion

  • The total number of unique transaction numbers is less than the number of rows in the dataframe. This indicates that multiple rows might correspond to the same transaction.

Task 5

(3 points) Display the summaries of the Price, Quantity and TotalDue columns. Below the output, provide a brief interpretation of the output for each column.

# Summaries of Price, Quantity, and TotalDue columns
summary(df$Price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -5740.51     4.50    11.29    14.36    14.68 21449.97       12
summary(df$Quantity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.177   1.000 815.000
summary(df$TotalDue)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -5740.51     4.50    11.80    15.25    15.04 21449.97       12

Interpretation

1. Price Column

  • The Price column shows a minimum value of -5740.51 and a maximum value of 21449.97.
  • The negative minimum value suggests returns or refunds.
  • The 1st quartile indicates that 25% of the prices are below 4.50.
  • The median indicates that 50% of the prices are below 11.29.
  • The 3rd quartile suggests that 75% of the prices are below 14.68.
  • The mean price is 14.36, which is slightly higher than the median, indicating that some unusually high prices are affecting the average.
  • There are 12 missing values (NA’s) in the Price column.

2. Quantity Column

  • The Quantity column has a minimum value of 1 and a maximum quantity of 815.
  • The 3rd quartile indicate that 75% of the transactions involve a quantity of 1 unit or less, while there are a few transactions with very high quantities.

3. TotalDue Column

  • The TotalDue column shows a wide range of values, from -5740.51 to 21449.97.
  • Similar to the Price column, the negative minimum value is due to returns or refunds.
  • The median shows that 50% of the total dues are below 11.80.
  • The mean total due is 15.25, which is higher than the median, suggesting that there are some high total due values.
  • There are 12 missing values (NA’s) in the TotalDue column.

Task 6

(6 points) Display the boxplots of the log values for the Price, Quantity and TotalDue columns. Below the output, provide a brief description of three insights that you see in the boxplots. As part of your description, indicate how the output from task 5 relates to the boxplots in this task.

# Load library
library(ggplot2)

# Log transformation for each column
df$log_Price <- log(df$Price + 1) # Adding 1 to avoid log(0) issues
## Warning in log(df$Price + 1): NaNs produced
df$log_Quantity <- log(df$Quantity + 1) # Adding 1 to avoid log(0) issues
df$log_TotalDue <- log(df$TotalDue + 1) # Adding 1 to avoid log(0) issues
## Warning in log(df$TotalDue + 1): NaNs produced
# Boxplots for the log-transformed columns
par(mfrow=c(1,3)) # Arrange plots in one row, three columns

boxplot(df$log_Price, main="Log of Price", ylab="Log(Price)")
boxplot(df$log_Quantity, main="Log of Quantity", ylab="Log(Quantity)")
boxplot(df$log_TotalDue, main="Log of TotalDue", ylab="Log(TotalDue)")

# Numeric summary for log-transformed columns
summary(df$log_Price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -1.050   1.705   2.509   2.295   2.752   9.973     347
summary(df$log_Quantity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6931  0.6931  0.6931  0.7400  0.6931  6.7044
summary(df$log_TotalDue)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -1.050   1.705   2.556   2.355   2.775   9.973     353

Insights from Boxplots of Log-Transformed Data

1. Price

  • The boxplot of log-transformed price shows a positive skew with most values clustered towards the lower end. There are a few high outliers, indicating the presence of some extreme values.
  • In Task 5, the numeric summary for Price showed that the mean ($14.36) was higher than the median ($11.29), reflecting the influence of extreme high prices. The boxplot confirms this skewness and shows that the log transformation has brought the mean (2.295) closer to the median (2.509), reducing the impact of these extreme values.
  • The numeric summary of log transformed data indicates an increase in missing values (from 12 to 347) which could be due to issues with the log transformation, likely due to zeros or negative values in the original data.

2. Quantity

  • In the boxplot for log-transformed quantity most values are clustered around the lower end. The maximum value indicates a few high outliers.

  • The numeric summary for quantity in Task 5 showed a minimum of 1 and a maximum of 815. The log-transformed boxplot confirms that while most transactions involve low quantities, there are rare high quantities.

3. Total Due

  • The boxplot for log-transformed total due displays a significant spread with values ranging widely. The median (2.556) is higher than the mean (2.355), showing the impact of a few high values.

  • In Task 5, the numeric summary for total due showed that the mean ($15.25) was higher than the median ($11.80), indicating the influence of extreme high values.

  • The numeric summary of log transformed data indicates an increase in missing values (from 12 to 353) which could be due to issues with the log transformation, likely due to zeros or negative values in the original data.