Read in the ities.csv datafile as a dataframe object, df.
# Read the CSV file into a data frame named df
df <- read.csv("ities.csv")
(2 points) Display the count of rows and columns in the dataframe using an appropriate R function. Below the output, identify the count of rows and the count of columns.
# Get the number of rows
num_rows <- nrow(df)
# Print the number of rows
cat("Number of rows:", num_rows, "\n")
## Number of rows: 438128
# Get the number of columns
num_columns <- ncol(df)
# Print the number of columns
cat("Number of columns:", num_columns, "\n")
## Number of columns: 13
# Print the summary statement
cat("This dataframe has", num_rows, "rows and", num_columns, "columns.\n")
## This dataframe has 438128 rows and 13 columns.
(3 points) Use the appropriate R function to display the structure (i.e., number of rows, columns, column names, column data type, some values from each column) of the dataframe, df. Below the output, briefly summarize two main points about the dataframe structure.
# Display the structure of the data frame
str(df)
## 'data.frame': 438128 obs. of 13 variables:
## $ Date : chr "7/18/2016" "7/18/2016" "7/18/2016" "7/18/2016" ...
## $ OperationType : chr "SALE" "SALE" "SALE" "SALE" ...
## $ CashierName : chr "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" "Wallace Kuiper" ...
## $ LineItem : chr "Salmon and Wheat Bran Salad" "Fountain Drink" "Beef and Squash Kabob" "Salmon and Wheat Bran Salad" ...
## $ Department : chr "Entrees" "Beverage" "Kabobs" "Salad" ...
## $ Category : chr "Salmon and Wheat Bran Salad" "Fountain" "Beef" "general" ...
## $ RegisterName : chr "RT149" "RT149" "RT149" "RT149" ...
## $ StoreNumber : chr "AZ23501305" "AZ23501289" "AZ23501367" "AZ23501633" ...
## $ TransactionNumber: chr "002XIIC146121" "002XIIC146121" "00PG9FL135736" "00Z3B4R37335" ...
## $ CustomerCode : chr "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" "CWM11331L8O" ...
## $ Price : num 66.22 2.88 12.02 18.43 18.43 ...
## $ Quantity : int 1 1 2 1 1 1 1 1 1 1 ...
## $ TotalDue : num 66.22 2.88 24.04 18.43 18.43 ...
(6 points) Is every transaction summarized in one row of the dataframe? Include a code chunk with code that will display some kind of evidence (e.g., number of rows and number of unique transaction numbers) to support your conclusion. Below the code chunk, clearly indicate how the output of your code supports your decision.
# The number of rows in the dataframe
num_rows <- nrow(df)
# The number of unique transaction numbers
num_unique_transactions <- length(unique(df$TransactionNumber))
# Print the results
cat("Number of rows in the dataframe:", num_rows, "\n")
## Number of rows in the dataframe: 438128
cat("Number of unique transaction numbers:", num_unique_transactions, "\n")
## Number of unique transaction numbers: 161053
(3 points) Display the summaries of the Price, Quantity and TotalDue columns. Below the output, provide a brief interpretation of the output for each column.
# Summaries of Price, Quantity, and TotalDue columns
summary(df$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -5740.51 4.50 11.29 14.36 14.68 21449.97 12
summary(df$Quantity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.177 1.000 815.000
summary(df$TotalDue)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -5740.51 4.50 11.80 15.25 15.04 21449.97 12
(6 points) Display the boxplots of the log values for the Price, Quantity and TotalDue columns. Below the output, provide a brief description of three insights that you see in the boxplots. As part of your description, indicate how the output from task 5 relates to the boxplots in this task.
# Load library
library(ggplot2)
# Log transformation for each column
df$log_Price <- log(df$Price + 1) # Adding 1 to avoid log(0) issues
## Warning in log(df$Price + 1): NaNs produced
df$log_Quantity <- log(df$Quantity + 1) # Adding 1 to avoid log(0) issues
df$log_TotalDue <- log(df$TotalDue + 1) # Adding 1 to avoid log(0) issues
## Warning in log(df$TotalDue + 1): NaNs produced
# Boxplots for the log-transformed columns
par(mfrow=c(1,3)) # Arrange plots in one row, three columns
boxplot(df$log_Price, main="Log of Price", ylab="Log(Price)")
boxplot(df$log_Quantity, main="Log of Quantity", ylab="Log(Quantity)")
boxplot(df$log_TotalDue, main="Log of TotalDue", ylab="Log(TotalDue)")
# Numeric summary for log-transformed columns
summary(df$log_Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.050 1.705 2.509 2.295 2.752 9.973 347
summary(df$log_Quantity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6931 0.6931 0.6931 0.7400 0.6931 6.7044
summary(df$log_TotalDue)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -1.050 1.705 2.556 2.355 2.775 9.973 353
In the boxplot for log-transformed quantity most values are clustered around the lower end. The maximum value indicates a few high outliers.
The numeric summary for quantity in Task 5 showed a minimum of 1 and a maximum of 815. The log-transformed boxplot confirms that while most transactions involve low quantities, there are rare high quantities.
The boxplot for log-transformed total due displays a significant spread with values ranging widely. The median (2.556) is higher than the mean (2.355), showing the impact of a few high values.
In Task 5, the numeric summary for total due showed that the mean ($15.25) was higher than the median ($11.80), indicating the influence of extreme high values.
The numeric summary of log transformed data indicates an increase in missing values (from 12 to 353) which could be due to issues with the log transformation, likely due to zeros or negative values in the original data.