Mugil_task

INTRODUCTION

This code performs exploratory data analysis on a Bakery dataset and then applies the Apriori algorithm to find frequent itemsets and association rules.

the necessary libraries are loaded into R using the library() function. The arules library is used for generating association rules, while arulesViz, ggplot2, and dplyr are used for visualization.

library(arules)

## Warning: package 'arules' was built under R version 4.2.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

library(arulesViz)

## Warning: package 'arulesViz' was built under R version 4.2.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.2

library(plotly)

## Warning: package 'plotly' was built under R version 4.2.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.2.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:arules':
## 
##     intersect, recode, setdiff, setequal, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The Bakery.csv file is read into R

data <- read.csv("C:\\Users\\mugil\\Desktop\\unsupervised  learning\\Bakery.csv")
data$Date <- as.Date(data$DateTime, format = "%Y-%m-%d %H:%M:%S")

The Date column in the dataset is converted from a string to a Date format. Then, the year, month, and day are extracted from the Date column and stored in new columns in the dataset.

data$year <- as.character(format(as.Date(data$Date), "%Y"))
data$month <- as.character(format(as.Date(data$Date), "%m"))
data$day <- as.character(format(as.Date(data$Date), "%d"))

The sum() function is used to calculate the number of missing values in the dataset. This is done to check for data quality issues.

sum(is.na(data))

## [1] 0

library(dplyr)

A bar chart is created using ggplot2 to display the top 10 most frequently sold items in the dataset. This code is creating a bar chart to show the top 10 most sold items in a dataset. The dataset is grouped by Items and the Count of each item is calculated. Then, the top 10 items with the highest count are selected and plotted using ggplot. The plot is customized with a title, axis labels, and theme settings.

#Top 10 items
data_n <- data %>% group_by(Items) %>% summarize(Count = n()) %>% arrange(desc(Count))

ggplot(data_n[1:10,], aes(x = Items, y = Count)) +
  geom_bar(stat = "identity") +
  labs(title = "Items Most Sold") +
  theme(plot.title = element_text(size = 20),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        plot.background = element_blank(),
        panel.grid.major = element_line(color = "gray", linetype = "dashed"),
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        axis.line = element_line(color = "black", size = 0.5),
        axis.title = element_text(size = 16))

## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.

Another bar chart is created using ggplot2 to display the number of unique transactions per month.

#Month wise count of items:
monthly_count <- data %>%
  group_by(month) %>%
  summarize(Unique_Transactions = n_distinct(TransactionNo))

ggplot(monthly_count, aes(x = month, y = Unique_Transactions)) +
  geom_bar(stat = "identity") +
  xlab("Month") +
  ylab("Unique Transactions") +
  ggtitle("Monthly Item Count") +
  theme(plot.title = element_text(hjust = 1))

A bar chart is created using ggplot2 to display the number of unique transactions per year.

#Year wise count of items:
year_count <- data %>%
  group_by(year) %>%
  summarize(Unique_Transactions = n_distinct(TransactionNo))

ggplot(year_count, aes(x = year, y = Unique_Transactions)) +
  geom_bar(stat = "identity") +
  xlab("Year") +
  ylab("Unique Transactions") +
  ggtitle("Yearly Item Count") +
  theme(plot.title = element_text(hjust = 1))

A bar chart is created using ggplot2 to display the number of unique transactions per daypart (morning, afternoon, evening, and night).

#Daypart wise count of items:
daypart_count <- data %>%
  group_by(Daypart) %>%
  summarize(Unique_Transactions = n_distinct(TransactionNo))

ggplot(daypart_count, aes(x = Daypart, y = Unique_Transactions)) +
  geom_bar(stat = "identity") +
  xlab("Daypart") +
  ylab("Unique Transactions") +
  ggtitle("Daypart Item Count") +
  theme(plot.title = element_text(hjust = 1))

A bar chart is created using ggplot2 to display the number of unique transactions per day type (weekend or weekday).

#DayType wise count of items:
DayType_count <- data %>%
  group_by(DayType) %>%
  summarize(Unique_Transactions = n_distinct(TransactionNo))

ggplot(DayType_count, aes(x = DayType, y = Unique_Transactions)) +
  geom_bar(stat = "identity") +
  xlab("DayType") +
  ylab("Unique Transactions") +
  ggtitle("DayType Item Count") +
  theme(plot.title = element_text(hjust = 1))

A bar chart is created using ggplot2 to display the number of unique transactions per day of the month.

#Day wise count of items:
Day_count <- data %>%
  group_by(day) %>%
  summarize(Unique_Transactions = n_distinct(TransactionNo))

ggplot(Day_count, aes(x = day, y = Unique_Transactions)) +
  geom_bar(stat = "identity") +
  xlab("day") +
  ylab("Unique Transactions") +
  ggtitle("day Item Count") +
  theme(plot.title = element_text(hjust = 1))

#######################

# Convert the data to a transaction object

This line reads in the data from the “Bakery.csv” file and converts it to a transaction object. A transaction is a collection of items that were purchased together, such as all the items bought in a single visit to a bakery. The format = “basket” argument tells R that the data is in basket format, where each row represents a transaction and the items in the transaction are listed in columns. The sep = “,” argument specifies that the items are separated by commas.

transactions <- read.transactions("C:\\Users\\mugil\\Desktop\\unsupervised  learning\\Bakery.csv", format = "basket", sep = ",")

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

## Warning in scan(text = l, what = "character", sep = sep, quote = quote, : EOF
## within quoted string

# Find frequent itemsets using the Apriori algorithm

This line uses the Apriori algorithm to find frequent itemsets in the transaction data. An itemset is a collection of one or more items that are purchased together in a transaction. The supp = 0.01 argument specifies the minimum support threshold for an itemset to be considered frequent, where support is the proportion of transactions that contain the itemset. The minlen = 2 argument specifies the minimum length of an itemset, and the maxlen = 15 argument specifies the maximum length of an itemset. The target = “frequent itemsets” argument specifies that we want to find frequent itemsets.

frequent_itemsets <- apriori(transactions, parameter = list(supp = 0.01, minlen = 2, maxlen = 15, target = "frequent itemsets"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen            target  ext
##      15 frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 205 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[19051 item(s), 20508 transaction(s)] done [0.04s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## sorting transactions ... done [0.00s].
## writing ... [72 set(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Print the frequent itemsets

This line prints the first 10 frequent itemsets found by the Apriori algorithm.

inspect(head(frequent_itemsets, n = 10))

##      items                  support    count
## [1]  {Toast, Weekday}       0.01116637 229  
## [2]  {Afternoon, Scone}     0.01043495 214  
## [3]  {Afternoon, Soup}      0.01584747 325  
## [4]  {Soup, Weekday}        0.01306807 268  
## [5]  {Afternoon, Alfajores} 0.01194656 245  
## [6]  {Alfajores, Weekday}   0.01131266 232  
## [7]  {Afternoon, Juice}     0.01092257 224  
## [8]  {Juice, Weekday}       0.01116637 229  
## [9]  {Afternoon, Muffin}    0.01004486 206  
## [10] {Muffin, Weekday}      0.01043495 214

Each row represents an itemset, which is a combination of one or more items that are frequently purchased together in a transaction. The ‘items’ column shows the item(s) included in each itemset. The ‘support’ column represents the proportion of transactions that contain the itemset, and the ‘count’ column shows the number of transactions that contain the itemset. For example, the first row shows that the itemset {Toast, Weekday} has a support of 0.01116637, which means that it appears in approximately 1.1% of all transactions, and a count of 229, which means that it appears in 229 transactions out of the total number of transactions. This line creates a plot of the frequent itemsets found by the Apriori algorithm.

plot(frequent_itemsets)

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The scatter plot that shows the relationship between the support and confidence values of the frequent itemsets. Each point on the plot represents an itemset, and the x-axis represents the support value while the y-axis represents the confidence value.

# Generate a bar plot of the frequent itemsets
barplot(sort(frequent_itemsets@quality$support), las=2, cex.names=0.7, main="Frequent Itemsets", xlab="Itemsets", ylab="Support")

The bar plot that shows the support values of the frequent itemsets in a descending order. The x-axis represents the itemsets and the y-axis represents the support values. This plot is useful for identifying the most common itemsets in the data.

# Find association rules using the Apriori algorithm

This line finds the association rules between items in the transactions dataset using the Apriori algorithm. It sets the support to 0.01, meaning only items that appear in at least 1% of the transactions will be considered. It sets the confidence to 0.5, meaning only rules that occur with at least 50% confidence will be returned. It sets the minimum and maximum lengths of the rules to 2 and 15, respectively. Finally, it specifies that the target is “rules”.

rules <- apriori(transactions, parameter = list(supp = 0.01, conf = 0.5, minlen = 2, maxlen = 15, target = "rules"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##      15  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 205 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[19051 item(s), 20508 transaction(s)] done [0.03s].
## sorting and recoding items ... [23 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [63 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

# Print the association rules

This line prints the first 10 association rules found by the Apriori algorithm. It uses the head() function to select the first 10 rules and the inspect() function to display the rules in a readable format.

inspect(head(rules, n = 10))

##      lhs            rhs         support    confidence coverage   lift     
## [1]  {Toast}     => {Weekday}   0.01116637 0.7201258  0.01550614 1.1549495
## [2]  {Scone}     => {Afternoon} 0.01043495 0.6544343  0.01594500 1.1609981
## [3]  {Soup}      => {Afternoon} 0.01584747 0.9502924  0.01667642 1.6858647
## [4]  {Soup}      => {Weekday}   0.01306807 0.7836257  0.01667642 1.2567918
## [5]  {Alfajores} => {Afternoon} 0.01194656 0.6639566  0.01799298 1.1778912
## [6]  {Alfajores} => {Weekday}   0.01131266 0.6287263  0.01799298 1.0083615
## [7]  {Juice}     => {Afternoon} 0.01092257 0.6070461  0.01799298 1.0769291
## [8]  {Juice}     => {Weekday}   0.01116637 0.6205962  0.01799298 0.9953224
## [9]  {Muffin}    => {Afternoon} 0.01004486 0.5567568  0.01804174 0.9877135
## [10] {Muffin}    => {Weekday}   0.01043495 0.5783784  0.01804174 0.9276127
##      count
## [1]  229  
## [2]  214  
## [3]  325  
## [4]  268  
## [5]  245  
## [6]  232  
## [7]  224  
## [8]  229  
## [9]  206  
## [10] 214

the first rule in the output is {Toast} => {Weekday}, which means that customers who buy Toast are likely to also buy items on weekdays. The approach to interpret these rules is to identify the rules with high support and confidence values and use them to inform business decisions. For instance, in the above output, we can see that Soup is a popular item and is often bought on weekdays and in the afternoon. Therefore, the bakery can use this information to tailor their marketing strategies or adjust their product offerings accordingly to attract more customers during these times.

#"This line plots the association rules found by the Apriori algorithm. The plot shows the support and confidence of each rule, with the support on the x-axis and the confidence on the y-axis.
plot(rules)

This line extracts the left-hand-side (LHS) and right-hand-side (RHS) items from the association rules and creates a data frame to store the pairs. It uses the labels() function to extract the item labels from the rules and the data.frame() function to create a data frame with two columns: lhs and rhs.

# Extract item pairs from association rules
item_pairs <- data.frame(lhs = labels(lhs(rules)), rhs = labels(rhs(rules)))

Summary In this analysis we can understand the uses of these kind of analysis, which can actually help the business to increase their profit and attract their customer according to their interest. These type of analysis is very useful for understanding the customer and run the business according to their satisfaction.

Mugil_task_3.R

mugil

2023-03-17