Install a useful package called “pacman”. This is a manager to manage package. Like your HR perosonnel, an employee to manage other employees.

once you have pacman installed and loaded. You can install/load all other packages using a function “p_load” from “pacman”.

Feel free to use other ways of installing and loading packages.

install.packages("pacman") 
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into 㤼㸱C:/Users/ranai/Documents/R/win-library/3.6㤼㸲
(as 㤼㸱lib㤼㸲 is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.6/pacman_0.5.1.zip'
Content type 'application/zip' length 389783 bytes (380 KB)
downloaded 380 KB
package ‘pacman’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\ranai\AppData\Local\Temp\RtmpI93LAv\downloaded_packages
library(pacman)
#I spoke to Professor Ron in 5/4 9-11am CT office hours about the WARNING message below and showed him that Rtools is already installed on my computer (by following the link below). I don't know how to otherwise make this warning message go away.#

Install/ load required packages.

p_load(dplyr, ggplot2, tidyr, stringr, lubridate)

Read the data file as “df”

df <- read.csv("Live_Session_Mod_3.csv")# make sure to put the csv file in your working directory

Quickly inspect the structure of the various variables

str(df)
'data.frame':   145615 obs. of  21 variables:
 $ Time             : Factor w/ 74492 levels "2014-01-02T11:03:00Z",..: 22553 35327 28037 15513 59404 21678 50788 46834 6313 72359 ...
 $ OperationType    : Factor w/ 2 levels "RETURN","SALE": 2 2 2 2 2 2 2 2 2 2 ...
 $ BarCode          : Factor w/ 1 level "*": 1 1 1 1 1 1 1 1 1 1 ...
 $ CashierName      : Factor w/ 53 levels "Aimee Port","Alberta Lynch",..: 22 11 30 12 7 29 11 42 44 29 ...
 $ LineItem         : Factor w/ 28 levels "Aubergine and Chickpea Vindaloo",..: 20 2 28 24 7 24 2 21 23 28 ...
 $ Department       : Factor w/ 8 levels "Beverage","Catering",..: 3 3 8 3 6 3 3 8 8 8 ...
 $ Category         : Factor w/ 20 levels "Aubergine and Chickpea Vindaloo",..: 14 3 20 19 2 19 3 15 18 20 ...
 $ CardholderName   : Factor w/ 48790 levels "AABID FIENE",..: 38867 41457 8593 22456 23959 28233 2516 10529 36382 44814 ...
 $ RegisterName     : Factor w/ 12 levels "FT136","FT137",..: 8 8 8 8 10 8 10 10 6 10 ...
 $ StoreNumber      : Factor w/ 29 levels "AZ23501251","AZ23501258",..: 1 2 19 5 11 5 2 14 15 19 ...
 $ TransactionNumber: Factor w/ 78559 levels "00002QD152261",..: 44782 9936 50563 40892 68609 37351 4676 44674 76508 61003 ...
 $ CustomerCode     : Factor w/ 6022 levels "C00114815PM",..: 5449 5449 5449 5449 5449 5449 5449 5449 5449 5449 ...
 $ Cost             : num  0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 ...
 $ Price            : num  7.84 14.35 4.5 14.68 12.02 ...
 $ Quantity         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Modifiers        : num  0.01 0.01 2.36 0.01 1.08 0.01 0.01 2.36 0.01 0.01 ...
 $ Subtotal         : num  7.85 14.36 6.86 14.69 13.1 ...
 $ Discounts        : num  -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 ...
 $ NetTotal         : num  7.88 14.39 6.89 14.72 13.13 ...
 $ Tax              : num  0.62 1.13 0.54 1.16 1.04 1.16 2.27 0.41 0.22 0.35 ...
 $ TotalDue         : num  8.5 15.52 7.43 15.88 14.17 ...

Please reveiew the nature of the variables and make sure they all look good. For example, if the variables related to date, such as “Time” in our dataset, is saved as a factor, we won’t be able to use it meaningfully. Thus, we must convert into a date object. Professor Ron covered this idea at length in Live Session 2.

Once, we convert “Time” as a date object, we can extract the variables such as Year, Month, Day, WeekDay, Hour for analyzing the trend over time.

First check a few values in the df$Time

head(df$Time, 3)
[1] 2015-02-19T15:30:00Z 2015-08-04T11:50:00Z 2015-04-29T19:02:00Z
74492 Levels: 2014-01-02T11:03:00Z 2014-01-02T11:29:00Z 2014-01-02T11:33:00Z ... 2017-04-03T17:08:00Z

You will notice a problem in the way the Time variable is stored currently. 1) 144660 levels - presence of levels means it’s stored as factors.

Let’s check the class of df$Time

class(df$Time)
[1] "factor"

This won’t help. We need to extract variables such as “day”, “month”, “year” etc. from this for analysis. And, we can’t do that on a factor variable.

Let’s set this as a time variable using the “strptime” function from base R package. Lubridate also has many useful functions.

df$Time <- lubridate::ymd_hms(df$Time)

Let’s check a few values, now

head(df$Time, 3)
[1] "2015-02-19 15:30:00 UTC" "2015-08-04 11:50:00 UTC" "2015-04-29 19:02:00 UTC"

Now, we see them in a proper date format with everything nicely lined up

We are ready to extract the variables.

Let’s extract the variables such as year, month, weekday, day and hour

We will use the functions from the lubridate package for this purpose.

df <- df %>% mutate(Year = year(Time), Month = month(Time), Day = day(Time), 
                    Hour = hour(Time), WeekDay = wday(Time), Date = date(Time),
                    Week = week(Time), Quarter = quarter(Time))

# I hope you remember the "mutate" function from our Mod3 on Coursera. This is to create new columns. Please play with this code by changing names "Year" and "Month" to "Year_abcd" and "Month_abcd" so that you understand which part of the code is a function and which part is user defined names. "year", "month", "day", "hour"... are all functions. 
# Try playing with them using "df %>% mutate(Year = year_abcd(Time)" and R will wreak havoc on your the notebook. If you do "df %>% mutate(Year_abcd = year(Time)", R will be kind to run the code nicely. 

We added 8 new variables in the chunk above. Now the number of columns has increased to 29 from 21. Check in the “Environment” on the top right corner of your screen.

colSums(is.na(df))
             Time     OperationType           BarCode       CashierName          LineItem        Department 
                0                 0                 0                 0                 0                 0 
         Category    CardholderName      RegisterName       StoreNumber TransactionNumber      CustomerCode 
                0                 0                 0                 0                 0                 0 
             Cost             Price          Quantity         Modifiers          Subtotal         Discounts 
                0                 0                 0                 0                 0                 0 
         NetTotal               Tax          TotalDue              Year             Month               Day 
                0                 0                 0                 0                 0                 0 
             Hour           WeekDay              Date              Week           Quarter 
                0                 0                 0                 0                 0 

There seems to be no missing values. So we are good for now.

Let’s quickly check the first and the last date in the data

min(df$Date, na.rm = TRUE)
[1] "2014-01-02"
max(df$Date, na.rm = TRUE)
[1] "2017-04-03"

We see that the first mentioned date in the data is 2nd-Jan-2014 and 3rd-April-2017.

Generate Feature - Profitability

We create a variable called “Profitability” as (NetTotal - (Quantity X Cost))*100/ (Quantity X Cost).

Please note that the figures in the data seem to be odd. But we will just ignore that part right now.

df <- df %>% mutate(Profitability = (NetTotal - (Quantity*Cost))/ (Quantity*Cost))

There is possibility that if the quantity or cost is zero in one of the rows, the profitability may turn out to be undefined value. We can check if do suffer from that problem here.

sum(!is.finite(df$Profitability))
[1] 0

No bad values for Profitability. There may be a negative values based on the return products. But we won’t bother right now.

***************Business Decision, Analysis, and the Data*********************

Customer Analysis

To perform customer level analysis, we need to bring our data to the customer level. That means each row must correspond to a unique customer.

To perform customer level analysis, marketing folks use something called RFM analysis. Where, R stands for Recency, F stands for Frequency and M Stands for Monetary.

Check this Wiki entry for basic idea - https://en.wikipedia.org/wiki/RFM_(market_research). It’s an intutive idea.

Feature Generation for RFM Analysis

Christina wants to perform RFM analysis to find out customer segments. Thus, she must calculate the following variables for each customer: Recency - How recently has the customer made a purchase? Frequency - How frequently has the customer made purchases in the past? Monetary Value - How much average purchase ($) has the customer made in the past?

Note - there are various approaches to define Monetary - such as mean_purchase vs total_purchase. We will stick to mean_pruchase.

We may also want to keep other variables for Profitability analysis.

From LineItem within a Transaction to Customer Level Data

You need to prepare a customer level data to help Christina. Currently, the unit of observation is a unique LineItem within a Transaction. You will have to aggregate the data at the Customer Level for Customer Level Analysis.

This is a question for you for Mod 3 Assignment.

To help you, I am showing you the Transaction level aggregation. You may use the same code (with appropriate tweaking) for creating customer level data.

Creating Transaction Level Data

Remember, for aggregation, we use group_by function and the summarise function from “dplyr” package.

I will call the transaction level data as df_Transactions.

Note - This code may take a few minutes. Have faith in GOD.

df_Transactions <- df %>% group_by(TransactionNumber) %>% summarise(
                      UniqueLineItemsCount = n_distinct(LineItem), # For the count of UNIQUE LineItems in each transaction
                      TotalItemsCount = sum(Quantity), # For the count of LineItems in each transaction
                      TransactionAmount = sum(NetTotal),# Total Purchase ($) in each transaction
                      Profitability = sum(Profitability), # Total profitability of each transaction 
                      Date = Date[1], # Since the date of each transaction is the same, we will get just the first date
                      WeekDay = WeekDay[1],# same logic as above
                      Hour = Hour[1],                                                                                                            OperationType = OperationType[1],# whether Sale or Return
                      Avg_Discounts = mean(Discounts),# Average discount in the transaction
                      Items = paste(unique(LineItem), collapse = ', ')) # List of unique items in each transaction

Note the number of rows and the number of columns of the new dataframe “df_Transactions”. Do they make sense?

The number of rows is 78559. This means in the data, there are 78559 unique transactions. You only created 10 variables. The number of columns is 11. How come?

To check if there are any Missing Values in your dataframe “df_Transactions”

anyNA(df_Transactions)
[1] FALSE

If you want you can save this as an RDS object for later use.

saveRDS(df_Transactions, "df_Transactions.RDS")

or as a csv file

write.csv(df_Transactions, "df_Transactions.csv", row.names = FALSE)

RDS is an R specific format. This is optimized to use very little memory.

Compare the size of the df_Transactions.RDS and df_Transactions.csv files. The size is almost 1:5.

You may read an RDS datafile using a function called ’readRDS(filename.RDS)"

Check the nature of the variables

str(df_Transactions)
tibble [78,559 x 11] (S3: tbl_df/tbl/data.frame)
 $ TransactionNumber   : Factor w/ 78559 levels "00002QD152261",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ UniqueLineItemsCount: int [1:78559] 2 2 2 1 1 2 1 1 1 1 ...
 $ TotalItemsCount     : int [1:78559] 2 2 2 1 1 2 1 1 1 1 ...
 $ TransactionAmount   : num [1:78559] 21.7 24.1 30.9 15.1 17.1 ...
 $ Profitability       : num [1:78559] 195 217 279 136 154 ...
 $ Date                : Date[1:78559], format: "2014-05-13" "2015-07-14" "2014-05-14" ...
 $ WeekDay             : num [1:78559] 3 3 4 5 3 3 1 3 5 1 ...
 $ Hour                : int [1:78559] 12 15 12 20 12 15 17 11 18 17 ...
 $ OperationType       : Factor w/ 2 levels "RETURN","SALE": 2 2 2 2 2 2 2 2 2 2 ...
 $ Avg_Discounts       : num [1:78559] -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 ...
 $ Items               : chr [1:78559] "Coconut and Beef Vindaloo, Chicken and Onion Kabob" "Beef and Apple Burgers, Salmon and Wheat Bran Salad" "Chicken and Onion Kabob, Beef and Broccoli Stir Fry" "Soya and Basil Salad" ...

Understand the data better by looking at the distributions of a few variables.

summary(df_Transactions$TotalItemsCount)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.854   2.000  22.000 

The number of items in a transcation range between 1 and 22. The average number of items in a transaction is 1.85. Median is 1. This means half of the transactions have one items only. 50% orders have only one item.

Get the first six rows in the data

head(df_Transactions)

A few questions for you to consider: - What is the average/ median Transaction Amount? - How many Transactions sell more than 1 item? - What are the number of transactions across all days in a week? See the code below

df_Transactions %>% group_by(WeekDay) %>% summarise(total_transaction = n() )

Your Task - Create data for Customer Level Analysis

Remember to create the Recency variable, you will have to calculate how many days have passed since the customer made the last purchase. FOr that, we will need a reference date. You may set the reference date as the last date in the data.

Hint - you may get difference between two dates using “difftime()” function as Prof Ron showed in Mod2. You may also just substract two days to get the duration in days.

Please get the following variables in your customer level data and save it as “df_Customers”

Customer Level Data with the following variables:

  1. TotalItemsCount - Number of total items bought
  2. AvgItemCount – Avg number of items bought
  3. Recency – difference between the last date in the data and the last date on which the customer made a transaction
  4. Frequency – Total number of transactions by a customer
  5. Monetary – Average amount of purchase
  6. Profitability – Total profitability
  7. Discounts – Total discount
  8. UniqueItems – Number of unique items bought (indicating broad taste vs narrow taste)
  9. Items – the list of unique items bought by a given customer

Customer Level Data - You can fill your code here.

str(df)
'data.frame':   145615 obs. of  30 variables:
 $ Time             : POSIXct, format: "2015-02-19 15:30:00" "2015-08-04 11:50:00" "2015-04-29 19:02:00" ...
 $ OperationType    : Factor w/ 2 levels "RETURN","SALE": 2 2 2 2 2 2 2 2 2 2 ...
 $ BarCode          : Factor w/ 1 level "*": 1 1 1 1 1 1 1 1 1 1 ...
 $ CashierName      : Factor w/ 53 levels "Aimee Port","Alberta Lynch",..: 22 11 30 12 7 29 11 42 44 29 ...
 $ LineItem         : Factor w/ 28 levels "Aubergine and Chickpea Vindaloo",..: 20 2 28 24 7 24 2 21 23 28 ...
 $ Department       : Factor w/ 8 levels "Beverage","Catering",..: 3 3 8 3 6 3 3 8 8 8 ...
 $ Category         : Factor w/ 20 levels "Aubergine and Chickpea Vindaloo",..: 14 3 20 19 2 19 3 15 18 20 ...
 $ CardholderName   : Factor w/ 48790 levels "AABID FIENE",..: 38867 41457 8593 22456 23959 28233 2516 10529 36382 44814 ...
 $ RegisterName     : Factor w/ 12 levels "FT136","FT137",..: 8 8 8 8 10 8 10 10 6 10 ...
 $ StoreNumber      : Factor w/ 29 levels "AZ23501251","AZ23501258",..: 1 2 19 5 11 5 2 14 15 19 ...
 $ TransactionNumber: Factor w/ 78559 levels "00002QD152261",..: 44782 9936 50563 40892 68609 37351 4676 44674 76508 61003 ...
 $ CustomerCode     : Factor w/ 6022 levels "C00114815PM",..: 5449 5449 5449 5449 5449 5449 5449 5449 5449 5449 ...
 $ Cost             : num  0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 0.11 ...
 $ Price            : num  7.84 14.35 4.5 14.68 12.02 ...
 $ Quantity         : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Modifiers        : num  0.01 0.01 2.36 0.01 1.08 0.01 0.01 2.36 0.01 0.01 ...
 $ Subtotal         : num  7.85 14.36 6.86 14.69 13.1 ...
 $ Discounts        : num  -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 -0.03 ...
 $ NetTotal         : num  7.88 14.39 6.89 14.72 13.13 ...
 $ Tax              : num  0.62 1.13 0.54 1.16 1.04 1.16 2.27 0.41 0.22 0.35 ...
 $ TotalDue         : num  8.5 15.52 7.43 15.88 14.17 ...
 $ Year             : num  2015 2015 2015 2014 2016 ...
 $ Month            : num  2 8 4 11 7 2 3 1 4 2 ...
 $ Day              : int  19 4 29 4 19 7 15 19 23 27 ...
 $ Hour             : int  15 11 19 17 17 12 13 13 16 16 ...
 $ WeekDay          : num  5 3 4 3 3 7 3 3 4 2 ...
 $ Date             : Date, format: "2015-02-19" "2015-08-04" "2015-04-29" ...
 $ Week             : num  8 31 17 44 29 6 11 3 17 9 ...
 $ Quarter          : int  1 3 2 4 3 1 1 1 2 1 ...
 $ Profitability    : num  70.6 129.8 61.6 132.8 118.4 ...
df_Customers <- df %>% group_by(CardholderName) %>% summarise(TotalItemsCount = sum(Quantity), #Create df_Customers dataframe grouped by Cardholder Name, with new columns as follows... Number of total items bought
        AvgItemCount = mean(Quantity), #Avg number of items bought
        Recency = difftime(min(Time), max(Time)), #Difference between the last date in the data and the last date on which the customer made a transaction
        Frequency = sum(n_distinct(as.numeric(TransactionNumber))), #Total number of transactions by a customer 
        Monetary = mean(NetTotal), #Average amount of purchase
        Profitability = sum(Profitability), #Total profitability 
        Discounts = sum(Discounts), #Total discount
        UniqueItems = n_distinct(LineItem), #Number of unique items bought (indicating broad taste vs narrow taste)
        Items = paste(unique(LineItem), collapse = ', ')
        ) #The list of unique items bought by a given customer 

Show first six rows of df_Customers

head(df_Customers)

Product Strategy/ Management - Department and Category

Christina also mentioned that she can perform some analysis to help Sameera take decisions for product stratgey also. Essentially figuring out which products sell well, have higher profitability, can be co-promoted, cross-sold, etc.

This can be done at various levels such as Department, Catgeory or even LineItem.

I am creating Depatment Level and LineItem Level Data and your task is to create the Category Level Data

Department Level

df_Department <- df %>% group_by(Department) %>% summarise( 
                        ItemsCount = n_distinct(LineItem), # n_distinct gets the count of 
                        #distinct/ unique values
                        TransactionsCount = n_distinct(TransactionNumber),
                        TotalSales = sum(NetTotal), 
                        CustomerCount = n_distinct(CustomerCode),
                        Profitability = mean(Profitability),
                        Discounts = mean(Discounts),
                        Items = paste(unique(LineItem), collapse = ', '))

Check the number of rows and columns in “df_Department” using “dim” function. This gives the dimensions of the dataframe.

dim(df_Department)
[1] 8 8

8 rows means there are 8 unique departments and now eahc row corresponds to one department. Again, we created 7 columns, the dataframe has got 8?

A few questions for you to consider: - Which Department pulls the highest number of customers? - Which Department has highest Sales? - Which Department sells most on discount? - Which Department has highest number of transactions?

What if we were interested in assesing each LInItem for Porfitability or Popularity or Ability to pull diverse customers

LineItem Management

df_LineItem  <-  df %>% group_by(LineItem) %>% summarise( 
                        TransactionsCount = n_distinct(TransactionNumber),
                        TotalSales = sum(NetTotal), 
                        CustomerCount = n_distinct(CustomerCode),
                        Profitability = mean(Profitability),
                        Discounts = mean(Discounts))

A few questions for you to consider: - Which LineItem pulls the highest number of customers? - Which LineItem has highest Sales? - Which Lineitem sells most on discount? This is tricky, how will you get it? - Which Lineitem has highest number of transactions?

Your Task - Create the following variables at the Catgeory Level Data

  1. UniqueItemsCount - Number of unique items in the category
  2. TransactionsCount - Number of transactions in the category
  3. TotalSales - Total Sales in the category
  4. CustomerCount - Total number of customers in this category
  5. Profitability - Average profitability
  6. Discounts - Average discount
  7. Items - the list of unique items in the category

Write your code below for creating data at Category level

df_Category  <-  df %>% group_by(Category) %>% summarise( #Creating new df_Category dataframe grouped by Category with the following columns: 
        UniqueItemsCount = n_distinct(LineItem), #Number of unique items in the category
        TransactionsCount = n_distinct(TransactionNumber), #Number of transactions in the category
        TotalSales = sum(NetTotal), #Total Sales in the category 
        CustomerCount = n_distinct(CardholderName), #Total number of customers in this category 
        Profitability = mean(Profitability), #Average profitability 
        Discounts = mean(Discounts), #Average discount
        Items = paste(unique(LineItem), collapse = ', ') #The list of unique items in the category 
)

Show first six rows of df_Category

head(df_Category)

TimeSeries Data

I will create Daily TimeSeries. Your task is to create data agrregated at monthly level.

Daily

df_day <- df %>% group_by(Day,Month,Year) %>% summarise(TransactionsCount = n_distinct(TransactionNumber),
                   TotalSales = sum(NetTotal), 
                   CustomerCount = n_distinct(CustomerCode),
                   Profitability = mean(Profitability),
                   Discounts = mean(Discounts))%>%
                   mutate(Date = make_date(Year, Month, Day))# this is to get a column "Date" so that we can use it on X-axis to create a Line Chart

# I will show a line chart below. 
                   

There are 951 rows and 9 columns in df_day data.

A few questions you may consider: How have sales increased/ decreased over time? How has number of orders/ transcations increased/ decreased over time? How has profitability increased/ decreased over time?

Write your code below to create monthly data with the following variables:

  1. TransactionsCount - Number of transactions in each month
  2. TotalSales – Total Sales in each month
  3. CustomerCount – Total Customers in each month
  4. LineItemCount – Total LineItems sold each month
  5. UniqueLineItemCount - Number of Unique Items Sold each month
  6. CategoryCount – Total Category Count each month

Monthly

df_month <- df %>% group_by(Month) %>% summarise( #Creating df_month dataframe grouped by Month with the following columns: 
        TransactionsCount = n_distinct(TransactionNumber), #Number of transactions in each month 
        TotalSales = sum(NetTotal), #Total Sales in each month 
        CustomerCount = n_distinct(CustomerCode), #Total Customers in each month 
        LineItemCount = sum(as.numeric(LineItem)),  #Total LineItems sold each month 
        UniqueLineItemCount = n_distinct(LineItem), #Number of Unique Items Sold each month 
        CategoryCount = n_distinct(Category), #Total Category Count each month
)

Get first six rows of df_month

head(df_month)
---
title: "oconno25_Mod_3.RMD"
output: html_notebook
author": "Ranait OConnor" 
---
Install a useful package called "pacman". This is a manager to manage package. Like your HR perosonnel, an employee to manage other employees. 

once you have pacman installed and loaded. You can install/load all other packages using a function "p_load" from "pacman". 

Feel free to use other ways of installing and loading packages. 

```{r}
install.packages("pacman") 
library(pacman)
#I spoke to Professor Ron in 5/4 9-11am CT office hours about the WARNING message below and showed him that Rtools is already installed on my computer (by following the link below). I don't know how to otherwise make this warning message go away.#
```

Install/ load required packages.
```{r}
p_load(dplyr, ggplot2, tidyr, stringr, lubridate)
```

Read the data file as "df"
```{r}
df <- read.csv("Live_Session_Mod_3.csv")# make sure to put the csv file in your working directory
```


Quickly inspect the structure of the various variables
```{r}
str(df)
```

Please reveiew the nature of the variables and make sure they all look good. 
For example, if the variables related to date, such as "Time" in our dataset, is saved as a factor, we won't be able to use it meaningfully. Thus, we must convert into a date object. Professor Ron covered this idea at length in Live Session 2. 

Once, we convert "Time" as a date object, we can extract the variables such as Year, Month, Day, WeekDay, Hour for analyzing the trend over time. 

First check a few values in the df$Time
```{r}
head(df$Time, 3)
```
You will notice a problem in the way the Time variable is stored currently. 
1) 144660 levels - presence of levels means it's stored as factors.

Let's check the class of df$Time
```{r}
class(df$Time)
```
This won't help. We need to extract variables such as "day", "month", "year" etc. from this for analysis. And, we can't do that on a factor variable. 


Let's set this as a time variable using the "strptime" function from base R package. Lubridate also has many useful functions.

```{r}
df$Time <- lubridate::ymd_hms(df$Time)
```


Let's check a few values, now
```{r}
head(df$Time, 3)
```
Now, we see them in a proper date format with everything nicely lined up


We are ready to extract the variables.

Let's extract the variables such as year, month, weekday, day and hour

We will use the functions from the lubridate package for this purpose. 
```{r}
df <- df %>% mutate(Year = year(Time), Month = month(Time), Day = day(Time), 
                    Hour = hour(Time), WeekDay = wday(Time), Date = date(Time),
                    Week = week(Time), Quarter = quarter(Time))

# I hope you remember the "mutate" function from our Mod3 on Coursera. This is to create new columns. Please play with this code by changing names "Year" and "Month" to "Year_abcd" and "Month_abcd" so that you understand which part of the code is a function and which part is user defined names. "year", "month", "day", "hour"... are all functions. 
# Try playing with them using "df %>% mutate(Year = year_abcd(Time)" and R will wreak havoc on your the notebook. If you do "df %>% mutate(Year_abcd = year(Time)", R will be kind to run the code nicely. 
```

We added 8 new variables in the chunk above. Now the number of columns has increased to 29 from 21. Check in the "Environment" on the top right corner of your screen. 


```{r}
colSums(is.na(df))
```
There seems to be no missing values. So we are good for now. 



Let's quickly check the first and the last date in the data 
```{r}
min(df$Date, na.rm = TRUE)
max(df$Date, na.rm = TRUE)
```
We see that the first mentioned date in the data is 2nd-Jan-2014 and 3rd-April-2017. 

Generate Feature - Profitability 

We create a variable called "Profitability" as (NetTotal - (Quantity X Cost))*100/ (Quantity X Cost). 

Please note that the figures in the data seem to be odd. But we will just ignore that part right now. 
```{r}
df <- df %>% mutate(Profitability = (NetTotal - (Quantity*Cost))/ (Quantity*Cost))

```

There is possibility that if the quantity or cost is zero in one of the rows, the profitability may turn out to be undefined value. We can check if do suffer from that problem here. 
```{r}
sum(!is.finite(df$Profitability))
```
No bad values for Profitability. There may be a negative values based on the return products. But we won't bother right now. 

***************Business Decision, Analysis, and the Data*********************

# Customer Analysis
To perform customer level analysis, we need to bring our data to the customer level. That means each row must correspond to a unique customer.

To perform customer level analysis, marketing folks use something called RFM analysis. Where, R stands for Recency, F stands for Frequency and M Stands for Monetary.

Check this Wiki entry for basic idea - https://en.wikipedia.org/wiki/RFM_(market_research). It's an intutive idea. 


## Feature Generation for RFM Analysis 
Christina wants to perform RFM analysis to find out customer segments. Thus, she must calculate the following variables for each customer:
Recency - How recently has the customer made a purchase?
Frequency - How frequently has the customer made purchases in the past?
Monetary Value - How much average purchase ($) has the customer made in the past?

Note - there are various approaches to define Monetary - such as mean_purchase vs total_purchase. We will stick to mean_pruchase. 

We may also want to keep other variables for Profitability analysis. 


## From LineItem within a Transaction to Customer Level Data
You need to prepare a customer level data to help Christina. Currently, the unit of observation is a unique LineItem within a Transaction. You will have to aggregate the data at the Customer Level for Customer Level Analysis. 

This is a question for you for Mod 3 Assignment. 

To help you, I am showing you the Transaction level aggregation. You may use the same code (with appropriate tweaking) for creating customer level data.  


## Creating Transaction Level Data

Remember, for aggregation, we use group_by function and the summarise function from "dplyr" package.


I will call the transaction level data as df_Transactions. 

Note - This code may take a few minutes. Have faith in GOD. 
```{r}
df_Transactions <- df %>% group_by(TransactionNumber) %>% summarise(
                      UniqueLineItemsCount = n_distinct(LineItem), # For the count of UNIQUE LineItems in each transaction
                      TotalItemsCount = sum(Quantity), # For the count of LineItems in each transaction
                      TransactionAmount = sum(NetTotal),# Total Purchase ($) in each transaction
                      Profitability = sum(Profitability), # Total profitability of each transaction 
                      Date = Date[1], # Since the date of each transaction is the same, we will get just the first date
                      WeekDay = WeekDay[1],# same logic as above
                      Hour = Hour[1],                                                                                                            OperationType = OperationType[1],# whether Sale or Return
                      Avg_Discounts = mean(Discounts),# Average discount in the transaction
                      Items = paste(unique(LineItem), collapse = ', ')) # List of unique items in each transaction
```

Note the number of rows and the number of columns of the new dataframe "df_Transactions". Do they make sense? 

The number of rows is 78559. This means in the data, there are 78559 unique transactions. 
You only created 10 variables. The number of columns is 11. How come?


To check if there are any Missing Values in your dataframe "df_Transactions"
```{r}
anyNA(df_Transactions)
```

If you want you can save this as an RDS object for later use. 
```{r}
saveRDS(df_Transactions, "df_Transactions.RDS")
```

or as a csv file
```{r}
write.csv(df_Transactions, "df_Transactions.csv", row.names = FALSE)
```

RDS is an R specific format. This is optimized to use very little memory. 

Compare the size of the df_Transactions.RDS and df_Transactions.csv files. The size is almost 1:5. 

You may read an RDS datafile using a function called 'readRDS(filename.RDS)"




Check the nature of the variables
```{r}
str(df_Transactions)
```

Understand the data better by looking at the distributions of a few variables.
```{r}
summary(df_Transactions$TotalItemsCount)
```
The number of items in a transcation range between 1 and 22. The average number of items in a transaction is 1.85. Median is 1. This means half of the transactions have one items only. 50% orders have only one item. 




Get the first six rows in the data
```{r}
head(df_Transactions)
```

A few questions for you to consider:
 - What is the average/ median Transaction Amount?
 - How many Transactions sell more than 1 item?
 - What are the number of transactions across all days in a week? See the code below
```{r}
df_Transactions %>% group_by(WeekDay) %>% summarise(total_transaction = n() )
```
 
 
 
 
# Your Task - Create data for Customer Level Analysis 

Remember to create the Recency variable, you will have to calculate how many days have passed since the customer made the last purchase. FOr that, we will need a reference date. You may set the reference date as the last date in the data. 

Hint - you may get difference between two dates using "difftime()" function as Prof Ron showed in Mod2. You may also just substract two days to get the duration in days. 

Please get the following variables in your customer level data and save it as "df_Customers"

Customer Level Data with the following variables:

1.	TotalItemsCount - Number of total items bought
2.	AvgItemCount – Avg number of items bought
3.	Recency – difference between the last date in the data and the last date on which the    customer made a transaction 
4.	Frequency – Total number of transactions by a customer
5.	Monetary – Average amount of purchase
6.	Profitability – Total profitability
7.	Discounts – Total discount
8.	UniqueItems – Number of unique items bought (indicating broad taste vs narrow taste)
9.	Items – the list of unique items bought by a given customer


# Customer Level Data - You can fill your code here. 
```{r}
str(df)
df_Customers <- df %>% group_by(CardholderName) %>% summarise(TotalItemsCount = sum(Quantity), #Create df_Customers dataframe grouped by Cardholder Name, with new columns as follows... Number of total items bought
        AvgItemCount = mean(Quantity), #Avg number of items bought
        Recency = difftime(min(Time), max(Time)), #Difference between the last date in the data and the last date on which the customer made a transaction
        Frequency = sum(n_distinct(as.numeric(TransactionNumber))), #Total number of transactions by a customer 
        Monetary = mean(NetTotal), #Average amount of purchase
        Profitability = sum(Profitability), #Total profitability 
        Discounts = sum(Discounts), #Total discount
        UniqueItems = n_distinct(LineItem), #Number of unique items bought (indicating broad taste vs narrow taste)
        Items = paste(unique(LineItem), collapse = ', ')
        ) #The list of unique items bought by a given customer 

```

Show first six rows of df_Customers 

```{r}
head(df_Customers)
```


# Product Strategy/ Management - Department and Category 

Christina also mentioned that she can perform some analysis to help Sameera take decisions for product stratgey also. Essentially figuring out which products sell well, have higher profitability, can be co-promoted, cross-sold, etc. 

This can be done at various levels such as Department, Catgeory or even LineItem. 

I am creating Depatment Level and LineItem Level Data and your task is to create the Category Level Data

## Department Level 
```{r}
df_Department <- df %>% group_by(Department) %>% summarise( 
                        ItemsCount = n_distinct(LineItem), # n_distinct gets the count of 
                        #distinct/ unique values
                        TransactionsCount = n_distinct(TransactionNumber),
                        TotalSales = sum(NetTotal), 
                        CustomerCount = n_distinct(CustomerCode),
                        Profitability = mean(Profitability),
                        Discounts = mean(Discounts),
                        Items = paste(unique(LineItem), collapse = ', '))
```

Check the number of rows and columns in "df_Department" using "dim" function. This gives the dimensions of the dataframe. 

```{r}
dim(df_Department)
```
8 rows means there are 8 unique departments and now eahc row corresponds to one department. 
Again, we created 7 columns, the dataframe has got 8?

A few questions for you to consider:
 - Which Department pulls the highest number of customers?
 - Which Department has highest Sales?
 - Which Department sells most on discount? 
 - Which Department has highest number of transactions?

What if we were interested in assesing each LInItem for Porfitability or Popularity or Ability to pull diverse customers

## LineItem Management 
```{r}
df_LineItem  <-  df %>% group_by(LineItem) %>% summarise( 
                        TransactionsCount = n_distinct(TransactionNumber),
                        TotalSales = sum(NetTotal), 
                        CustomerCount = n_distinct(CustomerCode),
                        Profitability = mean(Profitability),
                        Discounts = mean(Discounts))
```

A few questions for you to consider:
 - Which LineItem pulls the highest number of customers?
 - Which LineItem has highest Sales?
 - Which Lineitem sells most on discount? This is tricky, how will you get it?
 - Which Lineitem has highest number of transactions?


## Your Task - Create the following variables at the Catgeory Level Data

1.	UniqueItemsCount - Number of unique items in the category
2.	TransactionsCount - Number of transactions in the category
3.	TotalSales - Total Sales in the category
4.	CustomerCount - Total number of customers in this category
5.	Profitability - Average profitability
6.	Discounts - Average discount
7.	Items - the list of unique items in the category


Write your code below for creating data at Category level
```{r}
df_Category  <-  df %>% group_by(Category) %>% summarise( #Creating new df_Category dataframe grouped by Category with the following columns: 
        UniqueItemsCount = n_distinct(LineItem), #Number of unique items in the category
        TransactionsCount = n_distinct(TransactionNumber), #Number of transactions in the category
        TotalSales = sum(NetTotal), #Total Sales in the category 
        CustomerCount = n_distinct(CardholderName), #Total number of customers in this category 
        Profitability = mean(Profitability), #Average profitability 
        Discounts = mean(Discounts), #Average discount
        Items = paste(unique(LineItem), collapse = ', ') #The list of unique items in the category 
)
```



Show first six rows of df_Category
```{r}
head(df_Category)
```



# TimeSeries Data 

I will create Daily TimeSeries. Your task is to create data agrregated at monthly level.

# Daily
```{r}
df_day <- df %>% group_by(Day,Month,Year) %>% summarise(TransactionsCount = n_distinct(TransactionNumber),
                   TotalSales = sum(NetTotal), 
                   CustomerCount = n_distinct(CustomerCode),
                   Profitability = mean(Profitability),
                   Discounts = mean(Discounts))%>%
                   mutate(Date = make_date(Year, Month, Day))# this is to get a column "Date" so that we can use it on X-axis to create a Line Chart

# I will show a line chart below. 
                   
```

There are 951 rows and 9 columns in df_day data.

A few questions you may consider:
How have sales increased/ decreased over time?
How has number of orders/ transcations increased/ decreased over time?
How has profitability increased/ decreased over time?



Write your code below to create monthly data with the following variables:

1.	TransactionsCount - Number of transactions in each month
2.	TotalSales – Total Sales in each month
3.	CustomerCount – Total Customers in each month
4.	LineItemCount – Total LineItems sold each month
5.	UniqueLineItemCount - Number of Unique Items Sold each month
6.	CategoryCount – Total Category Count each month

# Monthly
```{r}
df_month <- df %>% group_by(Month) %>% summarise( #Creating df_month dataframe grouped by Month with the following columns: 
        TransactionsCount = n_distinct(TransactionNumber), #Number of transactions in each month 
        TotalSales = sum(NetTotal), #Total Sales in each month 
        CustomerCount = n_distinct(CustomerCode), #Total Customers in each month 
        LineItemCount = sum(as.numeric(LineItem)),  #Total LineItems sold each month 
        UniqueLineItemCount = n_distinct(LineItem), #Number of Unique Items Sold each month 
        CategoryCount = n_distinct(Category), #Total Category Count each month
)

```


Get first six rows of df_month
```{r}
head(df_month)
```











