This is week 2 assignment where I did data analysis using R.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(knitr)
library(ggplot2)
library(readxl)

## Warning: package 'readxl' was built under R version 4.4.3

library(dplyr)

##Loaded the data from excel to R

week_2 <- read_excel("week_2.xlsx")
colnames(week_2)

## [1] "CustomerID" "Region"     "Product"    "Quantity"   "Price"     
## [6] "Date"

head(week_2)

## # A tibble: 6 × 6
##   CustomerID Region Product  Quantity Price Date      
##        <dbl> <chr>  <chr>       <dbl> <dbl> <chr>     
## 1       1001 North  Widget C        5    30 2024-01-01
## 2       1002 South  Widget C       10    30 2024-01-02
## 3       1003 East   Widget C       10    30 2024-01-03
## 4       1004 North  Widget C       10    30 2024-01-04
## 5       1005 North  Widget C        8    30 2024-01-05
## 6       1006 South  Widget A        9    20 2024-01-06

str(week_2)

## tibble [5,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ CustomerID: num [1:5000] 1001 1002 1003 1004 1005 ...
##  $ Region    : chr [1:5000] "North" "South" "East" "North" ...
##  $ Product   : chr [1:5000] "Widget C" "Widget C" "Widget C" "Widget C" ...
##  $ Quantity  : num [1:5000] 5 10 10 10 8 9 2 5 7 6 ...
##  $ Price     : num [1:5000] 30 30 30 30 30 20 30 30 20 20 ...
##  $ Date      : chr [1:5000] "2024-01-01" "2024-01-02" "2024-01-03" "2024-01-04" ...

The data has 6 columns namely:

1.CustomerID - which is numerical 2.Region - has characters 3. Product - has characters 4. Quantity - has numeric data 5. Price - has numeric data 6. Date - has characters instead of date type.

Therefore column no. 6 needs to be converted to the correct datatype

class(week_2$Date)

## [1] "character"

week_2$Date <- ymd(week_2$Date)
class(week_2$Date)

## [1] "Date"

range of date

summary(week_2$Date)

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2024-01-01" "2027-06-03" "2030-11-04" "2030-11-04" "2034-04-07" "2037-09-08"

The date range is from 2024-01-01 to 2037-09-08

Missing values

colSums(is.na(week_2))

## CustomerID     Region    Product   Quantity      Price       Date 
##          0          0          0          0          0          0

There are no missing values in any columns

Data Cleaning with dplyr

Removing missing values

week_2 <- na.omit(week_2)

Checking of columns with incorrect and unnecessary values

week2_unique <-lapply(week_2, unique)
sum(duplicated(week2_unique))

## [1] 0

There are no any unusual entries or duplicate rows.

Data Grouping and Summarizing

Grouping the data by Region and Product Sorting the summarized results in descending order of total revenue

grouped_wk2 <- week_2 %>% 
  group_by(Region, Product) %>%
  summarise(
    total_qty <-  sum(Quantity),
    total_rev <- sum(Quantity * Price),
    .groups = "drop"
  )
head(grouped_wk2)

## # A tibble: 6 × 4
##   Region Product  `total_qty <- sum(Quantity)` total_rev <- sum(Quantity * Pri…¹
##   <chr>  <chr>                           <dbl>                             <dbl>
## 1 East   Widget A                         2450                             49000
## 2 East   Widget B                         2290                             34350
## 3 East   Widget C                         2459                             73770
## 4 North  Widget A                         2345                             46900
## 5 North  Widget B                         2199                             32985
## 6 North  Widget C                         2349                             70470
## # ℹ abbreviated name: ¹`total_rev <- sum(Quantity * Price)`

Arrange in descending order for total revenue

grouped_wk2 <- grouped_wk2%>%
  arrange(Region,desc(`total_rev <- sum(Quantity * Price)`))
head(grouped_wk2)

## # A tibble: 6 × 4
##   Region Product  `total_qty <- sum(Quantity)` total_rev <- sum(Quantity * Pri…¹
##   <chr>  <chr>                           <dbl>                             <dbl>
## 1 East   Widget C                         2459                             73770
## 2 East   Widget A                         2450                             49000
## 3 East   Widget B                         2290                             34350
## 4 North  Widget C                         2349                             70470
## 5 North  Widget A                         2345                             46900
## 6 North  Widget B                         2199                             32985
## # ℹ abbreviated name: ¹`total_rev <- sum(Quantity * Price)`

sales with total revenue over 1000

filtered_group <- week_2 %>%
  mutate(Revenue <- (Quantity * Price))%>%
  filter(sum(Quantity * Price) > 1000) %>%
  ungroup()
head(filtered_group)

## # A tibble: 6 × 7
##   CustomerID Region Product  Quantity Price Date       Revenue <- (Quantity * …¹
##        <dbl> <chr>  <chr>       <dbl> <dbl> <date>                         <dbl>
## 1       1001 North  Widget C        5    30 2024-01-01                       150
## 2       1002 South  Widget C       10    30 2024-01-02                       300
## 3       1003 East   Widget C       10    30 2024-01-03                       300
## 4       1004 North  Widget C       10    30 2024-01-04                       300
## 5       1005 North  Widget C        8    30 2024-01-05                       240
## 6       1006 South  Widget A        9    20 2024-01-06                       180
## # ℹ abbreviated name: ¹`Revenue <- (Quantity * Price)`

Insights

Revenue contribution by Product

week_2 %>% 
  group_by(Product) %>% 
  summarize(Revenue = mean(Quantity * Price)) %>% 
  arrange(desc(Revenue))%>%
  ggplot( aes(x = Product, y = Revenue)) + geom_col()

Widget C had the highest sales.

Regional Sales Distribution

week_2 %>% 
  group_by(Region) %>% 
  summarize(Revenue = sum(Quantity * Price)) %>% 
  arrange(desc(Revenue))%>%
  ggplot( aes(x = "", y = Revenue, fill = Region)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start = 0)

The eastern region had the most revenue with the west having the least

Time-Based Trends

week_2 %>% 
  mutate(Year = format(Date, "%Y")) %>% 
  group_by(Year) %>% 
  summarize(Revenue = sum(Quantity * Price)) %>% 
  ggplot(aes(Year, Revenue, group=1)) + 
  geom_line() + 
  theme(axis.text.x = element_text(angle=90))

week_2 %>% 
  group_by(Region, Quantity) %>% 
  summarize(Revenue = sum(Quantity * Price), .groups="drop")

## # A tibble: 40 × 3
##    Region Quantity Revenue
##    <chr>     <dbl>   <dbl>
##  1 East          1    2810
##  2 East          2    5130
##  3 East          3    8805
##  4 East          4   10900
##  5 East          5   14050
##  6 East          6   18360
##  7 East          7   19320
##  8 East          8   21960
##  9 East          9   26235
## 10 East         10   29550
## # ℹ 30 more rows

cor_test <- cor.test(week_2$Quantity, week_2$Price)
print(cor_test)

## 
##  Pearson's product-moment correlation
## 
## data:  week_2$Quantity and week_2$Price
## t = -0.5776, df = 4998, p-value = 0.5636
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03588101  0.01955387
## sample estimates:
##          cor 
## -0.008169848

numeric_data <- week_2[, sapply(week_2, is.numeric)] 

cor_matrix <- cor(numeric_data, use = "complete.obs")  # Handles missing values
print(cor_matrix)

##              CustomerID     Quantity        Price
## CustomerID  1.000000000  0.002395234 -0.021227919
## Quantity    0.002395234  1.000000000 -0.008169848
## Price      -0.021227919 -0.008169848  1.000000000

Customer behavior

week_2 %>% 
  mutate(OrderValue = Quantity * Price) %>% 
  summarize(AOV = mean(OrderValue))

## # A tibble: 1 × 1
##     AOV
##   <dbl>
## 1  120.

Outlier

week_2 %>% 
  filter(Quantity > quantile(Quantity, 0.99))

## # A tibble: 0 × 6
## # ℹ 6 variables: CustomerID <dbl>, Region <chr>, Product <chr>, Quantity <dbl>,
## #   Price <dbl>, Date <date>

Addition of customer demographics will highly enrich the analysis as below: Gender:- It would be easy to determine which gender prefers which product and from which region. This could help in determine which target customers to focus on. Age:- Likewise, by analysisng the ages of the customer, it would have helped to know which age group are receptive to which product and from which region. Thus, customer demographics could help customise the marketing strategy.

Week_2

Okal

2025-06-07