This is week 2 assignment where I did data analysis using R.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(ggplot2)
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
library(dplyr)

##Loaded the data from excel to R

week_2 <- read_excel("week_2.xlsx")
colnames(week_2)
## [1] "CustomerID" "Region"     "Product"    "Quantity"   "Price"     
## [6] "Date"
head(week_2)
## # A tibble: 6 × 6
##   CustomerID Region Product  Quantity Price Date      
##        <dbl> <chr>  <chr>       <dbl> <dbl> <chr>     
## 1       1001 North  Widget C        5    30 2024-01-01
## 2       1002 South  Widget C       10    30 2024-01-02
## 3       1003 East   Widget C       10    30 2024-01-03
## 4       1004 North  Widget C       10    30 2024-01-04
## 5       1005 North  Widget C        8    30 2024-01-05
## 6       1006 South  Widget A        9    20 2024-01-06
str(week_2)
## tibble [5,000 × 6] (S3: tbl_df/tbl/data.frame)
##  $ CustomerID: num [1:5000] 1001 1002 1003 1004 1005 ...
##  $ Region    : chr [1:5000] "North" "South" "East" "North" ...
##  $ Product   : chr [1:5000] "Widget C" "Widget C" "Widget C" "Widget C" ...
##  $ Quantity  : num [1:5000] 5 10 10 10 8 9 2 5 7 6 ...
##  $ Price     : num [1:5000] 30 30 30 30 30 20 30 30 20 20 ...
##  $ Date      : chr [1:5000] "2024-01-01" "2024-01-02" "2024-01-03" "2024-01-04" ...

The data has 6 columns namely:

1.CustomerID - which is numerical 2.Region - has characters 3. Product - has characters 4. Quantity - has numeric data 5. Price - has numeric data 6. Date - has characters instead of date type.

Therefore column no. 6 needs to be converted to the correct datatype

class(week_2$Date)
## [1] "character"
week_2$Date <- ymd(week_2$Date)
class(week_2$Date)
## [1] "Date"

range of date

summary(week_2$Date)
##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2024-01-01" "2027-06-03" "2030-11-04" "2030-11-04" "2034-04-07" "2037-09-08"

The date range is from 2024-01-01 to 2037-09-08

Missing values

colSums(is.na(week_2))
## CustomerID     Region    Product   Quantity      Price       Date 
##          0          0          0          0          0          0

There are no missing values in any columns

Data Cleaning with dplyr

Removing missing values

week_2 <- na.omit(week_2)

Checking of columns with incorrect and unnecessary values

week2_unique <-lapply(week_2, unique)
sum(duplicated(week2_unique))
## [1] 0

There are no any unusual entries or duplicate rows.

Data Grouping and Summarizing

Grouping the data by Region and Product Sorting the summarized results in descending order of total revenue

grouped_wk2 <- week_2 %>% 
  group_by(Region, Product) %>%
  summarise(
    total_qty <-  sum(Quantity),
    total_rev <- sum(Quantity * Price),
    .groups = "drop"
  )
head(grouped_wk2)
## # A tibble: 6 × 4
##   Region Product  `total_qty <- sum(Quantity)` total_rev <- sum(Quantity * Pri…¹
##   <chr>  <chr>                           <dbl>                             <dbl>
## 1 East   Widget A                         2450                             49000
## 2 East   Widget B                         2290                             34350
## 3 East   Widget C                         2459                             73770
## 4 North  Widget A                         2345                             46900
## 5 North  Widget B                         2199                             32985
## 6 North  Widget C                         2349                             70470
## # ℹ abbreviated name: ¹​`total_rev <- sum(Quantity * Price)`

Arrange in descending order for total revenue

grouped_wk2 <- grouped_wk2%>%
  arrange(Region,desc(`total_rev <- sum(Quantity * Price)`))
head(grouped_wk2)
## # A tibble: 6 × 4
##   Region Product  `total_qty <- sum(Quantity)` total_rev <- sum(Quantity * Pri…¹
##   <chr>  <chr>                           <dbl>                             <dbl>
## 1 East   Widget C                         2459                             73770
## 2 East   Widget A                         2450                             49000
## 3 East   Widget B                         2290                             34350
## 4 North  Widget C                         2349                             70470
## 5 North  Widget A                         2345                             46900
## 6 North  Widget B                         2199                             32985
## # ℹ abbreviated name: ¹​`total_rev <- sum(Quantity * Price)`

sales with total revenue over 1000

filtered_group <- week_2 %>%
  mutate(Revenue <- (Quantity * Price))%>%
  filter(sum(Quantity * Price) > 1000) %>%
  ungroup()
head(filtered_group)
## # A tibble: 6 × 7
##   CustomerID Region Product  Quantity Price Date       Revenue <- (Quantity * …¹
##        <dbl> <chr>  <chr>       <dbl> <dbl> <date>                         <dbl>
## 1       1001 North  Widget C        5    30 2024-01-01                       150
## 2       1002 South  Widget C       10    30 2024-01-02                       300
## 3       1003 East   Widget C       10    30 2024-01-03                       300
## 4       1004 North  Widget C       10    30 2024-01-04                       300
## 5       1005 North  Widget C        8    30 2024-01-05                       240
## 6       1006 South  Widget A        9    20 2024-01-06                       180
## # ℹ abbreviated name: ¹​`Revenue <- (Quantity * Price)`

Insights

Revenue contribution by Product

week_2 %>% 
  group_by(Product) %>% 
  summarize(Revenue = mean(Quantity * Price)) %>% 
  arrange(desc(Revenue))%>%
  ggplot( aes(x = Product, y = Revenue)) + geom_col()

Widget C had the highest sales.

Regional Sales Distribution

week_2 %>% 
  group_by(Region) %>% 
  summarize(Revenue = sum(Quantity * Price)) %>% 
  arrange(desc(Revenue))%>%
  ggplot( aes(x = "", y = Revenue, fill = Region)) +
  geom_bar(stat = "identity", width = 1)+
  coord_polar("y", start = 0)

The eastern region had the most revenue with the west having the least