This is week 2 assignment where I did data analysis using R.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(ggplot2)
library(readxl)
## Warning: package 'readxl' was built under R version 4.4.3
library(dplyr)
##Loaded the data from excel to R
week_2 <- read_excel("week_2.xlsx")
colnames(week_2)
## [1] "CustomerID" "Region" "Product" "Quantity" "Price"
## [6] "Date"
head(week_2)
## # A tibble: 6 × 6
## CustomerID Region Product Quantity Price Date
## <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 1001 North Widget C 5 30 2024-01-01
## 2 1002 South Widget C 10 30 2024-01-02
## 3 1003 East Widget C 10 30 2024-01-03
## 4 1004 North Widget C 10 30 2024-01-04
## 5 1005 North Widget C 8 30 2024-01-05
## 6 1006 South Widget A 9 20 2024-01-06
str(week_2)
## tibble [5,000 × 6] (S3: tbl_df/tbl/data.frame)
## $ CustomerID: num [1:5000] 1001 1002 1003 1004 1005 ...
## $ Region : chr [1:5000] "North" "South" "East" "North" ...
## $ Product : chr [1:5000] "Widget C" "Widget C" "Widget C" "Widget C" ...
## $ Quantity : num [1:5000] 5 10 10 10 8 9 2 5 7 6 ...
## $ Price : num [1:5000] 30 30 30 30 30 20 30 30 20 20 ...
## $ Date : chr [1:5000] "2024-01-01" "2024-01-02" "2024-01-03" "2024-01-04" ...
1.CustomerID - which is numerical 2.Region - has characters 3. Product - has characters 4. Quantity - has numeric data 5. Price - has numeric data 6. Date - has characters instead of date type.
Therefore column no. 6 needs to be converted to the correct datatype
class(week_2$Date)
## [1] "character"
week_2$Date <- ymd(week_2$Date)
class(week_2$Date)
## [1] "Date"
summary(week_2$Date)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2024-01-01" "2027-06-03" "2030-11-04" "2030-11-04" "2034-04-07" "2037-09-08"
The date range is from 2024-01-01 to 2037-09-08
colSums(is.na(week_2))
## CustomerID Region Product Quantity Price Date
## 0 0 0 0 0 0
There are no missing values in any columns
Removing missing values
week_2 <- na.omit(week_2)
Checking of columns with incorrect and unnecessary values
week2_unique <-lapply(week_2, unique)
sum(duplicated(week2_unique))
## [1] 0
There are no any unusual entries or duplicate rows.
Grouping the data by Region and Product Sorting the summarized results in descending order of total revenue
grouped_wk2 <- week_2 %>%
group_by(Region, Product) %>%
summarise(
total_qty <- sum(Quantity),
total_rev <- sum(Quantity * Price),
.groups = "drop"
)
head(grouped_wk2)
## # A tibble: 6 × 4
## Region Product `total_qty <- sum(Quantity)` total_rev <- sum(Quantity * Pri…¹
## <chr> <chr> <dbl> <dbl>
## 1 East Widget A 2450 49000
## 2 East Widget B 2290 34350
## 3 East Widget C 2459 73770
## 4 North Widget A 2345 46900
## 5 North Widget B 2199 32985
## 6 North Widget C 2349 70470
## # ℹ abbreviated name: ¹`total_rev <- sum(Quantity * Price)`
Arrange in descending order for total revenue
grouped_wk2 <- grouped_wk2%>%
arrange(Region,desc(`total_rev <- sum(Quantity * Price)`))
head(grouped_wk2)
## # A tibble: 6 × 4
## Region Product `total_qty <- sum(Quantity)` total_rev <- sum(Quantity * Pri…¹
## <chr> <chr> <dbl> <dbl>
## 1 East Widget C 2459 73770
## 2 East Widget A 2450 49000
## 3 East Widget B 2290 34350
## 4 North Widget C 2349 70470
## 5 North Widget A 2345 46900
## 6 North Widget B 2199 32985
## # ℹ abbreviated name: ¹`total_rev <- sum(Quantity * Price)`
sales with total revenue over 1000
filtered_group <- week_2 %>%
mutate(Revenue <- (Quantity * Price))%>%
filter(sum(Quantity * Price) > 1000) %>%
ungroup()
head(filtered_group)
## # A tibble: 6 × 7
## CustomerID Region Product Quantity Price Date Revenue <- (Quantity * …¹
## <dbl> <chr> <chr> <dbl> <dbl> <date> <dbl>
## 1 1001 North Widget C 5 30 2024-01-01 150
## 2 1002 South Widget C 10 30 2024-01-02 300
## 3 1003 East Widget C 10 30 2024-01-03 300
## 4 1004 North Widget C 10 30 2024-01-04 300
## 5 1005 North Widget C 8 30 2024-01-05 240
## 6 1006 South Widget A 9 20 2024-01-06 180
## # ℹ abbreviated name: ¹`Revenue <- (Quantity * Price)`
Revenue contribution by Product
week_2 %>%
group_by(Product) %>%
summarize(Revenue = mean(Quantity * Price)) %>%
arrange(desc(Revenue))%>%
ggplot( aes(x = Product, y = Revenue)) + geom_col()
Widget C had the highest sales.
week_2 %>%
group_by(Region) %>%
summarize(Revenue = sum(Quantity * Price)) %>%
arrange(desc(Revenue))%>%
ggplot( aes(x = "", y = Revenue, fill = Region)) +
geom_bar(stat = "identity", width = 1)+
coord_polar("y", start = 0)
The eastern region had the most revenue with the west having the
least
week_2 %>%
mutate(Year = format(Date, "%Y")) %>%
group_by(Year) %>%
summarize(Revenue = sum(Quantity * Price)) %>%
ggplot(aes(Year, Revenue, group=1)) +
geom_line() +
theme(axis.text.x = element_text(angle=90))
week_2 %>%
group_by(Region, Quantity) %>%
summarize(Revenue = sum(Quantity * Price), .groups="drop")
## # A tibble: 40 × 3
## Region Quantity Revenue
## <chr> <dbl> <dbl>
## 1 East 1 2810
## 2 East 2 5130
## 3 East 3 8805
## 4 East 4 10900
## 5 East 5 14050
## 6 East 6 18360
## 7 East 7 19320
## 8 East 8 21960
## 9 East 9 26235
## 10 East 10 29550
## # ℹ 30 more rows
cor_test <- cor.test(week_2$Quantity, week_2$Price)
print(cor_test)
##
## Pearson's product-moment correlation
##
## data: week_2$Quantity and week_2$Price
## t = -0.5776, df = 4998, p-value = 0.5636
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03588101 0.01955387
## sample estimates:
## cor
## -0.008169848
numeric_data <- week_2[, sapply(week_2, is.numeric)]
cor_matrix <- cor(numeric_data, use = "complete.obs") # Handles missing values
print(cor_matrix)
## CustomerID Quantity Price
## CustomerID 1.000000000 0.002395234 -0.021227919
## Quantity 0.002395234 1.000000000 -0.008169848
## Price -0.021227919 -0.008169848 1.000000000
week_2 %>%
mutate(OrderValue = Quantity * Price) %>%
summarize(AOV = mean(OrderValue))
## # A tibble: 1 × 1
## AOV
## <dbl>
## 1 120.
Outlier
week_2 %>%
filter(Quantity > quantile(Quantity, 0.99))
## # A tibble: 0 × 6
## # ℹ 6 variables: CustomerID <dbl>, Region <chr>, Product <chr>, Quantity <dbl>,
## # Price <dbl>, Date <date>
Addition of customer demographics will highly enrich the analysis as below: Gender:- It would be easy to determine which gender prefers which product and from which region. This could help in determine which target customers to focus on. Age:- Likewise, by analysisng the ages of the customer, it would have helped to know which age group are receptive to which product and from which region. Thus, customer demographics could help customise the marketing strategy.