Data Preparation

First we load the libraries.

library(janitor)
## 
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(lubridate)

Next we load the dataset.

sales <- read_csv("~/Grad School/Advanced Data Mining/R Studio Data/homework/LaptopSalesJan.csv")
## Rows: 7952 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (5): Date, CustomerPostcode, StorePostcode, IntegratedWireless, Bundled...
## dbl (10): Configuration, RetailPrice, ScreenSize, BatteryLife, RAM, Processo...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Let’s explore the dataset a bit. We can see there are no missing values, around 8000 rows, and 15 columns. 5 categorical and 10 numeric columns.

#glimpse(sales)
skim(sales)
Data summary
Name sales
Number of rows 7952
Number of columns 15
_______________________
Column type frequency:
character 5
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Date 0 1 13 15 0 7299 0
CustomerPostcode 0 1 6 8 0 834 0
StorePostcode 0 1 6 8 0 15 0
IntegratedWireless 0 1 2 3 0 2 0
BundledApplications 0 1 2 3 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Configuration 0 1 207.21 117.82 1.0 77.00 209.50 315.00 368.00 ▅▁▆▁▇
RetailPrice 0 1 487.94 61.53 300.0 455.00 490.00 525.00 665.00 ▁▂▇▃▁
ScreenSize 0 1 15.00 0.00 15.0 15.00 15.00 15.00 15.00 ▁▁▇▁▁
BatteryLife 0 1 5.14 0.82 4.0 4.00 5.00 6.00 6.00 ▅▁▆▁▇
RAM 0 1 1.55 0.50 1.0 1.00 2.00 2.00 2.00 ▆▁▁▁▇
ProcessorSpeeds 0 1 1.76 0.25 1.5 1.50 2.00 2.00 2.00 ▇▁▁▁▇
HDSize 0 1 150.38 102.50 40.0 80.00 120.00 300.00 300.00 ▇▅▁▁▆
customerX 0 1 530867.80 4414.48 512253.0 529207.75 531150.50 533130.00 549065.00 ▁▁▇▂▁
storeX 0 1 530747.78 4159.50 517917.0 528924.00 529902.00 534057.00 541428.00 ▁▂▇▅▁
CustomerStoreDistance 0 1 3679.88 2068.91 0.0 2422.24 3382.46 4346.36 19892.14 ▇▃▁▁▁

Let’s get into the questions!

Exploratory Analysis

1. At what prices are the laptops selling (hint: use a histogram). Use the chart to comment on the range of prices and the most common price range.

The total range of retail prices for laptops is from $300 to just under $700. The range is shaped like a bell curve, with the most common price range sitting from around $450 to $550.

ggplot(data=sales, mapping = aes(x=RetailPrice)) + 
    geom_histogram(fill = "red", color = "white", bins=15)

2. What were the dates when Acell had its minimum sales and maximum sales in terms of counts of laptops. Overall which HDSize was more popular throughout January, and which HDSize was least popular of the lot.

Acell had its lowest sales day on January 14th, with $108,725 retail sales. Acell had its highest sales day on January 31st, with $143,975.

The 300 HDSize was most popular throughout January and the 40 was least popular.

sales$New_Date <- mdy_hm(sales$Date)
sales$Day <- day(sales$New_Date)

daily_sales <- sales%>%
    select(Date,Day,RetailPrice)%>%
    group_by(Day) %>%
    summarize(total_sales = sum(RetailPrice))%>%
    arrange(total_sales)



ggplot(data = daily_sales, mapping = aes(x = Day, y = total_sales)) + geom_line()

sales%>%
    select(HDSize, RetailPrice)%>%
    group_by(HDSize) %>%
    summarize(total_sales = sum(RetailPrice))
## # A tibble: 4 × 2
##   HDSize total_sales
##    <dbl>       <dbl>
## 1     40      649710
## 2     80      901140
## 3    120     1049880
## 4    300     1279335

3. Does retail price correlate with customer store distance?(Hint: Use Scatter Plots) Use the chart to additionally answer, are there any outliers in the retail price or custom store distance? Would you include custom store distance as a predictor of store price?

Looking at the scatter plot, there seems to be a positive relationship between customer store distance and retail price. If you look at sales above $500, there are a greater proportion of customers who have traveled over 5,000km to the store compared to those who traveled a similar distance and spent less than $500.

There are outliers on the customer store distance, with a handful of customers traveling 20,000 km to visit the store. These outliers spend average to above average amounts.

I would include custom store distance as a predictor of store price.

ggplot(data = sales, mapping = aes(x = CustomerStoreDistance, y = RetailPrice)) + geom_point() 

4. Do the median values of the distance traveled by the customers vary based on the day of the week? Which are the days when you see maximum outliers (Hint - use Boxplot). Use the chart to additionally answer, are customers willing to travel more than 20K when the Applications are bundled.

It does not seem like the median values of the distances traveled by customers varies much based on the days of the week. We do see the highest amount of outliers on Sunday and Thursday.

While there is not a change in the median, there is an increase in outliers of customers traveling up to 20K to visit for bundled applications.

sales$New_Date <- mdy_hm(sales$Date)

#first create a new column that displays the day of the week 
sales$Weekday <- wday(sales$New_Date, label = TRUE, abbr = FALSE)

ggplot(data = sales, mapping = aes(x = reorder(Weekday, CustomerStoreDistance, FUN = median), y = CustomerStoreDistance)) + geom_boxplot()

Customer Store Difference by Bundled Applications:

ggplot(data = sales, mapping = aes(x =BundledApplications, y = CustomerStoreDistance)) + geom_boxplot()

5. Which stores(s) attracts customers who are willing to travel more? Create a new variable using case_when command to bin the store distance variable into following categories – “Under 5000”,”5001-10000”,”>10000”. (Hint: Use Stacked bar chart)

SW18 1NN, SW12 9HD, SE8 3JD, and E2 0RY were the stores that attracted the most customers traveling over 5,000 km, with SW18 1NN standing out as the store that attracted the most customers traveling over 10,000 km.

sales <- sales%>%
  mutate(store_dist = 
    case_when(
      CustomerStoreDistance < 5000 ~ "Under 5000",
      CustomerStoreDistance >= 5000 & CustomerStoreDistance <= 10000 ~ "5000-10000",
      CustomerStoreDistance > 10000 ~ ">10000" )
)

ggplot(data = sales, mapping = aes(x = StorePostcode, fill = store_dist)) + geom_bar(stat = "count") + coord_flip()