Data 101 Midterm

Author

Asher Scott

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/asherscott/Desktop/Data 110")
Groceries<- read.csv("SS.csv")
sum(is.na(Groceries))
[1] 0
summary(Groceries)
 Customer.ID             Age           Gender          Annual.Income..k.
 Length:5000        Min.   :18.00   Length:5000        Min.   : -3.00   
 Class :character   1st Qu.:30.00   Class :character   1st Qu.: 45.00   
 Mode  :character   Median :43.00   Mode  :character   Median : 59.00   
                    Mean   :43.17                      Mean   : 59.39   
                    3rd Qu.:56.00                      3rd Qu.: 73.00   
                    Max.   :69.00                      Max.   :123.00   
 Purchase.Amount.... Product.Category   Purchase.Date      Payment.Method    
 Min.   : 25.51      Length:5000        Length:5000        Length:5000       
 1st Qu.:166.54      Class :character   Class :character   Class :character  
 Median :199.07      Mode  :character   Mode  :character   Mode  :character  
 Mean   :199.55                                                              
 3rd Qu.:232.80                                                              
 Max.   :365.88                                                              
 Loyalty.Program.Member   Location        
 Length:5000            Length:5000       
 Class :character       Class :character  
 Mode  :character       Mode  :character  
                                          
                                          
                                          

For both Annual Income and Purchase Amount the Median and the Mean were nearly identical. However the Standard Deviation for Annual Income was a bit low, showing the varying economic status’ of the shoppers.

mean(Groceries$Annual.Income..k.)
[1] 59.39
median(Groceries$Annual.Income..k)
[1] 59
sd(Groceries$Annual.Income..k)
[1] 20.0141
mean(Groceries$Purchase.Amount....)
[1] 199.5522
median(Groceries$Purchase.Amount...)
[1] 199.075
sd(Groceries$Purchase.Amount...)
[1] 48.65077

One thing I’ve noticed is that all the categories have roughly the same spending amount. Another is that people spent the most on beauty while spending the least on books. This is probably due to beauty products being so expensive while books are generally cheap.

ggplot(Groceries, aes(x = Product.Category, y = Purchase.Amount...., fill = Product.Category)) +
    geom_boxplot() +
    scale_fill_manual(values = c("Beauty" = "orange", 
                                 "Books" = "blue", 
                                 "Clothing" = "green",
                                 "Electronics" = "yellow",
                                 "Home & Kitchen" = "purple",
                                 "Sports" = "red",
                                 "Toys" = "pink")) + 
    labs(title = "Purchase Power") +
    theme_minimal()  

GRAPH 1

ggplot(Groceries, mapping = aes(x = Age, y = Purchase.Amount...., fill = Loyalty.Program.Member)) +
  geom_col() + 
  scale_fill_manual(values = c("Yes" = "green", "No" = "orange")) +  
  labs(title = "Purchase Amount by Age & Loyalty", 
       y = "Purchase Amount", 
       x = "Age") +  
  theme_minimal()

GRAPH 2

ggplot(Groceries, aes(x = Age, y = Purchase.Amount...., fill = Gender)) +
  geom_area(color = "black", linewidth = 0.7) +  
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Purchase power by Age and Gender",
  x = "Age",
  y = "Purchase Amount") +
  theme_dark()  

ggplot(data = Groceries, mapping = aes( x = Purchase.Amount...., fill = Payment.Method)) +
  geom_histogram() +
  ggtitle("Purchase Amount to Payment method") +
  scale_color_manual(values = c("Cash" = "green", 
                                "Debit Card" = "yellow", 
                                "Credit Card" = "lightblue", 
                                "Online Payment" = "red"))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: No shared levels found between `names(values)` of the manual scale and the
data's colour values.

My key findings from the analysis was that age did not affect the purchase amount that much. In the graph 1 week can see that spending patterns are generally consistent throughout ages 18-70. Where we see major differences are between genders and loyalty members. According to graph 2, females spend noticeably more than their male and non-binary counterparts. Looking back at graph 1, the other distinguishable difference came from comparing loyalty program membership and it should be that non members had almost twice the overall spending of loyalty members. However, I suspect this is because there are simply way more non loyalty members than loyalty members. Finally in graph 3 I simply compared the purchase amount to the payment method and found that the majority of people still used cash while online purchase had the least.