Data607 - Assignment 1 Approach

Author

Sinem K Moschos

Introduction

This report uses the Pens and Printers Product Sales dataset published on Kaggle. The dataset contains sales data related to printers and office supply products. The source of the dataset: https://www.kaggle.com/datasets/lorenzovzquez/pens-and-printers-product-sales

Dataset Summary

This dataset includes product sales records for pens, printers and related items. Each row represents a sales transaction, while the columns describe product details, sales quantities, pricing and other relevant business information.

Motivation for Selection

I chose this dataset because it directly relates to my business interests. I am currently working on a business plan connected to the printing industry, and I want to apply data and business thinking across different areas of my life and work.

Planned Approach

For this assignment, my plan is to first make the dataset available online by storing it in a public GitHub repository. I will then load the dataset into R using a web link instead of a local file path so that the work is reproducible. After loading the data, I will select a smaller set of columns that are the most relevant for analysis. If there is a clear target or outcome variable in the dataset, I will make sure to include it. I will also rename columns to make them easier to understand and replace any abbreviated or coded values with more meaningful labels. The final result will be a clean and well organized data frame that can be used for further analysis.

Potential Data Challenges

While working with dataset, I expect to face a few challenges. These may include missing values, unclear column names or categorical values that are stored as codes instead of readable labels.

Conclusions

Once the data has been properly loaded and cleaned, the next step would be to explore the data and look for patterns.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/sinemkilicdere/Data607/refs/heads/main/data/product_sales.csv"

df <- read_csv(
  file = url,
  show_col_types = FALSE
)

glimpse(df)
Rows: 15,000
Columns: 8
$ week              <dbl> 2, 6, 5, 4, 3, 6, 4, 1, 5, 5, 3, 2, 5, 2, 5, 4, 2, 6…
$ sales_method      <chr> "Email", "Email + Call", "Call", "Email", "Email", "…
$ customer_id       <chr> "2e72d641-95ac-497b-bbf8-4861764a7097", "3998a98d-70…
$ nb_sold           <dbl> 10, 15, 11, 11, 9, 13, 11, 10, 11, 11, 9, 9, 11, 10,…
$ revenue           <dbl> NA, 225.47, 52.55, NA, 90.49, 65.01, 113.38, 99.94, …
$ years_as_customer <dbl> 0, 1, 6, 3, 0, 10, 9, 1, 10, 7, 4, 2, 2, 1, 1, 2, 6,…
$ nb_site_visits    <dbl> 24, 28, 26, 25, 28, 24, 28, 22, 31, 23, 28, 23, 30, …
$ state             <chr> "Arizona", "Kansas", "Wisconsin", "Indiana", "Illino…
sales_df <- df %>%
  select(
    week,
    sales_method,
    nb_sold,
    revenue,
    years_as_customer,
    nb_site_visits,
    state
  )

glimpse(sales_df)
Rows: 15,000
Columns: 7
$ week              <dbl> 2, 6, 5, 4, 3, 6, 4, 1, 5, 5, 3, 2, 5, 2, 5, 4, 2, 6…
$ sales_method      <chr> "Email", "Email + Call", "Call", "Email", "Email", "…
$ nb_sold           <dbl> 10, 15, 11, 11, 9, 13, 11, 10, 11, 11, 9, 9, 11, 10,…
$ revenue           <dbl> NA, 225.47, 52.55, NA, 90.49, 65.01, 113.38, 99.94, …
$ years_as_customer <dbl> 0, 1, 6, 3, 0, 10, 9, 1, 10, 7, 4, 2, 2, 1, 1, 2, 6,…
$ nb_site_visits    <dbl> 24, 28, 26, 25, 28, 24, 28, 22, 31, 23, 28, 23, 30, …
$ state             <chr> "Arizona", "Kansas", "Wisconsin", "Indiana", "Illino…
sales_df_clean <- sales_df %>%
  rename(
    units_sold = nb_sold,
    site_visits = nb_site_visits,
    customer_years = years_as_customer
  )

glimpse(sales_df_clean)
Rows: 15,000
Columns: 7
$ week           <dbl> 2, 6, 5, 4, 3, 6, 4, 1, 5, 5, 3, 2, 5, 2, 5, 4, 2, 6, 1…
$ sales_method   <chr> "Email", "Email + Call", "Call", "Email", "Email", "Cal…
$ units_sold     <dbl> 10, 15, 11, 11, 9, 13, 11, 10, 11, 11, 9, 9, 11, 10, 10…
$ revenue        <dbl> NA, 225.47, 52.55, NA, 90.49, 65.01, 113.38, 99.94, 108…
$ customer_years <dbl> 0, 1, 6, 3, 0, 10, 9, 1, 10, 7, 4, 2, 2, 1, 1, 2, 6, 0,…
$ site_visits    <dbl> 24, 28, 26, 25, 28, 24, 28, 22, 31, 23, 28, 23, 30, 28,…
$ state          <chr> "Arizona", "Kansas", "Wisconsin", "Indiana", "Illinois"…