Exploring Retail Customer Data: A Practical Data Science Project
A retail company has collected information from customers who made online or in-store purchases. The dataset includes customer names, demographic information, province, profession, phone number, email address, credit card information, purchase amount, preferred colour, IP address, visit time, weekday, and date.
Your task is to use dplyr functions from the tidyverse to explore the dataset and answer the questions below.
The dataset customer_data.csv required for this task is available in your designated Google Drive Folder.
1. Load the required packages and read the dataset. Use janitor::clean_names() to clean the column names.
4. Create a smaller dataset containing only customer name, gender, age, province, profession, phone, email, credit card type, spending amount, weekday, and date. Name the new dataset customer_small.
# A tibble: 30,000 × 10
customer_name gender age province profession phone email cc_type weekday
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
1 Dr. Ray Morton Male 38 MB Health Th… (987… sebv… Solo Sunday
2 Miss Claudia Ro… Female 51 SK Health Th… (356… acu@… Visa Tuesday
3 Miss Harry Meyer Female 51 NS Political… (539… zuz@… Switch Wednes…
4 Miss Edith Gilb… Female 55 YT Transport… (984… hans… Maestro Saturd…
5 Dr. Lura Murphy Female 20 ON Statistic… (902… webe… Diners… Friday
6 Mrs. Andre McGu… Male 50 AB Communica… (770… to@h… Laser Thursd…
7 Mr. Cody Thomps… Female 62 PE Special E… (837… nebu… Solo Thursd…
8 Mr. Cory Duncan Male 36 NU Urban Pla… (529… ruce… Diners… Friday
9 Miss Dorothy Wa… Female 36 MB Veterinar… (279… cakp… Diners… Wednes…
10 Dr. Rose Tate Female 53 NT Software … (410… nafi… Discov… Saturd…
# ℹ 29,990 more rows
# ℹ 1 more variable: date <chr>
5. Create the following new variables in customer:
# A tibble: 4 × 22
full_name price_CAD spending_group prefix first last email gender age
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Roy Stanley 31.9 Medium Dr. Roy Stanley apiu… Female 39
2 Lora Kennedy 53.3 Medium Mr. Lora Kennedy denn… Female 51
3 Nell Richards 76.3 Medium Mrs. Nell Richar… dob@… Female 18
4 Don McDaniel 62.7 Medium Mrs. Don McDani… naf@… Female 37
# ℹ 13 more variables: company <chr>, profession <chr>, phone <chr>,
# postal <chr>, province <chr>, cc_no <dbl>, cc_exp <chr>, cc_type <chr>,
# fav_color <chr>, ip <chr>, weekday <chr>, ampm <chr>, date <chr>
8. Display the 10 highest-spending customers. Sort the data from highest to lowest spending.
# A tibble: 10 × 22
full_name price_CAD spending_group prefix first last email gender age
<chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Gregory Brown 100 High Mrs. Greg… Brown hav@… Female 31
2 Cody Christen… 100 High Mrs. Cody Chri… get@… Male 28
3 Lizzie Dixon 100 High Miss Lizz… Dixon goh@… Female 38
4 Jessie Lawren… 100.0 High Mr. Jess… Lawr… vavo… Female 58
5 Christopher E… 100.0 High Mrs. Chri… Estr… ekit… Male 65
6 Dollie Edwards 100.0 High Dr. Doll… Edwa… uriv… Female 40
7 Edward Murray 100.0 High Mrs. Edwa… Murr… zeg@… Female 51
8 Timothy Bass 100.0 High Mr. Timo… Bass jo@d… Female 23
9 Iva Morton 100.0 High Miss Iva Mort… vumj… Male 48
10 Ruth West 100.0 High Miss Ruth West ara@… Female 26
# ℹ 13 more variables: company <chr>, profession <chr>, phone <chr>,
# postal <chr>, province <chr>, cc_no <dbl>, cc_exp <chr>, cc_type <chr>,
# fav_color <chr>, ip <chr>, weekday <chr>, ampm <chr>, date <chr>
9. Find the minimum, maximum, mean, and median age of customers.
14. Find the number of customers by credit card type. Sort the result from highest to lowest frequency.
customer %>%count(cc_type, sort =TRUE)
# A tibble: 17 × 2
cc_type n
<chr> <int>
1 Solo 1867
2 Visa Electron 1815
3 American Express 1810
4 Switch 1806
5 Laser 1798
6 China UnionPay 1788
7 Diners Club enRoute 1758
8 Maestro 1756
9 JCB 1751
10 Diners Club United States & Canada 1746
11 Diners Club International 1745
12 Mastercard 1741
13 Discover Card 1739
14 Diners Club Carte Blanche 1732
15 Bankcard 1728
16 Visa 1721
17 InstaPayment 1699
15. Extract email providers, such as gmail.com or yahoo.com, and find the top 5 most common email providers. (The email provider is the part of the email address that comes after the “@” symbol)