Introduction

This document demonstrates how to load, explore, transform data from a CSV file, and then save and load it as an RDS file in R, with a focus on dplyr for modern data analysis workflows.

Exercises

Exercise 1: Load the Data

Load the most_used_beauty_cosmetics_products_extended.csv dataset into an R data frame. Use read_csv() from the readr package for this.

# Your code here
cosmetics <- read_csv("_data/most_used_beauty_cosmetics_products_extended.csv")
## Rows: 15000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Product_Name, Brand, Category, Usage_Frequency, Product_Size, Skin...
## dbl  (3): Price_USD, Rating, Number_of_Reviews
## lgl  (1): Cruelty_Free
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exercise 2: Explore the Data

Display the structure of the cosmetics data frame, get a statistical summary, and check the unique values of the Product_Size column.

# Your code here
str(cosmetics)
## spc_tbl_ [15,000 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Product_Name     : chr [1:15000] "Ultra Face Mask" "Ultra Lipstick" "Ultra Serum" "Divine Serum" ...
##  $ Brand            : chr [1:15000] "Drunk Elephant" "Laura Mercier" "Natasha Denona" "Ilia Beauty" ...
##  $ Category         : chr [1:15000] "Blush" "Makeup Remover" "Highlighter" "Face Mask" ...
##  $ Usage_Frequency  : chr [1:15000] "Weekly" "Occasional" "Daily" "Occasional" ...
##  $ Price_USD        : num [1:15000] 67.8 116.4 90.8 55.2 140.6 ...
##  $ Rating           : num [1:15000] 1.4 4.2 1.6 3.2 1.7 3.2 2.5 4.3 3.3 4.4 ...
##  $ Number_of_Reviews: num [1:15000] 686 5483 5039 6202 297 ...
##  $ Product_Size     : chr [1:15000] "30ml" "250ml" "100ml" "250ml" ...
##  $ Skin_Type        : chr [1:15000] "Sensitive" "Dry" "Sensitive" "Normal" ...
##  $ Gender_Target    : chr [1:15000] "Female" "Unisex" "Male" "Male" ...
##  $ Packaging_Type   : chr [1:15000] "Tube" "Bottle" "Compact" "Tube" ...
##  $ Main_Ingredient  : chr [1:15000] "Retinol" "Shea Butter" "Aloe Vera" "Glycerin" ...
##  $ Cruelty_Free     : logi [1:15000] FALSE FALSE TRUE TRUE FALSE TRUE ...
##  $ Country_of_Origin: chr [1:15000] "Australia" "UK" "Italy" "South Korea" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Product_Name = col_character(),
##   ..   Brand = col_character(),
##   ..   Category = col_character(),
##   ..   Usage_Frequency = col_character(),
##   ..   Price_USD = col_double(),
##   ..   Rating = col_double(),
##   ..   Number_of_Reviews = col_double(),
##   ..   Product_Size = col_character(),
##   ..   Skin_Type = col_character(),
##   ..   Gender_Target = col_character(),
##   ..   Packaging_Type = col_character(),
##   ..   Main_Ingredient = col_character(),
##   ..   Cruelty_Free = col_logical(),
##   ..   Country_of_Origin = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(cosmetics)
##  Product_Name          Brand             Category         Usage_Frequency   
##  Length:15000       Length:15000       Length:15000       Length:15000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    Price_USD          Rating      Number_of_Reviews Product_Size      
##  Min.   : 10.00   Min.   :1.000   Min.   :   52     Length:15000      
##  1st Qu.: 45.48   1st Qu.:2.000   1st Qu.: 2562     Class :character  
##  Median : 80.04   Median :3.000   Median : 5002     Mode  :character  
##  Mean   : 80.13   Mean   :3.002   Mean   : 5014                       
##  3rd Qu.:114.76   3rd Qu.:4.000   3rd Qu.: 7497                       
##  Max.   :149.99   Max.   :5.000   Max.   :10000                       
##   Skin_Type         Gender_Target      Packaging_Type     Main_Ingredient   
##  Length:15000       Length:15000       Length:15000       Length:15000      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  Cruelty_Free    Country_of_Origin 
##  Mode :logical   Length:15000      
##  FALSE:7592      Class :character  
##  TRUE :7408      Mode  :character  
##                                    
##                                    
## 
unique(cosmetics$Product_Size)
## [1] "30ml"  "250ml" "100ml" "150ml" "200ml" "50ml"
glimpse(cosmetics)
## Rows: 15,000
## Columns: 14
## $ Product_Name      <chr> "Ultra Face Mask", "Ultra Lipstick", "Ultra Serum", …
## $ Brand             <chr> "Drunk Elephant", "Laura Mercier", "Natasha Denona",…
## $ Category          <chr> "Blush", "Makeup Remover", "Highlighter", "Face Mask…
## $ Usage_Frequency   <chr> "Weekly", "Occasional", "Daily", "Occasional", "Occa…
## $ Price_USD         <dbl> 67.85, 116.43, 90.84, 55.17, 140.56, 135.82, 148.99,…
## $ Rating            <dbl> 1.4, 4.2, 1.6, 3.2, 1.7, 3.2, 2.5, 4.3, 3.3, 4.4, 4.…
## $ Number_of_Reviews <dbl> 686, 5483, 5039, 6202, 297, 9405, 2423, 8032, 2468, …
## $ Product_Size      <chr> "30ml", "250ml", "100ml", "250ml", "100ml", "150ml",…
## $ Skin_Type         <chr> "Sensitive", "Dry", "Sensitive", "Normal", "Oily", "…
## $ Gender_Target     <chr> "Female", "Unisex", "Male", "Male", "Female", "Femal…
## $ Packaging_Type    <chr> "Tube", "Bottle", "Compact", "Tube", "Compact", "Com…
## $ Main_Ingredient   <chr> "Retinol", "Shea Butter", "Aloe Vera", "Glycerin", "…
## $ Cruelty_Free      <lgl> FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, T…
## $ Country_of_Origin <chr> "Australia", "UK", "Italy", "South Korea", "Germany"…

Exercise 3: Data Transformation - Product Size

Transform the Product_Size column into an ordered factor with appropriate levels.

# Your code here
cosmetics$Product_Size <- factor(cosmetics$Product_Size,
    levels = c("30ml", "50ml", "100ml", "150ml", "200ml", "250ml"),
    ordered = TRUE
)
# Display the new column structure
levels(cosmetics$Product_Size)
## [1] "30ml"  "50ml"  "100ml" "150ml" "200ml" "250ml"
str(cosmetics$Product_Size)
##  Ord.factor w/ 6 levels "30ml"<"50ml"<..: 1 6 3 6 3 4 6 5 5 4 ...

Exercise 4: Save Data to RDS

Save the transformed cosmetics data frame to an RDS file named cosmetics.rds in the _rds directory.

# Your code here
saveRDS(cosmetics, "_rds/cosmetics.rds")

Exercise 5: Load Data from RDS

Load the cosmetics.rds file back into R, but name the new object cosmetics_from_rds.

# Your code here
cosmetics_from_rds <- readRDS("_rds/cosmetics.rds")

Exercise 6: Compare Data Frames

Use the identical() function to compare the original cosmetics data frame with cosmetics_from_rds to ensure no data loss or alteration.

# Your code here
identical(cosmetics, cosmetics_from_rds)
## [1] FALSE

Exercise 7: Data Visualization - Price Distribution

Create a histogram of the Price_USD column from cosmetics_from_rds to visualize its distribution.

# Your code here
hist(cosmetics_from_rds$Price_USD)

ggplot(cosmetics_from_rds, aes(x = Price_USD)) +
  geom_histogram(binwidth = 10, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Product Prices (USD)",
    x = "Price (USD)",
    y = "Frequency"
  )

Solutions

Click here for the solutions