This document demonstrates how to load, explore, transform data from
a CSV file, and then save and load it as an RDS file in R, with a focus
on dplyr for modern data analysis workflows.
Load the
most_used_beauty_cosmetics_products_extended.csv dataset
into an R data frame. Use read_csv() from the
readr package for this.
# Your code here
cosmetics <- read_csv("_data/most_used_beauty_cosmetics_products_extended.csv")
## Rows: 15000 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Product_Name, Brand, Category, Usage_Frequency, Product_Size, Skin...
## dbl (3): Price_USD, Rating, Number_of_Reviews
## lgl (1): Cruelty_Free
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Display the structure of the cosmetics data frame, get a
statistical summary, and check the unique values of the
Product_Size column.
# Your code here
str(cosmetics)
## spc_tbl_ [15,000 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Product_Name : chr [1:15000] "Ultra Face Mask" "Ultra Lipstick" "Ultra Serum" "Divine Serum" ...
## $ Brand : chr [1:15000] "Drunk Elephant" "Laura Mercier" "Natasha Denona" "Ilia Beauty" ...
## $ Category : chr [1:15000] "Blush" "Makeup Remover" "Highlighter" "Face Mask" ...
## $ Usage_Frequency : chr [1:15000] "Weekly" "Occasional" "Daily" "Occasional" ...
## $ Price_USD : num [1:15000] 67.8 116.4 90.8 55.2 140.6 ...
## $ Rating : num [1:15000] 1.4 4.2 1.6 3.2 1.7 3.2 2.5 4.3 3.3 4.4 ...
## $ Number_of_Reviews: num [1:15000] 686 5483 5039 6202 297 ...
## $ Product_Size : chr [1:15000] "30ml" "250ml" "100ml" "250ml" ...
## $ Skin_Type : chr [1:15000] "Sensitive" "Dry" "Sensitive" "Normal" ...
## $ Gender_Target : chr [1:15000] "Female" "Unisex" "Male" "Male" ...
## $ Packaging_Type : chr [1:15000] "Tube" "Bottle" "Compact" "Tube" ...
## $ Main_Ingredient : chr [1:15000] "Retinol" "Shea Butter" "Aloe Vera" "Glycerin" ...
## $ Cruelty_Free : logi [1:15000] FALSE FALSE TRUE TRUE FALSE TRUE ...
## $ Country_of_Origin: chr [1:15000] "Australia" "UK" "Italy" "South Korea" ...
## - attr(*, "spec")=
## .. cols(
## .. Product_Name = col_character(),
## .. Brand = col_character(),
## .. Category = col_character(),
## .. Usage_Frequency = col_character(),
## .. Price_USD = col_double(),
## .. Rating = col_double(),
## .. Number_of_Reviews = col_double(),
## .. Product_Size = col_character(),
## .. Skin_Type = col_character(),
## .. Gender_Target = col_character(),
## .. Packaging_Type = col_character(),
## .. Main_Ingredient = col_character(),
## .. Cruelty_Free = col_logical(),
## .. Country_of_Origin = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(cosmetics)
## Product_Name Brand Category Usage_Frequency
## Length:15000 Length:15000 Length:15000 Length:15000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Price_USD Rating Number_of_Reviews Product_Size
## Min. : 10.00 Min. :1.000 Min. : 52 Length:15000
## 1st Qu.: 45.48 1st Qu.:2.000 1st Qu.: 2562 Class :character
## Median : 80.04 Median :3.000 Median : 5002 Mode :character
## Mean : 80.13 Mean :3.002 Mean : 5014
## 3rd Qu.:114.76 3rd Qu.:4.000 3rd Qu.: 7497
## Max. :149.99 Max. :5.000 Max. :10000
## Skin_Type Gender_Target Packaging_Type Main_Ingredient
## Length:15000 Length:15000 Length:15000 Length:15000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Cruelty_Free Country_of_Origin
## Mode :logical Length:15000
## FALSE:7592 Class :character
## TRUE :7408 Mode :character
##
##
##
unique(cosmetics$Product_Size)
## [1] "30ml" "250ml" "100ml" "150ml" "200ml" "50ml"
glimpse(cosmetics)
## Rows: 15,000
## Columns: 14
## $ Product_Name <chr> "Ultra Face Mask", "Ultra Lipstick", "Ultra Serum", …
## $ Brand <chr> "Drunk Elephant", "Laura Mercier", "Natasha Denona",…
## $ Category <chr> "Blush", "Makeup Remover", "Highlighter", "Face Mask…
## $ Usage_Frequency <chr> "Weekly", "Occasional", "Daily", "Occasional", "Occa…
## $ Price_USD <dbl> 67.85, 116.43, 90.84, 55.17, 140.56, 135.82, 148.99,…
## $ Rating <dbl> 1.4, 4.2, 1.6, 3.2, 1.7, 3.2, 2.5, 4.3, 3.3, 4.4, 4.…
## $ Number_of_Reviews <dbl> 686, 5483, 5039, 6202, 297, 9405, 2423, 8032, 2468, …
## $ Product_Size <chr> "30ml", "250ml", "100ml", "250ml", "100ml", "150ml",…
## $ Skin_Type <chr> "Sensitive", "Dry", "Sensitive", "Normal", "Oily", "…
## $ Gender_Target <chr> "Female", "Unisex", "Male", "Male", "Female", "Femal…
## $ Packaging_Type <chr> "Tube", "Bottle", "Compact", "Tube", "Compact", "Com…
## $ Main_Ingredient <chr> "Retinol", "Shea Butter", "Aloe Vera", "Glycerin", "…
## $ Cruelty_Free <lgl> FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, T…
## $ Country_of_Origin <chr> "Australia", "UK", "Italy", "South Korea", "Germany"…
Transform the Product_Size column into an ordered factor
with appropriate levels.
# Your code here
cosmetics$Product_Size <- factor(cosmetics$Product_Size,
levels = c("30ml", "50ml", "100ml", "150ml", "200ml", "250ml"),
ordered = TRUE
)
# Display the new column structure
levels(cosmetics$Product_Size)
## [1] "30ml" "50ml" "100ml" "150ml" "200ml" "250ml"
str(cosmetics$Product_Size)
## Ord.factor w/ 6 levels "30ml"<"50ml"<..: 1 6 3 6 3 4 6 5 5 4 ...
Save the transformed cosmetics data frame to an RDS file
named cosmetics.rds in the _rds directory.
# Your code here
saveRDS(cosmetics, "_rds/cosmetics.rds")
Load the cosmetics.rds file back into R, but name the
new object cosmetics_from_rds.
# Your code here
cosmetics_from_rds <- readRDS("_rds/cosmetics.rds")
Use the identical() function to compare the original
cosmetics data frame with cosmetics_from_rds
to ensure no data loss or alteration.
# Your code here
identical(cosmetics, cosmetics_from_rds)
## [1] FALSE
Create a histogram of the Price_USD column from
cosmetics_from_rds to visualize its distribution.
# Your code here
hist(cosmetics_from_rds$Price_USD)
ggplot(cosmetics_from_rds, aes(x = Price_USD)) +
geom_histogram(binwidth = 10, fill = "blue", color = "black") +
labs(
title = "Distribution of Product Prices (USD)",
x = "Price (USD)",
y = "Frequency"
)