This dataset is a simulation of user registration information from a travel platform. It includes user demographics such as gender, country, age group, travel type, and signup date. The dataset is useful for hypothesizing about user behavior across traveler types. This analysis could simulate how platforms might schedule events to create better user experiences. Our goal is to transform this dataset into a cleaner format better suited for analysis.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(lubridate)
# Load dataset
url <- "https://raw.githubusercontent.com/lher96/MSDS-Assignments/main/users.csv"
users <- read_csv(url)
## Rows: 2000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): user_gender, country, age_group, traveller_type
## dbl (1): user_id
## date (1): join_date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Clean column names
users <- clean_names(users)
# Preview structure
glimpse(users)
## Rows: 2,000
## Columns: 6
## $ user_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ user_gender <chr> "Female", "Male", "Female", "Male", "Other", "Male", "M…
## $ country <chr> "United Kingdom", "United Kingdom", "Mexico", "India", …
## $ age_group <chr> "35-44", "25-34", "25-34", "35-44", "25-34", "25-34", "…
## $ traveller_type <chr> "Solo", "Solo", "Family", "Family", "Solo", "Couple", "…
## $ join_date <date> 2024-09-29, 2023-11-29, 2022-04-03, 2023-12-02, 2021-1…
# Clean and transform relevant variables
df_clean <- users %>%
select(user_id, user_gender, country, age_group, traveller_type, join_date) %>%
mutate(
user_gender = case_when(
user_gender == "Male" ~ "Male",
user_gender == "Female" ~ "Female",
TRUE ~ "Other / Prefer not to say"
),
traveller_type = str_to_title(traveller_type),
country = str_to_title(country),
join_year = year(ymd(join_date))
)
# Show cleaned data
head(df_clean)
## # A tibble: 6 × 7
## user_id user_gender country age_group traveller_type join_date join_year
## <dbl> <chr> <chr> <chr> <chr> <date> <dbl>
## 1 1 Female United… 35-44 Solo 2024-09-29 2024
## 2 2 Male United… 25-34 Solo 2023-11-29 2023
## 3 3 Female Mexico 25-34 Family 2022-04-03 2022
## 4 4 Male India 35-44 Family 2023-12-02 2023
## 5 5 Other / Prefer … Japan 25-34 Solo 2021-12-18 2021
## 6 6 Male Brazil 25-34 Couple 2025-06-27 2025
The user dataset is now cleaned and organized, with clearly labeled
gender values, consistent casing for text fields, and a new
join_year
field extracted from the join date. This cleaned
version is now well-structured for visualization and analysis.
Recommendations for next steps: - Group users by
age_group
and traveller_type
to identify
trends. - Analyze country-wise growth by plotting new users per year and
compare it to traveler type to plan excursions. - Explore gender
distribution in solo vs. family travelers to help plan travel package
deals.
GitHub Repository: https://github.com/lher96/MSDS-Assignments/blob/main/rpubs.Rmd
RPubs Publication: https://rpubs.com/loudata/1340075