User Demographics Data Transformation

Introduction

This dataset is a simulation of user registration information from a travel platform. It includes user demographics such as gender, country, age group, travel type, and signup date. The dataset is useful for hypothesizing about user behavior across traveler types. This analysis could simulate how platforms might schedule events to create better user experiences. Our goal is to transform this dataset into a cleaner format better suited for analysis.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)

## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

library(lubridate)

# Load dataset 
url <- "https://raw.githubusercontent.com/lher96/MSDS-Assignments/main/users.csv"
users <- read_csv(url)

## Rows: 2000 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): user_gender, country, age_group, traveller_type
## dbl  (1): user_id
## date (1): join_date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Clean column names
users <- clean_names(users)

# Preview structure
glimpse(users)

## Rows: 2,000
## Columns: 6
## $ user_id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
## $ user_gender    <chr> "Female", "Male", "Female", "Male", "Other", "Male", "M…
## $ country        <chr> "United Kingdom", "United Kingdom", "Mexico", "India", …
## $ age_group      <chr> "35-44", "25-34", "25-34", "35-44", "25-34", "25-34", "…
## $ traveller_type <chr> "Solo", "Solo", "Family", "Family", "Solo", "Couple", "…
## $ join_date      <date> 2024-09-29, 2023-11-29, 2022-04-03, 2023-12-02, 2021-1…

Data Cleaning and Transformation

# Clean and transform relevant variables
df_clean <- users %>%
  select(user_id, user_gender, country, age_group, traveller_type, join_date) %>%
  mutate(
    user_gender = case_when(
      user_gender == "Male" ~ "Male",
      user_gender == "Female" ~ "Female",
      TRUE ~ "Other / Prefer not to say"
    ),
    traveller_type = str_to_title(traveller_type),
    country = str_to_title(country),
    join_year = year(ymd(join_date))
  )

# Show cleaned data
head(df_clean)

## # A tibble: 6 × 7
##   user_id user_gender      country age_group traveller_type join_date  join_year
##     <dbl> <chr>            <chr>   <chr>     <chr>          <date>         <dbl>
## 1       1 Female           United… 35-44     Solo           2024-09-29      2024
## 2       2 Male             United… 25-34     Solo           2023-11-29      2023
## 3       3 Female           Mexico  25-34     Family         2022-04-03      2022
## 4       4 Male             India   35-44     Family         2023-12-02      2023
## 5       5 Other / Prefer … Japan   25-34     Solo           2021-12-18      2021
## 6       6 Male             Brazil  25-34     Couple         2025-06-27      2025

Conclusions

The user dataset is now cleaned and organized, with clearly labeled gender values, consistent casing for text fields, and a new join_year field extracted from the join date. This cleaned version is now well-structured for visualization and analysis.

Recommendations for next steps: - Group users by age_group and traveller_type to identify trends. - Analyze country-wise growth by plotting new users per year and compare it to traveler type to plan excursions. - Explore gender distribution in solo vs. family travelers to help plan travel package deals.

GitHub Repository: https://github.com/lher96/MSDS-Assignments/blob/main/rpubs.Rmd
RPubs Publication: https://rpubs.com/loudata/1340075

User Demographics Data Transformation

Luis Hernandez

2025-08-31

Introduction

Data Cleaning and Transformation

Conclusions