For this section, I selected a Travel Expense Report table that lists daily costs for meals, hotels, and transport by city. The original dataset was extracted from a scanned image and manually reconstructed in R. It is considered untidy because it mixes raw data with subtotals and totals, uses section headers like “San Jose” and “Seattle” instead of a proper Location column, and stores dates as row labels rather than structured variables.
The original table was manually transcribed into R using tribble() from the tibble package to preserve its visual structure. This includes section headers, daily entries, and subtotal/total rows. Missing values are represented with NA.
library(tibble)
travel_expense <- tribble(
~Location, ~Meals, ~Hotels, ~Transport, ~Subtotals,
"San Jose", NA, NA, NA, NA,
"25-Aug-97", 37.74, 112.00, 45.00, NA,
"26-Aug-97", 27.28, 112.00, 45.00, NA,
"subtotals", 65.02, 224.00, 90.00, 379.02,
"Seattle", NA, NA, NA, NA,
"27-Aug-97", 96.25, 109.00, 36.00, NA,
"28-Aug-97", 35.00, 109.00, 36.00, NA,
"subtotals", 131.25, 218.00, 72.00, 421.25,
"Totals", 196.27, 442.00, 162.00, 800.27)
View(travel_expense)
The dataset is saved as a .csv file so the raw data is stored externally and later imported for analysis. This file will be uploaded to GitHub for reproducibility and remote access.
write.csv(travel_expense, "DATA607_Project2_travel_expense.csv", row.names = FALSE)
The CSV file is read directly from the GitHub repository using read_csv() from the readr package.
library(readr)
travel_raw <- read_csv("https://raw.githubusercontent.com/arutam-antunish/DATA607/refs/heads/main/DATA607_Project2_travel_expense.csv")
## Rows: 9 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Location
## dbl (4): Meals, Hotels, Transport, Subtotals
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(travel_raw)
First, we cleaned the raw data by adding a City column based on section headers, removing subtotal and total rows, and converting dates and amounts into proper formats.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# Adding a City column based on section headers
travel_clean <- travel_raw %>%
mutate(City = case_when(Location == "San Jose" ~ "San Jose",
Location == "Seattle" ~ "Seattle", TRUE ~ NA_character_)) %>% fill(City)
travel_clean <- travel_clean %>%
filter(!Location %in% c("San Jose", "Seattle", "subtotals", "Totals"))
# Renaming and convert columns
travel_clean <- travel_clean %>%
rename(Date = Location) %>%
mutate(Date = dmy(Date), Meals = as.numeric(Meals),
Hotels = as.numeric(Hotels), Transport = as.numeric(Transport))
After cleaning the data, we changed it from wide to long format using pivot_longer(). This makes each expense its own row, which helps us analyze and visualize the data more easily by city, date, or category.
travel_tidy <- travel_clean %>%
pivot_longer(cols = c(Meals, Hotels, Transport), names_to = "Category",
values_to = "Amount")
travel_tidy <- travel_tidy %>% select(-Subtotals)
View(travel_tidy)
With the data now tidy, we can explore spending patterns by city and category. We’ll calculate total and average expenses, and visualize the distribution of costs.
summary_by_city_category <- travel_tidy %>%
group_by(City, Category) %>%
summarise(Total = sum(Amount, na.rm = TRUE), Average = mean(Amount, na.rm = TRUE), .groups = "drop")
View(summary_by_city_category)
This bar plot shows total expenses by category for each city. It highlights which types of costs were most significant in San Jose vs. Seattle.
library(ggplot2)
ggplot(travel_tidy, aes(x = Category, y = Amount, fill = City)) +
geom_bar(stat = "summary", fun = "sum", position = "dodge") +
labs(title = "Total Travel Expenses by Category and City",
x = "Expense Category", y = "Total Amount") + theme_classic()
We calculated the average daily expense per entry for each city. This gives a general idea of how costly each location was during the trip.
avg_by_city <- travel_tidy %>% group_by(City) %>%
summarise(Average_Expense = mean(Amount, na.rm = TRUE))
View(avg_by_city)
This summary shows which expense categories (Meals, Hotels, Transport) accounted for the largest share of total spending.
total_by_category <- travel_tidy %>%
group_by(Category) %>%
summarise(Total_Spent = sum(Amount, na.rm = TRUE))
View(total_by_category)
We aggregated expenses by date to observe how total travel costs varied from day to day. This helps identify any spikes or patterns in spending.
daily_total <- travel_tidy %>%
group_by(Date) %>%
summarise(Daily_Total = sum(Amount, na.rm = TRUE))
View(daily_total)
This line plot shows how total travel costs changed day by day. It reveals spending peaks and patterns across the trip.
ggplot(daily_total, aes(x = Date, y = Daily_Total)) +
geom_line(color = "darkblue", linewidth = 1.2) +
geom_point(color = "darkred", size = 3) +
labs(title = "Daily Total Travel Costs",
x = "Date", y = "Total Cost") +
theme_classic()
In the first part of this Project we cleaned the dataset by adding a city column, removing subtotal and total rows, and converting dates and amounts to proper formats. Then we reshaped the data from wide to long format to make analysis easier. Finally, we analyzed average expenses by city, compared spending across categories, and tracked daily travel costs using summary tables and visualizations.