In this assignment, I worked with a Travel Expense Report showing daily spending for meals, hotels, and transport across two cities: San Jose and Seattle. The dataset is untidy because it uses section headers as city labels, mixes raw data with subtotals, and includes total rows that do not represent actual observations. The goal was to tidy the dataset, transform it into a structured format, and calculate the total and average expenses by city.
To start, I loaded the libraries used for organizing, transforming, and analyzing the data.
library(tidyverse)
I recreated the dataset in wide format to match the structure of the original report. This table represents the daily meal, hotel, and transport expenses for each city. It serves as the untidy data source that will be cleaned and transformed for analysis.
travel <- data.frame(
City = c("San Jose", "San Jose", "Seattle", "Seattle"),
Date = c("25-Aug-97", "26-Aug-97", "27-Aug-97", "28-Aug-97"),
Meals = c(37.74, 27.28, 96.25, 35.00),
Hotels = c(112.00, 112.00, 109.00, 109.00),
Transport = c(45.00, 45.00, 36.00, 36.00)
)
travel
## City Date Meals Hotels Transport
## 1 San Jose 25-Aug-97 37.74 112 45
## 2 San Jose 26-Aug-97 27.28 112 45
## 3 Seattle 27-Aug-97 96.25 109 36
## 4 Seattle 28-Aug-97 35.00 109 36
The dataset was already close to tidy, but I made sure all columns were in the correct format. I converted the Date column into a proper date format and verified that all numeric columns were recognized correctly. This step ensured the dataset was clean and ready for analysis.
travel <- travel %>%
mutate(Date = as.Date(Date, format = "%d-%b-%y"))
str(travel)
## 'data.frame': 4 obs. of 5 variables:
## $ City : chr "San Jose" "San Jose" "Seattle" "Seattle"
## $ Date : Date, format: "1997-08-25" "1997-08-26" ...
## $ Meals : num 37.7 27.3 96.2 35
## $ Hotels : num 112 112 109 109
## $ Transport: num 45 45 36 36
Next, I summarized the total spending for each city across all categories. I calculated the total amount spent on meals, hotels, and transport for San Jose and Seattle individually, and then computed the grand total across both trips.
city_totals <- travel %>%
group_by(City) %>%
summarise(
Meals = sum(Meals),
Hotels = sum(Hotels),
Transport = sum(Transport),
Total = sum(Meals + Hotels + Transport)
)
city_totals
## # A tibble: 2 × 5
## City Meals Hotels Transport Total
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 San Jose 65.0 224 90 379.
## 2 Seattle 131. 218 72 421.
grand_total <- sum(city_totals$Total)
grand_total
## [1] 800.27
To better compare the two locations, I created a bar chart showing total spending per city. This visualization highlights that Seattle had slightly higher overall expenses than San Jose, mainly due to higher meal costs.
ggplot(city_totals, aes(x = City, y = Total, fill = City)) +
geom_col() +
labs(title = "Total Travel Expenses by City", y = "Total ($)") +
theme_minimal()
After tidying and transforming the dataset, the total travel cost was $800.27, with Seattle being slightly more expensive overall than San Jose. This exercise demonstrated how an untidy report with subtotals can be cleaned into a structured dataset ready for analysis.