1. Introduction

In this assignment, I worked with a Travel Expense Report showing daily spending for meals, hotels, and transport across two cities: San Jose and Seattle. The dataset is untidy because it uses section headers as city labels, mixes raw data with subtotals, and includes total rows that do not represent actual observations. The goal was to tidy the dataset, transform it into a structured format, and calculate the total and average expenses by city.


2. Load Libraries

To start, I loaded the libraries used for organizing, transforming, and analyzing the data.

library(tidyverse)

3. Create the Dataset

I recreated the dataset in wide format to match the structure of the original report. This table represents the daily meal, hotel, and transport expenses for each city. It serves as the untidy data source that will be cleaned and transformed for analysis.

travel <- data.frame(
  City = c("San Jose", "San Jose", "Seattle", "Seattle"),
  Date = c("25-Aug-97", "26-Aug-97", "27-Aug-97", "28-Aug-97"),
  Meals = c(37.74, 27.28, 96.25, 35.00),
  Hotels = c(112.00, 112.00, 109.00, 109.00),
  Transport = c(45.00, 45.00, 36.00, 36.00)
)
travel
##       City      Date Meals Hotels Transport
## 1 San Jose 25-Aug-97 37.74    112        45
## 2 San Jose 26-Aug-97 27.28    112        45
## 3  Seattle 27-Aug-97 96.25    109        36
## 4  Seattle 28-Aug-97 35.00    109        36

4. Tidy the Data

The dataset was already close to tidy, but I made sure all columns were in the correct format. I converted the Date column into a proper date format and verified that all numeric columns were recognized correctly. This step ensured the dataset was clean and ready for analysis.

travel <- travel %>%
  mutate(Date = as.Date(Date, format = "%d-%b-%y"))

str(travel)
## 'data.frame':    4 obs. of  5 variables:
##  $ City     : chr  "San Jose" "San Jose" "Seattle" "Seattle"
##  $ Date     : Date, format: "1997-08-25" "1997-08-26" ...
##  $ Meals    : num  37.7 27.3 96.2 35
##  $ Hotels   : num  112 112 109 109
##  $ Transport: num  45 45 36 36

5. Transform and Summary

Next, I summarized the total spending for each city across all categories. I calculated the total amount spent on meals, hotels, and transport for San Jose and Seattle individually, and then computed the grand total across both trips.

city_totals <- travel %>%
  group_by(City) %>%
  summarise(
    Meals = sum(Meals),
    Hotels = sum(Hotels),
    Transport = sum(Transport),
    Total = sum(Meals + Hotels + Transport)
  )

city_totals
## # A tibble: 2 × 5
##   City     Meals Hotels Transport Total
##   <chr>    <dbl>  <dbl>     <dbl> <dbl>
## 1 San Jose  65.0    224        90  379.
## 2 Seattle  131.     218        72  421.
grand_total <- sum(city_totals$Total)
grand_total
## [1] 800.27

6. Visualization

To better compare the two locations, I created a bar chart showing total spending per city. This visualization highlights that Seattle had slightly higher overall expenses than San Jose, mainly due to higher meal costs.

ggplot(city_totals, aes(x = City, y = Total, fill = City)) +
  geom_col() +
  labs(title = "Total Travel Expenses by City", y = "Total ($)") +
  theme_minimal()

7. Conclusion

After tidying and transforming the dataset, the total travel cost was $800.27, with Seattle being slightly more expensive overall than San Jose. This exercise demonstrated how an untidy report with subtotals can be cleaned into a structured dataset ready for analysis.