Introduction

I choose three datasets from the discussion forum, which are airbooking_table, film_review_table and students_grades_table. They are all in a wide format and I am going to tidy and transform them into long format using tidyverse library.

1. Air Booking Table

# The first table is about air booking agencies.
ab <- read.csv('https://raw.githubusercontent.com/vincent-usny/607_project2/refs/heads/main/airbooking_table.csv')

ab <- ab %>%
  pivot_longer(
    cols = c(Winter, Spring, Summer, Fall),
    names_to = "Season",
    values_to = "Price"
  )

# compare avg_price by season among agencies
ab %>%
  group_by(Agency,Season) %>%
  summarise(avg_price = round(mean(Price)), .groups = "drop") %>%
  arrange(desc(avg_price))

## # A tibble: 12 × 3
##    Agency        Season avg_price
##    <chr>         <chr>      <dbl>
##  1 Booking.com   Summer       615
##  2 TravelOne     Summer       612
##  3 Google Travel Summer       600
##  4 TravelOne     Fall         488
##  5 Booking.com   Fall         480
##  6 Booking.com   Spring       478
##  7 TravelOne     Spring       475
##  8 Google Travel Fall         472
##  9 Google Travel Spring       470
## 10 Booking.com   Winter       425
## 11 TravelOne     Winter       425
## 12 Google Travel Winter       410

# compare the services' avg_price among agencies
ab %>%
  group_by(Agency, Service)%>%
  summarise(avg_price = mean(Price), .groups = "drop") %>%
  arrange(desc(avg_price))

## # A tibble: 6 × 3
##   Agency        Service avg_price
##   <chr>         <chr>       <dbl>
## 1 TravelOne     Hotel        588.
## 2 Booking.com   Hotel        582.
## 3 Google Travel Hotel        570 
## 4 Booking.com   Flight       416.
## 5 TravelOne     Flight       412.
## 6 Google Travel Flight       406.

ggplot(ab, aes(x = Season, y = Price, color = Service, group = Service)) + geom_line() + facet_wrap(~Agency)

2. Film Review Table

film <- read.csv('https://raw.githubusercontent.com/vincent-usny/607_project2/refs/heads/main/film_review_table.csv')

film <- film %>%
  pivot_longer(
    cols = c(Daniel,Eleana,Susan,Winnie,Aiden),
    names_to = "Reviewer",
    values_to = "Rating"
  )

# average rating per film
film %>%
  group_by(Film) %>%
  summarise(avg_rating = round(mean(Rating),2), .groups = "drop") %>%
  arrange(desc(avg_rating))

## # A tibble: 5 × 2
##   Film                     avg_rating
##   <chr>                         <dbl>
## 1 How to Train your Dragon        4.4
## 2 F1 the movie                    4  
## 3 Elio                            3.8
## 4 Mission Impossible              3.4
## 5 Superman                        2.6

ggplot(film, aes(x = Film, y = Reviewer, fill = Rating)) +
  geom_tile() +
  scale_fill_gradient(low = "white", high = "blue")

# This tile plot shows each reviewer's rating toward each film.
# We can also see some reviewers tend to give higer ratings.

3. Students Grades Table

grades <- read.csv('https://raw.githubusercontent.com/vincent-usny/607_project2/refs/heads/main/students_grades_table.csv')

grades <- grades %>%
  pivot_longer(
    cols = c(Math, Science, English),
    names_to = "Subject",
    values_to = "Score"
  )

# compare each student's overall avg_score each year
grades %>%
  group_by(Student, Year) %>%
  summarise(avg_score = round(mean(Score),1), .groups = "drop") %>%
  arrange(desc(avg_score))

## # A tibble: 8 × 3
##   Student  Year avg_score
##   <chr>   <int>     <dbl>
## 1 Carol    2020      93.7
## 2 Carol    2019      91.7
## 3 Alice    2020      87  
## 4 Alice    2019      84.3
## 5 Bob      2020      77.7
## 6 Bob      2019      75  
## 7 David    2020      70  
## 8 David    2019      67.7

# rank each subject's avg_score among students 
grades %>%
  group_by(Student, Subject) %>%
  summarise(avg_score = round(mean(Score),1), .groups = "drop") %>%
  arrange(Subject,desc(avg_score))

## # A tibble: 12 × 3
##    Student Subject avg_score
##    <chr>   <chr>       <dbl>
##  1 Carol   English      96  
##  2 Alice   English      79.5
##  3 Bob     English      71  
##  4 David   English      66.5
##  5 Carol   Math         93  
##  6 Alice   Math         86.5
##  7 Bob     Math         76.5
##  8 David   Math         69  
##  9 Alice   Science      91  
## 10 Carol   Science      89  
## 11 Bob     Science      81.5
## 12 David   Science      71

ggplot(grades, aes(x = Year, y = Score, color = Subject, group = Subject)) + 
  geom_line() + 
  facet_wrap(~Student)

# we can see David's overall performance is the worst while Caro's is the best.
# Alice is the second and Bob is the third. It's quite clear.

Conclusion

These three wide datasets are successfully tidied and transformed into long formats. The data is analyzed in some levels while I think the line chart and the tile chart perform pretty well to describe the data.

Project2

Haoming Chen

2025-10-03

Introduction

1. Air Booking Table

2. Film Review Table

3. Students Grades Table

Conclusion