I choose three datasets from the discussion forum, which are airbooking_table, film_review_table and students_grades_table. They are all in a wide format and I am going to tidy and transform them into long format using tidyverse library.
# The first table is about air booking agencies.
ab <- read.csv('https://raw.githubusercontent.com/vincent-usny/607_project2/refs/heads/main/airbooking_table.csv')
ab <- ab %>%
pivot_longer(
cols = c(Winter, Spring, Summer, Fall),
names_to = "Season",
values_to = "Price"
)
# compare avg_price by season among agencies
ab %>%
group_by(Agency,Season) %>%
summarise(avg_price = round(mean(Price)), .groups = "drop") %>%
arrange(desc(avg_price))
## # A tibble: 12 × 3
## Agency Season avg_price
## <chr> <chr> <dbl>
## 1 Booking.com Summer 615
## 2 TravelOne Summer 612
## 3 Google Travel Summer 600
## 4 TravelOne Fall 488
## 5 Booking.com Fall 480
## 6 Booking.com Spring 478
## 7 TravelOne Spring 475
## 8 Google Travel Fall 472
## 9 Google Travel Spring 470
## 10 Booking.com Winter 425
## 11 TravelOne Winter 425
## 12 Google Travel Winter 410
# compare the services' avg_price among agencies
ab %>%
group_by(Agency, Service)%>%
summarise(avg_price = mean(Price), .groups = "drop") %>%
arrange(desc(avg_price))
## # A tibble: 6 × 3
## Agency Service avg_price
## <chr> <chr> <dbl>
## 1 TravelOne Hotel 588.
## 2 Booking.com Hotel 582.
## 3 Google Travel Hotel 570
## 4 Booking.com Flight 416.
## 5 TravelOne Flight 412.
## 6 Google Travel Flight 406.
ggplot(ab, aes(x = Season, y = Price, color = Service, group = Service)) + geom_line() + facet_wrap(~Agency)
film <- read.csv('https://raw.githubusercontent.com/vincent-usny/607_project2/refs/heads/main/film_review_table.csv')
film <- film %>%
pivot_longer(
cols = c(Daniel,Eleana,Susan,Winnie,Aiden),
names_to = "Reviewer",
values_to = "Rating"
)
# average rating per film
film %>%
group_by(Film) %>%
summarise(avg_rating = round(mean(Rating),2), .groups = "drop") %>%
arrange(desc(avg_rating))
## # A tibble: 5 × 2
## Film avg_rating
## <chr> <dbl>
## 1 How to Train your Dragon 4.4
## 2 F1 the movie 4
## 3 Elio 3.8
## 4 Mission Impossible 3.4
## 5 Superman 2.6
ggplot(film, aes(x = Film, y = Reviewer, fill = Rating)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "blue")
# This tile plot shows each reviewer's rating toward each film.
# We can also see some reviewers tend to give higer ratings.
grades <- read.csv('https://raw.githubusercontent.com/vincent-usny/607_project2/refs/heads/main/students_grades_table.csv')
grades <- grades %>%
pivot_longer(
cols = c(Math, Science, English),
names_to = "Subject",
values_to = "Score"
)
# compare each student's overall avg_score each year
grades %>%
group_by(Student, Year) %>%
summarise(avg_score = round(mean(Score),1), .groups = "drop") %>%
arrange(desc(avg_score))
## # A tibble: 8 × 3
## Student Year avg_score
## <chr> <int> <dbl>
## 1 Carol 2020 93.7
## 2 Carol 2019 91.7
## 3 Alice 2020 87
## 4 Alice 2019 84.3
## 5 Bob 2020 77.7
## 6 Bob 2019 75
## 7 David 2020 70
## 8 David 2019 67.7
# rank each subject's avg_score among students
grades %>%
group_by(Student, Subject) %>%
summarise(avg_score = round(mean(Score),1), .groups = "drop") %>%
arrange(Subject,desc(avg_score))
## # A tibble: 12 × 3
## Student Subject avg_score
## <chr> <chr> <dbl>
## 1 Carol English 96
## 2 Alice English 79.5
## 3 Bob English 71
## 4 David English 66.5
## 5 Carol Math 93
## 6 Alice Math 86.5
## 7 Bob Math 76.5
## 8 David Math 69
## 9 Alice Science 91
## 10 Carol Science 89
## 11 Bob Science 81.5
## 12 David Science 71
ggplot(grades, aes(x = Year, y = Score, color = Subject, group = Subject)) +
geom_line() +
facet_wrap(~Student)
# we can see David's overall performance is the worst while Caro's is the best.
# Alice is the second and Bob is the third. It's quite clear.
These three wide datasets are successfully tidied and transformed into long formats. The data is analyzed in some levels while I think the line chart and the tile chart perform pretty well to describe the data.