Your task is to:

  1. Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!)

For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

  1. Please include in your homework submission, for each of the three chosen datasets: The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.

Import Tidyverse

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(knitr)

Load the three distinct data sets…

# sourced from Alice Ding's Discussion Board Entry
student_data <- read.csv("https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv")

# from Jacob Silver's Discussion Board link to a google document where I dounloaded the csv file and then uploade the file to my github repo.
nba_data <- read.csv("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/nba_players.csv")

# my discussion board entry on nafta countries population from 1994 to 2000
nafta_population <- read.csv("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/nafta_countries.csv")

Taking a look at the untidy data using the head() function

head(student_data)
##   id   name phone sex.and.age test.number term.1 term.2 term.3
## 1  1   Mike   134        m_12      test 1     76     84     87
## 2  2  Linda   270        f_13      test 1     88     90     73
## 3  3    Sam   210        m_11      test 1     78     74     80
## 4  4 Esther   617        f_12      test 1     68     75     74
## 5  5   Mary   114        f_14      test 1     65     67     64
## 6  1   Mike   134        m_12      test 2     85     80     90
head(nafta_population)
##   id       Country     X1994     X1995     X1996     X1997     X1998     X1999
## 1  1        Mexico  88314424  89969572  91586555  93183094  94767284  96334810
## 2  2 United States 263126000 266278000 269394000 272657000 275854000 279040000
## 3  3        Canada  29000663  29302311  29610218  29905948  30155173  30401286
##       X2000
## 1  97873442
## 2 282162411
## 3  30685730
head(nba_data)
##                      X1                       X2                          X3
## 1          Nikola Jokic                Team: DEN               Position: BIG
## 2                          Field goals made: 9.3 Field goal percentage: 63.2
## 3                       Free throw attempts: 6.1 Free throw percentage: 82.6
## 4                                  Assists: 10.0              Turnovers: 3.6
## 5 Giannis Antetokounmpo                Team: MIL               Position: BIG
## 6                         Field goals made: 11.1 Field goal percentage: 53.8
##                        X4                         X5
## 1                 Age: 28               Points: 24.5
## 2    3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2    Defensive rebounds: 9.4
## 4             Steals: 1.3                Blocks: 0.7
## 5                 Age: 28               Points: 31.3
## 6    3-pointers made: 0.8 3-pointer percentage: 27.0

Clean and Tidy

Reference: https://tidyr.tidyverse.org/

Tidying Student Data

In this code chunk, using the pivot_longer() function to pivot the wide data format to long format; selecting column names that start with term. to pivot, then creating a term column name to store the old column names and then creating another column name score to store the values of the old columns. After pivot_longer(), we need to separate the sex.and.age column to two distinct columns by using the separate() function. Finally, using select to arrange the column names.

student_tidy <- student_data |> 
    pivot_longer(cols = starts_with("term."), names_to = "term", values_to = "score") |> 
    separate(col = "sex.and.age", into = c("sex", "age"), sep = "_") |> 
    select(id, name, phone, sex, age, test.number, term, score) |> 
    mutate(term = if_else(term == "term.1", 1, 
                          if_else(term == "term.2", 2, 
                                  if_else(term == "term.3", 3, NA_real_)))) |>
    mutate(age = as.integer(age)) |> 
    arrange(id, term)

head(student_tidy)
## # A tibble: 6 × 8
##      id name  phone sex     age test.number  term score
##   <int> <chr> <int> <chr> <int> <chr>       <dbl> <int>
## 1     1 Mike    134 m        12 test 1          1    76
## 2     1 Mike    134 m        12 test 2          1    85
## 3     1 Mike    134 m        12 test 1          2    84
## 4     1 Mike    134 m        12 test 2          2    80
## 5     1 Mike    134 m        12 test 1          3    87
## 6     1 Mike    134 m        12 test 2          3    90

Tidying NAFTA Populations Data

Here, we are using pivot_longer() function to pivot the columns that start with X, then using mutate to remove the letter X in front of each year.

nafta_tidy <- nafta_population |> 
    pivot_longer(cols = starts_with("X"), names_to = "Year", values_to = "Population") |> 
    mutate(Year = str_remove_all(Year, "^X")) |>  
    mutate(Year = as.integer(Year))
kable(nafta_tidy)
id Country Year Population
1 Mexico 1994 88314424
1 Mexico 1995 89969572
1 Mexico 1996 91586555
1 Mexico 1997 93183094
1 Mexico 1998 94767284
1 Mexico 1999 96334810
1 Mexico 2000 97873442
2 United States 1994 263126000
2 United States 1995 266278000
2 United States 1996 269394000
2 United States 1997 272657000
2 United States 1998 275854000
2 United States 1999 279040000
2 United States 2000 282162411
3 Canada 1994 29000663
3 Canada 1995 29302311
3 Canada 1996 29610218
3 Canada 1997 29905948
3 Canada 1998 30155173
3 Canada 1999 30401286
3 Canada 2000 30685730

Tidying NBA Players Data

The NBA data is more complicated and thus requires more cleaning since we have columns that are formatted in a way that the values of a player’s statistics are right in the same line. Therefore, after using the pivot_longer() function, we use the separate() function to separate the statistic and its value. Now, we want to make each statistic its own column and have its value as its entry. At the same time filling the empty row in the name column with name of the player. The next step is to use pivot_wider() so that each column is a statistic, every row is an observation, and every cell is a single value. Thus, satisfying the constraints of tidy data.

nba_tidy <- nba_data %>% 
    pivot_longer(cols = c(X2:X5), names_to = "old_colnames", values_to = "value") |> 
    separate(value, c("statistic", "measure"), sep = ":") |>
    select(-old_colnames) |> 
    mutate(X1 = if_else(X1 == "", NA_character_, X1)) |> 
    fill(X1) |> 
    pivot_wider(names_from = statistic, values_from = measure) |> 
    rename("Player_Name" = X1)  |>
    arrange(Team, desc(Points)) |> 
    mutate(across(c(Age, Points, `Field goals made`, `Field goal percentage`, `Offensive rebounds`, `Defensive rebounds`, Assists, Turnovers, Steals, Blocks), as.numeric))

 
nba_tidy_sliced <- nba_tidy |>
    slice(1:10)  

kable(nba_tidy_sliced)
Player_Name Team Position Age Points Field goals made Field goal percentage 3-pointers made 3-pointer percentage Free throw attempts Free throw percentage Offensive rebounds Defensive rebounds Assists Turnovers Steals Blocks
Trae Young ATL GUARD 24 26.9 8.4 42.8 2.2 32.9 8.8 88.7 0.7 2.2 10.2 4.1 1.0 0.2
Dejounte Murray ATL GUARD 26 20.8 8.4 46.5 2.0 36.2 2.5 83.7 0.7 4.9 6.1 2.3 1.5 0.3
Clint Capela ATL BIG 28 12.1 5.5 62.9 0.0 0.0 1.8 61.3 4.2 7.0 0.8 0.8 0.7 1.2
Mikal Bridges BKN FORWARD 26 17.6 6.4 46.7 1.9 39.6 3.2 89.6 1.1 3.4 3.5 1.5 1.2 0.8
Spencer Dinwiddie BKN GUARD 29 17.5 5.8 44.8 2.5 39.6 4.0 82.5 0.3 2.8 5.3 1.7 0.8 0.3
Cameron Johnson BKN FORWARD 26 13.8 4.8 45.3 2.6 41.2 2.0 79.1 0.7 3.0 1.6 0.5 1.0 0.3
Nic Claxton BKN BIG 23 12.7 5.5 71.7 0.0 0.0 3.3 50.6 2.5 6.5 1.7 1.3 0.8 2.5
Al Horford BOS BIG 36 9.3 3.5 47.0 2.1 43.0 0.4 72.2 1.2 5.0 2.7 0.6 0.5 0.9
Robert Williams III BOS BIG 25 8.5 4.0 75.0 0.0 0.0 0.9 65.2 3.2 5.5 1.6 0.9 0.5 1.1
Jayson Tatum BOS FORWARD 24 30.4 9.9 46.1 3.3 35.5 8.5 86.4 1.1 7.6 4.6 2.9 1.0 0.8

Analysis and Visualization


Student Test Scores

For students data from Alice Ding’s discussion entry, she suggested to find the average test score for each student or the average test score overall and count the number male and female student, and the average age in the data.

avg_score_term <- student_tidy |> 
    group_by(name, term) |> 
    summarize(average_score = round(mean(score), 2)) |> 
    arrange(name, term)
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
avg_score_overall <- student_tidy |> 
    group_by(name) |> 
    summarize(average_score = round(mean(score), 2)) |> 
    arrange(name)

avg_score_term
## # A tibble: 15 × 3
## # Groups:   name [5]
##    name    term average_score
##    <chr>  <dbl>         <dbl>
##  1 Esther     1          69  
##  2 Esther     2          75  
##  3 Esther     3          76  
##  4 Linda      1          87.5
##  5 Linda      2          86  
##  6 Linda      3          83.5
##  7 Mary       1          66.5
##  8 Mary       2          68.5
##  9 Mary       3          63.5
## 10 Mike       1          80.5
## 11 Mike       2          82  
## 12 Mike       3          88.5
## 13 Sam        1          79  
## 14 Sam        2          80.5
## 15 Sam        3          80
avg_score_overall
## # A tibble: 5 × 2
##   name   average_score
##   <chr>          <dbl>
## 1 Esther          73.3
## 2 Linda           85.7
## 3 Mary            66.2
## 4 Mike            83.7
## 5 Sam             79.8
avg_score_term |> 
    ggplot(aes(x = name, y = average_score, fill = factor(term))) +
    geom_bar(stat = "identity", position = "dodge") +
    geom_text(aes(label = round(average_score, 1)), vjust = -0.3, position = position_dodge(0.9)) +
    labs(x = "Name", y = "Average score", fill = "Term") + 
    ggtitle("Average Test Score per Term")

We observe that Esther and Mike’s average has shown improvement over the three terms. On the other hand, Linda and Mary’s average has declined over the three terms. Academic intervention might be needed in order to help Linda and Mary’s test scores.

student_tidy |> 
    summarize(average_age = mean(age))
## # A tibble: 1 × 1
##   average_age
##         <dbl>
## 1        12.4

The average age of the students in the data is 12.4

student_tidy |> 
    count(sex)
## # A tibble: 2 × 2
##   sex       n
##   <chr> <int>
## 1 f        18
## 2 m        12

In this data, there are 18 female and 12 male students.

NAFTA Countries Population

Here, we are adding a column for growth rate where it take the current and prior data in order to calculate the growth rate for that year. If there is not prior data, then it returns 0.

nafta_tidy <- nafta_tidy |> 
    group_by(Country) |> 
    mutate(growth_rate = round((Population - lag(Population))/lag(Population) * 100, 2)) |> 
    replace_na(list(growth_rate = 0))

nafta_tidy
## # A tibble: 21 × 5
## # Groups:   Country [3]
##       id Country        Year Population growth_rate
##    <int> <chr>         <int>      <int>       <dbl>
##  1     1 Mexico         1994   88314424        0   
##  2     1 Mexico         1995   89969572        1.87
##  3     1 Mexico         1996   91586555        1.8 
##  4     1 Mexico         1997   93183094        1.74
##  5     1 Mexico         1998   94767284        1.7 
##  6     1 Mexico         1999   96334810        1.65
##  7     1 Mexico         2000   97873442        1.6 
##  8     2 United States  1994  263126000        0   
##  9     2 United States  1995  266278000        1.2 
## 10     2 United States  1996  269394000        1.17
## # … with 11 more rows
nafta_tidy |> ggplot(aes(x = Year, y = growth_rate, color = Country)) +
    geom_line() +
    geom_point() +
    geom_text(aes(label = round(growth_rate, 2)), vjust = -0.4) +
    labs(title = "Population Growth Rate in Years 1994 - 2000", x = "Year", y = "Percent Increase") +
    theme_bw()

In the years 1994 - 2000, we can observe that United States has maintained a relatively constant growth rate of around 1.2 percent. While, Mexico and Canada’s growth have not been constant. Particularly Canada, we see a decline in growth rate in 1997 and continued decreasing until 2000 where the growth rate increased from 0.82 to 0.94. Moreover, Mexico’s growth has been consistently decreasing yearly meaning that more people are leaving the country and less are moving into the country or the birth to death is less than 1. i.e. more people are dying than being born.


NBA Players Statistics

In Jacob Silver’s entry, he suggested to calculate: Shooting percentage and answer the question are the best teams those with the most players in the top 100 regardless of placement? or is “one guy” at the top of the mountain enough?

It seems the data already comes with percent statistics such as field goal percentage and 3-pointers made. Now, we will try to answer the question of does the best teams necessarily have the best players?


Disclaimer:

top teams and players for this project does not reflect the actual ranking since we do not know how the NBA calculates overall ratings. For the sake of this project, we will be using weighted.mean() function based on surface level knowledge in basketball.

Here, we summarize and arrange the data to identity the top 10 performing teams according to their overall rating.

top_teams <- nba_tidy %>%
  group_by(Team) |> 
  summarize(overall_rating = weighted.mean(`Field goals made`,
                                           `3-pointers made`,
                                           w = Points / sum(Points))) %>%
  arrange(desc(overall_rating)) |> 
  top_n(10, overall_rating)
top_teams |> 
    ggplot(aes(x = Team, y = overall_rating, fill = Team)) +
    geom_col() +
    geom_text(aes(label = Team), hjust = 0.5, vjust = -0.5) +
    labs(title = "Top 10 Teams by Overall Rating", x = "Team", y = "Overall Rating") +
    theme(legend.position = "none")

Similarly, adding offensive and defensive ratings then arranging them in descending order to identity who the top players are based on their offensive_rating because the best players are the ones with the most points.

nba_off_def_ratings <- nba_tidy |>
    group_by(Player_Name) |> 
    mutate(offensive_rating = weighted.mean(`Field goals made`,
                                           `3-pointers made`,
                                           `Offensive rebounds`,
                                           w = Points / sum(Points)),
           defensive_rating = weighted.mean(`Defensive rebounds`, 
                                            Turnovers,
                                            Steals,
                                            Blocks,
                                            w = `Defensive rebounds`/ sum(`Defensive rebounds`))) |> 
    mutate(across(c(offensive_rating, defensive_rating), as.double)) |> 
    select(Player_Name, Team, Position, offensive_rating, defensive_rating) |> 
    arrange(desc(offensive_rating))
nba_off_def_ratings |> 
    head(15) |> 
    ggplot(aes(x = offensive_rating, y = defensive_rating, color = Team)) +
    geom_point() +
    labs(title = "Offense vs Defense Rating", x = "Offensive Rating", y = "Defensive Rating") +
    scale_x_continuous(limits = c(9,11.5)) +
    geom_text(aes(label = Player_Name), vjust = 1.5, hjust = 1, check_overlap = TRUE, nudge_y = 0.5) +
    theme(legend.position = "none")

top_players <- nba_off_def_ratings |> 
    head(100)

Next, we use inner_join() to join the two data frame so that we can count how many players of the top 100 players are in each team.

top_teams |> 
    inner_join(top_players, by = "Team") |> 
    group_by(Team) |> 
    summarize(top_players = n()) |> 
    ggplot(aes(x = Team, y = top_players, fill = Team)) +
    geom_col() +
    theme(legend.position = "none") 

We can state that basketball is definitely a team sports where no one person can always carry the team. We set the top 100 players based on their offensive rating and counted how many of them are in the top 10 teams. There are around at most 3-4 players that are in the top 100 rating and are in a top 10 team. Thus, the answer to the question does the best teams necessarily have the best players? No, the best teams does not mean they have the best players according to this data.