Your task is to:
For each of the three chosen datasets:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(knitr)
# sourced from Alice Ding's Discussion Board Entry
student_data <- read.csv("https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv")
# from Jacob Silver's Discussion Board link to a google document where I dounloaded the csv file and then uploade the file to my github repo.
nba_data <- read.csv("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/nba_players.csv")
# my discussion board entry on nafta countries population from 1994 to 2000
nafta_population <- read.csv("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/nafta_countries.csv")
head(student_data)
## id name phone sex.and.age test.number term.1 term.2 term.3
## 1 1 Mike 134 m_12 test 1 76 84 87
## 2 2 Linda 270 f_13 test 1 88 90 73
## 3 3 Sam 210 m_11 test 1 78 74 80
## 4 4 Esther 617 f_12 test 1 68 75 74
## 5 5 Mary 114 f_14 test 1 65 67 64
## 6 1 Mike 134 m_12 test 2 85 80 90
head(nafta_population)
## id Country X1994 X1995 X1996 X1997 X1998 X1999
## 1 1 Mexico 88314424 89969572 91586555 93183094 94767284 96334810
## 2 2 United States 263126000 266278000 269394000 272657000 275854000 279040000
## 3 3 Canada 29000663 29302311 29610218 29905948 30155173 30401286
## X2000
## 1 97873442
## 2 282162411
## 3 30685730
head(nba_data)
## X1 X2 X3
## 1 Nikola Jokic Team: DEN Position: BIG
## 2 Field goals made: 9.3 Field goal percentage: 63.2
## 3 Free throw attempts: 6.1 Free throw percentage: 82.6
## 4 Assists: 10.0 Turnovers: 3.6
## 5 Giannis Antetokounmpo Team: MIL Position: BIG
## 6 Field goals made: 11.1 Field goal percentage: 53.8
## X4 X5
## 1 Age: 28 Points: 24.5
## 2 3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2 Defensive rebounds: 9.4
## 4 Steals: 1.3 Blocks: 0.7
## 5 Age: 28 Points: 31.3
## 6 3-pointers made: 0.8 3-pointer percentage: 27.0
Reference: https://tidyr.tidyverse.org/
In this code chunk, using the pivot_longer() function to pivot the wide data format to long format; selecting column names that start with term. to pivot, then creating a term column name to store the old column names and then creating another column name score to store the values of the old columns. After pivot_longer(), we need to separate the sex.and.age column to two distinct columns by using the separate() function. Finally, using select to arrange the column names.
student_tidy <- student_data |>
pivot_longer(cols = starts_with("term."), names_to = "term", values_to = "score") |>
separate(col = "sex.and.age", into = c("sex", "age"), sep = "_") |>
select(id, name, phone, sex, age, test.number, term, score) |>
mutate(term = if_else(term == "term.1", 1,
if_else(term == "term.2", 2,
if_else(term == "term.3", 3, NA_real_)))) |>
mutate(age = as.integer(age)) |>
arrange(id, term)
head(student_tidy)
## # A tibble: 6 × 8
## id name phone sex age test.number term score
## <int> <chr> <int> <chr> <int> <chr> <dbl> <int>
## 1 1 Mike 134 m 12 test 1 1 76
## 2 1 Mike 134 m 12 test 2 1 85
## 3 1 Mike 134 m 12 test 1 2 84
## 4 1 Mike 134 m 12 test 2 2 80
## 5 1 Mike 134 m 12 test 1 3 87
## 6 1 Mike 134 m 12 test 2 3 90
Here, we are using pivot_longer() function to pivot the columns that start with X, then using mutate to remove the letter X in front of each year.
nafta_tidy <- nafta_population |>
pivot_longer(cols = starts_with("X"), names_to = "Year", values_to = "Population") |>
mutate(Year = str_remove_all(Year, "^X")) |>
mutate(Year = as.integer(Year))
kable(nafta_tidy)
| id | Country | Year | Population |
|---|---|---|---|
| 1 | Mexico | 1994 | 88314424 |
| 1 | Mexico | 1995 | 89969572 |
| 1 | Mexico | 1996 | 91586555 |
| 1 | Mexico | 1997 | 93183094 |
| 1 | Mexico | 1998 | 94767284 |
| 1 | Mexico | 1999 | 96334810 |
| 1 | Mexico | 2000 | 97873442 |
| 2 | United States | 1994 | 263126000 |
| 2 | United States | 1995 | 266278000 |
| 2 | United States | 1996 | 269394000 |
| 2 | United States | 1997 | 272657000 |
| 2 | United States | 1998 | 275854000 |
| 2 | United States | 1999 | 279040000 |
| 2 | United States | 2000 | 282162411 |
| 3 | Canada | 1994 | 29000663 |
| 3 | Canada | 1995 | 29302311 |
| 3 | Canada | 1996 | 29610218 |
| 3 | Canada | 1997 | 29905948 |
| 3 | Canada | 1998 | 30155173 |
| 3 | Canada | 1999 | 30401286 |
| 3 | Canada | 2000 | 30685730 |
The NBA data is more complicated and thus requires more cleaning since we have columns that are formatted in a way that the values of a player’s statistics are right in the same line. Therefore, after using the pivot_longer() function, we use the separate() function to separate the statistic and its value. Now, we want to make each statistic its own column and have its value as its entry. At the same time filling the empty row in the name column with name of the player. The next step is to use pivot_wider() so that each column is a statistic, every row is an observation, and every cell is a single value. Thus, satisfying the constraints of tidy data.
nba_tidy <- nba_data %>%
pivot_longer(cols = c(X2:X5), names_to = "old_colnames", values_to = "value") |>
separate(value, c("statistic", "measure"), sep = ":") |>
select(-old_colnames) |>
mutate(X1 = if_else(X1 == "", NA_character_, X1)) |>
fill(X1) |>
pivot_wider(names_from = statistic, values_from = measure) |>
rename("Player_Name" = X1) |>
arrange(Team, desc(Points)) |>
mutate(across(c(Age, Points, `Field goals made`, `Field goal percentage`, `Offensive rebounds`, `Defensive rebounds`, Assists, Turnovers, Steals, Blocks), as.numeric))
nba_tidy_sliced <- nba_tidy |>
slice(1:10)
kable(nba_tidy_sliced)
| Player_Name | Team | Position | Age | Points | Field goals made | Field goal percentage | 3-pointers made | 3-pointer percentage | Free throw attempts | Free throw percentage | Offensive rebounds | Defensive rebounds | Assists | Turnovers | Steals | Blocks |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Trae Young | ATL | GUARD | 24 | 26.9 | 8.4 | 42.8 | 2.2 | 32.9 | 8.8 | 88.7 | 0.7 | 2.2 | 10.2 | 4.1 | 1.0 | 0.2 |
| Dejounte Murray | ATL | GUARD | 26 | 20.8 | 8.4 | 46.5 | 2.0 | 36.2 | 2.5 | 83.7 | 0.7 | 4.9 | 6.1 | 2.3 | 1.5 | 0.3 |
| Clint Capela | ATL | BIG | 28 | 12.1 | 5.5 | 62.9 | 0.0 | 0.0 | 1.8 | 61.3 | 4.2 | 7.0 | 0.8 | 0.8 | 0.7 | 1.2 |
| Mikal Bridges | BKN | FORWARD | 26 | 17.6 | 6.4 | 46.7 | 1.9 | 39.6 | 3.2 | 89.6 | 1.1 | 3.4 | 3.5 | 1.5 | 1.2 | 0.8 |
| Spencer Dinwiddie | BKN | GUARD | 29 | 17.5 | 5.8 | 44.8 | 2.5 | 39.6 | 4.0 | 82.5 | 0.3 | 2.8 | 5.3 | 1.7 | 0.8 | 0.3 |
| Cameron Johnson | BKN | FORWARD | 26 | 13.8 | 4.8 | 45.3 | 2.6 | 41.2 | 2.0 | 79.1 | 0.7 | 3.0 | 1.6 | 0.5 | 1.0 | 0.3 |
| Nic Claxton | BKN | BIG | 23 | 12.7 | 5.5 | 71.7 | 0.0 | 0.0 | 3.3 | 50.6 | 2.5 | 6.5 | 1.7 | 1.3 | 0.8 | 2.5 |
| Al Horford | BOS | BIG | 36 | 9.3 | 3.5 | 47.0 | 2.1 | 43.0 | 0.4 | 72.2 | 1.2 | 5.0 | 2.7 | 0.6 | 0.5 | 0.9 |
| Robert Williams III | BOS | BIG | 25 | 8.5 | 4.0 | 75.0 | 0.0 | 0.0 | 0.9 | 65.2 | 3.2 | 5.5 | 1.6 | 0.9 | 0.5 | 1.1 |
| Jayson Tatum | BOS | FORWARD | 24 | 30.4 | 9.9 | 46.1 | 3.3 | 35.5 | 8.5 | 86.4 | 1.1 | 7.6 | 4.6 | 2.9 | 1.0 | 0.8 |
For students data from Alice Ding’s discussion entry, she suggested to find the average test score for each student or the average test score overall and count the number male and female student, and the average age in the data.
avg_score_term <- student_tidy |>
group_by(name, term) |>
summarize(average_score = round(mean(score), 2)) |>
arrange(name, term)
## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.
avg_score_overall <- student_tidy |>
group_by(name) |>
summarize(average_score = round(mean(score), 2)) |>
arrange(name)
avg_score_term
## # A tibble: 15 × 3
## # Groups: name [5]
## name term average_score
## <chr> <dbl> <dbl>
## 1 Esther 1 69
## 2 Esther 2 75
## 3 Esther 3 76
## 4 Linda 1 87.5
## 5 Linda 2 86
## 6 Linda 3 83.5
## 7 Mary 1 66.5
## 8 Mary 2 68.5
## 9 Mary 3 63.5
## 10 Mike 1 80.5
## 11 Mike 2 82
## 12 Mike 3 88.5
## 13 Sam 1 79
## 14 Sam 2 80.5
## 15 Sam 3 80
avg_score_overall
## # A tibble: 5 × 2
## name average_score
## <chr> <dbl>
## 1 Esther 73.3
## 2 Linda 85.7
## 3 Mary 66.2
## 4 Mike 83.7
## 5 Sam 79.8
avg_score_term |>
ggplot(aes(x = name, y = average_score, fill = factor(term))) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = round(average_score, 1)), vjust = -0.3, position = position_dodge(0.9)) +
labs(x = "Name", y = "Average score", fill = "Term") +
ggtitle("Average Test Score per Term")
We observe that Esther and Mike’s average has shown improvement over the three terms. On the other hand, Linda and Mary’s average has declined over the three terms. Academic intervention might be needed in order to help Linda and Mary’s test scores.
student_tidy |>
summarize(average_age = mean(age))
## # A tibble: 1 × 1
## average_age
## <dbl>
## 1 12.4
The average age of the students in the data is 12.4
student_tidy |>
count(sex)
## # A tibble: 2 × 2
## sex n
## <chr> <int>
## 1 f 18
## 2 m 12
In this data, there are 18 female and 12 male students.
Here, we are adding a column for growth rate where it take the current and prior data in order to calculate the growth rate for that year. If there is not prior data, then it returns 0.
nafta_tidy <- nafta_tidy |>
group_by(Country) |>
mutate(growth_rate = round((Population - lag(Population))/lag(Population) * 100, 2)) |>
replace_na(list(growth_rate = 0))
nafta_tidy
## # A tibble: 21 × 5
## # Groups: Country [3]
## id Country Year Population growth_rate
## <int> <chr> <int> <int> <dbl>
## 1 1 Mexico 1994 88314424 0
## 2 1 Mexico 1995 89969572 1.87
## 3 1 Mexico 1996 91586555 1.8
## 4 1 Mexico 1997 93183094 1.74
## 5 1 Mexico 1998 94767284 1.7
## 6 1 Mexico 1999 96334810 1.65
## 7 1 Mexico 2000 97873442 1.6
## 8 2 United States 1994 263126000 0
## 9 2 United States 1995 266278000 1.2
## 10 2 United States 1996 269394000 1.17
## # … with 11 more rows
nafta_tidy |> ggplot(aes(x = Year, y = growth_rate, color = Country)) +
geom_line() +
geom_point() +
geom_text(aes(label = round(growth_rate, 2)), vjust = -0.4) +
labs(title = "Population Growth Rate in Years 1994 - 2000", x = "Year", y = "Percent Increase") +
theme_bw()
In the years 1994 - 2000, we can observe that United States has maintained a relatively constant growth rate of around 1.2 percent. While, Mexico and Canada’s growth have not been constant. Particularly Canada, we see a decline in growth rate in 1997 and continued decreasing until 2000 where the growth rate increased from 0.82 to 0.94. Moreover, Mexico’s growth has been consistently decreasing yearly meaning that more people are leaving the country and less are moving into the country or the birth to death is less than 1. i.e. more people are dying than being born.
In Jacob Silver’s entry, he suggested to calculate: Shooting percentage and answer the question are the best teams those with the most players in the top 100 regardless of placement? or is “one guy” at the top of the mountain enough?
It seems the data already comes with percent statistics such as field goal percentage and 3-pointers made. Now, we will try to answer the question of does the best teams necessarily have the best players?
top teams and players for this project does not reflect the actual ranking since we do not know how the NBA calculates overall ratings. For the sake of this project, we will be using weighted.mean() function based on surface level knowledge in basketball.
Here, we summarize and arrange the data to identity the top 10 performing teams according to their overall rating.
top_teams <- nba_tidy %>%
group_by(Team) |>
summarize(overall_rating = weighted.mean(`Field goals made`,
`3-pointers made`,
w = Points / sum(Points))) %>%
arrange(desc(overall_rating)) |>
top_n(10, overall_rating)
top_teams |>
ggplot(aes(x = Team, y = overall_rating, fill = Team)) +
geom_col() +
geom_text(aes(label = Team), hjust = 0.5, vjust = -0.5) +
labs(title = "Top 10 Teams by Overall Rating", x = "Team", y = "Overall Rating") +
theme(legend.position = "none")
Similarly, adding offensive and defensive ratings then arranging them in descending order to identity who the top players are based on their offensive_rating because the best players are the ones with the most points.
nba_off_def_ratings <- nba_tidy |>
group_by(Player_Name) |>
mutate(offensive_rating = weighted.mean(`Field goals made`,
`3-pointers made`,
`Offensive rebounds`,
w = Points / sum(Points)),
defensive_rating = weighted.mean(`Defensive rebounds`,
Turnovers,
Steals,
Blocks,
w = `Defensive rebounds`/ sum(`Defensive rebounds`))) |>
mutate(across(c(offensive_rating, defensive_rating), as.double)) |>
select(Player_Name, Team, Position, offensive_rating, defensive_rating) |>
arrange(desc(offensive_rating))
nba_off_def_ratings |>
head(15) |>
ggplot(aes(x = offensive_rating, y = defensive_rating, color = Team)) +
geom_point() +
labs(title = "Offense vs Defense Rating", x = "Offensive Rating", y = "Defensive Rating") +
scale_x_continuous(limits = c(9,11.5)) +
geom_text(aes(label = Player_Name), vjust = 1.5, hjust = 1, check_overlap = TRUE, nudge_y = 0.5) +
theme(legend.position = "none")
top_players <- nba_off_def_ratings |>
head(100)
Next, we use inner_join() to join the two data frame so that we can count how many players of the top 100 players are in each team.
top_teams |>
inner_join(top_players, by = "Team") |>
group_by(Team) |>
summarize(top_players = n()) |>
ggplot(aes(x = Team, y = top_players, fill = Team)) +
geom_col() +
theme(legend.position = "none")
We can state that basketball is definitely a team sports where no one person can always carry the team. We set the top 100 players based on their offensive rating and counted how many of them are in the top 10 teams. There are around at most 3-4 players that are in the top 100 rating and are in a top 10 team. Thus, the answer to the question does the best teams necessarily have the best players? No, the best teams does not mean they have the best players according to this data.