Your task is to:

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. (You may use your own dataset; please don’t use my Sample Post dataset, since that was used in your Week 6 assignment!)

For each of the three chosen datasets:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!] Perform the analysis requested in the discussion item. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.

Please include in your homework submission, for each of the three chosen datasets: The URL to the .Rmd file in your GitHub repository, and The URL for your rpubs.com web page.

Import Tidyverse

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(knitr)

Load the three distinct data sets…

# sourced from Alice Ding's Discussion Board Entry
student_data <- read.csv("https://gist.githubusercontent.com/Kimmirikwa/b69d0ea134820ea52f8481991ffae93e/raw/4db7b1698035ee29885d10e1a59bd902716ae168/student_results.csv")

# from Jacob Silver's Discussion Board link to a google document where I dounloaded the csv file and then uploade the file to my github repo.
nba_data <- read.csv("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/nba_players.csv")

# my discussion board entry on nafta countries population from 1994 to 2000
nafta_population <- read.csv("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/nafta_countries.csv")

Taking a look at the untidy data using the head() function

head(student_data)

##   id   name phone sex.and.age test.number term.1 term.2 term.3
## 1  1   Mike   134        m_12      test 1     76     84     87
## 2  2  Linda   270        f_13      test 1     88     90     73
## 3  3    Sam   210        m_11      test 1     78     74     80
## 4  4 Esther   617        f_12      test 1     68     75     74
## 5  5   Mary   114        f_14      test 1     65     67     64
## 6  1   Mike   134        m_12      test 2     85     80     90

head(nafta_population)

##   id       Country     X1994     X1995     X1996     X1997     X1998     X1999
## 1  1        Mexico  88314424  89969572  91586555  93183094  94767284  96334810
## 2  2 United States 263126000 266278000 269394000 272657000 275854000 279040000
## 3  3        Canada  29000663  29302311  29610218  29905948  30155173  30401286
##       X2000
## 1  97873442
## 2 282162411
## 3  30685730

head(nba_data)

##                      X1                       X2                          X3
## 1          Nikola Jokic                Team: DEN               Position: BIG
## 2                          Field goals made: 9.3 Field goal percentage: 63.2
## 3                       Free throw attempts: 6.1 Free throw percentage: 82.6
## 4                                  Assists: 10.0              Turnovers: 3.6
## 5 Giannis Antetokounmpo                Team: MIL               Position: BIG
## 6                         Field goals made: 11.1 Field goal percentage: 53.8
##                        X4                         X5
## 1                 Age: 28               Points: 24.5
## 2    3-pointers made: 0.8 3-pointer percentage: 40.2
## 3 Offensive rebounds: 2.2    Defensive rebounds: 9.4
## 4             Steals: 1.3                Blocks: 0.7
## 5                 Age: 28               Points: 31.3
## 6    3-pointers made: 0.8 3-pointer percentage: 27.0

Clean and Tidy

Constraints of Tidy data is where:
1. Every column is a variable
2. Every row is an observation
3. Every cell is a single value

Reference: https://tidyr.tidyverse.org/

Tidying Student Data

In this code chunk, using the pivot_longer() function to pivot the wide data format to long format; selecting column names that start with term. to pivot, then creating a term column name to store the old column names and then creating another column name score to store the values of the old columns. After pivot_longer(), we need to separate the sex.and.age column to two distinct columns by using the separate() function. Finally, using select to arrange the column names.

student_tidy <- student_data |> 
    pivot_longer(cols = starts_with("term."), names_to = "term", values_to = "score") |> 
    separate(col = "sex.and.age", into = c("sex", "age"), sep = "_") |> 
    select(id, name, phone, sex, age, test.number, term, score) |> 
    mutate(term = if_else(term == "term.1", 1, 
                          if_else(term == "term.2", 2, 
                                  if_else(term == "term.3", 3, NA_real_)))) |>
    mutate(age = as.integer(age)) |> 
    arrange(id, term)

head(student_tidy)

## # A tibble: 6 × 8
##      id name  phone sex     age test.number  term score
##   <int> <chr> <int> <chr> <int> <chr>       <dbl> <int>
## 1     1 Mike    134 m        12 test 1          1    76
## 2     1 Mike    134 m        12 test 2          1    85
## 3     1 Mike    134 m        12 test 1          2    84
## 4     1 Mike    134 m        12 test 2          2    80
## 5     1 Mike    134 m        12 test 1          3    87
## 6     1 Mike    134 m        12 test 2          3    90

Tidying NAFTA Populations Data

Here, we are using pivot_longer() function to pivot the columns that start with X, then using mutate to remove the letter X in front of each year.

nafta_tidy <- nafta_population |> 
    pivot_longer(cols = starts_with("X"), names_to = "Year", values_to = "Population") |> 
    mutate(Year = str_remove_all(Year, "^X")) |>  
    mutate(Year = as.integer(Year))
kable(nafta_tidy)

id	Country	Year	Population
1	Mexico	1994	88314424
1	Mexico	1995	89969572
1	Mexico	1996	91586555
1	Mexico	1997	93183094
1	Mexico	1998	94767284
1	Mexico	1999	96334810
1	Mexico	2000	97873442
2	United States	1994	263126000
2	United States	1995	266278000
2	United States	1996	269394000
2	United States	1997	272657000
2	United States	1998	275854000
2	United States	1999	279040000
2	United States	2000	282162411
3	Canada	1994	29000663
3	Canada	1995	29302311
3	Canada	1996	29610218
3	Canada	1997	29905948
3	Canada	1998	30155173
3	Canada	1999	30401286
3	Canada	2000	30685730

Tidying NBA Players Data

The NBA data is more complicated and thus requires more cleaning since we have columns that are formatted in a way that the values of a player’s statistics are right in the same line. Therefore, after using the pivot_longer() function, we use the separate() function to separate the statistic and its value. Now, we want to make each statistic its own column and have its value as its entry. At the same time filling the empty row in the name column with name of the player. The next step is to use pivot_wider() so that each column is a statistic, every row is an observation, and every cell is a single value. Thus, satisfying the constraints of tidy data.

nba_tidy <- nba_data %>% 
    pivot_longer(cols = c(X2:X5), names_to = "old_colnames", values_to = "value") |> 
    separate(value, c("statistic", "measure"), sep = ":") |>
    select(-old_colnames) |> 
    mutate(X1 = if_else(X1 == "", NA_character_, X1)) |> 
    fill(X1) |> 
    pivot_wider(names_from = statistic, values_from = measure) |> 
    rename("Player_Name" = X1)  |>
    arrange(Team, desc(Points)) |> 
    mutate(across(c(Age, Points, `Field goals made`, `Field goal percentage`, `Offensive rebounds`, `Defensive rebounds`, Assists, Turnovers, Steals, Blocks), as.numeric))

 
nba_tidy_sliced <- nba_tidy |>
    slice(1:10)  

kable(nba_tidy_sliced)

Player_Name	Team	Position	Age	Points	Field goals made	Field goal percentage	3-pointers made	3-pointer percentage	Free throw attempts	Free throw percentage	Offensive rebounds	Defensive rebounds	Assists	Turnovers	Steals	Blocks
Trae Young	ATL	GUARD	24	26.9	8.4	42.8	2.2	32.9	8.8	88.7	0.7	2.2	10.2	4.1	1.0	0.2
Dejounte Murray	ATL	GUARD	26	20.8	8.4	46.5	2.0	36.2	2.5	83.7	0.7	4.9	6.1	2.3	1.5	0.3
Clint Capela	ATL	BIG	28	12.1	5.5	62.9	0.0	0.0	1.8	61.3	4.2	7.0	0.8	0.8	0.7	1.2
Mikal Bridges	BKN	FORWARD	26	17.6	6.4	46.7	1.9	39.6	3.2	89.6	1.1	3.4	3.5	1.5	1.2	0.8
Spencer Dinwiddie	BKN	GUARD	29	17.5	5.8	44.8	2.5	39.6	4.0	82.5	0.3	2.8	5.3	1.7	0.8	0.3
Cameron Johnson	BKN	FORWARD	26	13.8	4.8	45.3	2.6	41.2	2.0	79.1	0.7	3.0	1.6	0.5	1.0	0.3
Nic Claxton	BKN	BIG	23	12.7	5.5	71.7	0.0	0.0	3.3	50.6	2.5	6.5	1.7	1.3	0.8	2.5
Al Horford	BOS	BIG	36	9.3	3.5	47.0	2.1	43.0	0.4	72.2	1.2	5.0	2.7	0.6	0.5	0.9
Robert Williams III	BOS	BIG	25	8.5	4.0	75.0	0.0	0.0	0.9	65.2	3.2	5.5	1.6	0.9	0.5	1.1
Jayson Tatum	BOS	FORWARD	24	30.4	9.9	46.1	3.3	35.5	8.5	86.4	1.1	7.6	4.6	2.9	1.0	0.8

Analysis and Visualization

Student Test Scores

For students data from Alice Ding’s discussion entry, she suggested to find the average test score for each student or the average test score overall and count the number male and female student, and the average age in the data.

avg_score_term <- student_tidy |> 
    group_by(name, term) |> 
    summarize(average_score = round(mean(score), 2)) |> 
    arrange(name, term)

## `summarise()` has grouped output by 'name'. You can override using the
## `.groups` argument.

avg_score_overall <- student_tidy |> 
    group_by(name) |> 
    summarize(average_score = round(mean(score), 2)) |> 
    arrange(name)

avg_score_term

## # A tibble: 15 × 3
## # Groups:   name [5]
##    name    term average_score
##    <chr>  <dbl>         <dbl>
##  1 Esther     1          69  
##  2 Esther     2          75  
##  3 Esther     3          76  
##  4 Linda      1          87.5
##  5 Linda      2          86  
##  6 Linda      3          83.5
##  7 Mary       1          66.5
##  8 Mary       2          68.5
##  9 Mary       3          63.5
## 10 Mike       1          80.5
## 11 Mike       2          82  
## 12 Mike       3          88.5
## 13 Sam        1          79  
## 14 Sam        2          80.5
## 15 Sam        3          80

avg_score_overall

## # A tibble: 5 × 2
##   name   average_score
##   <chr>          <dbl>
## 1 Esther          73.3
## 2 Linda           85.7
## 3 Mary            66.2
## 4 Mike            83.7
## 5 Sam             79.8

avg_score_term |> 
    ggplot(aes(x = name, y = average_score, fill = factor(term))) +
    geom_bar(stat = "identity", position = "dodge") +
    geom_text(aes(label = round(average_score, 1)), vjust = -0.3, position = position_dodge(0.9)) +
    labs(x = "Name", y = "Average score", fill = "Term") + 
    ggtitle("Average Test Score per Term")

We observe that Esther and Mike’s average has shown improvement over the three terms. On the other hand, Linda and Mary’s average has declined over the three terms. Academic intervention might be needed in order to help Linda and Mary’s test scores.

student_tidy |> 
    summarize(average_age = mean(age))

## # A tibble: 1 × 1
##   average_age
##         <dbl>
## 1        12.4

The average age of the students in the data is 12.4

student_tidy |> 
    count(sex)

## # A tibble: 2 × 2
##   sex       n
##   <chr> <int>
## 1 f        18
## 2 m        12

In this data, there are 18 female and 12 male students.

NAFTA Countries Population

Here, we are adding a column for growth rate where it take the current and prior data in order to calculate the growth rate for that year. If there is not prior data, then it returns 0.

nafta_tidy <- nafta_tidy |> 
    group_by(Country) |> 
    mutate(growth_rate = round((Population - lag(Population))/lag(Population) * 100, 2)) |> 
    replace_na(list(growth_rate = 0))

nafta_tidy

## # A tibble: 21 × 5
## # Groups:   Country [3]
##       id Country        Year Population growth_rate
##    <int> <chr>         <int>      <int>       <dbl>
##  1     1 Mexico         1994   88314424        0   
##  2     1 Mexico         1995   89969572        1.87
##  3     1 Mexico         1996   91586555        1.8 
##  4     1 Mexico         1997   93183094        1.74
##  5     1 Mexico         1998   94767284        1.7 
##  6     1 Mexico         1999   96334810        1.65
##  7     1 Mexico         2000   97873442        1.6 
##  8     2 United States  1994  263126000        0   
##  9     2 United States  1995  266278000        1.2 
## 10     2 United States  1996  269394000        1.17
## # … with 11 more rows

nafta_tidy |> ggplot(aes(x = Year, y = growth_rate, color = Country)) +
    geom_line() +
    geom_point() +
    geom_text(aes(label = round(growth_rate, 2)), vjust = -0.4) +
    labs(title = "Population Growth Rate in Years 1994 - 2000", x = "Year", y = "Percent Increase") +
    theme_bw()

In the years 1994 - 2000, we can observe that United States has maintained a relatively constant growth rate of around 1.2 percent. While, Mexico and Canada’s growth have not been constant. Particularly Canada, we see a decline in growth rate in 1997 and continued decreasing until 2000 where the growth rate increased from 0.82 to 0.94. Moreover, Mexico’s growth has been consistently decreasing yearly meaning that more people are leaving the country and less are moving into the country or the birth to death is less than 1. i.e. more people are dying than being born.

NBA Players Statistics

In Jacob Silver’s entry, he suggested to calculate: Shooting percentage and answer the question are the best teams those with the most players in the top 100 regardless of placement? or is “one guy” at the top of the mountain enough?

It seems the data already comes with percent statistics such as field goal percentage and 3-pointers made. Now, we will try to answer the question of does the best teams necessarily have the best players?

Disclaimer:

top teams and players for this project does not reflect the actual ranking since we do not know how the NBA calculates overall ratings. For the sake of this project, we will be using weighted.mean() function based on surface level knowledge in basketball.

Here, we summarize and arrange the data to identity the top 10 performing teams according to their overall rating.

top_teams <- nba_tidy %>%
  group_by(Team) |> 
  summarize(overall_rating = weighted.mean(`Field goals made`,
                                           `3-pointers made`,
                                           w = Points / sum(Points))) %>%
  arrange(desc(overall_rating)) |> 
  top_n(10, overall_rating)

top_teams |> 
    ggplot(aes(x = Team, y = overall_rating, fill = Team)) +
    geom_col() +
    geom_text(aes(label = Team), hjust = 0.5, vjust = -0.5) +
    labs(title = "Top 10 Teams by Overall Rating", x = "Team", y = "Overall Rating") +
    theme(legend.position = "none")

Similarly, adding offensive and defensive ratings then arranging them in descending order to identity who the top players are based on their offensive_rating because the best players are the ones with the most points.

nba_off_def_ratings <- nba_tidy |>
    group_by(Player_Name) |> 
    mutate(offensive_rating = weighted.mean(`Field goals made`,
                                           `3-pointers made`,
                                           `Offensive rebounds`,
                                           w = Points / sum(Points)),
           defensive_rating = weighted.mean(`Defensive rebounds`, 
                                            Turnovers,
                                            Steals,
                                            Blocks,
                                            w = `Defensive rebounds`/ sum(`Defensive rebounds`))) |> 
    mutate(across(c(offensive_rating, defensive_rating), as.double)) |> 
    select(Player_Name, Team, Position, offensive_rating, defensive_rating) |> 
    arrange(desc(offensive_rating))

nba_off_def_ratings |> 
    head(15) |> 
    ggplot(aes(x = offensive_rating, y = defensive_rating, color = Team)) +
    geom_point() +
    labs(title = "Offense vs Defense Rating", x = "Offensive Rating", y = "Defensive Rating") +
    scale_x_continuous(limits = c(9,11.5)) +
    geom_text(aes(label = Player_Name), vjust = 1.5, hjust = 1, check_overlap = TRUE, nudge_y = 0.5) +
    theme(legend.position = "none")

top_players <- nba_off_def_ratings |> 
    head(100)

Next, we use inner_join() to join the two data frame so that we can count how many players of the top 100 players are in each team.

top_teams |> 
    inner_join(top_players, by = "Team") |> 
    group_by(Team) |> 
    summarize(top_players = n()) |> 
    ggplot(aes(x = Team, y = top_players, fill = Team)) +
    geom_col() +
    theme(legend.position = "none")

We can state that basketball is definitely a team sports where no one person can always carry the team. We set the top 100 players based on their offensive rating and counted how many of them are in the top 10 teams. There are around at most 3-4 players that are in the top 100 rating and are in a top 10 team. Thus, the answer to the question does the best teams necessarily have the best players? No, the best teams does not mean they have the best players according to this data.

Project 2

Nick Climaco

2023-03-02