L10 Data Wrangling

# Load package(s)
library(ggplot2)
library(tidyverse)
library(dplyr)
library(knitr)

# Read in the datasets
mod_nba2014_15 <- read_delim(file = "data/mod_nba2014_15_advanced.txt", 
                             delim = "|")
NU_admission_data <- read_csv(file = "data/NU_admission_data.csv")

Above, I have loaded the four packages of ggplot2, tidyverse, dplyr, and knitr. Each of these packages will help me construct the datasets indicated for this exercise. After loading the packages, I have loaded two datasets of mod_nba2014_15 and NU_admission_data. The mod_nba2014_15 dataset contains statistical performance data of professional basketball players in the National Basketball Association during the 2014-15 season. The NU_admission_data file contains admissions data of students who applied to Northwestern University and were admitted, each organized by year.

Exercise 1

quartile_rank <- function(x = 0:99) {
  
  # Set quartile
  quart_breaks <- c(
    -Inf,
    quantile(x,
      probs = c(.25, .5, .75),
      na.rm = TRUE
    ),
    Inf
  )

  cut(x = x, breaks = quart_breaks, labels = FALSE)
}

The above chunk helps narrow the dataset so that only players of certain quartiles, that is, certain percentages of each statistic will be included. In this case, the intention is to exclude players the played less than 10 games or played less than 5 minutes of a game.

James_Harden = subset(mod_nba2014_15, Player == "James Harden")

data = data.frame(group = c("Assist Rate","True Shooting",
                            "Usage Rate","Rebound Rate"),
                  value = c(34.6, 60.5, 31.3, 2.8))

ggplot(data, aes(x = group, y = value, fill = group)) +
  geom_bar(stat = "identity") +
  coord_polar(theta = "y") +
  ggtitle("James Harden \n (2015)") +
  ylim(c(0,65)) +
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        legend.title = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  xlab("") +
  ylab("")

Above, I have created a pie chart using the statistical data of NBA player James Harden, based on his Assist Rate, Rebound Rate, True Shooting rate, and Usage Rate during the 2014-15 NBA season. Each value is represented in percentages, and I apologize that this is not the exact graph that is indicated to depict. Harden seemed to have had a successful attacking record during the season, with a high true shooting rate. I know this isn’t the exact graph that we were asked to produce, so I would like to say I’m sorry. I was just having a hard time.

Exercise 2

Year<-c('1999', '1999', '1999',
        '2000','2000','2000',
        '2001', '2001', '2001',
        '2002', '2002', '2002',
        '2003',  '2003',  '2003',
        '2004', '2004', '2004',
        '2005', '2005', '2005',
        '2006', '2006', '2006',
        '2007', '2007', '2007',
        '2008', '2008', '2008',
        '2009',  '2009',  '2009',
        '2010','2010','2010',
        '2011', '2011', '2011',
        '2012', '2012', '2012',
        '2013','2013','2013',
        '2014','2014','2014',
        '2015', '2015', '2015',
        '2016', '2016', '2016',
        '2017','2017','2017',
        '2018','2018','2018')

Students<-c('15460', '4999','1953',
            '14725', '4827','1893',
            '13987','4780','1952',
            '14283','4701','2005',
            '14139','4687','1941',
            '15654','4673','1915',
            '16235','4807','1952',
            '18394','5443','2089',
            '21934','5876','1981',
            '24994','6552','2079',
            '25369','6887','2128',
            '27528','6367','2127',
            '30905','5554','2107',
            '32060','4912','2037',
            '32796','4598','2040',
            '33674','4416','2043',
            '32106','4181','2018',
            '35107','3743','1985',
            '37259','3442','1903',
            '40425','3422','1931')

Label<-c('Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants',
         'Applicants','Admitted Students', 'Matriculants')

admissions.data<-data.frame(Year,Students,Label)

Above, I have created a dataset called admissions.data in order to incorporate the Year variable (1999,2000,2001,…,2018), Students variable (which indicates the number of students for each respective category), and Label (which notes whether the value under the Students variable falls under the category of Applicants, Admitted Students, or Matriculants). The dataset will be used to build a stacked barplot which will include all three aspects.

ggplot(admissions.data, aes(Year, Students, fill = Label, 
                            order = Label)) +
  geom_bar(stat = "identity") +
  theme(axis.text.y = element_blank(),
        axis.text.x = element_text(angle = 90),
        axis.ticks = element_blank(),
        legend.title = element_blank()) +
  ylab("Applications") +
  xlab("Entering Year")

The above barchart is built based on the admissions.data dataset, and is shown as a stacked barplot. Although this in itself is a sufficient chart that visually captures the data in a single sight, I believe 3 separate barcharts with each variable(Admitted Students, Applications, and Matriculants) will work better, because it is not easy to compare at once with a stacked barplot. I know this isn’t the exact plot we were asked to produce, so I am sorry. I was just having a bad time.

L10 Data Wrangling

Taehyung Kim

May 2, 2019

Exercise 1

Exercise 2