1. Dataset

In our project, we will use the following dataset: Source of the Data: https://www.kaggle.com/mcamli/nba17-18/version/4

2. Data Cleaning

Reading the Data

First of all, we are going to read the .csv file from the the data containing stats of players in NBA in 2017-2018 season.

#Reading the .csv file "breast_tissue"
nba <- read.csv("nba.csv")

At this time, we are going to install and load the “dplyr” packages. We need to use it to clean our data.

#Installing the dplyr packages
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Dimensions and Structure of the Dataset

Now, we are going to get the dimensions and structure of the data with the functions dim()and str().

dim(nba)
## [1] 540  27

We have 540 observations and 27 variables.

str(nba)
## 'data.frame':    540 obs. of  27 variables:
##  $ rk    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ player: Factor w/ 540 levels "Aaron Brooks\\brookaa01",..: 13 421 468 38 35 78 319 227 290 489 ...
##  $ pos   : Factor w/ 7 levels "C","PF","PG",..: 7 2 1 1 7 1 1 1 3 5 ...
##  $ age   : int  24 27 24 20 32 29 32 19 25 36 ...
##  $ tm    : Factor w/ 31 levels "ATL","BOS","BRK",..: 21 3 21 16 22 18 27 3 2 19 ...
##  $ g     : int  75 70 76 69 53 21 75 72 18 22 ...
##  $ pm    : int  1134 1359 2487 1368 682 49 2509 1441 107 273 ...
##  $ per   : num  9 8.2 20.6 15.7 5.8 6 25 17.5 2.6 8.7 ...
##  $ ts.   : num  0.567 0.525 0.63 0.57 0.516 0.34 0.57 0.636 0.366 0.514 ...
##  $ X3par : num  0.759 0.8 0.003 0.021 0.432 0 0.068 0.038 0.5 0.132 ...
##  $ ftr   : num  0.158 0.164 0.402 0.526 0.16 0.4 0.296 0.37 0.409 0.231 ...
##  $ orb.  : num  2.5 3.1 16.6 9.7 0.6 7 10.8 10.5 4.1 8.2 ...
##  $ drb.  : num  8.9 17 13.9 21.6 10.1 28.6 17.3 18.2 7.1 10.4 ...
##  $ trb.  : num  5.6 10 15.3 15.6 5.3 17.7 14 14.3 5.6 9.3 ...
##  $ ast.  : num  3.4 6 5.5 11 6.2 8.2 11.3 5.4 15.2 4.6 ...
##  $ stl.  : num  1.7 1.2 1.8 1.2 0.3 2 0.9 0.9 1.4 1.9 ...
##  $ blk.  : num  0.6 1.6 2.8 2.5 1.1 1.8 3 4.6 1.6 0.9 ...
##  $ tov.  : num  7.4 13.3 13.2 13.6 10.8 5.4 6.8 15.1 25.7 15.9 ...
##  $ usg.  : num  12.7 14.4 16.7 15.9 12.5 16.8 29.1 16.3 14.6 18.9 ...
##  $ ows   : num  1.3 -0.1 6.7 2.3 -0.1 -0.1 7.4 2.7 -0.2 -0.2 ...
##  $ dws   : num  1 1.1 3 1.9 0.2 0.1 3.5 1.5 0.1 0.2 ...
##  $ ws    : num  2.2 1 9.7 4.2 0.1 0 10.9 4.2 -0.1 0.1 ...
##  $ ws.48 : num  0.094 0.036 0.187 0.148 0.009 -0.013 0.209 0.141 -0.039 0.017 ...
##  $ obpm  : num  -0.5 -2 2.2 -1.6 -4.1 -7 3 -1.3 -6.7 -4 ...
##  $ dbpm  : num  -1.7 -0.2 1.1 1.8 -1.8 0.1 0.3 1.4 0.3 -1.3 ...
##  $ bpm   : num  -2.2 -2.2 3.3 0.2 -5.8 -6.9 3.3 0.2 -6.4 -5.2 ...
##  $ vorp  : num  -0.1 -0.1 3.3 0.8 -0.7 -0.1 3.3 0.8 -0.1 -0.2 ...

Head and Tail

At this step, we will look at the head and the tail of the data set to see by defaut its first and last 6 variables.

head(nba)
##   rk                   player pos age  tm  g   pm  per   ts. X3par   ftr
## 1  1  Alex Abrines\\abrinal01  SG  24 OKC 75 1134  9.0 0.567 0.759 0.158
## 2  2      Quincy Acy\\acyqu01  PF  27 BRK 70 1359  8.2 0.525 0.800 0.164
## 3  3  Steven Adams\\adamsst01   C  24 OKC 76 2487 20.6 0.630 0.003 0.402
## 4  4   Bam Adebayo\\adebaba01   C  20 MIA 69 1368 15.7 0.570 0.021 0.526
## 5  5 Arron Afflalo\\afflaar01  SG  32 ORL 53  682  5.8 0.516 0.432 0.160
## 6  6  Cole Aldrich\\aldrico01   C  29 MIN 21   49  6.0 0.340 0.000 0.400
##   orb. drb. trb. ast. stl. blk. tov. usg.  ows dws  ws  ws.48 obpm dbpm
## 1  2.5  8.9  5.6  3.4  1.7  0.6  7.4 12.7  1.3 1.0 2.2  0.094 -0.5 -1.7
## 2  3.1 17.0 10.0  6.0  1.2  1.6 13.3 14.4 -0.1 1.1 1.0  0.036 -2.0 -0.2
## 3 16.6 13.9 15.3  5.5  1.8  2.8 13.2 16.7  6.7 3.0 9.7  0.187  2.2  1.1
## 4  9.7 21.6 15.6 11.0  1.2  2.5 13.6 15.9  2.3 1.9 4.2  0.148 -1.6  1.8
## 5  0.6 10.1  5.3  6.2  0.3  1.1 10.8 12.5 -0.1 0.2 0.1  0.009 -4.1 -1.8
## 6  7.0 28.6 17.7  8.2  2.0  1.8  5.4 16.8 -0.1 0.1 0.0 -0.013 -7.0  0.1
##    bpm vorp
## 1 -2.2 -0.1
## 2 -2.2 -0.1
## 3  3.3  3.3
## 4  0.2  0.8
## 5 -5.8 -0.7
## 6 -6.9 -0.1
tail(nba)
##      rk                    player pos age  tm  g   pm  per   ts. X3par
## 535 535 Thaddeus Young\\youngth01  PF  29 IND 81 2607 14.8 0.528 0.209
## 536 536    Cody Zeller\\zelleco01   C  25 CHO 33  627 15.9 0.602 0.019
## 537 537   Tyler Zeller\\zellety01   C  28 TOT 66 1109 16.0 0.598 0.084
## 538 538    Paul Zipser\\zipsepa01  SF  23 CHI 54  824  5.2 0.445 0.470
## 539 539     Ante Zizic\\zizican01   C  21 CLE 32  214 24.2 0.746 0.000
## 540 540    Ivica Zubac\\zubaciv01   C  20 LAL 43  410 15.3 0.557 0.008
##       ftr orb. drb. trb. ast. stl. blk. tov. usg.  ows dws   ws  ws.48
## 535 0.106  8.0 14.1 11.1  8.5  2.6  1.2 10.4 17.3  2.3 3.2  5.5  0.101
## 536 0.545 11.4 19.3 15.3  7.4  1.1  2.8 14.6 15.7  1.2 0.7  1.9  0.145
## 537 0.237 11.0 19.4 15.2  6.7  0.7  2.5 11.3 16.4  2.0 0.9  2.9  0.126
## 538 0.107  1.6 16.0  8.5  8.0  1.2  1.6 14.9 15.2 -1.2 0.6 -0.6 -0.034
## 539 0.433 12.8 18.6 15.7  3.8  0.5  5.2 12.1 18.8  0.9 0.2  1.0  0.231
## 540 0.418 11.8 20.1 16.0  8.8  0.9  3.0 15.3 17.6  0.5 0.5  1.0  0.118
##     obpm dbpm  bpm vorp
## 535  0.1  1.4  1.5  2.3
## 536 -0.6  1.3  0.7  0.4
## 537 -1.1 -0.5 -1.6  0.1
## 538 -5.5 -0.3 -5.9 -0.8
## 539  1.3 -1.2  0.1  0.1
## 540 -2.7  0.5 -2.2  0.0

Removing NA Values

At this step, we are going to remove all the missing values by using complete.cases() function.We will call the new data frame as “nba1”

nba1 <- nba[complete.cases(nba),]
dim(nba1)
## [1] 537  27

Now, We have 537 observations and 27 variables.

3. Exploring Data

At this step, since we were assigned to create a data visualisation using R code, we will use the ggplot2 package to visualize the data through scatterplots and histograms. In addition, we will make plots interactive using the packae “plotly”. Then were are going to install and load the Package “ggplot2” and “plotly” .

#Installing the ggplot2 packages
library(readr)
## Warning: package 'readr' was built under R version 3.6.1
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Create a scatterplot

Map variables in the data onto the X and Y axes

# map values in data to X and Y axes
ggplot(nba1, aes(per, age))

Change the axes labels and theme

# Change the theme
ggplot(nba1, aes(x = per, y = age)) +
  xlab("Player Efficiency Rating") + 
  ylab("Age of Player at the start of February 1st of that season") +
  theme_minimal(base_size = 12)

p1 <- ggplot(nba1, aes(x = per, y = age)) +
  xlab("Player Efficiency Rating") + 
  ylab("Age of Player at the start of February 1st of that season") +
  theme_minimal(base_size = 12)
p1 + geom_point()

What is going on?

The one point off to the right represents the rank 344, which had a much higher player efficienty rate 133.8. The player with the next highest rate at that time were 41.9 and the player with the rank 466 had 39.8 as rate. Remove the rank 344 and replot:

nba2 <- nba1[nba1$rk != "344",]
p2 <- ggplot(nba2, aes(x = per, y = age)) +
  xlab("Player Efficiency Rating") + 
  ylab("Age of Player at the start of February 1st of that season") +
  theme_minimal(base_size = 12)
p2 + geom_point()

Now the scatterplot appears to show a correlation

p3 <- p2 + xlim(0,10) + ylim(0,1200)
p3 + geom_point()
## Warning: Removed 400 rows containing missing values (geom_point).

Add a smoother in red with a confidence interval

p4 <- p3 + geom_point() + geom_smooth(color = "red")
p4
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 400 rows containing non-finite values (stat_smooth).
## Warning: Removed 400 rows containing missing values (geom_point).

What is the linear equation of that linear regression model?

In the form, y=mx + b, we use the command, lm(y~x), meaning, fit the predictor variable x into the model to predict y. Look at the values of (Intercept) and murder. The column, Estimate gives the value you need in your linear model. The column for Pr(>|t|) describes whether the predictor is useful to the model.

cor(nba2$age, nba2$per)
## [1] 0.04143486
fit1 <- lm(age ~ per, data = nba2)
summary(fit1)
## 
## Call:
## lm(formula = age ~ per, data = nba2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3269 -3.0965 -0.9954  2.8241 14.9393 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.79461    0.37370  69.024   <2e-16 ***
## per          0.02464    0.02571   0.958    0.338    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.178 on 534 degrees of freedom
## Multiple R-squared:  0.001717,   Adjusted R-squared:  -0.0001526 
## F-statistic: 0.9184 on 1 and 534 DF,  p-value: 0.3383

p value = 2e-16 < 0.05 then a small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. Therefore, there is a strong relationship between the age and the rate.

p3 +
  geom_point(mapping = aes(per, age, size = g), color = "red") + xlim(0,10) + ylim(0,1200) + 
ggtitle ("AGE OF PLAYERS VERSUS PLAYER RATE EFFICIENCY", subtitle = "Sizes of circles are proportional to number of games")
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 400 rows containing missing values (geom_point).

Finally, add some interactivity to the plot with plotly

p <- ggplot(nba2, aes(x = per, y = age, size = g, text = paste("rk:", rk))) + theme_minimal(base_size = 12) +
     geom_point(alpha = 0.5, color = "red") + xlim(0,10) + ylim(0,1200) +
  ggtitle ("AGE OF PLAYERS VERSUS PLAYER RATE EFFICIENCY", subtitle = "Sizes of circles are proportional to number of games")
p <- ggplotly(p)
p

Write a short essay

a- The topic is about the NBA. The data contains stats of players in NBA in 2017-2018 The dataset we used is from the following database: https://www.kaggle.com/mcamli/nba17-18/version/4. In this dadabase, there are 540 observations and 27 variables: Rk = Rank Player = Name of player Pos = Position Age = Age of Player at the start of February 1st of that season. Tm = Team G = Games MP = Minutes Played Per Game PER = Player Efficiency Rating TS% = True Shooting % 3PAr = 3-Point Attempt Rate FTr = Free Throw Rate ORB% = Offensive Rebound Percentage DRB% = Defensive Rebound Percentage TRB% = Total Rebound Percentage AST% = Assist Percentage STL% = Steal Percentage BLK% = Block Percentage TOV% = Turnover Percentage USG% = Usage Percentage OWS = Offensive Win Shares DWS = Defensive Win Shares WS = Win Shares WS/48 = Win Shares Per 48 Minutes OBPM = Offensive Box Plus/Minus DBPM = Defensive Box Plus/Minus BPM = Box Plus/Minus VORP = Value Over Replacement

In this project, we are interested in a relationship between the age of players and player efficiency rate of the NBA (sport I like). We used the the “dplyr” packages to clean the data especially by dropping the missing values which are 3.

b- Through scatterplots and histograms, we wer surprized by the high value of the rank 344 which is outtlier. This player has only 1 game and are the age 24. My expectation was to figure out the oulier related to players who have many games in their assets.