#Installing the ggplot2 packages
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 3.6.1
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

1. Dataset

In our project, we will use the following dataset: Source of the Data: https://www.kaggle.com/mcamli/nba17-18/version/4

2. Data Cleaning

Reading the Data

First of all, we are going to read the .csv file from the the data containing stats of players in NBA in 2017-2018 season.

#Reading the .csv file "breast_tissue"
nba <- read.csv("nba.csv")

Dimensions and Structure of the Dataset

Now, we are going to get the dimensions and structure of the data with the functions dim()and str().

dim(nba)
## [1] 540  27

We have 540 observations and 27 variables.

str(nba)
## 'data.frame':    540 obs. of  27 variables:
##  $ rk    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ player: Factor w/ 540 levels "Aaron Brooks\\brookaa01",..: 13 421 468 38 35 78 319 227 290 489 ...
##  $ pos   : Factor w/ 7 levels "C","PF","PG",..: 7 2 1 1 7 1 1 1 3 5 ...
##  $ age   : int  24 27 24 20 32 29 32 19 25 36 ...
##  $ tm    : Factor w/ 31 levels "ATL","BOS","BRK",..: 21 3 21 16 22 18 27 3 2 19 ...
##  $ g     : int  75 70 76 69 53 21 75 72 18 22 ...
##  $ pm    : int  1134 1359 2487 1368 682 49 2509 1441 107 273 ...
##  $ per   : num  9 8.2 20.6 15.7 5.8 6 25 17.5 2.6 8.7 ...
##  $ ts.   : num  0.567 0.525 0.63 0.57 0.516 0.34 0.57 0.636 0.366 0.514 ...
##  $ X3par : num  0.759 0.8 0.003 0.021 0.432 0 0.068 0.038 0.5 0.132 ...
##  $ ftr   : num  0.158 0.164 0.402 0.526 0.16 0.4 0.296 0.37 0.409 0.231 ...
##  $ orb.  : num  2.5 3.1 16.6 9.7 0.6 7 10.8 10.5 4.1 8.2 ...
##  $ drb.  : num  8.9 17 13.9 21.6 10.1 28.6 17.3 18.2 7.1 10.4 ...
##  $ trb.  : num  5.6 10 15.3 15.6 5.3 17.7 14 14.3 5.6 9.3 ...
##  $ ast.  : num  3.4 6 5.5 11 6.2 8.2 11.3 5.4 15.2 4.6 ...
##  $ stl.  : num  1.7 1.2 1.8 1.2 0.3 2 0.9 0.9 1.4 1.9 ...
##  $ blk.  : num  0.6 1.6 2.8 2.5 1.1 1.8 3 4.6 1.6 0.9 ...
##  $ tov.  : num  7.4 13.3 13.2 13.6 10.8 5.4 6.8 15.1 25.7 15.9 ...
##  $ usg.  : num  12.7 14.4 16.7 15.9 12.5 16.8 29.1 16.3 14.6 18.9 ...
##  $ ows   : num  1.3 -0.1 6.7 2.3 -0.1 -0.1 7.4 2.7 -0.2 -0.2 ...
##  $ dws   : num  1 1.1 3 1.9 0.2 0.1 3.5 1.5 0.1 0.2 ...
##  $ ws    : num  2.2 1 9.7 4.2 0.1 0 10.9 4.2 -0.1 0.1 ...
##  $ ws.48 : num  0.094 0.036 0.187 0.148 0.009 -0.013 0.209 0.141 -0.039 0.017 ...
##  $ obpm  : num  -0.5 -2 2.2 -1.6 -4.1 -7 3 -1.3 -6.7 -4 ...
##  $ dbpm  : num  -1.7 -0.2 1.1 1.8 -1.8 0.1 0.3 1.4 0.3 -1.3 ...
##  $ bpm   : num  -2.2 -2.2 3.3 0.2 -5.8 -6.9 3.3 0.2 -6.4 -5.2 ...
##  $ vorp  : num  -0.1 -0.1 3.3 0.8 -0.7 -0.1 3.3 0.8 -0.1 -0.2 ...

Head and Tail

At this step, we will look at the head and the tail of the data set to see by defaut its first and last 6 variables.

head(nba)
##   rk                   player pos age  tm  g   pm  per   ts. X3par   ftr
## 1  1  Alex Abrines\\abrinal01  SG  24 OKC 75 1134  9.0 0.567 0.759 0.158
## 2  2      Quincy Acy\\acyqu01  PF  27 BRK 70 1359  8.2 0.525 0.800 0.164
## 3  3  Steven Adams\\adamsst01   C  24 OKC 76 2487 20.6 0.630 0.003 0.402
## 4  4   Bam Adebayo\\adebaba01   C  20 MIA 69 1368 15.7 0.570 0.021 0.526
## 5  5 Arron Afflalo\\afflaar01  SG  32 ORL 53  682  5.8 0.516 0.432 0.160
## 6  6  Cole Aldrich\\aldrico01   C  29 MIN 21   49  6.0 0.340 0.000 0.400
##   orb. drb. trb. ast. stl. blk. tov. usg.  ows dws  ws  ws.48 obpm dbpm
## 1  2.5  8.9  5.6  3.4  1.7  0.6  7.4 12.7  1.3 1.0 2.2  0.094 -0.5 -1.7
## 2  3.1 17.0 10.0  6.0  1.2  1.6 13.3 14.4 -0.1 1.1 1.0  0.036 -2.0 -0.2
## 3 16.6 13.9 15.3  5.5  1.8  2.8 13.2 16.7  6.7 3.0 9.7  0.187  2.2  1.1
## 4  9.7 21.6 15.6 11.0  1.2  2.5 13.6 15.9  2.3 1.9 4.2  0.148 -1.6  1.8
## 5  0.6 10.1  5.3  6.2  0.3  1.1 10.8 12.5 -0.1 0.2 0.1  0.009 -4.1 -1.8
## 6  7.0 28.6 17.7  8.2  2.0  1.8  5.4 16.8 -0.1 0.1 0.0 -0.013 -7.0  0.1
##    bpm vorp
## 1 -2.2 -0.1
## 2 -2.2 -0.1
## 3  3.3  3.3
## 4  0.2  0.8
## 5 -5.8 -0.7
## 6 -6.9 -0.1
tail(nba)
##      rk                    player pos age  tm  g   pm  per   ts. X3par
## 535 535 Thaddeus Young\\youngth01  PF  29 IND 81 2607 14.8 0.528 0.209
## 536 536    Cody Zeller\\zelleco01   C  25 CHO 33  627 15.9 0.602 0.019
## 537 537   Tyler Zeller\\zellety01   C  28 TOT 66 1109 16.0 0.598 0.084
## 538 538    Paul Zipser\\zipsepa01  SF  23 CHI 54  824  5.2 0.445 0.470
## 539 539     Ante Zizic\\zizican01   C  21 CLE 32  214 24.2 0.746 0.000
## 540 540    Ivica Zubac\\zubaciv01   C  20 LAL 43  410 15.3 0.557 0.008
##       ftr orb. drb. trb. ast. stl. blk. tov. usg.  ows dws   ws  ws.48
## 535 0.106  8.0 14.1 11.1  8.5  2.6  1.2 10.4 17.3  2.3 3.2  5.5  0.101
## 536 0.545 11.4 19.3 15.3  7.4  1.1  2.8 14.6 15.7  1.2 0.7  1.9  0.145
## 537 0.237 11.0 19.4 15.2  6.7  0.7  2.5 11.3 16.4  2.0 0.9  2.9  0.126
## 538 0.107  1.6 16.0  8.5  8.0  1.2  1.6 14.9 15.2 -1.2 0.6 -0.6 -0.034
## 539 0.433 12.8 18.6 15.7  3.8  0.5  5.2 12.1 18.8  0.9 0.2  1.0  0.231
## 540 0.418 11.8 20.1 16.0  8.8  0.9  3.0 15.3 17.6  0.5 0.5  1.0  0.118
##     obpm dbpm  bpm vorp
## 535  0.1  1.4  1.5  2.3
## 536 -0.6  1.3  0.7  0.4
## 537 -1.1 -0.5 -1.6  0.1
## 538 -5.5 -0.3 -5.9 -0.8
## 539  1.3 -1.2  0.1  0.1
## 540 -2.7  0.5 -2.2  0.0

Removing NA Values

At this step, we are going to remove all the missing values by using complete.cases() function.We will call the new data frame as “nba1”

nba1 <- nba[complete.cases(nba),]
dim(nba1)
## [1] 537  27

Now, We have 537 observations and 27 variables.

3. Exploring Data

At this step, since we were assigned to create a data visualisation using R code, we will use the ggplot2 package to visualize the data through scatterplots and histograms. In addition, we will make plots interactive using the packae “plotly”.

Create a scatterplot

Map variables in the data onto the X and Y axes

# map values in data to X and Y axes
ggplot(nba1, aes(per, age))

Change the axes labels and theme

# Change the theme
ggplot(nba1, aes(x = per, y = age)) +
  xlab("Player Efficiency Rating") + 
  ylab("Age of Player at the start of February 1st of that season") +
  theme_minimal(base_size = 12)

p1 <- ggplot(nba1, aes(x = per, y = age)) +
  xlab("Player Efficiency Rating") + 
  ylab("Age of Player at the start of February 1st of that season") +
  theme_minimal(base_size = 12)
p1 + geom_point()

What is going on?

The one point off to the right represents the rank 344, which had a much higher player efficienty rate 133.8. The player with the next highest rate at that time were 41.9 and the player with the rank 466 had 39.8 as rate. Remove the rank 344 and replot:

nba2 <- nba1[nba1$rk != "344",]
p2 <- ggplot(nba2, aes(x = per, y = age)) +
  xlab("Player Efficiency Rating") + 
  ylab("Age of Player at the start of February 1st of that season") +
  theme_minimal(base_size = 12)
p2 + geom_point()

Now the scatterplot appears to show a correlation

p3 <- p2 + xlim(0,50) + ylim(0,60)
p3 + geom_point()
## Warning: Removed 19 rows containing missing values (geom_point).

Add a smoother in red with a confidence interval

p4 <- p3 + geom_point() + geom_smooth(color = "red")
p4
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).

What is the linear equation of that linear regression model?

In the form, y=mx + b, we use the command, lm(y~x), meaning, fit the predictor variable x into the model to predict y. Look at the values of (Intercept) and murder. The column, Estimate gives the value you need in your linear model. The column for Pr(>|t|) describes whether the predictor is useful to the model.

cor(nba2$age, nba2$per)
## [1] 0.04143486

The correlation is 0.04 < 0.05, but it is close to 0.05 that means there is a weak relationship between the variable age of the player and player efficiency rate.

fit1 <- lm(age ~ per, data = nba2)
summary(fit1)
## 
## Call:
## lm(formula = age ~ per, data = nba2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3269 -3.0965 -0.9954  2.8241 14.9393 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 25.79461    0.37370  69.024   <2e-16 ***
## per          0.02464    0.02571   0.958    0.338    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.178 on 534 degrees of freedom
## Multiple R-squared:  0.001717,   Adjusted R-squared:  -0.0001526 
## F-statistic: 0.9184 on 1 and 534 DF,  p-value: 0.3383

What does the output mean?

The model has the equation: age of the player = 0.02(per) + 25.79 For each addtionnnal player efficiency rate, there is a predicted increase of 0.02 age of the player.

The Adjusted R-squared, which is -0.0001526 means that any variation in the data maybe explained by the model.

p3 +
  geom_point(mapping = aes(per, age, size = g), color = "red") + xlim(0,50) + ylim(0,60) + 
ggtitle ("AGE OF PLAYERS VERSUS PLAYER RATE EFFICIENCY", subtitle = "Sizes of circles are proportional to number of games")
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 19 rows containing missing values (geom_point).

Finally, add some interactivity to the plot with plotly

p <- ggplot(nba2, aes(x = per, y = age, size = g, text = paste("rk:", rk))) + theme_minimal(base_size = 12) +
     geom_point(alpha = 0.5, color = "red") + xlim(0,50) + ylim(0,60) +
  ggtitle ("AGE OF PLAYERS VERSUS PLAYER RATE EFFICIENCY", subtitle = "Sizes of circles are proportional to number of games")
p <- ggplotly(p)
p

Analysis

a- The topic is about the NBA. The data contains stats of players in NBA in 2017-2018 The dataset we used is from the following database: https://www.kaggle.com/mcamli/nba17-18/version/4.

In this dadabase, there are 540 observations and 27 variables: Rk = Rank, Player = Name of player, Pos = Position, Age = Age of Player at the start of February 1st of that season, Tm = Team, G = Games, MP = Minutes Played Per Game, PER = Player Efficiency Rating, TS% = True Shooting %, 3PAr = 3-Point Attempt Rate, FTr = Free Throw Rate, ORB% = Offensive Rebound Percentage, DRB% = Defensive Rebound Percentage, TRB% = Total Rebound Percentage, AST% = Assist Percentage, STL% = Steal Percentage, BLK% = Block Percentage, TOV% = Turnover Percentage, USG% = Usage Percentage, OWS = Offensive Win Shares, DWS = Defensive Win Shares, WS = Win Shares, WS/48 = Win Shares Per 48 Minutes, OBPM = Offensive Box Plus/Minus, DBPM = Defensive Box Plus/Minus, BPM = Box Plus/Minus, VORP = Value Over Replacement.

In this project, we are interested in a relationship between the age of players and player efficiency rate in the NBA. My goal was to figure out if the efficiency rate of the players of NBA is linked to the age of the athlets.

I chose this topic and dataset because I would better understand the relationship between the age of players and player efficiency rate in the NBA. Otherwise, I wanted to know if the variable the efficiency rate was highly defined by the age of the players. In other word, I was also interested in finding the the number of games could influence the variable efficiency rate.

b- Overall,I found my idea that the age of players and player efficiency rate would be correlated is correct. I was surprised by seeing that there is a weak relationship between those variables. In addition, I think an outlier which had the following characteristics: age=24 closed to the mean =25 and the highest efficiency rate=133.8 could show a probably dispersion of values. This difference can be deeply appreciated by the second high value of efficiency rate, which was 41.9 with age =25. However, by looking up a scatterplot, this tend is almost linear with a positive slope. Despite I crossed by the number of the games, there is quite a bit of variation and it is not entirely linear. I think this data set was great in terms of quality of data. Finally, I’m not sure but I recommend to test others variables in a multiple regression to verify whether they could explain the efficiency of the players.