#Installing the ggplot2 packages
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 3.6.1
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
In our project, we will use the following dataset: Source of the Data: https://www.kaggle.com/mcamli/nba17-18/version/4
First of all, we are going to read the .csv file from the the data containing stats of players in NBA in 2017-2018 season.
#Reading the .csv file "breast_tissue"
nba <- read.csv("nba.csv")
Now, we are going to get the dimensions and structure of the data with the functions dim()and str().
dim(nba)
## [1] 540 27
We have 540 observations and 27 variables.
str(nba)
## 'data.frame': 540 obs. of 27 variables:
## $ rk : int 1 2 3 4 5 6 7 8 9 10 ...
## $ player: Factor w/ 540 levels "Aaron Brooks\\brookaa01",..: 13 421 468 38 35 78 319 227 290 489 ...
## $ pos : Factor w/ 7 levels "C","PF","PG",..: 7 2 1 1 7 1 1 1 3 5 ...
## $ age : int 24 27 24 20 32 29 32 19 25 36 ...
## $ tm : Factor w/ 31 levels "ATL","BOS","BRK",..: 21 3 21 16 22 18 27 3 2 19 ...
## $ g : int 75 70 76 69 53 21 75 72 18 22 ...
## $ pm : int 1134 1359 2487 1368 682 49 2509 1441 107 273 ...
## $ per : num 9 8.2 20.6 15.7 5.8 6 25 17.5 2.6 8.7 ...
## $ ts. : num 0.567 0.525 0.63 0.57 0.516 0.34 0.57 0.636 0.366 0.514 ...
## $ X3par : num 0.759 0.8 0.003 0.021 0.432 0 0.068 0.038 0.5 0.132 ...
## $ ftr : num 0.158 0.164 0.402 0.526 0.16 0.4 0.296 0.37 0.409 0.231 ...
## $ orb. : num 2.5 3.1 16.6 9.7 0.6 7 10.8 10.5 4.1 8.2 ...
## $ drb. : num 8.9 17 13.9 21.6 10.1 28.6 17.3 18.2 7.1 10.4 ...
## $ trb. : num 5.6 10 15.3 15.6 5.3 17.7 14 14.3 5.6 9.3 ...
## $ ast. : num 3.4 6 5.5 11 6.2 8.2 11.3 5.4 15.2 4.6 ...
## $ stl. : num 1.7 1.2 1.8 1.2 0.3 2 0.9 0.9 1.4 1.9 ...
## $ blk. : num 0.6 1.6 2.8 2.5 1.1 1.8 3 4.6 1.6 0.9 ...
## $ tov. : num 7.4 13.3 13.2 13.6 10.8 5.4 6.8 15.1 25.7 15.9 ...
## $ usg. : num 12.7 14.4 16.7 15.9 12.5 16.8 29.1 16.3 14.6 18.9 ...
## $ ows : num 1.3 -0.1 6.7 2.3 -0.1 -0.1 7.4 2.7 -0.2 -0.2 ...
## $ dws : num 1 1.1 3 1.9 0.2 0.1 3.5 1.5 0.1 0.2 ...
## $ ws : num 2.2 1 9.7 4.2 0.1 0 10.9 4.2 -0.1 0.1 ...
## $ ws.48 : num 0.094 0.036 0.187 0.148 0.009 -0.013 0.209 0.141 -0.039 0.017 ...
## $ obpm : num -0.5 -2 2.2 -1.6 -4.1 -7 3 -1.3 -6.7 -4 ...
## $ dbpm : num -1.7 -0.2 1.1 1.8 -1.8 0.1 0.3 1.4 0.3 -1.3 ...
## $ bpm : num -2.2 -2.2 3.3 0.2 -5.8 -6.9 3.3 0.2 -6.4 -5.2 ...
## $ vorp : num -0.1 -0.1 3.3 0.8 -0.7 -0.1 3.3 0.8 -0.1 -0.2 ...
At this step, we will look at the head and the tail of the data set to see by defaut its first and last 6 variables.
head(nba)
## rk player pos age tm g pm per ts. X3par ftr
## 1 1 Alex Abrines\\abrinal01 SG 24 OKC 75 1134 9.0 0.567 0.759 0.158
## 2 2 Quincy Acy\\acyqu01 PF 27 BRK 70 1359 8.2 0.525 0.800 0.164
## 3 3 Steven Adams\\adamsst01 C 24 OKC 76 2487 20.6 0.630 0.003 0.402
## 4 4 Bam Adebayo\\adebaba01 C 20 MIA 69 1368 15.7 0.570 0.021 0.526
## 5 5 Arron Afflalo\\afflaar01 SG 32 ORL 53 682 5.8 0.516 0.432 0.160
## 6 6 Cole Aldrich\\aldrico01 C 29 MIN 21 49 6.0 0.340 0.000 0.400
## orb. drb. trb. ast. stl. blk. tov. usg. ows dws ws ws.48 obpm dbpm
## 1 2.5 8.9 5.6 3.4 1.7 0.6 7.4 12.7 1.3 1.0 2.2 0.094 -0.5 -1.7
## 2 3.1 17.0 10.0 6.0 1.2 1.6 13.3 14.4 -0.1 1.1 1.0 0.036 -2.0 -0.2
## 3 16.6 13.9 15.3 5.5 1.8 2.8 13.2 16.7 6.7 3.0 9.7 0.187 2.2 1.1
## 4 9.7 21.6 15.6 11.0 1.2 2.5 13.6 15.9 2.3 1.9 4.2 0.148 -1.6 1.8
## 5 0.6 10.1 5.3 6.2 0.3 1.1 10.8 12.5 -0.1 0.2 0.1 0.009 -4.1 -1.8
## 6 7.0 28.6 17.7 8.2 2.0 1.8 5.4 16.8 -0.1 0.1 0.0 -0.013 -7.0 0.1
## bpm vorp
## 1 -2.2 -0.1
## 2 -2.2 -0.1
## 3 3.3 3.3
## 4 0.2 0.8
## 5 -5.8 -0.7
## 6 -6.9 -0.1
tail(nba)
## rk player pos age tm g pm per ts. X3par
## 535 535 Thaddeus Young\\youngth01 PF 29 IND 81 2607 14.8 0.528 0.209
## 536 536 Cody Zeller\\zelleco01 C 25 CHO 33 627 15.9 0.602 0.019
## 537 537 Tyler Zeller\\zellety01 C 28 TOT 66 1109 16.0 0.598 0.084
## 538 538 Paul Zipser\\zipsepa01 SF 23 CHI 54 824 5.2 0.445 0.470
## 539 539 Ante Zizic\\zizican01 C 21 CLE 32 214 24.2 0.746 0.000
## 540 540 Ivica Zubac\\zubaciv01 C 20 LAL 43 410 15.3 0.557 0.008
## ftr orb. drb. trb. ast. stl. blk. tov. usg. ows dws ws ws.48
## 535 0.106 8.0 14.1 11.1 8.5 2.6 1.2 10.4 17.3 2.3 3.2 5.5 0.101
## 536 0.545 11.4 19.3 15.3 7.4 1.1 2.8 14.6 15.7 1.2 0.7 1.9 0.145
## 537 0.237 11.0 19.4 15.2 6.7 0.7 2.5 11.3 16.4 2.0 0.9 2.9 0.126
## 538 0.107 1.6 16.0 8.5 8.0 1.2 1.6 14.9 15.2 -1.2 0.6 -0.6 -0.034
## 539 0.433 12.8 18.6 15.7 3.8 0.5 5.2 12.1 18.8 0.9 0.2 1.0 0.231
## 540 0.418 11.8 20.1 16.0 8.8 0.9 3.0 15.3 17.6 0.5 0.5 1.0 0.118
## obpm dbpm bpm vorp
## 535 0.1 1.4 1.5 2.3
## 536 -0.6 1.3 0.7 0.4
## 537 -1.1 -0.5 -1.6 0.1
## 538 -5.5 -0.3 -5.9 -0.8
## 539 1.3 -1.2 0.1 0.1
## 540 -2.7 0.5 -2.2 0.0
At this step, we are going to remove all the missing values by using complete.cases() function.We will call the new data frame as “nba1”
nba1 <- nba[complete.cases(nba),]
dim(nba1)
## [1] 537 27
Now, We have 537 observations and 27 variables.
At this step, since we were assigned to create a data visualisation using R code, we will use the ggplot2 package to visualize the data through scatterplots and histograms. In addition, we will make plots interactive using the packae “plotly”.
# map values in data to X and Y axes
ggplot(nba1, aes(per, age))
# Change the theme
ggplot(nba1, aes(x = per, y = age)) +
xlab("Player Efficiency Rating") +
ylab("Age of Player at the start of February 1st of that season") +
theme_minimal(base_size = 12)
p1 <- ggplot(nba1, aes(x = per, y = age)) +
xlab("Player Efficiency Rating") +
ylab("Age of Player at the start of February 1st of that season") +
theme_minimal(base_size = 12)
p1 + geom_point()
The one point off to the right represents the rank 344, which had a much higher player efficienty rate 133.8. The player with the next highest rate at that time were 41.9 and the player with the rank 466 had 39.8 as rate. Remove the rank 344 and replot:
nba2 <- nba1[nba1$rk != "344",]
p2 <- ggplot(nba2, aes(x = per, y = age)) +
xlab("Player Efficiency Rating") +
ylab("Age of Player at the start of February 1st of that season") +
theme_minimal(base_size = 12)
p2 + geom_point()
p3 <- p2 + xlim(0,50) + ylim(0,60)
p3 + geom_point()
## Warning: Removed 19 rows containing missing values (geom_point).
p4 <- p3 + geom_point() + geom_smooth(color = "red")
p4
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 19 rows containing non-finite values (stat_smooth).
## Warning: Removed 19 rows containing missing values (geom_point).
In the form, y=mx + b, we use the command, lm(y~x), meaning, fit the predictor variable x into the model to predict y. Look at the values of (Intercept) and murder. The column, Estimate gives the value you need in your linear model. The column for Pr(>|t|) describes whether the predictor is useful to the model.
cor(nba2$age, nba2$per)
## [1] 0.04143486
The correlation is 0.04 < 0.05, but it is close to 0.05 that means there is a weak relationship between the variable age of the player and player efficiency rate.
fit1 <- lm(age ~ per, data = nba2)
summary(fit1)
##
## Call:
## lm(formula = age ~ per, data = nba2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3269 -3.0965 -0.9954 2.8241 14.9393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.79461 0.37370 69.024 <2e-16 ***
## per 0.02464 0.02571 0.958 0.338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.178 on 534 degrees of freedom
## Multiple R-squared: 0.001717, Adjusted R-squared: -0.0001526
## F-statistic: 0.9184 on 1 and 534 DF, p-value: 0.3383
The model has the equation: age of the player = 0.02(per) + 25.79 For each addtionnnal player efficiency rate, there is a predicted increase of 0.02 age of the player.
The Adjusted R-squared, which is -0.0001526 means that any variation in the data maybe explained by the model.
p3 +
geom_point(mapping = aes(per, age, size = g), color = "red") + xlim(0,50) + ylim(0,60) +
ggtitle ("AGE OF PLAYERS VERSUS PLAYER RATE EFFICIENCY", subtitle = "Sizes of circles are proportional to number of games")
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 19 rows containing missing values (geom_point).
p <- ggplot(nba2, aes(x = per, y = age, size = g, text = paste("rk:", rk))) + theme_minimal(base_size = 12) +
geom_point(alpha = 0.5, color = "red") + xlim(0,50) + ylim(0,60) +
ggtitle ("AGE OF PLAYERS VERSUS PLAYER RATE EFFICIENCY", subtitle = "Sizes of circles are proportional to number of games")
p <- ggplotly(p)
p
a- The topic is about the NBA. The data contains stats of players in NBA in 2017-2018 The dataset we used is from the following database: https://www.kaggle.com/mcamli/nba17-18/version/4.
In this dadabase, there are 540 observations and 27 variables: Rk = Rank, Player = Name of player, Pos = Position, Age = Age of Player at the start of February 1st of that season, Tm = Team, G = Games, MP = Minutes Played Per Game, PER = Player Efficiency Rating, TS% = True Shooting %, 3PAr = 3-Point Attempt Rate, FTr = Free Throw Rate, ORB% = Offensive Rebound Percentage, DRB% = Defensive Rebound Percentage, TRB% = Total Rebound Percentage, AST% = Assist Percentage, STL% = Steal Percentage, BLK% = Block Percentage, TOV% = Turnover Percentage, USG% = Usage Percentage, OWS = Offensive Win Shares, DWS = Defensive Win Shares, WS = Win Shares, WS/48 = Win Shares Per 48 Minutes, OBPM = Offensive Box Plus/Minus, DBPM = Defensive Box Plus/Minus, BPM = Box Plus/Minus, VORP = Value Over Replacement.
In this project, we are interested in a relationship between the age of players and player efficiency rate in the NBA. My goal was to figure out if the efficiency rate of the players of NBA is linked to the age of the athlets.
I chose this topic and dataset because I would better understand the relationship between the age of players and player efficiency rate in the NBA. Otherwise, I wanted to know if the variable the efficiency rate was highly defined by the age of the players. In other word, I was also interested in finding the the number of games could influence the variable efficiency rate.
b- Overall,I found my idea that the age of players and player efficiency rate would be correlated is correct. I was surprised by seeing that there is a weak relationship between those variables. In addition, I think an outlier which had the following characteristics: age=24 closed to the mean =25 and the highest efficiency rate=133.8 could show a probably dispersion of values. This difference can be deeply appreciated by the second high value of efficiency rate, which was 41.9 with age =25. However, by looking up a scatterplot, this tend is almost linear with a positive slope. Despite I crossed by the number of the games, there is quite a bit of variation and it is not entirely linear. I think this data set was great in terms of quality of data. Finally, I’m not sure but I recommend to test others variables in a multiple regression to verify whether they could explain the efficiency of the players.