In our project, we will use the following dataset: Source of the Data: https://www.kaggle.com/mcamli/nba17-18/version/4
First of all, we are going to read the .csv file from the the data containing stats of players in NBA in 2017-2018 season.
#Reading the .csv file "breast_tissue"
nba <- read.csv("nba.csv")
At this time, we are going to install and load the “dplyr” packages. We need to use it to clean our data.
#Installing the dplyr packages
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Now, we are going to get the dimensions and structure of the data with the functions dim()and str().
dim(nba)
## [1] 540 27
We have 540 observations and 27 variables.
str(nba)
## 'data.frame': 540 obs. of 27 variables:
## $ rk : int 1 2 3 4 5 6 7 8 9 10 ...
## $ player: Factor w/ 540 levels "Aaron Brooks\\brookaa01",..: 13 421 468 38 35 78 319 227 290 489 ...
## $ pos : Factor w/ 7 levels "C","PF","PG",..: 7 2 1 1 7 1 1 1 3 5 ...
## $ age : int 24 27 24 20 32 29 32 19 25 36 ...
## $ tm : Factor w/ 31 levels "ATL","BOS","BRK",..: 21 3 21 16 22 18 27 3 2 19 ...
## $ g : int 75 70 76 69 53 21 75 72 18 22 ...
## $ pm : int 1134 1359 2487 1368 682 49 2509 1441 107 273 ...
## $ per : num 9 8.2 20.6 15.7 5.8 6 25 17.5 2.6 8.7 ...
## $ ts. : num 0.567 0.525 0.63 0.57 0.516 0.34 0.57 0.636 0.366 0.514 ...
## $ X3par : num 0.759 0.8 0.003 0.021 0.432 0 0.068 0.038 0.5 0.132 ...
## $ ftr : num 0.158 0.164 0.402 0.526 0.16 0.4 0.296 0.37 0.409 0.231 ...
## $ orb. : num 2.5 3.1 16.6 9.7 0.6 7 10.8 10.5 4.1 8.2 ...
## $ drb. : num 8.9 17 13.9 21.6 10.1 28.6 17.3 18.2 7.1 10.4 ...
## $ trb. : num 5.6 10 15.3 15.6 5.3 17.7 14 14.3 5.6 9.3 ...
## $ ast. : num 3.4 6 5.5 11 6.2 8.2 11.3 5.4 15.2 4.6 ...
## $ stl. : num 1.7 1.2 1.8 1.2 0.3 2 0.9 0.9 1.4 1.9 ...
## $ blk. : num 0.6 1.6 2.8 2.5 1.1 1.8 3 4.6 1.6 0.9 ...
## $ tov. : num 7.4 13.3 13.2 13.6 10.8 5.4 6.8 15.1 25.7 15.9 ...
## $ usg. : num 12.7 14.4 16.7 15.9 12.5 16.8 29.1 16.3 14.6 18.9 ...
## $ ows : num 1.3 -0.1 6.7 2.3 -0.1 -0.1 7.4 2.7 -0.2 -0.2 ...
## $ dws : num 1 1.1 3 1.9 0.2 0.1 3.5 1.5 0.1 0.2 ...
## $ ws : num 2.2 1 9.7 4.2 0.1 0 10.9 4.2 -0.1 0.1 ...
## $ ws.48 : num 0.094 0.036 0.187 0.148 0.009 -0.013 0.209 0.141 -0.039 0.017 ...
## $ obpm : num -0.5 -2 2.2 -1.6 -4.1 -7 3 -1.3 -6.7 -4 ...
## $ dbpm : num -1.7 -0.2 1.1 1.8 -1.8 0.1 0.3 1.4 0.3 -1.3 ...
## $ bpm : num -2.2 -2.2 3.3 0.2 -5.8 -6.9 3.3 0.2 -6.4 -5.2 ...
## $ vorp : num -0.1 -0.1 3.3 0.8 -0.7 -0.1 3.3 0.8 -0.1 -0.2 ...
At this step, we will look at the head and the tail of the data set to see by defaut its first and last 6 variables.
head(nba)
## rk player pos age tm g pm per ts. X3par ftr
## 1 1 Alex Abrines\\abrinal01 SG 24 OKC 75 1134 9.0 0.567 0.759 0.158
## 2 2 Quincy Acy\\acyqu01 PF 27 BRK 70 1359 8.2 0.525 0.800 0.164
## 3 3 Steven Adams\\adamsst01 C 24 OKC 76 2487 20.6 0.630 0.003 0.402
## 4 4 Bam Adebayo\\adebaba01 C 20 MIA 69 1368 15.7 0.570 0.021 0.526
## 5 5 Arron Afflalo\\afflaar01 SG 32 ORL 53 682 5.8 0.516 0.432 0.160
## 6 6 Cole Aldrich\\aldrico01 C 29 MIN 21 49 6.0 0.340 0.000 0.400
## orb. drb. trb. ast. stl. blk. tov. usg. ows dws ws ws.48 obpm dbpm
## 1 2.5 8.9 5.6 3.4 1.7 0.6 7.4 12.7 1.3 1.0 2.2 0.094 -0.5 -1.7
## 2 3.1 17.0 10.0 6.0 1.2 1.6 13.3 14.4 -0.1 1.1 1.0 0.036 -2.0 -0.2
## 3 16.6 13.9 15.3 5.5 1.8 2.8 13.2 16.7 6.7 3.0 9.7 0.187 2.2 1.1
## 4 9.7 21.6 15.6 11.0 1.2 2.5 13.6 15.9 2.3 1.9 4.2 0.148 -1.6 1.8
## 5 0.6 10.1 5.3 6.2 0.3 1.1 10.8 12.5 -0.1 0.2 0.1 0.009 -4.1 -1.8
## 6 7.0 28.6 17.7 8.2 2.0 1.8 5.4 16.8 -0.1 0.1 0.0 -0.013 -7.0 0.1
## bpm vorp
## 1 -2.2 -0.1
## 2 -2.2 -0.1
## 3 3.3 3.3
## 4 0.2 0.8
## 5 -5.8 -0.7
## 6 -6.9 -0.1
tail(nba)
## rk player pos age tm g pm per ts. X3par
## 535 535 Thaddeus Young\\youngth01 PF 29 IND 81 2607 14.8 0.528 0.209
## 536 536 Cody Zeller\\zelleco01 C 25 CHO 33 627 15.9 0.602 0.019
## 537 537 Tyler Zeller\\zellety01 C 28 TOT 66 1109 16.0 0.598 0.084
## 538 538 Paul Zipser\\zipsepa01 SF 23 CHI 54 824 5.2 0.445 0.470
## 539 539 Ante Zizic\\zizican01 C 21 CLE 32 214 24.2 0.746 0.000
## 540 540 Ivica Zubac\\zubaciv01 C 20 LAL 43 410 15.3 0.557 0.008
## ftr orb. drb. trb. ast. stl. blk. tov. usg. ows dws ws ws.48
## 535 0.106 8.0 14.1 11.1 8.5 2.6 1.2 10.4 17.3 2.3 3.2 5.5 0.101
## 536 0.545 11.4 19.3 15.3 7.4 1.1 2.8 14.6 15.7 1.2 0.7 1.9 0.145
## 537 0.237 11.0 19.4 15.2 6.7 0.7 2.5 11.3 16.4 2.0 0.9 2.9 0.126
## 538 0.107 1.6 16.0 8.5 8.0 1.2 1.6 14.9 15.2 -1.2 0.6 -0.6 -0.034
## 539 0.433 12.8 18.6 15.7 3.8 0.5 5.2 12.1 18.8 0.9 0.2 1.0 0.231
## 540 0.418 11.8 20.1 16.0 8.8 0.9 3.0 15.3 17.6 0.5 0.5 1.0 0.118
## obpm dbpm bpm vorp
## 535 0.1 1.4 1.5 2.3
## 536 -0.6 1.3 0.7 0.4
## 537 -1.1 -0.5 -1.6 0.1
## 538 -5.5 -0.3 -5.9 -0.8
## 539 1.3 -1.2 0.1 0.1
## 540 -2.7 0.5 -2.2 0.0
At this step, we are going to remove all the missing values by using complete.cases() function.We will call the new data frame as “nba1”
nba1 <- nba[complete.cases(nba),]
dim(nba1)
## [1] 537 27
Now, We have 537 observations and 27 variables.
At this step, since we were assigned to create a data visualisation using R code, we will use the ggplot2 package to visualize the data through scatterplots and histograms. In addition, we will make plots interactive using the packae “plotly”. Then were are going to install and load the Package “ggplot2” and “plotly” .
#Installing the ggplot2 packages
library(readr)
## Warning: package 'readr' was built under R version 3.6.1
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 3.6.1
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# map values in data to X and Y axes
ggplot(nba1, aes(per, age))
# Change the theme
ggplot(nba1, aes(x = per, y = age)) +
xlab("Player Efficiency Rating") +
ylab("Age of Player at the start of February 1st of that season") +
theme_minimal(base_size = 12)
p1 <- ggplot(nba1, aes(x = per, y = age)) +
xlab("Player Efficiency Rating") +
ylab("Age of Player at the start of February 1st of that season") +
theme_minimal(base_size = 12)
p1 + geom_point()
The one point off to the right represents the rank 344, which had a much higher player efficienty rate 133.8. The player with the next highest rate at that time were 41.9 and the player with the rank 466 had 39.8 as rate. Remove the rank 344 and replot:
nba2 <- nba1[nba1$rk != "344",]
p2 <- ggplot(nba2, aes(x = per, y = age)) +
xlab("Player Efficiency Rating") +
ylab("Age of Player at the start of February 1st of that season") +
theme_minimal(base_size = 12)
p2 + geom_point()
p3 <- p2 + xlim(0,10) + ylim(0,1200)
p3 + geom_point()
## Warning: Removed 400 rows containing missing values (geom_point).
p4 <- p3 + geom_point() + geom_smooth(color = "red")
p4
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 400 rows containing non-finite values (stat_smooth).
## Warning: Removed 400 rows containing missing values (geom_point).
In the form, y=mx + b, we use the command, lm(y~x), meaning, fit the predictor variable x into the model to predict y. Look at the values of (Intercept) and murder. The column, Estimate gives the value you need in your linear model. The column for Pr(>|t|) describes whether the predictor is useful to the model.
cor(nba2$age, nba2$per)
## [1] 0.04143486
fit1 <- lm(age ~ per, data = nba2)
summary(fit1)
##
## Call:
## lm(formula = age ~ per, data = nba2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3269 -3.0965 -0.9954 2.8241 14.9393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.79461 0.37370 69.024 <2e-16 ***
## per 0.02464 0.02571 0.958 0.338
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.178 on 534 degrees of freedom
## Multiple R-squared: 0.001717, Adjusted R-squared: -0.0001526
## F-statistic: 0.9184 on 1 and 534 DF, p-value: 0.3383
p value = 2e-16 < 0.05 then a small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis. Therefore, there is a strong relationship between the age and the rate.
p3 +
geom_point(mapping = aes(per, age, size = g), color = "red") + xlim(0,10) + ylim(0,1200) +
ggtitle ("AGE OF PLAYERS VERSUS PLAYER RATE EFFICIENCY", subtitle = "Sizes of circles are proportional to number of games")
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Scale for 'y' is already present. Adding another scale for 'y', which
## will replace the existing scale.
## Warning: Removed 400 rows containing missing values (geom_point).
p <- ggplot(nba2, aes(x = per, y = age, size = g, text = paste("rk:", rk))) + theme_minimal(base_size = 12) +
geom_point(alpha = 0.5, color = "red") + xlim(0,10) + ylim(0,1200) +
ggtitle ("AGE OF PLAYERS VERSUS PLAYER RATE EFFICIENCY", subtitle = "Sizes of circles are proportional to number of games")
p <- ggplotly(p)
p
Write a short essay
a- The topic is about the NBA. The data contains stats of players in NBA in 2017-2018 The dataset we used is from the following database: https://www.kaggle.com/mcamli/nba17-18/version/4. In this dadabase, there are 540 observations and 27 variables: Rk = Rank Player = Name of player Pos = Position Age = Age of Player at the start of February 1st of that season. Tm = Team G = Games MP = Minutes Played Per Game PER = Player Efficiency Rating TS% = True Shooting % 3PAr = 3-Point Attempt Rate FTr = Free Throw Rate ORB% = Offensive Rebound Percentage DRB% = Defensive Rebound Percentage TRB% = Total Rebound Percentage AST% = Assist Percentage STL% = Steal Percentage BLK% = Block Percentage TOV% = Turnover Percentage USG% = Usage Percentage OWS = Offensive Win Shares DWS = Defensive Win Shares WS = Win Shares WS/48 = Win Shares Per 48 Minutes OBPM = Offensive Box Plus/Minus DBPM = Defensive Box Plus/Minus BPM = Box Plus/Minus VORP = Value Over Replacement
In this project, we are interested in a relationship between the age of players and player efficiency rate of the NBA (sport I like). We used the the “dplyr” packages to clean the data especially by dropping the missing values which are 3.
b- Through scatterplots and histograms, we wer surprized by the high value of the rank 344 which is outtlier. This player has only 1 game and are the age 24. My expectation was to figure out the oulier related to players who have many games in their assets.