nba<-read.csv("C:\\Users\\vigne\\Desktop\\NBA_player_of_the_week.csv")
dim(nba)
## [1] 1340 17
The above function dim(nba) gives us the number of rows and columns present in the dataset
View(nba)
The view(nba) function allows us to have a look or view on the entire dataset
summary(nba)
## Player Team Conference Date
## LeBron James : 63 Los Angeles Lakers : 87 :501 Jan 12, 2003: 3
## Kobe Bryant : 33 Houston Rockets : 68 East:420 Apr 1, 2007 : 2
## Kevin Durant : 26 San Antonio Spurs : 68 West:419 Apr 1, 2013 : 2
## Michael Jordan: 25 Cleveland Cavaliers: 62 Apr 1, 2019 : 2
## James Harden : 24 Miami Heat : 59 Apr 10, 2005: 2
## Allen Iverson : 23 Boston Celtics : 57 Apr 10, 2017: 2
## (Other) :1146 (Other) :939 (Other) :1327
## Position Height Weight Age Draft.Year
## G :200 6'9 :158 Min. :150.0 Min. :19.00 Min. :1965
## C :188 6'7 :148 1st Qu.:205.0 1st Qu.:24.00 1st Qu.:1987
## SG :177 6'8 :146 Median :220.0 Median :26.00 Median :1998
## PF :157 6'6 :135 Mean :224.6 Mean :26.74 Mean :1996
## F :154 6'3 :134 3rd Qu.:250.0 3rd Qu.:29.00 3rd Qu.:2005
## SF :153 6'11 :121 Max. :325.0 Max. :40.00 Max. :2018
## (Other):311 (Other):498
## Seasons.in.league Season Season.short
## Min. : 0.00 2002-2003: 47 Min. :1980
## 1st Qu.: 3.00 2003-2004: 46 1st Qu.:1994
## Median : 5.00 2004-2005: 46 Median :2005
## Mean : 5.74 2005-2006: 46 Mean :2003
## 3rd Qu.: 8.00 2006-2007: 46 3rd Qu.:2013
## Max. :17.00 2007-2008: 46 Max. :2020
## (Other) :1063
## Pre.draft.Team Real_value
## St. Vincent St. Mary High School (Ohio): 63 Min. :0.5000
## Georgetown : 51 1st Qu.:0.5000
## North Carolina : 47 Median :0.5000
## UCLA : 42 Mean :0.6869
## Wake Forest : 40 3rd Qu.:1.0000
## Texas : 37 Max. :1.0000
## (Other) :1060
## Height.CM Weight.KG Last.Season
## Min. :175.0 Min. : 68.0 Min. :0.00000
## 1st Qu.:193.0 1st Qu.: 93.0 1st Qu.:0.00000
## Median :201.0 Median : 99.0 Median :0.00000
## Mean :201.1 Mean :101.4 Mean :0.02388
## 3rd Qu.:208.0 3rd Qu.:113.0 3rd Qu.:0.00000
## Max. :229.0 Max. :147.0 Max. :1.00000
##
The summary function gives us the minimum,mean,median and maximum values of all the variables present in the dataset.
In the dataset NBA let us consider column age for an example, Minimum age is 19, Average age is 26.74, *Maximum age is 40.
When we consider the position column we can see that number of players playing in different positions are G: 200 SG: 177 SF: 153 like this we can use it for height,draft year, and other columns.
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
describe(nba)
## nba
##
## 17 Variables 1340 Observations
## --------------------------------------------------------------------------------
## Player
## n missing distinct
## 1340 0 333
##
## lowest : Aaron McKie Adrian Dantley Al Harrington Al Horford Al Jefferson
## highest: World B. Free Xavier McDaniel Yao Ming Zach Randolph Zydrunas Ilgauskas
## --------------------------------------------------------------------------------
## Team
## n missing distinct
## 1340 0 37
##
## lowest : Atlanta Hawks Boston Celtics Brooklyn Nets Charlotte Bobcats Charlotte Hornets
## highest: Seattle SuperSonics Toronto Raptors Utah Jazz Washington Bullets Washington Wizards
## --------------------------------------------------------------------------------
## Conference
## n missing distinct
## 1340 0 3
##
## Value East West
## Frequency 501 420 419
## Proportion 0.374 0.313 0.313
## --------------------------------------------------------------------------------
## Date
## n missing distinct
## 1340 0 916
##
## lowest : Apr 1, 1984 Apr 1, 1985 Apr 1, 1990 Apr 1, 2001 Apr 1, 2007
## highest: Oct 28, 1979 Oct 28, 2019 Oct 29, 2018 Oct 30, 2017 Oct 31, 2016
## --------------------------------------------------------------------------------
## Position
## n missing distinct
## 1340 0 11
##
## lowest : C F F-C FC G , highest: GF PF PG SF SG
##
## Value C F F-C FC G G-F GF PF PG SF SG
## Frequency 188 154 9 88 200 3 70 157 141 153 177
## Proportion 0.140 0.115 0.007 0.066 0.149 0.002 0.052 0.117 0.105 0.114 0.132
## --------------------------------------------------------------------------------
## Height
## n missing distinct
## 1340 0 21
##
## lowest : 5'10 5'11 5'9 6'0 6'1 , highest: 7'1 7'2 7'3 7'4 7'6
## --------------------------------------------------------------------------------
## Weight
## n missing distinct Info Mean Gmd .05 .10
## 1340 0 72 0.997 224.6 34.38 175 185
## .25 .50 .75 .90 .95
## 205 220 250 260 265
##
## lowest : 150 163 165 168 170, highest: 285 289 307 310 325
## --------------------------------------------------------------------------------
## Age
## n missing distinct Info Mean Gmd .05 .10
## 1340 0 22 0.991 26.74 3.794 22 23
## .25 .50 .75 .90 .95
## 24 26 29 31 33
##
## lowest : 19 20 21 22 23, highest: 36 37 38 39 40
## --------------------------------------------------------------------------------
## Draft.Year
## n missing distinct Info Mean Gmd .05 .10
## 1340 0 52 0.998 1996 12.83 1977 1979
## .25 .50 .75 .90 .95
## 1987 1998 2005 2010 2012
##
## lowest : 1965 1968 1969 1970 1971, highest: 2014 2015 2016 2017 2018
## --------------------------------------------------------------------------------
## Seasons.in.league
## n missing distinct Info Mean Gmd .05 .10
## 1340 0 18 0.991 5.74 3.668 1 2
## .25 .50 .75 .90 .95
## 3 5 8 10 12
##
## lowest : 0 1 2 3 4, highest: 13 14 15 16 17
##
## Value 0 1 2 3 4 5 6 7 8 9 10
## Frequency 35 63 119 149 166 164 155 123 108 87 51
## Proportion 0.026 0.047 0.089 0.111 0.124 0.122 0.116 0.092 0.081 0.065 0.038
##
## Value 11 12 13 14 15 16 17
## Frequency 37 37 15 13 6 10 2
## Proportion 0.028 0.028 0.011 0.010 0.004 0.007 0.001
## --------------------------------------------------------------------------------
## Season
## n missing distinct
## 1340 0 41
##
## lowest : 1979-1980 1980-1981 1981-1982 1982-1983 1983-1984
## highest: 2015-2016 2016-2017 2017-2018 2018-2019 2019-2020
## --------------------------------------------------------------------------------
## Season.short
## n missing distinct Info Mean Gmd .05 .10
## 1340 0 41 0.999 2003 13.07 1982 1985
## .25 .50 .75 .90 .95
## 1994 2005 2013 2017 2019
##
## lowest : 1980 1981 1982 1983 1984, highest: 2016 2017 2018 2019 2020
## --------------------------------------------------------------------------------
## Pre.draft.Team
## n missing distinct
## 1340 0 169
##
## lowest : Alabama Albany State (GA) Alcorn State Alief Elsik High School (Texas) Alpella B.K. (Turkey)
## highest: Western Carolina Wichita State Wisconsin Xavier Zalgiris (Lithuania)
## --------------------------------------------------------------------------------
## Real_value
## n missing distinct Info Mean Gmd
## 1340 0 2 0.702 0.6869 0.2343
##
## Value 0.5 1.0
## Frequency 839 501
## Proportion 0.626 0.374
## --------------------------------------------------------------------------------
## Height.CM
## n missing distinct Info Mean Gmd .05 .10
## 1340 0 21 0.992 201.1 10.58 185 188
## .25 .50 .75 .90 .95
## 193 201 208 213 216
##
## lowest : 175 178 180 183 185, highest: 216 218 221 224 229
## --------------------------------------------------------------------------------
## Weight.KG
## n missing distinct Info Mean Gmd .05 .10
## 1340 0 48 0.997 101.4 15.64 79 83
## .25 .50 .75 .90 .95
## 93 99 113 118 120
##
## lowest : 68 74 76 77 78, highest: 129 131 139 140 147
## --------------------------------------------------------------------------------
## Last.Season
## n missing distinct Info Sum Mean Gmd
## 1340 0 2 0.07 32 0.02388 0.04666
##
## --------------------------------------------------------------------------------
The function describe in the package Hmisc gives us a wide descriptive statistic of the datasetnba
It shows the number of missing values in the dataset. this output shows that there is no missing values in dataset.
Taking the variable HEIGHT as an example, it clearly explains the data present. We can infer the first 5 minimum heights of the players, 5 maximum heights of the players, The average height of the players and also the number of players present in each category of height.
Similarly different packages such as pastecs and psych can also be used for descriptive statistics of a dataset.
library(pastecs)
stat.desc(nba)
## Player Team Conference Date Position Height Weight Age
## nbr.val NA NA NA NA NA NA 1.340000e+03 1.340000e+03
## nbr.null NA NA NA NA NA NA 0.000000e+00 0.000000e+00
## nbr.na NA NA NA NA NA NA 0.000000e+00 0.000000e+00
## min NA NA NA NA NA NA 1.500000e+02 1.900000e+01
## max NA NA NA NA NA NA 3.250000e+02 4.000000e+01
## range NA NA NA NA NA NA 1.750000e+02 2.100000e+01
## sum NA NA NA NA NA NA 3.009200e+05 3.582900e+04
## median NA NA NA NA NA NA 2.200000e+02 2.600000e+01
## mean NA NA NA NA NA NA 2.245672e+02 2.673806e+01
## SE.mean NA NA NA NA NA NA 8.413614e-01 9.289958e-02
## CI.mean NA NA NA NA NA NA 1.650530e+00 1.822446e-01
## var NA NA NA NA NA NA 9.485713e+02 1.156464e+01
## std.dev NA NA NA NA NA NA 3.079888e+01 3.400683e+00
## coef.var NA NA NA NA NA NA 1.371478e-01 1.271851e-01
## Draft.Year Seasons.in.league Season Season.short Pre.draft.Team
## nbr.val 1.340000e+03 1340.0000000 NA 1.340000e+03 NA
## nbr.null 0.000000e+00 35.0000000 NA 0.000000e+00 NA
## nbr.na 0.000000e+00 0.0000000 NA 0.000000e+00 NA
## min 1.965000e+03 0.0000000 NA 1.980000e+03 NA
## max 2.018000e+03 17.0000000 NA 2.020000e+03 NA
## range 5.300000e+01 17.0000000 NA 4.000000e+01 NA
## sum 2.675025e+06 7692.0000000 NA 2.684230e+06 NA
## median 1.998000e+03 5.0000000 NA 2.005000e+03 NA
## mean 1.996287e+03 5.7402985 NA 2.003157e+03 NA
## SE.mean 3.074238e-01 0.0899694 NA 3.133410e-01 NA
## CI.mean 6.030846e-01 0.1764963 NA 6.146927e-01 NA
## var 1.266426e+02 10.8466198 NA 1.315647e+02 NA
## std.dev 1.125356e+01 3.2934207 NA 1.147016e+01 NA
## coef.var 5.637243e-03 0.5737368 NA 5.726044e-03 NA
## Real_value Height.CM Weight.KG Last.Season
## nbr.val 1.340000e+03 1.340000e+03 1.340000e+03 1.340000e+03
## nbr.null 0.000000e+00 0.000000e+00 0.000000e+00 1.308000e+03
## nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min 5.000000e-01 1.750000e+02 6.800000e+01 0.000000e+00
## max 1.000000e+00 2.290000e+02 1.470000e+02 1.000000e+00
## range 5.000000e-01 5.400000e+01 7.900000e+01 1.000000e+00
## sum 9.205000e+02 2.694360e+05 1.358550e+05 3.200000e+01
## median 5.000000e-01 2.010000e+02 9.900000e+01 0.000000e+00
## mean 6.869403e-01 2.010716e+02 1.013843e+02 2.388060e-02
## SE.mean 6.611116e-03 2.559134e-01 3.827575e-01 4.172379e-03
## CI.mean 1.296927e-02 5.020349e-01 7.508697e-01 8.185112e-03
## var 5.856718e-02 8.775887e+01 1.963145e+02 2.332772e-02
## std.dev 2.420066e-01 9.367970e+00 1.401123e+01 1.527342e-01
## coef.var 3.522964e-01 4.659021e-02 1.381991e-01 6.395743e+00
The package pastecs gives us the details about variance, standard deviation and range of each variables.
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe.by(nba)
## Warning: describe.by is deprecated. Please use the describeBy function
## Warning in describeBy(x = x, group = group, mat = mat, type = type, ...): no
## grouping variable requested
## vars n mean sd median trimmed mad min max
## Player* 1 1340 163.93 93.25 170.0 163.64 117.13 1.0 333
## Team* 2 1340 18.57 10.38 18.0 18.53 13.34 1.0 37
## Conference* 3 1340 1.94 0.83 2.0 1.92 1.48 1.0 3
## Date* 4 1340 464.77 264.61 466.5 465.42 340.26 1.0 916
## Position* 5 1340 6.26 3.52 7.0 6.32 4.45 1.0 11
## Height* 6 1340 11.23 3.86 12.0 11.35 4.45 1.0 21
## Weight 7 1340 224.57 30.80 220.0 224.00 32.62 150.0 325
## Age 8 1340 26.74 3.40 26.0 26.54 2.97 19.0 40
## Draft.Year 9 1340 1996.29 11.25 1998.0 1996.79 13.34 1965.0 2018
## Seasons.in.league 10 1340 5.74 3.29 5.0 5.51 2.97 0.0 17
## Season* 11 1340 24.16 11.47 26.0 24.81 13.34 1.0 41
## Season.short 12 1340 2003.16 11.47 2005.0 2003.81 13.34 1980.0 2020
## Pre.draft.Team* 13 1340 86.83 48.68 84.0 87.27 65.23 1.0 169
## Real_value 14 1340 0.69 0.24 0.5 0.67 0.00 0.5 1
## Height.CM 15 1340 201.07 9.37 201.0 201.37 10.38 175.0 229
## Weight.KG 16 1340 101.38 14.01 99.0 101.11 14.83 68.0 147
## Last.Season 17 1340 0.02 0.15 0.0 0.00 0.00 0.0 1
## range skew kurtosis se
## Player* 332.0 0.03 -1.07 2.55
## Team* 36.0 0.05 -1.19 0.28
## Conference* 2.0 0.11 -1.53 0.02
## Date* 915.0 -0.02 -1.20 7.23
## Position* 10.0 -0.18 -1.40 0.10
## Height* 20.0 -0.26 -0.83 0.11
## Weight 175.0 0.38 0.50 0.84
## Age 21.0 0.57 0.28 0.09
## Draft.Year 53.0 -0.37 -0.81 0.31
## Seasons.in.league 17.0 0.64 0.22 0.09
## Season* 40.0 -0.43 -0.95 0.31
## Season.short 40.0 -0.43 -0.95 0.31
## Pre.draft.Team* 168.0 0.03 -1.26 1.33
## Real_value 0.5 0.52 -1.73 0.01
## Height.CM 54.0 -0.22 -0.36 0.26
## Weight.KG 79.0 0.38 0.50 0.38
## Last.Season 1.0 6.23 36.84 0.00
psych can be used for advanced level of descriptive statistics, it gives us the skewness and kurtosis of every variable.
mydata<-table(nba$Draft.Year,nba$Age)
margin.table(mydata,1)
##
## 1965 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982
## 1 1 7 1 4 3 2 11 9 26 20 30 23 10 26 19
## 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
## 16 62 47 11 32 8 23 12 21 27 18 33 27 93 44 55
## 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
## 37 8 45 23 119 38 37 26 36 37 62 26 42 28 20 11
## 2015 2016 2017 2018
## 7 9 4 3
margin.table(mydata,2)
##
## 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
## 3 14 29 66 113 145 176 141 142 145 94 84 62 47 33 17 13 5 5 3
## 39 40
## 1 2
The frequency distribution table has taken age and draft year as the two attributes of the table. From the table it is clear that most of the players who get drafted in year 2003.
The package plyr can be used for frequency distribution table.
library(plyr)
## Warning: package 'plyr' was built under R version 3.6.3
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
fdt<-count(nba)
class(fdt)
## [1] "data.frame"
barplot(nba$Height.CM,names.arg = "total number of players",main = "frequency barplot for players height")
The above is a frequency bar plot of players height. Since the number of players are too many, the bar plot shows the height of the tall players.
library(ggplot2)
boxplot(nba$Height.CM)
The above boxplot shows the outliers in the dataset which can be considered as an important aspect in the calculation of mean and median.
Most of the players are around the height of 200cms.
library(ggplot2)
boxplot(nba$Draft.Year)
The above plot explains number of players drafted in each year and it is clearly shown that most of the players are drafted during year 1990-2000
library(ggplot2)
ggplot(nba,aes(nba$Height,nba$Position))+geom_jitter()
The above shown jitter plot shows that most of the players are between 190 to 210cms. There are many outliers as well. It can also be inferred that, height plays a role in their position
library(ggplot2)
ggplot(nba,aes(nba$Draft.Year,fill=nba$Team))+geom_bar()
## Warning: position_stack requires non-overlapping x intervals
The above graph clearly shows that how many players each team drafted in years.
library(ggplot2)
ggplot(nba,aes(nba$Height.CM,fill=nba$Age))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cor.test(nba$Seasons.in.league,nba$Age)
##
## Pearson's product-moment correlation
##
## data: nba$Seasons.in.league and nba$Age
## t = 68.364, df = 1338, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8692112 0.8931019
## sample estimates:
## cor
## 0.8817206
scatter.smooth(nba$Seasons.in.league,nba$Age)
ggplot(nba,aes(nba$Seasons.in.league,nba$Age))+geom_point()
From this we can correlate the relationship between the players age and number of seasons in league.