Importing the dataset

nba<-read.csv("C:\\Users\\vigne\\Desktop\\NBA_player_of_the_week.csv")

Exploration of dataset

dim(nba)
## [1] 1340   17

The above function dim(nba) gives us the number of rows and columns present in the dataset

View(nba)

The view(nba) function allows us to have a look or view on the entire dataset

DESCRIPTIVE STATISTICS

summary(nba)
##             Player                      Team     Conference           Date     
##  LeBron James  :  63   Los Angeles Lakers : 87       :501   Jan 12, 2003:   3  
##  Kobe Bryant   :  33   Houston Rockets    : 68   East:420   Apr 1, 2007 :   2  
##  Kevin Durant  :  26   San Antonio Spurs  : 68   West:419   Apr 1, 2013 :   2  
##  Michael Jordan:  25   Cleveland Cavaliers: 62              Apr 1, 2019 :   2  
##  James Harden  :  24   Miami Heat         : 59              Apr 10, 2005:   2  
##  Allen Iverson :  23   Boston Celtics     : 57              Apr 10, 2017:   2  
##  (Other)       :1146   (Other)            :939              (Other)     :1327  
##     Position       Height        Weight           Age          Draft.Year  
##  G      :200   6'9    :158   Min.   :150.0   Min.   :19.00   Min.   :1965  
##  C      :188   6'7    :148   1st Qu.:205.0   1st Qu.:24.00   1st Qu.:1987  
##  SG     :177   6'8    :146   Median :220.0   Median :26.00   Median :1998  
##  PF     :157   6'6    :135   Mean   :224.6   Mean   :26.74   Mean   :1996  
##  F      :154   6'3    :134   3rd Qu.:250.0   3rd Qu.:29.00   3rd Qu.:2005  
##  SF     :153   6'11   :121   Max.   :325.0   Max.   :40.00   Max.   :2018  
##  (Other):311   (Other):498                                                 
##  Seasons.in.league       Season      Season.short 
##  Min.   : 0.00     2002-2003:  47   Min.   :1980  
##  1st Qu.: 3.00     2003-2004:  46   1st Qu.:1994  
##  Median : 5.00     2004-2005:  46   Median :2005  
##  Mean   : 5.74     2005-2006:  46   Mean   :2003  
##  3rd Qu.: 8.00     2006-2007:  46   3rd Qu.:2013  
##  Max.   :17.00     2007-2008:  46   Max.   :2020  
##                    (Other)  :1063                 
##                                  Pre.draft.Team   Real_value    
##  St. Vincent St. Mary High School (Ohio):  63   Min.   :0.5000  
##  Georgetown                             :  51   1st Qu.:0.5000  
##  North Carolina                         :  47   Median :0.5000  
##  UCLA                                   :  42   Mean   :0.6869  
##  Wake Forest                            :  40   3rd Qu.:1.0000  
##  Texas                                  :  37   Max.   :1.0000  
##  (Other)                                :1060                   
##    Height.CM       Weight.KG      Last.Season     
##  Min.   :175.0   Min.   : 68.0   Min.   :0.00000  
##  1st Qu.:193.0   1st Qu.: 93.0   1st Qu.:0.00000  
##  Median :201.0   Median : 99.0   Median :0.00000  
##  Mean   :201.1   Mean   :101.4   Mean   :0.02388  
##  3rd Qu.:208.0   3rd Qu.:113.0   3rd Qu.:0.00000  
##  Max.   :229.0   Max.   :147.0   Max.   :1.00000  
## 

The summary function gives us the minimum,mean,median and maximum values of all the variables present in the dataset.

In the dataset NBA let us consider column age for an example, Minimum age is 19, Average age is 26.74, *Maximum age is 40.

When we consider the position column we can see that number of players playing in different positions are G: 200 SG: 177 SF: 153 like this we can use it for height,draft year, and other columns.

IMPORTING PACKAGES FOR DESCRIPTIVE STATISTICS

HMISC

library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
describe(nba)
## nba 
## 
##  17  Variables      1340  Observations
## --------------------------------------------------------------------------------
## Player 
##        n  missing distinct 
##     1340        0      333 
## 
## lowest : Aaron McKie        Adrian Dantley     Al Harrington      Al Horford         Al Jefferson      
## highest: World B. Free      Xavier McDaniel    Yao Ming           Zach Randolph      Zydrunas Ilgauskas
## --------------------------------------------------------------------------------
## Team 
##        n  missing distinct 
##     1340        0       37 
## 
## lowest : Atlanta Hawks       Boston Celtics      Brooklyn Nets       Charlotte Bobcats   Charlotte Hornets  
## highest: Seattle SuperSonics Toronto Raptors     Utah Jazz           Washington Bullets  Washington Wizards 
## --------------------------------------------------------------------------------
## Conference 
##        n  missing distinct 
##     1340        0        3 
##                             
## Value             East  West
## Frequency    501   420   419
## Proportion 0.374 0.313 0.313
## --------------------------------------------------------------------------------
## Date 
##        n  missing distinct 
##     1340        0      916 
## 
## lowest : Apr 1, 1984  Apr 1, 1985  Apr 1, 1990  Apr 1, 2001  Apr 1, 2007 
## highest: Oct 28, 1979 Oct 28, 2019 Oct 29, 2018 Oct 30, 2017 Oct 31, 2016
## --------------------------------------------------------------------------------
## Position 
##        n  missing distinct 
##     1340        0       11 
## 
## lowest : C   F   F-C FC  G  , highest: GF  PF  PG  SF  SG 
##                                                                             
## Value          C     F   F-C    FC     G   G-F    GF    PF    PG    SF    SG
## Frequency    188   154     9    88   200     3    70   157   141   153   177
## Proportion 0.140 0.115 0.007 0.066 0.149 0.002 0.052 0.117 0.105 0.114 0.132
## --------------------------------------------------------------------------------
## Height 
##        n  missing distinct 
##     1340        0       21 
## 
## lowest : 5'10 5'11 5'9  6'0  6'1 , highest: 7'1  7'2  7'3  7'4  7'6 
## --------------------------------------------------------------------------------
## Weight 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1340        0       72    0.997    224.6    34.38      175      185 
##      .25      .50      .75      .90      .95 
##      205      220      250      260      265 
## 
## lowest : 150 163 165 168 170, highest: 285 289 307 310 325
## --------------------------------------------------------------------------------
## Age 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1340        0       22    0.991    26.74    3.794       22       23 
##      .25      .50      .75      .90      .95 
##       24       26       29       31       33 
## 
## lowest : 19 20 21 22 23, highest: 36 37 38 39 40
## --------------------------------------------------------------------------------
## Draft.Year 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1340        0       52    0.998     1996    12.83     1977     1979 
##      .25      .50      .75      .90      .95 
##     1987     1998     2005     2010     2012 
## 
## lowest : 1965 1968 1969 1970 1971, highest: 2014 2015 2016 2017 2018
## --------------------------------------------------------------------------------
## Seasons.in.league 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1340        0       18    0.991     5.74    3.668        1        2 
##      .25      .50      .75      .90      .95 
##        3        5        8       10       12 
## 
## lowest :  0  1  2  3  4, highest: 13 14 15 16 17
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency     35    63   119   149   166   164   155   123   108    87    51
## Proportion 0.026 0.047 0.089 0.111 0.124 0.122 0.116 0.092 0.081 0.065 0.038
##                                                     
## Value         11    12    13    14    15    16    17
## Frequency     37    37    15    13     6    10     2
## Proportion 0.028 0.028 0.011 0.010 0.004 0.007 0.001
## --------------------------------------------------------------------------------
## Season 
##        n  missing distinct 
##     1340        0       41 
## 
## lowest : 1979-1980 1980-1981 1981-1982 1982-1983 1983-1984
## highest: 2015-2016 2016-2017 2017-2018 2018-2019 2019-2020
## --------------------------------------------------------------------------------
## Season.short 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1340        0       41    0.999     2003    13.07     1982     1985 
##      .25      .50      .75      .90      .95 
##     1994     2005     2013     2017     2019 
## 
## lowest : 1980 1981 1982 1983 1984, highest: 2016 2017 2018 2019 2020
## --------------------------------------------------------------------------------
## Pre.draft.Team 
##        n  missing distinct 
##     1340        0      169 
## 
## lowest : Alabama                         Albany State (GA)               Alcorn State                    Alief Elsik High School (Texas) Alpella B.K. (Turkey)          
## highest: Western Carolina                Wichita State                   Wisconsin                       Xavier                          Zalgiris (Lithuania)           
## --------------------------------------------------------------------------------
## Real_value 
##        n  missing distinct     Info     Mean      Gmd 
##     1340        0        2    0.702   0.6869   0.2343 
##                       
## Value        0.5   1.0
## Frequency    839   501
## Proportion 0.626 0.374
## --------------------------------------------------------------------------------
## Height.CM 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1340        0       21    0.992    201.1    10.58      185      188 
##      .25      .50      .75      .90      .95 
##      193      201      208      213      216 
## 
## lowest : 175 178 180 183 185, highest: 216 218 221 224 229
## --------------------------------------------------------------------------------
## Weight.KG 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1340        0       48    0.997    101.4    15.64       79       83 
##      .25      .50      .75      .90      .95 
##       93       99      113      118      120 
## 
## lowest :  68  74  76  77  78, highest: 129 131 139 140 147
## --------------------------------------------------------------------------------
## Last.Season 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##     1340        0        2     0.07       32  0.02388  0.04666 
## 
## --------------------------------------------------------------------------------

The function describe in the package Hmisc gives us a wide descriptive statistic of the datasetnba

It shows the number of missing values in the dataset. this output shows that there is no missing values in dataset.

Taking the variable HEIGHT as an example, it clearly explains the data present. We can infer the first 5 minimum heights of the players, 5 maximum heights of the players, The average height of the players and also the number of players present in each category of height.

Similarly different packages such as pastecs and psych can also be used for descriptive statistics of a dataset.

PASTECS

library(pastecs)
stat.desc(nba)
##          Player Team Conference Date Position Height       Weight          Age
## nbr.val      NA   NA         NA   NA       NA     NA 1.340000e+03 1.340000e+03
## nbr.null     NA   NA         NA   NA       NA     NA 0.000000e+00 0.000000e+00
## nbr.na       NA   NA         NA   NA       NA     NA 0.000000e+00 0.000000e+00
## min          NA   NA         NA   NA       NA     NA 1.500000e+02 1.900000e+01
## max          NA   NA         NA   NA       NA     NA 3.250000e+02 4.000000e+01
## range        NA   NA         NA   NA       NA     NA 1.750000e+02 2.100000e+01
## sum          NA   NA         NA   NA       NA     NA 3.009200e+05 3.582900e+04
## median       NA   NA         NA   NA       NA     NA 2.200000e+02 2.600000e+01
## mean         NA   NA         NA   NA       NA     NA 2.245672e+02 2.673806e+01
## SE.mean      NA   NA         NA   NA       NA     NA 8.413614e-01 9.289958e-02
## CI.mean      NA   NA         NA   NA       NA     NA 1.650530e+00 1.822446e-01
## var          NA   NA         NA   NA       NA     NA 9.485713e+02 1.156464e+01
## std.dev      NA   NA         NA   NA       NA     NA 3.079888e+01 3.400683e+00
## coef.var     NA   NA         NA   NA       NA     NA 1.371478e-01 1.271851e-01
##            Draft.Year Seasons.in.league Season Season.short Pre.draft.Team
## nbr.val  1.340000e+03      1340.0000000     NA 1.340000e+03             NA
## nbr.null 0.000000e+00        35.0000000     NA 0.000000e+00             NA
## nbr.na   0.000000e+00         0.0000000     NA 0.000000e+00             NA
## min      1.965000e+03         0.0000000     NA 1.980000e+03             NA
## max      2.018000e+03        17.0000000     NA 2.020000e+03             NA
## range    5.300000e+01        17.0000000     NA 4.000000e+01             NA
## sum      2.675025e+06      7692.0000000     NA 2.684230e+06             NA
## median   1.998000e+03         5.0000000     NA 2.005000e+03             NA
## mean     1.996287e+03         5.7402985     NA 2.003157e+03             NA
## SE.mean  3.074238e-01         0.0899694     NA 3.133410e-01             NA
## CI.mean  6.030846e-01         0.1764963     NA 6.146927e-01             NA
## var      1.266426e+02        10.8466198     NA 1.315647e+02             NA
## std.dev  1.125356e+01         3.2934207     NA 1.147016e+01             NA
## coef.var 5.637243e-03         0.5737368     NA 5.726044e-03             NA
##            Real_value    Height.CM    Weight.KG  Last.Season
## nbr.val  1.340000e+03 1.340000e+03 1.340000e+03 1.340000e+03
## nbr.null 0.000000e+00 0.000000e+00 0.000000e+00 1.308000e+03
## nbr.na   0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min      5.000000e-01 1.750000e+02 6.800000e+01 0.000000e+00
## max      1.000000e+00 2.290000e+02 1.470000e+02 1.000000e+00
## range    5.000000e-01 5.400000e+01 7.900000e+01 1.000000e+00
## sum      9.205000e+02 2.694360e+05 1.358550e+05 3.200000e+01
## median   5.000000e-01 2.010000e+02 9.900000e+01 0.000000e+00
## mean     6.869403e-01 2.010716e+02 1.013843e+02 2.388060e-02
## SE.mean  6.611116e-03 2.559134e-01 3.827575e-01 4.172379e-03
## CI.mean  1.296927e-02 5.020349e-01 7.508697e-01 8.185112e-03
## var      5.856718e-02 8.775887e+01 1.963145e+02 2.332772e-02
## std.dev  2.420066e-01 9.367970e+00 1.401123e+01 1.527342e-01
## coef.var 3.522964e-01 4.659021e-02 1.381991e-01 6.395743e+00

The package pastecs gives us the details about variance, standard deviation and range of each variables.

PSYCH

library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe.by(nba)
## Warning: describe.by is deprecated. Please use the describeBy function
## Warning in describeBy(x = x, group = group, mat = mat, type = type, ...): no
## grouping variable requested
##                   vars    n    mean     sd median trimmed    mad    min  max
## Player*              1 1340  163.93  93.25  170.0  163.64 117.13    1.0  333
## Team*                2 1340   18.57  10.38   18.0   18.53  13.34    1.0   37
## Conference*          3 1340    1.94   0.83    2.0    1.92   1.48    1.0    3
## Date*                4 1340  464.77 264.61  466.5  465.42 340.26    1.0  916
## Position*            5 1340    6.26   3.52    7.0    6.32   4.45    1.0   11
## Height*              6 1340   11.23   3.86   12.0   11.35   4.45    1.0   21
## Weight               7 1340  224.57  30.80  220.0  224.00  32.62  150.0  325
## Age                  8 1340   26.74   3.40   26.0   26.54   2.97   19.0   40
## Draft.Year           9 1340 1996.29  11.25 1998.0 1996.79  13.34 1965.0 2018
## Seasons.in.league   10 1340    5.74   3.29    5.0    5.51   2.97    0.0   17
## Season*             11 1340   24.16  11.47   26.0   24.81  13.34    1.0   41
## Season.short        12 1340 2003.16  11.47 2005.0 2003.81  13.34 1980.0 2020
## Pre.draft.Team*     13 1340   86.83  48.68   84.0   87.27  65.23    1.0  169
## Real_value          14 1340    0.69   0.24    0.5    0.67   0.00    0.5    1
## Height.CM           15 1340  201.07   9.37  201.0  201.37  10.38  175.0  229
## Weight.KG           16 1340  101.38  14.01   99.0  101.11  14.83   68.0  147
## Last.Season         17 1340    0.02   0.15    0.0    0.00   0.00    0.0    1
##                   range  skew kurtosis   se
## Player*           332.0  0.03    -1.07 2.55
## Team*              36.0  0.05    -1.19 0.28
## Conference*         2.0  0.11    -1.53 0.02
## Date*             915.0 -0.02    -1.20 7.23
## Position*          10.0 -0.18    -1.40 0.10
## Height*            20.0 -0.26    -0.83 0.11
## Weight            175.0  0.38     0.50 0.84
## Age                21.0  0.57     0.28 0.09
## Draft.Year         53.0 -0.37    -0.81 0.31
## Seasons.in.league  17.0  0.64     0.22 0.09
## Season*            40.0 -0.43    -0.95 0.31
## Season.short       40.0 -0.43    -0.95 0.31
## Pre.draft.Team*   168.0  0.03    -1.26 1.33
## Real_value          0.5  0.52    -1.73 0.01
## Height.CM          54.0 -0.22    -0.36 0.26
## Weight.KG          79.0  0.38     0.50 0.38
## Last.Season         1.0  6.23    36.84 0.00

psych can be used for advanced level of descriptive statistics, it gives us the skewness and kurtosis of every variable.

FREQUENCY DISTRIBUTION TABLE

mydata<-table(nba$Draft.Year,nba$Age)
margin.table(mydata,1)
## 
## 1965 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 
##    1    1    7    1    4    3    2   11    9   26   20   30   23   10   26   19 
## 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 
##   16   62   47   11   32    8   23   12   21   27   18   33   27   93   44   55 
## 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 
##   37    8   45   23  119   38   37   26   36   37   62   26   42   28   20   11 
## 2015 2016 2017 2018 
##    7    9    4    3
margin.table(mydata,2)
## 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38 
##   3  14  29  66 113 145 176 141 142 145  94  84  62  47  33  17  13   5   5   3 
##  39  40 
##   1   2

The frequency distribution table has taken age and draft year as the two attributes of the table. From the table it is clear that most of the players who get drafted in year 2003.

The package plyr can be used for frequency distribution table.

library(plyr)
## Warning: package 'plyr' was built under R version 3.6.3
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize
fdt<-count(nba)
class(fdt)
## [1] "data.frame"
barplot(nba$Height.CM,names.arg = "total number of players",main = "frequency barplot for players height")

The above is a frequency bar plot of players height. Since the number of players are too many, the bar plot shows the height of the tall players.

VISUALIZATION: BAR GRAPH

library(ggplot2)
boxplot(nba$Height.CM)

The above boxplot shows the outliers in the dataset which can be considered as an important aspect in the calculation of mean and median.

Most of the players are around the height of 200cms.

library(ggplot2)
boxplot(nba$Draft.Year)

The above plot explains number of players drafted in each year and it is clearly shown that most of the players are drafted during year 1990-2000

  library(ggplot2)
ggplot(nba,aes(nba$Height,nba$Position))+geom_jitter()

The above shown jitter plot shows that most of the players are between 190 to 210cms. There are many outliers as well. It can also be inferred that, height plays a role in their position

library(ggplot2)
ggplot(nba,aes(nba$Draft.Year,fill=nba$Team))+geom_bar()
## Warning: position_stack requires non-overlapping x intervals

The above graph clearly shows that how many players each team drafted in years.

library(ggplot2)
ggplot(nba,aes(nba$Height.CM,fill=nba$Age))+geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

CORRELATION

cor.test(nba$Seasons.in.league,nba$Age)
## 
##  Pearson's product-moment correlation
## 
## data:  nba$Seasons.in.league and nba$Age
## t = 68.364, df = 1338, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8692112 0.8931019
## sample estimates:
##       cor 
## 0.8817206
scatter.smooth(nba$Seasons.in.league,nba$Age)

ggplot(nba,aes(nba$Seasons.in.league,nba$Age))+geom_point()

From this we can correlate the relationship between the players age and number of seasons in league.