Learn Histogram

====================================================================================


What to do first?

Get your current directory and if needed change it to the path you want to be in.

getwd()
## [1] "C:/Users/mbarnwa/Dropbox/documents/Office/Learn/Data visualization, udacity/201402-histogram-tutorial"
setwd("C:/Users/mbarnwa/Dropbox/documents/Office/Learn/Data visualization, udacity/201402-histogram-tutorial")
list.files()
## [1] "__MACOSX"                 "LearnHistogram.html"     
## [3] "LearnHistogram.rmd"       "nba-players-histograms.R"
## [5] "nba-players.csv"          "rsconnect"

After getting the list of files in your working directory. Read the file you want to do the analyisis on.

nba <- read.csv("nba-players.csv")

Once you have your file read. Get to know a little about your input data: Get the names of the columns, the dimensions of the data, have a look at the first few rows of the data.

names(nba)
##  [1] "Name"            "Age"             "Team"           
##  [4] "POS"             "X."              "X2013salary"    
##  [7] "Ht_inches"       "Wt"              "Exp"            
## [10] "First_year"      "DOB"             "School"         
## [13] "City"            "State_Territory" "Country"        
## [16] "Race"            "HS_Only"
dim(nba)
## [1] 528  17
head(nba)
##                 Name Age          Team POS X. X2013salary Ht_inches  Wt
## 1        Gee, Alonzo  26     Cavaliers   F 33  $3,250,000        78 219
## 2    Wallace, Gerald  31       Celtics   F 45 $10,105,855        79 220
## 3       Williams, Mo  30 Trail Blazers   G 25  $2,652,000        73 195
## 4  Gladness, Mickell  27         Magic   C 40    $762,195        83 220
## 5 Jefferson, Richard  33          Jazz   F 44 $11,046,000        79 230
## 6      Hill, Solomon  22        Pacers   F  9  $1,246,680        79 220
##   Exp First_year        DOB      School              City State_Territory
## 1   4       2009  5/29/1987     Alabama Riviera Beach, FL         Florida
## 2  12       2001  7/23/1982     Alabama     Sylacauga, AL         Alabama
## 3  10       2003 12/19/1982     Alabama       Jackson, MS     Mississippi
## 4   2       2011  7/26/1986 Alabama A&M    Birmingham, AL         Alabama
## 5  12       2001  6/21/1980     Arizona   Los Angeles, CA      California
## 6   0       2013  3/18/1991     Arizona   Los Angeles, CA      California
##   Country  Race HS_Only
## 1      US Black      No
## 2      US Black      No
## 3      US Black      No
## 4      US Black      No
## 5      US Black      No
## 6      US Black      No

We are here to learn about histogram.

One misconception I had about histogram was that a bar chart and a histogram is same which I came to know now is not true. A bar chart is used for categorical data and major focus is on the height of the bars. Bars width are the same. A bar plot We will make a bar plot of heights of all the players in the team ‘Warriors’

warriors <- subset(nba, Team=='Warriors')
# ordering the dataframe 'nba' by height of the players in increasing order
warriors_o <- warriors[order(warriors$Ht_inches), ]
dim(warriors_o)
## [1] 17 17
head(warriors_o)
##                 Name Age     Team POS X. X2013salary Ht_inches  Wt Exp
## 91       Curry, Seth  23 Warriors   G  3    $490,180        74 185   0
## 114   Douglas, Toney  27 Warriors   G  0  $1,600,000        74 185   4
## 70    Curry, Stephen  25 Warriors   G 30  $9,887,640        75 185   4
## 320 Nedovi?, Nemanja  22 Warriors   G  8  $1,056,720        75 192   0
## 366   Bazemore, Kent  24 Warriors   G 20    $788,872        77 201   1
## 14   Iguodala, Andre  29 Warriors G/F  9 $12,868,632        78 207   9
##     First_year       DOB        School            City State_Territory
## 91        2013 8/23/1990          Duke   Charlotte, NC  North Carolina
## 114       2009 3/16/1986 Florida State       Tampa, FL         Florida
## 70        2009 3/14/1988      Davidson       Akron, OH            Ohio
## 320       2013 6/16/1991           n/a      Nova Varos             n/a
## 366       2012  7/1/1989  Old Dominion     Kelford, NC  North Carolina
## 14        2004 1/28/1984       Arizona Springfield, IL        Illinois
##        Country  Race HS_Only
## 91          US Black      No
## 114         US Black      No
## 70          US Mixed      No
## 320 Yugoslavia White      No
## 366         US Black      No
## 14          US Black      No
print(warriors_o$Name)
##  [1] Curry, Seth        Douglas, Toney     Curry, Stephen    
##  [4] Nedovi?, Nemanja   Bazemore, Kent     Iguodala, Andre   
##  [7] Green, Draymond    Thompson, Klay     Barnes, Harrison  
## [10] Alexander, Joe     Lee, David         Speights, Marreese
## [13] O'Neal, Jermaine   Ezeli, Festus      Dedmon, Dewayne   
## [16] Bogut, Andrew      Kuzmi?, Ognjen    
## 528 Levels: ?lyasova, Ersan A??k, Ömer Acy, Quincy ... Zeller, Tyler
# barplot(warriors_o$Ht_inches, warriors_o$Name)
barplot(warriors_o$Ht_inches, names.arg=warriors_o$Name,  border=NA,
        las=1, main="Heights of Golden State Warriors", xlab="Names", ylab="Heights",
        col = c("lightblue", "mistyrose", "lightcyan",
                "lavender", "cornsilk"))

See how average height varies with the positions in which the players play

unique(nba$POS)
## [1] F   G   C   F/C G/F
## Levels: C F F/C G G/F
avgHeights <- aggregate(Ht_inches~POS, data = nba, FUN = mean)
dim(avgHeights)
## [1] 5 2
avgHeights_o <- avgHeights[order(avgHeights$Ht_inches),]
barplot(avgHeights_o$Ht_inches, names.arg = avgHeights_o$POS, 
        main = "Average heights of players across positions",
        xlab = "Positions", ylab = "Heights", las = 1, border = NA)

A histogram is used to understand about the distriubtion of a variable. Here not all the bars are of same width. The point of interest here is not the height of the bars but its area.

Say we want to see how the heights of basket ball players vary in nba

hist(nba$Ht_inches)

par(mfrow = c(1,3), las = 1) # las: numeric in {0,1,2,3}; the style of axis labels.
range(nba$Ht_inches)
## [1] 69 87
hist(nba$Ht_inches, main = "NBA players heights", xlab = "inches", 
     breaks = seq(65, 90, 1) )
hist(nba$Ht_inches, main = "NBA players heights", xlab = "inches", 
     breaks = seq(65, 90, 2) )


hist(nba$Ht_inches, main = "NBA players heights", xlab = "inches", 
     breaks = seq(65, 90, 5) )

In one of the earlier histograms we looked at how the average height varies with POS, position of the player in the game but that didn’t give us much informaation.

Now we will try to get the distribution of heights across the positions

plot.new()
positions <- unique(nba$POS)
positions
## [1] F   G   C   F/C G/F
## Levels: C F F/C G G/F
par(mfrow=c(2,3), las=1)

for(i in 1:length(positions)) {
        # posHeights <- nba[nba$POS==positions[i], "Ht_inches"]
        posHeights <- subset(nba, POS==positions[i])
        # legend("bottomleft", c("Mean", "Median"), col=c("red", "blue"), lwd=10)
        h <- hist(posHeights$Ht_inches, 
                   main=paste0("Heights for position ",positions[i]),
             breaks = seq(65, 90, 1), xlab = 'inches', border="#ffffff",
             col="#999999", lwd=0.4)
        
        maxFreq <- max(h$counts)
        segments(h$breaks, rep(0, length(h$breaks)), h$breaks, maxFreq, col="white")
        heightMedian <- median(posHeights$Ht_inches)
        heightMean <- mean(posHeights$Ht_inches)
        lines(c(heightMedian, heightMedian), c(-1, maxFreq), col="blue", lwd=2)
        lines(c(heightMean, heightMean), c(-1, maxFreq), col="red", lwd=2)
        
}

Summary:

So we will end this tutorial with few learnings:

  1. Histograms and bar plots are different

  2. Bar plot is used for categorical data

  3. Histograms are used to determine the distribution of data

  4. For bar plots, it is just the length of the bar that speaks about data but for histograms, the area i.e. both the height and width matters