====================================================================================
Get your current directory and if needed change it to the path you want to be in.
getwd()
## [1] "C:/Users/mbarnwa/Dropbox/documents/Office/Learn/Data visualization, udacity/201402-histogram-tutorial"
setwd("C:/Users/mbarnwa/Dropbox/documents/Office/Learn/Data visualization, udacity/201402-histogram-tutorial")
list.files()
## [1] "__MACOSX" "LearnHistogram.html"
## [3] "LearnHistogram.rmd" "nba-players-histograms.R"
## [5] "nba-players.csv" "rsconnect"
After getting the list of files in your working directory. Read the file you want to do the analyisis on.
nba <- read.csv("nba-players.csv")
Once you have your file read. Get to know a little about your input data: Get the names of the columns, the dimensions of the data, have a look at the first few rows of the data.
names(nba)
## [1] "Name" "Age" "Team"
## [4] "POS" "X." "X2013salary"
## [7] "Ht_inches" "Wt" "Exp"
## [10] "First_year" "DOB" "School"
## [13] "City" "State_Territory" "Country"
## [16] "Race" "HS_Only"
dim(nba)
## [1] 528 17
head(nba)
## Name Age Team POS X. X2013salary Ht_inches Wt
## 1 Gee, Alonzo 26 Cavaliers F 33 $3,250,000 78 219
## 2 Wallace, Gerald 31 Celtics F 45 $10,105,855 79 220
## 3 Williams, Mo 30 Trail Blazers G 25 $2,652,000 73 195
## 4 Gladness, Mickell 27 Magic C 40 $762,195 83 220
## 5 Jefferson, Richard 33 Jazz F 44 $11,046,000 79 230
## 6 Hill, Solomon 22 Pacers F 9 $1,246,680 79 220
## Exp First_year DOB School City State_Territory
## 1 4 2009 5/29/1987 Alabama Riviera Beach, FL Florida
## 2 12 2001 7/23/1982 Alabama Sylacauga, AL Alabama
## 3 10 2003 12/19/1982 Alabama Jackson, MS Mississippi
## 4 2 2011 7/26/1986 Alabama A&M Birmingham, AL Alabama
## 5 12 2001 6/21/1980 Arizona Los Angeles, CA California
## 6 0 2013 3/18/1991 Arizona Los Angeles, CA California
## Country Race HS_Only
## 1 US Black No
## 2 US Black No
## 3 US Black No
## 4 US Black No
## 5 US Black No
## 6 US Black No
One misconception I had about histogram was that a bar chart and a histogram is same which I came to know now is not true. A bar chart is used for categorical data and major focus is on the height of the bars. Bars width are the same. A bar plot We will make a bar plot of heights of all the players in the team ‘Warriors’
warriors <- subset(nba, Team=='Warriors')
# ordering the dataframe 'nba' by height of the players in increasing order
warriors_o <- warriors[order(warriors$Ht_inches), ]
dim(warriors_o)
## [1] 17 17
head(warriors_o)
## Name Age Team POS X. X2013salary Ht_inches Wt Exp
## 91 Curry, Seth 23 Warriors G 3 $490,180 74 185 0
## 114 Douglas, Toney 27 Warriors G 0 $1,600,000 74 185 4
## 70 Curry, Stephen 25 Warriors G 30 $9,887,640 75 185 4
## 320 Nedovi?, Nemanja 22 Warriors G 8 $1,056,720 75 192 0
## 366 Bazemore, Kent 24 Warriors G 20 $788,872 77 201 1
## 14 Iguodala, Andre 29 Warriors G/F 9 $12,868,632 78 207 9
## First_year DOB School City State_Territory
## 91 2013 8/23/1990 Duke Charlotte, NC North Carolina
## 114 2009 3/16/1986 Florida State Tampa, FL Florida
## 70 2009 3/14/1988 Davidson Akron, OH Ohio
## 320 2013 6/16/1991 n/a Nova Varos n/a
## 366 2012 7/1/1989 Old Dominion Kelford, NC North Carolina
## 14 2004 1/28/1984 Arizona Springfield, IL Illinois
## Country Race HS_Only
## 91 US Black No
## 114 US Black No
## 70 US Mixed No
## 320 Yugoslavia White No
## 366 US Black No
## 14 US Black No
print(warriors_o$Name)
## [1] Curry, Seth Douglas, Toney Curry, Stephen
## [4] Nedovi?, Nemanja Bazemore, Kent Iguodala, Andre
## [7] Green, Draymond Thompson, Klay Barnes, Harrison
## [10] Alexander, Joe Lee, David Speights, Marreese
## [13] O'Neal, Jermaine Ezeli, Festus Dedmon, Dewayne
## [16] Bogut, Andrew Kuzmi?, Ognjen
## 528 Levels: ?lyasova, Ersan A??k, Ömer Acy, Quincy ... Zeller, Tyler
# barplot(warriors_o$Ht_inches, warriors_o$Name)
barplot(warriors_o$Ht_inches, names.arg=warriors_o$Name, border=NA,
las=1, main="Heights of Golden State Warriors", xlab="Names", ylab="Heights",
col = c("lightblue", "mistyrose", "lightcyan",
"lavender", "cornsilk"))
See how average height varies with the positions in which the players play
unique(nba$POS)
## [1] F G C F/C G/F
## Levels: C F F/C G G/F
avgHeights <- aggregate(Ht_inches~POS, data = nba, FUN = mean)
dim(avgHeights)
## [1] 5 2
avgHeights_o <- avgHeights[order(avgHeights$Ht_inches),]
barplot(avgHeights_o$Ht_inches, names.arg = avgHeights_o$POS,
main = "Average heights of players across positions",
xlab = "Positions", ylab = "Heights", las = 1, border = NA)
A histogram is used to understand about the distriubtion of a variable. Here not all the bars are of same width. The point of interest here is not the height of the bars but its area.
Say we want to see how the heights of basket ball players vary in nba
hist(nba$Ht_inches)
par(mfrow = c(1,3), las = 1) # las: numeric in {0,1,2,3}; the style of axis labels.
range(nba$Ht_inches)
## [1] 69 87
hist(nba$Ht_inches, main = "NBA players heights", xlab = "inches",
breaks = seq(65, 90, 1) )
hist(nba$Ht_inches, main = "NBA players heights", xlab = "inches",
breaks = seq(65, 90, 2) )
hist(nba$Ht_inches, main = "NBA players heights", xlab = "inches",
breaks = seq(65, 90, 5) )
In one of the earlier histograms we looked at how the average height varies with POS, position of the player in the game but that didn’t give us much informaation.
Now we will try to get the distribution of heights across the positions
plot.new()
positions <- unique(nba$POS)
positions
## [1] F G C F/C G/F
## Levels: C F F/C G G/F
par(mfrow=c(2,3), las=1)
for(i in 1:length(positions)) {
# posHeights <- nba[nba$POS==positions[i], "Ht_inches"]
posHeights <- subset(nba, POS==positions[i])
# legend("bottomleft", c("Mean", "Median"), col=c("red", "blue"), lwd=10)
h <- hist(posHeights$Ht_inches,
main=paste0("Heights for position ",positions[i]),
breaks = seq(65, 90, 1), xlab = 'inches', border="#ffffff",
col="#999999", lwd=0.4)
maxFreq <- max(h$counts)
segments(h$breaks, rep(0, length(h$breaks)), h$breaks, maxFreq, col="white")
heightMedian <- median(posHeights$Ht_inches)
heightMean <- mean(posHeights$Ht_inches)
lines(c(heightMedian, heightMedian), c(-1, maxFreq), col="blue", lwd=2)
lines(c(heightMean, heightMean), c(-1, maxFreq), col="red", lwd=2)
}
So we will end this tutorial with few learnings:
Histograms and bar plots are different
Bar plot is used for categorical data
Histograms are used to determine the distribution of data
For bar plots, it is just the length of the bar that speaks about data but for histograms, the area i.e. both the height and width matters