Statistical Learning
James Scott (UT-Austin)
Categorical: the answer to a multiple choice question:
Ordinal: categorical, where the answers have an ordering but not a magnitude
Numerical: numbers, whether integer or real-valued
Let's load some data on temperatures in Rapid City and San Diego:
# Use the correct path name
citytemps = read.csv('data/citytemps.csv')
summary(citytemps)
Year Month Day Temp.SanDiego
Min. :1995 Min. : 1.000 Min. : 1.00 Min. :45.10
1st Qu.:1999 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:58.70
Median :2003 Median : 7.000 Median :16.00 Median :63.00
Mean :2003 Mean : 6.492 Mean :15.71 Mean :63.08
3rd Qu.:2007 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:67.30
Max. :2011 Max. :12.000 Max. :31.00 Max. :81.30
Temp.RapidCity
Min. :-19.00
1st Qu.: 33.30
Median : 47.60
Mean : 47.28
3rd Qu.: 63.95
Max. : 91.90
A simple histogram:
hist(citytemps$Temp.SanDiego)
hist(citytemps$Temp.SanDiego, breaks=30)
muSanDiego = mean(citytemps$Temp.SanDiego)
hist(citytemps$Temp.SanDiego, breaks=30)
abline(v=muSanDiego, col='red', lwd=5)
We can make a multi-frame plot.
par(mfrow=c(2,1))
hist(citytemps$Temp.SanDiego)
hist(citytemps$Temp.RapidCity)
R's syntax here is somewhat hacky. We'll see something better when we learn ggplot!
Core principle: plots are truthful about magnitude! Here we have:
Let's fix these:
# custom breakpoints
mybreaks = seq(-20, 92, by=2)
par(mfrow=c(2,1))
hist(citytemps$Temp.SanDiego, breaks=mybreaks,
xlim=c(-20,100), ylim=c(0, 760))
hist(citytemps$Temp.RapidCity, breaks=mybreaks,
xlim=c(-20,100), ylim=c(0, 760))