lecture2

Today's To-Dos

Review last module's homework
Exploratory data analysis
ggplot2
BigVis
devtools
This module's homework

Last Week's Homework

Let's walk through it
Gather your Data

suppressPackageStartupMessages(library(plyr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
inc <- read.csv("https://raw.githubusercontent.com/charleyferrari/CUNY_DATA608/master/lecture1/Data/inc5000_data.csv", header= TRUE)
head(inc,2)

  Rank                  Name Growth_Rate   Revenue
1    1                  Fuhu      421.48 117900000
2    2 FederalConference.com      248.31  49600000
                      Industry Employees       City State
1 Consumer Products & Services       104 El Segundo    CA
2          Government Services        51   Dumfries    VA

Last Week's Homework

Investigate

summary(inc[,c(3:6,8)])

  Growth_Rate         Revenue                                  Industry   
 Min.   :  0.340   Min.   :2.000e+06   IT Services                 : 733  
 1st Qu.:  0.770   1st Qu.:5.100e+06   Business Products & Services: 482  
 Median :  1.420   Median :1.090e+07   Advertising & Marketing     : 471  
 Mean   :  4.612   Mean   :4.822e+07   Health                      : 355  
 3rd Qu.:  3.290   3rd Qu.:2.860e+07   Software                    : 342  
 Max.   :421.480   Max.   :1.010e+10   Financial Services          : 260  
                                       (Other)                     :2358  
   Employees           State     
 Min.   :    1.0   CA     : 701  
 1st Qu.:   25.0   TX     : 387  
 Median :   53.0   NY     : 311  
 Mean   :  232.7   VA     : 283  
 3rd Qu.:  132.0   FL     : 282  
 Max.   :66803.0   IL     : 273  
 NA's   :12        (Other):2764

Last Week's Homework

For this analysis, remove NULL values

all_inc <- inc[complete.cases(inc)==TRUE,]

Last Week's Homework

Get counts by State

cnt <- ddply(all_inc, .(State), summarize, cnt = length(State))
p3 <- ggplot(cnt, aes(x=State, y=cnt)) + geom_bar(stat='identity') 
p3

plot of chunk ddply}{r p3

Last Week's Homework

To switch to horizontal bars, use coord_flip()
To show tabular, quantitative data, line or scatter plots are good

p4 <- ggplot(cnt, aes(x=State, y=cnt)) + geom_bar(stat='identity')  
p4 + coord_flip()

plot of chunk p4

Last Week's Homework

Can sort using reorder

p_states <- ggplot(cnt, aes(x=reorder(State,cnt), y=cnt)) + geom_bar(stat='identity')  
p_states + coord_flip()

plot of chunk p_states

Last Week's Homework

New York is the #3 State, so let's dig in

ny <- subset(all_inc, State == 'NY')
p5 <- ggplot(ny, aes(x=Industry, y=Employees)) + geom_point() 
p5 + coord_flip()

plot of chunk unnamed-chunk-3

Last Week's Homework

Serious outlier issue: how do we handle?
Do we include, make a note (annotate) or ignore?
Do we care more about the mean or median?
If we care more about the median, outliers are distractions
'Winsorize' Data

winsor <- function(x, bot, top)  { return(min(top, max(x, bot))) }
ny$clip_employ <- sapply(ny$Employees, winsor, bot=0, top =2500)
p5 <- ggplot(ny, aes(x=Industry, y=clip_employ))

Last Week's Homework

p5 + geom_point() + coord_flip()

plot of chunk p6

Last Week's Homework

A relative of the scatter plot is the box plot

p5 + geom_boxplot() + coord_flip(ylim=c(0,2500))

plot of chunk unnamed-chunk-5

Last Week's Homework

Last Week's Homework - Marking Outliers

p5 + geom_boxplot() + coord_flip(ylim=c(0,2500)) +
annotate('text', label= c('outliers','3,000','10,000','32,000'),
x = c(18,16,5,2), y=c(2300,2400,2400,2400), size=c(4,3,3,3))

plot of chunk unnamed-chunk-6

Last Week's Homework

There are other ways to show variance
But we need to create averages

ny_ave <- ny %>%
  group_by(Industry) %>%
  summarize(mean = mean(Employees),
            sd = sd(Employees),
            median = median(clip_employ),
            lower = quantile(clip_employ)[2],
            upper = quantile(clip_employ)[4])

head(ny_ave,2)

# A tibble: 2 × 6
                      Industry      mean         sd median lower  upper
                        <fctr>     <dbl>      <dbl>  <dbl> <dbl>  <dbl>
1      Advertising & Marketing   58.4386   62.22971   38.0  21.0  65.00
2 Business Products & Services 1492.4615 6240.70574   70.5  30.5 332.75

Last Week's Homework - Point Ranges

p6 <- ggplot(ny_ave, aes(x=Industry, y=median)) + geom_point()
p6 <- p6 + geom_pointrange(ymin=ny_ave$lower, ymax=ny_ave$upper) 
p6 + ylim(c(0,750)) + coord_flip()

plot of chunk unnamed-chunk-8

Last Week's Homework - Error Bars

p7 <- ggplot(ny_ave, aes(x=Industry, y=median)) + geom_bar(stat='identity')
p7 <- p7 + geom_errorbar(ymin=ny_ave$lower, ymax=ny_ave$upper, width=.1) 
p7 + ylim(c(0,750)) + coord_flip()

plot of chunk unnamed-chunk-9

Last Week's Homework - Error Bars

ny_ave$i = reorder(ny_ave$Industry, ny_ave$median)
p8 <- ggplot(ny_ave, aes(x=i, y=median)) + geom_bar(stat='identity',fill='coral')
p8 <- p8 + geom_errorbar(ymin=ny_ave$lower, ymax=ny_ave$upper, width=.1, color='blue') 
p8 + ylim(c(0,750)) + coord_flip() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

plot of chunk unnamed-chunk-10

Last Week's Homework - Investors Care About Money

all_inc$rev_per_employ <- all_inc$Revenue / all_inc$Employees
p9 <- ggplot(all_inc, aes(x=Industry, y=rev_per_employ))
p9 + geom_boxplot() + coord_flip()

plot of chunk unnamed-chunk-11

Last Week's Homework - Revenue Per Employee

all_inc$rev_per_employ <- all_inc$Revenue / all_inc$Employees
p10 <- ggplot(all_inc, aes(x=Industry, y=rev_per_employ))
p10 + geom_boxplot() + coord_flip()

plot of chunk unnamed-chunk-12

Last Week's Homework - Likely Outcomes and Distributions

p11 <- ggplot(all_inc,aes(x=rev_per_employ))
p11 <- p11 + geom_density() + facet_wrap(~ Industry) 
p11 + scale_x_log10(breaks=c(10000, 100000, 1000000, 10000000))

plot of chunk unnamed-chunk-13

Exploratory Data Analysis

A great way to test your visualizations - do you find them useful?
We basically just did it!
Should always use to understand your data set

ggplot2

Most popular visualization framework
Developed by Hadley Wickham
Easy to learn, supports lots of features
Being ported to other languages
We will focus on these design patterns throughout the semester

Bigvis

Also written by Hadley Wickham
Geared towards larger data sets
Not on CRAN

This Week's Homework

We will be working with the set of all NYC tax lot data
Go to http://www.nyc.gov/html/dcp/html/bytes/applbyte.shtml#pluto
Download the PLUTO data set
The data is in separate files for each boro: you will need to combine
Think about “Moving the code to the data”

That's it

This presentation will be on the GitHub page for reference
Good luck! Any questions?