Source file ⇒ 2017-midterm_review-pt2.Rmd

Minzhi Zhang Q1, 12,14, 15,17,18,20,21,22,25

1

We already did #1 last time but here is some clarification about lazy evaluation:

When using Lazy evaluation the argument of a function call isn’t evaluated unless it is actually used in a function.

example:

f <- function(x){
  
  10
}
a <- f(x=print("hi"))

We don’t print out “hi”. This is because x isn’t used in the definition of the function f.

f <- function(x){
  10
  x
}
a <- f(x=print("hi"))
## [1] "hi"

Another example:

g <- function(i) i + 3

vec <- c()
for (i in 1:2){
  vec[i] <- g(i)
}
vec[1]
## [1] 4
vec[2]
## [1] 5

Here i is used in g(i) so vec[1]=4 and vec[2]=5 Another example:

Note: We need a list ls to store our functions.

g <- function(i) function(x) x + i
  

ls <- list()
for (i in 1:2){
  ls[[i]] <- g(i)
}
ls[[1]](10)
## [1] 12
ls[[2]](10)
## [1] 12

Here , i is used in the definition of g, however, R does not know it because i is inside the definition of another function, i.e. function(x) x + i. That function has never been called when you run ls[[i]] <- g(i) so lazy eval applies to that function.

g <- function(i) {
  i
  function(x) x + i
}

ls <- list()
for (i in 1:2){
  ls[[i]] <- g(i)
}
ls[[1]](10)
## [1] 11
ls[[2]](10)
## [1] 12

Now lazy eval doesn’t apply since i is in the definition of g.

extra question for you:

What is the output of the following chunk?

i=1
a <- c(1, 2)
g <- function(x){x+a[i]}
f <- function(a) {
  b <- list()
  for (i in 1:length(a)){
    b[[i]] <- function(x) g(x)
  }
  print(b[[2]](10))
}
c <- f(a)

12

For every location, sex and month there are incomes which is a categorical variable with levels poverty, low, mid and high. For every income for a particular location, sex and month we choose the max and min which means the max and min alphabetically.

14

What is your question here?

Note there are clearly many running times so there must be many cases.

This raises an interesting question raised by Cuong Luu in our class about density plots from homework 6. Question: why don’t the stacked density plots add up to single density plot? I think to see what is going on it is helpful to look at histograms of counts. You can see that the density plots for each client group is renormalized to both have area 1.

Loading Data

Stations <- mosaic::read.file("http://tiny.cc/dcf/DC-Stations.csv")
## Reading data with read.csv()
data_site <- "http://tiny.cc/dcf/2014-Q4-Trips-History-Data-Small.rds" 
Trips <- readRDS(gzcon(url(data_site)))
str(Trips)
## 'data.frame':    10000 obs. of  7 variables:
##  $ duration: chr  "0h 9m 15s" "0h 47m 21s" "2h 46m 22s" "0h 15m 15s" ...
##  $ sdate   : POSIXct, format: "2014-11-06 16:26:00" "2014-10-12 11:30:00" ...
##  $ sstation: chr  "15th & L St NW" "3rd & D St SE" "10th & E St NW" "4th & M St SW" ...
##  $ edate   : POSIXct, format: "2014-11-06 16:35:00" "2014-10-12 12:17:00" ...
##  $ estation: chr  "15th & L St NW" "Jefferson Dr & 14th St SW" "10th & E St NW" "5th & K St NW" ...
##  $ bikeno  : chr  "W00169" "W01482" "W21346" "W00647" ...
##  $ client  : chr  "Registered" "Registered" "Casual" "Casual" ...

not filled by client (single density plot)

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(x=time_of_day)) +
  geom_density() +
  facet_wrap(~day_of_week) +
  theme( legend.position = "none")

overlaying fills by client

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(time_of_day)) +
  geom_density(aes(fill=client), alpha=.3, col=NA) +
  facet_wrap(~day_of_week)

stacking fills of client

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(time_of_day)) +
  geom_density(aes(fill=client), position=position_stack(), alpha=.3, col=NA) +
  facet_wrap(~day_of_week)

### histogram not filled by `client` (single histogram)

histogram not filled by client (single histogram)

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(x=time_of_day)) +
  geom_histogram(binwidth=5) +
  facet_wrap(~day_of_week) +
  theme( legend.position = "none")

histogram dodged fills by client

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(time_of_day)) +
  geom_histogram(aes(fill=client), position="dodge",alpha=.3, col=NA,binwidth=5) +
  facet_wrap(~day_of_week)

histogram stacking fills of client

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(time_of_day)) +
  geom_histogram(aes(fill=client), position=position_stack(), alpha=.3, col=NA,binwidth=5) +
  facet_wrap(~day_of_week)

15

What is your quesiton?

Note: for each box plot you need to have at least two cases to have a nonzero interquartile range.

data=c(1)
quantile(data)
##   0%  25%  50%  75% 100% 
##    1    1    1    1    1
data=c(1,2)
quantile(data)
##   0%  25%  50%  75% 100% 
## 1.00 1.25 1.50 1.75 2.00

17

Here are the details (with made up data):

party <-  data.frame(c("DON SAMUELS","JACKIE CHERRYHOMES","BETSY HODGES", "CAM WINTON"), c("DFL","LIBETARIAN","INDEPENDENT", "DFL"))
names(party) <- c("Candidate","Party")
party
Candidate Party
DON SAMUELS DFL
JACKIE CHERRYHOMES LIBETARIAN
BETSY HODGES INDEPENDENT
CAM WINTON DFL
Precinct Ward Candidate
7762 P-08 W-2 DON SAMUELS
78920 P-08 W-7 JACKIE CHERRYHOMES
25409 P-03 W-9 BETSY HODGES
31526 P-04 W-11 CAM WINTON
ballet_party <- candidates %>% 
  left_join(party) 
## Joining, by = "Candidate"
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factor and character vector, coercing into character vector
ballet_party
Precinct Ward Candidate Party
P-08 W-2 DON SAMUELS DFL
P-08 W-7 JACKIE CHERRYHOMES LIBETARIAN
P-03 W-9 BETSY HODGES INDEPENDENT
P-04 W-11 CAM WINTON DFL
ballet_party %>% 
  group_by(Precinct,Party) %>%
  summarize(count=n())
## Source: local data frame [4 x 3]
## Groups: Precinct [?]
## 
##   Precinct       Party count
##      <chr>      <fctr> <int>
## 1     P-03 INDEPENDENT     1
## 2     P-04         DFL     1
## 3     P-08         DFL     1
## 4     P-08  LIBETARIAN     1

18

Did this already

20

did this already

21

did this already

22

did this already

25

did this already

Leon Gutierrez

14, 15, more specifically, how do we know the number of cases?

16, inner joins, and could you please go over left and right joints?

14

Did this already

15

Did this already

16

MISPRINT: replace by = c("First", "variable2") with by=c("FIRST"="variable1")

Lets review the different joins:

Extra question for you:

The data tables Grades and Courses gives grade and course related info in a school.

Grades <- read.csv("http://tiny.cc/mosaic/grades.csv")
sid grade sessionID
2197 S31680 B session3518
259 S31242 B session2897
4188 S32127 A session2002
3880 S32058 A- session2952
Courses <- read.csv("http://tiny.cc/mosaic/courses.csv")
sessionID dept level sem enroll iid
640 session2568 J 100 FA2002 15 inst223
76 session1940 d 100 FA2000 16 inst409
1218 session3242 m 200 SP2004 30 inst476

The primary and foreign key is sessionID.

Write out the commmands to find average class size. The first 6 lines of output should be:

sid ave_enroll
S31185 29.00000
S31188 27.55556
S31191 29.13333
S31194 19.50000
S31197 24.42857
S31200 26.50000

Lets review gather and spread

library(tidyr)

We make a data table of country populations in different years.

a <- c("Afghanistan","Brazil", "China")
b <- c(745,37737,212258)
c <- c(266,80488,213766)
df_wide <- data.frame(a,b,c)
names(df_wide) <- c("countries","1999", "2000")
df_wide
countries 1999 2000
Afghanistan 745 266
Brazil 37737 80488
China 212258 213766

We wish to gather this table into tidy form.

df_narrow <- df_wide %>% 
  gather( key = year, value = population, `1999`, `2000`)
df_narrow
countries year population
Afghanistan 1999 745
Brazil 1999 37737
China 1999 212258
Afghanistan 2000 266
Brazil 2000 80488
China 2000 213766

to convert back to wide we use spread

df_wide1 <- df_narrow %>% 
  spread( key = year, value = population)
df_wide1
countries 1999 2000
Afghanistan 745 266
Brazil 37737 80488
China 212258 213766

Extra question for you:

Seohyeong Jeong 3, 10, 24

3

did already

10

What is your question?

BabyNames %>%
  group_by(name) %>%
  summarise(tot = sum(count)) %>%
  mutate(rank = rank(desc(tot))) %>%
  filter(name == "Fernando")
name tot rank
Fernando 90785 663

24

What is your quesiton?