Source file ⇒ 2017-midterm_review-pt2.Rmd

Minzhi Zhang Q1, 12,14, 15,17,18,20,21,22,25

1

We already did #1 last time but here is some clarification about lazy evaluation:

When using Lazy evaluation the argument of a function call isn’t evaluated unless it is actually used in a function.

example:

f <- function(x){
  
  10
}
a <- f(x=print("hi"))

We don’t print out “hi”. This is because x isn’t used in the definition of the function f.

f <- function(x){
  10
  x
}
a <- f(x=print("hi"))

## [1] "hi"

Another example:

g <- function(i) i + 3

vec <- c()
for (i in 1:2){
  vec[i] <- g(i)
}
vec[1]

## [1] 4

vec[2]

## [1] 5

Here i is used in g(i) so vec[1]=4 and vec[2]=5 Another example:

Note: We need a list ls to store our functions.

g <- function(i) function(x) x + i
  

ls <- list()
for (i in 1:2){
  ls[[i]] <- g(i)
}
ls[[1]](10)

## [1] 12

ls[[2]](10)

## [1] 12

Here , i is used in the definition of g, however, R does not know it because i is inside the definition of another function, i.e. function(x) x + i. That function has never been called when you run ls[[i]] <- g(i) so lazy eval applies to that function.

g <- function(i) {
  i
  function(x) x + i
}

ls <- list()
for (i in 1:2){
  ls[[i]] <- g(i)
}
ls[[1]](10)

## [1] 11

ls[[2]](10)

## [1] 12

Now lazy eval doesn’t apply since i is in the definition of g.

extra question for you:

What is the output of the following chunk?

i=1
a <- c(1, 2)
g <- function(x){x+a[i]}
f <- function(a) {
  b <- list()
  for (i in 1:length(a)){
    b[[i]] <- function(x) g(x)
  }
  print(b[[2]](10))
}
c <- f(a)

12

For every location, sex and month there are incomes which is a categorical variable with levels poverty, low, mid and high. For every income for a particular location, sex and month we choose the max and min which means the max and min alphabetically.

14

What is your question here?

Note there are clearly many running times so there must be many cases.

This raises an interesting question raised by Cuong Luu in our class about density plots from homework 6. Question: why don’t the stacked density plots add up to single density plot? I think to see what is going on it is helpful to look at histograms of counts. You can see that the density plots for each client group is renormalized to both have area 1.

Loading Data

Stations <- mosaic::read.file("http://tiny.cc/dcf/DC-Stations.csv")

## Reading data with read.csv()

data_site <- "http://tiny.cc/dcf/2014-Q4-Trips-History-Data-Small.rds" 
Trips <- readRDS(gzcon(url(data_site)))
str(Trips)

## 'data.frame':    10000 obs. of  7 variables:
##  $ duration: chr  "0h 9m 15s" "0h 47m 21s" "2h 46m 22s" "0h 15m 15s" ...
##  $ sdate   : POSIXct, format: "2014-11-06 16:26:00" "2014-10-12 11:30:00" ...
##  $ sstation: chr  "15th & L St NW" "3rd & D St SE" "10th & E St NW" "4th & M St SW" ...
##  $ edate   : POSIXct, format: "2014-11-06 16:35:00" "2014-10-12 12:17:00" ...
##  $ estation: chr  "15th & L St NW" "Jefferson Dr & 14th St SW" "10th & E St NW" "5th & K St NW" ...
##  $ bikeno  : chr  "W00169" "W01482" "W21346" "W00647" ...
##  $ client  : chr  "Registered" "Registered" "Casual" "Casual" ...

not filled by `client` (single density plot)

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(x=time_of_day)) +
  geom_density() +
  facet_wrap(~day_of_week) +
  theme( legend.position = "none")

overlaying fills by `client`

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(time_of_day)) +
  geom_density(aes(fill=client), alpha=.3, col=NA) +
  facet_wrap(~day_of_week)

stacking fills of `client`

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(time_of_day)) +
  geom_density(aes(fill=client), position=position_stack(), alpha=.3, col=NA) +
  facet_wrap(~day_of_week)

### histogram not filled by `client` (single histogram)

histogram not filled by `client` (single histogram)

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(x=time_of_day)) +
  geom_histogram(binwidth=5) +
  facet_wrap(~day_of_week) +
  theme( legend.position = "none")

histogram dodged fills by client

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(time_of_day)) +
  geom_histogram(aes(fill=client), position="dodge",alpha=.3, col=NA,binwidth=5) +
  facet_wrap(~day_of_week)

histogram stacking fills of `client`

Trips %>%
  mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
  ggplot(aes(time_of_day)) +
  geom_histogram(aes(fill=client), position=position_stack(), alpha=.3, col=NA,binwidth=5) +
  facet_wrap(~day_of_week)

15

What is your quesiton?

Note: for each box plot you need to have at least two cases to have a nonzero interquartile range.

data=c(1)
quantile(data)

##   0%  25%  50%  75% 100% 
##    1    1    1    1    1

data=c(1,2)
quantile(data)

##   0%  25%  50%  75% 100% 
## 1.00 1.25 1.50 1.75 2.00

17

Here are the details (with made up data):

party <-  data.frame(c("DON SAMUELS","JACKIE CHERRYHOMES","BETSY HODGES", "CAM WINTON"), c("DFL","LIBETARIAN","INDEPENDENT", "DFL"))
names(party) <- c("Candidate","Party")
party

Candidate	Party
DON SAMUELS	DFL
JACKIE CHERRYHOMES	LIBETARIAN
BETSY HODGES	INDEPENDENT
CAM WINTON	DFL

	Precinct	Ward	Candidate
7762	P-08	W-2	DON SAMUELS
78920	P-08	W-7	JACKIE CHERRYHOMES
25409	P-03	W-9	BETSY HODGES
31526	P-04	W-11	CAM WINTON

ballet_party <- candidates %>% 
  left_join(party)

## Joining, by = "Candidate"

## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factor and character vector, coercing into character vector

ballet_party

Precinct	Ward	Candidate	Party
P-08	W-2	DON SAMUELS	DFL
P-08	W-7	JACKIE CHERRYHOMES	LIBETARIAN
P-03	W-9	BETSY HODGES	INDEPENDENT
P-04	W-11	CAM WINTON	DFL

ballet_party %>% 
  group_by(Precinct,Party) %>%
  summarize(count=n())

## Source: local data frame [4 x 3]
## Groups: Precinct [?]
## 
##   Precinct       Party count
##      <chr>      <fctr> <int>
## 1     P-03 INDEPENDENT     1
## 2     P-04         DFL     1
## 3     P-08         DFL     1
## 4     P-08  LIBETARIAN     1

18

Did this already

20

did this already

21

did this already

22

did this already

25

did this already

Leon Gutierrez

14, 15, more specifically, how do we know the number of cases?

16, inner joins, and could you please go over left and right joints?

14

Did this already

15

Did this already

16

MISPRINT: replace by = c("First", "variable2") with by=c("FIRST"="variable1")

Lets review the different joins:

Extra question for you:

The data tables Grades and Courses gives grade and course related info in a school.

Grades <- read.csv("http://tiny.cc/mosaic/grades.csv")

	sid	grade	sessionID
2197	S31680	B	session3518
259	S31242	B	session2897
4188	S32127	A	session2002
3880	S32058	A-	session2952

Courses <- read.csv("http://tiny.cc/mosaic/courses.csv")

	sessionID	dept	level	sem	enroll	iid
640	session2568	J	100	FA2002	15	inst223
76	session1940	d	100	FA2000	16	inst409
1218	session3242	m	200	SP2004	30	inst476

The primary and foreign key is sessionID.

Write out the commmands to find average class size. The first 6 lines of output should be:

sid	ave_enroll
S31185	29.00000
S31188	27.55556
S31191	29.13333
S31194	19.50000
S31197	24.42857
S31200	26.50000

Lets review gather and spread

library(tidyr)

We make a data table of country populations in different years.

a <- c("Afghanistan","Brazil", "China")
b <- c(745,37737,212258)
c <- c(266,80488,213766)
df_wide <- data.frame(a,b,c)
names(df_wide) <- c("countries","1999", "2000")
df_wide

countries	1999	2000
Afghanistan	745	266
Brazil	37737	80488
China	212258	213766

We wish to gather this table into tidy form.

df_narrow <- df_wide %>% 
  gather( key = year, value = population, `1999`, `2000`)
df_narrow

countries	year	population
Afghanistan	1999	745
Brazil	1999	37737
China	1999	212258
Afghanistan	2000	266
Brazil	2000	80488
China	2000	213766

to convert back to wide we use spread

df_wide1 <- df_narrow %>% 
  spread( key = year, value = population)
df_wide1

countries	1999	2000
Afghanistan	745	266
Brazil	37737	80488
China	212258	213766

Extra question for you:

Seohyeong Jeong 3, 10, 24

3

did already

10

What is your question?

BabyNames %>%
  group_by(name) %>%
  summarise(tot = sum(count)) %>%
  mutate(rank = rank(desc(tot))) %>%
  filter(name == "Fernando")

name	tot	rank
Fernando	90785	663

24

What is your quesiton?

2017 Stat 133 Midterm review pt 2

Minzhi Zhang Q1, 12,14, 15,17,18,20,21,22,25

1

extra question for you:

12

14

Loading Data

not filled by client (single density plot)

overlaying fills by client

stacking fills of client

histogram not filled by client (single histogram)

histogram stacking fills of client

15

17

18

20

21

22

25

Leon Gutierrez

14

15

16

Extra question for you:

Extra question for you:

Seohyeong Jeong 3, 10, 24

3

10

24

not filled by `client` (single density plot)

overlaying fills by `client`

stacking fills of `client`

histogram not filled by `client` (single histogram)

histogram stacking fills of `client`