Source file ⇒ 2017-midterm_review-pt2.Rmd
We already did #1 last time but here is some clarification about lazy evaluation:
When using Lazy evaluation the argument of a function call isn’t evaluated unless it is actually used in a function.
example:
f <- function(x){
10
}
a <- f(x=print("hi"))
We don’t print out “hi”. This is because x
isn’t used in the definition of the function f
.
f <- function(x){
10
x
}
a <- f(x=print("hi"))
## [1] "hi"
Another example:
g <- function(i) i + 3
vec <- c()
for (i in 1:2){
vec[i] <- g(i)
}
vec[1]
## [1] 4
vec[2]
## [1] 5
Here i is used in g(i) so vec[1]=4
and vec[2]=5
Another example:
Note: We need a list ls
to store our functions.
g <- function(i) function(x) x + i
ls <- list()
for (i in 1:2){
ls[[i]] <- g(i)
}
ls[[1]](10)
## [1] 12
ls[[2]](10)
## [1] 12
Here , i is used in the definition of g, however, R does not know it because i is inside the definition of another function, i.e. function(x) x + i. That function has never been called when you run ls[[i]] <- g(i)
so lazy eval applies to that function.
g <- function(i) {
i
function(x) x + i
}
ls <- list()
for (i in 1:2){
ls[[i]] <- g(i)
}
ls[[1]](10)
## [1] 11
ls[[2]](10)
## [1] 12
Now lazy eval doesn’t apply since i
is in the definition of g
.
What is the output of the following chunk?
i=1
a <- c(1, 2)
g <- function(x){x+a[i]}
f <- function(a) {
b <- list()
for (i in 1:length(a)){
b[[i]] <- function(x) g(x)
}
print(b[[2]](10))
}
c <- f(a)
For every location, sex and month there are incomes which is a categorical variable with levels poverty, low, mid and high. For every income for a particular location, sex and month we choose the max and min which means the max and min alphabetically.
What is your question here?
Note there are clearly many running times so there must be many cases.
This raises an interesting question raised by Cuong Luu in our class about density plots from homework 6. Question: why don’t the stacked density plots add up to single density plot? I think to see what is going on it is helpful to look at histograms of counts. You can see that the density plots for each client group is renormalized to both have area 1.
Stations <- mosaic::read.file("http://tiny.cc/dcf/DC-Stations.csv")
## Reading data with read.csv()
data_site <- "http://tiny.cc/dcf/2014-Q4-Trips-History-Data-Small.rds"
Trips <- readRDS(gzcon(url(data_site)))
str(Trips)
## 'data.frame': 10000 obs. of 7 variables:
## $ duration: chr "0h 9m 15s" "0h 47m 21s" "2h 46m 22s" "0h 15m 15s" ...
## $ sdate : POSIXct, format: "2014-11-06 16:26:00" "2014-10-12 11:30:00" ...
## $ sstation: chr "15th & L St NW" "3rd & D St SE" "10th & E St NW" "4th & M St SW" ...
## $ edate : POSIXct, format: "2014-11-06 16:35:00" "2014-10-12 12:17:00" ...
## $ estation: chr "15th & L St NW" "Jefferson Dr & 14th St SW" "10th & E St NW" "5th & K St NW" ...
## $ bikeno : chr "W00169" "W01482" "W21346" "W00647" ...
## $ client : chr "Registered" "Registered" "Casual" "Casual" ...
client
(single density plot)Trips %>%
mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
ggplot(aes(x=time_of_day)) +
geom_density() +
facet_wrap(~day_of_week) +
theme( legend.position = "none")
client
Trips %>%
mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
ggplot(aes(time_of_day)) +
geom_density(aes(fill=client), alpha=.3, col=NA) +
facet_wrap(~day_of_week)
client
Trips %>%
mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
ggplot(aes(time_of_day)) +
geom_density(aes(fill=client), position=position_stack(), alpha=.3, col=NA) +
facet_wrap(~day_of_week)
### histogram not filled by `client` (single histogram)
client
(single histogram)Trips %>%
mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
ggplot(aes(x=time_of_day)) +
geom_histogram(binwidth=5) +
facet_wrap(~day_of_week) +
theme( legend.position = "none")
histogram dodged fills by client
Trips %>%
mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
ggplot(aes(time_of_day)) +
geom_histogram(aes(fill=client), position="dodge",alpha=.3, col=NA,binwidth=5) +
facet_wrap(~day_of_week)
client
Trips %>%
mutate(time_of_day = lubridate::hour(sdate) + lubridate::minute(sdate)/60, day_of_week = lubridate::wday(sdate, label=TRUE)) %>%
ggplot(aes(time_of_day)) +
geom_histogram(aes(fill=client), position=position_stack(), alpha=.3, col=NA,binwidth=5) +
facet_wrap(~day_of_week)
What is your quesiton?
Note: for each box plot you need to have at least two cases to have a nonzero interquartile range.
data=c(1)
quantile(data)
## 0% 25% 50% 75% 100%
## 1 1 1 1 1
data=c(1,2)
quantile(data)
## 0% 25% 50% 75% 100%
## 1.00 1.25 1.50 1.75 2.00
Here are the details (with made up data):
party <- data.frame(c("DON SAMUELS","JACKIE CHERRYHOMES","BETSY HODGES", "CAM WINTON"), c("DFL","LIBETARIAN","INDEPENDENT", "DFL"))
names(party) <- c("Candidate","Party")
party
Candidate | Party |
---|---|
DON SAMUELS | DFL |
JACKIE CHERRYHOMES | LIBETARIAN |
BETSY HODGES | INDEPENDENT |
CAM WINTON | DFL |
Precinct | Ward | Candidate | |
---|---|---|---|
7762 | P-08 | W-2 | DON SAMUELS |
78920 | P-08 | W-7 | JACKIE CHERRYHOMES |
25409 | P-03 | W-9 | BETSY HODGES |
31526 | P-04 | W-11 | CAM WINTON |
ballet_party <- candidates %>%
left_join(party)
## Joining, by = "Candidate"
## Warning in left_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factor and character vector, coercing into character vector
ballet_party
Precinct | Ward | Candidate | Party |
---|---|---|---|
P-08 | W-2 | DON SAMUELS | DFL |
P-08 | W-7 | JACKIE CHERRYHOMES | LIBETARIAN |
P-03 | W-9 | BETSY HODGES | INDEPENDENT |
P-04 | W-11 | CAM WINTON | DFL |
ballet_party %>%
group_by(Precinct,Party) %>%
summarize(count=n())
## Source: local data frame [4 x 3]
## Groups: Precinct [?]
##
## Precinct Party count
## <chr> <fctr> <int>
## 1 P-03 INDEPENDENT 1
## 2 P-04 DFL 1
## 3 P-08 DFL 1
## 4 P-08 LIBETARIAN 1
Did this already
did this already
did this already
did this already
did this already
14, 15, more specifically, how do we know the number of cases?
16, inner joins, and could you please go over left and right joints?
Did this already
Did this already
MISPRINT: replace by = c("First", "variable2")
with by=c("FIRST"="variable1")
Lets review the different joins:
The data tables Grades
and Courses
gives grade and course related info in a school.
Grades <- read.csv("http://tiny.cc/mosaic/grades.csv")
sid | grade | sessionID | |
---|---|---|---|
2197 | S31680 | B | session3518 |
259 | S31242 | B | session2897 |
4188 | S32127 | A | session2002 |
3880 | S32058 | A- | session2952 |
Courses <- read.csv("http://tiny.cc/mosaic/courses.csv")
sessionID | dept | level | sem | enroll | iid | |
---|---|---|---|---|---|---|
640 | session2568 | J | 100 | FA2002 | 15 | inst223 |
76 | session1940 | d | 100 | FA2000 | 16 | inst409 |
1218 | session3242 | m | 200 | SP2004 | 30 | inst476 |
The primary and foreign key is sessionID
.
Write out the commmands to find average class size. The first 6 lines of output should be:
sid | ave_enroll |
---|---|
S31185 | 29.00000 |
S31188 | 27.55556 |
S31191 | 29.13333 |
S31194 | 19.50000 |
S31197 | 24.42857 |
S31200 | 26.50000 |
Lets review gather
and spread
library(tidyr)
We make a data table of country populations in different years.
a <- c("Afghanistan","Brazil", "China")
b <- c(745,37737,212258)
c <- c(266,80488,213766)
df_wide <- data.frame(a,b,c)
names(df_wide) <- c("countries","1999", "2000")
df_wide
countries | 1999 | 2000 |
---|---|---|
Afghanistan | 745 | 266 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
We wish to gather this table into tidy form.
df_narrow <- df_wide %>%
gather( key = year, value = population, `1999`, `2000`)
df_narrow
countries | year | population |
---|---|---|
Afghanistan | 1999 | 745 |
Brazil | 1999 | 37737 |
China | 1999 | 212258 |
Afghanistan | 2000 | 266 |
Brazil | 2000 | 80488 |
China | 2000 | 213766 |
to convert back to wide we use spread
df_wide1 <- df_narrow %>%
spread( key = year, value = population)
df_wide1
countries | 1999 | 2000 |
---|---|---|
Afghanistan | 745 | 266 |
Brazil | 37737 | 80488 |
China | 212258 | 213766 |
did already
What is your question?
BabyNames %>%
group_by(name) %>%
summarise(tot = sum(count)) %>%
mutate(rank = rank(desc(tot))) %>%
filter(name == "Fernando")
name | tot | rank |
---|---|---|
Fernando | 90785 | 663 |
What is your quesiton?