Source file ⇒ lec23.Rmd
show_answers = TRUE
Could we please review the different components of graphics, especially the definitions and examples of glyphs, guides and scales? Thanks!
best resource for grammar of graphics: ggplot
glyph or geom= graphical unit, describe the type of plot you will produce
aesthetic= a visual property of the glyph (position, shape, color).
scale = the relationship between a variable and the aesthetic to which it is mapped.
Age -> x
Height -> y
Sex ->color
guide = An indication for the human viewer of the scale
Consider the following graphic:
# geom_point, geom_smooth, geom_boxplot
# Points - x, y, color
# Lines - x, y, color
#boxplot -x, group,
# x - Education
# y - Wage
# color - Sex
#group - Education
# Education - quantitative
# Wage - quantitative
# Sex - qualitative
# labels on x and y axis showing the scale for education and wage
# key on the side saying which color belongs to which gender
ggplot
and the CPS85
data table. (Hint: the font size of the plot title is 20)CPS85 %>%
ggplot(aes(x = educ, y = wage)) +
geom_boxplot(aes(group=educ)) +
geom_point(aes(color=sex)) +
geom_smooth(method="lm", aes(color=sex)) +
ylim(0,15) +
labs(title = "Wage vs. Education in CPS85") +
theme(plot.title=element_text(size=20))
Consider this plot made with geom_boxplot()
…
Taken collectively in this case since we need summary information about the data table namely the 5, 50 and 95th percentile.
Solution
Three variables
(20,30]
, (30,40]
and so onYou have been given a newspaper report of the political party affiliations of each candidate, like this:
DFL Betsy Hodges, Mark Andrew, Bob Fine, and Don Samuels
Independent Cam Winton
Libertarian Chrisopher Zimmerman and Christopher Clark.
Given this information in this form along with data in this format:
Precinct | Ward | Candidate | |
---|---|---|---|
7762 | P-08 | W-2 | DON SAMUELS |
78920 | P-08 | W-7 | JACKIE CHERRYHOMES |
25409 | P-03 | W-9 | BETSY HODGES |
31526 | P-04 | W-11 | CAM WINTON |
Describe in words how you would go about calculating the vote for each party in each precinct.
Solution
Candidate | Party |
---|---|
BETSY HODGES | DFL |
CAM WINTON | Independent |
df <- data.frame(Candidate=c("BETSY HODGES", "MARK ANDREWS" , "BOB FINE", "DON SAMUELS","CAM WINSTON" , "CHRISTOPHER ZIMMERMAN", "CHRISTOPHER CLARK"), Party=c("DFL","DFL","DFL","DFL","I","L","L"))
df
Candidate | Party |
---|---|
BETSY HODGES | DFL |
MARK ANDREWS | DFL |
BOB FINE | DFL |
DON SAMUELS | DFL |
CAM WINSTON | I |
CHRISTOPHER ZIMMERMAN | L |
CHRISTOPHER CLARK | L |
ballet <- Minneapolis2013 %>%
select(Precinct, Ward, First) %>%
rename(Candidate = First) %>%
filter(Candidate != "undervote")
head(ballet)
Precinct | Ward | Candidate |
---|---|---|
P-10 | W-7 | BETSY HODGES |
P-06 | W-10 | BOB FINE |
P-09 | W-10 | KURTIS W. HANNA |
P-05 | W-13 | BETSY HODGES |
P-01 | W-5 | DON SAMUELS |
P-05 | W-1 | BETSY HODGES |
merged_df <- ballet %>% inner_join(df)
## Joining by: "Candidate"
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
head(merged_df)
Precinct | Ward | Candidate | Party |
---|---|---|---|
P-10 | W-7 | BETSY HODGES | DFL |
P-06 | W-10 | BOB FINE | DFL |
P-05 | W-13 | BETSY HODGES | DFL |
P-01 | W-5 | DON SAMUELS | DFL |
P-05 | W-1 | BETSY HODGES | DFL |
P-01 | W-12 | BETSY HODGES | DFL |
merged_df %>%
group_by(Precinct, Party) %>%
summarize(count=n()) %>%
head()
Precinct | Party | count |
---|---|---|
P-01 | DFL | 3790 |
P-01 | L | 27 |
P-01C | DFL | 220 |
P-02 | DFL | 4943 |
P-02 | L | 24 |
P-02D | DFL | 407 |
The data table BodyTypes
gives the responses of 24,117 women and 35,829 men who are members of OkCupid to a question about their “body type”. The numbers are the counts of people giving listing that body_type
.
body_type | f | m |
---|---|---|
average | 5620 | 9032 |
fit | 4431 | 8280 |
curvy | 3811 | 113 |
(none given) | 2703 | 2593 |
thin | 2469 | 2242 |
athletic | 2309 | 9510 |
full figured | 870 | 139 |
a little extra | 821 | 1808 |
skinny | 601 | 1176 |
overweight | 145 | 299 |
jacked | 129 | 292 |
rather not say | 106 | 92 |
used up | 102 | 253 |
Write a wrangling statement to find the proportion of women with each body type. Call this variable fprop
. Similarly, calculate mprop
, the proportion of men with each body type, and ratio
, the ratio of the women’s proportion to the men’s. Finally, create a data table arranged in order from most unbalanced to most balanced. (Hint for the last step: pmax()
will compare two variables, case by case, returning the maximum quantity of the two for each case.)
Solution
n_women <- 24117
n_men <- 35829
Types %>%
mutate(fprop = f /n_women , mprop = m / n_men, ratio = fprop / mprop) %>%
arrange(desc(pmax(fprop / mprop, mprop / fprop)))
body_type | f | m | fprop | mprop | ratio |
---|---|---|---|---|---|
curvy | 3811 | 113 | 0.1580213 | 0.0031539 | 50.1039435 |
full figured | 870 | 139 | 0.0360741 | 0.0038795 | 9.2985634 |
athletic | 2309 | 9510 | 0.0957416 | 0.2654274 | 0.3607072 |
rather not say | 106 | 92 | 0.0043952 | 0.0025678 | 1.7117071 |
used up | 102 | 253 | 0.0042294 | 0.0070613 | 0.5989507 |
thin | 2469 | 2242 | 0.1023759 | 0.0625750 | 1.6360512 |
(none given) | 2703 | 2593 | 0.1120786 | 0.0723715 | 1.5486559 |
jacked | 129 | 292 | 0.0053489 | 0.0081498 | 0.6563240 |
a little extra | 821 | 1808 | 0.0340424 | 0.0504619 | 0.6746152 |
overweight | 145 | 299 | 0.0060124 | 0.0083452 | 0.7204573 |
skinny | 601 | 1176 | 0.0249202 | 0.0328226 | 0.7592391 |
fit | 4431 | 8280 | 0.1837293 | 0.2310977 | 0.7950287 |
average | 5620 | 9032 | 0.2330306 | 0.2520863 | 0.9244082 |
Suppose following tables are loaded to your computer. You can call them TableA
and TableB
, respectively.
Year | Algeria | Brazil | Columbia |
---|---|---|---|
2000 | 7 | 12 | 16 |
2001 | 9 | 14 | 18 |
Country | Abbreviation |
---|---|
Algeria | DZ |
Brazil | BR |
Columbia | CO |
Write a code to get to the following table.
Abbreviation | Avg |
---|---|
DZ | 8 |
BR | 13 |
CO | 17 |
Solution
step 1
TableA %>%
gather(key = Country, value = Value, -Year)
Year | Country | Value |
---|---|---|
2000 | Algeria | 7 |
2001 | Algeria | 9 |
2000 | Brazil | 12 |
2001 | Brazil | 14 |
2000 | Columbia | 16 |
2001 | Columbia | 18 |
step 2
TableA %>%
gather(key = Country, value = Value, -Year) %>%
group_by(Country) %>%
summarize(Avg = mean(Value))
Country | Avg |
---|---|
Algeria | 8 |
Brazil | 13 |
Columbia | 17 |
step 3
TableC <- TableA %>%
gather(key = Country, value = Value, -Year) %>%
group_by(Country) %>%
summarize(Avg = mean(Value)) %>%
left_join(TableB, by = c("Country"= "Country"))
TableC
Country | Avg | Abbreviation |
---|---|---|
Algeria | 8 | DZ |
Brazil | 13 | BR |
Columbia | 17 | CO |
step 4
TableA %>%
gather(key = Country, value = Value, -Year) %>%
group_by(Country) %>%
summarize(Avg = mean(Value)) %>%
left_join(TableB, by = c("Country"= "Country")) %>%
select(Abbreviation, Avg)
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector
Abbreviation | Avg |
---|---|
DZ | 8 |
BR | 13 |
CO | 17 |
please go over 7, 11, 13 and problem 16 part 3
What is your quesiton?
Write a function testLoop
which takes the single argument n
and returns the first n − 1 (assume n is larger than 3) values of the sequence \(\{x_j\}_{j≥0}\): that means the values of \(x_0, x_1, x_2, . . . , x_{n−2}\).
tmp <- as.data.frame(cbind(1:499,testLoop(500)))
colnames(tmp) <- c("n","result")
Try to use ggplot to plot the result like below:
*Solution a)
testLoop <- function(n)
{
xVec <- rep(NA, n-1)
xVec[1] <- 1
xVec[2] <- 2
for( j in 3:(n-1) )
xVec[j] <- xVec[j-1] + 2/xVec[j-1]
return(xVec)
}
ggplot(data = tmp,aes(x = n, y = result))+geom_point()+geom_smooth(se = FALSE)
Write a functon fib
that takes a positive integer n
and returns the nth Fibonacci number. (In R we start counting from 1 not 0.) Fibonacci numbers are as follows: 0, 1, 1, 2, 3, 5, 8. First two Fibonacci numbers are 0 and 1, respectively. Starting from the third one each Fibonacci number is equal to the sum of previous two. In other words fib(3) = fib(2) + fib(1)
and fib(4) = fib(3) + fib(2)
etc.
Solution
fib <- function(n){
if (n==1) {
return(0)
} else if (n==2) {
return(1)
} else {
count = 3
prev1 = 1
prev2 = 0
while (count<=n) {
temp = prev2
prev2 = prev1
prev1 = prev1 + temp
count = count + 1
}
return(prev1)
}
}
fib(1)
## [1] 0
fib(2)
## [1] 1
fib(3)
## [1] 1
fib(4)
## [1] 2
fib(5)
## [1] 3
What is your question here?
Could we please go over Q12 from the Review Sheet?
Sorry this question isn’t good. Here is what I should have asked:
Suppose we have a dataframe called data
like below:
transect | fish_abund |
---|---|
3 | 19 |
1 | 4 |
1 | 13 |
1 | 18 |
1 | 0 |
1 | 5 |
3 | 4 |
3 | 6 |
3 | 18 |
Write the code to get the result like below:
genre | Sum_fish_abund |
---|---|
Light | 40 |
Strong | 47 |
Solution
data %>%
mutate(genre = ifelse(transect == "1", "Light","Strong")) %>%
group_by(genre) %>%
summarise(Sum_fish_abund = sum(fish_abund))
Please go over problem 11 part a and problem 13. Thank you!
Yes, misprint see above
Happy to go over in detail
Can we go over #9 why you used n() for finding proportion of female?
Correct, misprint. Thanks! See solution above.
Can we just do a quick review of vector expressions vs for loops and when to use for vs vector? vector expressions are kind of confusing. Thanks!
(do you mean vectorized calculation?)
Which do you think is better?
for loop:
x <- 1:5
y <- 6:10
z <- c()
for (i in c(1:5)) {
z <- c(z, x[i] + y[i])
}
z
## [1] 7 9 11 13 15
vectorized calculation
x <- 1:5
y <- 6:10
z <- x + y
z
## [1] 7 9 11 13 15
Also see number 17 and 22
I feel for question 7,8, we are not provided enough information. How similar will the exam questions compare to this problem set?
Can we go over question 15, 16b, and 22(1)?
Can we please go over problems 9, 11, 12, 13, 16(2), 17(part 1)?
And for problem 20(c): can we use sapply instead?