Source file ⇒ lec23.Rmd

show_answers = TRUE

Today (Midterm review)

Lingjie Qiao

Could we please review the different components of graphics, especially the definitions and examples of glyphs, guides and scales? Thanks!

best resource for grammar of graphics: ggplot

glyph or geom= graphical unit, describe the type of plot you will produce

aesthetic= a visual property of the glyph (position, shape, color).

scale = the relationship between a variable and the aesthetic to which it is mapped.

Age -> x
Height -> y
Sex ->color

guide = An indication for the human viewer of the scale

Example

Consider the following graphic:

  1. What are the glyphs/geoms in this graph?
# geom_point, geom_smooth, geom_boxplot
  1. What are the aesthetics of each glyph?
# Points - x, y, color
# Lines - x, y, color
#boxplot -x, group, 
  1. Which variables are being mapped to each aesthetic?
# x - Education
# y - Wage
# color - Sex
#group - Education
  1. Are the variables qualitative or quantitative?
# Education - quantitative
# Wage - quantitative
# Sex - qualitative
  1. What are the guides on this graph?
# labels on x and y axis showing the scale for education and wage
# key on the side saying which color belongs to which gender
  1. Reconstruct this graphic using ggplot and the CPS85 data table. (Hint: the font size of the plot title is 20)
CPS85 %>%
  ggplot(aes(x = educ, y = wage)) +
   geom_boxplot(aes(group=educ)) +
  geom_point(aes(color=sex)) +
  geom_smooth(method="lm", aes(color=sex)) +
  ylim(0,15) + 
  labs(title = "Wage vs. Education in CPS85") + 
  theme(plot.title=element_text(size=20)) 

Jerome Rufin

number 6: part 1 and part 2d

Consider this plot made with geom_boxplot()

  1. Do the glyphs represent individual cases or cases taken collectively (a.k.a. a “statistic”)?

Taken collectively in this case since we need summary information about the data table namely the 5, 50 and 95th percentile.

  1. What do the glyph-ready data look like? That is,
    1. What are the variables?
    2. Are they quantitative or categorical?
    3. What are the levels for any categorical variable.
    4. Is it likely that there are 20 or fewer cases?

Solution

Three variables

  • running time — quantitative
  • Age — categorical with levels (20,30], (30,40] and so on
  • sex — categorical with levels F and M
  • Each box-and-whisker glyph involves at least two cases (and typically many more than that). There are 12 box-and-whisker glyphs, so there are at least 24 cases.

number 8: inner_join (confused about solutions)

You have been given a newspaper report of the political party affiliations of each candidate, like this:

DFL Betsy Hodges, Mark Andrew, Bob Fine, and Don Samuels

Independent Cam Winton

Libertarian Chrisopher Zimmerman and Christopher Clark.

Given this information in this form along with data in this format:

Precinct Ward Candidate
7762 P-08 W-2 DON SAMUELS
78920 P-08 W-7 JACKIE CHERRYHOMES
25409 P-03 W-9 BETSY HODGES
31526 P-04 W-11 CAM WINTON

Describe in words how you would go about calculating the vote for each party in each precinct.

Solution

  • First: Create a data table translating between candidate and party
Candidate Party
BETSY HODGES DFL
CAM WINTON Independent
df <- data.frame(Candidate=c("BETSY HODGES", "MARK ANDREWS" , "BOB FINE", "DON SAMUELS","CAM WINSTON" , "CHRISTOPHER ZIMMERMAN", "CHRISTOPHER CLARK"), Party=c("DFL","DFL","DFL","DFL","I","L","L"))
df
Candidate Party
BETSY HODGES DFL
MARK ANDREWS DFL
BOB FINE DFL
DON SAMUELS DFL
CAM WINSTON I
CHRISTOPHER ZIMMERMAN L
CHRISTOPHER CLARK L
  • Second: Join this with the ballot data
ballet <- Minneapolis2013 %>%
  select(Precinct, Ward, First) %>%
  rename(Candidate = First) %>% 
  filter(Candidate != "undervote")
head(ballet)
Precinct Ward Candidate
P-10 W-7 BETSY HODGES
P-06 W-10 BOB FINE
P-09 W-10 KURTIS W. HANNA
P-05 W-13 BETSY HODGES
P-01 W-5 DON SAMUELS
P-05 W-1 BETSY HODGES
merged_df <- ballet %>% inner_join(df)
## Joining by: "Candidate"
## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector
head(merged_df)
Precinct Ward Candidate Party
P-10 W-7 BETSY HODGES DFL
P-06 W-10 BOB FINE DFL
P-05 W-13 BETSY HODGES DFL
P-01 W-5 DON SAMUELS DFL
P-05 W-1 BETSY HODGES DFL
P-01 W-12 BETSY HODGES DFL
  • Third: group by precinct and party, summarising with a count of the number of ballots in each group.
merged_df %>% 
  group_by(Precinct, Party) %>%
  summarize(count=n()) %>%
  head()
Precinct Party count
P-01 DFL 3790
P-01 L 27
P-01C DFL 220
P-02 DFL 4943
P-02 L 24
P-02D DFL 407

number 9

The data table BodyTypes gives the responses of 24,117 women and 35,829 men who are members of OkCupid to a question about their “body type”. The numbers are the counts of people giving listing that body_type.

body_type f m
average 5620 9032
fit 4431 8280
curvy 3811 113
(none given) 2703 2593
thin 2469 2242
athletic 2309 9510
full figured 870 139
a little extra 821 1808
skinny 601 1176
overweight 145 299
jacked 129 292
rather not say 106 92
used up 102 253

Write a wrangling statement to find the proportion of women with each body type. Call this variable fprop. Similarly, calculate mprop, the proportion of men with each body type, and ratio, the ratio of the women’s proportion to the men’s. Finally, create a data table arranged in order from most unbalanced to most balanced. (Hint for the last step: pmax() will compare two variables, case by case, returning the maximum quantity of the two for each case.)

Solution

n_women <- 24117
n_men <- 35829

Types %>% 
  mutate(fprop = f /n_women , mprop = m / n_men, ratio = fprop / mprop) %>%
  arrange(desc(pmax(fprop / mprop, mprop / fprop)))
body_type f m fprop mprop ratio
curvy 3811 113 0.1580213 0.0031539 50.1039435
full figured 870 139 0.0360741 0.0038795 9.2985634
athletic 2309 9510 0.0957416 0.2654274 0.3607072
rather not say 106 92 0.0043952 0.0025678 1.7117071
used up 102 253 0.0042294 0.0070613 0.5989507
thin 2469 2242 0.1023759 0.0625750 1.6360512
(none given) 2703 2593 0.1120786 0.0723715 1.5486559
jacked 129 292 0.0053489 0.0081498 0.6563240
a little extra 821 1808 0.0340424 0.0504619 0.6746152
overweight 145 299 0.0060124 0.0083452 0.7204573
skinny 601 1176 0.0249202 0.0328226 0.7592391
fit 4431 8280 0.1837293 0.2310977 0.7950287
average 5620 9032 0.2330306 0.2520863 0.9244082

number 14: does summarise (Avg = mean(Value)) work ?

Suppose following tables are loaded to your computer. You can call them TableA and TableB, respectively.

Year Algeria Brazil Columbia
2000 7 12 16
2001 9 14 18
Country Abbreviation
Algeria DZ
Brazil BR
Columbia CO

Write a code to get to the following table.

Abbreviation Avg
DZ 8
BR 13
CO 17

Solution

step 1

TableA %>%
  gather(key = Country, value = Value, -Year)
Year Country Value
2000 Algeria 7
2001 Algeria 9
2000 Brazil 12
2001 Brazil 14
2000 Columbia 16
2001 Columbia 18

step 2

TableA %>%
  gather(key = Country, value = Value, -Year) %>%
  group_by(Country) %>%
  summarize(Avg = mean(Value))
Country Avg
Algeria 8
Brazil 13
Columbia 17

step 3

TableC <- TableA %>%
  gather(key = Country, value = Value, -Year) %>%
  group_by(Country) %>%
  summarize(Avg = mean(Value)) %>%
  left_join(TableB, by = c("Country"= "Country"))

TableC
Country Avg Abbreviation
Algeria 8 DZ
Brazil 13 BR
Columbia 17 CO

step 4

TableA %>%
  gather(key = Country, value = Value, -Year) %>%
  group_by(Country) %>%
  summarize(Avg = mean(Value)) %>%
  left_join(TableB, by = c("Country"= "Country")) %>%
  select(Abbreviation, Avg)
## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector
Abbreviation Avg
DZ 8
BR 13
CO 17

Lu Zhang

please go over 7, 11, 13 and problem 16 part 3

problem 7

What is your quesiton?

problem 11

  1. Suppose \(x_0 =1\) and \(x1 =2\) and \[ x_j = x_{j−1}+ 2/x_{j−1} \text{ for } j=1,2,.... \] MISPRINT HERE

Write a function testLoop which takes the single argument n and returns the first n − 1 (assume n is larger than 3) values of the sequence \(\{x_j\}_{j≥0}\): that means the values of \(x_0, x_1, x_2, . . . , x_{n−2}\).

  1. Assume we have already built a dataframe below:
tmp <- as.data.frame(cbind(1:499,testLoop(500)))
colnames(tmp) <- c("n","result")

Try to use ggplot to plot the result like below:

*Solution a)

testLoop <- function(n)
    {
        xVec <- rep(NA, n-1)
        xVec[1] <- 1
        xVec[2] <- 2
        for( j in 3:(n-1) )
          xVec[j] <- xVec[j-1] + 2/xVec[j-1]
        return(xVec)
}
ggplot(data = tmp,aes(x = n, y = result))+geom_point()+geom_smooth(se = FALSE)

problem 13

Write a functon fib that takes a positive integer n and returns the nth Fibonacci number. (In R we start counting from 1 not 0.) Fibonacci numbers are as follows: 0, 1, 1, 2, 3, 5, 8. First two Fibonacci numbers are 0 and 1, respectively. Starting from the third one each Fibonacci number is equal to the sum of previous two. In other words fib(3) = fib(2) + fib(1) and fib(4) = fib(3) + fib(2) etc.

Solution

fib <- function(n){
  if (n==1) {
    return(0)
  } else if (n==2) {
    return(1) 
  } else {
    count = 3
    prev1 = 1
    prev2 = 0
    while (count<=n) {
      temp = prev2
      prev2 = prev1
      prev1 = prev1 + temp
      count = count + 1
    }
    return(prev1)
  }
}
fib(1)
## [1] 0
fib(2)
## [1] 1
fib(3)
## [1] 1
fib(4)
## [1] 2
fib(5)
## [1] 3

problem 16 part c

What is your question here?

MALAYANDI (Andy) PALANIAPPAN

Could we please go over Q12 from the Review Sheet?

Sorry this question isn’t good. Here is what I should have asked:

Suppose we have a dataframe called data like below:

transect fish_abund
3 19
1 4
1 13
1 18
1 0
1 5
3 4
3 6
3 18

Write the code to get the result like below:

genre Sum_fish_abund
Light 40
Strong 47

Solution

data %>%
  mutate(genre = ifelse(transect == "1", "Light","Strong")) %>%
  group_by(genre) %>%
  summarise(Sum_fish_abund = sum(fish_abund))

NOOSHA BRONTE RAZAVIAN

Please go over problem 11 part a and problem 13. Thank you!

11a

Yes, misprint see above

13

Happy to go over in detail

TAE WOOK HA

Can we go over #9 why you used n() for finding proportion of female?

Correct, misprint. Thanks! See solution above.

Sid Masih

Can we just do a quick review of vector expressions vs for loops and when to use for vs vector? vector expressions are kind of confusing. Thanks!

(do you mean vectorized calculation?)

Which do you think is better?

for loop:

x <- 1:5
y <- 6:10
z <- c()
for (i in c(1:5)) {
  z <- c(z, x[i] + y[i])
}
z
## [1]  7  9 11 13 15

vectorized calculation

x <- 1:5
y <- 6:10
z <- x + y
z
## [1]  7  9 11 13 15

Also see number 17 and 22

RONG HUANG

problems 4(2),4(3)

I feel for question 7,8, we are not provided enough information. How similar will the exam questions compare to this problem set?

YISHAN HAN

Can we go over question 15, 16b, and 22(1)?

ROWENA JIEYI XIA

Can we please go over problems 9, 11, 12, 13, 16(2), 17(part 1)?

And for problem 20(c): can we use sapply instead?