Today (Midterm review)

Lingjie Qiao

Could we please review the different components of graphics, especially the definitions and examples of glyphs, guides and scales? Thanks!

best resource for grammar of graphics: ggplot

glyph or geom= graphical unit, describe the type of plot you will produce

aesthetic= a visual property of the glyph (position, shape, color).

scale = the relationship between a variable and the aesthetic to which it is mapped.

Age -> x
Height -> y
Sex ->color

guide = An indication for the human viewer of the scale

Example

Consider the following graphic:

What are the glyphs/geoms in this graph?

# geom_point, geom_smooth, geom_boxplot

What are the aesthetics of each glyph?

# Points - x, y, color
# Lines - x, y, color
#boxplot -x, group,

Which variables are being mapped to each aesthetic?

# x - Education
# y - Wage
# color - Sex
#group - Education

Are the variables qualitative or quantitative?

# Education - quantitative
# Wage - quantitative
# Sex - qualitative

What are the guides on this graph?

# labels on x and y axis showing the scale for education and wage
# key on the side saying which color belongs to which gender

Reconstruct this graphic using ggplot and the CPS85 data table. (Hint: the font size of the plot title is 20)

CPS85 %>%
  ggplot(aes(x = educ, y = wage)) +
   geom_boxplot(aes(group=educ)) +
  geom_point(aes(color=sex)) +
  geom_smooth(method="lm", aes(color=sex)) +
  ylim(0,15) + 
  labs(title = "Wage vs. Education in CPS85") + 
  theme(plot.title=element_text(size=20))

Jerome Rufin

number 6: part 1 and part 2d

Consider this plot made with geom_boxplot() …

Do the glyphs represent individual cases or cases taken collectively (a.k.a. a “statistic”)?

Taken collectively in this case since we need summary information about the data table namely the 5, 50 and 95th percentile.

What do the glyph-ready data look like? That is,
1. What are the variables?
2. Are they quantitative or categorical?
3. What are the levels for any categorical variable.
4. Is it likely that there are 20 or fewer cases?

Solution

Three variables

running time — quantitative
Age — categorical with levels (20,30], (30,40] and so on
sex — categorical with levels F and M
Each box-and-whisker glyph involves at least two cases (and typically many more than that). There are 12 box-and-whisker glyphs, so there are at least 24 cases.

number 8: inner_join (confused about solutions)

You have been given a newspaper report of the political party affiliations of each candidate, like this:

DFL Betsy Hodges, Mark Andrew, Bob Fine, and Don Samuels

Independent Cam Winton

Libertarian Chrisopher Zimmerman and Christopher Clark.

Given this information in this form along with data in this format:

	Precinct	Ward	Candidate
7762	P-08	W-2	DON SAMUELS
78920	P-08	W-7	JACKIE CHERRYHOMES
25409	P-03	W-9	BETSY HODGES
31526	P-04	W-11	CAM WINTON

Describe in words how you would go about calculating the vote for each party in each precinct.

Solution

First: Create a data table translating between candidate and party

Candidate	Party
BETSY HODGES	DFL
CAM WINTON	Independent

df <- data.frame(Candidate=c("BETSY HODGES", "MARK ANDREWS" , "BOB FINE", "DON SAMUELS","CAM WINSTON" , "CHRISTOPHER ZIMMERMAN", "CHRISTOPHER CLARK"), Party=c("DFL","DFL","DFL","DFL","I","L","L"))
df

Candidate	Party
BETSY HODGES	DFL
MARK ANDREWS	DFL
BOB FINE	DFL
DON SAMUELS	DFL
CAM WINSTON	I
CHRISTOPHER ZIMMERMAN	L
CHRISTOPHER CLARK	L

Second: Join this with the ballot data

ballet <- Minneapolis2013 %>%
  select(Precinct, Ward, First) %>%
  rename(Candidate = First) %>% 
  filter(Candidate != "undervote")
head(ballet)

Precinct	Ward	Candidate
P-10	W-7	BETSY HODGES
P-06	W-10	BOB FINE
P-09	W-10	KURTIS W. HANNA
P-05	W-13	BETSY HODGES
P-01	W-5	DON SAMUELS
P-05	W-1	BETSY HODGES

merged_df <- ballet %>% inner_join(df)

## Joining by: "Candidate"

## Warning in inner_join_impl(x, y, by$x, by$y): joining character vector and
## factor, coercing into character vector

head(merged_df)

Precinct	Ward	Candidate	Party
P-10	W-7	BETSY HODGES	DFL
P-06	W-10	BOB FINE	DFL
P-05	W-13	BETSY HODGES	DFL
P-01	W-5	DON SAMUELS	DFL
P-05	W-1	BETSY HODGES	DFL
P-01	W-12	BETSY HODGES	DFL

Third: group by precinct and party, summarising with a count of the number of ballots in each group.

merged_df %>% 
  group_by(Precinct, Party) %>%
  summarize(count=n()) %>%
  head()

Precinct	Party	count
P-01	DFL	3790
P-01	L	27
P-01C	DFL	220
P-02	DFL	4943
P-02	L	24
P-02D	DFL	407

number 9

The data table BodyTypes gives the responses of 24,117 women and 35,829 men who are members of OkCupid to a question about their “body type”. The numbers are the counts of people giving listing that body_type.

body_type	f	m
average	5620	9032
fit	4431	8280
curvy	3811	113
(none given)	2703	2593
thin	2469	2242
athletic	2309	9510
full figured	870	139
a little extra	821	1808
skinny	601	1176
overweight	145	299
jacked	129	292
rather not say	106	92
used up	102	253

Write a wrangling statement to find the proportion of women with each body type. Call this variable fprop. Similarly, calculate mprop, the proportion of men with each body type, and ratio, the ratio of the women’s proportion to the men’s. Finally, create a data table arranged in order from most unbalanced to most balanced. (Hint for the last step: pmax() will compare two variables, case by case, returning the maximum quantity of the two for each case.)

Solution

n_women <- 24117
n_men <- 35829

Types %>% 
  mutate(fprop = f /n_women , mprop = m / n_men, ratio = fprop / mprop) %>%
  arrange(desc(pmax(fprop / mprop, mprop / fprop)))

body_type	f	m	fprop	mprop	ratio
curvy	3811	113	0.1580213	0.0031539	50.1039435
full figured	870	139	0.0360741	0.0038795	9.2985634
athletic	2309	9510	0.0957416	0.2654274	0.3607072
rather not say	106	92	0.0043952	0.0025678	1.7117071
used up	102	253	0.0042294	0.0070613	0.5989507
thin	2469	2242	0.1023759	0.0625750	1.6360512
(none given)	2703	2593	0.1120786	0.0723715	1.5486559
jacked	129	292	0.0053489	0.0081498	0.6563240
a little extra	821	1808	0.0340424	0.0504619	0.6746152
overweight	145	299	0.0060124	0.0083452	0.7204573
skinny	601	1176	0.0249202	0.0328226	0.7592391
fit	4431	8280	0.1837293	0.2310977	0.7950287
average	5620	9032	0.2330306	0.2520863	0.9244082

number 14: does summarise (Avg = mean(Value)) work ?

Suppose following tables are loaded to your computer. You can call them TableA and TableB, respectively.

Year	Algeria	Brazil	Columbia
2000	7	12	16
2001	9	14	18

Country	Abbreviation
Algeria	DZ
Brazil	BR
Columbia	CO

Write a code to get to the following table.

Abbreviation	Avg
DZ	8
BR	13
CO	17

Solution

step 1

TableA %>%
  gather(key = Country, value = Value, -Year)

Year	Country	Value
2000	Algeria	7
2001	Algeria	9
2000	Brazil	12
2001	Brazil	14
2000	Columbia	16
2001	Columbia	18

step 2

TableA %>%
  gather(key = Country, value = Value, -Year) %>%
  group_by(Country) %>%
  summarize(Avg = mean(Value))

Country	Avg
Algeria	8
Brazil	13
Columbia	17

step 3

TableC <- TableA %>%
  gather(key = Country, value = Value, -Year) %>%
  group_by(Country) %>%
  summarize(Avg = mean(Value)) %>%
  left_join(TableB, by = c("Country"= "Country"))

TableC

Country	Avg	Abbreviation
Algeria	8	DZ
Brazil	13	BR
Columbia	17	CO

step 4

TableA %>%
  gather(key = Country, value = Value, -Year) %>%
  group_by(Country) %>%
  summarize(Avg = mean(Value)) %>%
  left_join(TableB, by = c("Country"= "Country")) %>%
  select(Abbreviation, Avg)

## Warning in left_join_impl(x, y, by$x, by$y): joining factor and character
## vector, coercing into character vector

Abbreviation	Avg
DZ	8
BR	13
CO	17

Lu Zhang

please go over 7, 11, 13 and problem 16 part 3

problem 7

What is your quesiton?

problem 11

Suppose \(x_0 =1\) and \(x1 =2\) and \[ x_j = x_{j−1}+ 2/x_{j−1} \text{ for } j=1,2,.... \] MISPRINT HERE

Write a function testLoop which takes the single argument n and returns the first n − 1 (assume n is larger than 3) values of the sequence \(\{x_j\}_{j≥0}\): that means the values of \(x_0, x_1, x_2, . . . , x_{n−2}\).

Assume we have already built a dataframe below:

tmp <- as.data.frame(cbind(1:499,testLoop(500)))
colnames(tmp) <- c("n","result")

Try to use ggplot to plot the result like below:

*Solution a)

testLoop <- function(n)
    {
        xVec <- rep(NA, n-1)
        xVec[1] <- 1
        xVec[2] <- 2
        for( j in 3:(n-1) )
          xVec[j] <- xVec[j-1] + 2/xVec[j-1]
        return(xVec)
}

ggplot(data = tmp,aes(x = n, y = result))+geom_point()+geom_smooth(se = FALSE)

problem 13

Write a functon fib that takes a positive integer n and returns the nth Fibonacci number. (In R we start counting from 1 not 0.) Fibonacci numbers are as follows: 0, 1, 1, 2, 3, 5, 8. First two Fibonacci numbers are 0 and 1, respectively. Starting from the third one each Fibonacci number is equal to the sum of previous two. In other words fib(3) = fib(2) + fib(1) and fib(4) = fib(3) + fib(2) etc.

Solution

fib <- function(n){
  if (n==1) {
    return(0)
  } else if (n==2) {
    return(1) 
  } else {
    count = 3
    prev1 = 1
    prev2 = 0
    while (count<=n) {
      temp = prev2
      prev2 = prev1
      prev1 = prev1 + temp
      count = count + 1
    }
    return(prev1)
  }
}
fib(1)

## [1] 0

fib(2)

## [1] 1

fib(3)

## [1] 1

fib(4)

## [1] 2

fib(5)

## [1] 3

problem 16 part c

What is your question here?

MALAYANDI (Andy) PALANIAPPAN

Could we please go over Q12 from the Review Sheet?

Sorry this question isn’t good. Here is what I should have asked:

Suppose we have a dataframe called data like below:

transect	fish_abund
3	19
1	4
1	13
1	18
1	0
1	5
3	4
3	6
3	18

Write the code to get the result like below:

genre	Sum_fish_abund
Light	40
Strong	47

Solution

data %>%
  mutate(genre = ifelse(transect == "1", "Light","Strong")) %>%
  group_by(genre) %>%
  summarise(Sum_fish_abund = sum(fish_abund))

NOOSHA BRONTE RAZAVIAN

Please go over problem 11 part a and problem 13. Thank you!

11a

Yes, misprint see above

13

Happy to go over in detail

TAE WOOK HA

Can we go over #9 why you used n() for finding proportion of female?

Correct, misprint. Thanks! See solution above.

Sid Masih

Can we just do a quick review of vector expressions vs for loops and when to use for vs vector? vector expressions are kind of confusing. Thanks!

(do you mean vectorized calculation?)

Which do you think is better?

for loop:

x <- 1:5
y <- 6:10
z <- c()
for (i in c(1:5)) {
  z <- c(z, x[i] + y[i])
}
z

## [1]  7  9 11 13 15

vectorized calculation

x <- 1:5
y <- 6:10
z <- x + y
z

## [1]  7  9 11 13 15

Also see number 17 and 22

RONG HUANG

problems 4(2),4(3)

I feel for question 7,8, we are not provided enough information. How similar will the exam questions compare to this problem set?

YISHAN HAN

Can we go over question 15, 16b, and 22(1)?

ROWENA JIEYI XIA

Can we please go over problems 9, 11, 12, 13, 16(2), 17(part 1)?

And for problem 20(c): can we use sapply instead?

lec23

Today (Midterm review)

Lingjie Qiao

Example

Jerome Rufin

number 6: part 1 and part 2d

number 8: inner_join (confused about solutions)

number 9

number 14: does summarise (Avg = mean(Value)) work ?

Lu Zhang

problem 7

problem 11

problem 13

problem 16 part c

MALAYANDI (Andy) PALANIAPPAN

NOOSHA BRONTE RAZAVIAN

11a

13

TAE WOOK HA

Sid Masih

RONG HUANG

problems 4(2),4(3)

YISHAN HAN

ROWENA JIEYI XIA