Source file ⇒ midterm_soln.Rmd

Announcement

Here are review problems for Midterm 1 in class on Wednesday March 16. This material covers the first 14 chapters of the book (excluding chapter 13) and data camp courses Introduction to R, Intermediate R (chapters 1-4), Data Visualization with ggplot2 (1) (chapters 1-4).

If you can do these problems you should be fine on the exam. If you want more practice problems I would redo the i-clicker quesitons on b-courses next.

I suggest you write out your solutions long hand with pencil and paper since this is what you will be doing on the exam.

These questions are due for extra credit Saturday night March 12 at 10pm. For extra credit upload a picture of your solutions to b-courses. It doesn’t have to be a great picture. You will be graded entirely for effort. An honest attempt will get 100%. I will give you solutions Saturday night at 10pm.

We will review as a class on Monday March 14. To make an efficient review, I will ask you to post which review questions you are confused about on a discussion board on b-courses.

You are allowed single sided 8.5x11in cheat cheat on the exam. The cheat sheet can only include syntax for R commands from the follwoing allowable resources:

Anything you get from the lecture notes or textbook or data camp
Anything you get from the R code book (for example ?left_join())
ggplot2 help.
ggplot2 cheat sheet

Good luck!!

1.

This wrangling sequence on BabyNames will produce a short table as output.

BabyNames %>%
  group_by(name) %>%
  summarise(tot = sum(count)) %>%
  mutate(rank = rank(desc(tot))) %>%
  filter(name == "Fernando")

Solution

The first two statements compute the total count of each name. Case is a name. The third statement adds a variable rank which contains the ranked place among the names from most popular to least popular. Last, the single case for “Fernando” is extracted. In effect, the command finds the total count and popularity rank for the name “Fernando”.

2.

The WorkoutLog data table lists the duration and activity type of each day’s workout for the members of a rowing crew team. The variables are :

number the team member’s jersey number
activity giving the kind of activity, e.g. weight, sprints, etc., • date when the activity occurred
duration of the activity in minutes

Consider the following data wrangling sequence:

WorkoutLog %>%
group_by(jersey) %>%
summarise(tot = sum(duration, na.rm = TRUE)) %>% filter(activity == "sprints")

The sequence generates an error: Error: unknown column ‘jersey’ What went wrong? How can it be fixed?

Solution

There is no variable named jersey. To fix it, change jersey to number.

3.

A wrangling sequence in this form,

InputData %>%
group_by(???) %>%
summarise(high = max(income), low = min(income))

produces an output that starts this way:

location	sex	month	low	high
DC	F	Aug	mid	poverty
MN	M	Nov	high	poverty
CA	M	Oct	high	low

What variable or variables are in the place marked by ??? ?
Describe the income variable in as much detail as the information provided allows.

Solution

The grouping variables are location, sex, and month. The output of a summarise() operation always contains the grouping variables and any variables created within the summarise(). income is a categorical variable with levels high, low, mid, poverty. When operating on a categorical variable, min() and max() examine the levels alphabetically, returning respectively the first and last in alphabetical order.

4.

Recall the BabyNames data that tells how many babies of each sex were given each name, like this:

name	sex	count	year
Francis	F	654	1920
Francis	M	429	2012

In the entire data table, there are almost 2 million rows of this sort covering 134 years from 1880 onward.

For each of the following, separately, write the wrangling statements to create a new data table

containing only the years on or after 2000.
containing only those year/names with at least 20 babies for boys and girls combined.
Extract out only those names which appear in at least 120 years out of the 134 year time span.

Solution

Part 1.

BabyNames %>%
  filter(year >= 2012) %>%
  head(2)

name	sex	count	year
Sophia	F	22245	2012
Emma	F	20871	2012

Part 2.

BabyNames %>%
  group_by(name, year) %>%
  summarise(babies = sum(count)) %>%
  filter(babies >= 20) %>%
  head(2)

name	year	babies
Aadan	2008	22
Aadan	2009	23

BabyNames %>% 
  group_by(name) %>%
  summarise(year_count = n_distinct(year)) %>%
  filter(year_count >= 120) %>% 
  head(2)

name	year_count
Aaron	134
Abbie	134

5

Consider this plot made with geom_density()

What do the glyph-ready data look like? That is,

What are the variables?
Are they quantitative or categorical?
What are the levels for any categorical variable.
Is it likely that there are 10 or fewer cases?

Solution

Just two variables

running time — quantitative
sex — categorical with levels F and M
There are many cases.

Here is the ggplot commands to make the graph in case you are curious.

  ggplot(mosaicData::TenMileRace, aes(x = net)) +
  geom_density(alpha=0.5, color=NA, aes(fill = sex)) +
  xlab("Running time (s)") +
  scale_fill_grey() +
  theme(panel.background = element_rect(fill='white', colour='black'))

6

Consider this plot made with geom_boxplot() …

Do the glyphs represent individual cases or cases taken collectively (a.k.a. a “statistic”)?
What do the glyph-ready data look like? That is,
1. What are the variables?
2. Are they quantitative or categorical?
3. What are the levels for any categorical variable.
4. Is it likely that there are 20 or fewer cases?

Solution

Three variables

running time — quantitative
Age — categorical with levels (20,30], (30,40] and so on
sex — categorical with levels F and M
Each box-and-whisker glyph involves at least two cases (and typically many more than that). There are 12 box-and-whisker glyphs, so there are at least 24 cases.

7

The data table gives the individual first-choice ballots for the Minneapolis 2013 mayoral election.

Precinct	Ward	First
P-10	W-7	BETSY HODGES
P-06	W-10	BOB FINE
P-09	W-10	KURTIS W. HANNA
P-05	W-13	BETSY HODGES

The long, capitalized names for candidates are driving you crazy. You want to convert them, for all 80101 cases, to shorten them to be the first name and last initial, e.g. “Betsy H”, “Bob F”, “Kurtis H”. You can do this using the inner_join() verb and an auxiliary data table like this:

variable1	variable2
BETSY HODGES	.
CAM WINTON	.
DON SAMUELS	.
	.
	.

Fill in values for Variable2 in a way appropriate for the task.
What are the appropriate arguments to inner_join()

Solution

The auxiliary table will translate between the names as given in the ballot data and the names as you would like to see them.

variable1	variable2
BETSY HODGES	Betsy H
CAM WINTON	Cam W
DON SAMUELS	Don S

The names variable1 and variable2 might be made more descriptive, e.g. full_name and short_name.

inner_join() have the ballot data table and the auxiliary data table as the first two inputs. You’ll also need to specify which variables are to be used to perform the case matching, by = c("First", "variable2"). Note: In a statement like Ballots %>% inner_join(Aux, by=c("First", "variable2")), Ballots is the first argument to inner_join() and Aux is the second argument.

8

You have been given a newspaper report of the political party affiliations of each candidate, like this:

DFL candidates Betsy Hodges, Mark Andrew, Bob Fine, and Don Samuels faced Independent Cam Winton and Libertarian candidates Chrisopher Zimmerman and Christopher Clark.

Given this information in this form along with data in this format:

	Precinct	Ward	Candidate
7762	P-08	W-2	DON SAMUELS
78920	P-08	W-7	JACKIE CHERRYHOMES
25409	P-03	W-9	BETSY HODGES
31526	P-04	W-11	CAM WINTON

Describe in words how you would go about calculating the vote for each party in each precinct.

Solution

First: Create a data table translating between candidate and party

Candidate	Party
BETSY HODGES	DFL
CAM WINTON	Independent

Second: Join this with the ballot data
Third: group by precinct and party, summarising with a count of the number of ballots in each group.

9

HAS MISPRINT

The data table BodyTypes gives the responses of 24,117 women and 35,829 men who are members of OkCupid to a question about their “body type”. The numbers are the counts of people giving listing that body_type.

body_type	f	m
average	5620	9032
fit	4431	8280
curvy	3811	113
(none given)	2703	2593
thin	2469	2242
athletic	2309	9510
full figured	870	139
a little extra	821	1808
skinny	601	1176
overweight	145	299
jacked	129	292
rather not say	106	92
used up	102	253

Write a wrangling statement to find the proportion of women with each body type. Call this variable fprop. Similarly, calculate mprop, the proportion of men with each body type, and ratio, the ratio of the women’s proportion to the men’s. Finally, create a data table arranged in order from most unbalanced to most balanced. (Hint for the last step: pmax() will compare two variables, case by case, returning the maximum quantity of the two for each case.)

Solution

n_women <- 24117
n_men <- 35829

Types %>% 
  mutate(fprop = f /n_women , mprop = m / n_men, ratio = fprop / mprop) %>%
  arrange(desc(pmax(fprop / mprop, mprop / fprop)))

body_type	f	m	fprop	mprop	ratio
curvy	3811	113	0.1580213	0.0031539	50.1039435
full figured	870	139	0.0360741	0.0038795	9.2985634
athletic	2309	9510	0.0957416	0.2654274	0.3607072
rather not say	106	92	0.0043952	0.0025678	1.7117071
used up	102	253	0.0042294	0.0070613	0.5989507
thin	2469	2242	0.1023759	0.0625750	1.6360512
(none given)	2703	2593	0.1120786	0.0723715	1.5486559
jacked	129	292	0.0053489	0.0081498	0.6563240
a little extra	821	1808	0.0340424	0.0504619	0.6746152
overweight	145	299	0.0060124	0.0083452	0.7204573
skinny	601	1176	0.0249202	0.0328226	0.7592391
fit	4431	8280	0.1837293	0.2310977	0.7950287
average	5620	9032	0.2330306	0.2520863	0.9244082

Another way to accomplish the last step, if you’re happy using logarithms, is arrange(desc(abs(log(fprop/mprop)))).

10

The following was written to find the top 3 candidates in the Minneapolis 2013 mayoral election. There are a few mistakes or omissions in the statement. Circle them and indicate how to fix them.

Result <-
  Minneapolis2013 %>%
  group_by(First)
  summarise(total_votes == sum(First)) %>%
  arrange(total_votes) %>%
  head(3)

Solution

Line 3, group_by(First) should be group_by(First) %>% (missing %>% in the original)
Line 4, total_votes == sum(First) should be total_votes = n() (two mistakes)
Line 5, arrange(total_votes) should be arrange(desc(total_votes))

11

Suppose \(x_0 =1\) and \(x1 =2\) and \[ x_j = x_{j−1}+ 2/x_{j−1} \text{ for } j=1,2,.... \]

Write a function testLoop which takes the single argument n and returns the first n − 1 (assume n is larger than 3) values of the sequence \(\{x_j\}_{j≥0}\): that means the values of \(x_0, x_1, x_2, . . . , x_{n−2}\).

Assume we have already built a dataframe below:

tmp <- as.data.frame(cbind(1:499,testLoop(500)))
colnames(tmp) <- c("n","result")

Try to use ggplot to plot the result like below:

*Solution a)

testLoop <- function(n)
    {
        xVec <- rep(NA, n-1)
        xVec[1] <- 1
        xVec[2] <- 2
        for( j in 3:(n-1) )
          xVec[j] <- xVec[j-1] + 2/xVec[j-1]
        return(xVec)
}

ggplot(data = tmp,aes(x = n, y = result))+geom_point()+geom_smooth(se = FALSE)

12

Suppose we have a dataframe called data like below:

site    transect    fish_abund  dN15_SG
B       3             19          15.22
A         1           4         10.35
A         2           12          9.06
B         1           13          14.72
A         2           5         8.99
B         1           18        14.17
B         2           25          15.66
C         1           0         11.42
A         1           5         11.35
C         3           4         10.83
C         2           4          9.97
A         3           6         8.58
C         2           8         11.74
B         3           18          16.53

Write the code to get the result like below:

genre	Sum_dN15_SG
Light	40
Strong	47

Solution

data <- read.csv("/Users/Adam/Desktop/stat133lectures_hw_lab/exam/practice/dat.csv")
data %>%
  filter(transect %in% c("1","3")) %>%
  mutate(genre = ifelse(transect == "1", "Light","Strong")) %>%
  group_by(genre) %>%
  summarise(Sum_dN15_SG = sum(fish_abund))

13

Write a functon fib that takes a positive integer n and returns the nth Fibonacci number. (In R we start counting from 1 not 0.) Fibonacci numbers are as follows: 0, 1, 1, 2, 3, 5, 8. First two Fibonacci numbers are 0 and 1, respectively. Starting from the third one each Fibonacci number is equal to the sum of previous two. In other words fib(3) = fib(2) + fib(1) and fib(4) = fib(3) + fib(2) etc.

Solution

fib <- function(n){
  if (n==1) {
    return(0)
  } else if (n==2) {
    return(1) 
  } else {
    count = 3
    prev1 = 1
    prev2 = 0
    while (count<=n) {
      temp = prev2
      prev2 = prev1
      prev1 = prev1 + temp
      count = count + 1
    }
    return(prev1)
  }
}
fib(1)

## [1] 0

fib(2)

## [1] 1

fib(3)

## [1] 1

fib(4)

## [1] 2

fib(5)

## [1] 3

14

Suppose following tables are loaded to your computer. You can call them TableA and TableB, respectively.

Year	Algeria	Brazil	Columbia
2000	7	12	16
2001	9	14	18

Country	Abbreviation
Algeria	DZ
Brazil	BR
Columbia	CO

Write a code to get to the following table.

Abbreviation	Avg
DZ	8
BR	13
CO	17

Solution

TableC <- TableA %>%
  gather(key = Country, value = Value, -Year) %>%
  group_by(Country) %>%
  summarize(Avg = mean(Value)) %>%
  left_join(TableB, by = c("Country"= "Country")) %>%
  select(Abbreviation, Avg)

TableC

15

Using e while and/or a for loop and conditional expression convert Mat, matrix of zeros, to an identity matrix. Mat

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    0
## [2,]    0    0    0    0    0
## [3,]    0    0    0    0    0
## [4,]    0    0    0    0    0
## [5,]    0    0    0    0    0

After your code runs Mat should look like this.

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1

Solution

Mat = matrix(rep(0, times = 25), nrow = 5)
Mat

After your code runs Mat should look like this.

for (i in 1:5) {
  for (j in 1:5) {
    if (i==j){
      Mat[i, j] =1
    }
  }
}
Mat

16

Given vector ‘asc=1:50’, answer the following questions:
1. use ‘asc’ and for loop to generate a new vector called ‘desc’ which is the reverse of ‘asc’.
2. use ‘asc’, for loop and while loop to update ‘asc’ so as it contains value c(1:25,25:1).
3. use ‘asc’, for loop and if/else statement to update ‘asc’ to c(1,1,2,2,3,3……,24,24,25,25)

Solution

asc = 1:50
#1.
desc=numeric(50)
for (i in 1:length(asc)){
  desc[i] = asc[51-i]
}
#2.
for (i in 1:length(asc)){
  while (asc[i]>25){
    asc[i] = 51-asc[i]
  }
}
#3.
for (i in 1:length(asc)){
  if(asc[i] %% 2 == 0){
    asc[i] = asc[i]/2
  }else if(asc[i]%%2 ==1){
    asc[i] = (asc[i]+1)/2
  }
}

17

Split the a character string University of California into a vector of signle characters and calculate the number of i in it.

Solutions

split_chars <- function(n) {
  chars <- unlist(strsplit(n,""))
  return(chars)
}

char <- split_chars("University of California")
char

##  [1] "U" "n" "i" "v" "e" "r" "s" "i" "t" "y" " " "o" "f" " " "C" "a" "l"
## [18] "i" "f" "o" "r" "n" "i" "a"

count <- 0
for (i in 1:length(char)) {
  if (char[i] == "i") {
    count <- count + 1 
  }
}
count

## [1] 4

18

Change data from wide to narrow format using gather function. Group Ozone, Solar.R, Wind, Temp into one variable called type and create another column called value to store their values.

# built-in dataset in R
head(airquality)

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Solutions

narrow_airquality <- airquality %>% gather(type, value, Ozone:Temp)
head(narrow_airquality)

##   Month Day  type value
## 1     5   1 Ozone    41
## 2     5   2 Ozone    36
## 3     5   3 Ozone    12
## 4     5   4 Ozone    18
## 5     5   5 Ozone    NA
## 6     5   6 Ozone    28

19

Suppose you have a data frame, data, as given below:

##   V1 V2    V3 V4
## 1  a  1 alpha 10
## 2  a  2  beta 20
## 3  b  1 gamma 30
## 4  b  2 alpha 40
## 5  c  1  beta 50
## 6  c  2 gamma 60

Assuming that the tidyr and dplyr libraries are already loaded, write down what the output for the following code. The final result is enough for full credit, but partial credit will be given for writing out and labelling intermediate steps.

data %>%
  filter(V1 == "a") %>% # Step 1
  select(V2, V4) %>% # Step 2
  gather(key = Apple, value = Banana, V2, V4) %>% # Step 3
  mutate(Apple = Banana) # Step 4

Solution

##   Apple Banana
## 1     1      1
## 2     2      2
## 3    10     10
## 4    20     20

20

Suppose you have a data frame, data, as given below.

Write a function called fix_missing_99 that takes one argument: x, a numeric vector. The function should replace every component of x equal to -99 with NA.
Write a loop that replaces every -99 in data with NA. For full credit, your code must use the function in part (a) and it should continue to work without modification if additional columns are added to the data frame.
Write down an appropriate call from the apply family of functionals to perform the same task as in part (b).

##    a  b c   d   e f
## 1  1  6 1   5 -99 1
## 2 10  4 4 -99   9 3
## 3  7  9 5   4   1 4
## 4  2  9 3   8   6 8
## 5  1 10 5   9   8 6
## 6  6  2 1   3   8 5

Solution

Part (a)

fix_missing_99 <- function(x) {
  x[x == -99] <- NA
  return(x)
}

Part (b)

for(name in names(data)) {
  data[name] <- fix_missing_99(data[name])
}

Part (c)

data[] <- lapply(data, fix_missing_99)

21

Given the plot below,

assuming the ‘ggplot2’ is already loaded and the first 6 rows of the ‘diamond’ dataset are:

## Source: local data frame [6 x 10]
## 
##   carat       cut  color clarity depth table price     x     y     z
##   (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1  0.23     Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good      J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96  2.48

What command in ‘ggplot’ that you will use to generate this designated graph?

Solution

diamonds %>% ggplot(aes(x=carat,y=price))+geom_point(aes(col=cut))+geom_smooth(se=FALSE)+facet_grid(.~cut)+labs(title = "carat vs price")+theme(plot.title=element_text(size=20))

22

temp <-  list(c(3,7,9,6,-1),c(6,9,12,13,5), c(4,8,3,-1,-3), c(1,4,7,2,-2),c(5,7,9,4,2))
temp

## [[1]]
## [1]  3  7  9  6 -1
## 
## [[2]]
## [1]  6  9 12 13  5
## 
## [[3]]
## [1]  4  8  3 -1 -3
## 
## [[4]]
## [1]  1  4  7  2 -2
## 
## [[5]]
## [1] 5 7 9 4 2

Create a function that returns all values below zero. Call the function belowZero
Apply belowZero over temp using sapply(). Call the result freezingS

Solution a)

belowZero <- function(vec){
    return(vec[vec<0]) 
}

freezingS <- temp %>% sapply(belowZero)

midterm review solution

stat 133

Announcement

1.

2.

3.

4.

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22