Here are review problems for Midterm 1 in class on Wednesday March 16. This material covers the first 14 chapters of the book (excluding chapter 13) and data camp courses Introduction to R, Intermediate R (chapters 1-4), Data Visualization with ggplot2 (1) (chapters 1-4).
If you can do these problems you should be fine on the exam. If you want more practice problems I would redo the i-clicker quesitons on b-courses next.
I suggest you write out your solutions long hand with pencil and paper since this is what you will be doing on the exam.
These questions are due for extra credit Saturday night March 12 at 10pm. For extra credit upload a picture of your solutions to b-courses. It doesn’t have to be a great picture. You will be graded entirely for effort. An honest attempt will get 100%. I will give you solutions Saturday night at 10pm.
We will review as a class on Monday March 14. To make an efficient review, I will ask you to post which review questions you are confused about on a discussion board on b-courses.
You are allowed single sided 8.5x11in cheat cheat on the exam. The cheat sheet can only include syntax for R commands from the follwoing allowable resources:
Anything you get from the lecture notes or textbook or data camp
Anything you get from the R code book (for example ?left_join()
)
Good luck!!
This wrangling sequence on BabyNames will produce a short table as output. Explain what each line of the code does.
BabyNames %>%
group_by(name) %>%
summarise(tot = sum(count)) %>%
mutate(rank = rank(desc(tot))) %>%
filter(name == "Fernando")
The WorkoutLog data table lists the duration and activity type of each day’s workout for the members of a rowing crew team. The variables are :
number
: the team member’s jersey number
activity
: giving the kind of activity, e.g. weight, sprints, etc., • date when the activity occurred
duration
: of the activity in minutes
Consider the following data wrangling sequence:
WorkoutLog %>%
group_by(jersey) %>%
summarise(tot = sum(duration, na.rm = TRUE)) %>% filter(activity == "sprints")
The sequence generates an error: Error: unknown column ‘jersey’ What went wrong? How can it be fixed?
A wrangling sequence in this form,
InputData %>%
group_by(???) %>%
summarise(high = max(income), low = min(income))
produces an output that starts this way:
location | sex | month | low | high |
---|---|---|---|---|
DC | F | Aug | mid | poverty |
MN | M | Nov | high | poverty |
CA | M | Oct | high | low |
What variable or variables are in the place marked by ???
?
Describe the income variable in as much detail as the information provided allows.
Recall the BabyNames
data that tells how many babies of each sex were given each name, like this:
name | sex | count | year |
---|---|---|---|
Francis | F | 654 | 1920 |
Francis | M | 429 | 2012 |
In the entire data table, there are almost 2 million rows of this sort covering 134 years from 1880 onward.
For each of the following, separately, write the wrangling statements to create a new data table
Consider this plot made with geom_density()
What do the glyph-ready data look like? That is,
Consider this plot made with geom_boxplot()
…
The data table gives the individual first-choice ballots for the Minneapolis 2013 mayoral election.
Precinct | Ward | First |
---|---|---|
P-10 | W-7 | BETSY HODGES |
P-06 | W-10 | BOB FINE |
P-09 | W-10 | KURTIS W. HANNA |
P-05 | W-13 | BETSY HODGES |
The long, capitalized names for candidates are driving you crazy. You want to convert them, for all 80101 cases, to shorten them to be the first name and last initial, e.g. “Betsy H”, “Bob F”, “Kurtis H”. You can do this using the inner_join()
verb and an auxiliary data table like this:
variable1 | variable2 |
---|---|
BETSY HODGES | . |
CAM WINTON | . |
DON SAMUELS | . |
. | |
. |
Variable2
in a way appropriate for the task.inner_join()
You have been given a newspaper report of the political party affiliations of each candidate, like this:
DFL candidates Betsy Hodges, Mark Andrew, Bob Fine, and Don Samuels faced Independent Cam Winton and Libertarian candidates Chrisopher Zimmerman and Christopher Clark.
Given this information in this form along with data in this format:
Precinct | Ward | Candidate | |
---|---|---|---|
7762 | P-08 | W-2 | DON SAMUELS |
78920 | P-08 | W-7 | JACKIE CHERRYHOMES |
25409 | P-03 | W-9 | BETSY HODGES |
31526 | P-04 | W-11 | CAM WINTON |
Describe in words how you would go about calculating the vote for each party in each precinct.
The data table BodyTypes
gives the responses of 24,117 women and 35,829 men who are members of OkCupid to a question about their “body type”. The numbers are the counts of people giving listing that body_type
.
body_type | f | m |
---|---|---|
average | 5620 | 9032 |
fit | 4431 | 8280 |
curvy | 3811 | 113 |
(none given) | 2703 | 2593 |
thin | 2469 | 2242 |
athletic | 2309 | 9510 |
full figured | 870 | 139 |
a little extra | 821 | 1808 |
skinny | 601 | 1176 |
overweight | 145 | 299 |
jacked | 129 | 292 |
rather not say | 106 | 92 |
used up | 102 | 253 |
Write a wrangling statement to find the proportion of women with each body type. Call this variable fprop
. Similarly, calculate mprop
, the proportion of men with each body type, and ratio
, the ratio of the women’s proportion to the men’s. Finally, create a data table arranged in order from most unbalanced to most balanced. (Hint for the last step: pmax()
will compare two variables, case by case, returning the maximum quantity of the two for each case.)
The following was written to find the top 3 candidates in the Minneapolis 2013 mayoral election. There are a few mistakes or omissions in the statement. Circle them and indicate how to fix them.
Result <-
Minneapolis2013 %>%
group_by(First)
summarise(total_votes == sum(First)) %>%
arrange(total_votes) %>%
head(3)
Suppose \(x_0 =0\) and \(x1 =2\) and \[
x_j = x_{j−1}+ 2 \text{ for } j=1,2,....
\] Write a function testLoop
which takes the single argument n
and returns the first n − 1 (assume n is larger than 3) values of the sequence \(\{x_j\}_{j≥0}\): that means the values of \(x_0, x_1, x_2, . . . , x_{n−2}\).
Assume we have already built a dataframe below:
tmp <- as.data.frame(cbind(1:499,testLoop(500)))
colnames(tmp) <- c("n","result")
Try to use ggplot to plot the result like below:
Suppose we have a dataframe called data
like below:
site transect fish_abund dN15_SG
B 3 19 15.22
A 1 4 10.35
A 2 12 9.06
B 1 13 14.72
A 2 5 8.99
B 1 18 14.17
B 2 25 15.66
C 1 0 11.42
A 1 5 11.35
C 3 4 10.83
C 2 4 9.97
A 3 6 8.58
C 2 8 11.74
B 3 18 16.53
Write the code to get the result like below:
genre | Sum_dN15_SG |
---|---|
Light | 40 |
Strong | 47 |
Write a functon fib
that takes a positive integer n
and returns the nth Fibonacci number. (In R we start counting from 1 not 0.) Fibonacci numbers are as follows: 0, 1, 1, 2, 3, 5, 8. First two Fibonacci numbers are 0 and 1, respectively. Starting from the third one each Fibonacci number is equal to the sum of previous two. In other words fib(3) = fib(2) + fib(1)
and fib(4) = fib(3) + fib(2)
etc.
Suppose following tables are loaded to your computer. You can call them TableA
and TableB
, respectively.
Year | Algeria | Brazil | Columbia |
---|---|---|---|
2000 | 7 | 12 | 16 |
2001 | 9 | 14 | 18 |
Country | Abbreviation |
---|---|
Algeria | DZ |
Brazil | BR |
Columbia | CO |
Write a code to get to the following table.
Abbreviation | Avg |
---|---|
DZ | 8 |
BR | 13 |
CO | 17 |
Using e while and/or a for loop and conditional expression convert Mat
, matrix of zeros, to an identity matrix. Mat
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 0
## [2,] 0 0 0 0 0
## [3,] 0 0 0 0 0
## [4,] 0 0 0 0 0
## [5,] 0 0 0 0 0
After your code runs Mat
should look like this.
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
Given vector ‘asc=1:50’, answer the following questions:
1. use ‘asc’ and for loop to generate a new vector called ‘desc’ which is the reverse of ‘asc’.
2. use ‘asc’, for loop and while loop to update ‘asc’ so as it contains value c(1:25,25:1).
3. use ‘asc’, for loop and if/else statement to update ‘asc’ to c(1,1,2,2,3,3……,24,24,25,25)
Split the a character string University of California
into a vector of signle characters and calculate the number of i
in it.
Change data from wide to narrow format using gather function. Group Ozone, Solar.R, Wind, Temp into one variable called type
and create another column called value
to store their values.
# built-in dataset in R
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Suppose you have a data frame, data
, as given below:
## V1 V2 V3 V4
## 1 a 1 alpha 10
## 2 a 2 beta 20
## 3 b 1 gamma 30
## 4 b 2 alpha 40
## 5 c 1 beta 50
## 6 c 2 gamma 60
Assuming that the tidyr
and dplyr
libraries are already loaded, write down what the output for the following code. The final result is enough for full credit, but partial credit will be given for writing out and labelling intermediate steps.
data %>%
filter(V1 == "a") %>% # Step 1
select(V2, V4) %>% # Step 2
gather(key = Apple, value = Banana, V2, V4) %>% # Step 3
mutate(Apple = Banana) # Step 4
Suppose you have a data frame, data
, as given below.
fix_missing_99
that takes one argument: x
, a numeric vector. The function should replace every component of x
equal to -99 with NA
.data
with NA
. For full credit, your code must use the function in part (a) and it should continue to work without modification if additional columns are added to the data frame.apply
family of functionals to perform the same task as in part (b).## a b c d e f
## 1 1 6 1 5 -99 1
## 2 10 4 4 -99 9 3
## 3 7 9 5 4 1 4
## 4 2 9 3 8 6 8
## 5 1 10 5 9 8 6
## 6 6 2 1 3 8 5
Given the plot below,
assuming the ‘ggplot2’ is already loaded and the first 6 rows of the ‘diamond’ dataset are:
## Source: local data frame [6 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
What command in ‘ggplot’ that you will use to generate this designated graph?
temp <- list(c(3,7,9,6,-1),c(6,9,12,13,5), c(4,8,3,-1,-3), c(1,4,7,2,-2),c(5,7,9,4,2))
temp
## [[1]]
## [1] 3 7 9 6 -1
##
## [[2]]
## [1] 6 9 12 13 5
##
## [[3]]
## [1] 4 8 3 -1 -3
##
## [[4]]
## [1] 1 4 7 2 -2
##
## [[5]]
## [1] 5 7 9 4 2
Create a function that returns all values below zero. Call the function belowZero
Apply belowZero over temp using sapply()
. Call the result freezingS