Question 1

# built-in dataset in R
head(airquality)
##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Change data from wide to narrow format (above) using gather function. Group Ozone, Solar.R, Wind, Temp into one variable called type and create another column called value to store their values. Your output should look like this:

##   Month Day  type value
## 1     5   1 Ozone    41
## 2     5   2 Ozone    36
## 3     5   3 Ozone    12
## 4     5   4 Ozone    18
## 5     5   5 Ozone    NA
## 6     5   6 Ozone    28

Question 2

Suppose you have a data frame, data, as given below:

##   V1 V2    V3 V4
## 1  a  1 alpha 10
## 2  a  2  beta 20
## 3  b  1 gamma 30
## 4  b  2 alpha 40
## 5  c  1  beta 50
## 6  c  2 gamma 60

Assuming that the tidyr and dplyr libraries are already loaded, write down what the output for the following code. The final result is enough for full credit, but partial credit will be given for writing out and labelling intermediate steps.

What I think will happen with this code without running it

  1. filter(V1 == “a”) will only keep all V1 such that it is “a” and no other letter. The other columns are affected as well (if a row does not have V1 == “a” it is omitted)
  2. select (V2, V4) will pick V2 and V4 as columns and omit all others.
  3. gather(key = Apple, value = Banana, V2, V4) will make a key column titled Apple and a value column titled Value, and will only take values from V2 and V4 to pop
  4. The contents of Apple become Banana so you get the same value.
data %>%
  filter(V1 == "a") %>% # Step 1
  select(V2, V4) %>% # Step 2
  gather(key = Apple, value = Banana, V2, V4) %>% # Step 3
  mutate(Apple = Banana) # Step 4
##   Apple Banana
## 1     1      1
## 2     2      2
## 3    10     10
## 4    20     20

Question 3

Suppose you have a data frame, data, as given below.

  1. Write a function called fix_missing_99 that takes one argument: x, a numeric vector. The function should replace every component of x equal to -99 with NA.
  2. Write a loop that replaces every -99 in data with NA. For full credit, your code must use the function in part (a) and it should continue to work without modification if additional columns are added to the data frame.
  3. Write down an appropriate call from the apply family of functionals to perform the same task as in part (b).
fix_missing_99 <- function(x){
  if(x == -99){
    print('NA')
} else
    print('Fine')
}
fix_missing_99(-99)
## [1] "NA"
data <- c(5,4,7,-99)

loop_fix_missing_99 <- function(y){
  for(i in 1:length(y)){
    fix_missing_99(y[i])}
}

loop_fix_missing_99(data)
## [1] "Fine"
## [1] "Fine"
## [1] "Fine"
## [1] "NA"
sapply(data, fix_missing_99)
## [1] "Fine"
## [1] "Fine"
## [1] "Fine"
## [1] "NA"
## [1] "Fine" "Fine" "Fine" "NA"
##    a  b c   d   e f
## 1  1  6 1   5 -99 1
## 2 10  4 4 -99   9 3
## 3  7  9 5   4   1 4
## 4  2  9 3   8   6 8
## 5  1 10 5   9   8 6
## 6  6  2 1   3   8 5
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "NA"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "NA"
## Warning in if (x == -99) {: the condition has length > 1 and only the first
## element will be used
## [1] "Fine"
## $a
## [1] "Fine"
## 
## $b
## [1] "Fine"
## 
## $c
## [1] "Fine"
## 
## $d
## [1] "Fine"
## 
## $e
## [1] "NA"
## 
## $f
## [1] "Fine"

Question 4

Assuming the ‘ggplot2’ is already loaded. The first 6 rows of the ‘diamond’ dataset are:

## Source: local data frame [6 x 10]
## 
##   carat       cut  color clarity depth table price     x     y     z
##   (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1  0.23     Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2  0.21   Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3  0.23      Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4  0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5  0.31      Good      J     SI2  63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96  2.48

What command in ‘ggplot’ that you will use to generate the graph given below?