To complete this assignment, follow these steps:
  1. Download the homework1.Rmd file from the Canvas website.

  2. Open homework1.Rmd in RStudio.

  3. Replace the “Your Name Here” text in the author: field with your own name, and add a date.

  4. Supply your solutions to the homework by editing homework1.Rmd.

  5. When you have completed the homework and have checked that your code both runs in the Console and knits correctly when you click Knit HTML, rename the R Markdown file to homework1_YourNameHere.Rmd, and submit on Canvas (YourNameHere should be changed to your own name).

Homework tips:
  1. Recall the following useful RStudio hotkeys.
Keystroke Description
<tab> Autocompletes commands and filenames, and lists arguments for functions.
<up> Cycles through previous commands in the console prompt
<ctrl-up> Lists history of previous commands matching an unfinished one
<ctrl-enter> Runs current line from source window to Console. Good for trying things out ideas from a source file.
<ESC> Aborts an unfinished command and get out of the + prompt

Note: Shown above are the Windows/Linux keys. For Mac OS X, the <ctrl> key should be substituted with the <command> (⌘) key.

  1. Instead of sending code line-by-line with <ctrl-enter>, you can send entire code chunks, and even run all of the code chunks in your .Rmd file. Look under the menu of the Source panel.

  2. Run your code in the Console and Knit HTML frequently to check for errors.

  3. You may find it easier to solve a problem by interacting only with the Console at first.

Homework 1 outline

This homework gets you to create a “Cheat Sheet” that you can refer back to over the course of the semester.

Problem 1: Simple Boolean operations

Tip: Note that each of the code blocks in this Problem contain the expression eval = FALSE. This tells R Markdown to display the code contained in the block, but not to evaluate it. To check that your answer makes sense, be sure to try it out in the console with various choices of values for the variable x.

(a) Checking equality.

Given a variable x, write a Boolean expression that returns TRUE if the variable x is equal to “dog”.

# Insert your Boolean expression here
# boolean expression to return TRUE if variable x = "dog", no output otherwise
# initialize x for true
x<-'dog'
# test if output works for value
if(x=='dog'){
  TRUE
}
x <- 'cat'
# test to display no output
if(x=='dog'){
  TRUE
}
(b) Checking inequality.

Given a variable x, write a Boolean expression that returns TRUE if the variable x is not NA (i.e., is not missing).

# conditional statement returns TRUE if x is not NA, FALSE otherwise
if(!is.na(x)){
  T
} else{
  F
}
# test for false
x<-NA
if(!is.na(x)){
  T
} else{
  F
}
(c) Checking if a number is in a given range.

Given a (possibly negative) number x, write a Boolean expression that returns TRUE if and only if x is smaller than -52 or bigger than 12.

# Test code and conditional for if x < -52 or x > 12
# test for FALSE
x <- -32
if (x<(-52) | x > 12){
  T
} else {
  F
}
# test for TRUE
x <- -100
if (x<(-52) | x > 12){
  T
} else {
  F
}
(d) A more complicated example.

Given an integer number x, write a Boolean expression that returns TRUE if and only if x is an odd number between 12 and 32 or 58 and 72.

# create vector to hold conditions, concatenate two odd sequences in selected range
v <- c(seq(13,32,2),seq(59,72,2))
#test if true
x <- 13
is.element(x,v)
#test if false
x <- 1
is.element(x,v)
# one line boolean expressions
is.element(x,c(seq(13,32,2),seq(59,72,2)))
x %in% c(seq(13,32,2),seq(59,72,2))
if (is.element(x,v)){
  T
} else (
  F
)

# alternate solution with % operator and order of operations
if (((x >= 12 & x <= 32) | (x >= 58 & x <= 72)) & x %% 2 == 1){
  TRUE
} else{
  FALSE
} 

**Tip:** Recall the modulus operator we saw in lecture 1: %%. For integers x and y, x %% y is the remainder of x divided by y.

Problem 2: Vector Boolean operations

(a) R has two kinds of Boolean operators implemented, single (&, |) and double (&&, ||).

One of these operators takes advantage of something called lazy evaluation while the other does not. They also don’t behave the same way when applied to vectors.

Read the help file (help("||")) and construct some examples to help figure out how the two behave.

To help you get started, try out the following two examples in your console:

# Example:  The variable y.prob2a is never defined.  
# (Do not define it!)
# What happens when you run this code?
x.prob2a <- 5
# y.prob2a is not found because it tries to evaluate the entire expression
(x.prob2a < 10) | (y.prob2a > 2)
# evaluates to True because x.prob2a satisfies the expression and the evaluation stops
(x.prob2a < 10) || (y.prob2a > 2)
# Define vectors
x.prob2a.vec <- c(TRUE, FALSE, FALSE)
y.prob2a.vec <- c(TRUE, TRUE, FALSE)

# Apply various Boolean operations to see what happens
# evalutes & expression comparing each corresponding vector value and returns a vector
x.prob2a.vec & y.prob2a.vec
# only evaluates & expression for first element of each vector only
x.prob2a.vec && y.prob2a.vec
# evalutes | expression comparing each corresponding vector value and returns a vector
x.prob2a.vec | y.prob2a.vec
# only evaluates | expression for first element of each vector only
x.prob2a.vec || y.prob2a.vec

Can you explain what’s happening? Write up a brief explanation below.

When using the single operators | and &, it evaluates the vector for each corresponding value and returns a vector. There is a lazy operation with the || and &&, where it only evaluates and returns a single value that compares the first element of each vector.

(b) Using all()

Two people were asked to give their preferences between two options: [Facebook, Twitter], [Firefox, Chrome], [Mac, PC], [Summer, Winter]. Their results are given below.

choices1 <- c("Twitter", "Chrome", "Mac", "Summer")
choices2 <- c("Facebook", "Chrome", "PC", "Summer")

Use the all() function to determine if the two people have identical preferences. (Your code should ouput a single Boolean value, either TRUE or FALSE)

# Compares all of choices1 to choices 2 and returns true if and only if all are true
all(choices1 == choices2)
## [1] FALSE
(c) Using any()

Use the any() function to determine if the two people have any preferences in common. (Your code should output a single Boolean value, either TRUE or FALSE)

# returns true if any choice in a vector matches a corresponding element in another vector

any(choices1 == choices2)
## [1] TRUE
# test below to confirm that elements in vector must line up in corresponding location (it is not a combination comparison)
choices3 <- c("Chrome", "Twitter", "x", "y")
any(choices1 == choices3)
## [1] FALSE
(d) Missing values.

Let age be the vector defined below.

age <- c(37, 21, 92, NA, 45, NA, NA, 18)

Write a Boolean expression that checks whether each entry of age is missing (recall missing values are denoted by NA). Your expression should return a Boolean vector having the same length as age.

# initialize vector v for future storage
v<-rep(NA,length(age))
# Returns vector v, FALSE = valid value, TRUE = NA
for (i in 1:length(age)){
  v[i]<-is.na(age[i])
}
# display resulting boolean vector
v
## [1] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE

Problem 3: Referencing vector elements

(a) which() practice

Write code that returns the indexes of age that are missing.

# Using boolean vector v from previous question, we can identify which indexes are missing
?which
## starting httpd help server ... done
which(v)
## [1] 4 6 7
(b) Getting non-missing values

Write code that uses negative indexes and your solution from (a) to return only the values of age that are not missing. (i.e., your code should result in a vector with elements: 37, 21, 92, 45, 18)

# Removes NA from age
age[which(v)*-1]
## [1] 37 21 92 45 18
(c) A more direct way of getting non-missing values

Using the negation operator ! and the is.na() function, write an expression that returns only the values of age that are not missing.

# second alternative way to remove elements
age[!is.na(age)]
## [1] 37 21 92 45 18
(d) More which() practice

For the next three problem we’ll go back to the cars data set.

# vectors to hold distance and speed columns from the cars data set
speed <- cars$speed
dist <- cars$dist

Write code to figure out which cars had a stopping distance of 15 feet or more.

# filters distance vector for value greater than or equal to 15
which(dist >= 15)
##  [1]  4  5  7  8  9 10 11 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
(e) which.min, which.max practice

Use the which.min() function to figure out which car had the shortest stopping distance. (Your code should return the car’s index.)

# which.min function returns element with smallest value in vector dist
which.min(dist)
## [1] 1
(f) More practice

Use the which.max() function to figure out the speed of the car that had the longest stopping distance. (Your code should return the car’s speed.)

# return spped for the car that had longest stopping distance
speed[which.max(dist)]
## [1] 24

Problem 4: Data frame basics

(a) Importing data.

In Lecture 2 we saw how to use the read.table() function to import the survey data. Now we’ll use a different function. Use the read.csv() function to import the survey data into a variable called survey.

# make sure to set working director to source file location!
# function read.csv to read in survey data
survey<-read.csv("survey_data.csv", sep = ",", header = TRUE)

# alternate function to import data
import.csv <- function(filename) {
    return(read.csv(filename, sep = ",", header = TRUE))
}
# call function to import
survey<- import.csv("survey_data.csv")

Tip: The data file is located at http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data.csv. Do not download the file. Import the data directly using the URL.

(b) $ notation

Use the $ operator to select the TVhours column from the survey data

# converts column to vector to hold TVhours
survey$TVhours
##  [1]  0.0  3.0 30.0  6.0 20.0 15.0  6.0  8.0 10.0  0.0  4.0  0.0  0.0  2.0
## [15] 24.0 10.0 25.0  0.0  0.0 10.0  5.0 35.0  2.0  2.0  5.5 20.0  0.0  0.0
## [29] 10.0  5.0 40.0 10.0  0.0 20.0  0.0  6.0  2.0 10.0 10.0  7.0
(c) [,] notation

Repeat part (b) using [,] notation. i.e., Use [,] notation to select the TVhours column from the survey data by name (i.e., obtain this column by using the name “TVhours” instead of using the column number)

# Selects name of column and returns vector of TVhours
survey[,"TVhours"]
##  [1]  0.0  3.0 30.0  6.0 20.0 15.0  6.0  8.0 10.0  0.0  4.0  0.0  0.0  2.0
## [15] 24.0 10.0 25.0  0.0  0.0 10.0  5.0 35.0  2.0  2.0  5.5 20.0  0.0  0.0
## [29] 10.0  5.0 40.0 10.0  0.0 20.0  0.0  6.0  2.0 10.0 10.0  7.0
(d) [[]] notation

Repeat part (c) with [[]] notation.

# determine which column is TVhours
which(colnames(survey) == "TVhours")
## [1] 5
# return 2nd dimension (column in this case) of the 5th element TVhours
survey[[5]]
##  [1]  0.0  3.0 30.0  6.0 20.0 15.0  6.0  8.0 10.0  0.0  4.0  0.0  0.0  2.0
## [15] 24.0 10.0 25.0  0.0  0.0 10.0  5.0 35.0  2.0  2.0  5.5 20.0  0.0  0.0
## [29] 10.0  5.0 40.0 10.0  0.0 20.0  0.0  6.0  2.0 10.0 10.0  7.0
(e) [] notation

Repeat part (d), but this time using single blackets ([ ]) notation.

(Observe that this returns a new single-column data frame, not just a vector.)

# returns 5th column and header as a data frame
survey[5]
##    TVhours
## 1      0.0
## 2      3.0
## 3     30.0
## 4      6.0
## 5     20.0
## 6     15.0
## 7      6.0
## 8      8.0
## 9     10.0
## 10     0.0
## 11     4.0
## 12     0.0
## 13     0.0
## 14     2.0
## 15    24.0
## 16    10.0
## 17    25.0
## 18     0.0
## 19     0.0
## 20    10.0
## 21     5.0
## 22    35.0
## 23     2.0
## 24     2.0
## 25     5.5
## 26    20.0
## 27     0.0
## 28     0.0
## 29    10.0
## 30     5.0
## 31    40.0
## 32    10.0
## 33     0.0
## 34    20.0
## 35     0.0
## 36     6.0
## 37     2.0
## 38    10.0
## 39    10.0
## 40     7.0
(f) subset() practice

Use the subset() function to select all the survey data on Program and OperatingSystem for respondents whose Rexperience is “Never used” or who watched 5 or more hours of TV last week.

# subset data frame with 5 or more TV hours or never used R to reflect Program and OS
sub.Pgm.OS<-subset(survey, TVhours >=5 | Rexperience == "Never used",select = c(ï..Program, OperatingSystem))

Problem 5: Data summaries and inline code practice.

(a) Bar graph

Create a bar graph of respondents’ Rexperience.

# summary of R experience for entire survey (not previous question 4) 
# get count of fields for bar plot
count<-table(survey$Rexperience)
# create bar graph with header, did not use x label since fields are already classified
barplot(count, main = "R experience", ylab = "# survey respondents")

(b) Inline code practice

Replace all occurrences of ???? in the paragraph below with an inline code chunk supplying the appropriate information.

# Created variables to make inline code easier to interpret
# get number of participants in survey
num.Respondents<-nrow(survey)
# filter out MISM students
MISM<-length(which(survey$ï..Program=='MISM'))
# get ration of Mac OS X users vs # survey respondents
OS<- length(which(survey$OperatingSystem=="Mac OS X"))/num.Respondents*100
# count of MISM students with basic competence in R
R.basic<-nrow(subset(survey, Rexperience=='Basic competence' & ï..Program == "MISM"))/num.Respondents*100

Inline code with variables

Of the 40 survey respondents, 22 were from the MISM program. We found that 45% of the all students in the class use the Mac OS X operating system. 17.5% of of MISM students report having Basic competence in R.

Inline code with direct code input:

Of the 40 survey respondents, 22 were from the MISM program. We found that 45% of the all students in the class use the Mac OS X operating system. 17.5% of of MISM students report having Basic competence in R.