Source file ⇒ stat133_mtreviewsheet.Rmd

Getting Started

In Chapter 1, we first talked about Tidy Data. While tidy does not mean neat, we have the following rules to see if data is tidy:

  1. Each row is called a case which each refers to a unique and similar thing that the data is trying to describe.
  2. Each column is called a variable and contains the same type of value for each case.

Variables

In data, a variable is a known, measured value. There are two types of variables: categorical and quantitative

Categorial Variables describe groups or categories that a case could fall into. For instance in the data set Babynames, sex is a categorical variable that has two levels, F and M, which represent female and male. Another example of a categorical variable is the variable name. there are 92,600 different levels, or unique instances, for name.

Quantitative Variables are tangible values that are numbers. We can usually put these values on a number line and order them.

Cases

Each case has one or more variables to describe the data that was collected. For instance, in the data set mtcars each case represents a unique model of a car. Each case has many variables including hp (horsepower), disp (engine displacement), am (Auto/Manual Transmission), as well as other variables that help describe the car.

Data Verbs & Chaining

Sometimes when we are dealing with a data set, we may need may need to change, pick, or remove certain pieces of data that we do not want. We use data verbs to clean up data. This process of cleaning and arranging data is called data wrangling.

It is important to note that there are multiple ways to evoke multiple functions. There is piping/chaining functions which uses the symbol %>% to pipe functions and there is nesting functions which nests multiple functions within each other. In the first example, I will demonstrate the two different ways of using multiple functions.

Let’s use the mtcars set to explain data verbs:

# pipe method
mtcars %>% 
  group_by(cyl) %>%
  tally()
## Source: local data frame [3 x 2]
## 
##     cyl     n
##   (dbl) (int)
## 1     4    11
## 2     6     7
## 3     8    14
#nesting method 
tally(group_by(mtcars,cyl))
## Source: local data frame [3 x 2]
## 
##     cyl     n
##   (dbl) (int)
## 1     4    11
## 2     6     7
## 3     8    14

Here we can see the differences of how functions can be implemented. However, it is clear that the syntax for the pipe method is much easier to follow.

mtcars %>%
  summarise(totalwt = sum(wt))
##   totalwt
## 1 102.952
mtcars %>%
  group_by(cyl) %>%
  summarise(totwt_per_cyl = sum(wt))
## Source: local data frame [3 x 2]
## 
##     cyl totwt_per_cyl
##   (dbl)         (dbl)
## 1     4        25.143
## 2     6        21.820
## 3     8        55.989