Source file ⇒ stat133_mtreviewsheet.Rmd
In Chapter 1, we first talked about Tidy Data. While tidy does not mean neat, we have the following rules to see if data is tidy:
In data, a variable is a known, measured value. There are two types of variables: categorical and quantitative
Categorial Variables describe groups or categories that a case could fall into. For instance in the data set Babynames, sex is a categorical variable that has two levels, F and M, which represent female and male. Another example of a categorical variable is the variable name. there are 92,600 different levels, or unique instances, for name.
Quantitative Variables are tangible values that are numbers. We can usually put these values on a number line and order them.
Each case has one or more variables to describe the data that was collected. For instance, in the data set mtcars each case represents a unique model of a car. Each case has many variables including hp (horsepower), disp (engine displacement), am (Auto/Manual Transmission), as well as other variables that help describe the car.
Sometimes when we are dealing with a data set, we may need may need to change, pick, or remove certain pieces of data that we do not want. We use data verbs to clean up data. This process of cleaning and arranging data is called data wrangling.
It is important to note that there are multiple ways to evoke multiple functions. There is piping/chaining functions which uses the symbol %>% to pipe functions and there is nesting functions which nests multiple functions within each other. In the first example, I will demonstrate the two different ways of using multiple functions.
Let’s use the mtcars set to explain data verbs:
# pipe method
mtcars %>%
group_by(cyl) %>%
tally()
## Source: local data frame [3 x 2]
##
## cyl n
## (dbl) (int)
## 1 4 11
## 2 6 7
## 3 8 14
#nesting method
tally(group_by(mtcars,cyl))
## Source: local data frame [3 x 2]
##
## cyl n
## (dbl) (int)
## 1 4 11
## 2 6 7
## 3 8 14
Here we can see the differences of how functions can be implemented. However, it is clear that the syntax for the pipe method is much easier to follow.
sum function or the avg function.mtcars %>%
summarise(totalwt = sum(wt))
## totalwt
## 1 102.952
mtcars %>%
group_by(cyl) %>%
summarise(totwt_per_cyl = sum(wt))
## Source: local data frame [3 x 2]
##
## cyl totwt_per_cyl
## (dbl) (dbl)
## 1 4 25.143
## 2 6 21.820
## 3 8 55.989