Harold Nelson
April 7, 2016
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
Most of the time, the data we receive is not exactly in the form we want for analysis.
Always think about what your “unit of analysis” is. It may be an individual person. It may be a cohort of people who began an activity ot the same time. It may be an academic class.
However you define this, your data must wind up in a dataframe in “tidy” shape. This means that each row of your dataframe represents one unit of analysis. and each column of your dataframe has information on a single attribute.
Primarily we will be looking at the package dplyr. There are some duplicate names between the packages plyr and dplyr. We will need both, but to make sure that we are referring to the functions in dplyr where there are duplicates, be sure to load plyr first.
You first step when you begin to work with a dataframe is to modify it with tbl_df. This changes the default printing method, which would overflow your screen. Here is the syntax.
diamonds = tbl_df(diamonds)
diamonds
## Source: local data frame [53,940 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## .. ... ... ... ... ... ... ... ... ... ...
You can also use the function glimpse()
glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat (dbl) 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut (fctr) Ideal, Premium, Good, Premium, Good, Very Good, Very ...
## $ color (fctr) E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J,...
## $ clarity (fctr) SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, S...
## $ depth (dbl) 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table (dbl) 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price (int) 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x (dbl) 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y (dbl) 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z (dbl) 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
A relatively small core set of well-designed verbs.
Read the notes on filter()
Uisng the diamonds dataset, use filter to produce a dataframe restricted to ideal cut diamonds which weigh more than 1 carat.
d1 = filter(diamonds,cut=="Ideal",carat > 1)
d1
## Source: local data frame [5,662 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 1.01 Ideal I I1 61.5 57 2844 6.45 6.46 3.97
## 2 1.02 Ideal H SI2 61.6 55 2856 6.49 6.43 3.98
## 3 1.02 Ideal I I1 61.7 56 2872 6.44 6.49 3.99
## 4 1.02 Ideal J SI2 60.3 54 2879 6.53 6.50 3.93
## 5 1.01 Ideal I I1 61.5 57 2896 6.46 6.45 3.97
## 6 1.02 Ideal I I1 61.7 56 2925 6.49 6.44 3.99
## 7 1.14 Ideal J SI1 60.2 57 3045 6.81 6.71 4.07
## 8 1.02 Ideal H SI2 58.8 57 3142 6.61 6.55 3.87
## 9 1.06 Ideal I SI2 62.8 55 3146 6.51 6.46 4.07
## 10 1.02 Ideal I VS2 62.8 57 3148 6.45 6.39 4.03
## .. ... ... ... ... ... ... ... ... ... ...
Read the notes on arrange().
Then use arrange() to produce a copy of the diamonds dataframe sorted by price in ascending order.
Next produce a copy of the diamonds dataframe sorted by cut in ascending order, color in descending order, and then price in ascending order.
d1 = arrange(diamonds,price)
d1
## Source: local data frame [53,940 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## .. ... ... ... ... ... ... ... ... ... ...
d2 = arrange(diamonds,cut,desc(color),price)
d2
## Source: local data frame [53,940 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.30 Fair J VS2 64.8 58 416 4.24 4.16 2.72
## 2 0.34 Fair J SI1 64.5 57 497 4.38 4.36 2.82
## 3 0.50 Fair J VS1 66.8 57 949 4.89 4.84 3.25
## 4 0.51 Fair J VS2 65.3 55 996 4.97 4.96 3.24
## 5 0.52 Fair J VS2 65.3 56 1030 5.05 5.00 3.28
## 6 0.70 Fair J I1 64.7 59 1066 5.59 5.50 3.59
## 7 0.62 Fair J SI1 65.4 59 1101 5.39 5.28 3.49
## 8 0.61 Fair J VS2 58.2 56 1174 5.74 5.49 3.27
## 9 0.85 Fair J I1 65.5 59 1334 5.95 5.82 3.86
## 10 0.71 Fair J SI2 66.1 58 1362 5.53 5.43 3.62
## .. ... ... ... ... ... ... ... ... ... ...
Read the notes on select().
Then produce a copy of the diamonds dataframe which contains only carat and price.
Next produce a copy of the diamonds dataframe that drops the variables table, x, y, z.
d1 = select(diamonds,carat,price)
d1
## Source: local data frame [53,940 x 2]
##
## carat price
## (dbl) (int)
## 1 0.23 326
## 2 0.21 326
## 3 0.23 327
## 4 0.29 334
## 5 0.31 335
## 6 0.24 336
## 7 0.24 336
## 8 0.26 337
## 9 0.22 337
## 10 0.23 338
## .. ... ...
d2 = select(diamonds,-table,-(x:z))
str(d2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 6 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
Read the notes on rename().
Then creat a copy of the diamonds dataframe in which the variable carat has been renamed to weight,
d1 = rename(diamonds,weight=carat)
str(d1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ weight : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Read the notes on distinct() Then find the distinct values of cut in the diamonds dataframe.
distinct(select(diamonds,cut))
## Source: local data frame [5 x 1]
##
## cut
## (fctr)
## 1 Ideal
## 2 Premium
## 3 Good
## 4 Very Good
## 5 Fair
Read the notes on mutate() Then create a copy of the diamonds dataframe with a new variable ppc computed as the price per carat.
d1 = mutate(diamonds,ppc=price/carat)
str(d1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 11 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ ppc : num 1417 1552 1422 1152 1081 ...
Read the notes on summarize() and group_by(), then get a table of the count and average weight of the diamonds by cut.
dbc = group_by(diamonds,cut)
summarize(dbc,count=n(),avgweight = mean(carat))
## Source: local data frame [5 x 3]
##
## cut count avgweight
## (fctr) (int) (dbl)
## 1 Fair 1610 1.0461366
## 2 Good 4906 0.8491847
## 3 Very Good 12082 0.8063814
## 4 Premium 13791 0.8919549
## 5 Ideal 21551 0.7028370
Read the notes on chaining. Then use this notation to find the combination of cut and color which has the highest average price per carat.
TBD - Keep working