CSC 360 Module 4 Lecture Notes

Harold Nelson

April 7, 2016

library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(magrittr)

Tools for Restructuring Data

Most of the time, the data we receive is not exactly in the form we want for analysis.

Always think about what your “unit of analysis” is. It may be an individual person. It may be a cohort of people who began an activity ot the same time. It may be an academic class.

However you define this, your data must wind up in a dataframe in “tidy” shape. This means that each row of your dataframe represents one unit of analysis. and each column of your dataframe has information on a single attribute.

Primarily we will be looking at the package dplyr. There are some duplicate names between the packages plyr and dplyr. We will need both, but to make sure that we are referring to the functions in dplyr where there are duplicates, be sure to load plyr first.

Using tbl_df

You first step when you begin to work with a dataframe is to modify it with tbl_df. This changes the default printing method, which would overflow your screen. Here is the syntax.

diamonds = tbl_df(diamonds)
diamonds
## Source: local data frame [53,940 x 10]
## 
##    carat       cut  color clarity depth table price     x     y     z
##    (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1   0.23     Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2   0.21   Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3   0.23      Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4   0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5   0.31      Good      J     SI2  63.3    58   335  4.34  4.35  2.75
## 6   0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96  2.48
## 7   0.24 Very Good      I    VVS1  62.3    57   336  3.95  3.98  2.47
## 8   0.26 Very Good      H     SI1  61.9    55   337  4.07  4.11  2.53
## 9   0.22      Fair      E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good      H     VS1  59.4    61   338  4.00  4.05  2.39
## ..   ...       ...    ...     ...   ...   ...   ...   ...   ...   ...

You can also use the function glimpse()

glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat   (dbl) 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     (fctr) Ideal, Premium, Good, Premium, Good, Very Good, Very ...
## $ color   (fctr) E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J,...
## $ clarity (fctr) SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, S...
## $ depth   (dbl) 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   (dbl) 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   (int) 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       (dbl) 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       (dbl) 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       (dbl) 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

The Verbs of dplyr

A relatively small core set of well-designed verbs.

filter()

Read the notes on filter()

Uisng the diamonds dataset, use filter to produce a dataframe restricted to ideal cut diamonds which weigh more than 1 carat.

Answer for filter() exercise.

d1 = filter(diamonds,cut=="Ideal",carat > 1)
d1
## Source: local data frame [5,662 x 10]
## 
##    carat    cut  color clarity depth table price     x     y     z
##    (dbl) (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1   1.01  Ideal      I      I1  61.5    57  2844  6.45  6.46  3.97
## 2   1.02  Ideal      H     SI2  61.6    55  2856  6.49  6.43  3.98
## 3   1.02  Ideal      I      I1  61.7    56  2872  6.44  6.49  3.99
## 4   1.02  Ideal      J     SI2  60.3    54  2879  6.53  6.50  3.93
## 5   1.01  Ideal      I      I1  61.5    57  2896  6.46  6.45  3.97
## 6   1.02  Ideal      I      I1  61.7    56  2925  6.49  6.44  3.99
## 7   1.14  Ideal      J     SI1  60.2    57  3045  6.81  6.71  4.07
## 8   1.02  Ideal      H     SI2  58.8    57  3142  6.61  6.55  3.87
## 9   1.06  Ideal      I     SI2  62.8    55  3146  6.51  6.46  4.07
## 10  1.02  Ideal      I     VS2  62.8    57  3148  6.45  6.39  4.03
## ..   ...    ...    ...     ...   ...   ...   ...   ...   ...   ...

arrange()

Read the notes on arrange().

Then use arrange() to produce a copy of the diamonds dataframe sorted by price in ascending order.

Next produce a copy of the diamonds dataframe sorted by cut in ascending order, color in descending order, and then price in ascending order.

Answers to arrange() exercises

d1 = arrange(diamonds,price)
d1
## Source: local data frame [53,940 x 10]
## 
##    carat       cut  color clarity depth table price     x     y     z
##    (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1   0.23     Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2   0.21   Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3   0.23      Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4   0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5   0.31      Good      J     SI2  63.3    58   335  4.34  4.35  2.75
## 6   0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96  2.48
## 7   0.24 Very Good      I    VVS1  62.3    57   336  3.95  3.98  2.47
## 8   0.26 Very Good      H     SI1  61.9    55   337  4.07  4.11  2.53
## 9   0.22      Fair      E     VS2  65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good      H     VS1  59.4    61   338  4.00  4.05  2.39
## ..   ...       ...    ...     ...   ...   ...   ...   ...   ...   ...
d2 = arrange(diamonds,cut,desc(color),price)
d2
## Source: local data frame [53,940 x 10]
## 
##    carat    cut  color clarity depth table price     x     y     z
##    (dbl) (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1   0.30   Fair      J     VS2  64.8    58   416  4.24  4.16  2.72
## 2   0.34   Fair      J     SI1  64.5    57   497  4.38  4.36  2.82
## 3   0.50   Fair      J     VS1  66.8    57   949  4.89  4.84  3.25
## 4   0.51   Fair      J     VS2  65.3    55   996  4.97  4.96  3.24
## 5   0.52   Fair      J     VS2  65.3    56  1030  5.05  5.00  3.28
## 6   0.70   Fair      J      I1  64.7    59  1066  5.59  5.50  3.59
## 7   0.62   Fair      J     SI1  65.4    59  1101  5.39  5.28  3.49
## 8   0.61   Fair      J     VS2  58.2    56  1174  5.74  5.49  3.27
## 9   0.85   Fair      J      I1  65.5    59  1334  5.95  5.82  3.86
## 10  0.71   Fair      J     SI2  66.1    58  1362  5.53  5.43  3.62
## ..   ...    ...    ...     ...   ...   ...   ...   ...   ...   ...

select()

Read the notes on select().

Then produce a copy of the diamonds dataframe which contains only carat and price.

Next produce a copy of the diamonds dataframe that drops the variables table, x, y, z.

Answers to select exercises

d1 = select(diamonds,carat,price)
d1
## Source: local data frame [53,940 x 2]
## 
##    carat price
##    (dbl) (int)
## 1   0.23   326
## 2   0.21   326
## 3   0.23   327
## 4   0.29   334
## 5   0.31   335
## 6   0.24   336
## 7   0.24   336
## 8   0.26   337
## 9   0.22   337
## 10  0.23   338
## ..   ...   ...
d2 = select(diamonds,-table,-(x:z))
str(d2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  6 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...

rename()

Read the notes on rename().

Then creat a copy of the diamonds dataframe in which the variable carat has been renamed to weight,

Answer to rename() exercise.

d1 = rename(diamonds,weight=carat)
str(d1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ weight : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

distinct()

Read the notes on distinct() Then find the distinct values of cut in the diamonds dataframe.

Answer to distinct() exercise

distinct(select(diamonds,cut))
## Source: local data frame [5 x 1]
## 
##         cut
##      (fctr)
## 1     Ideal
## 2   Premium
## 3      Good
## 4 Very Good
## 5      Fair

mutate()

Read the notes on mutate() Then create a copy of the diamonds dataframe with a new variable ppc computed as the price per carat.

Answer to mutate() exercise.

d1 = mutate(diamonds,ppc=price/carat)
str(d1)
## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  11 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ ppc    : num  1417 1552 1422 1152 1081 ...

summarize() and group_by()

Read the notes on summarize() and group_by(), then get a table of the count and average weight of the diamonds by cut.

Answer to the summarize() and group_by()

dbc = group_by(diamonds,cut)

summarize(dbc,count=n(),avgweight = mean(carat))
## Source: local data frame [5 x 3]
## 
##         cut count avgweight
##      (fctr) (int)     (dbl)
## 1      Fair  1610 1.0461366
## 2      Good  4906 0.8491847
## 3 Very Good 12082 0.8063814
## 4   Premium 13791 0.8919549
## 5     Ideal 21551 0.7028370

Chaining or Pipelines

Read the notes on chaining. Then use this notation to find the combination of cut and color which has the highest average price per carat.

Answer to chaining exercise.

TBD - Keep working