** Please click all the tabs (in sequence) to get the entire set of information in these pages. **

** To download code, see the instructions in Session 2: https://rpubs.com/hkb/DAX-Session2 **

Objectives

Creating / Importing a Data Set

some.vector = c(1,2,4)
some.vector
[1] 1 2 4

We’ll use the data.frame function which is of the form

data.frame(col1name = col1values, col2name = col2values, … )

# build an example data frame
df.eg <- data.frame(name = c("Alex", "Bob", "Charlie", "David", "Emma", "Farah", "Ganesh"), id = letters[1:7], grade = 7+sample(2,7, replace=TRUE), PE.score = sample(50,7), history.score = sample(29, 7), english.score = 50 + sample(2,7, replace=TRUE)) 

print(df.eg)

Anatomy of a data set

  • Rows: Alex, Bob, Charlie, David, Emma (people)
  • Columns: name, id, grade, PE.score, english.score = attributes about people
  • Cells: values of those attributes

One way to think of that is that Columns are Variables, while Cells are Values. So, “name” is a Variable that takes on different values (Alex, Bob, etc.). Similarly, grade is a Variable, with values 4, 5, etc.

Now, the test scores are an interesting animal. They could be thought of as Variables (English score, etc.) and the values being the numbers.

But in some cases one might think of “Subject” as the variable, and “English” and “PE” being values. In this view, we would think of Subject as Variable, and Subject.Name and Subject.Score as additional variables, with appropriate values under each.

The existing view that we have is called a “wide form” view of the data - it has a column for each variable.

The long form view of the data is shown below, and is obtained using the “gather” function. The long form view is very helpful if you want to do some analytics or visualization against a series of variables (here, PE.score and english.score).

data_long <- gather(df.eg, Subject, Score, PE.score:english.score, factor_key=TRUE)
data_long

Digging Into dplyr

Here’s a great site: https://dplyr.tidyverse.org/

  • select() picks variables based on their names
  • filter() picks cases based on their values
  • arrange() changes the ordering of the rows
  • rename() renames existing variables
  • mutate() adds new variables (usually, functions of existing variables)
  • group_by() implicitly breaks the data set into groups
  • summarise() reduces multiple values down to a single summary
select(data_long, name, Score)
data_long %>% select(name) %>% distinct()
filter(data_long, Score > 25)
data_long %>% filter(Score > 25)

Now suppose we want to report all grade 9 students, plus students with scores > 25. So, the conditions are Score > 25 OR grade = 9.

data_long %>% filter(Score > 25 | grade == 9)

Now suppose we want to report only grade 9 students with scores > 25. So, the conditions are Score > 25 AND grade = 9.

ARRANGE

data_long %>% arrange(desc(Score), grade)
data_long %>% rename(student.id = id)

to add a new column, we’ll use mutate.

df.eg <- df.eg %>% mutate(total.score = PE.score + history.score + english.score)
df.eg
data_long %>% group_by(name) %>% summarise(sum(Score))
data_long %>% group_by(name) %>% mutate(total.score = sum(Score))
NA
LS0tCnRpdGxlOiAiU2Vzc2lvbiA4IgphdXRob3I6ICJIZW1hbnQgQmhhcmdhdmEiCmRhdGU6ICIwOC8wNy8yMDIwIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgoqKiBQbGVhc2UgY2xpY2sgYWxsIHRoZSB0YWJzIChpbiBzZXF1ZW5jZSkgdG8gZ2V0IHRoZSBlbnRpcmUgc2V0IG9mIGluZm9ybWF0aW9uIGluIHRoZXNlIHBhZ2VzLiAqKgoKKiogVG8gZG93bmxvYWQgY29kZSwgc2VlIHRoZSBpbnN0cnVjdGlvbnMgaW4gU2Vzc2lvbiAyOiBodHRwczovL3JwdWJzLmNvbS9oa2IvREFYLVNlc3Npb24yICoqCgoKYGBge3Igc2V0dXAsZWNobz1GQUxTRX0Ka25pdHI6Om9wdHNfY2h1bmskc2V0KGVjaG8gPSBUUlVFLCB3YXJuaW5nPUZBTFNFLCBtZXNzYWdlPUZBTFNFKQpvcHRpb25zKHNjaXBlbj0xMDAwMDAwMCkKb3B0aW9ucyhkaWdpdHM9MykKYGBgCgpgYGB7ciBwYWNrYWdlcyxlY2hvPUZBTFNFfQojIGluc3RhbGwucGFja2FnZXMoImtuaXRyIikKbGlicmFyeShrbml0cikKCmxpYnJhcnkoZHBseXIpCmxpYnJhcnkodGlkeXZlcnNlKQpsaWJyYXJ5KGdncGxvdDIpCmxpYnJhcnkoZ3JpZEV4dHJhKQpsaWJyYXJ5KGdncmVwZWwpIApgYGAKCiMjIE9iamVjdGl2ZXMKCiogRXhlcmNpc2U6IENyZWF0aW5nIC8gSW1wb3J0aW5nIGEgRGF0YSBTZXQKCiogQW5hdG9teSBvZiBhIGRhdGEgc2V0CgogIC0gUm93cywgY29sdW1ucywgY2VsbHMKICAtIFZhcmlhYmxlcywgdmFsdWVzCiAgLSBMb25nIGZvcm0sIHdpZGUgZm9ybQoKKiBHZXR0aW5nIEZhbWlsaWFyIHdpdGggZHBseXIgcGFja2FnZQoKICAtIFRoZSAlPiUgbm90YXRpb24KICAtIHNlbGVjdAogIC0gZmlsdGVyCiAgLSBtdXRhdGUKICAtIHJlbmFtZQogIC0gZ3JvdXBfYnkgCiAgLSBzdW1tYXJpemUKCiMjIyBDcmVhdGluZyAvIEltcG9ydGluZyBhIERhdGEgU2V0CgpgYGB7ciB2ZWN0b3J9CnNvbWUudmVjdG9yID0gYygxLDIsNCkKc29tZS52ZWN0b3IKYGBgCgpXZSdsbCB1c2UgdGhlIGRhdGEuZnJhbWUgZnVuY3Rpb24gd2hpY2ggaXMgb2YgdGhlIGZvcm0KCmRhdGEuZnJhbWUoY29sMW5hbWUgPSBjb2wxdmFsdWVzLCBjb2wybmFtZSA9IGNvbDJ2YWx1ZXMsIC4uLiApCgpgYGB7ciBjcmVhdGV9CiMgYnVpbGQgYW4gZXhhbXBsZSBkYXRhIGZyYW1lCmRmLmVnIDwtIGRhdGEuZnJhbWUobmFtZSA9IGMoIkFsZXgiLCAiQm9iIiwgIkNoYXJsaWUiLCAiRGF2aWQiLCAiRW1tYSIsICJGYXJhaCIsICJHYW5lc2giKSwgaWQgPSBsZXR0ZXJzWzE6N10sIGdyYWRlID0gNytzYW1wbGUoMiw3LCByZXBsYWNlPVRSVUUpLCBQRS5zY29yZSA9IHNhbXBsZSg1MCw3KSwgaGlzdG9yeS5zY29yZSA9IHNhbXBsZSgyOSwgNyksIGVuZ2xpc2guc2NvcmUgPSA1MCArIHNhbXBsZSgyLDcsIHJlcGxhY2U9VFJVRSkpIAoKcHJpbnQoZGYuZWcpCmBgYCAKCiMjIyBBbmF0b215IG9mIGEgZGF0YSBzZXQKCiogUm93czogQWxleCwgQm9iLCBDaGFybGllLCBEYXZpZCwgRW1tYSAocGVvcGxlKQoqIENvbHVtbnM6IG5hbWUsIGlkLCBncmFkZSwgUEUuc2NvcmUsIGVuZ2xpc2guc2NvcmUgPSBhdHRyaWJ1dGVzIGFib3V0IHBlb3BsZQoqIENlbGxzOiB2YWx1ZXMgb2YgdGhvc2UgYXR0cmlidXRlcwoKT25lIHdheSB0byB0aGluayBvZiB0aGF0IGlzIHRoYXQgQ29sdW1ucyBhcmUgVmFyaWFibGVzLCB3aGlsZSBDZWxscyBhcmUgVmFsdWVzLiBTbywgIm5hbWUiIGlzIGEgVmFyaWFibGUgdGhhdCB0YWtlcyBvbiBkaWZmZXJlbnQgdmFsdWVzIChBbGV4LCBCb2IsIGV0Yy4pLiBTaW1pbGFybHksIGdyYWRlIGlzIGEgVmFyaWFibGUsIHdpdGggdmFsdWVzIDQsIDUsIGV0Yy4gCgpOb3csIHRoZSB0ZXN0IHNjb3JlcyBhcmUgYW4gaW50ZXJlc3RpbmcgYW5pbWFsLiBUaGV5IGNvdWxkIGJlIHRob3VnaHQgb2YgYXMgVmFyaWFibGVzIChFbmdsaXNoIHNjb3JlLCBldGMuKSBhbmQgdGhlIHZhbHVlcyBiZWluZyB0aGUgbnVtYmVycy4gCgpCdXQgaW4gc29tZSBjYXNlcyBvbmUgbWlnaHQgdGhpbmsgb2YgIlN1YmplY3QiIGFzIHRoZSB2YXJpYWJsZSwgYW5kICJFbmdsaXNoIiBhbmQgIlBFIiBiZWluZyB2YWx1ZXMuIEluIHRoaXMgdmlldywgd2Ugd291bGQgdGhpbmsgb2YgU3ViamVjdCBhcyBWYXJpYWJsZSwgYW5kIFN1YmplY3QuTmFtZSBhbmQgU3ViamVjdC5TY29yZSBhcyBhZGRpdGlvbmFsIHZhcmlhYmxlcywgd2l0aCBhcHByb3ByaWF0ZSB2YWx1ZXMgdW5kZXIgZWFjaC4gCgpUaGUgZXhpc3RpbmcgdmlldyB0aGF0IHdlIGhhdmUgaXMgY2FsbGVkIGEgIndpZGUgZm9ybSIgdmlldyBvZiB0aGUgZGF0YSAtIGl0IGhhcyBhIGNvbHVtbiBmb3IgZWFjaCB2YXJpYWJsZS4gCgpUaGUgbG9uZyBmb3JtIHZpZXcgb2YgdGhlIGRhdGEgaXMgc2hvd24gYmVsb3csIGFuZCBpcyBvYnRhaW5lZCB1c2luZyB0aGUgImdhdGhlciIgZnVuY3Rpb24uIFRoZSBsb25nIGZvcm0gdmlldyBpcyB2ZXJ5IGhlbHBmdWwgaWYgeW91IHdhbnQgdG8gZG8gc29tZSBhbmFseXRpY3Mgb3IgdmlzdWFsaXphdGlvbiBhZ2FpbnN0IGEgc2VyaWVzIG9mIHZhcmlhYmxlcyAoaGVyZSwgUEUuc2NvcmUgYW5kIGVuZ2xpc2guc2NvcmUpLiAKCmBgYHtyIGxvbmcuZm9ybX0KZGF0YV9sb25nIDwtIGdhdGhlcihkZi5lZywgU3ViamVjdCwgU2NvcmUsIFBFLnNjb3JlOmVuZ2xpc2guc2NvcmUsIGZhY3Rvcl9rZXk9VFJVRSkKZGF0YV9sb25nCmBgYAoKIyMjIERpZ2dpbmcgSW50byBkcGx5cgoKSGVyZSdzIGEgZ3JlYXQgc2l0ZTogaHR0cHM6Ly9kcGx5ci50aWR5dmVyc2Uub3JnLwoKKiBzZWxlY3QoKSBwaWNrcyB2YXJpYWJsZXMgYmFzZWQgb24gdGhlaXIgbmFtZXMKKiBmaWx0ZXIoKSBwaWNrcyBjYXNlcyBiYXNlZCBvbiB0aGVpciB2YWx1ZXMKKiBhcnJhbmdlKCkgY2hhbmdlcyB0aGUgb3JkZXJpbmcgb2YgdGhlIHJvd3MKKiByZW5hbWUoKSByZW5hbWVzIGV4aXN0aW5nIHZhcmlhYmxlcyAKKiBtdXRhdGUoKSBhZGRzIG5ldyB2YXJpYWJsZXMgKHVzdWFsbHksIGZ1bmN0aW9ucyBvZiBleGlzdGluZyB2YXJpYWJsZXMpCiogZ3JvdXBfYnkoKSBpbXBsaWNpdGx5IGJyZWFrcyB0aGUgZGF0YSBzZXQgaW50byBncm91cHMKKiBzdW1tYXJpc2UoKSByZWR1Y2VzIG11bHRpcGxlIHZhbHVlcyBkb3duIHRvIGEgc2luZ2xlIHN1bW1hcnkKCmBgYHtyIHNlbGVjdH0Kc2VsZWN0KGRhdGFfbG9uZywgbmFtZSwgU2NvcmUpCmBgYAoKYGBge3Igc2VsZWN0LmRpc3RpbmN0fQpkYXRhX2xvbmcgJT4lIHNlbGVjdChuYW1lKSAlPiUgZGlzdGluY3QoKQpgYGAgCgpgYGB7ciBmaWx0ZXIxfQpmaWx0ZXIoZGF0YV9sb25nLCBTY29yZSA+IDI1KQpkYXRhX2xvbmcgJT4lIGZpbHRlcihTY29yZSA+IDI1KQpgYGAKCk5vdyBzdXBwb3NlIHdlIHdhbnQgdG8gcmVwb3J0IGFsbCBncmFkZSA5IHN0dWRlbnRzLCBwbHVzIHN0dWRlbnRzIHdpdGggc2NvcmVzID4gMjUuIFNvLCB0aGUgY29uZGl0aW9ucyBhcmUgU2NvcmUgPiAyNSBPUiBncmFkZSA9IDkuIAoKYGBge3IgZmlsdGVyMn0KZGF0YV9sb25nICU+JSBmaWx0ZXIoU2NvcmUgPiAyNSB8IGdyYWRlID09IDkpCmBgYAoKTm93IHN1cHBvc2Ugd2Ugd2FudCB0byByZXBvcnQgb25seSBncmFkZSA5IHN0dWRlbnRzIHdpdGggc2NvcmVzID4gMjUuIFNvLCB0aGUgY29uZGl0aW9ucyBhcmUgU2NvcmUgPiAyNSBBTkQgZ3JhZGUgPSA5LiAKCiMjIyMgQVJSQU5HRQoKYGBge3IgYXJyYW5nZS5TY29yZX0KZGF0YV9sb25nICU+JSBhcnJhbmdlKGRlc2MoU2NvcmUpLCBncmFkZSkKYGBgCgpgYGB7ciByZW5hbWV9CmRhdGFfbG9uZyAlPiUgcmVuYW1lKHN0dWRlbnQuaWQgPSBpZCkKYGBgCgp0byBhZGQgYSBuZXcgY29sdW1uLCB3ZSdsbCB1c2UgbXV0YXRlLiAKCmBgYHtyIG11dGF0ZX0KZGYuZWcgPC0gZGYuZWcgJT4lIG11dGF0ZSh0b3RhbC5zY29yZSA9IFBFLnNjb3JlICsgaGlzdG9yeS5zY29yZSArIGVuZ2xpc2guc2NvcmUpCmRmLmVnCmBgYAoKCmBgYHtyIGdyb3VwLnN1bW1hcmlzZX0KZGF0YV9sb25nICU+JSBncm91cF9ieShuYW1lKSAlPiUgc3VtbWFyaXNlKHN1bShTY29yZSkpCmRhdGFfbG9uZyAlPiUgZ3JvdXBfYnkobmFtZSkgJT4lIG11dGF0ZSh0b3RhbC5zY29yZSA9IHN1bShTY29yZSkpCgpgYGAgCg==