Data Exploration

Reading data

tidy data

Data is easy to work with when it’s what we call tidy. This means it’s in a rectangular format, with observations in the rows and variables in the columns. It shouldn’t have any data stored in formatting (ie color), nor should any of it depend on the order of the rows or columns.

variable names

variables need to start with a letter and should be alpha-numeric with only . and _ as possible spacing.

You can have an extra row of column names above the variable names with more human readable information, and skip these by using skip=1 (to skip one row).

missing data

Blank…

R code to do this

na="NA"
skip=1

d <- read_excel("data01-dog.xlsx", na="na")
str(d)

## Classes 'tbl_df', 'tbl' and 'data.frame':    85 obs. of  13 variables:
##  $ ID         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Treatment  : chr  "Control" "Control" "Control" "Control" ...
##  $ Age        : num  10 11 10 6 6 8 10 12 9 12 ...
##  $ Breed      : chr  "Am Bulldog" "Dalmatian" "GR" "Vizsla" ...
##  $ Weight     : num  40 30 29.5 27.5 40 25.9 12.6 33 39.5 28 ...
##  $ Splenectomy: POSIXct, format: "2009-03-26" "2004-12-16" ...
##  $ Stage      : chr  "II" "II" "II" "III" ...
##  $ Hemoabdomen: chr  "y" "y" "y" "y" ...
##  $ Hematocrit : num  36.1 18.9 42.9 27.4 34 42 24 22.7 17.2 28.9 ...
##  $ WBC        : num  23.7 13.6 13.3 12.2 31.1 ...
##  $ Platelets  : num  215 NA 31 60 73 90 68 97 131 204 ...
##  $ Transfusion: chr  "y" "y" "n" "y" ...
##  $ Survival   : num  1017 188 52 59 60 ...

accessing variables within data sets

Here’s a basic data frame

d <- read_csv("data03-ex1.csv")
d

## # A tibble: 6 x 2
##   Group Response
##   <chr>    <dbl>
## 1 Trt        3.5
## 2 Trt        3.7
## 3 Trt        3.8
## 4 Con        3.6
## 5 Con        3.2
## 6 Con        2.9

We can’t access the variables in it directly:

Group

## Error in eval(expr, envir, enclos): object 'Group' not found

Three options (poor, okay, best)

1. (Poor) Use `attach`. Has pitfalls, use only if you know what they are.

attach(d)
Group

## [1] "Trt" "Trt" "Trt" "Con" "Con" "Con"

Use detach to un-attach.

detach(d)

2. Use `$`. This is fine, but often leads to lots of tedious error-prone typing.

d$Group

## [1] "Trt" "Trt" "Trt" "Con" "Con" "Con"

3. Use functions where you specify what data set the variable is in.

Generically, can use with, for functions that don’t allow you to do this.

with(d, Group)

## [1] "Trt" "Trt" "Trt" "Con" "Con" "Con"

Exploring variables

The mosaic package standardizes how many functions work in R to this format, where you replace the words with capital letters with the specific function, variables, and data set you want to use.

GOAL(Y ~ X, data = MYDATA)

Sometimes we’ll add a grouping variable,

GOAL(Y ~ X | G, data = MYDATA)

Here are some examples.

R isn’t picky about spaces, but it’s good form to put spaces around operators and after commas; this helps you as a reader read your scripts more easily later.

Here are some numerical examples.

d <- read_excel("data01-dog.xlsx", na="na")

mean( ~ Weight, data=d)

## [1] 28.06733

sd( ~ Weight, data=d)

## [1] 11.50567

sd( ~ Weight | Treatment, data=d)

##        A        B  Control 
## 11.53648 11.76857 11.15189

median( ~ Survival | Treatment, data=d)

##       A       B Control 
##     135     239     145

tally(Treatment ~ Transfusion, data=d)

##          Transfusion
## Treatment  n  y <NA>
##   A       17  8    0
##   B       10 12    1
##   Control 11 25    1

** need to add about missing data, use na.rm=TRUE**

And here are some graphical examples.

gf_point(Survival ~ WBC, data=d)

## Warning: Removed 4 rows containing missing values (geom_point).

gf_dotplot( ~ Survival, data=d)

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

gf_histogram(~ Survival, data=d)

gf_histogram(~ Survival | Treatment, data=d)

gf_boxplot(Survival ~ Treatment, data=d)

gf_violin(Survival ~ Treatment, data=d)

gf_density(~ Survival, fill=~Treatment, data=d)

gf_bar( ~ Transfusion, data=d)

gf_bar(~ Transfusion | Treatment, data=d)

gf_bar(~ Transfusion, fill = ~ Treatment, data=d)

For more details on Mosaic, see the book A Student’s Guide to R, in the Google Drive or online at https://github.com/ProjectMOSAIC/LittleBooks or http://mosaic-web.org/.

Data Exploration

Aaron Rendahl, PhD

Fall 2019

Reading data

tidy data

variable names

missing data

R code to do this

accessing variables within data sets

1. (Poor) Use `attach`. Has pitfalls, use only if you know what they are.

2. Use `$`. This is fine, but often leads to lots of tedious error-prone typing.

3. Use functions where you specify what data set the variable is in.

Exploring variables

Data Exploration

Aaron Rendahl, PhD

Fall 2019

Reading data

tidy data

variable names

missing data

R code to do this

accessing variables within data sets

1. (Poor) Use attach. Has pitfalls, use only if you know what they are.

2. Use $. This is fine, but often leads to lots of tedious error-prone typing.

3. Use functions where you specify what data set the variable is in.

Exploring variables

1. (Poor) Use `attach`. Has pitfalls, use only if you know what they are.

2. Use `$`. This is fine, but often leads to lots of tedious error-prone typing.