Data is easy to work with when it’s what we call tidy. This means it’s in a rectangular format, with observations in the rows and variables in the columns. It shouldn’t have any data stored in formatting (ie color), nor should any of it depend on the order of the rows or columns.
variables need to start with a letter and should be alpha-numeric with only . and _ as possible spacing.
You can have an extra row of column names above the variable names with more human readable information, and skip these by using skip=1 (to skip one row).
Blank…
na="NA"skip=1d <- read_excel("data01-dog.xlsx", na="na")
str(d)
## Classes 'tbl_df', 'tbl' and 'data.frame': 85 obs. of 13 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Treatment : chr "Control" "Control" "Control" "Control" ...
## $ Age : num 10 11 10 6 6 8 10 12 9 12 ...
## $ Breed : chr "Am Bulldog" "Dalmatian" "GR" "Vizsla" ...
## $ Weight : num 40 30 29.5 27.5 40 25.9 12.6 33 39.5 28 ...
## $ Splenectomy: POSIXct, format: "2009-03-26" "2004-12-16" ...
## $ Stage : chr "II" "II" "II" "III" ...
## $ Hemoabdomen: chr "y" "y" "y" "y" ...
## $ Hematocrit : num 36.1 18.9 42.9 27.4 34 42 24 22.7 17.2 28.9 ...
## $ WBC : num 23.7 13.6 13.3 12.2 31.1 ...
## $ Platelets : num 215 NA 31 60 73 90 68 97 131 204 ...
## $ Transfusion: chr "y" "y" "n" "y" ...
## $ Survival : num 1017 188 52 59 60 ...
Here’s a basic data frame
d <- read_csv("data03-ex1.csv")
d
## # A tibble: 6 x 2
## Group Response
## <chr> <dbl>
## 1 Trt 3.5
## 2 Trt 3.7
## 3 Trt 3.8
## 4 Con 3.6
## 5 Con 3.2
## 6 Con 2.9
We can’t access the variables in it directly:
Group
## Error in eval(expr, envir, enclos): object 'Group' not found
Three options (poor, okay, best)
attach. Has pitfalls, use only if you know what they are.attach(d)
Group
## [1] "Trt" "Trt" "Trt" "Con" "Con" "Con"
Use detach to un-attach.
detach(d)
$. This is fine, but often leads to lots of tedious error-prone typing.d$Group
## [1] "Trt" "Trt" "Trt" "Con" "Con" "Con"
Generically, can use with, for functions that don’t allow you to do this.
with(d, Group)
## [1] "Trt" "Trt" "Trt" "Con" "Con" "Con"
The mosaic package standardizes how many functions work in R to this format, where you replace the words with capital letters with the specific function, variables, and data set you want to use.
GOAL(Y ~ X, data = MYDATA)
Sometimes we’ll add a grouping variable,
GOAL(Y ~ X | G, data = MYDATA)
Here are some examples.
R isn’t picky about spaces, but it’s good form to put spaces around operators and after commas; this helps you as a reader read your scripts more easily later.
Here are some numerical examples.
d <- read_excel("data01-dog.xlsx", na="na")
mean( ~ Weight, data=d)
## [1] 28.06733
sd( ~ Weight, data=d)
## [1] 11.50567
sd( ~ Weight | Treatment, data=d)
## A B Control
## 11.53648 11.76857 11.15189
median( ~ Survival | Treatment, data=d)
## A B Control
## 135 239 145
tally(Treatment ~ Transfusion, data=d)
## Transfusion
## Treatment n y <NA>
## A 17 8 0
## B 10 12 1
## Control 11 25 1
** need to add about missing data, use na.rm=TRUE**
And here are some graphical examples.
gf_point(Survival ~ WBC, data=d)
## Warning: Removed 4 rows containing missing values (geom_point).
gf_dotplot( ~ Survival, data=d)
## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
gf_histogram(~ Survival, data=d)
gf_histogram(~ Survival | Treatment, data=d)
gf_boxplot(Survival ~ Treatment, data=d)
gf_violin(Survival ~ Treatment, data=d)
gf_density(~ Survival, fill=~Treatment, data=d)
gf_bar( ~ Transfusion, data=d)
gf_bar(~ Transfusion | Treatment, data=d)
gf_bar(~ Transfusion, fill = ~ Treatment, data=d)
For more details on Mosaic, see the book A Student’s Guide to R, in the Google Drive or online at https://github.com/ProjectMOSAIC/LittleBooks or http://mosaic-web.org/.