Can view in editor (but not edit) by clicking on it in environment; can also click on arrow to see structure.
Other helpful commands:
d <- read_csv("data04-ex1.csv")
d
## # A tibble: 6 x 2
## Group Response
## <dbl> <dbl>
## 1 1 3.5
## 2 1 3.7
## 3 1 3.8
## 4 2 3.6
## 5 2 3.2
## 6 2 2.9
str(d)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 2 variables:
## $ Group : num 1 1 1 2 2 2
## $ Response: num 3.5 3.7 3.8 3.6 3.2 2.9
## - attr(*, "spec")=
## .. cols(
## .. Group = col_double(),
## .. Response = col_double()
## .. )
inspect(d)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 Group numeric 1.0 1.000 1.50 2.000 2.0 1.50 0.5477226 6 0
## 2 Response numeric 2.9 3.275 3.55 3.675 3.8 3.45 0.3391165 6 0
head(d)
## # A tibble: 6 x 2
## Group Response
## <dbl> <dbl>
## 1 1 3.5
## 2 1 3.7
## 3 1 3.8
## 4 2 3.6
## 5 2 3.2
## 6 2 2.9
tail(d)
## # A tibble: 6 x 2
## Group Response
## <dbl> <dbl>
## 1 1 3.5
## 2 1 3.7
## 3 1 3.8
## 4 2 3.6
## 5 2 3.2
## 6 2 2.9
A factor is a categorical variable; basically, a text variable where the levels are defined and given an order.
Here I’ve coded Group as a numeric, 1 or 2; I need to make Group a factor!
Here I assign them to new variable names for demonstration; I could use the same name as well.
d$Group_v2 <- factor(d$Group)
d$Group_v3 <- factor(d$Group, levels=c(1,2), labels=c("Trt", "Con"))
inspect(d)
##
## categorical variables:
## name class levels n missing
## 1 Group_v2 factor 2 6 0
## 2 Group_v3 factor 2 6 0
## distribution
## 1 1 (50%), 2 (50%)
## 2 Trt (50%), Con (50%)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 Group numeric 1.0 1.000 1.50 2.000 2.0 1.50 0.5477226 6 0
## 2 Response numeric 2.9 3.275 3.55 3.675 3.8 3.45 0.3391165 6 0
A better way is to use functions that specify the data set, rather than the $.
Back to the original version of the data:
d <- read_csv("data04-ex1.csv")
d
## # A tibble: 6 x 2
## Group Response
## <dbl> <dbl>
## 1 1 3.5
## 2 1 3.7
## 3 1 3.8
## 4 2 3.6
## 5 2 3.2
## 6 2 2.9
mutate adds (or changes) variables in the data set, using existing variables
d <- mutate(d,
Group_v2=factor(Group),
Group_v3=factor(Group, levels=c(1,2), labels=c("Trt", "Con"))
)
inspect(d)
##
## categorical variables:
## name class levels n missing
## 1 Group_v2 factor 2 6 0
## 2 Group_v3 factor 2 6 0
## distribution
## 1 1 (50%), 2 (50%)
## 2 Trt (50%), Con (50%)
##
## quantitative variables:
## name class min Q1 median Q3 max mean sd n missing
## 1 Group numeric 1.0 1.000 1.50 2.000 2.0 1.50 0.5477226 6 0
## 2 Response numeric 2.9 3.275 3.55 3.675 3.8 3.45 0.3391165 6 0
could log transform (log, log10, log2)
d <- mutate(d, logResponse=log2(Response))
d
## # A tibble: 6 x 5
## Group Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 1 3.5 1 Trt 1.81
## 2 1 3.7 1 Trt 1.89
## 3 1 3.8 1 Trt 1.93
## 4 2 3.6 2 Con 1.85
## 5 2 3.2 2 Con 1.68
## 6 2 2.9 2 Con 1.54
select(d, Group, Response)
## # A tibble: 6 x 2
## Group Response
## <dbl> <dbl>
## 1 1 3.5
## 2 1 3.7
## 3 1 3.8
## 4 2 3.6
## 5 2 3.2
## 6 2 2.9
select(d, -Response, -Group)
## # A tibble: 6 x 3
## Group_v2 Group_v3 logResponse
## <fct> <fct> <dbl>
## 1 1 Trt 1.81
## 2 1 Trt 1.89
## 3 1 Trt 1.93
## 4 2 Con 1.85
## 5 2 Con 1.68
## 6 2 Con 1.54
==, !=, %in%<, <=, >, >=& |filter(d, Group == 1)
## # A tibble: 3 x 5
## Group Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 1 3.5 1 Trt 1.81
## 2 1 3.7 1 Trt 1.89
## 3 1 3.8 1 Trt 1.93
filter(d, Group != 1)
## # A tibble: 3 x 5
## Group Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 2 3.6 2 Con 1.85
## 2 2 3.2 2 Con 1.68
## 3 2 2.9 2 Con 1.54
filter(d, Group %in% c(1, 2))
## # A tibble: 6 x 5
## Group Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 1 3.5 1 Trt 1.81
## 2 1 3.7 1 Trt 1.89
## 3 1 3.8 1 Trt 1.93
## 4 2 3.6 2 Con 1.85
## 5 2 3.2 2 Con 1.68
## 6 2 2.9 2 Con 1.54
rename(d, Treatment = "Group")
## # A tibble: 6 x 5
## Treatment Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 1 3.5 1 Trt 1.81
## 2 1 3.7 1 Trt 1.89
## 3 1 3.8 1 Trt 1.93
## 4 2 3.6 2 Con 1.85
## 5 2 3.2 2 Con 1.68
## 6 2 2.9 2 Con 1.54
arrange(d, Response)
## # A tibble: 6 x 5
## Group Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 2 2.9 2 Con 1.54
## 2 2 3.2 2 Con 1.68
## 3 1 3.5 1 Trt 1.81
## 4 2 3.6 2 Con 1.85
## 5 1 3.7 1 Trt 1.89
## 6 1 3.8 1 Trt 1.93
arrange(d, Group, Response)
## # A tibble: 6 x 5
## Group Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 1 3.5 1 Trt 1.81
## 2 1 3.7 1 Trt 1.89
## 3 1 3.8 1 Trt 1.93
## 4 2 2.9 2 Con 1.54
## 5 2 3.2 2 Con 1.68
## 6 2 3.6 2 Con 1.85
These two commands are equivalent.
mutate(d, logResponse=log10(Response))
## # A tibble: 6 x 5
## Group Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 1 3.5 1 Trt 0.544
## 2 1 3.7 1 Trt 0.568
## 3 1 3.8 1 Trt 0.580
## 4 2 3.6 2 Con 0.556
## 5 2 3.2 2 Con 0.505
## 6 2 2.9 2 Con 0.462
d %>% mutate(logResponse=log10(Response))
## # A tibble: 6 x 5
## Group Response Group_v2 Group_v3 logResponse
## <dbl> <dbl> <fct> <fct> <dbl>
## 1 1 3.5 1 Trt 0.544
## 2 1 3.7 1 Trt 0.568
## 3 1 3.8 1 Trt 0.580
## 4 2 3.6 2 Con 0.556
## 5 2 3.2 2 Con 0.505
## 6 2 2.9 2 Con 0.462
This lets us chain together verbs for easier writing and reading.
d %>% mutate(logResponse=log10(Response)) %>%
select(Group, logResponse) %>%
filter(Group==1) %>%
rename(Treatment="Group")
## # A tibble: 3 x 2
## Treatment logResponse
## <dbl> <dbl>
## 1 1 0.544
## 2 1 0.568
## 3 1 0.580
d %>% group_by(Group) %>%
summarize(m=mean(Response),
s=sd(Response))
## # A tibble: 2 x 3
## Group m s
## <dbl> <dbl> <dbl>
## 1 1 3.67 0.153
## 2 2 3.23 0.351
Can use these to plot mean +/- SD
dx <- d %>% mutate(Group=factor(Group)) %>%
group_by(Group) %>%
summarize(m = mean(Response),
s = sd(Response)) %>%
mutate(lwr = m - s, upr = m + s)
gf_pointrange(m + lwr + upr ~ Group, data=dx)
Or with error bars: piping lets us add points for mean (at larger size), error bars, and all data points to same plot
gf_point(m ~ Group, data=dx, ylab="mean (SD)", size=3) %>%
gf_errorbar(lwr + upr ~ Group, data=dx, width=0.5) %>%
gf_point(Response ~ Group, data=d)
Some more generic data; this has four columns, representing two variables (a and b) measured at two times (1 and 2)
d <- read_csv("data04-ex2.csv")
d
## # A tibble: 10 x 5
## id a_1 a_2 b_1 b_2
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0 7 6 4
## 2 2 7 1 6 6
## 3 3 -1 0 7 10
## 4 4 3 3 5 7
## 5 5 8 0 5 7
## 6 6 1 3 2 6
## 7 7 2 1 7 0
## 8 8 1 -4 7 -3
## 9 9 2 4 1 -2
## 10 10 3 2 0 3
Takes a selection of variables (either by inclusion or exclusion) and “stacks” the values of those variables, with the result of two columns, one with the name of the variable it came from and the other with the value.
Here “variable” is the new variable name for the name of the variable, and “value” is the new variable for the values; you can change these. The other parameters are the selection of the variables to “gather”; first by exclusion (everything but id), then by inclusion; they have the same result.
gather(d, "variable", "value", -id)
gather(d, "variable", "value", a_1, a_2, b_1, b_2)
## # A tibble: 40 x 3
## id variable value
## <dbl> <chr> <dbl>
## 1 1 a_1 0
## 2 2 a_1 7
## 3 3 a_1 -1
## 4 4 a_1 3
## 5 5 a_1 8
## 6 6 a_1 1
## 7 7 a_1 2
## 8 8 a_1 1
## 9 9 a_1 2
## 10 10 a_1 3
## # … with 30 more rows
So now we need to separate the “variable” variable into the two parts, a/b and 1/2. This separate verb does that, here I separate them into new variables called “var” and “time”. By default, it separates by non-alphanumeric.
dx <- gather(d, "variable", "value", -id) %>%
separate(variable, c("var", "time"))
dx
## # A tibble: 40 x 4
## id var time value
## <dbl> <chr> <chr> <dbl>
## 1 1 a 1 0
## 2 2 a 1 7
## 3 3 a 1 -1
## 4 4 a 1 3
## 5 5 a 1 8
## 6 6 a 1 1
## 7 7 a 1 2
## 8 8 a 1 1
## 9 9 a 1 2
## 10 10 a 1 3
## # … with 30 more rows
Also see use of the bins parameter; binwidth may also be of interest.
gf_histogram(~value | var ~ time, data=dx, bins=5)
Can reverse the “gather” procedure by “spreading”, like this
dx2 <- dx %>% spread(var, value)
dx2
## # A tibble: 20 x 4
## id time a b
## <dbl> <chr> <dbl> <dbl>
## 1 1 1 0 6
## 2 1 2 7 4
## 3 2 1 7 6
## 4 2 2 1 6
## 5 3 1 -1 7
## 6 3 2 0 10
## 7 4 1 3 5
## 8 4 2 3 7
## 9 5 1 8 5
## 10 5 2 0 7
## 11 6 1 1 2
## 12 6 2 3 6
## 13 7 1 2 7
## 14 7 2 1 0
## 15 8 1 1 7
## 16 8 2 -4 -3
## 17 9 1 2 1
## 18 9 2 4 -2
## 19 10 1 3 0
## 20 10 2 2 3
gf_point(a ~ b, color = ~time, data=dx2, title="By color")