** Please click all the tabs (in sequence) to get the entire set of information in these pages. **
** To download code, see the instructions in Session 2: https://rpubs.com/hkb/DAX-Session2 **
Let’s load in the student scores data set we created previously
# setwd("/cloud/project")
load("/cloud/project/DAX/dfeg.Rdata")
head(df.eg)
head(data_long)
There are two data frames: df.eg (wide form) and data_long (long form). Let’s rename them to be more memorable. df.wide and df.long
df.wide <- df.eg
df.long <- data_long
plot(as.factor(df.wide$name), df.wide$grade) # , type = "b"
plot(as.factor(df.wide$name), df.wide$total.score) # , type = "b"
Now let’s do the same thing (and more fancy things) with ggplot, which has the structure ggplot(data, aesthetics(x,y, additional dimensions), other details …)
ggplot(df.long, aes(x=name, y=grade)) + geom_point() # + expand_limits(y=0)
Ok, that’s not terribly useful to plot who’s in what class. (It might be more useful to plot how many students in each class … we’ll come to that later.) More useful might be to plot their test scores.
ggplot(df.long, aes(x=name, y=Score)) + geom_point() + expand_limits(y=0)
Hmm … each student has multiple scores, and they are all bunched up together against the name. How can we make this better? What should we separate the scores out on? And how?
ggplot(df.long, aes(x=name, y=Score, color=Subject)) + geom_point() + expand_limits(y=0)
ggplot(df.long, aes(x=Subject, y=Score, color=name)) + geom_point() + theme_classic() + geom_line(aes(group=name)) # wouldn't need this last group command if Subject were numeric
# ggplot(df.long, aes(x=as.numeric(Subject), y=Score, color=name, group=name)) + geom_point() + geom_line()
We mentioned earlier that plotting students against grades was not terribly useful, but that it could be useful to see how many students are in each class. There are a number of ways to do this.
table(df.long$grade)
8 9
12 9
Anyone see something odd with this report?
Yeah, because we ran the table function on the long form data set df.long, in which each student (who’s in one grade) appears multiple times. That’s not useful. We need to do this operation on a data set where each student/grade appears once.
table(df.wide$grade)
8 9
4 3
Here’s another way to get the same result using df.long and distinct()
df.long.grade <- df.long %>% select(name, grade) %>% distinct()
table(df.long.grade$grade)
8 9
4 3
and even better with only dplyr
df.long %>% select(name, grade) %>% distinct() %>% group_by(grade) %>%
summarise(n=n())
and now it is in the best form for plotting because we have a data frame object, with named columns so we can assign x=grade and y=n, and do the plot. That’s an exercise for you!
ggplot(df.long, aes(x=Subject, y=name, size=Score, color=as.factor(grade))) + geom_point() + theme_classic()