Homework #5 Visualize a Quantitative Variable with ggplot2

Huiting Ma

This homework will focus on the following parts:

Data Import for Gapminder Data

First, the Gapminder data can be imported from here. Then, make sure that you are working in the correct directory on your computer.

# setwd("C:/Users/user/Desktop/UBC/STAT545")
gdURL<-"http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat<-read.delim(gdURL)

Now, Let us check whether the dataset has imported correctly.

str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
library(ggplot2)
library(lattice)
library(plyr)
library(xtable)

Visualize a Quantitative Variable Using Gapminder Data

Now, Let us start our jounery. Through analyzing dataset from last homework, I noticed that there are only two measly coutries in Oceania. I would like to drop these observation and focus my analysis on other continents.

Using Gapminder Data

iDat <- droplevels(subset(gDat, continent != "Oceania"))

Let us check whether we did it!

table(iDat$continent)
## 
##   Africa Americas     Asia   Europe 
##      624      300      396      360

Yes! Oceania has dropped successfully.

First, let us review the solution from JB by using lattice().

stripplot(lifeExp ~ as.factor(year)|continent, iDat, jitter.data = TRUE, grid = "h", main ="How is Life Expectancy Changing over Time on Diffferent Continents", type =c("p","a"))

plot of chunk unnamed-chunk-7

Now, let us try to use ggplot2().

ggplot(iDat, aes(x = year, y = lifeExp, color = year)) + geom_point() + facet_wrap(~ continent) +
  ggtitle("How is Life Expectancy Changing over Time on Diffferent Continents")

plot of chunk unnamed-chunk-8

Compared with two plots, the figure plotted by ggplot2() is much nicer than the one plotted by lattice().

Data Import for New Data

After that, let us try a new dataset-HIV data. First, we need install a JM package to get our dataset.

library(JM)
## Loading required package: MASS Loading required package: nlme Loading
## required package: splines Loading required package: survival

Now, we can have a close look at the dataset.

str(aids)
## 'data.frame':    1405 obs. of  12 variables:
##  $ patient: Factor w/ 467 levels "1","2","3","4",..: 1 1 1 2 2 2 2 3 3 3 ...
##  $ Time   : num  17 17 17 19 19 ...
##  $ death  : int  0 0 0 0 0 0 0 1 1 1 ...
##  $ CD4    : num  10.68 8.43 9.43 6.32 8.12 ...
##  $ obstime: int  0 6 12 0 6 12 18 0 2 6 ...
##  $ drug   : Factor w/ 2 levels "ddC","ddI": 1 1 1 2 2 2 2 2 2 2 ...
##  $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 1 1 1 ...
##  $ prevOI : Factor w/ 2 levels "noAIDS","AIDS": 2 2 2 1 1 1 1 2 2 2 ...
##  $ AZT    : Factor w/ 2 levels "intolerance",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ start  : int  0 6 12 0 6 12 18 0 2 6 ...
##  $ stop   : num  6 12 17 6 12 ...
##  $ event  : num  0 0 0 0 0 0 0 0 0 1 ...

This dataset contains 12 variables and 1408 observations. However, today I will focus on the following:

Now, I will get the variables that I am going to use from the original dataset.

aidsnew <- arrange(aids[,c(1,2,3,4,5,6,7)],drug)

Visualize a Quantitative Variable Using New Data

First, we can try some stripplots.

ggplot(aidsnew, aes(x = drug, y = CD4)) + geom_jitter(color = "darkblue") + 
  ggtitle("Plot of CD4 v.s. Drug")

plot of chunk unnamed-chunk-12

ggplot(aidsnew, aes(x = gender, y = CD4)) + geom_jitter() + 
  ggtitle("Plot of CD4 v.s. Gender")

plot of chunk unnamed-chunk-13

It is interesting to find that there is much more data availiable from male than female. It makes sense since the number of male patients having HIV is much higher than the number of female having HIV.

We tried to plot above two plots separately. Now, we can put them together in one plot.

ggplot(aidsnew, aes(x = gender, y = CD4)) + geom_jitter() + 
  facet_wrap(~ drug) + ggtitle("Plot of CD4 v.s. Gender Based on Drug")

plot of chunk unnamed-chunk-14

Then, let us plot the density function of a quantitative variable CD4 based on different drugs.

ggplot(aidsnew, aes(x = CD4, color = drug)) + geom_density() + 
  ggtitle("Plot of Desity Function of CD4")

plot of chunk unnamed-chunk-15

Based on the above plot, we found both drugs have the similar distribution for CD4.

ggplot(aidsnew, aes(x = obstime, y = CD4, color = drug)) + 
  geom_jitter() + ggtitle("Plot of Time of Observation v.s. CD4 Based on Drug")

plot of chunk unnamed-chunk-16

ggplot(aidsnew, aes(x = obstime, y = CD4, color = gender)) + 
  geom_jitter() + ggtitle("Plot of Time of Observation v.s. CD4 Based on Gender")

plot of chunk unnamed-chunk-17

From above two graphs, we can find that the time for observation seems no significant influence on CD4. However, the number of patients dropped dramatically with the time for observation. It means that a lot of patients dropped out or died in the later time of study.