ggplot2Huiting Ma
This homework will focus on the following parts:
First, the Gapminder data can be imported from here. Then, make sure that you are working in the correct directory on your computer.
# setwd("C:/Users/user/Desktop/UBC/STAT545")
gdURL<-"http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat<-read.delim(gdURL)
Now, Let us check whether the dataset has imported correctly.
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
library(ggplot2)
library(lattice)
library(plyr)
library(xtable)
Now, Let us start our jounery. Through analyzing dataset from last homework, I noticed that there are only two measly coutries in Oceania. I would like to drop these observation and focus my analysis on other continents.
iDat <- droplevels(subset(gDat, continent != "Oceania"))
Let us check whether we did it!
table(iDat$continent)
##
## Africa Americas Asia Europe
## 624 300 396 360
Yes! Oceania has dropped successfully.
First, let us review the solution from JB by using lattice().
stripplot(lifeExp ~ as.factor(year)|continent, iDat, jitter.data = TRUE, grid = "h", main ="How is Life Expectancy Changing over Time on Diffferent Continents", type =c("p","a"))
Now, let us try to use ggplot2().
ggplot(iDat, aes(x = year, y = lifeExp, color = year)) + geom_point() + facet_wrap(~ continent) +
ggtitle("How is Life Expectancy Changing over Time on Diffferent Continents")
Compared with two plots, the figure plotted by ggplot2() is much nicer than the one plotted by lattice().
After that, let us try a new dataset-HIV data.
First, we need install a JM package to get our dataset.
library(JM)
## Loading required package: MASS Loading required package: nlme Loading
## required package: splines Loading required package: survival
Now, we can have a close look at the dataset.
str(aids)
## 'data.frame': 1405 obs. of 12 variables:
## $ patient: Factor w/ 467 levels "1","2","3","4",..: 1 1 1 2 2 2 2 3 3 3 ...
## $ Time : num 17 17 17 19 19 ...
## $ death : int 0 0 0 0 0 0 0 1 1 1 ...
## $ CD4 : num 10.68 8.43 9.43 6.32 8.12 ...
## $ obstime: int 0 6 12 0 6 12 18 0 2 6 ...
## $ drug : Factor w/ 2 levels "ddC","ddI": 1 1 1 2 2 2 2 2 2 2 ...
## $ gender : Factor w/ 2 levels "female","male": 2 2 2 2 2 2 2 1 1 1 ...
## $ prevOI : Factor w/ 2 levels "noAIDS","AIDS": 2 2 2 1 1 1 1 2 2 2 ...
## $ AZT : Factor w/ 2 levels "intolerance",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ start : int 0 6 12 0 6 12 18 0 2 6 ...
## $ stop : num 6 12 17 6 12 ...
## $ event : num 0 0 0 0 0 0 0 0 0 1 ...
This dataset contains 12 variables and 1408 observations. However, today I will focus on the following:
patient is the identifier for each patientTime is the time to death or censoring (either patient dropout or death has not oberserved during the study)death 1 means the study observed death eventCD4 the number of CD4 for the patient (In general, the higer the CD4 is, the healthier the patient will be)obstime gives the information when the CD4 measuresdrug ddc means zalcitabine, ddI means didanosinegender the gender for the corresponding patientNow, I will get the variables that I am going to use from the original dataset.
aidsnew <- arrange(aids[,c(1,2,3,4,5,6,7)],drug)
First, we can try some stripplots.
ggplot(aidsnew, aes(x = drug, y = CD4)) + geom_jitter(color = "darkblue") +
ggtitle("Plot of CD4 v.s. Drug")
ggplot(aidsnew, aes(x = gender, y = CD4)) + geom_jitter() +
ggtitle("Plot of CD4 v.s. Gender")
It is interesting to find that there is much more data availiable from male than female. It makes sense since the number of male patients having HIV is much higher than the number of female having HIV.
We tried to plot above two plots separately. Now, we can put them together in one plot.
ggplot(aidsnew, aes(x = gender, y = CD4)) + geom_jitter() +
facet_wrap(~ drug) + ggtitle("Plot of CD4 v.s. Gender Based on Drug")
Then, let us plot the density function of a quantitative variable CD4 based on different drugs.
ggplot(aidsnew, aes(x = CD4, color = drug)) + geom_density() +
ggtitle("Plot of Desity Function of CD4")
Based on the above plot, we found both drugs have the similar distribution for CD4.
ggplot(aidsnew, aes(x = obstime, y = CD4, color = drug)) +
geom_jitter() + ggtitle("Plot of Time of Observation v.s. CD4 Based on Drug")
ggplot(aidsnew, aes(x = obstime, y = CD4, color = gender)) +
geom_jitter() + ggtitle("Plot of Time of Observation v.s. CD4 Based on Gender")
From above two graphs, we can find that the time for observation seems no significant influence on CD4. However, the number of patients dropped dramatically with the time for observation. It means that a lot of patients dropped out or died in the later time of study.