Problem Set 1

Note: Grading is based both on your graphs and verbal explanations. Follow all best practices as discussed in class, including choosing appropriate parameters for all graphs. Do not expect the assignment questions to spell out precisely how the graphs should be drawn. Sometimes guidance will be provided, but the absense of guidance does not mean that all choices are ok.

Read Graphical Data Analysis with R, Ch. 3

1. Education and Income

[8 points]

Data: ex0525 in the Sleuth3 package

Draw multiple horizontal boxplots of Income2005, by Educ (education level). Order the boxplots from lowest educational level at the bottom to highest educational level at the top. (Hint: reorder factor levels) What do you observe?

library(ggplot2)
data(ex0525, package = "Sleuth3")
#horizontal box plot by coord_flip
ggplot(ex0525, aes(reorder(Educ, Income2005), Income2005, group = Educ))+
geom_boxplot(outlier.color = 'red') + coord_flip() + ggtitle("Box Plot of Income vs Eudcation Level")+
ylab("Income in 2005") + xlab("Eduacation Level") + theme(plot.title = element_text(hjust = 0.5))

Average income of a better educated person is higher than a less educated one. And the maximum income is expected higher of those who have higher level of education. It means in general, a higher education level leads to higher income.Note that education level of 12 and 13-15 have outliers which cover whisker part after upper quartile(Q3). Also, group 16 and >16 have wide range of outliers (for >16, up to 700,000)

Draw histograms, faceted by Educ, for the same data. Describe one insight that was not visible in the boxplots.

ex0525_new <- ex0525
ex0525_new$Educ <- factor(ex0525_new$Educ, levels = c("<12","12","13-15","16",">16"))
ggplot(ex0525_new, aes(Income2005)) + geom_histogram(aes(y=stat(count)/sum(count)),binwidth = 15000, fill = "blue", colour ="black") + xlim(c(0,150000))+ xlab("Income2005") + ylab("Scaled Count")+ ggtitle("Histogram of Income respect to Eudcation Level") + theme(plot.title = element_text(hjust = 0.5))+
facet_grid(Educ~., scales = "free")

Plot overlapping density curves of the same data, one curve per education level, on a single set of axes. Each curve should be a different color. What additional information do you learn?

ggplot(ex0525_new, aes(x = Income2005, colour = Educ, fill= Educ)) + geom_density(alpha = 0.2)+geom_line(stat = "density")+
  expand_limits(y=0) +scale_fill_discrete(name='Education Level by Year',labels=c("<12","12","13-15","16",">16"))+
  scale_color_discrete(name='Education Level by Year',labels=c("<12","12","13-15","16",">16")) + 
  scale_x_continuous(labels = scales::comma, limits = c(0,300000))+
 ggtitle("Overlapping Density Curve of Income respect to Eudcation Level") + theme(plot.title = element_text(hjust = 0.5))

From C, it is possible to learn group 13-15 has major distribution overlapped with group 12. Unlike group <12 and 12, which shows approximate narrow- symmterical distribution, Group 16 and >16 have tendency of wide right-skewed distribution shows that higher education would have more degree of freedom to have wide range of outcome(also higher).

2. Boundaries

[4 points]

Find or create a small dataset (< 100 observations) for which right open and right closed histograms for the same parameters are not identical. Display the full dataset (that is, show the numbers) and the plots of the two forms.

# create data based on range
x1 <- sample(1:5,10, replace = TRUE)
x2 <- sample(6:10,10,replace = TRUE)
x3 <- sample(11:15, 10, replace = TRUE)
x4 <- sample(16:20, 10, replace = TRUE)
x5 <- c(12.5,12.5,17.5,17.5,17.5)
#concat each vector and make it to a simple dataframe
x_f <- c(x1,x2,x3,x4,x5)
simple_df <- data.frame(x_f)

#show data in dataframe
print(simple_df)

##     x_f
## 1   4.0
## 2   3.0
## 3   3.0
## 4   3.0
## 5   1.0
## 6   5.0
## 7   1.0
## 8   3.0
## 9   5.0
## 10  1.0
## 11  8.0
## 12  7.0
## 13  7.0
## 14 10.0
## 15  9.0
## 16  7.0
## 17  8.0
## 18 10.0
## 19 10.0
## 20  9.0
## 21 13.0
## 22 11.0
## 23 13.0
## 24 14.0
## 25 15.0
## 26 14.0
## 27 13.0
## 28 13.0
## 29 12.0
## 30 15.0
## 31 17.0
## 32 17.0
## 33 20.0
## 34 18.0
## 35 16.0
## 36 18.0
## 37 20.0
## 38 17.0
## 39 16.0
## 40 18.0
## 41 12.5
## 42 12.5
## 43 17.5
## 44 17.5
## 45 17.5

g1<- ggplot(simple_df, aes(x_f)) + stat_bin(binwidth = 5, closed = c("right"), fill = "white", colour = "black") + ylim(c(0,15))+
  ggtitle("right closed histogram(left open)")+
  theme(plot.title = element_text(hjust = 0.5))
g2 <- ggplot(simple_df, aes(x_f)) + stat_bin(binwidth = 5, closed = c("left"), fill = "white", colour = "black") + ylim(c(0,15)) +
  ggtitle("right open histogram(left closed)")+
  theme(plot.title = element_text(hjust = 0.5))

library(ggpubr)
ggarrange(g1, g2)

Adjust parameters–the same for both–so that the right open and right closed versions become identical. Explain your strategy.

g3<- ggplot(simple_df, aes(x_f)) + stat_bin(binwidth = 5, center = 0.2, closed = c("right"), fill = "white", colour = "black") + ylim(c(0,15))+
  ggtitle("right closed histogram(left open) - modified")+
  theme(plot.title = element_text(hjust = 0.5))
g4 <- ggplot(simple_df, aes(x_f)) + stat_bin(binwidth = 5, center = 0.2, closed = c("left"), fill = "white", colour = "black") + ylim(c(0,15)) +
  ggtitle("right open histogram(left closed) - modified")+
  theme(plot.title = element_text(hjust = 0.5))

library(ggpubr)
ggarrange(g3, g4)

It shows from (a) that right-open histogram and right-closed histogram differs because there are values which are exactly same as bin boundaries. To make both histograms same, “center” parameter used to shift first bin boundary from 0 -> 0.2; This enables for both histogram to include all values correspond to bin boundaries in desired bins.

3. Beavers

[8 points]

Use QQ (quantile-quantile) plots with theoretical normal lines to compare temp for the built-in beaver1 and beaver2 datasets. Which appears to be more normally distributed?

qqnorm(beaver1$temp, main = "Normal Q-Q Plot beaver1-temp")
qqline(beaver1$temp, col="red")

qqnorm(beaver2$temp, main = "Normal Q-Q Plot beaver2-temp")
qqline(beaver2$temp, col="red")

#From qqplot we see that beaver1 temp is more normally distributed

Draw density histograms with density curves and theoretical normal curves overlaid. Does the data appear to be normally distributed?

temp1 = data.frame(beaver1$temp)
temp2 = data.frame(beaver2$temp)
temp1_mean = mean(beaver1$temp)
temp2_mean = mean(beaver2$temp)
temp1_sd = sd(beaver1$temp)
temp2_sd = sd(beaver2$temp)
ggplot(temp1, aes(x = beaver1$temp)) + 
geom_histogram(aes( y = ..density..),bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) +
geom_density(color = "#3D6480") + stat_function(fun = dnorm, args = list(mean = temp1_mean, sd = temp1_sd))  + ggtitle("density histogram with theoretical normal curves - beaver1")+ theme(plot.title = element_text(hjust = 0.5))

ggplot(temp2, aes(x = beaver2$temp)) + 
geom_histogram(aes( y = ..density..),bins = 20, colour = "#80593D", fill = "#9FC29F", boundary = 0) +
geom_density(color = "#3D6480") + stat_function(fun = dnorm, args = list(mean = temp2_mean, sd = temp2_sd)) + ggtitle("density histogram with theoretical normal curves - beaver2")+ theme(plot.title = element_text(hjust = 0.5))

#From histograms we can find that beaver1 temp is more normally distributed

Perform the Shapiro-Wilk test for normality using the shapiro.test() function and interpret the results.

#Shapiro test for beaver1
shapiro.test(rnorm(100, mean = temp2_mean, sd = temp2_sd))

## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(100, mean = temp2_mean, sd = temp2_sd)
## W = 0.99428, p-value = 0.952

#Shapiro test for beaver2
shapiro.test(rnorm(100, mean = temp2_mean, sd = temp2_sd))

## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(100, mean = temp2_mean, sd = temp2_sd)
## W = 0.99563, p-value = 0.9881

Did all of the methods for testing for normality (a, b, and c) produce the same results? Briefly explain.

All of the methods for testing for normality (a, b, and c) produce the same results that beaver1.temp is more normally distributed than beaver2.temp. In question a, qqplot of beaver1.temp is more likely to be a line. In question b, the dense curve of beaver1.temp is more likely to be the dense curve of normal distribution. In question c, #W and p-value of temp1 are both greater than temp2’s, so beaver1 temp is more normally distributed.

[4 points]

Draw two histograms of the number of deaths attributed to coronary artery disease among doctors in the breslow dataset (boot package), one for smokers and one for non-smokers. Hint: read the help file ?breslow to understand the data.

data(breslow, package = "boot")
breslow_c <-breslow
breslow_c <- transform(breslow_c, d_per = (y/n)*100)
breslow_c$smoke[breslow_c$smoke == 0] <- "non smoke"
breslow_c$smoke[breslow_c$smoke == 1] <- "smoke"

non_smoke <- breslow[breslow$smoke == 0, ]
smoke <- breslow[breslow$smoke == 1,]

non_smoke <- transform(non_smoke, d_per = (y/n)*100)
smoke <- transform(smoke, d_per = (y/n)*100)



ggplot(non_smoke, aes(age, y)) + geom_bar(stat = 'identity', fill ="blue", colour = "black") + 
  ylab("number of death by coronary disease")  + ggtitle("Death by Coronary Disease for Non-smoker respect to Age")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(smoke, aes(age, y)) + geom_bar(stat = 'identity', fill ="red", colour = "black") + 
  ylab("number of death by coronary disease") + ggtitle("Death by Coronary Disease for Smoker respect to Age")+
  theme(plot.title = element_text(hjust = 0.5))

It is also examined further by student to display percentage of death(number of death by coronary diseases/number of people in group) coronary diseases between each group(smoke vs non-smoke). You can see that from age 40-70 group, smoke group has higher percentage of death by coronary disease. But in age group 80, surprisingly, non-smoke group has higher percentage of death by coronary diease.

ggplot(breslow_c, aes(age, d_per,fill = smoke)) + geom_bar(stat='identity', position = 'dodge')  + 
  ylab("percentage of death by coronary disease(y/n, %)") + ggtitle("% Comparison of Death by Coronary Diesase between Smoker .vs Non-smoker")+
  theme(plot.title = element_text(hjust = 0.5))

5. Nutrition

[6 points]

Data: NutritionStudy in the Lock5withR package

Create a new categorical variable that represents ages in 10-year groups: 0-10, 11-20, 21-30, etc. Choose one of the continuous variables in the dataset and create a ridgeline plot (ggridges package) showing the distribution of the chosen variable by age.

library(ggridges)
data(NutritionStudy, package ="Lock5withR")

NStudy <- NutritionStudy
NStudy <- transform(NStudy, n_Age = Age)

max(NStudy$Age) # 83

## [1] 83

min(NStudy$Age) # 19

## [1] 19

NStudy$n_Age[NStudy$n_Age >= 11 & NStudy$n_Age <=20] <- "11-20"
NStudy$n_Age[NStudy$n_Age >= 21 & NStudy$n_Age <=30] <- "21-30"
NStudy$n_Age[NStudy$n_Age >= 31 & NStudy$n_Age <=40] <- "31-40"
NStudy$n_Age[NStudy$n_Age >= 41 & NStudy$n_Age <=50] <- "41-50"
NStudy$n_Age[NStudy$n_Age >= 51 & NStudy$n_Age <=60] <- "51-60"
NStudy$n_Age[NStudy$n_Age >= 61 & NStudy$n_Age <=70] <- "61-70"
NStudy$n_Age[NStudy$n_Age >= 71 & NStudy$n_Age <=80] <- "71-80"
NStudy$n_Age[NStudy$n_Age >= 81 & NStudy$n_Age <=90] <- "81-90"

ggplot(NStudy, aes(Cholesterol, n_Age, fill = n_Age)) + geom_density_ridges() +
  scale_fill_discrete(name='Age in Category') + ylab("Age in Category") + 
  ggtitle("Ridgeline Plot of Cholesterol by Age Group")+
  theme(plot.title = element_text(hjust = 0.5))

Display the same data as in part a) using boxplots.

ggplot(NStudy, aes(n_Age, Cholesterol, group = n_Age)) + geom_boxplot(outlier.color = "blue") + coord_flip() +
  xlab("Age in Category") +  
  ggtitle("Box Plot of Cholesterol by Age Group")+
  theme(plot.title = element_text(hjust = 0.5))

Compare a) and b). Which do you think is more effective for this data and why?

It will differ based on the purpose of experiment and data usuage. Ridgeline graph is more effective to find the trend of data such as multimodality and probability density. Boxplot is more effective to find statistical parameter such as median, range of Q1-Q4 quartile and outliers.

Problem Set 1

Kyungjin Kim. Gaoge Liu

01/27/2021

1. Education and Income

2. Boundaries

3. Beavers

5. Nutrition