Grammar_of_graphics_only_in_class_Exercises_&_Statistical_graphics

Grammar of graphics

In-class Exercisess:

1.Find out what each code chunk (indicated by ‘##’) in the R script does and provide comments.

#使用women資料畫圖，使用basic graphic，先不加任何東西#
plot(women, type='n')
#用women的第一筆資料畫一個點#
points(women[1,])

#使用women資料畫圖，使用lattice，將weight與height分別做y與x軸，將women的row名稱第一筆資料做一個點#
lattice::xyplot(weight ~ height, 
  data=women,
  subset=row.names(women)==1, type='p')

#使用women資料畫圖，使用ggplot，將weight與height分別做y與x軸，將women的row名稱第一筆資料做一個點#
library(ggplot2)
ggplot(data=women[1,], aes(height, weight))+
  geom_point()

2.The data set is concerned with grade 8 pupils (age about 11 years) in elementary schools in the Netherlands. After deleting pupils with missing values, the number of pupils is 2,287 and the number of schools is 131. Class size ranges from 4 to 35. The response variables are score on a language test and that on an arithmetic test. The research intest is on how the two test scores depend on the pupil’s intelligence (verbal IQ) and on the number of pupils in a school class.

The class size is categorized into small, medium, and large with roughly equal number of observations in each category. The verbal IQ is categorized into low, middle and high with roughly equal number of observations in each category. Reproduce the plot below.

Column 1: School ID Column 2: Pupil ID Column 3: Verbal IQ score Column 4: The number of pupils in a class Column 5: Language test score Column 6: Arithmetic test score

#讀檔案，開工具包，看檔案#
dta <- read.table("C:/Users/boss/Desktop/data_management/langMathDutch.txt", h = T)
head(dta)

  school pupil  IQV size lang arith
1      1 17001 15.0   29   46    24
2      1 17002 14.5   29   45    19
3      1 17003  9.5   29   33    24
4      1 17004 11.0   29   46    26
5      1 17005  8.0   29   20     9
6      1 17006  9.5   29   30    13

names(dta)

[1] "school" "pupil"  "IQV"    "size"   "lang"   "arith"

library(ggplot2)
library(tidyverse)
library(lattice)
library(gridExtra)

#以size與IQV分別以"Small", "Medium", "Large"還有"Low", "Middle", "High"另外增加欄位並分割資料照順序排列#

dta <- dta %>% mutate(sizef=cut(size, 
                      breaks=quantile(size, 
                                      probs=c(0, .33, .66, 1)),
                      label=c("Small", "Medium", "Large"), 
                      ordered=T, 
                      include.lowest=T))
                      
dta <- dta %>% mutate(IQV_f=cut(IQV, 
                      breaks=quantile(IQV, 
                                      probs=c(0, .33,.66, 1)),
                      label=c("Low", "Middle", "High"), 
                      ordered=T, 
                      include.lowest=T))

#再看有沒有分開#
head(dta)

  school pupil  IQV size lang arith sizef IQV_f
1      1 17001 15.0   29   46    24 Large  High
2      1 17002 14.5   29   45    19 Large  High
3      1 17003  9.5   29   33    24 Large   Low
4      1 17004 11.0   29   46    26 Large   Low
5      1 17005  8.0   29   20     9 Large   Low
6      1 17006  9.5   29   30    13 Large   Low

#再把多個子圖合併做比較#

p1 <- ggplot(dta, 
             aes(lang, arith)) +
  stat_smooth(method="lm", 
              formula=y ~ x) +
  geom_point(shape=20) +
  facet_grid(sizef ~ IQV_f) +
  labs(x="Language score", y="Arithnetric score") +
  theme_bw() 

p1

#除了顏色與標題位置調不過來，大致上與老師要求的相近#

3.Use the USPersonalExpenditure{datasets} for this problem. This data set consists of United States personal expenditures (in billions of dollars) in the categories; food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960.

Plot the US personal expenditure data in the style of the third plot on the “Time Use” case study in the course web page. You might want to transform the dollar amounts to log base 10 unit first.

#開工具#
library(ggplot2)
library(tidyverse)
library(lattice)
library(grid)
library(gridExtra)
#開檔案#
require(stats) 
USPersonalExpenditure

                      1940   1945  1950 1955  1960
Food and Tobacco    22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health   3.530  5.760  9.71 14.0 21.10
Personal Care        1.040  1.980  2.45  3.4  5.40
Private Education    0.341  0.974  1.80  2.6  3.64

dta <- USPersonalExpenditure

#看檔案#
str(dta)

 num [1:5, 1:5] 22.2 10.5 3.53 1.04 0.341 44.5 15.5 5.76 1.98 0.974 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "Food and Tobacco" "Household Operation" "Medical and Health" "Personal Care" ...
  ..$ : chr [1:5] "1940" "1945" "1950" "1955" ...

#改欄位名字#
colnames(dta) <- c("1940y","1945y","1950y","1955y","1960y")
rownames(dta) <- c("F&T","HO","M&H","PC","PE")
#再看一次#
head(dta)

     1940y  1945y 1950y 1955y 1960y
F&T 22.200 44.500 59.60  73.2 86.80
HO  10.500 15.500 29.00  36.5 46.20
M&H  3.530  5.760  9.71  14.0 21.10
PC   1.040  1.980  2.45   3.4  5.40
PE   0.341  0.974  1.80   2.6  3.64

#將資料從寬轉長#
library(reshape2)
dta1<- melt(dta)
#又看一次#
head(dta1)

  Var1  Var2  value
1  F&T 1940y 22.200
2   HO 1940y 10.500
3  M&H 1940y  3.530
4   PC 1940y  1.040
5   PE 1940y  0.341
6  F&T 1945y 44.500

#重新命名欄位名稱#
colnames(dta1) <- c("activity", "year", "expenditure")
#把expenditure改成乘以log10之後，四捨五入到第2位#
logexpenditure <- round(log10(dta1$expenditure),2)
head(logexpenditure)

[1]  1.35  1.02  0.55  0.02 -0.47  1.65

#畫圖#
qplot(logexpenditure, activity, data = dta1) +
  geom_segment(aes(xend = 0, yend = activity)) +
  geom_vline(xintercept = 0, colour = "grey50") +
  facet_wrap(~ year, nrow = 1)

4.A sample of 158 children with autisim spectrum disorder were recruited. Social development was assessed using the Vineland Adaptive Behavior Interview survey form, a parent-reported measure of socialization. It is a combined score that included assessment of interpersonal relationships, play/leisure time activities, and coping skills. Initial language development was assessed using the Sequenced Inventory of Communication Development (SICD) scale. These assessments were repeated on these children when they were 3, 5, 9, 13 years of age.

Source: West, B.T., Welch, K.B., & Galecki, A.T. (2002). Linear Mixed Models: Practical Guide Using Statistical Software. p. 220-271.

Data: autism{WWGbook}

Column 1: Age (in years) Column 2: Vineland Socialization Age Equivalent score Column 3: Sequenced Inventory of Communication Development Expressive Group (1 = Low, 2 = Medium, 3 = High) Column 4: Child ID

Replicate the two plots above using ggplot2.

#開工具#
library(ggplot2)
library(tidyverse)
library(lattice)
library(grid)
library(gridExtra)
library(hrbrthemes)
#讀檔案#
pacman::p_load(WWGbook)
data("autism", package="WWGbook")
dta <- WWGbook::autism
#看檔案#
head(dta)

  age vsae sicdegp childid
1   2    6       3       1
2   3    7       3       1
3   5   18       3       1
4   9   25       3       1
5  13   27       3       1
6   2   17       3       3

#將sicdegp資料以L、M、H分開另外照著順序排列#
dta$sic <- factor(dta$sicdegp, levels = c(1,2,3), labels = c("L", "M", "H"))
#再看一下#
head(dta)

  age vsae sicdegp childid sic
1   2    6       3       1   H
2   3    7       3       1   H
3   5   18       3       1   H
4   9   25       3       1   H
5  13   27       3       1   H
6   2   17       3       3   H

#解決遺漏值#
dta1<-na.omit(dta)  
#把age改成圖形要的/10#
logage <- round(log10(dta1$age),2)
#做圖#
p0 <- ggplot(data=dta1, 
             aes(x=logage, 
                 y=vsae))+

      geom_point()+
  
      geom_path(color="grey")+
  
      labs(x='Age (in years, centered)', 
      y='VSAE score', 
      title = NULL)+

      
      stat_smooth(data=dta1, 
              aes(color=childid), 
              formula=y ~ x,
              method='lm', 
              se=FALSE)+
      
      facet_grid(. ~ sic)
  
p0

#另一個圖首先要照圖上將age-2#
p1 <- dta1 %>% mutate(age2=age-2) %>% 
               group_by(sic, age2) %>%
               dplyr::summarise(n=n(), m_p=mean(vsae),se_p=sd(vsae)/sqrt(n)) %>%
               ggplot() + 
               aes(age2, m_p, group=sic, shape=sic) +
#接下來就是做圖#  
  geom_errorbar(aes(ymin=m_p - se_p,
                    ymax=m_p + se_p),
                width=.2, size=.3,group="sic") +
  geom_line(aes(linetype=sic), show.legend=T)+
  geom_point(size=rel(3),show.legend=T) +
  scale_shape(guide=guide_legend(title=NULL)) +
  labs(x="AGE(in years-2)", y="VSAE score") +
  theme_ipsum() +
  theme(legend.position=c(.1, .8))

p1

#還沒搞清楚Legendt怎麼改成gruop，但大致上符合#

5.Use the diabetes dataset to generate a plot similar to the one below and inteprete the plot.

#開工具#
library(ggplot2)
library(tidyverse)
library(lattice)
library(grid)
library(gridExtra)
library(hrbrthemes)
library(reshape2)
library(ggalluvial)

#讀檔案#
dta <- read.csv("C:/Users/boss/Desktop/data_management/diabetes_mell.csv")
#看檔案#
head(dta)

   SEQN RIAGENDR RIDRETH1 DIQ010 BMXBMI  gender     race diabetes           BMI
1 51624        1        3      2  32.22   Males    White       No    Overweight
2 51626        1        4      2  22.00   Males    Black       No Normal weight
3 51627        1        4      2  18.22   Males    Black       No Normal weight
4 51628        2        4      1  42.39 Females    Black      Yes    Overweight
5 51629        1        1      2  32.61   Males Hispanic       No    Overweight
6 51630        2        3      2  30.57 Females    White       No    Overweight

#開始畫圖#
dta_v3 <- data.frame(xtabs(data = dta, ~ race + gender + diabetes + BMI))

head(dta_v3)

      race  gender diabetes           BMI Freq
1    Black Females       No Normal weight  347
2 Hispanic Females       No Normal weight  712
3    White Females       No Normal weight  998
4    Black   Males       No Normal weight  429
5 Hispanic   Males       No Normal weight  706
6    White   Males       No Normal weight  873

p <- ggplot(dta_v3, 
       aes(axis1=race,
           axis2=gender, 
           axis3=diabetes, 
           y=Freq)) +
  scale_x_discrete(limits=c("race", 
                            "gender", 
                            "diabetes"), 
                   expand=c(.1, .01)) +
  labs(x='', 
       y='No.individuals') +
  geom_alluvium(aes(fill=BMI)) +
  geom_stratum() + 
  geom_text(stat="stratum", 
            infer.label=TRUE) +
  scale_fill_manual(values=c('skyblue','hotpink'))+
  theme_minimal() +
  ggtitle("Dibetes in overall population in US 200-2010") +
  labs(subtitle = "stratified by race, gender and diabetes mellitus") + 
  theme(legend.position = "bottom")
  
#除了顏色不對之外大致符合#
#這張圖可以看得出來各膚色無論男女，過重的人得糖尿病的機率很少#
#但是有白種人即使沒過重，也會得糖尿病#  
p

6.Find out what each code chunk (indicated by ‘##’) in the R script does and provide comments.

# 開ggplot2# library(ggplot2) ?ggplot2

# install.packages('formatR') library(formatR)

# 裝+開gapminder# install.packages('gapminder') library(gapminder)

# 使用gaminder本身的資料，並看其資料結構如何# data(gapminder) str(gapminder)

# 將其資料定義為gap# gap <- gapminder

# 畫背景框線# ggplot(data = gap, aes(x = lifeExp))

# 加上其資料的長條圖# ggplot(data = gap, aes(x = lifeExp)) + geom_histogram()

# 加長條圖填充藍色、框線黑色、標題、X軸與Y軸標題，使用theme_classic()來調整大小#
# ggplot(data = gap, aes(x = lifeExp)) + geom_histogram(fill = 'blue', color =
# 'black', bins = 10) + ggtitle('Life expectancy for the gap dataset') +
# xlab('Life expectancy (years)') + ylab('Frequency') + theme_classic()


# 改成畫盒鬚圖，加上標題、X軸與Y軸標題，使用theme_minimal()來填入顏色#
# ggplot(data = gap, aes(x = continent, y = lifeExp, fill = continent)) +
# geom_boxplot() + ggtitle('Boxplots for lifeExp by continent') +
# xlab('Continent') + ylab('Life expectancy (years)') + theme_minimal() #+
# guides(fill = FALSE)

# What happens if you un-hashtage `guides(fill = FALSE)` and the plus sign in
# lines 68 and 69 above?

# 就不會跑出來右邊的各區域色塊標籤了#

# 改成畫點圖，加上標題、X軸與Y軸標題，使用
# theme_classic()調整非資料展示的其他圖形，使用第二個theme來調整圖的色塊、大小...等等#
# ggplot(data = gap, aes(x = lifeExp, y = gdpPercap, color = continent, shape =
# continent)) + geom_point(size = 5, alpha = 0.5) + ggtitle('Scatterplot of life
# expectancy by gdpPercap') + theme_classic() + xlab('Life expectancy (years)') +
# ylab('gdpPercap (USD)') + theme(legend.position = 'top', plot.title =
# element_text(hjust = 0.5, size = 20), legend.title = element_text(size = 10),
# legend.text = element_text(size = 5), axis.text.x = element_text(angle = 45,
# hjust = 1))

# In lines the ggplot code above, what are the arguments inside of our second
# 'theme' argument doing?
# 根據網站上查到的是，可以用來調整非資料展示的其他圖形，有試過把那條刪掉再重新跑，沒有任何影響#
# 另外這個Rmarkdown Knit不了，所以強制加上#字號# The End

Statistical graphics

Exercisess:

1.The distribution of personal disposable income in Taiwan in 2015 has a story to tell.

Revise the following plot to enhance that message.

dta <- read.csv("C:/Users/boss/Desktop/data_management/income_tw.csv", header = T)

head(dta)

              Income  Count
1  160,000 and under 807160
2 160,000 to 179,999 301650
3 180,000 to 199,999 313992
4 200,000 to 219,999 329290
5 220,000 to 239,999 369583
6 240,000 to 259,999 452671

names(dta) <- c("Income", "Count")




p <- ggplot(data = dta, aes(x = Income, y = Count)) + geom_col(aes(x = Income, y = Count)) + 
    coord_flip() + labs(title = "Distribution of disposable personal income in Taiwan in 2015", 
    y = "Nunber of persons")

p

# 看得出來除了沒錢最多人之外，最多人是處於340000~359999的能夠用錢範圍，但是我現在不知道怎麼把Income給排好#

2.Comment on how the graphs presented in this link violate the principles for effective graphics and how would you revise them.

3.Sarah Leo at the Economist magazine published a data set to accompany the story about how scientific publishing is dominated by men. The plot on the left panel below is the orignal graph that appeared in the article.

Help her find a better plot.

Grammar_of_graphics_only_in_class_Exercises_&_Statistical_graphics_1

2020-04-27