Grammar of graphics

In-class Exercisess:

1.Find out what each code chunk (indicated by ‘##’) in the R script does and provide comments.

2.The data set is concerned with grade 8 pupils (age about 11 years) in elementary schools in the Netherlands. After deleting pupils with missing values, the number of pupils is 2,287 and the number of schools is 131. Class size ranges from 4 to 35. The response variables are score on a language test and that on an arithmetic test. The research intest is on how the two test scores depend on the pupil’s intelligence (verbal IQ) and on the number of pupils in a school class.

The class size is categorized into small, medium, and large with roughly equal number of observations in each category. The verbal IQ is categorized into low, middle and high with roughly equal number of observations in each category. Reproduce the plot below.

Column 1: School ID Column 2: Pupil ID Column 3: Verbal IQ score Column 4: The number of pupils in a class Column 5: Language test score Column 6: Arithmetic test score

  school pupil  IQV size lang arith
1      1 17001 15.0   29   46    24
2      1 17002 14.5   29   45    19
3      1 17003  9.5   29   33    24
4      1 17004 11.0   29   46    26
5      1 17005  8.0   29   20     9
6      1 17006  9.5   29   30    13
[1] "school" "pupil"  "IQV"    "size"   "lang"   "arith" 
  school pupil  IQV size lang arith sizef IQV_f
1      1 17001 15.0   29   46    24 Large  High
2      1 17002 14.5   29   45    19 Large  High
3      1 17003  9.5   29   33    24 Large   Low
4      1 17004 11.0   29   46    26 Large   Low
5      1 17005  8.0   29   20     9 Large   Low
6      1 17006  9.5   29   30    13 Large   Low

3.Use the USPersonalExpenditure{datasets} for this problem. This data set consists of United States personal expenditures (in billions of dollars) in the categories; food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960.

Plot the US personal expenditure data in the style of the third plot on the “Time Use” case study in the course web page. You might want to transform the dollar amounts to log base 10 unit first.

                      1940   1945  1950 1955  1960
Food and Tobacco    22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health   3.530  5.760  9.71 14.0 21.10
Personal Care        1.040  1.980  2.45  3.4  5.40
Private Education    0.341  0.974  1.80  2.6  3.64
 num [1:5, 1:5] 22.2 10.5 3.53 1.04 0.341 44.5 15.5 5.76 1.98 0.974 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:5] "Food and Tobacco" "Household Operation" "Medical and Health" "Personal Care" ...
  ..$ : chr [1:5] "1940" "1945" "1950" "1955" ...
     1940y  1945y 1950y 1955y 1960y
F&T 22.200 44.500 59.60  73.2 86.80
HO  10.500 15.500 29.00  36.5 46.20
M&H  3.530  5.760  9.71  14.0 21.10
PC   1.040  1.980  2.45   3.4  5.40
PE   0.341  0.974  1.80   2.6  3.64
  Var1  Var2  value
1  F&T 1940y 22.200
2   HO 1940y 10.500
3  M&H 1940y  3.530
4   PC 1940y  1.040
5   PE 1940y  0.341
6  F&T 1945y 44.500
[1]  1.35  1.02  0.55  0.02 -0.47  1.65

4.A sample of 158 children with autisim spectrum disorder were recruited. Social development was assessed using the Vineland Adaptive Behavior Interview survey form, a parent-reported measure of socialization. It is a combined score that included assessment of interpersonal relationships, play/leisure time activities, and coping skills. Initial language development was assessed using the Sequenced Inventory of Communication Development (SICD) scale. These assessments were repeated on these children when they were 3, 5, 9, 13 years of age.

Source: West, B.T., Welch, K.B., & Galecki, A.T. (2002). Linear Mixed Models: Practical Guide Using Statistical Software. p. 220-271.

Data: autism{WWGbook}

Column 1: Age (in years) Column 2: Vineland Socialization Age Equivalent score Column 3: Sequenced Inventory of Communication Development Expressive Group (1 = Low, 2 = Medium, 3 = High) Column 4: Child ID

Replicate the two plots above using ggplot2.

  age vsae sicdegp childid
1   2    6       3       1
2   3    7       3       1
3   5   18       3       1
4   9   25       3       1
5  13   27       3       1
6   2   17       3       3
  age vsae sicdegp childid sic
1   2    6       3       1   H
2   3    7       3       1   H
3   5   18       3       1   H
4   9   25       3       1   H
5  13   27       3       1   H
6   2   17       3       3   H

5.Use the diabetes dataset to generate a plot similar to the one below and inteprete the plot.

   SEQN RIAGENDR RIDRETH1 DIQ010 BMXBMI  gender     race diabetes           BMI
1 51624        1        3      2  32.22   Males    White       No    Overweight
2 51626        1        4      2  22.00   Males    Black       No Normal weight
3 51627        1        4      2  18.22   Males    Black       No Normal weight
4 51628        2        4      1  42.39 Females    Black      Yes    Overweight
5 51629        1        1      2  32.61   Males Hispanic       No    Overweight
6 51630        2        3      2  30.57 Females    White       No    Overweight
      race  gender diabetes           BMI Freq
1    Black Females       No Normal weight  347
2 Hispanic Females       No Normal weight  712
3    White Females       No Normal weight  998
4    Black   Males       No Normal weight  429
5 Hispanic   Males       No Normal weight  706
6    White   Males       No Normal weight  873

6.Find out what each code chunk (indicated by ‘##’) in the R script does and provide comments.

# 開ggplot2# library(ggplot2) ?ggplot2

# install.packages('formatR') library(formatR)

# 裝+開gapminder# install.packages('gapminder') library(gapminder)

# 使用gaminder本身的資料,並看其資料結構如何# data(gapminder) str(gapminder)

# 將其資料定義為gap# gap <- gapminder

# 畫背景框線# ggplot(data = gap, aes(x = lifeExp))

# 加上其資料的長條圖# ggplot(data = gap, aes(x = lifeExp)) + geom_histogram()

# 加長條圖填充藍色、框線黑色、標題、X軸與Y軸標題,使用theme_classic()來調整大小#
# ggplot(data = gap, aes(x = lifeExp)) + geom_histogram(fill = 'blue', color =
# 'black', bins = 10) + ggtitle('Life expectancy for the gap dataset') +
# xlab('Life expectancy (years)') + ylab('Frequency') + theme_classic()


# 改成畫盒鬚圖,加上標題、X軸與Y軸標題,使用theme_minimal()來填入顏色#
# ggplot(data = gap, aes(x = continent, y = lifeExp, fill = continent)) +
# geom_boxplot() + ggtitle('Boxplots for lifeExp by continent') +
# xlab('Continent') + ylab('Life expectancy (years)') + theme_minimal() #+
# guides(fill = FALSE)

# What happens if you un-hashtage `guides(fill = FALSE)` and the plus sign in
# lines 68 and 69 above?

# 就不會跑出來右邊的各區域色塊標籤了#

# 改成畫點圖,加上標題、X軸與Y軸標題,使用
# theme_classic()調整非資料展示的其他圖形,使用第二個theme來調整圖的色塊、大小...等等#
# ggplot(data = gap, aes(x = lifeExp, y = gdpPercap, color = continent, shape =
# continent)) + geom_point(size = 5, alpha = 0.5) + ggtitle('Scatterplot of life
# expectancy by gdpPercap') + theme_classic() + xlab('Life expectancy (years)') +
# ylab('gdpPercap (USD)') + theme(legend.position = 'top', plot.title =
# element_text(hjust = 0.5, size = 20), legend.title = element_text(size = 10),
# legend.text = element_text(size = 5), axis.text.x = element_text(angle = 45,
# hjust = 1))

# In lines the ggplot code above, what are the arguments inside of our second
# 'theme' argument doing?
# 根據網站上查到的是,可以用來調整非資料展示的其他圖形,有試過把那條刪掉再重新跑,沒有任何影響#
# 另外這個Rmarkdown Knit不了,所以強制加上#字號# The End

Statistical graphics

Exercisess:

1.The distribution of personal disposable income in Taiwan in 2015 has a story to tell.

Revise the following plot to enhance that message.

              Income  Count
1  160,000 and under 807160
2 160,000 to 179,999 301650
3 180,000 to 199,999 313992
4 200,000 to 219,999 329290
5 220,000 to 239,999 369583
6 240,000 to 259,999 452671

2.Comment on how the graphs presented in this link violate the principles for effective graphics and how would you revise them.

3.Sarah Leo at the Economist magazine published a data set to accompany the story about how scientific publishing is dominated by men. The plot on the left panel below is the orignal graph that appeared in the article.

Help her find a better plot.