DataM: HW Exercise 0330 1-5

HW exercise 1.

Select at random one school per county in the data set Caschool{Ecdat} and draw a scatter diagram of average math score mathscr against average reading score readscr for the sampled data set. Make sure your results are reproducible (e.g., the same random sample will be drawn each time).

[Solution and Answer]

Load in the dataset

library(Ecdat)
data(Caschool)
head(Caschool)

str(Caschool)

'data.frame':   420 obs. of  17 variables:
 $ distcod : int  75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
 $ county  : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
 $ district: Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 132 270 53 152 383 263 94 ...
 $ grspan  : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
 $ enrltot : int  195 240 1550 243 1335 137 195 888 379 2247 ...
 $ teachers: num  10.9 11.1 82.9 14 71.5 ...
 $ calwpct : num  0.51 15.42 55.03 36.48 33.11 ...
 $ mealpct : num  2.04 47.92 76.32 77.05 78.43 ...
 $ computer: int  67 101 169 85 171 25 28 66 35 0 ...
 $ testscr : num  691 661 644 648 641 ...
 $ compstu : num  0.344 0.421 0.109 0.35 0.128 ...
 $ expnstu : num  6385 5099 5502 7102 5236 ...
 $ str     : num  17.9 21.5 18.7 17.4 18.7 ...
 $ avginc  : num  22.69 9.82 8.98 8.98 9.08 ...
 $ elpct   : num  0 4.58 30 0 13.86 ...
 $ readscr : num  692 660 636 652 642 ...
 $ mathscr : num  690 662 651 644 640 ...

Each row in the data fram is the data of one school.

Hierarchical sampling

library(sampling)
set.seed(1234)    # fix the random no.
sample_result <- strata(Caschool, stratanames = 'county',
                        size = rep(1, length(levels(Caschool$county))), # 1
                        method="srswor")  # 2
head(sample_result)

# 1: select one school per county
# 2: simple random sampling without replacement

Draw the scatter plot using the sampled dataset

dta <- Caschool[sample_result$ID_unit,]
lattice::xyplot(readscr ~ mathscr, groups=county, data=dta, type=c('p', 'g'), auto.key=list(columns=5))

HW exercise 2.

Find 133 class-level 95%-confidence intervals for language test score means of the nlschools{MASS} data set by using the tidy approach. The tail end of the data object should looks as follows:

[Solution and Answer]

Load in the dataset

library(MASS)
dta <- nlschools
head(nlschools)

str(nlschools)

'data.frame':   2287 obs. of  6 variables:
 $ lang : int  46 45 33 46 20 30 30 57 36 36 ...
 $ IQ   : num  15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
 $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ GS   : int  29 29 29 29 29 29 29 29 29 29 ...
 $ SES  : int  23 10 15 23 10 10 23 10 13 15 ...
 $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Compute mean and 95% CI for language test score

What presents in the question statement seems to be incorrect. The column of language_mean contains the data of means of IQ actually. Thus I display both of them.

library(dplyr)
library(tidyr)

nlschools %>% group_by(class) %>%
  summarise(IQ_mean = mean(IQ),
            language_mean = mean(lang),
            language_lb = gmodels::ci(lang, confidence=0.95)[2],
            language_ub = gmodels::ci(lang, confidence=0.95)[3]) %>%
  mutate(classID = 1:length(levels(dta$class))) %>%
  select(classID, IQ_mean, language_mean, language_lb, language_ub) %>% 
  tail(., 3)

HW exercise 3.

Use the Prestige{car} data set for this problem.

Find the median prestige score for each of the three types of occupation, respectively.
Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.

[Solution and Answer]

Load in the dataset and check its structure

library(car)
head(Prestige)

str(Prestige)

'data.frame':   102 obs. of  6 variables:
 $ education: num  13.1 12.3 12.8 11.4 14.6 ...
 $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
 $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
 $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
 $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
 $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

Find the median prestige score for each of the three types of occupation, respectively.

median_types <- tapply(Prestige$prestige, Prestige$type, median)
median_types

  bc prof   wc 
35.9 68.4 41.5

Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.

lst <- Prestige %>% split(., Prestige$type)
f_make_level <- function(df) {
  df$level_prestige <- factor(c('Low', 'High')[(df$prestige >= median_types[df$type[1]])*1 + 1])
  return(df)
}
lst <- lapply(lst, f_make_level)
dta3 <- rbind(lst[[1]], lst[[2]], lst[[3]])
head(dta3)

str(dta3)

'data.frame':   98 obs. of  7 variables:
 $ education     : num  9.45 9.93 9.47 10.93 7.74 ...
 $ income        : int  3485 2370 8895 8891 3116 3930 7869 3000 3472 3582 ...
 $ women         : num  76.14 3.69 0 1.65 52 ...
 $ prestige      : num  34.9 23.3 43.5 51.6 29.7 20.2 54.9 20.8 17.3 20.1 ...
 $ census        : int  3135 5145 6111 6112 6121 6123 6141 6162 6191 6193 ...
 $ type          : Factor w/ 3 levels "bc","prof","wc": 1 1 1 1 1 1 1 1 1 1 ...
 $ level_prestige: Factor w/ 2 levels "High","Low": 2 2 1 1 2 2 1 2 2 2 ...

Draw scatter plot of `income` and `education` with color lable of `level of prestige score` for each type.

library(lattice)
for (i in levels(dta3$type)) {
  print(xyplot(income ~ education, groups=level_prestige, data=dta3 %>% filter(type == i), type=c("p","g","r"), auto.key=list(columns=2)))
}

Summary the scatter plot in the same scale

library(ggplot2)
qplot(income, education, data=dta3, geom = c('point', 'abline'),
      col=level_prestige, facets = . ~ type)

qplot(income, education, data=dta3, geom = c('point', 'abline'),
      col=level_prestige, facets = level_prestige ~ type)

Compute correlation coefficient of `income` and `education` for each type

dta3 %>% group_by(type) %>%
  summarise(R = cor(income, education))

Compute correlation coefficient of `income` and `education` for each group of type with low or high level of prestige score.

dta3 %>% group_by(type, level_prestige) %>%
  summarise(R = cor(income, education))

[Conclusion]

In type of bc, there are a medium correlation between income and education. However, when the dataset was grouped by low/high level of prestige score, only the correlation with high level maintain. There is no correlation between income and education in type of bc with low level.
In type of prof, there is a slight correlation between income and education. However, when the dataset was grouped by low/high level of prestige score, only the correlation with low level maintain. There is nocorrelation between income and education in type of prof with low level.
In type of prof, there is a very slight correlation between income and education. When the dataset was grouped by low/high level of prestige score, the correlation disappear. Neither low nor high level data has a correlation between income and education in type of wc.

HW exercise 4.

Reverse the order of input to the series of dplyr::*_join examples using data from the Nobel laureates in literature and explain the resulting output.

[Solution and Answer]

Load in the datasets and check their structures

Nobel_countries <- read.table('../data/list_by_countries.txt', header = TRUE)
Nobel_winners <- read.table('../data/list_by_winners.txt', header = TRUE)
str(Nobel_countries)

'data.frame':   8 obs. of  2 variables:
 $ Country: Factor w/ 7 levels "Canada","China",..: 3 6 6 7 1 2 4 5
 $ Year   : int  2014 1950 2017 2016 2013 2012 2015 2011

str(Nobel_winners)

'data.frame':   7 obs. of  3 variables:
 $ Name  : Factor w/ 7 levels "Alice  Munro",..: 6 2 4 3 1 5 7
 $ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 1
 $ Year  : int  2014 1950 2017 2016 2013 2012 1938

Nobel_countries

Nobel_winners

There are 8 observations and 2 variables in Nobel_countries.
There are 7 observations and 3 variables in Nobel_winners.

1-1. Mutating joins: `inner_join{dplyr}`

tf_inner <- inner_join(Nobel_countries, Nobel_winners)

Joining, by = "Year"

tf_inner

Two datasets are joined together by their common variable, year. All rows from Nobel_countries where there are matching values of year in Nobel_winners are returned. All columns of two datasets are returned.

1-2. Mutating joins: `left_join{dplyr}`

tf_left <- left_join(Nobel_countries, Nobel_winners)

Joining, by = "Year"

tf_left

Two datasets are joined together by their common variable, year. All rows from Nobel_countries (the left dataset) are returned. All columns of two datasets are returned. Rows in Nobel_countries with no match in Nobel_winners have missing values in the new columns.

1-3. Mutating joins: `right_join{dplyr}`

tf_right <- right_join(Nobel_countries, Nobel_winners)

Joining, by = "Year"

tf_right

Two datasets are joined together by their common variable, year. All rows from Nobel_winners (the right dataset) are returned. All columns of two datasets are returned. Rows in Nobel_winners with no match in Nobel_countries have missing values in the new columns.

1-4. `full_join{dplyr}`

tf_full <- full_join(Nobel_countries, Nobel_winners)

Joining, by = "Year"

tf_full

Two datasets are joined together by their common variable, year. All rows from two datasets are returned. Where there are not matching values, missing values are returned.

2-1. Filtering joins: `semi_join{dplyr}`

tf_semi <- semi_join(Nobel_countries, Nobel_winners)

Joining, by = "Year"

tf_semi

Two datasets are semi-joined together by their common variable, year. All rows from Nobel_countries where there are matching values of year in Nobel_winners are returned. Only columns of Nobel_countries are kept.

2-2. Filtering joins: `anti_join{dplyr}`

tf_anti <- anti_join(Nobel_countries, Nobel_winners)

Joining, by = "Year"

tf_anti

Two datasets are semi-joined together by their common variable, year. All rows from Nobel_countries where there are not matching values of year in Nobel_winners are returned. Only columns of Nobel_countries are kept.

3. Nesting joins: `nest_join{dplyr}`

tf_nest <- nest_join(Nobel_countries, Nobel_winners)

Joining, by = "Year"

tf_nest

All rows and columns from Nobel_countries are returned. A list column of tibbles is added. Each tibble contains all the rows from Nobel_winners that match that row of Nobel_countries. When there is no match, the list column is a 0-row tibble with the same column names and types as Nobel_winners.

HW exercise 5.

Augment the data object in the ‘SAT’ lecture note with state.division{datasets}. For each of the 9 divisions, find the slope estimate for regressing average SAT scores onto average teacher’s salary. How many of them are of negative signs?

[Solution and Answer]

Load in the data set and rename the columns.

library(datasets)
dta_SAT <- read.table('http://www.amstat.org/publications/jse/datasets/sat.dat.txt', row.names=1)
head(dta_SAT)

colnames(dta_SAT) <- c("Spending", "PTR", "Salary", "PE", "Verbal", "Math", "SAT")
str(dta_SAT)

'data.frame':   50 obs. of  7 variables:
 $ Spending: num  4.41 8.96 4.78 4.46 4.99 ...
 $ PTR     : num  17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
 $ Salary  : num  31.1 48 32.2 28.9 41.1 ...
 $ PE      : int  8 47 27 6 45 29 81 68 48 65 ...
 $ Verbal  : int  491 445 448 482 417 462 431 429 420 406 ...
 $ Math    : int  538 489 496 523 485 518 477 468 469 448 ...
 $ SAT     : int  1029 934 944 1005 902 980 908 897 889 854 ...

Create a new variable `Divison`.

dta_SAT$Division <- state.division

Visualize the association between `SAT` and `salary` for each division.

lattice::xyplot(SAT ~ Salary, groups=Division, data=dta_SAT, type=c('r', 'g'), auto.key=list(columns=3))

Find the slope estimate for regressing `SAT` onto `salary` for each division.

df_slope <- split(dta_SAT, dta_SAT$Division) %>% 
  sapply(., function(x) coef(lm(x$SAT ~ x$Salary))[2]) %>%
  as.data.frame()
df_slope

The associations between Salary and SAT are different in 9 divisions.
students’ SAT score was negatively associated with teachers’ salary in New England, East South Central, West South Central, and West North Central such five divisions.
students’ SAT score was negatively associated with teachers’ salary in Middle Atlantic, South Atlantic, East North Central, and Pacificl such four divisions.

DataM: HW Exercise 0330 1-5