HW exercise 1.

Select at random one school per county in the data set Caschool{Ecdat} and draw a scatter diagram of average math score mathscr against average reading score readscr for the sampled data set. Make sure your results are reproducible (e.g., the same random sample will be drawn each time).

[Solution and Answer]

Load in the dataset

'data.frame':   420 obs. of  17 variables:
 $ distcod : int  75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
 $ county  : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
 $ district: Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 132 270 53 152 383 263 94 ...
 $ grspan  : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
 $ enrltot : int  195 240 1550 243 1335 137 195 888 379 2247 ...
 $ teachers: num  10.9 11.1 82.9 14 71.5 ...
 $ calwpct : num  0.51 15.42 55.03 36.48 33.11 ...
 $ mealpct : num  2.04 47.92 76.32 77.05 78.43 ...
 $ computer: int  67 101 169 85 171 25 28 66 35 0 ...
 $ testscr : num  691 661 644 648 641 ...
 $ compstu : num  0.344 0.421 0.109 0.35 0.128 ...
 $ expnstu : num  6385 5099 5502 7102 5236 ...
 $ str     : num  17.9 21.5 18.7 17.4 18.7 ...
 $ avginc  : num  22.69 9.82 8.98 8.98 9.08 ...
 $ elpct   : num  0 4.58 30 0 13.86 ...
 $ readscr : num  692 660 636 652 642 ...
 $ mathscr : num  690 662 651 644 640 ...

Each row in the data fram is the data of one school.

Hierarchical sampling

HW exercise 2.

Find 133 class-level 95%-confidence intervals for language test score means of the nlschools{MASS} data set by using the tidy approach. The tail end of the data object should looks as follows:

[Solution and Answer]

Load in the dataset

'data.frame':   2287 obs. of  6 variables:
 $ lang : int  46 45 33 46 20 30 30 57 36 36 ...
 $ IQ   : num  15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
 $ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ GS   : int  29 29 29 29 29 29 29 29 29 29 ...
 $ SES  : int  23 10 15 23 10 10 23 10 13 15 ...
 $ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Compute mean and 95% CI for language test score

What presents in the question statement seems to be incorrect. The column of language_mean contains the data of means of IQ actually. Thus I display both of them.


HW exercise 3.

Use the Prestige{car} data set for this problem.

  1. Find the median prestige score for each of the three types of occupation, respectively.

  2. Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.

[Solution and Answer]

  • Load in the dataset and check its structure
'data.frame':   102 obs. of  6 variables:
 $ education: num  13.1 12.3 12.8 11.4 14.6 ...
 $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
 $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
 $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
 $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
 $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
  1. Find the median prestige score for each of the three types of occupation, respectively.
  bc prof   wc 
35.9 68.4 41.5 
  1. Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.
'data.frame':   98 obs. of  7 variables:
 $ education     : num  9.45 9.93 9.47 10.93 7.74 ...
 $ income        : int  3485 2370 8895 8891 3116 3930 7869 3000 3472 3582 ...
 $ women         : num  76.14 3.69 0 1.65 52 ...
 $ prestige      : num  34.9 23.3 43.5 51.6 29.7 20.2 54.9 20.8 17.3 20.1 ...
 $ census        : int  3135 5145 6111 6112 6121 6123 6141 6162 6191 6193 ...
 $ type          : Factor w/ 3 levels "bc","prof","wc": 1 1 1 1 1 1 1 1 1 1 ...
 $ level_prestige: Factor w/ 2 levels "High","Low": 2 2 1 1 2 2 1 2 2 2 ...

Compute correlation coefficient of income and education for each type

Compute correlation coefficient of income and education for each group of type with low or high level of prestige score.

[Conclusion]

  1. In type of bc, there are a medium correlation between income and education. However, when the dataset was grouped by low/high level of prestige score, only the correlation with high level maintain. There is no correlation between income and education in type of bc with low level.

  2. In type of prof, there is a slight correlation between income and education. However, when the dataset was grouped by low/high level of prestige score, only the correlation with low level maintain. There is nocorrelation between income and education in type of prof with low level.

  3. In type of prof, there is a very slight correlation between income and education. When the dataset was grouped by low/high level of prestige score, the correlation disappear. Neither low nor high level data has a correlation between income and education in type of wc.


HW exercise 4.

Reverse the order of input to the series of dplyr::*_join examples using data from the Nobel laureates in literature and explain the resulting output.

[Solution and Answer]

Load in the datasets and check their structures

'data.frame':   8 obs. of  2 variables:
 $ Country: Factor w/ 7 levels "Canada","China",..: 3 6 6 7 1 2 4 5
 $ Year   : int  2014 1950 2017 2016 2013 2012 2015 2011
'data.frame':   7 obs. of  3 variables:
 $ Name  : Factor w/ 7 levels "Alice  Munro",..: 6 2 4 3 1 5 7
 $ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 1
 $ Year  : int  2014 1950 2017 2016 2013 2012 1938
  • There are 8 observations and 2 variables in Nobel_countries.
  • There are 7 observations and 3 variables in Nobel_winners.

1-1. Mutating joins: inner_join{dplyr}

Joining, by = "Year"

Two datasets are joined together by their common variable, year. All rows from Nobel_countries where there are matching values of year in Nobel_winners are returned. All columns of two datasets are returned.

1-2. Mutating joins: left_join{dplyr}

Joining, by = "Year"

Two datasets are joined together by their common variable, year. All rows from Nobel_countries (the left dataset) are returned. All columns of two datasets are returned. Rows in Nobel_countries with no match in Nobel_winners have missing values in the new columns.

1-3. Mutating joins: right_join{dplyr}

Joining, by = "Year"

Two datasets are joined together by their common variable, year. All rows from Nobel_winners (the right dataset) are returned. All columns of two datasets are returned. Rows in Nobel_winners with no match in Nobel_countries have missing values in the new columns.

1-4. full_join{dplyr}

Joining, by = "Year"

Two datasets are joined together by their common variable, year. All rows from two datasets are returned. Where there are not matching values, missing values are returned.

2-1. Filtering joins: semi_join{dplyr}

Joining, by = "Year"

Two datasets are semi-joined together by their common variable, year. All rows from Nobel_countries where there are matching values of year in Nobel_winners are returned. Only columns of Nobel_countries are kept.

2-2. Filtering joins: anti_join{dplyr}

Joining, by = "Year"

Two datasets are semi-joined together by their common variable, year. All rows from Nobel_countries where there are not matching values of year in Nobel_winners are returned. Only columns of Nobel_countries are kept.

3. Nesting joins: nest_join{dplyr}

Joining, by = "Year"

All rows and columns from Nobel_countries are returned. A list column of tibbles is added. Each tibble contains all the rows from Nobel_winners that match that row of Nobel_countries. When there is no match, the list column is a 0-row tibble with the same column names and types as Nobel_winners.


HW exercise 5.

Augment the data object in the ‘SAT’ lecture note with state.division{datasets}. For each of the 9 divisions, find the slope estimate for regressing average SAT scores onto average teacher’s salary. How many of them are of negative signs?

[Solution and Answer]

Load in the data set and rename the columns.

'data.frame':   50 obs. of  7 variables:
 $ Spending: num  4.41 8.96 4.78 4.46 4.99 ...
 $ PTR     : num  17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
 $ Salary  : num  31.1 48 32.2 28.9 41.1 ...
 $ PE      : int  8 47 27 6 45 29 81 68 48 65 ...
 $ Verbal  : int  491 445 448 482 417 462 431 429 420 406 ...
 $ Math    : int  538 489 496 523 485 518 477 468 469 448 ...
 $ SAT     : int  1029 934 944 1005 902 980 908 897 889 854 ...


Create a new variable Divison.

Visualize the association between SAT and salary for each division.

Find the slope estimate for regressing SAT onto salary for each division.

  • The associations between Salary and SAT are different in 9 divisions.
  • students’ SAT score was negatively associated with teachers’ salary in New England, East South Central, West South Central, and West North Central such five divisions.
  • students’ SAT score was negatively associated with teachers’ salary in Middle Atlantic, South Atlantic, East North Central, and Pacificl such four divisions.