DataM: HW Exercise 0330 1-5
HW exercise 1.
Select at random one school per county in the data set Caschool{Ecdat} and draw a scatter diagram of average math score mathscr against average reading score readscr for the sampled data set. Make sure your results are reproducible (e.g., the same random sample will be drawn each time).
[Solution and Answer]
Load in the dataset
'data.frame': 420 obs. of 17 variables:
$ distcod : int 75119 61499 61549 61457 61523 62042 68536 63834 62331 67306 ...
$ county : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
$ district: Factor w/ 409 levels "Ackerman Elementary",..: 362 214 367 132 270 53 152 383 263 94 ...
$ grspan : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
$ enrltot : int 195 240 1550 243 1335 137 195 888 379 2247 ...
$ teachers: num 10.9 11.1 82.9 14 71.5 ...
$ calwpct : num 0.51 15.42 55.03 36.48 33.11 ...
$ mealpct : num 2.04 47.92 76.32 77.05 78.43 ...
$ computer: int 67 101 169 85 171 25 28 66 35 0 ...
$ testscr : num 691 661 644 648 641 ...
$ compstu : num 0.344 0.421 0.109 0.35 0.128 ...
$ expnstu : num 6385 5099 5502 7102 5236 ...
$ str : num 17.9 21.5 18.7 17.4 18.7 ...
$ avginc : num 22.69 9.82 8.98 8.98 9.08 ...
$ elpct : num 0 4.58 30 0 13.86 ...
$ readscr : num 692 660 636 652 642 ...
$ mathscr : num 690 662 651 644 640 ...
Each row in the data fram is the data of one school.
Hierarchical sampling
library(sampling)
set.seed(1234) # fix the random no.
sample_result <- strata(Caschool, stratanames = 'county',
size = rep(1, length(levels(Caschool$county))), # 1
method="srswor") # 2
head(sample_result)Draw the scatter plot using the sampled dataset
dta <- Caschool[sample_result$ID_unit,]
lattice::xyplot(readscr ~ mathscr, groups=county, data=dta, type=c('p', 'g'), auto.key=list(columns=5))HW exercise 2.
Find 133 class-level 95%-confidence intervals for language test score means of the nlschools{MASS} data set by using the tidy approach. The tail end of the data object should looks as follows:
[Solution and Answer]
Load in the dataset
'data.frame': 2287 obs. of 6 variables:
$ lang : int 46 45 33 46 20 30 30 57 36 36 ...
$ IQ : num 15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
$ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
$ GS : int 29 29 29 29 29 29 29 29 29 29 ...
$ SES : int 23 10 15 23 10 10 23 10 13 15 ...
$ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
Compute mean and 95% CI for language test score
What presents in the question statement seems to be incorrect. The column of language_mean contains the data of means of IQ actually. Thus I display both of them.
library(dplyr)
library(tidyr)
nlschools %>% group_by(class) %>%
summarise(IQ_mean = mean(IQ),
language_mean = mean(lang),
language_lb = gmodels::ci(lang, confidence=0.95)[2],
language_ub = gmodels::ci(lang, confidence=0.95)[3]) %>%
mutate(classID = 1:length(levels(dta$class))) %>%
select(classID, IQ_mean, language_mean, language_lb, language_ub) %>%
tail(., 3)HW exercise 3.
Use the Prestige{car} data set for this problem.
Find the median prestige score for each of the three types of occupation, respectively.
Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.
[Solution and Answer]
- Load in the dataset and check its structure
'data.frame': 102 obs. of 6 variables:
$ education: num 13.1 12.3 12.8 11.4 14.6 ...
$ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
$ women : num 11.16 4.02 15.7 9.11 11.68 ...
$ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
$ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
$ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
- Find the median prestige score for each of the three types of occupation, respectively.
bc prof wc
35.9 68.4 41.5
- Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between
incomeandeducationfor each category generated from crossing the factor prestige with the type of occupation.
lst <- Prestige %>% split(., Prestige$type)
f_make_level <- function(df) {
df$level_prestige <- factor(c('Low', 'High')[(df$prestige >= median_types[df$type[1]])*1 + 1])
return(df)
}
lst <- lapply(lst, f_make_level)
dta3 <- rbind(lst[[1]], lst[[2]], lst[[3]])
head(dta3)'data.frame': 98 obs. of 7 variables:
$ education : num 9.45 9.93 9.47 10.93 7.74 ...
$ income : int 3485 2370 8895 8891 3116 3930 7869 3000 3472 3582 ...
$ women : num 76.14 3.69 0 1.65 52 ...
$ prestige : num 34.9 23.3 43.5 51.6 29.7 20.2 54.9 20.8 17.3 20.1 ...
$ census : int 3135 5145 6111 6112 6121 6123 6141 6162 6191 6193 ...
$ type : Factor w/ 3 levels "bc","prof","wc": 1 1 1 1 1 1 1 1 1 1 ...
$ level_prestige: Factor w/ 2 levels "High","Low": 2 2 1 1 2 2 1 2 2 2 ...
Draw scatter plot of income and education with color lable of level of prestige score for each type.
library(lattice)
for (i in levels(dta3$type)) {
print(xyplot(income ~ education, groups=level_prestige, data=dta3 %>% filter(type == i), type=c("p","g","r"), auto.key=list(columns=2)))
}Summary the scatter plot in the same scale
library(ggplot2)
qplot(income, education, data=dta3, geom = c('point', 'abline'),
col=level_prestige, facets = . ~ type)qplot(income, education, data=dta3, geom = c('point', 'abline'),
col=level_prestige, facets = level_prestige ~ type)Compute correlation coefficient of income and education for each type
Compute correlation coefficient of income and education for each group of type with low or high level of prestige score.
[Conclusion]
In type of
bc, there are a medium correlation betweenincomeandeducation. However, when the dataset was grouped by low/high level of prestige score, only the correlation with high level maintain. There is no correlation betweenincomeandeducationin type ofbcwith low level.In type of
prof, there is a slight correlation betweenincomeandeducation. However, when the dataset was grouped by low/high level of prestige score, only the correlation with low level maintain. There is nocorrelation betweenincomeandeducationin type ofprofwith low level.In type of
prof, there is a very slight correlation betweenincomeandeducation. When the dataset was grouped by low/high level of prestige score, the correlation disappear. Neither low nor high level data has a correlation betweenincomeandeducationin type ofwc.
HW exercise 4.
Reverse the order of input to the series of dplyr::*_join examples using data from the Nobel laureates in literature and explain the resulting output.
[Solution and Answer]
Load in the datasets and check their structures
Nobel_countries <- read.table('../data/list_by_countries.txt', header = TRUE)
Nobel_winners <- read.table('../data/list_by_winners.txt', header = TRUE)
str(Nobel_countries)'data.frame': 8 obs. of 2 variables:
$ Country: Factor w/ 7 levels "Canada","China",..: 3 6 6 7 1 2 4 5
$ Year : int 2014 1950 2017 2016 2013 2012 2015 2011
'data.frame': 7 obs. of 3 variables:
$ Name : Factor w/ 7 levels "Alice Munro",..: 6 2 4 3 1 5 7
$ Gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 1 2 1
$ Year : int 2014 1950 2017 2016 2013 2012 1938
- There are 8 observations and 2 variables in
Nobel_countries. - There are 7 observations and 3 variables in
Nobel_winners.
1-1. Mutating joins: inner_join{dplyr}
Joining, by = "Year"
Two datasets are joined together by their common variable, year. All rows from Nobel_countries where there are matching values of year in Nobel_winners are returned. All columns of two datasets are returned.
1-2. Mutating joins: left_join{dplyr}
Joining, by = "Year"
Two datasets are joined together by their common variable, year. All rows from Nobel_countries (the left dataset) are returned. All columns of two datasets are returned. Rows in Nobel_countries with no match in Nobel_winners have missing values in the new columns.
1-3. Mutating joins: right_join{dplyr}
Joining, by = "Year"
Two datasets are joined together by their common variable, year. All rows from Nobel_winners (the right dataset) are returned. All columns of two datasets are returned. Rows in Nobel_winners with no match in Nobel_countries have missing values in the new columns.
1-4. full_join{dplyr}
Joining, by = "Year"
Two datasets are joined together by their common variable, year. All rows from two datasets are returned. Where there are not matching values, missing values are returned.
2-1. Filtering joins: semi_join{dplyr}
Joining, by = "Year"
Two datasets are semi-joined together by their common variable, year. All rows from Nobel_countries where there are matching values of year in Nobel_winners are returned. Only columns of Nobel_countries are kept.
2-2. Filtering joins: anti_join{dplyr}
Joining, by = "Year"
Two datasets are semi-joined together by their common variable, year. All rows from Nobel_countries where there are not matching values of year in Nobel_winners are returned. Only columns of Nobel_countries are kept.
3. Nesting joins: nest_join{dplyr}
Joining, by = "Year"
All rows and columns from Nobel_countries are returned. A list column of tibbles is added. Each tibble contains all the rows from Nobel_winners that match that row of Nobel_countries. When there is no match, the list column is a 0-row tibble with the same column names and types as Nobel_winners.
HW exercise 5.
Augment the data object in the ‘SAT’ lecture note with state.division{datasets}. For each of the 9 divisions, find the slope estimate for regressing average SAT scores onto average teacher’s salary. How many of them are of negative signs?
[Solution and Answer]
Load in the data set and rename the columns.
library(datasets)
dta_SAT <- read.table('http://www.amstat.org/publications/jse/datasets/sat.dat.txt', row.names=1)
head(dta_SAT)'data.frame': 50 obs. of 7 variables:
$ Spending: num 4.41 8.96 4.78 4.46 4.99 ...
$ PTR : num 17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
$ Salary : num 31.1 48 32.2 28.9 41.1 ...
$ PE : int 8 47 27 6 45 29 81 68 48 65 ...
$ Verbal : int 491 445 448 482 417 462 431 429 420 406 ...
$ Math : int 538 489 496 523 485 518 477 468 469 448 ...
$ SAT : int 1029 934 944 1005 902 980 908 897 889 854 ...
Create a new variable Divison.
Visualize the association between SAT and salary for each division.
Find the slope estimate for regressing SAT onto salary for each division.
df_slope <- split(dta_SAT, dta_SAT$Division) %>%
sapply(., function(x) coef(lm(x$SAT ~ x$Salary))[2]) %>%
as.data.frame()
df_slope- The associations between
SalaryandSATare different in 9 divisions. - students’ SAT score was negatively associated with teachers’ salary in New England, East South Central, West South Central, and West North Central such five divisions.
- students’ SAT score was negatively associated with teachers’ salary in Middle Atlantic, South Atlantic, East North Central, and Pacificl such four divisions.