In-class exercise

Q1.

Summarize the backpain{HSAUR3} into the following format:

driver suburban case control  total
  no       no    ?       ?       ?
  no      yes    ?       ?       ?
 yes       no    ?       ?       ?
 yes      yes    ?       ?       ?

library(tidyverse)
library(Ecdat)
library(MASS)
library(magrittr)
library(car)
library(lattice)

pacman::p_load(HSAUR3)
data("backpain", package="HSAUR3")
head(backpain)

backpain %>% group_by(status,driver,suburban) %>% summarize(total = n()) %>% 
  unite(Time, status) %>% spread(Time, total) %>% mutate(total = case + control) %>% 
  as.data.frame() %>% head(.,4)

Q2.

Merge the two data sets: state.x77{datasets} and USArrests{datasets} and compute all pair-wise correlations for numerical variables. Is there anything interesting to report?

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20
## Alaska            365   6315        1.5    69.31   11.3    66.7   152
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65
## California      21198   5114        1.1    71.71   10.3    62.6    20
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166
##              Area
## Alabama     50708
## Alaska     566432
## Arizona    113417
## Arkansas    51945
## California 156361
## Colorado   103766

head(USArrests)

merge(state.x77,USArrests) %$% cor(.)

##                 Murder  Population      Income Illiteracy    Life Exp
## Murder      1.00000000  0.32869864 -0.09833907  0.7997904 -0.74059658
## Population  0.32869864  1.00000000 -0.02186730  0.1245989  0.04728461
## Income     -0.09833907 -0.02186730  1.00000000 -0.4579606  0.22182349
## Illiteracy  0.79979040  0.12459889 -0.45796056  1.0000000 -0.74652671
## Life Exp   -0.74059658  0.04728461  0.22182349 -0.7465267  1.00000000
## HS Grad    -0.59204539 -0.39784300  0.66120542 -0.6765483  0.65625800
## Frost      -0.53457903 -0.36245690  0.35878256 -0.6710699  0.40934054
## Area        0.36543567 -0.09454207  0.56785382  0.1844932 -0.21059347
## Assault     0.81449426  0.41126839  0.28340572  0.5064039 -0.60478595
## UrbanPop   -0.11103769  0.56955715  0.09148709 -0.3512532  0.24842664
## Rape        0.72359299  0.55697741  0.02312758  0.4970228 -0.35938747
##               HS Grad      Frost        Area    Assault    UrbanPop
## Murder     -0.5920454 -0.5345790  0.36543567  0.8144943 -0.11103769
## Population -0.3978430 -0.3624569 -0.09454207  0.4112684  0.56955715
## Income      0.6612054  0.3587826  0.56785382  0.2834057  0.09148709
## Illiteracy -0.6765483 -0.6710699  0.18449317  0.5064039 -0.35125322
## Life Exp    0.6562580  0.4093405 -0.21059347 -0.6047860  0.24842664
## HS Grad     1.0000000  0.6023133  0.35935037 -0.3651881 -0.12557750
## Frost       0.6023133  1.0000000  0.11733688 -0.2795162  0.19173187
## Area        0.3593504  0.1173369  1.00000000  0.5419884 -0.07080678
## Assault    -0.3651881 -0.2795162  0.54198844  1.0000000  0.13822857
## UrbanPop   -0.1255775  0.1917319 -0.07080678  0.1382286  1.00000000
## Rape       -0.4139066 -0.4625038  0.50318069  0.6872273  0.23961214
##                   Rape
## Murder      0.72359299
## Population  0.55697741
## Income      0.02312758
## Illiteracy  0.49702278
## Life Exp   -0.35938747
## HS Grad    -0.41390656
## Frost      -0.46250383
## Area        0.50318069
## Assault     0.68722731
## UrbanPop    0.23961214
## Rape        1.00000000

Q3.

Supply comments to each code chunk in the following survey rmarkdown file and preview it as an R notebook or knit to html.

https://rpubs.com/PL_W/c5ie3

Q4.

The data set Vocab{car} gives observations on gender, education and vocabulary, from respondents to U.S. General Social Surveys, 1972-2004. Summarize the relationship between education and vocabulary over the years by gender.

head(Vocab)

Vocab %>% xyplot(vocabulary ~ education | factor(year), groups = sex, data = ., type = c("p","g","r"), auto.key = list(columns = 2))

Q5.

The ‘MASS’ library has these two data sets: ‘Animals’ and ‘mammals’. Merge the two files and remove duplicated observations using ‘duplicated’.

head(Animals)

Animals$species <- row.names(Animals)
head(mammals)

mammals$species <- row.names(mammals)
merge(Animals, mammals)

sum(duplicated(merge(Animals, mammals)))

## [1] 0

Q6.

Convert the data set probe words from long to wide format as described.

dta <- read.table("probeL.txt",header = T)
head(dta)

dta %>% mutate(tmp = rep("Pos",dim(.)[1])) %>%
  unite(vn, tmp, Position) %>%
  spread(vn, Response_Time) %>%
  as.data.frame() %>% head(.,5)

Exercises

Q1.

Select at random one school per county in the data set Caschool{Ecdat} and draw a scatter diagram of average math score mathscr against average reading score readscr for the sampled data set. Make sure your results are reproducible (e.g., the same random sample will be drawn each time).

set.seed(1234)
head(Caschool)

Caschool %>% group_by(county) %>% sample_n(1) %>%
  xyplot(readscr ~ mathscr, type=c("p","g","r"), data = .)

Q2.

Find 133 class-level 95%-confidence intervals for language test score means of the nlschools{MASS} data set by using the tidy approach. The tail end of the data object should looks as follows:

classID language_mean language_lb language_ub
  131       11.273         ...         ...
  132       10.550         ...         ...
  133       10.643         ...         ...

head(nlschools)

nlschools %>% 
  mutate(classID = factor(class, levels = levels(class), labels = c(1:length(levels(.$class))))) %>%
  group_by(classID) %>% 
  summarize(language_mean = mean(lang), 
            language_lb = language_mean - 1.96*sd(lang), 
            language_ub = language_mean + 1.96*sd(lang)) %>%
  tail(.,3)

Q3.

Use the Prestige{car} data set for this problem.

Find the median prestige score for each of the three types of occupation, respectively.
Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.

head(Prestige)

Prestige %>% 
  group_by(type) %>% 
  mutate(ptmed = median(prestige),
         ptlev = case_when(prestige > ptmed ~ "High",
                           prestige < ptmed ~ "Low")) %>%
  xyplot(income ~ education | type, groups = ptlev, data = ., type = c("g","p","r"))

## Warning: Factor `type` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## Warning: Factor `type` contains implicit NA, consider using
## `forcats::fct_explicit_na`

Q4.

Reverse the order of input to the series of dplyr::*_join examples using data from the Nobel laureates in literature and explain the resulting output.

dtac <- read.table("nobel_countries.txt",header = T); head(dtac)

dtaw <- read.table("nobel_winners.txt",header = T); head(dtaw)

full_join(dtac, dtaw) %>% arrange(desc(Year))

## Joining, by = "Year"

使用Year倒排序使最大的在最前面

Q5.

Augment the data object in the ‘SAT’ lecture note with state.division{datasets}. For each of the 9 divisions, find the slope estimate for regressing average SAT scores onto average teacher’s salary. How many of them are of negative signs?

fL <- "http://www.amstat.org/publications/jse/datasets/sat.dat.txt"
dta <- read.table(fL, row.names=1)
names(dta) <- c("Spending", "PTR", "Salary", "PE", "Verbal", "Math", "SAT")
dta$Region <- state.division
head(dta)

dta %>% xyplot(SAT ~ Salary | Region, type = c("g","r","p"), data = .)

negative signs : West North Central, Mountain, New England, East South Central, West South Central

Q6.

The HELP (Health Evaluation and Linkage to Primary Care) study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. Eligible subjects were adults, who spoke Spanish or English, reported alcohol, heroin or cocaine as their first or second drug of choice, resided in proximity to the primary care clinic to which they would be referred or were homeless. Subjects were interviewed at baseline during their detoxification stay and follow-up interviews were undertaken every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these five occasions.

https://rpubs.com/PL_W/c5e6

Class5

0411

In-class exercise

Q1.

Q2.

Q3.

Q4.

Q5.

Q6.

Exercises

Q1.

Q2.

Q3.

Q4.

Q5.

Q6.