Data wrangling exercise

In-class exercises1

Summarize the backpain{HSAUR3} into the following format: driver suburban case control total no no ? ? ? no yes ? ? ? yes no ? ? ? yes yes ? ? ?

You should provide comments for each code chunk.

pacman::p_load(HSAUR3)
data("backpain", package="HSAUR3")
dta <- HSAUR3::backpain

看檔案

names(dta)

[1] "ID"       "status"   "driver"   "suburban"

開tidyverse

library("tidyverse")

將原本檔案格式設定為指定格式，其中roup_by()函數的功能為設定分組依據，通常與summarise()聚合變數合併使用

spread() 函數：turn rows into columns

is.na() 測試資料中是否含有遺漏值

dta <- dta %>% group_by(driver, suburban)%>%tidyr::spread(key= 'status', value = 'status') %>%
  summarize(case = sum(is.na(case)),
            control = sum(is.na(control)),
            total = n()) %>%
  as.data.frame 

head(dta)

  driver suburban case control total
1     no       no   38      17    64
2     no      yes    5       4    11
3    yes       no   43      44   107
4    yes      yes   37      58   158

In-class exercises2

Merge the two data sets: state.x77{datasets} and USArrests{datasets} and compute all pair-wise correlations for numerical variables. Is there anything interesting to report?

將state.x77與 USArrests分別設定

dta1 <- as.data.frame(datasets::state.x77)
dta2 <- datasets::USArrests
head(dta1, 3)

        Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama       3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska         365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona       2212   4530        1.8    70.55    7.8    58.1    15 113417

tail(dta2, 3)

              Murder Assault UrbanPop Rape
West Virginia    5.7      81       39  9.3
Wisconsin        2.6      53       66 10.8
Wyoming          6.8     161       60 15.6

看兩個檔案裡頭分別有甚麼

names(dta1)

[1] "Population" "Income"     "Illiteracy" "Life Exp"   "Murder"    
[6] "HS Grad"    "Frost"      "Area"

names(dta2)

[1] "Murder"   "Assault"  "UrbanPop" "Rape"

將兩個檔案的分別做相關

分別檢視兩個檔案的相關

head(dta1r)

            Population     Income Illiteracy    Life Exp     Murder     HS Grad
Population  1.00000000  0.2082276  0.1076224 -0.06805195  0.3436428 -0.09848975
Income      0.20822756  1.0000000 -0.4370752  0.34025534 -0.2300776  0.61993232
Illiteracy  0.10762237 -0.4370752  1.0000000 -0.58847793  0.7029752 -0.65718861
Life Exp   -0.06805195  0.3402553 -0.5884779  1.00000000 -0.7808458  0.58221620
Murder      0.34364275 -0.2300776  0.7029752 -0.78084575  1.0000000 -0.48797102
HS Grad    -0.09848975  0.6199323 -0.6571886  0.58221620 -0.4879710  1.00000000
                Frost        Area
Population -0.3321525  0.02254384
Income      0.2262822  0.36331544
Illiteracy -0.6719470  0.07726113
Life Exp    0.2620680 -0.10733194
Murder     -0.5388834  0.22839021
HS Grad     0.3667797  0.33354187

在Alabama、Alaska、Arizona區域，大於0.7的高相關有 Illiteracy X Murder、Murder X Life Exp

head(dta2r)

             Murder   Assault   UrbanPop      Rape
Murder   1.00000000 0.8018733 0.06957262 0.5635788
Assault  0.80187331 1.0000000 0.25887170 0.6652412
UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
Rape     0.56357883 0.6652412 0.41134124 1.0000000

在West Virginia、Wisconsin、Wyoming，大於0.7的高相關有 Murder X Assault

Murder在state.x77和Illiteracy、Life Exp有高度相關，在USArrests和Assault有高度相關

如果直接合併

dtaa <- merge(dta1, dta2, all = TRUE)

head(dtaa)

  Murder Population Income Illiteracy Life Exp HS Grad Frost  Area Assault
1    0.8         NA     NA         NA       NA      NA    NA    NA      45
2    1.4        637   5087        0.8    72.78    50.3   186 69273      NA
3    1.7        681   4167        0.5    72.08    53.3   172 75955      NA
4    2.1         NA     NA         NA       NA      NA    NA    NA      83
5    2.1         NA     NA         NA       NA      NA    NA    NA      57
6    2.2         NA     NA         NA       NA      NA    NA    NA      56
  UrbanPop Rape
1       44  7.3
2       NA   NA
3       NA   NA
4       51  7.8
5       56  9.5
6       57 11.3

會有一堆數據跑不出來

但是…

dtaa <- merge(dta1, dta2)

head(dtaa)

  Murder Population Income Illiteracy Life Exp HS Grad Frost   Area Assault
1    2.7       1058   3694        0.7    70.39    54.7   161  30920      72
2    3.3       5814   4755        1.1    71.83    58.5   103   7826     110
3    3.3        812   4281        0.7    71.23    57.6   174   9027     110
4    4.3       3559   4864        0.6    71.72    63.5    32  66570     102
5    5.3        813   4119        0.6    71.87    59.5   126  82677      46
6    6.8       2541   4884        0.7    72.06    63.9   166 103766     161
  UrbanPop Rape
1       66 14.9
2       77 11.1
3       77 11.1
4       62 16.5
5       83 20.2
6       60 15.6

我這樣就跑出來啦~~~(但是數字不正確)

In-class exercises3

Supply comments to each code chunk in the following survey rmarkdown file and preview it as an R notebook or knit to html.

https://rpubs.com/Onevoice/In-class_exercises_3

In-class exercises4

The data set Vocab{car} gives observations on gender, education and vocabulary, from respondents to U.S. General Social Surveys, 1972-2004. Summarize the relationship between education and vocabulary over the years by gender.

將Vocab{car}開出來

head(carData::Vocab)

         year    sex education vocabulary
19740001 1974   Male        14          9
19740002 1974   Male        16          9
19740003 1974 Female        10          9
19740004 1974 Female        10          5
19740005 1974 Female        12          8
19740006 1974   Male        16          8

將此資料定義為dta，使用lattice套件，將每一年的資料單獨分割出來，各製比較圖

dta<- carData::Vocab
pacman::p_load(lattice)
dta1974 <- subset(dta, dta$year=="1974")
xyplot(vocabulary ~ education, groups=sex, data=dta1974, type=c("p", "g"), auto.key=list(columns=2))

dta1984 <- subset(dta, dta$year=="1984")
xyplot(vocabulary ~ education, groups=sex, data=dta1984, type=c("p", "g"), auto.key=list(columns=2))

dta1994 <- subset(dta, dta$year=="1994")
xyplot(vocabulary ~ education, groups=sex, data=dta1994, type=c("p", "g"), auto.key=list(columns=2))

dta2004 <- subset(dta, dta$year=="2004")
xyplot(vocabulary ~ education, groups=sex, data=dta2004, type=c("p", "g"), auto.key=list(columns=2))

放在一起互相比較

lattice::xyplot(vocabulary ~ education | year, groups = sex, type = c("p", "g", "r"), data =dta , auto.key = list(columns = 2), xlab = "Education", ylab = "Vocabulary")

隨著時代進展，認識的單字量增加了

再將男性與女性資料各自分割製表

malec <- subset(dta, dta$sex=="Male")
lapply(split(malec, malec$year), function(x) coef(lm(x$vocabulary ~ x$education)))

$`1974`
(Intercept) x$education 
  1.5318434   0.3713183 

$`1976`
(Intercept) x$education 
  1.6342960   0.3555403 

$`1978`
(Intercept) x$education 
  0.9762161   0.3963762 

$`1982`
(Intercept) x$education 
  0.9730291   0.3832637 

$`1984`
(Intercept) x$education 
   1.678465    0.337124 

$`1987`
(Intercept) x$education 
  0.8103651   0.3818373 

$`1988`
(Intercept) x$education 
  1.0459936   0.3592442 

$`1989`
(Intercept) x$education 
  1.0596176   0.3708525 

$`1990`
(Intercept) x$education 
  1.7000935   0.3377029 

$`1991`
(Intercept) x$education 
  1.2504604   0.3683962 

$`1993`
(Intercept) x$education 
  1.6384884   0.3221049 

$`1994`
(Intercept) x$education 
  1.8684770   0.3146151 

$`1996`
(Intercept) x$education 
  0.8221711   0.3770325 

$`1998`
(Intercept) x$education 
  1.5199973   0.3314754 

$`2000`
(Intercept) x$education 
  1.1203888   0.3558918 

$`2004`
(Intercept) x$education 
  1.4259424   0.3411153 

$`2006`
(Intercept) x$education 
  2.1383454   0.2952926 

$`2008`
(Intercept) x$education 
  1.4212286   0.3277987 

$`2010`
(Intercept) x$education 
  1.7996389   0.3135749 

$`2012`
(Intercept) x$education 
  1.7303105   0.3061534 

$`2014`
(Intercept) x$education 
  1.4804789   0.3262112 

$`2016`
(Intercept) x$education 
  1.8562367   0.3031146

男性隨著時代進展，認識的單字量增加

femalec <- subset(dta, dta$sex=="Female")
lapply(split(femalec, femalec$year), function(x) coef(lm(x$vocabulary ~ x$education)))

$`1974`
(Intercept) x$education 
  1.5652579   0.3816095 

$`1976`
(Intercept) x$education 
  1.7021281   0.3824002 

$`1978`
(Intercept) x$education 
  1.3006416   0.4002707 

$`1982`
(Intercept) x$education 
  0.9829602   0.3949758 

$`1984`
(Intercept) x$education 
  1.4536872   0.3728698 

$`1987`
(Intercept) x$education 
  0.9647931   0.3843508 

$`1988`
(Intercept) x$education 
  1.1634561   0.3763999 

$`1989`
(Intercept) x$education 
  1.0682600   0.3863606 

$`1990`
(Intercept) x$education 
  0.4594812   0.4346902 

$`1991`
(Intercept) x$education 
  1.1543766   0.3875821 

$`1993`
(Intercept) x$education 
  1.7388287   0.3286325 

$`1994`
(Intercept) x$education 
  1.6453365   0.3422146 

$`1996`
(Intercept) x$education 
  1.1482811   0.3727178 

$`1998`
(Intercept) x$education 
  1.4472751   0.3592843 

$`2000`
(Intercept) x$education 
  1.9276040   0.3155532 

$`2004`
(Intercept) x$education 
   2.104150    0.304056 

$`2006`
(Intercept) x$education 
  2.7777171   0.2535376 

$`2008`
(Intercept) x$education 
  2.6074315   0.2553971 

$`2010`
(Intercept) x$education 
  1.3520300   0.3468821 

$`2012`
(Intercept) x$education 
  1.7535298   0.3080832 

$`2014`
(Intercept) x$education 
  2.3445239   0.2663464 

$`2016`
(Intercept) x$education 
  2.0055919   0.2928955

女性隨著時代進展，認識的單字量反而減少

In-class exercises5

The ‘MASS’ library has these two data sets: ‘Animals’ and ‘mammals’. Merge the two files and remove duplicated observations using ‘duplicated’.

將Animals和mammals資料調出來，分別看看是甚麼

library(MASS)
dta1 <- Animals
dta2 <- mammals

head(dta1)

                    body brain
Mountain beaver     1.35   8.1
Cow               465.00 423.0
Grey wolf          36.33 119.5
Goat               27.66 115.0
Guinea pig          1.04   5.5
Dipliodocus     11700.00  50.0

head(dta2)

                   body brain
Arctic fox        3.385  44.5
Owl monkey        0.480  15.5
Mountain beaver   1.350   8.1
Cow             465.000 423.0
Grey wolf        36.330 119.5
Goat             27.660 115.0

將兩資料合併，重複資料刪掉

dta3 <- merge(dta1, dta2, all = T)
duplicated(dta3$body)

 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE

檢查結果

head(dta3)

   body brain
1 0.005  0.14
2 0.010  0.25
3 0.023  0.30
4 0.023  0.40
5 0.048  0.33
6 0.060  1.00

In-class exercises6

Convert the data set probe words from long to wide format as described.

開檔案

dta <- read.table("D:Program Files/R/R-3.6.3/bin/probeL.txt", header=T , stringsAsFactor=F, fill=T ) 
head(dta)

   ID Response_Time Position
1 S01            51        1
2 S01            36        2
3 S01            50        3
4 S01            35        4
5 S01            42        5
6 S02            27        1

開tidyverse

library(tidyverse)

使用mutate增加新變項分行命名

dta %>% mutate(Position = case_when(Position == 1 ~ "Pos_1", Position == 2 ~ "Pos_2", Position == 3 ~ "Pos_3", Position == 4 ~ "Pos_4", Position == 5 ~ "Pos_5")) %>% dplyr::select(Response_Time, Position) %>% head

  Response_Time Position
1            51    Pos_1
2            36    Pos_2
3            50    Pos_3
4            35    Pos_4
5            42    Pos_5
6            27    Pos_1

就出來了

Exercises1

Select at random one school per county in the data set Caschool{Ecdat} and draw a scatter diagram of average math score mathscr against average reading score readscr for the sampled data set. Make sure your results are reproducible (e.g., the same random sample will be drawn each time).

開Ecdat，使用Casschool資料

pacman::p_load(Ecdat)
dta <- Ecdat::Caschool
head(dta)

  distcod  county                        district grspan enrltot teachers
1   75119 Alameda              Sunol Glen Unified  KK-08     195    10.90
2   61499   Butte            Manzanita Elementary  KK-08     240    11.15
3   61549   Butte     Thermalito Union Elementary  KK-08    1550    82.90
4   61457   Butte Golden Feather Union Elementary  KK-08     243    14.00
  calwpct mealpct computer testscr   compstu  expnstu      str avginc     elpct
1  0.5102  2.0408       67   690.8 0.3435898 6384.911 17.88991 22.690  0.000000
2 15.4167 47.9167      101   661.2 0.4208333 5099.381 21.52466  9.824  4.583333
3 55.0323 76.3226      169   643.6 0.1090323 5501.955 18.69723  8.978 30.000002
4 36.4754 77.0492       85   647.7 0.3497942 7101.831 17.35714  8.978  0.000000
  readscr mathscr
1   691.6   690.0
2   660.5   661.9
3   636.3   650.9
4   651.9   643.5
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

開tidyverse

library(tidyverse)

隨機取樣

set.seed(129456)

畫圖

dta %>% dplyr::transmute(mathscr, readscr) %T>%
  plot(., xlab="readscr Status", ylab="mathscr score", pch='.') %>% colMeans

 mathscr  readscr 
653.3426 654.9705

Exercises2

Find 133 class-level 95%-confidence intervals for language test score means of the nlschools{MASS} data set by using the tidy approach. The tail end of the data object should looks as follows: classID language_mean language_lb language_ub 131 11.273 … … 132 10.550 … … 133 10.643 … …

開檔案

pacman::p_load(MASS)
dta <- MASS::nlschools
head(dta)

  lang   IQ class GS SES COMB
1   46 15.0   180 29  23    0
2   45 14.5   180 29  10    0
3   33  9.5   180 29  15    0
4   46 11.0   180 29  23    0
5   20  8.0   180 29  10    0
6   30  9.5   180 29  10    0

開tidyverse

library(tidyverse)

開tibble用knitr

library(tibble)
data(nlschools, package="MASS")
knitr::kable(head(nlschools))

lang	IQ	class	GS	SES
46	15.0	180	29	23
45	14.5	180	29	10
33	9.5	180	29	15
46	11.0	180	29	23
20	8.0	180	29	10
30	9.5	180	29	10

做獨立樣本迴歸

dtac <- dta  %>% group_by(class)   %>% 
  group_by(lang) %>%
  summarise(language_mean = mean(lang, na.rm = TRUE),
            sd.lang = sd(lang, na.rm = TRUE),
            n.lang = n()) %>%
  mutate(se.lang = sd.lang / sqrt(n.lang),
         language_lb = language_mean - qt(1 - (0.05 / 2), n.lang - 1) * se.lang,
         language_ub = language_mean + qt(1 - (0.05 / 2), n.lang - 1) * se.lang)

dtac

# A tibble: 47 x 7
    lang language_mean sd.lang n.lang se.lang language_lb language_ub
   <int>         <dbl>   <dbl>  <int>   <dbl>       <dbl>       <dbl>
 1     9             9     NaN      1     NaN         NaN         NaN
 2    11            11     NaN      1     NaN         NaN         NaN
 3    14            14       0      2       0          14          14
 4    15            15       0      4       0          15          15
 5    16            16       0      3       0          16          16
 6    17            17       0      7       0          17          17
 7    18            18       0      7       0          18          18
 8    19            19       0      6       0          19          19
 9    20            20       0     13       0          20          20
10    21            21       0     18       0          21          21
# ... with 37 more rows

Exercises3

Use the Prestige{car} data set for this problem. Find the median prestige score for each of the three types of occupation, respectively. Use the median score in each type of occupation to define two levels of prestige: High and low, for each occupation, respectively. Summarize the relationship between income and education for each category generated from crossing the factor prestige with the type of occupation.

開檔案

library(car)
pacman::p_load(car)
class(Prestige)

[1] "data.frame"

head(Prestige)

                    education income women prestige census type
gov.administrators      13.11  12351 11.16     68.8   1113 prof
general.managers        12.26  25879  4.02     69.1   1130 prof
accountants             12.77   9271 15.70     63.4   1171 prof
purchasing.officers     11.42   8865  9.11     56.8   1175 prof
chemists                14.62   8403 11.68     73.5   2111 prof
physicists              15.64  11030  5.13     77.6   2113 prof

names(Prestige)

[1] "education" "income"    "women"     "prestige"  "census"    "type"

使用aggregate重整數據的平均數

dta <- Prestige
dta <- aggregate((prestige~ type) , data=dta, FUN=mean)
head(dta)

  type prestige
1   bc 35.52727
2 prof 67.84839
3   wc 42.24348

使用quantiler每10%占多少比例

dta1<- quantile(dta$prestige, probs=seq(from=0, to=1, by=.1))
dta1

      0%      10%      20%      30%      40%      50%      60%      70% 
35.52727 36.87051 38.21375 39.55700 40.90024 42.24348 47.36446 52.48544 
     80%      90%     100% 
57.60642 62.72741 67.84839

再用其資料分成高與低

dta1 <- with(dta, cut(prestige, ordered=T, breaks=c(0, 50, 100), labels=c("Low",  "High")))
with(dta, table(dta1))

dta1
 Low High 
   2    1

dta1

[1] Low  High Low 
Levels: Low < High

最後將以上所有資料重組

dtap <- aggregate(cbind(prestige, type) ~ dta1, data=dta, FUN=mean)
dtap

  dta1 prestige type
1  Low 38.88538    2
2 High 67.84839    2

開tidyverse

library(tidyverse)

畫散佈圖

plot(x=Prestige$education, y=Prestige$income, main="income to education",xlab="educations",ylab="incomes")

require(lattice)

畫比較的散佈圖加迴歸線

Prestige$prestige <- as.factor(Prestige$prestige) 

xyplot(income ~ education | dta1, data=Prestige, type=c("g","p","r"), auto.key=list(columns=2))

Exercises4

Reverse the order of input to the series of dplyr::*_join examples using data from the Nobel laureates in literature and explain the resulting output.

開tidyverse

library(tidyverse)

讀檔案

nbl_c <- read.table("C:/Users/boss/Desktop/nobel_countries.txt", h = T)
nbl_w <- read.table("C:/Users/boss/Desktop/nobel_winners.txt", h = T)

使用merge合併

merge(nbl_c, nbl_w)

  Year Country              Name Gender
1 1950      UK Bertrand  Russell   Male
2 2012   China            Mo Yan   Male
3 2013  Canada      Alice  Munro Female
4 2014  France   Patrick Modiano   Male
5 2016      US        Bob  Dylan   Male
6 2017      UK    Kazuo Ishiguro   Male

有東西不見了，使用True讓他全部跑出來

merge(nbl_c, nbl_w, all = TRUE)

  Year Country              Name Gender
1 1938    <NA>        Pearl Buck Female
2 1950      UK Bertrand  Russell   Male
3 2011  Sweden              <NA>   <NA>
4 2012   China            Mo Yan   Male
5 2013  Canada      Alice  Munro Female
6 2014  France   Patrick Modiano   Male
7 2015  Russia              <NA>   <NA>
8 2016      US        Bob  Dylan   Male
9 2017      UK    Kazuo Ishiguro   Male

顯示nbl_w能與nbl_c匹配的

inner_join(nbl_c, nbl_w)

  Country Year              Name Gender
1  France 2014   Patrick Modiano   Male
2      UK 1950 Bertrand  Russell   Male
3      UK 2017    Kazuo Ishiguro   Male
4      US 2016        Bob  Dylan   Male
5  Canada 2013      Alice  Munro Female
6   China 2012            Mo Yan   Male

顯示nbl_w能與nbl_c匹配的，只留nbl_w的欄位

semi_join(nbl_c, nbl_w)

  Country Year
1  France 2014
2      UK 1950
3      UK 2017
4      US 2016
5  Canada 2013
6   China 2012

使nbl_w為第一參數，nbl_c為第二參數，未匹配的全顯示為NA

left_join(nbl_c, nbl_w)

  Country Year              Name Gender
1  France 2014   Patrick Modiano   Male
2      UK 1950 Bertrand  Russell   Male
3      UK 2017    Kazuo Ishiguro   Male
4      US 2016        Bob  Dylan   Male
5  Canada 2013      Alice  Munro Female
6   China 2012            Mo Yan   Male
7  Russia 2015              <NA>   <NA>
8  Sweden 2011              <NA>   <NA>

nbl_c所獨有的參數

anti_join(nbl_c, nbl_w)

  Country Year
1  Russia 2015
2  Sweden 2011

nbl_c、nbl_w共有與獨有總和，未匹配到的值，全顯示為NA

full_join(nbl_c, nbl_w)

  Country Year              Name Gender
1  France 2014   Patrick Modiano   Male
2      UK 1950 Bertrand  Russell   Male
3      UK 2017    Kazuo Ishiguro   Male
4      US 2016        Bob  Dylan   Male
5  Canada 2013      Alice  Munro Female
6   China 2012            Mo Yan   Male
7  Russia 2015              <NA>   <NA>
8  Sweden 2011              <NA>   <NA>
9    <NA> 1938        Pearl Buck Female

Exercises5

Augment the data object in the ‘SAT’ lecture note with state.division{datasets}. For each of the 9 divisions, find the slope estimate for regressing average SAT scores onto average teacher’s salary. How many of them are of negative signs?

開檔案

              V2   V3     V4 V5  V6  V7   V8
Alabama    4.405 17.2 31.144  8 491 538 1029
Alaska     8.963 17.6 47.951 47 445 489  934
Arizona    4.778 19.3 32.175 27 448 496  944
Arkansas   4.459 17.1 28.934  6 482 523 1005
California 4.992 24.0 41.078 45 417 485  902
Colorado   5.443 18.4 34.571 29 462 518  980

開tidyverse

library(tidyverse)

重新命名欄位

names(dta) <- c("Spending", "PTR", "Salary", "PE", "Verbal", "Math", "SAT")
str(dta)

'data.frame':   50 obs. of  7 variables:
 $ Spending: num  4.41 8.96 4.78 4.46 4.99 ...
 $ PTR     : num  17.2 17.6 19.3 17.1 24 18.4 14.4 16.6 19.1 16.3 ...
 $ Salary  : num  31.1 48 32.2 28.9 41.1 ...
 $ PE      : int  8 47 27 6 45 29 81 68 48 65 ...
 $ Verbal  : int  491 445 448 482 417 462 431 429 420 406 ...
 $ Math    : int  538 489 496 523 485 518 477 468 469 448 ...
 $ SAT     : int  1029 934 944 1005 902 980 908 897 889 854 ...

將9個區定義出來

dta$divisions <- state.division
with(dta, table(divisions))

divisions
       New England    Middle Atlantic     South Atlantic East South Central 
                 6                  3                  8                  4 
West South Central East North Central West North Central           Mountain 
                 4                  5                  7                  8 
           Pacific 
                 5

畫9個區的迴歸線

lattice::xyplot(SAT ~ Salary, groups=divisions, data=dta, type=c('r', 'g'), auto.key=list(columns=3))

將9個區的SAT分數與教師薪水分別做相關

dta %>%
  group_by(divisions) %>%
  dplyr::summarize(r=cor(SAT, Salary))

# A tibble: 9 x 2
  divisions                r
  <fct>                <dbl>
1 New England        -0.0830
2 Middle Atlantic     0.662 
3 South Atlantic      0.489 
4 East South Central -0.372 
5 West South Central -0.884 
6 East North Central  0.524 
7 West North Central -0.206 
8 Mountain           -0.729 
9 Pacific             0.0649

有5個是負的

Exercises6

The HELP (Health Evaluation and Linkage to Primary Care) study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. Eligible subjects were adults, who spoke Spanish or English, reported alcohol, heroin or cocaine as their first or second drug of choice, resided in proximity to the primary care clinic to which they would be referred or were homeless. Subjects were interviewed at baseline during their detoxification stay and follow-up interviews were undertaken every 6 months for 2 years. A variety of continuous, count, discrete, and survival time predictors and outcomes were collected at each of these five occasions.

The following R script is used to manage the data file at the initial stage of investigation. Provide comments on what each line of the script is meant to achieve.

—-echo=FALSE,eval=TRUE————————————————

options(continue=" ")

設定小數點到第3位，線寬72，讀檔————————————————————————

options(digits=3) options(width=72) # narrow output ds = read.csv(“http://www.amherst.edu/~nhorton/r2/datasets/help.csv”) ##使用dplyr library(dplyr)

##將這些變項選取並定義(但是我跑這一條時出現錯誤，故不跑) newds = select(ds, cesd, female, i1, i2, id, treat, f1a, f1b, f1c, f1d, f1e, f1f, f1g, f1h, f1i, f1j, f1k, f1l, f1m, f1n, f1o, f1p, f1q, f1r, f1s, f1t)

看newds有甚麼————————————————————————

names(newds) str(newds[,1:10]) # 將newds做成1~10的變項的表格

————————————————————————

summary(newds[,1:10]) # 對這表格做描述估計

————————————————————————

head(newds, n=3)#看這表格的前三行

————————————————————————

comment(newds) = “HELP baseline dataset”#將表格命名為HELP baseline dataset comment(newds) save(ds, file=“savedfile”)"#存檔

————————————————————————

write.csv(ds, file=“ds.csv”)#載入檔案

————————————————————————

library(foreign)#開foreign，將此檔案寫入foreign write.foreign(newds, “file.dat”, “file.sas”, package=“SAS”)

————————————————————————

with(newds, cesd[1:10])#列出newds1~10筆數據 with(newds, head(cesd, 10))#列出newds cesd的前10筆數據

————————————————————————

with(newds, cesd[cesd > 56])#列出newds cesd的大於56的數據

————————————————————————

library(dplyr)#開dplyr filter(newds, cesd > 56) %>% select(id, cesd)#使用filter和select分別選擇要分析的觀察值及欄位，並串在一起執行

————————————————————————

with(newds, sort(cesd)[1:4])#列出newds中cesd的1到4筆資料 with(newds, which.min(cesd))#列出newds中cesd最小的資料

————————————————————————

library(mosaic)#開mosaic tally(~ is.na(f1g), data=newds)#使用tally將遺漏值用f1g的數值補上去包裝起來 favstats(~ f1g, data=newds)#使用favstats做統計包括平均數、標準差、四分位數…等等 ## ———————————————————————— # 反轉 code f1d, f1h, f1l and f1p cesditems = with(newds, cbind(f1a, f1b, f1c, (3 - f1d), f1e, f1f, f1g, (3 - f1h), f1i, f1j, f1k, (3 - f1l), f1m, f1n, f1o, (3 - f1p), f1q, f1r, f1s, f1t)) nmisscesd = apply(is.na(cesditems), 1, sum) ncesditems = cesditems ncesditems[is.na(cesditems)] = 0 newcesd = apply(ncesditems, 1, sum) imputemeancesd = 20/(20-nmisscesd)*newcesd

————————————————————————

data.frame(newcesd, newds$cesd, nmisscesd, imputemeancesd)[nmisscesd>0,]#製表

—-createdrink,ssage=FALSE——————————————-

library(dplyr)#開dplyr library(memisc)#開memisc newds = mutate(newds, drinkstat= cases( “abstinent” = i1==0, “moderate” = (i1>0 & i1<=1 & i2<=3 & female==1) | (i1>0 & i1<=2 & i2<=4 & female==0), “highrisk” = ((i1>1 | i2>3) & female==1) | ((i1>2 | i2>4) & female==0)))#在newds內創造新變數abstinent、moderate、highrisk

—-echo=FALSE———————————————————-

library(mosaic)#開mosaic

—-echo=FALSE———————————————————-

detach(package:memisc)#開memisc來看 detach(package:MASS)#開MASS來看

————————————————————————

library(dplyr)#開dplyr tmpds = select(newds, i1, i2, female, drinkstat)#選newds, i1, i2, female, drinkstat這些欄位當tmpds tmpds[365:370,]#把tmpds第365~370筆數據調出來

————————————————————————

library(dplyr)#開dplyr filter(tmpds, drinkstat==“moderate” & female==1)#選符合tmpds, drinkstat==“moderate” & female==1的觀察值

—-message=FALSE——————————————————-

library(gmodels)#開gmodels with(tmpds, CrossTable(drinkstat))#選tmpds, CrossTable(drinkstat)的欄位

————————————————————————

with(tmpds, CrossTable(drinkstat, female, prop.t=FALSE, prop.c=FALSE, prop.chisq=FALSE))#選tmpds, CrossTable(drinkstat, female, prop.t=FALSE, prop.c=FALSE, prop.chisq=FALSE的欄位

————————————————————————

newds = transform(newds, gender=factor(female, c(0,1), c(“Male”,“Female”)))#將newds, gender=factor(female, c(0,1), c(“Male”,“Female”的第一條目錄轉換為其data frame tally(~ female + gender, margin=FALSE, data=newds)#使用tally把female + gender包裝上去

————————————————————————

library(dplyr)#開dplyr newds = arrange(ds, cesd, i1)#將ds, cesd, i1的觀察值照遞增排好 newds[1:5, c(“cesd”, “i1”, “id”)]#將“cesd”, “i1”, “id”1~5筆數據調出來

————————————————————————

library(dplyr)#開dplyr females = filter(ds, female==1)#將ds, female==1的觀察值定義為females with(females, mean(cesd))#將females, mean(cesd)的欄位選起來 # an alternative approach mean(ds$cesd[ds$female==1])#將ds$cesd[ds$female==1]算平均

————————————————————————

with(ds, tapply(cesd, female, mean))#將以上項目的觀察值選起來 library(mosaic)#開mosaic mean(cesd ~ female, data=ds)#做cesd ~ female, data=ds平均

Data wrangling exercise

2020-04-13

—-echo=FALSE,eval=TRUE————————————————

設定小數點到第3位，線寬72，讀檔————————————————————————

看newds有甚麼————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————

—-createdrink,ssage=FALSE——————————————-

—-echo=FALSE———————————————————-

—-echo=FALSE———————————————————-

————————————————————————

————————————————————————

—-message=FALSE——————————————————-

————————————————————————

————————————————————————

————————————————————————

————————————————————————

————————————————————————