1 Homework 1

Chatterjee and Hadi (Regression by Examples, 2006) provided a link to the right to work data from “Industrial and Labor Relations” set on their web page.

Question: Display the relationship between Income and Taxes.

The origin: These right-to-work laws have caused a wave of concern throughout the labor movement. A question of interest here is: What are the effects of these laws on the cost of living for a four-person family living on an intermediate budget in the United States?

  • COL: Cost of living for a four-person family
  • PD: Population density (person per square mile)
  • URate: State unionization rate in 1978
  • POP: Population in 1975
  • Taxes: Property taxes in 1972
  • Income: Per capita income in 1974
  • RTWL: Indicator variable (1 if there is right-to-work laws in the state and 0 otherwise)

1.1 Input data

# load data
dat1 <- read.csv("P005.txt", sep="\t")

# only show City with Taxes and Income column
DT::datatable(dat1[,c(1,6,7)], rownames=FALSE, options=list(pageLength=5))

1.2 Plot

with(dat1, plot(Taxes, Income, 
                xlab="Property taxes in 1972",
                ylab="Per capita income in 1974",
              # xlim=c(23, 51),
              # ylim=c(840, 1110),
                type='n', bty='n',
                main='Property taxes and Per capita income'))
grid()
# add y=Income ~ x=Taxes regression line 
abline(lm(Income ~ Taxes, data=dat1))
# change the point to city name
with(dat1, text(Taxes, Income, 
                labels = City, 
                cex=.5))

  • 物業稅(Property taxes)越高的,平均每人所得(Per capita income)只有些許變化?
  • Washington 跟 San Francisco 好像次等公民?

1.3 Correlation

with(dat1, cor(Taxes, Income))
## [1] 0.0560718
  • low level correlation

2 Homework 2

The pdf version of the paper by Hsu, et al (2020). Health inequality: a longitudinal study on geographic variations in lung cancer incidence and mortality in Taiwan is available on-line.

Question: Extract the values in Table 1 of the article to replicate Figure 2 in the paper.

2.1 Input PDF file

pacman::p_load(tabulizer)

# extract tables from pdf
tbl1 <- extract_tables("Page3.pdf", pages = 1)

tbl1 <- as.data.frame(tbl1[[1]])

colnames(tbl1) <- c("Area",
                    "Age_S_Incidence",
                    "IncidenceGrowthRate",
                    "Age_S_Mortality",
                    "Age_S_MortalityGrowth")

tbl1final <- tbl1[-c(1,2,3,7,12,16,18,25,27,29),]

# add back the remove text
tbl1final[3,1] <- paste(tbl1final[3,1], "City", sep=" ")
tbl1final[c(7, 10, 11, 17, 18, 19),1] <- paste(tbl1final[c(7, 10, 11, 17, 18, 19), 1], "County", sep=" ")

# 負號修正 gsub grep is base function for Pattern Matching and Replacement
myfile[[1]]tbl1final[,5] <- gsub("−", "-", tbl1final[,5])

# reset row names
rownames(tbl1final) <- NULL

# write.csv(tbl1final, file="tbl1final.csv", row.names = FALSE)

負號會辨識成 <U+2212> extract_tables 擷取的編碼可能會導致後續knit / run code會失敗!(MacOS會不會就沒這個問題? 暗示該換MacbookPro?)

tbl1final <- read.csv("tbl1final.csv")
pacman::p_load(dplyr)
# change the chr type to factor type
tbl1final <- tbl1final |>
  mutate(Area = as.factor(Area))

2.2 Table

tbl1final |> knitr::kable()
Area Age_S_Incidence IncidenceGrowthRate Age_S_Mortality Age_S_MortalityGrowth
Taipei City 33.36 10.83 20.3 -14.31
Keelung City 35.33 -11.08 27.6 -25.63
New Taipei City 39.29 12.68 25.2 -9.60
Yilan County 39.63 2.60 30.2 -9.16
Taoyuan City 34.83 12.22 24.8 -10.21
Hsinchu City 34.64 11.21 21.0 17.38
Hsinchu County 30.60 23.13 20.7 -2.94
Miaoli County 28.74 5.62 23.0 -11.93
Taichung City 33.69 6.03 24.6 -7.26
Changhua County 40.28 17.24 29.9 -5.53
Nantou County 31.43 22.83 25.4 3.73
Yunlin County 35.66 3.00 32.4 -1.04
Chiayi City 38.47 25.81 24.5 -6.45
Chiayi County 35.04 -0.96 32.1 -6.36
Tainan City 34.76 2.30 26.0 -3.64
Kaohsiung City 36.67 7.21 26.7 -6.66
Pingtung County 30.62 8.44 24.9 -1.96
Hualien County 28.73 8.22 21.7 -9.92
Taitung County 37.87 15.75 28.4 0.67

2.3 Figure 2-A

with(tbl1final, plot(Age_S_Incidence, Age_S_Mortality, 
                xlab="Incidence (per 100,000)",
                ylab="Mortality (per 100,000)",
                xlim=c(25, 45),
                ylim=c(15, 35),
                pch=20, bty="n", col="lightblue",
                main="Age-standardized incidence and mortality (2014)", axes = FALSE))
axis(1, at = seq(25,45, by=1))
axis(2, at = seq(15,35, by=2))
grid()
# add y=Income ~ x=Taxes regression line 
abline(h=mean(tbl1final$Age_S_Mortality), v=mean(tbl1final$Age_S_Incidence), col="gray")

# change the point to city name
with(tbl1final, text(Age_S_Incidence, Age_S_Mortality, 
                labels = Area, 
                adj = c(1,0), 
                cex=0.6))

The means of the incidence / mortality (and their growth rates) for all administrative areas were set to the cut-off point separating high/low incidence (mortality and their growth rate)

2.4 Figure 2-B

with(tbl1final, plot(IncidenceGrowthRate, Age_S_MortalityGrowth, 
                xlab="Incidence growth rate (%)",
                ylab="Mortality growth rate (%)",
                #xlim=c(-20, 30),
                #ylim=c(-30, 20),
                pch=20, bty="n", col="lightblue",
                main="Age-standardized incidence and mortality growth rates (2005–2014)", axes = FALSE))
axis(1, at = seq(-12,28, by=5))
axis(2, at = seq(-30,20, by=5))
grid()
# add y=Income ~ x=Taxes regression line 
abline(h=mean(tbl1final$Age_S_MortalityGrowth), v=mean(tbl1final$IncidenceGrowthRate), col="gray")

# change the point to city name
with(tbl1final, text(IncidenceGrowthRate, Age_S_MortalityGrowth, 
                labels = Area, 
                adj = c(1,0), 
                cex=0.6))

The distance from the incidence / mortality (growth rate) point for each administrative area to the origin (mean) on the four-quadrant scatter plots was calculated

3 Homework 3

Reproducible data and code are available for Leongomez, J.D. et al.(2020).
Self-reported Health is Related to Body Height and Waist Circumference in Rural Indigenous and Urbanised Latin-American Populations. Scientific Report, 10, 4391.

Question: Summarize the mean of height, waist, and weight in each study sample by gender.

3.1 Input data

# load data
dat3 <- read.csv("Full_data.csv")

tibble::glimpse(dat3)
## Rows: 477
## Columns: 15
## $ ID          <chr> "F001", "F003", "F004", "F005", "F009", "F010", "F011", "F~
## $ Country     <chr> "Colombia", "Colombia", "Colombia", "Colombia", "Colombia"~
## $ Population  <chr> "Urban", "Urban", "Urban", "Urban", "Urban", "Urban", "Urb~
## $ Sex         <chr> "Female", "Female", "Female", "Female", "Female", "Female"~
## $ Age         <int> 23, 24, 19, 19, 18, 18, 21, 22, 19, 18, 24, 18, 19, 21, 19~
## $ Waist       <dbl> 67.33333, 97.50000, 81.13333, 70.30000, 66.73333, 82.53333~
## $ Hip         <dbl> 90.43333, 107.50000, 106.06667, 96.06667, 91.46667, 102.06~
## $ Height      <dbl> 157.8000, 164.6000, 164.7667, 161.0667, 162.1667, 160.9667~
## $ Weight      <dbl> 48.80000, 71.43333, 73.90000, 56.16667, 54.80000, 72.73333~
## $ Fat         <dbl> 29.96667, 42.56667, 43.50000, 34.23333, 32.36667, 45.96667~
## $ VisceralFat <dbl> 3.000000, 5.000000, 5.000000, 3.666667, 3.000000, 5.000000~
## $ BMI         <dbl> 19.70000, 26.23333, 27.10000, 21.66667, 20.90000, 28.10000~
## $ Muscle      <dbl> 24.66667, 26.36667, 23.73333, 25.66667, 26.23333, 22.23333~
## $ Health      <dbl> 75.00, 50.00, 43.75, 50.00, 68.75, 62.50, 75.00, 87.50, 93~
## $ Sample      <chr> "Colombia-Urban", "Colombia-Urban", "Colombia-Urban", "Col~
# change the chr type to factor type
dat3[,c(1,2,3,4,15)] <- lapply(dat3[,c(1,2,3,4,15)],as.factor)

3.2 summary for mean

#aggregate(cbind(Height) ~ Sex, data=dat3, FUN=mean)

tbLdat3 <- aggregate(cbind(dat3$Height, dat3$Waist, dat3$Weight), list(dat3$Sample,dat3$Sex), data=dat3, FUN=mean)

# give name
colnames(tbLdat3) <- c("Sample", "Gender", "mean Height", "mean Waist", "mean Weight")

tbLdat3|> knitr::kable()
Sample Gender mean Height mean Waist mean Weight
Colombia-Urban Female 158.8906 71.79928 57.84819
Mexico-Indigenous Female 146.1667 87.00417 54.24583
Mexico-Urban Female 157.6667 87.79000 65.48000
Colombia-Urban Male 172.2061 78.16925 68.17627
Mexico-Indigenous Male 159.9077 88.61026 65.91282
Mexico-Urban Male 172.0933 84.51000 71.03000
  • Only the Mexico-Urban male’s mean waist lower than female

4 Homework 4

The following zip file contains one subject’s laser-event potentials (LEP) data for 4 separate conditions (different level of stimulus intensity), each in a plain text file (1w.dat, 2w.dat, 3w.dat and 4w.dat). The rows are time points from -100 to 800 ms sampled at 2 ms per record. The columns are channel IDs.

Question: Input all the files into R for graphical exploration.

4.1 Input data

  • fail by download way due to “error 1 in extracting from zip file in r”
unzip("Subject1.zip")

# file list
temp = list.files(file.path(getwd(), "Subject1/"), pattern="*.dat", full.names = TRUE)

# grap the column name
dat4colname <- scan("./Subject1/1w.dat", skip=1, sep="\t", nlines = 1, what = "character")
dat4colname <- gsub(" ", "", dat4colname)
dat4colname
##  [1] "[F7]"  "[FT7]" "[T7]"  "[TP7]" "[P7]"  "[Fp1]" "[F3]"  "[FC3]" "[C3]" 
## [10] "[CP3]" "[P3]"  "[O1]"  "[Fz]"  "[FCz]" "[Cz]"  "[CPz]" "[Pz]"  "[Oz]" 
## [19] "[Fp2]" "[F4]"  "[FC4]" "[C4]"  "[CP4]" "[P4]"  "[O2]"  "[F8]"  "[FT8]"
## [28] "[T8]"  "[TP8]" "[P8]"  ""
# input list 
myfile <- lapply(temp, read.table, skip = 2, sep = "\t", 
                 header=FALSE, na.strings = c("", "NA"), fill = TRUE)

# link5 <- paste0("Subject1/", 1:4, sep ="w.dat")

# remove NA column
myfile[[1]]$V31<-NULL
myfile[[2]]$V31<-NULL
myfile[[3]]$V31<-NULL
myfile[[4]]$V31<-NULL

dat4colnamenew <- dat4colname[-31]

# add column name into each data frame
dat4 <- lapply(myfile, setNames, nm=dat4colnamenew)

# combine into one df
dat4_df <- rbind(dat4[[1]], dat4[[2]], dat4[[3]], dat4[[4]])

# add a column to indicate condition 
dat4_df$condition<-rep(paste0(1:4, "w"), each=nrow(dat4[[1]]))

# add a column to indicate time
dat4_df$time<-rep(seq(-100, 800, 2), 4)

pacman::p_load(reshape)

# wide to long format
dat4_melt <- melt(dat4_df, id = c("time", "condition"))

str(dat4_melt)
## 'data.frame':    54120 obs. of  4 variables:
##  $ time     : num  -100 -98 -96 -94 -92 -90 -88 -86 -84 -82 ...
##  $ condition: chr  "1w" "1w" "1w" "1w" ...
##  $ variable : Factor w/ 30 levels "[F7]","[FT7]",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value    : num  -0.9733 -0.7079 -0.3732 -0.0225 0.3523 ...

4.2 lattice

pacman::p_load(lattice)

xyplot(value ~ time | variable, 
       groups = condition,
       data = dat4_melt, cex=.5,
       type = c("p","g", "r"),
       pch = 20,
       xlab = "Time",
       ylab = "LFP",
       layout = c(6,5),
       auto.key = list(space = "right"))

4.3 ggplot

value ~ time in differ condition in one plot

pacman::p_load(ggplot2, colorspace)

ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
  geom_point(alpha = 0.5, size= 0.8) +
  xlab("Time") + 
  ylab("LFP") +
  scale_colour_discrete_qualitative(palette = "Dark 3") +
  theme_minimal() +
  theme(legend.title = element_blank())

value ~ time in differ condition separated

ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
  geom_point(size = .8, alpha=0.5) +
  facet_wrap( ~ condition, nrow = 1) +
  xlab("Time") + 
  ylab("LFP") +
  scale_colour_discrete_qualitative(palette = "Dark 3") +
  theme_minimal() +
  theme(legend.title = element_blank())

value ~ time in differ condition facet by differ position

ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
  geom_point(data = subset(dat4_melt, variable =condition), 
            aes(colour = condition), size = .2) +
  scale_colour_discrete_qualitative(palette = "Dark 3") +
  facet_wrap(~variable, nrow = 5) +
  xlab("Time") + 
  ylab("LFP") +
  theme_minimal() +
  theme(legend.title = element_blank())

5 Homework 5

Reaction times of schizophrenics and others
The ASCII (plain text) file schiz.asc contains response times (in milliseconds) for 11 non-schizophrenics and 6 schizophrenics (30 measurements for each person).

Question: Summarize and compare descriptive statistics of the measurements from the two groups.

Bayesian Data Analysis

5.1 Input data

fL <- "http://www.stat.columbia.edu/~gelman/book/data/schiz.asc"
dat5<- read.table(fL, skip = 4, col.names = paste0("T", 101:130))
# skip = integer: the number of lines of the data file to skip before beginning to read data.
# dat5 <- read.table("schiz.asc", skip = 4, col.names = paste0("T", 101:130))

# Create Group and ID 
dat5$Group <- c(rep("Non-schiz", 11), rep("Schiz", 6))
dat5$ID <- 1:nrow(dat5)

head(dat5)
##   T101 T102 T103 T104 T105 T106 T107 T108 T109 T110 T111 T112 T113 T114 T115
## 1  312  272  350  286  268  328  298  356  292  308  296  372  396  402  280
## 2  354  346  384  342  302  312  322  376  306  402  320  298  308  414  304
## 3  256  284  320  274  324  268  370  430  314  312  362  256  342  388  302
## 4  260  294  306  292  264  290  272  268  344  362  330  280  354  320  334
## 5  204  272  250  260  314  308  246  236  208  268  272  264  308  236  238
## 6  590  312  286  310  778  364  318  316  316  298  344  262  274  330  312
##   T116 T117 T118 T119 T120 T121 T122 T123 T124 T125 T126 T127 T128 T129 T130
## 1  330  254  282  350  328  332  308  292  258  340  242  306  328  294  272
## 2  422  388  422  426  338  332  426  478  372  392  374  430  388  354  368
## 3  366  298  396  274  226  328  274  258  220  236  272  322  284  274  356
## 4  276  418  288  338  350  350  324  286  322  280  256  218  256  220  356
## 5  350  272  252  252  236  306  238  350  206  260  280  274  318  268  210
## 6  310  376  326  346  334  282  292  282  300  290  302  300  306  294  444
##       Group ID
## 1 Non-schiz  1
## 2 Non-schiz  2
## 3 Non-schiz  3
## 4 Non-schiz  4
## 5 Non-schiz  5
## 6 Non-schiz  6

5.2 Reshape to long format

dat5L <- reshape(dat5, varying=list(1:30), idvar="ID", 
                 v.names="RT", timevar="Test", 
                 times=c(1:30), direction="long")

5.3 Descriptive

# mean and sd by group
aggregate(RT ~ Group, data=dat5L, FUN = mean) |> knitr::kable()
Group RT
Non-schiz 310.1697
Schiz 506.8667
aggregate(RT ~ Group, data=dat5L, FUN = sd) |> knitr::kable()
Group RT
Non-schiz 64.8805
Schiz 262.8473
  • the reaction time of schizophrenics is longer than non-schizophrenics group
  • wide variation in schizophrenics group
# mean by subject
DT::datatable(aggregate(RT ~ ID, data=dat5L, FUN =mean), rownames=FALSE, options=list(pageLength=11))
# aggregate(RT ~ ID, data=dat5L, FUN =mean) |> knitr::kable()

# sd by subject
DT::datatable(aggregate(RT ~ ID, data=dat5L, FUN =sd), rownames=FALSE, options=list(pageLength=11))
# aggregate(RT ~ ID, data=dat5L, FUN =sd) |> knitr::kable()

# Order subject by response means
x <- aggregate(RT ~ as.factor(ID), data=dat5L, FUN =mean)
x[order(x$RT),] |> t() |> knitr::kable()
9 5 8 12 4 3 11 1 10 6 7 2 13 17 16 15 14
as.factor(ID) 9 5 8 12 4 3 11 1 10 6 7 2 13 17 16 15 14
RT 257.7333 265.2000 268.4667 303.5333 303.6000 306.2000 309.3333 311.0667 325.6667 339.8000 358.1333 366.6667 419.0667 488.5333 594.3333 599.8667 635.8667

5.4 Plot

with(dat5L, boxplot(RT ~ Group,
                    horizontal = TRUE,
                    frame = FALSE,
                    ylab = "Reaction time",
                    col = "aliceblue",
                    varwidth = TRUE
                    ))

5.5 Linear models

aov(RT ~ Group + as.integer(Test), data = dat5L) |> anova()
## Analysis of Variance Table
## 
## Response: RT
##                   Df   Sum Sq Mean Sq  F value Pr(>F)    
## Group              1  4506212 4506212 166.2613 <2e-16 ***
## as.integer(Test)   1    10482   10482   0.3867 0.5343    
## Residuals        507 13741318   27103                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(RT ~ Group + as.integer(Test), data = dat5L)
## 
## Call:
## lm(formula = RT ~ Group + as.integer(Test), data = dat5L)
## 
## Coefficients:
##      (Intercept)        GroupSchiz  as.integer(Test)  
##         318.2882          196.6970           -0.5238

The reaction time will increase 196.697 for subject as schizophrenics.

5.6 Reference

Source: Belin, T., & Rubin, D. (1995). The analysis of repeated-measures data on schizophrenic reaction times using mixture models. Statistics in Medicine 14(8), 747-768.

6 Google scholar

Rookie scholar…

pacman::p_load(scholar)
dat6 <- get_citation_history("s6G3jt4AAAAJ")
plot(dat6, 
     xlab="Year", ylab="Citations", 
     type='h', lwd=2, 
     xaxt="n", bty='n', 
     xlim=c(2019, 2021))
axis(side=1, 
     at=seq(2019, 2021, by=1), 
     cex.axis=0.7)
abline(h=seq(0, 400, by=100), 
       lty=2, col="darkgray")