1 Homework 1

Chatterjee and Hadi (Regression by Examples, 2006) provided a link to the right to work data from “Industrial and Labor Relations” set on their web page.

Question: Display the relationship between Income and Taxes.

The origin: These right-to-work laws have caused a wave of concern throughout the labor movement. A question of interest here is: What are the effects of these laws on the cost of living for a four-person family living on an intermediate budget in the United States?

COL: Cost of living for a four-person family
PD: Population density (person per square mile)
URate: State unionization rate in 1978
POP: Population in 1975
Taxes: Property taxes in 1972
Income: Per capita income in 1974
RTWL: Indicator variable (1 if there is right-to-work laws in the state and 0 otherwise)

1.1 Input data

# load data
dat1 <- read.csv("P005.txt", sep="\t")

# only show City with Taxes and Income column
DT::datatable(dat1[,c(1,6,7)], rownames=FALSE, options=list(pageLength=5))

1.2 Plot

with(dat1, plot(Taxes, Income, 
                xlab="Property taxes in 1972",
                ylab="Per capita income in 1974",
              # xlim=c(23, 51),
              # ylim=c(840, 1110),
                type='n', bty='n',
                main='Property taxes and Per capita income'))
grid()
# add y=Income ~ x=Taxes regression line 
abline(lm(Income ~ Taxes, data=dat1))
# change the point to city name
with(dat1, text(Taxes, Income, 
                labels = City, 
                cex=.5))

物業稅(Property taxes)越高的，平均每人所得(Per capita income)只有些許變化？
Washington 跟 San Francisco 好像次等公民？

1.3 Correlation

with(dat1, cor(Taxes, Income))

## [1] 0.0560718

low level correlation

2 Homework 2

The pdf version of the paper by Hsu, et al (2020). Health inequality: a longitudinal study on geographic variations in lung cancer incidence and mortality in Taiwan is available on-line.

Question: Extract the values in Table 1 of the article to replicate Figure 2 in the paper.

2.1 Input PDF file

pacman::p_load(tabulizer)

# extract tables from pdf
tbl1 <- extract_tables("Page3.pdf", pages = 1)

tbl1 <- as.data.frame(tbl1[[1]])

colnames(tbl1) <- c("Area",
                    "Age_S_Incidence",
                    "IncidenceGrowthRate",
                    "Age_S_Mortality",
                    "Age_S_MortalityGrowth")

tbl1final <- tbl1[-c(1,2,3,7,12,16,18,25,27,29),]

# add back the remove text
tbl1final[3,1] <- paste(tbl1final[3,1], "City", sep=" ")
tbl1final[c(7, 10, 11, 17, 18, 19),1] <- paste(tbl1final[c(7, 10, 11, 17, 18, 19), 1], "County", sep=" ")

# 負號修正 gsub grep is base function for Pattern Matching and Replacement
myfile[[1]]tbl1final[,5] <- gsub("−", "-", tbl1final[,5])

# reset row names
rownames(tbl1final) <- NULL

# write.csv(tbl1final, file="tbl1final.csv", row.names = FALSE)

負號會辨識成 <U+2212> extract_tables 擷取的編碼可能會導致後續knit / run code會失敗！(MacOS會不會就沒這個問題? 暗示該換MacbookPro?)

tbl1final <- read.csv("tbl1final.csv")

pacman::p_load(dplyr)
# change the chr type to factor type
tbl1final <- tbl1final |>
  mutate(Area = as.factor(Area))

2.2 Table

tbl1final |> knitr::kable()

Area	Age_S_Incidence	IncidenceGrowthRate	Age_S_Mortality	Age_S_MortalityGrowth
Taipei City	33.36	10.83	20.3	-14.31
Keelung City	35.33	-11.08	27.6	-25.63
New Taipei City	39.29	12.68	25.2	-9.60
Yilan County	39.63	2.60	30.2	-9.16
Taoyuan City	34.83	12.22	24.8	-10.21
Hsinchu City	34.64	11.21	21.0	17.38
Hsinchu County	30.60	23.13	20.7	-2.94
Miaoli County	28.74	5.62	23.0	-11.93
Taichung City	33.69	6.03	24.6	-7.26
Changhua County	40.28	17.24	29.9	-5.53
Nantou County	31.43	22.83	25.4	3.73
Yunlin County	35.66	3.00	32.4	-1.04
Chiayi City	38.47	25.81	24.5	-6.45
Chiayi County	35.04	-0.96	32.1	-6.36
Tainan City	34.76	2.30	26.0	-3.64
Kaohsiung City	36.67	7.21	26.7	-6.66
Pingtung County	30.62	8.44	24.9	-1.96
Hualien County	28.73	8.22	21.7	-9.92
Taitung County	37.87	15.75	28.4	0.67

2.3 Figure 2-A

with(tbl1final, plot(Age_S_Incidence, Age_S_Mortality, 
                xlab="Incidence (per 100,000)",
                ylab="Mortality (per 100,000)",
                xlim=c(25, 45),
                ylim=c(15, 35),
                pch=20, bty="n", col="lightblue",
                main="Age-standardized incidence and mortality (2014)", axes = FALSE))
axis(1, at = seq(25,45, by=1))
axis(2, at = seq(15,35, by=2))
grid()
# add y=Income ~ x=Taxes regression line 
abline(h=mean(tbl1final$Age_S_Mortality), v=mean(tbl1final$Age_S_Incidence), col="gray")

# change the point to city name
with(tbl1final, text(Age_S_Incidence, Age_S_Mortality, 
                labels = Area, 
                adj = c(1,0), 
                cex=0.6))

The means of the incidence / mortality (and their growth rates) for all administrative areas were set to the cut-off point separating high/low incidence (mortality and their growth rate)

2.4 Figure 2-B

with(tbl1final, plot(IncidenceGrowthRate, Age_S_MortalityGrowth, 
                xlab="Incidence growth rate (%)",
                ylab="Mortality growth rate (%)",
                #xlim=c(-20, 30),
                #ylim=c(-30, 20),
                pch=20, bty="n", col="lightblue",
                main="Age-standardized incidence and mortality growth rates (2005–2014)", axes = FALSE))
axis(1, at = seq(-12,28, by=5))
axis(2, at = seq(-30,20, by=5))
grid()
# add y=Income ~ x=Taxes regression line 
abline(h=mean(tbl1final$Age_S_MortalityGrowth), v=mean(tbl1final$IncidenceGrowthRate), col="gray")

# change the point to city name
with(tbl1final, text(IncidenceGrowthRate, Age_S_MortalityGrowth, 
                labels = Area, 
                adj = c(1,0), 
                cex=0.6))

The distance from the incidence / mortality (growth rate) point for each administrative area to the origin (mean) on the four-quadrant scatter plots was calculated

3 Homework 3

Reproducible data and code are available for Leongomez, J.D. et al.(2020).
Self-reported Health is Related to Body Height and Waist Circumference in Rural Indigenous and Urbanised Latin-American Populations. Scientific Report, 10, 4391.

Question: Summarize the mean of height, waist, and weight in each study sample by gender.

3.1 Input data

# load data
dat3 <- read.csv("Full_data.csv")

tibble::glimpse(dat3)

## Rows: 477
## Columns: 15
## $ ID          <chr> "F001", "F003", "F004", "F005", "F009", "F010", "F011", "F~
## $ Country     <chr> "Colombia", "Colombia", "Colombia", "Colombia", "Colombia"~
## $ Population  <chr> "Urban", "Urban", "Urban", "Urban", "Urban", "Urban", "Urb~
## $ Sex         <chr> "Female", "Female", "Female", "Female", "Female", "Female"~
## $ Age         <int> 23, 24, 19, 19, 18, 18, 21, 22, 19, 18, 24, 18, 19, 21, 19~
## $ Waist       <dbl> 67.33333, 97.50000, 81.13333, 70.30000, 66.73333, 82.53333~
## $ Hip         <dbl> 90.43333, 107.50000, 106.06667, 96.06667, 91.46667, 102.06~
## $ Height      <dbl> 157.8000, 164.6000, 164.7667, 161.0667, 162.1667, 160.9667~
## $ Weight      <dbl> 48.80000, 71.43333, 73.90000, 56.16667, 54.80000, 72.73333~
## $ Fat         <dbl> 29.96667, 42.56667, 43.50000, 34.23333, 32.36667, 45.96667~
## $ VisceralFat <dbl> 3.000000, 5.000000, 5.000000, 3.666667, 3.000000, 5.000000~
## $ BMI         <dbl> 19.70000, 26.23333, 27.10000, 21.66667, 20.90000, 28.10000~
## $ Muscle      <dbl> 24.66667, 26.36667, 23.73333, 25.66667, 26.23333, 22.23333~
## $ Health      <dbl> 75.00, 50.00, 43.75, 50.00, 68.75, 62.50, 75.00, 87.50, 93~
## $ Sample      <chr> "Colombia-Urban", "Colombia-Urban", "Colombia-Urban", "Col~

# change the chr type to factor type
dat3[,c(1,2,3,4,15)] <- lapply(dat3[,c(1,2,3,4,15)],as.factor)

3.2 summary for mean

#aggregate(cbind(Height) ~ Sex, data=dat3, FUN=mean)

tbLdat3 <- aggregate(cbind(dat3$Height, dat3$Waist, dat3$Weight), list(dat3$Sample,dat3$Sex), data=dat3, FUN=mean)

# give name
colnames(tbLdat3) <- c("Sample", "Gender", "mean Height", "mean Waist", "mean Weight")

tbLdat3|> knitr::kable()

Sample	Gender	mean Height	mean Waist	mean Weight
Colombia-Urban	Female	158.8906	71.79928	57.84819
Mexico-Indigenous	Female	146.1667	87.00417	54.24583
Mexico-Urban	Female	157.6667	87.79000	65.48000
Colombia-Urban	Male	172.2061	78.16925	68.17627
Mexico-Indigenous	Male	159.9077	88.61026	65.91282
Mexico-Urban	Male	172.0933	84.51000	71.03000

Only the Mexico-Urban male’s mean waist lower than female

4 Homework 4

The following zip file contains one subject’s laser-event potentials (LEP) data for 4 separate conditions (different level of stimulus intensity), each in a plain text file (1w.dat, 2w.dat, 3w.dat and 4w.dat). The rows are time points from -100 to 800 ms sampled at 2 ms per record. The columns are channel IDs.

Question: Input all the files into R for graphical exploration.

4.1 Input data

fail by download way due to “error 1 in extracting from zip file in r”

unzip("Subject1.zip")

# file list
temp = list.files(file.path(getwd(), "Subject1/"), pattern="*.dat", full.names = TRUE)

# grap the column name
dat4colname <- scan("./Subject1/1w.dat", skip=1, sep="\t", nlines = 1, what = "character")
dat4colname <- gsub(" ", "", dat4colname)
dat4colname

##  [1] "[F7]"  "[FT7]" "[T7]"  "[TP7]" "[P7]"  "[Fp1]" "[F3]"  "[FC3]" "[C3]" 
## [10] "[CP3]" "[P3]"  "[O1]"  "[Fz]"  "[FCz]" "[Cz]"  "[CPz]" "[Pz]"  "[Oz]" 
## [19] "[Fp2]" "[F4]"  "[FC4]" "[C4]"  "[CP4]" "[P4]"  "[O2]"  "[F8]"  "[FT8]"
## [28] "[T8]"  "[TP8]" "[P8]"  ""

# input list 
myfile <- lapply(temp, read.table, skip = 2, sep = "\t", 
                 header=FALSE, na.strings = c("", "NA"), fill = TRUE)

# link5 <- paste0("Subject1/", 1:4, sep ="w.dat")

# remove NA column
myfile[[1]]$V31<-NULL
myfile[[2]]$V31<-NULL
myfile[[3]]$V31<-NULL
myfile[[4]]$V31<-NULL

dat4colnamenew <- dat4colname[-31]

# add column name into each data frame
dat4 <- lapply(myfile, setNames, nm=dat4colnamenew)

# combine into one df
dat4_df <- rbind(dat4[[1]], dat4[[2]], dat4[[3]], dat4[[4]])

# add a column to indicate condition 
dat4_df$condition<-rep(paste0(1:4, "w"), each=nrow(dat4[[1]]))

# add a column to indicate time
dat4_df$time<-rep(seq(-100, 800, 2), 4)

pacman::p_load(reshape)

# wide to long format
dat4_melt <- melt(dat4_df, id = c("time", "condition"))

str(dat4_melt)

## 'data.frame':    54120 obs. of  4 variables:
##  $ time     : num  -100 -98 -96 -94 -92 -90 -88 -86 -84 -82 ...
##  $ condition: chr  "1w" "1w" "1w" "1w" ...
##  $ variable : Factor w/ 30 levels "[F7]","[FT7]",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ value    : num  -0.9733 -0.7079 -0.3732 -0.0225 0.3523 ...

4.2 lattice

pacman::p_load(lattice)

xyplot(value ~ time | variable, 
       groups = condition,
       data = dat4_melt, cex=.5,
       type = c("p","g", "r"),
       pch = 20,
       xlab = "Time",
       ylab = "LFP",
       layout = c(6,5),
       auto.key = list(space = "right"))

4.3 ggplot

value ~ time in differ condition in one plot

pacman::p_load(ggplot2, colorspace)

ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
  geom_point(alpha = 0.5, size= 0.8) +
  xlab("Time") + 
  ylab("LFP") +
  scale_colour_discrete_qualitative(palette = "Dark 3") +
  theme_minimal() +
  theme(legend.title = element_blank())

value ~ time in differ condition separated

ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
  geom_point(size = .8, alpha=0.5) +
  facet_wrap( ~ condition, nrow = 1) +
  xlab("Time") + 
  ylab("LFP") +
  scale_colour_discrete_qualitative(palette = "Dark 3") +
  theme_minimal() +
  theme(legend.title = element_blank())

value ~ time in differ condition facet by differ position

ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
  geom_point(data = subset(dat4_melt, variable =condition), 
            aes(colour = condition), size = .2) +
  scale_colour_discrete_qualitative(palette = "Dark 3") +
  facet_wrap(~variable, nrow = 5) +
  xlab("Time") + 
  ylab("LFP") +
  theme_minimal() +
  theme(legend.title = element_blank())

5 Homework 5

Reaction times of schizophrenics and others
The ASCII (plain text) file schiz.asc contains response times (in milliseconds) for 11 non-schizophrenics and 6 schizophrenics (30 measurements for each person).

Question: Summarize and compare descriptive statistics of the measurements from the two groups.

Bayesian Data Analysis

5.1 Input data

fL <- "http://www.stat.columbia.edu/~gelman/book/data/schiz.asc"
dat5<- read.table(fL, skip = 4, col.names = paste0("T", 101:130))
# skip = integer: the number of lines of the data file to skip before beginning to read data.
# dat5 <- read.table("schiz.asc", skip = 4, col.names = paste0("T", 101:130))

# Create Group and ID 
dat5$Group <- c(rep("Non-schiz", 11), rep("Schiz", 6))
dat5$ID <- 1:nrow(dat5)

head(dat5)

##   T101 T102 T103 T104 T105 T106 T107 T108 T109 T110 T111 T112 T113 T114 T115
## 1  312  272  350  286  268  328  298  356  292  308  296  372  396  402  280
## 2  354  346  384  342  302  312  322  376  306  402  320  298  308  414  304
## 3  256  284  320  274  324  268  370  430  314  312  362  256  342  388  302
## 4  260  294  306  292  264  290  272  268  344  362  330  280  354  320  334
## 5  204  272  250  260  314  308  246  236  208  268  272  264  308  236  238
## 6  590  312  286  310  778  364  318  316  316  298  344  262  274  330  312
##   T116 T117 T118 T119 T120 T121 T122 T123 T124 T125 T126 T127 T128 T129 T130
## 1  330  254  282  350  328  332  308  292  258  340  242  306  328  294  272
## 2  422  388  422  426  338  332  426  478  372  392  374  430  388  354  368
## 3  366  298  396  274  226  328  274  258  220  236  272  322  284  274  356
## 4  276  418  288  338  350  350  324  286  322  280  256  218  256  220  356
## 5  350  272  252  252  236  306  238  350  206  260  280  274  318  268  210
## 6  310  376  326  346  334  282  292  282  300  290  302  300  306  294  444
##       Group ID
## 1 Non-schiz  1
## 2 Non-schiz  2
## 3 Non-schiz  3
## 4 Non-schiz  4
## 5 Non-schiz  5
## 6 Non-schiz  6

5.2 Reshape to long format

dat5L <- reshape(dat5, varying=list(1:30), idvar="ID", 
                 v.names="RT", timevar="Test", 
                 times=c(1:30), direction="long")

5.3 Descriptive

# mean and sd by group
aggregate(RT ~ Group, data=dat5L, FUN = mean) |> knitr::kable()

Group	RT
Non-schiz	310.1697
Schiz	506.8667

aggregate(RT ~ Group, data=dat5L, FUN = sd) |> knitr::kable()

Group	RT
Non-schiz	64.8805
Schiz	262.8473

the reaction time of schizophrenics is longer than non-schizophrenics group
wide variation in schizophrenics group

# mean by subject
DT::datatable(aggregate(RT ~ ID, data=dat5L, FUN =mean), rownames=FALSE, options=list(pageLength=11))

# aggregate(RT ~ ID, data=dat5L, FUN =mean) |> knitr::kable()

# sd by subject
DT::datatable(aggregate(RT ~ ID, data=dat5L, FUN =sd), rownames=FALSE, options=list(pageLength=11))

# aggregate(RT ~ ID, data=dat5L, FUN =sd) |> knitr::kable()

# Order subject by response means
x <- aggregate(RT ~ as.factor(ID), data=dat5L, FUN =mean)
x[order(x$RT),] |> t() |> knitr::kable()

	9	5	8	12	4	3	11	1	10	6	7	2	13	17	16	15	14
as.factor(ID)	9	5	8	12	4	3	11	1	10	6	7	2	13	17	16	15	14
RT	257.7333	265.2000	268.4667	303.5333	303.6000	306.2000	309.3333	311.0667	325.6667	339.8000	358.1333	366.6667	419.0667	488.5333	594.3333	599.8667	635.8667

5.4 Plot

with(dat5L, boxplot(RT ~ Group,
                    horizontal = TRUE,
                    frame = FALSE,
                    ylab = "Reaction time",
                    col = "aliceblue",
                    varwidth = TRUE
                    ))

5.5 Linear models

aov(RT ~ Group + as.integer(Test), data = dat5L) |> anova()

## Analysis of Variance Table
## 
## Response: RT
##                   Df   Sum Sq Mean Sq  F value Pr(>F)    
## Group              1  4506212 4506212 166.2613 <2e-16 ***
## as.integer(Test)   1    10482   10482   0.3867 0.5343    
## Residuals        507 13741318   27103                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lm(RT ~ Group + as.integer(Test), data = dat5L)

## 
## Call:
## lm(formula = RT ~ Group + as.integer(Test), data = dat5L)
## 
## Coefficients:
##      (Intercept)        GroupSchiz  as.integer(Test)  
##         318.2882          196.6970           -0.5238

The reaction time will increase 196.697 for subject as schizophrenics.

5.6 Reference

Source: Belin, T., & Rubin, D. (1995). The analysis of repeated-measures data on schizophrenic reaction times using mixture models. Statistics in Medicine 14(8), 747-768.

6 Google scholar

Rookie scholar…

pacman::p_load(scholar)
dat6 <- get_citation_history("s6G3jt4AAAAJ")
plot(dat6, 
     xlab="Year", ylab="Citations", 
     type='h', lwd=2, 
     xaxt="n", bty='n', 
     xlim=c(2019, 2021))
axis(side=1, 
     at=seq(2019, 2021, by=1), 
     cex.axis=0.7)
abline(h=seq(0, 400, by=100), 
       lty=2, col="darkgray")

20211025 Homework

Chang-Chun Chen

2021-10-30