Chatterjee and Hadi (Regression by Examples, 2006) provided a link to the right to work data from “Industrial and Labor Relations” set on their web page.
Question: Display the relationship between Income and Taxes.
The origin: These right-to-work laws have caused a wave of concern throughout the labor movement. A question of interest here is: What are the effects of these laws on the cost of living for a four-person family living on an intermediate budget in the United States?
# load data
<- read.csv("P005.txt", sep="\t")
dat1
# only show City with Taxes and Income column
::datatable(dat1[,c(1,6,7)], rownames=FALSE, options=list(pageLength=5)) DT
with(dat1, plot(Taxes, Income,
xlab="Property taxes in 1972",
ylab="Per capita income in 1974",
# xlim=c(23, 51),
# ylim=c(840, 1110),
type='n', bty='n',
main='Property taxes and Per capita income'))
grid()
# add y=Income ~ x=Taxes regression line
abline(lm(Income ~ Taxes, data=dat1))
# change the point to city name
with(dat1, text(Taxes, Income,
labels = City,
cex=.5))
with(dat1, cor(Taxes, Income))
## [1] 0.0560718
The pdf version of the paper by Hsu, et al (2020). Health inequality: a longitudinal study on geographic variations in lung cancer incidence and mortality in Taiwan is available on-line.
Question: Extract the values in Table 1 of the article to replicate Figure 2 in the paper.
::p_load(tabulizer)
pacman
# extract tables from pdf
<- extract_tables("Page3.pdf", pages = 1)
tbl1
<- as.data.frame(tbl1[[1]])
tbl1
colnames(tbl1) <- c("Area",
"Age_S_Incidence",
"IncidenceGrowthRate",
"Age_S_Mortality",
"Age_S_MortalityGrowth")
<- tbl1[-c(1,2,3,7,12,16,18,25,27,29),]
tbl1final
# add back the remove text
3,1] <- paste(tbl1final[3,1], "City", sep=" ")
tbl1final[c(7, 10, 11, 17, 18, 19),1] <- paste(tbl1final[c(7, 10, 11, 17, 18, 19), 1], "County", sep=" ")
tbl1final[
# 負號修正 gsub grep is base function for Pattern Matching and Replacement
1]]tbl1final[,5] <- gsub("−", "-", tbl1final[,5])
myfile[[
# reset row names
rownames(tbl1final) <- NULL
# write.csv(tbl1final, file="tbl1final.csv", row.names = FALSE)
負號會辨識成 <U+2212> extract_tables 擷取的編碼可能會導致後續knit / run code會失敗!(MacOS會不會就沒這個問題? 暗示該換MacbookPro?)
<- read.csv("tbl1final.csv") tbl1final
::p_load(dplyr)
pacman# change the chr type to factor type
<- tbl1final |>
tbl1final mutate(Area = as.factor(Area))
|> knitr::kable() tbl1final
Area | Age_S_Incidence | IncidenceGrowthRate | Age_S_Mortality | Age_S_MortalityGrowth |
---|---|---|---|---|
Taipei City | 33.36 | 10.83 | 20.3 | -14.31 |
Keelung City | 35.33 | -11.08 | 27.6 | -25.63 |
New Taipei City | 39.29 | 12.68 | 25.2 | -9.60 |
Yilan County | 39.63 | 2.60 | 30.2 | -9.16 |
Taoyuan City | 34.83 | 12.22 | 24.8 | -10.21 |
Hsinchu City | 34.64 | 11.21 | 21.0 | 17.38 |
Hsinchu County | 30.60 | 23.13 | 20.7 | -2.94 |
Miaoli County | 28.74 | 5.62 | 23.0 | -11.93 |
Taichung City | 33.69 | 6.03 | 24.6 | -7.26 |
Changhua County | 40.28 | 17.24 | 29.9 | -5.53 |
Nantou County | 31.43 | 22.83 | 25.4 | 3.73 |
Yunlin County | 35.66 | 3.00 | 32.4 | -1.04 |
Chiayi City | 38.47 | 25.81 | 24.5 | -6.45 |
Chiayi County | 35.04 | -0.96 | 32.1 | -6.36 |
Tainan City | 34.76 | 2.30 | 26.0 | -3.64 |
Kaohsiung City | 36.67 | 7.21 | 26.7 | -6.66 |
Pingtung County | 30.62 | 8.44 | 24.9 | -1.96 |
Hualien County | 28.73 | 8.22 | 21.7 | -9.92 |
Taitung County | 37.87 | 15.75 | 28.4 | 0.67 |
with(tbl1final, plot(Age_S_Incidence, Age_S_Mortality,
xlab="Incidence (per 100,000)",
ylab="Mortality (per 100,000)",
xlim=c(25, 45),
ylim=c(15, 35),
pch=20, bty="n", col="lightblue",
main="Age-standardized incidence and mortality (2014)", axes = FALSE))
axis(1, at = seq(25,45, by=1))
axis(2, at = seq(15,35, by=2))
grid()
# add y=Income ~ x=Taxes regression line
abline(h=mean(tbl1final$Age_S_Mortality), v=mean(tbl1final$Age_S_Incidence), col="gray")
# change the point to city name
with(tbl1final, text(Age_S_Incidence, Age_S_Mortality,
labels = Area,
adj = c(1,0),
cex=0.6))
The means of the incidence / mortality (and their growth rates) for all administrative areas were set to the cut-off point separating high/low incidence (mortality and their growth rate)
with(tbl1final, plot(IncidenceGrowthRate, Age_S_MortalityGrowth,
xlab="Incidence growth rate (%)",
ylab="Mortality growth rate (%)",
#xlim=c(-20, 30),
#ylim=c(-30, 20),
pch=20, bty="n", col="lightblue",
main="Age-standardized incidence and mortality growth rates (2005–2014)", axes = FALSE))
axis(1, at = seq(-12,28, by=5))
axis(2, at = seq(-30,20, by=5))
grid()
# add y=Income ~ x=Taxes regression line
abline(h=mean(tbl1final$Age_S_MortalityGrowth), v=mean(tbl1final$IncidenceGrowthRate), col="gray")
# change the point to city name
with(tbl1final, text(IncidenceGrowthRate, Age_S_MortalityGrowth,
labels = Area,
adj = c(1,0),
cex=0.6))
The distance from the incidence / mortality (growth rate) point for each administrative area to the origin (mean) on the four-quadrant scatter plots was calculated
Reproducible data and code are available for Leongomez, J.D. et al.(2020).
Self-reported Health is Related to Body Height and Waist Circumference in Rural Indigenous and Urbanised Latin-American Populations. Scientific Report, 10, 4391.
Question: Summarize the mean of height, waist, and weight in each study sample by gender.
# load data
<- read.csv("Full_data.csv")
dat3
::glimpse(dat3) tibble
## Rows: 477
## Columns: 15
## $ ID <chr> "F001", "F003", "F004", "F005", "F009", "F010", "F011", "F~
## $ Country <chr> "Colombia", "Colombia", "Colombia", "Colombia", "Colombia"~
## $ Population <chr> "Urban", "Urban", "Urban", "Urban", "Urban", "Urban", "Urb~
## $ Sex <chr> "Female", "Female", "Female", "Female", "Female", "Female"~
## $ Age <int> 23, 24, 19, 19, 18, 18, 21, 22, 19, 18, 24, 18, 19, 21, 19~
## $ Waist <dbl> 67.33333, 97.50000, 81.13333, 70.30000, 66.73333, 82.53333~
## $ Hip <dbl> 90.43333, 107.50000, 106.06667, 96.06667, 91.46667, 102.06~
## $ Height <dbl> 157.8000, 164.6000, 164.7667, 161.0667, 162.1667, 160.9667~
## $ Weight <dbl> 48.80000, 71.43333, 73.90000, 56.16667, 54.80000, 72.73333~
## $ Fat <dbl> 29.96667, 42.56667, 43.50000, 34.23333, 32.36667, 45.96667~
## $ VisceralFat <dbl> 3.000000, 5.000000, 5.000000, 3.666667, 3.000000, 5.000000~
## $ BMI <dbl> 19.70000, 26.23333, 27.10000, 21.66667, 20.90000, 28.10000~
## $ Muscle <dbl> 24.66667, 26.36667, 23.73333, 25.66667, 26.23333, 22.23333~
## $ Health <dbl> 75.00, 50.00, 43.75, 50.00, 68.75, 62.50, 75.00, 87.50, 93~
## $ Sample <chr> "Colombia-Urban", "Colombia-Urban", "Colombia-Urban", "Col~
# change the chr type to factor type
c(1,2,3,4,15)] <- lapply(dat3[,c(1,2,3,4,15)],as.factor) dat3[,
#aggregate(cbind(Height) ~ Sex, data=dat3, FUN=mean)
<- aggregate(cbind(dat3$Height, dat3$Waist, dat3$Weight), list(dat3$Sample,dat3$Sex), data=dat3, FUN=mean)
tbLdat3
# give name
colnames(tbLdat3) <- c("Sample", "Gender", "mean Height", "mean Waist", "mean Weight")
|> knitr::kable() tbLdat3
Sample | Gender | mean Height | mean Waist | mean Weight |
---|---|---|---|---|
Colombia-Urban | Female | 158.8906 | 71.79928 | 57.84819 |
Mexico-Indigenous | Female | 146.1667 | 87.00417 | 54.24583 |
Mexico-Urban | Female | 157.6667 | 87.79000 | 65.48000 |
Colombia-Urban | Male | 172.2061 | 78.16925 | 68.17627 |
Mexico-Indigenous | Male | 159.9077 | 88.61026 | 65.91282 |
Mexico-Urban | Male | 172.0933 | 84.51000 | 71.03000 |
The following zip file contains one subject’s laser-event potentials (LEP) data for 4 separate conditions (different level of stimulus intensity), each in a plain text file (1w.dat, 2w.dat, 3w.dat and 4w.dat). The rows are time points from -100 to 800 ms sampled at 2 ms per record. The columns are channel IDs.
Question: Input all the files into R for graphical exploration.
unzip("Subject1.zip")
# file list
= list.files(file.path(getwd(), "Subject1/"), pattern="*.dat", full.names = TRUE)
temp
# grap the column name
<- scan("./Subject1/1w.dat", skip=1, sep="\t", nlines = 1, what = "character")
dat4colname <- gsub(" ", "", dat4colname)
dat4colname dat4colname
## [1] "[F7]" "[FT7]" "[T7]" "[TP7]" "[P7]" "[Fp1]" "[F3]" "[FC3]" "[C3]"
## [10] "[CP3]" "[P3]" "[O1]" "[Fz]" "[FCz]" "[Cz]" "[CPz]" "[Pz]" "[Oz]"
## [19] "[Fp2]" "[F4]" "[FC4]" "[C4]" "[CP4]" "[P4]" "[O2]" "[F8]" "[FT8]"
## [28] "[T8]" "[TP8]" "[P8]" ""
# input list
<- lapply(temp, read.table, skip = 2, sep = "\t",
myfile header=FALSE, na.strings = c("", "NA"), fill = TRUE)
# link5 <- paste0("Subject1/", 1:4, sep ="w.dat")
# remove NA column
1]]$V31<-NULL
myfile[[2]]$V31<-NULL
myfile[[3]]$V31<-NULL
myfile[[4]]$V31<-NULL
myfile[[
<- dat4colname[-31]
dat4colnamenew
# add column name into each data frame
<- lapply(myfile, setNames, nm=dat4colnamenew)
dat4
# combine into one df
<- rbind(dat4[[1]], dat4[[2]], dat4[[3]], dat4[[4]])
dat4_df
# add a column to indicate condition
$condition<-rep(paste0(1:4, "w"), each=nrow(dat4[[1]]))
dat4_df
# add a column to indicate time
$time<-rep(seq(-100, 800, 2), 4)
dat4_df
::p_load(reshape)
pacman
# wide to long format
<- melt(dat4_df, id = c("time", "condition"))
dat4_melt
str(dat4_melt)
## 'data.frame': 54120 obs. of 4 variables:
## $ time : num -100 -98 -96 -94 -92 -90 -88 -86 -84 -82 ...
## $ condition: chr "1w" "1w" "1w" "1w" ...
## $ variable : Factor w/ 30 levels "[F7]","[FT7]",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ value : num -0.9733 -0.7079 -0.3732 -0.0225 0.3523 ...
::p_load(lattice)
pacman
xyplot(value ~ time | variable,
groups = condition,
data = dat4_melt, cex=.5,
type = c("p","g", "r"),
pch = 20,
xlab = "Time",
ylab = "LFP",
layout = c(6,5),
auto.key = list(space = "right"))
value ~ time in differ condition in one plot
::p_load(ggplot2, colorspace)
pacman
ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
geom_point(alpha = 0.5, size= 0.8) +
xlab("Time") +
ylab("LFP") +
scale_colour_discrete_qualitative(palette = "Dark 3") +
theme_minimal() +
theme(legend.title = element_blank())
value ~ time in differ condition separated
ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
geom_point(size = .8, alpha=0.5) +
facet_wrap( ~ condition, nrow = 1) +
xlab("Time") +
ylab("LFP") +
scale_colour_discrete_qualitative(palette = "Dark 3") +
theme_minimal() +
theme(legend.title = element_blank())
value ~ time in differ condition facet by differ position
ggplot(dat4_melt, aes(x = time, y = value, colour = condition)) +
geom_point(data = subset(dat4_melt, variable =condition),
aes(colour = condition), size = .2) +
scale_colour_discrete_qualitative(palette = "Dark 3") +
facet_wrap(~variable, nrow = 5) +
xlab("Time") +
ylab("LFP") +
theme_minimal() +
theme(legend.title = element_blank())
Reaction times of schizophrenics and others
The ASCII (plain text) file schiz.asc contains response times (in milliseconds) for 11 non-schizophrenics and 6 schizophrenics (30 measurements for each person).
Question: Summarize and compare descriptive statistics of the measurements from the two groups.
<- "http://www.stat.columbia.edu/~gelman/book/data/schiz.asc"
fL <- read.table(fL, skip = 4, col.names = paste0("T", 101:130))
dat5# skip = integer: the number of lines of the data file to skip before beginning to read data.
# dat5 <- read.table("schiz.asc", skip = 4, col.names = paste0("T", 101:130))
# Create Group and ID
$Group <- c(rep("Non-schiz", 11), rep("Schiz", 6))
dat5$ID <- 1:nrow(dat5)
dat5
head(dat5)
## T101 T102 T103 T104 T105 T106 T107 T108 T109 T110 T111 T112 T113 T114 T115
## 1 312 272 350 286 268 328 298 356 292 308 296 372 396 402 280
## 2 354 346 384 342 302 312 322 376 306 402 320 298 308 414 304
## 3 256 284 320 274 324 268 370 430 314 312 362 256 342 388 302
## 4 260 294 306 292 264 290 272 268 344 362 330 280 354 320 334
## 5 204 272 250 260 314 308 246 236 208 268 272 264 308 236 238
## 6 590 312 286 310 778 364 318 316 316 298 344 262 274 330 312
## T116 T117 T118 T119 T120 T121 T122 T123 T124 T125 T126 T127 T128 T129 T130
## 1 330 254 282 350 328 332 308 292 258 340 242 306 328 294 272
## 2 422 388 422 426 338 332 426 478 372 392 374 430 388 354 368
## 3 366 298 396 274 226 328 274 258 220 236 272 322 284 274 356
## 4 276 418 288 338 350 350 324 286 322 280 256 218 256 220 356
## 5 350 272 252 252 236 306 238 350 206 260 280 274 318 268 210
## 6 310 376 326 346 334 282 292 282 300 290 302 300 306 294 444
## Group ID
## 1 Non-schiz 1
## 2 Non-schiz 2
## 3 Non-schiz 3
## 4 Non-schiz 4
## 5 Non-schiz 5
## 6 Non-schiz 6
<- reshape(dat5, varying=list(1:30), idvar="ID",
dat5L v.names="RT", timevar="Test",
times=c(1:30), direction="long")
# mean and sd by group
aggregate(RT ~ Group, data=dat5L, FUN = mean) |> knitr::kable()
Group | RT |
---|---|
Non-schiz | 310.1697 |
Schiz | 506.8667 |
aggregate(RT ~ Group, data=dat5L, FUN = sd) |> knitr::kable()
Group | RT |
---|---|
Non-schiz | 64.8805 |
Schiz | 262.8473 |
# mean by subject
::datatable(aggregate(RT ~ ID, data=dat5L, FUN =mean), rownames=FALSE, options=list(pageLength=11)) DT
# aggregate(RT ~ ID, data=dat5L, FUN =mean) |> knitr::kable()
# sd by subject
::datatable(aggregate(RT ~ ID, data=dat5L, FUN =sd), rownames=FALSE, options=list(pageLength=11)) DT
# aggregate(RT ~ ID, data=dat5L, FUN =sd) |> knitr::kable()
# Order subject by response means
<- aggregate(RT ~ as.factor(ID), data=dat5L, FUN =mean)
x order(x$RT),] |> t() |> knitr::kable() x[
9 | 5 | 8 | 12 | 4 | 3 | 11 | 1 | 10 | 6 | 7 | 2 | 13 | 17 | 16 | 15 | 14 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
as.factor(ID) | 9 | 5 | 8 | 12 | 4 | 3 | 11 | 1 | 10 | 6 | 7 | 2 | 13 | 17 | 16 | 15 | 14 |
RT | 257.7333 | 265.2000 | 268.4667 | 303.5333 | 303.6000 | 306.2000 | 309.3333 | 311.0667 | 325.6667 | 339.8000 | 358.1333 | 366.6667 | 419.0667 | 488.5333 | 594.3333 | 599.8667 | 635.8667 |
with(dat5L, boxplot(RT ~ Group,
horizontal = TRUE,
frame = FALSE,
ylab = "Reaction time",
col = "aliceblue",
varwidth = TRUE
))
aov(RT ~ Group + as.integer(Test), data = dat5L) |> anova()
## Analysis of Variance Table
##
## Response: RT
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 1 4506212 4506212 166.2613 <2e-16 ***
## as.integer(Test) 1 10482 10482 0.3867 0.5343
## Residuals 507 13741318 27103
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm(RT ~ Group + as.integer(Test), data = dat5L)
##
## Call:
## lm(formula = RT ~ Group + as.integer(Test), data = dat5L)
##
## Coefficients:
## (Intercept) GroupSchiz as.integer(Test)
## 318.2882 196.6970 -0.5238
The reaction time will increase 196.697 for subject as schizophrenics.
Source: Belin, T., & Rubin, D. (1995). The analysis of repeated-measures data on schizophrenic reaction times using mixture models. Statistics in Medicine 14(8), 747-768.
Rookie scholar…
::p_load(scholar)
pacman<- get_citation_history("s6G3jt4AAAAJ")
dat6 plot(dat6,
xlab="Year", ylab="Citations",
type='h', lwd=2,
xaxt="n", bty='n',
xlim=c(2019, 2021))
axis(side=1,
at=seq(2019, 2021, by=1),
cex.axis=0.7)
abline(h=seq(0, 400, by=100),
lty=2, col="darkgray")