DataM: HW Exercise 0323

HW exercise 1.

Here is a copy of the student roster in csv format from NCKU for a course I taught. Dispaly the number of students from each major.

student_roster <- read.table('../data/student_roster.txt', header = TRUE, sep=',')
head(student_roster)

student_roster <- student_roster[-1,] # the first row is not needful here
student_roster$major <- substr(student_roster$系.年.班, 1, 3)
table(student_roster$major)


教育所 心理所 心理系 
     4      7      4

HW exercise 2.

Chatterjee and Hadi (Regression by Examples, 2006) provided a link to the right to work data set on their web page. Display the relationship between Income and Taxes.

[Solution and Answer]

Load in the data and check its structure.

dta2 <- read.table('../data/right_to_work.txt', header=TRUE, sep='\t')
head(dta2)

str(dta2)

'data.frame':   38 obs. of  8 variables:
 $ City  : Factor w/ 38 levels "Atlanta","Austin",..: 1 2 3 4 5 6 7 9 8 10 ...
 $ COL   : int  169 143 339 173 99 363 253 117 294 291 ...
 $ PD    : int  414 239 43 951 255 1257 834 162 229 1886 ...
 $ URate : num  13.6 11 23.7 21 16 24.4 39.2 31.5 18.2 31.5 ...
 $ Pop   : int  1790128 396891 349874 2147850 411725 3914071 1326848 162304 164145 7015251 ...
 $ Taxes : int  5128 4303 4166 5001 3965 4928 4471 4813 4839 5408 ...
 $ Income: int  2961 1711 2122 4654 1620 5634 7213 5535 7224 6113 ...
 $ RTWL  : int  1 1 0 0 1 0 0 0 1 0 ...

Commpute the correlation coefficient between Income and Taxes.

cor(dta2$Income, dta2$Taxes)

[1] 0.0560718

Income and Taxes do not have a strong linear relationship.

plot(dta2$Income, dta2$Taxes, pch=19)

plot(dta2$Income, dta2$Taxes, col=factor(dta2$RTWL), pch=19)

dta2_RTWL1 <- dta2[dta2$RTWL == 1,]
cor(dta2_RTWL1$Income, dta2_RTWL1$Taxes)

[1] 0.2340646

dta2_RTWL0 <- dta2[dta2$RTWL == 0,]
cor(dta2_RTWL0$Income, dta2_RTWL0$Taxes)

[1] -0.2015487

If we group data into 2 groups by the binary variable, RTWL, we can find that there is a slight correlation between Income and Taxes.

In group of RTWL=1, Income is slightly positively correlated with Taxes.
In group of RTWL=0, Income is slightly negatively correlated with Taxes.

HW exercise 3.

Download the data file in junior school project and read it into your currect R session. Assign the data set to a data frame object called jsp.

[Solution and Answer]

Load in the dataset and check its structure

jsp <- read.table('../data/junior_school_project.txt', header=TRUE, sep='\t')
head(jsp)

str(jsp)

'data.frame':   3236 obs. of  9 variables:
 $ school : Factor w/ 49 levels "S1","S10","S11",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ class  : Factor w/ 4 levels "C1","C2","C3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ sex    : Factor w/ 2 levels "B","G": 2 2 2 1 1 1 1 1 1 1 ...
 $ soc    : int  9 9 9 2 2 2 2 2 9 9 ...
 $ ravens : int  23 23 23 15 15 22 22 22 14 14 ...
 $ pupil  : Factor w/ 1192 levels "P1","P10","P100",..: 1 1 1 413 413 512 512 512 612 612 ...
 $ english: int  72 80 39 7 17 88 89 83 12 25 ...
 $ math   : int  23 24 23 14 11 36 32 39 24 26 ...
 $ year   : int  0 1 2 0 1 0 1 2 0 1 ...

Re-name the variable sex as Gender.

colnames(jsp)[colnames(jsp) == 'sex'] <- 'Gender'

Re-label the values of the social class variable using the (long character strings) descriptive terms to produce the following plot.

jsp$soc <- factor(jsp$soc, labels=c('I', 'II', 'III_Oman', 'III_man', 'IV', 'V', 'VI_Unemp_L', 'VII_emp_NC', 'VIII_Miss_dad'))
boxplot(math ~ soc, data=jsp)

Save the edited jsp data object out as a comma-separated-value file and in a R data format to a data folder and read them back into your R session, separately.

write.csv(jsp, 'jsp_edited.rda')
jsp_edited <- read.csv('./jsp_edited.rda')[,-1]   # ignore the first column (index)
head(jsp_edited)

HW exercise 4.

The following zip file contains one subject’s laser-event potentials (LEP) data for 4 separate conditions (different level of stimulus intensity), each in a plain text file (1w.dat, 2w.dat, 3w.dat and 4w.dat). The rows are time points from -100 to 800 ms sampled at 2 ms per record. The columns are channel IDs. Input all the files into R for graphical exploration.

[Solution and Answer]

Load in the datasets

library(data.table)
LEP <- list()
for (i in 1:4) {
  LEP[[i]] <- fread(paste0('../data/Subject1/', i,'w.dat'))
  LEP[[i]]$V31 <- NULL
}

Since I do not understand LEP at all, I try to find the correlations between each of two columns (channels).

library(corrplot)

corrplot 0.84 loaded

corrplot(cor(LEP[[1]]))

corrplot(cor(LEP[[2]]))

corrplot(cor(LEP[[3]]))

corrplot(cor(LEP[[4]]))

Fp1 and Fp2 look special. In the fourth condition, most of correlations between each of two channels are highly positive. Yet most of correlations between Fp2 and other channels are not so high and most of correlations between Fp2 and other channels are even negative.

Try to figure out Fp1 and Fp2

LEP_df <- rbind(LEP[[1]], LEP[[2]], LEP[[3]], LEP[[4]])
colnames(LEP_df)[colnames(LEP_df) == '[     Fp1]'] <- 'Fp1'
colnames(LEP_df)[colnames(LEP_df) == '[     Fp2]'] <- 'Fp2'
LEP_df$condition <- rep(paste0(1:4, 'w'), each=nrow(LEP[[1]]))
LEP_df$Time <- rep(seq(-100,800,2), 4)

library(ggplot2)
qplot(x=Time, y=Fp1, col=condition, data=LEP_df, geom='line')

qplot(x=Time, y=Fp2, col=condition, data=LEP_df, geom='line')

Both Fp1 and Fp2 show different patterns in differen conditions.

HW exercise 5.

The ASCII (plain text) file schiz.asc contains response times (in milliseconds) for 11 non-schizophrenics and 6 schizophrenics (30 measurements for each person). Summarize and compare descriptive statistics of the measurements from the two groups. Source: Belin, T., & Rubin, D. (1995). The analysis of repeated-measures data on schizophrenic reaction times using mixture models. Statistics in Medicine 14(8), 747-768.

[Solution and Answer]

Load in the data set and check its structure

ASCII <- rbind(read.table('../data/ASCII_schiz.txt', header = FALSE),
              read.table('../data/ASCII_non.txt', header = FALSE))
head(ASCII)

str(ASCII)

'data.frame':   17 obs. of  30 variables:
 $ V1 : int  312 354 256 260 204 590 308 244 232 318 ...
 $ V2 : int  272 346 284 294 272 312 364 240 262 324 ...
 $ V3 : int  350 384 320 306 250 286 374 278 230 282 ...
 $ V4 : int  286 342 274 292 260 310 278 262 222 364 ...
 $ V5 : int  268 302 324 264 314 778 366 266 210 286 ...
 $ V6 : int  328 312 268 290 308 364 310 254 284 342 ...
 $ V7 : int  298 322 370 272 246 318 358 240 232 306 ...
 $ V8 : int  356 376 430 268 236 316 380 244 228 302 ...
 $ V9 : int  292 306 314 344 208 316 294 226 264 280 ...
 $ V10: int  308 402 312 362 268 298 334 266 246 306 ...
 $ V11: int  296 320 362 330 272 344 302 294 264 256 ...
 $ V12: int  372 298 256 280 264 262 250 250 316 334 ...
 $ V13: int  396 308 342 354 308 274 542 284 260 332 ...
 $ V14: int  402 414 388 320 236 330 340 260 266 336 ...
 $ V15: int  280 304 302 334 238 312 352 418 304 360 ...
 $ V16: int  330 422 366 276 350 310 322 280 268 344 ...
 $ V17: int  254 388 298 418 272 376 372 294 384 480 ...
 $ V18: int  282 422 396 288 252 326 348 216 234 310 ...
 $ V19: int  350 426 274 338 252 346 460 308 308 336 ...
 $ V20: int  328 338 226 350 236 334 322 324 266 314 ...
 $ V21: int  332 332 328 350 306 282 374 264 294 392 ...
 $ V22: int  308 426 274 324 238 292 370 232 254 284 ...
 $ V23: int  292 478 258 286 350 282 334 294 222 292 ...
 $ V24: int  258 372 220 322 206 300 360 236 262 280 ...
 $ V25: int  340 392 236 280 260 290 318 226 278 320 ...
 $ V26: int  242 374 272 256 280 302 356 234 290 322 ...
 $ V27: int  306 430 322 218 274 300 338 274 208 286 ...
 $ V28: int  328 388 284 256 318 306 346 258 232 406 ...
 $ V29: int  294 354 274 220 268 294 462 208 206 352 ...
 $ V30: int  272 368 356 356 210 444 510 380 206 324 ...

Create some needful variables, ID and group label Schiz.

ASCII$Schiz <- c(rep('Non-schiz', 11), rep('Schiz', 6))
ASCII$ID <- 1:nrow(ASCII)
head(ASCII)

Reshape the dataset into a long format

ASCII_long <- melt(ASCII, id=c('Schiz', 'ID'))

Warning in melt(ASCII, id = c("Schiz", "ID")): The melt generic in
data.table has been passed a data.frame and will attempt to redirect to the
relevant reshape2 method; please note that reshape2 is deprecated, and this
redirection is now deprecated as well. To continue using melt methods from
reshape2 while both libraries are attached, e.g. melt.list, you can prepend
the namespace like reshape2::melt(ASCII). In the next version, this warning
will become an error.

colnames(ASCII_long)

[1] "Schiz"    "ID"       "variable" "value"

colnames(ASCII_long)[3:4] <- c("Time", "RT")
head(ASCII_long)

Display the average RT of 30 times for each individual

ASCII_ID <- aggregate(RT ~ ID, mean, data=ASCII_long)
ASCII_ID$Schiz <- ASCII$Schiz
ASCII_ID

boxplot(RT ~ Schiz, data=ASCII_ID)

It is obvious that group of non-schizophrenia have lower RT.

Conduct one-between and one-within analysis of variance

model <- aov(RT ~ Schiz + Time + Error(ID / Time), data=ASCII_long)
summary(model)


Error: ID
      Df  Sum Sq Mean Sq
Schiz  1 3503780 3503780

Error: ID:Time
     Df  Sum Sq Mean Sq
Time 29 1232036   42484

Error: Within
           Df   Sum Sq Mean Sq F value   Pr(>F)    
Schiz       1  1042372 1042372  38.392 1.31e-09 ***
Time       29   288981    9965   0.367    0.999    
Residuals 449 12190843   27151                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since \(p<\alpha=.05\), we reject the null hypothesis. The main effect of Schiz is significant. The response time of cases with schizophrenia is higher than their counterparts.