DataM: HW Exercise 0323
HW exercise 1.
Here is a copy of the student roster in csv format from NCKU for a course I taught. Dispaly the number of students from each major.
student_roster <- read.table('../data/student_roster.txt', header = TRUE, sep=',')
head(student_roster)student_roster <- student_roster[-1,] # the first row is not needful here
student_roster$major <- substr(student_roster$系.年.班, 1, 3)
table(student_roster$major)
教育所 心理所 心理系
4 7 4
HW exercise 2.
Chatterjee and Hadi (Regression by Examples, 2006) provided a link to the right to work data set on their web page. Display the relationship between Income and Taxes.
[Solution and Answer]
- Load in the data and check its structure.
'data.frame': 38 obs. of 8 variables:
$ City : Factor w/ 38 levels "Atlanta","Austin",..: 1 2 3 4 5 6 7 9 8 10 ...
$ COL : int 169 143 339 173 99 363 253 117 294 291 ...
$ PD : int 414 239 43 951 255 1257 834 162 229 1886 ...
$ URate : num 13.6 11 23.7 21 16 24.4 39.2 31.5 18.2 31.5 ...
$ Pop : int 1790128 396891 349874 2147850 411725 3914071 1326848 162304 164145 7015251 ...
$ Taxes : int 5128 4303 4166 5001 3965 4928 4471 4813 4839 5408 ...
$ Income: int 2961 1711 2122 4654 1620 5634 7213 5535 7224 6113 ...
$ RTWL : int 1 1 0 0 1 0 0 0 1 0 ...
- Commpute the correlation coefficient between
IncomeandTaxes.
[1] 0.0560718
IncomeandTaxesdo not have a strong linear relationship.
[1] 0.2340646
[1] -0.2015487
If we group data into 2 groups by the binary variable, RTWL, we can find that there is a slight correlation between Income and Taxes.
- In group of
RTWL=1,Incomeis slightly positively correlated withTaxes. - In group of
RTWL=0,Incomeis slightly negatively correlated withTaxes.
HW exercise 3.
Download the data file in junior school project and read it into your currect R session. Assign the data set to a data frame object called jsp.
[Solution and Answer]
- Load in the dataset and check its structure
'data.frame': 3236 obs. of 9 variables:
$ school : Factor w/ 49 levels "S1","S10","S11",..: 1 1 1 1 1 1 1 1 1 1 ...
$ class : Factor w/ 4 levels "C1","C2","C3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "B","G": 2 2 2 1 1 1 1 1 1 1 ...
$ soc : int 9 9 9 2 2 2 2 2 9 9 ...
$ ravens : int 23 23 23 15 15 22 22 22 14 14 ...
$ pupil : Factor w/ 1192 levels "P1","P10","P100",..: 1 1 1 413 413 512 512 512 612 612 ...
$ english: int 72 80 39 7 17 88 89 83 12 25 ...
$ math : int 23 24 23 14 11 36 32 39 24 26 ...
$ year : int 0 1 2 0 1 0 1 2 0 1 ...
- Re-name the variable
sexasGender.
- Re-label the values of the social class variable using the (long character strings) descriptive terms to produce the following plot.
jsp$soc <- factor(jsp$soc, labels=c('I', 'II', 'III_Oman', 'III_man', 'IV', 'V', 'VI_Unemp_L', 'VII_emp_NC', 'VIII_Miss_dad'))
boxplot(math ~ soc, data=jsp)- Save the edited jsp data object out as a comma-separated-value file and in a R data format to a data folder and read them back into your R session, separately.
write.csv(jsp, 'jsp_edited.rda')
jsp_edited <- read.csv('./jsp_edited.rda')[,-1] # ignore the first column (index)
head(jsp_edited)HW exercise 4.
The following zip file contains one subject’s laser-event potentials (LEP) data for 4 separate conditions (different level of stimulus intensity), each in a plain text file (1w.dat, 2w.dat, 3w.dat and 4w.dat). The rows are time points from -100 to 800 ms sampled at 2 ms per record. The columns are channel IDs. Input all the files into R for graphical exploration.
[Solution and Answer]
- Load in the datasets
library(data.table)
LEP <- list()
for (i in 1:4) {
LEP[[i]] <- fread(paste0('../data/Subject1/', i,'w.dat'))
LEP[[i]]$V31 <- NULL
}- Since I do not understand LEP at all, I try to find the correlations between each of two columns (channels).
corrplot 0.84 loaded
Fp1 and Fp2 look special. In the fourth condition, most of correlations between each of two channels are highly positive. Yet most of correlations between Fp2 and other channels are not so high and most of correlations between Fp2 and other channels are even negative.
- Try to figure out
Fp1andFp2
LEP_df <- rbind(LEP[[1]], LEP[[2]], LEP[[3]], LEP[[4]])
colnames(LEP_df)[colnames(LEP_df) == '[ Fp1]'] <- 'Fp1'
colnames(LEP_df)[colnames(LEP_df) == '[ Fp2]'] <- 'Fp2'
LEP_df$condition <- rep(paste0(1:4, 'w'), each=nrow(LEP[[1]]))
LEP_df$Time <- rep(seq(-100,800,2), 4)
library(ggplot2)
qplot(x=Time, y=Fp1, col=condition, data=LEP_df, geom='line')Both Fp1 and Fp2 show different patterns in differen conditions.
HW exercise 5.
The ASCII (plain text) file schiz.asc contains response times (in milliseconds) for 11 non-schizophrenics and 6 schizophrenics (30 measurements for each person). Summarize and compare descriptive statistics of the measurements from the two groups. Source: Belin, T., & Rubin, D. (1995). The analysis of repeated-measures data on schizophrenic reaction times using mixture models. Statistics in Medicine 14(8), 747-768.
[Solution and Answer]
- Load in the data set and check its structure
ASCII <- rbind(read.table('../data/ASCII_schiz.txt', header = FALSE),
read.table('../data/ASCII_non.txt', header = FALSE))
head(ASCII)'data.frame': 17 obs. of 30 variables:
$ V1 : int 312 354 256 260 204 590 308 244 232 318 ...
$ V2 : int 272 346 284 294 272 312 364 240 262 324 ...
$ V3 : int 350 384 320 306 250 286 374 278 230 282 ...
$ V4 : int 286 342 274 292 260 310 278 262 222 364 ...
$ V5 : int 268 302 324 264 314 778 366 266 210 286 ...
$ V6 : int 328 312 268 290 308 364 310 254 284 342 ...
$ V7 : int 298 322 370 272 246 318 358 240 232 306 ...
$ V8 : int 356 376 430 268 236 316 380 244 228 302 ...
$ V9 : int 292 306 314 344 208 316 294 226 264 280 ...
$ V10: int 308 402 312 362 268 298 334 266 246 306 ...
$ V11: int 296 320 362 330 272 344 302 294 264 256 ...
$ V12: int 372 298 256 280 264 262 250 250 316 334 ...
$ V13: int 396 308 342 354 308 274 542 284 260 332 ...
$ V14: int 402 414 388 320 236 330 340 260 266 336 ...
$ V15: int 280 304 302 334 238 312 352 418 304 360 ...
$ V16: int 330 422 366 276 350 310 322 280 268 344 ...
$ V17: int 254 388 298 418 272 376 372 294 384 480 ...
$ V18: int 282 422 396 288 252 326 348 216 234 310 ...
$ V19: int 350 426 274 338 252 346 460 308 308 336 ...
$ V20: int 328 338 226 350 236 334 322 324 266 314 ...
$ V21: int 332 332 328 350 306 282 374 264 294 392 ...
$ V22: int 308 426 274 324 238 292 370 232 254 284 ...
$ V23: int 292 478 258 286 350 282 334 294 222 292 ...
$ V24: int 258 372 220 322 206 300 360 236 262 280 ...
$ V25: int 340 392 236 280 260 290 318 226 278 320 ...
$ V26: int 242 374 272 256 280 302 356 234 290 322 ...
$ V27: int 306 430 322 218 274 300 338 274 208 286 ...
$ V28: int 328 388 284 256 318 306 346 258 232 406 ...
$ V29: int 294 354 274 220 268 294 462 208 206 352 ...
$ V30: int 272 368 356 356 210 444 510 380 206 324 ...
- Create some needful variables,
IDand group labelSchiz.
- Reshape the dataset into a long format
Warning in melt(ASCII, id = c("Schiz", "ID")): The melt generic in
data.table has been passed a data.frame and will attempt to redirect to the
relevant reshape2 method; please note that reshape2 is deprecated, and this
redirection is now deprecated as well. To continue using melt methods from
reshape2 while both libraries are attached, e.g. melt.list, you can prepend
the namespace like reshape2::melt(ASCII). In the next version, this warning
will become an error.
[1] "Schiz" "ID" "variable" "value"
- Display the average RT of 30 times for each individual
It is obvious that group of non-schizophrenia have lower RT.
- Conduct one-between and one-within analysis of variance
Error: ID
Df Sum Sq Mean Sq
Schiz 1 3503780 3503780
Error: ID:Time
Df Sum Sq Mean Sq
Time 29 1232036 42484
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
Schiz 1 1042372 1042372 38.392 1.31e-09 ***
Time 29 288981 9965 0.367 0.999
Residuals 449 12190843 27151
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since \(p<\alpha=.05\), we reject the null hypothesis. The main effect of Schiz is significant. The response time of cases with schizophrenia is higher than their counterparts.