Assignment 1

You are going to analyze a real data set. This data arises from a large study to examine EEG correlates of genetic predisposition to alcoholism. It contains measurements from 64 electrodes placed on subject’s scalps which were sampled at 256 Hz (3.9-msec epoch) for 1 second.

You can check the description of the data file in the following link https://archive.ics.uci.edu/ml/machine-learning-databases/eeg-mld/eeg.data.html.

You can use the letter of your lab group to get the seed you are going to use to know which users to read

Users to Load

Groups with A+ are going to read 10 users, while groups with A are going to read 5 users. The remaining groups are going to read 2 users. Which users? execute the following code with your proper lab letters and seed.

set.seed(myseed)
nAplus <- 10
nA <- 5
nBC <-2

myGroupn <- nAplus # Write here the correct one
usersToRead <- sample(1:length(result2),myGroupn,replace = FALSE)
# You need to load the following users
# result2[usersToRead]
data("eegdata")

My users are

result2[usersToRead]

##  [1] "co2a0000424.tar.gz" "co2a0000371.tar.gz" "co2c0000364.tar.gz"
##  [4] "co2c0000354.tar.gz" "co2a0000405.tar.gz" "co2a0000412.tar.gz"
##  [7] "co2c0000339.tar.gz" "co2a0000416.tar.gz" "co2c0000342.tar.gz"
## [10] "co2a0000443.tar.gz"

How to Fill your DATA FRAME

The Data frame: 1. Fist Column: “alch” if the user is alcoholic, and “nonAl” if its control 2. Second Column: user identifier 3. Third Column: Paradigm S1 obj / S2 nomatch / s2 match 4. Fourth Column: replication number or trial (you can see there are many samples) 5. Fiveth Column: Channel, there are 64 channels / name (0/63) 6. Sixth Column: Time, there are values from 0/255 or from 1 to 256 7. Seventh Column: Microvolts

## 'data.frame':    0 obs. of  7 variables:
##  $ UserType   : chr 
##  $ UserId     : int 
##  $ Paradigm   : chr 
##  $ Replication: int 
##  $ Channel    : int 
##  $ Time       : int 
##  $ Microvolts : num

You can use the df data frame to fill it up with the correct data.

Question to Solve

Problem 1

Load the data to the data frame and then save it with the name myDF.Rda Use your own function and explain how it works

Save the comple data frame

eegs1=geteegdata(indir="C:/Users/Carlos/Documents/Descargas Chrome/Estadisitca/Assigment 1/Users/",  cond="S1",filename="eegtrainS1")

## subject: co2a0000364 
## subject: co2a0000371 
## subject: co2a0000405 
## subject: co2a0000412 
## subject: co2a0000416 
## subject: co2a0000424 
## subject: co2a0000443 
## subject: co2c0000339 
## subject: co2c0000342 
## subject: co2c0000354

eegS2m=geteegdata(indir="C:/Users/Carlos/Documents/Descargas Chrome/Estadisitca/Assigment 1/Users/", cond="S2m",filename="eegtrainS2m")

## subject: co2a0000364 
## subject: co2a0000371 
## subject: co2a0000405 
## subject: co2a0000412 
## subject: co2a0000416 
## subject: co2a0000424 
## subject: co2a0000443 
## subject: co2c0000339 
## subject: co2c0000342 
## subject: co2c0000354

eegS2n=geteegdata(indir="C:/Users/Carlos/Documents/Descargas Chrome/Estadisitca/Assigment 1/Users/", cond="S2n",filename="eegtrainS2n")

## subject: co2a0000364 
## subject: co2a0000371 
## subject: co2a0000405 
## subject: co2a0000412 
## subject: co2a0000416 
## subject: co2a0000424 
## subject: co2a0000443 
## subject: co2c0000339 
## subject: co2c0000342 
## subject: co2c0000354

df <- rbind(eegs1,eegS2m,eegS2n)


#Formatting data frame
colnames(df) <- c('UserId','UserType','Paradigm','Replication','Channel','Time','Microvolts')
df$Paradigm <- str_replace_all(df$Paradigm,c("S1"= "S1obj","S2n" = "S2nomatch","S2m" = "S2match"))
df$UserType <- str_replace_all(df$UserType, c("a"="Al","c"= "nonAl"))
df$Channel <- str_replace_all(df$Channel, c("AF1"="0","AF2"="1","AF7"="2","AF8"="3","AFZ"="4","C1"="5","C2"="6","C3"="7","C4"="8","C5"="9","C6"="10","CP1"="11","CP2"="12","CP3"="13","CP4"="14","CP5"="15","CP6"="16","CPZ"="17","CZ"="18","F1"="19","F2"="20","F3"="21","F4"="22","F5"="23","F6"="24","F7"="25","F8"="26","FC1"="27","FC2"="28","FC3"="29","FC4"="30","FC5"="31","FC6"="32","FCZ"="33","FP1"="34","FP2"="35","FPZ"="36","FT7"="37","FT8"="38","FZ"="39","nd"="40","O1"="41","O2"="42","OZ"="43","P1"="44","P2"="45","P3"="46","P4"="47","P5"="48","P6"="49","P7"="50","P8"="51","PO1"="52","PO2"="53","PO7"="54","P08"="55","POZ"="56","PZ"="57","T7"="58","T8"="59","TP7"="60","TP8"="61","X"="62","Y"="63"))

save(df, file="myDF.Rda")

Problem 2

How many rows are in your data frame? 14.893.056
- How many of them are from alcoholic users? 10.174.464
- How many of them are from non alcoholic users? 4.718.592
- Are there any missing data, or error data? No
Compute the mean, median, range, standard deviation, quartiles and IQR from Microvolts column

meanVoltage <- mean(df$Microvolts)
medianVoltage <- median(df$Microvolts)
rangeVoltage<-range(df$Microvolts)
sdVoltage<-sd(df$Microvolts)
qVoltage<-quantile(df$Microvolts)
IQRvoltage<-IQR(df$Microvolts)

Represent an Histogram and a BoxPlot of the same Column

hist(df$Microvolts,breaks =1000 ,main = "Histogram of Microvolts measures",xlab="Microvolts", xlim = c(-30,30),ylim = c(0,1500000))

boxplot(df$Microvolts,main = "Box Plot of Microvolts measures",ylab="Microvolts", ylim= c(-30,30),horizontal = TRUE)

* What information can you obtain from these two plots? - explain it brievely

Problem 3 (for A and A+)

Compute the mean, median, range, standard deviation, quartiles and IQR from Microvolts column by type of user: alcoholic and non-alcoholic

#Alcoholic
microVoltsAl <- subset(df, Microvolts & UserType=='Al')
meanVoltsAl<-mean(microVoltsAl$Microvolts)
medianVoltsAl <- median(microVoltsAl$Microvolts)
rangeVoltsAl<-range(microVoltsAl$Microvolts)
sdVoltsAl<-sd(microVoltsAl$Microvolts)
qVoltsAl<-quantile(microVoltsAl$Microvolts)
IQRvoltsAl<-IQR(microVoltsAl$Microvolts)

#Non Alcoholic
microVoltsNal<- subset(df, Microvolts & UserType=='nonAl')
meanVoltsNal<-mean(microVoltsNal$Microvolts)
medianVoltsNal <- median(microVoltsNal$Microvolts)
rangeVoltsNal<-range(microVoltsNal$Microvolts)
sdVoltsNal<-sd(microVoltsNal$Microvolts)
qVoltsNal<-quantile(microVoltsNal$Microvolts)
IQRvoltNal<-IQR(microVoltsNal$Microvolts)

Represent an Histogram and a BoxPlot of the same Column by type of user: alcoholic and non-alcoholic

#Histograms
hist(microVoltsAl$Microvolts,breaks =1000,main = "Histogram of Microvolts measures (Alcoholic)",xlab="Microvolts",xlim = c(-30,30),ylim = c(0,1500000))

hist(microVoltsNal$Microvolts,breaks =1000,main = "Histogram of Microvolts measures (Non-Alcoholic)",xlab="Microvolts", xlim = c(-30,30),ylim = c(0,1500000))

#Box Plots
boxplot(microVoltsAl$Microvolts,main = "Box Plot of Microvolts measures (Alcoholic)",ylab="Microvolts", ylim= c(-30,30),horizontal = TRUE)

boxplot(microVoltsNal$Microvolts,main = "Box Plot of Microvolts measures (Non-Alcoholic)",ylab="Microvolts", ylim= c(-30,30),horizontal = TRUE)

* What information can you obtain from these plots? - explain it brievely

Problem 4 (for A+)

Compute the mean, median, range, standard deviation, quartiles and IQR from Microvolts column by type of user UserType : alcoholic and non-alcoholic and by Paradigm : S1obj, S2nomatch and s2match

#Alcoholic S1
microVoltsAlS1 <- subset(df, Microvolts & UserType=='Al' & Paradigm =='S1obj')
meanVoltsAlS1 <-mean(microVoltsAlS1$Microvolts)
medianVoltsAlS1<-median(microVoltsAlS1$Microvolts)
rangeVoltsAlS1<-range(microVoltsAlS1$Microvolts)
sdVoltsAlS1<-sd(microVoltsAlS1$Microvolts)
qVoltsAlS1<-quantile(microVoltsAlS1$Microvolts)
IQRvoltsAlS1<-IQR(microVoltsAlS1$Microvolts)

#Alcoholic S2no match
microVoltsAlS2nomatch <- subset(df, Microvolts & UserType=='Al' & Paradigm =='S2nomatch')
meanVoltsAlS2nomatch<-mean(microVoltsAlS2nomatch$Microvolts)
medianVoltsAlS2nomatch<-median(microVoltsAlS2nomatch$Microvolts)
rangeVoltsAlS2nomatch<-range(microVoltsAlS2nomatch$Microvolts)
sdVoltsAlS2nomatch<-sd(microVoltsAlS2nomatch$Microvolts)
qVoltsAlS2nomatch<-quantile(microVoltsAlS2nomatch$Microvolts)
IQRvoltsAlS2nomatch<-IQR(microVoltsAlS2nomatch$Microvolts)

#Alcoholic S2 match
microVoltsAlS2match <- subset(df, Microvolts & UserType=='Al' & Paradigm =='S2match')
meanVoltsAlS2match<-mean(microVoltsAlS2match$Microvolts)
medianVoltsAlS2match<-median(microVoltsAlS2match$Microvolts)
rangeVoltsAlS2match<-range(microVoltsAlS2match$Microvolts)
sdVoltsAlS2match<-sd(microVoltsAlS2match$Microvolts)
qVoltsAlS2match<-quantile(microVoltsAlS2match$Microvolts)
IQRvoltsAlS2match<-IQR(microVoltsAlS2match$Microvolts)

#Non-Alcoholic S1
microVoltsNalS1 <- subset(df, Microvolts & UserType=='nonAl' & Paradigm =='S1obj')
meanVoltsNalS1<-mean(microVoltsNalS1$Microvolts)
medianVoltsNalS1<-median(microVoltsNalS1$Microvolts)
rangeVoltsNalS1<-range(microVoltsNalS1$Microvolts)
sdVoltsNalS1<-sd(microVoltsNalS1$Microvolts)
qVoltsNalS1<-quantile(microVoltsNalS1$Microvolts)
IQRvoltsNalS1<-IQR(microVoltsNalS1$Microvolts)

#Non-Alcoholic S2no match
microVoltsNalS2nomatch <- subset(df, Microvolts & UserType=='nonAl' & Paradigm =='S2nomatch')
meanVoltsNalS2nomatch<-mean(microVoltsNalS2nomatch$Microvolts)
medianVoltsNalS2nomatch<-median(microVoltsNalS2nomatch$Microvolts)
rangeVoltsNalS2nomatch<-range(microVoltsNalS2nomatch$Microvolts)
sdVoltsNalS2nomatch<-sd(microVoltsNalS2nomatch$Microvolts)
qVoltsNalS2nomatch<-quantile(microVoltsNalS2nomatch$Microvolts)
IQRvoltsNalS2nomatch<-IQR(microVoltsNalS2nomatch$Microvolts)

#Non-Alcoholic S2 match
microVoltsNalS2match <- subset(df, Microvolts & UserType=='nonAl' & Paradigm =='S2match')
meanVoltsNalS2match<-mean(microVoltsNalS2match$Microvolts)
medianVoltsNalS2match<-median(microVoltsNalS2match$Microvolts)
rangeVoltsNalS2match<-range(microVoltsNalS2match$Microvolts)
sdVoltsNalS2match<-sd(microVoltsNalS2match$Microvolts)
qVoltsNalS2match<-quantile(microVoltsNalS2match$Microvolts)
IQRvoltsNalS2match<-IQR(microVoltsNalS2match$Microvolts)

Represent an Histogram and a BoxPlot of the same Column by type of user: alcoholic and non-alcoholic and by Paradigm : S1obj, S2nomatch and s2match

#Histograms
hist(microVoltsAlS1$Microvolts,breaks =1000,main = "Histogram of Microvolts measures S1 (Alcoholic)",xlab="Microvolts",xlim = c(-30,30),ylim = c(0,1500000))

hist(microVoltsAlS2nomatch$Microvolts,breaks =1000,main = "Histogram of Microvolts measures S2 no match (Non-Alcoholic)",xlab="Microvolts", xlim = c(-30,30),ylim = c(0,1500000))

hist(microVoltsAlS2match$Microvolts,breaks =1000,main = "Histogram of Microvolts measures S2 match (Non-Alcoholic)",xlab="Microvolts", xlim = c(-30,30),ylim = c(0,1500000))

#Box Plots
boxplot(microVoltsNalS1$Microvolts,main = "Box Plot of Microvolts measures S1 (Alcoholic)",ylab="Microvolts", ylim= c(-30,30),horizontal = TRUE)

boxplot(microVoltsNalS2nomatch$Microvolts,main = "Box Plot of Microvolts measures S2 no match (Non-Alcoholic)",ylab="Microvolts", ylim= c(-30,30),horizontal = TRUE)

boxplot(microVoltsNalS2match$Microvolts,main = "Box Plot of Microvolts measures S2 match (Non-Alcoholic)",ylab="Microvolts", ylim= c(-30,30),horizontal = TRUE)

* What information can you obtain from these plots? - explain it brievely

Problem 5

Let’s check for data within this ranges

## (-100,-10]  (-7.777778,-5.555556) (-5.555556,-3.333333) (-3.333333,-1.111111) (-1.111111,1.111111) (1.111111,3.333333) (3.333333,5.555556) (5.555556,7.777778) (7.777778,10.000000) (10,100]

## NULL

Compute the probability (frequency) related to all the ranges. In other words, find observations (Microvolts) within these ranges and give the results
- Account for the number of observation, the frequency and the cumulatice frequency.

row1<- c(-100,myRange[1])
row2<- c(myRange[2],myRange[3])
row3<- c(myRange[3],myRange[4])
row4<- c(myRange[4],myRange[5])
row5<- c(myRange[5],myRange[6])
row6<- c(myRange[6],myRange[7])
row7<- c(myRange[7],myRange[8])
row8<- c(myRange[8],myRange[9])
row9<- c(myRange[9],myRange[10])
row10<- c(myRange[10],100)
mymatrix<-rbind(row1,row2,row3,row4,row5,row6,row7,row8,row9,row10)
mymatrix

##              [,1]       [,2]
## row1  -100.000000 -10.000000
## row2    -7.777778  -5.555556
## row3    -5.555556  -3.333333
## row4    -3.333333  -1.111111
## row5    -1.111111   1.111111
## row6     1.111111   3.333333
## row7     3.333333   5.555556
## row8     5.555556   7.777778
## row9     7.777778  10.000000
## row10   10.000000 100.000000

var4 <-0
var5<- 0.0
resulTable<- matrix(nrow = 10, ncol = 4)
for(i in 1:10){
  var2<- 0
  var3<- 0.0
    for(j in df$Microvolts){
      if((j >= mymatrix[i,1]) & (j< mymatrix[i,2])){
        var2<-var2+1
      }
    }
  var3<- var2/14893056
  var4<- var4+var2
  var5<- var5+var3
  resulTable[i,1]= var2
  resulTable[i,2]= var3
  resulTable[i,3]= var4
  resulTable[i,4]= var5
}
rownames(resulTable) <- c("(-100,-10]","(-7.777778,-5.555556)","(-5.555556,-3.333333)","(-3.333333,-1.111111)","(-1.111111,1.111111)","(1.111111,3.333333)","(3.333333,5.555556)","(5.555556,7.777778)","(7.777778,10.000000)","(10,100]")
colnames(resulTable)<- c("Frequency","Rela.Freq","Cum.Freq.","Cum.Rel.Freq")
resulTable

##                       Frequency  Rela.Freq Cum.Freq. Cum.Rel.Freq
## (-100,-10]              1610118 0.10811200   1610118    0.1081120
## (-7.777778,-5.555556)   1068359 0.07173538   2678477    0.1798474
## (-5.555556,-3.333333)   1528443 0.10262790   4206920    0.2824753
## (-3.333333,-1.111111)   2050032 0.13765019   6256952    0.4201255
## (-1.111111,1.111111)    2379477 0.15977090   8636429    0.5798964
## (1.111111,3.333333)     1889516 0.12687228  10525945    0.7067686
## (3.333333,5.555556)     1287186 0.08642860  11813131    0.7931972
## (5.555556,7.777778)      818997 0.05499187  12632128    0.8481891
## (7.777778,10.000000)     518065 0.03478567  13150193    0.8829748
## (10,100]                 996131 0.06688560  14146324    0.9498604

Problem 6 (for A and A+)

Repeat the problem 5 but this time take into account only the following channels

##  [1]  3 13 24 28 35 45 46 48 57 64

##       Frequency Rela.Freq Cum.Freq. Cum.Rel.Freq
##  [1,]    232704  0.015625    232704     0.015625
##  [2,]    232704  0.015625    465408     0.031250
##  [3,]    465408  0.031250    930816     0.062500
##  [4,]         0  0.000000    930816     0.062500
##  [5,]    232704  0.015625   1163520     0.078125
##  [6,]    232704  0.015625   1396224     0.093750
##  [7,]    232704  0.015625   1628928     0.109375
##  [8,]    232704  0.015625   1861632     0.125000
##  [9,]    232704  0.015625   2094336     0.140625
## [10,]         0  0.000000   2094336     0.140625

Problem 7 (for A and A+)

Repeat the problem 6 but this time take into account the type of user UserType : alcoholic and non-alcoholic

arrayAlnonAl <- c("Al","nonAl")
var4 <-0
var5<- 0.0
resulTableAlNonAl<- matrix(nrow = 2, ncol = 4)
for(i in 1:2){
  var2<- 0
  var3<- 0.0
    for(j in df$UserType){
      if(j == arrayAlnonAl[i]){
        var2<-var2+1
      }
    }
  var3<- var2/14893056
  var4<- var4+var2
  var5<- var5+var3
  resulTableAlNonAl[i,1]= var2
  resulTableAlNonAl[i,2]= var3
  resulTableAlNonAl[i,3]= var4
  resulTableAlNonAl[i,4]= var5
}
colnames(resulTableAlNonAl)<- c("Frequency","Rela.Freq","Cum.Freq.","Cum.Rel.Freq")
rownames(resulTableAlNonAl)<- c("Alcoholic", "Non- Alcoholic")
resulTableAlNonAl

##                Frequency Rela.Freq Cum.Freq. Cum.Rel.Freq
## Alcoholic       10174464 0.6831683  10174464    0.6831683
## Non- Alcoholic   4718592 0.3168317  14893056    1.0000000

Repeat the problem 6 but this time take into account the Paradigm : S1obj, S2nomatch and s2match

arrayStype <- c("S1obj","S2nomatch","S2match")
var4 <-0
var5<- 0.0
resulTableStype<- matrix(nrow = 3, ncol = 4)
for(i in 1:3){ 
  var2<- 0
  var3<- 0.0
    for(j in df$Paradigm){
      if(j == arrayStype[i]){
        var2<-var2+1
      }
    }
  var3<- var2/14893056
  var4<- var4+var2
  var5<- var5+var3
  resulTableStype[i,1]= var2
  resulTableStype[i,2]= var3
  resulTableStype[i,3]= var4
  resulTableStype[i,4]= var5
}
colnames(resulTableStype)<- c("Frequency","Rela.Freq","Cum.Freq.","Cum.Rel.Freq")
rownames(resulTableStype)<- c("S1 obj", "S2 no match", "S2 match")
resulTableStype

##             Frequency Rela.Freq Cum.Freq. Cum.Rel.Freq
## S1 obj        7405568 0.4972497   7405568    0.4972497
## S2 no match   3768320 0.2530253  11173888    0.7502750
## S2 match      3719168 0.2497250  14893056    1.0000000

Problem 8 (for A+)

Repeat the problem 7 joing results (taking into account both factors)

Work in Progress

Problem 9 (for A+)

Select one of the users you have and for the same channels than problem 7 compute the correlation against Replication

Represent with a suitable plot these results

Work In Progress

Problem 10 (for A and A+)

Is there any way to check for different brain activity between alcoholic users and non-alcoholic users?

To make it more understandable, we can make a graphic where it shows the different-users type (alcoholic and non-alcoholic) the differences. We can use the timing and microvolts data in order to see the user timing reaction and through the microvolts variable we can see which kind of reaction the user had. We should compute the quotient between the microvolts of that test and the time reaction it took (using a spot). We should set one color for all experiments related with alcoholic users and non-alcoholic users. In that way, we’ll be able to check easily the different time reactions and the different kind of voltage reactions between the different type of users.

plot(df$Microvolts / df$Time, main = "Brain activity between alcoholic users and non-alcoholic users", xlab = "Time", ylab = "Voltage Range",xlim=c(0,255), ylim = c(-100,100), col = "blue")

Problem 11 (for A++)

Is there any way to check for different brain activity between alcoholic users and non-alcoholic users, and for the different paradigms?

From our point of view there are two ways to compute this information. To make it understandable, we are going to generate again all this information through a graphic. The first way we see for computing the information is through one graphic where we have six different data, each one defended with a different color, and the way exactly the same with question 10. The only difference, that in this graphic we are going to have 6 different variables (3 different paradigm types (S1, S2 no match, S2 match) and for each one 2 different user types (alcoholic and non-alcoholic)). The other way we found in order to solve the problem is through 3 different graphics, where each graphic represents each kind of user paradigm (S1, S2 no match, S2 match) and each graphic has two different of data one for alcoholic users, and the other for non-alcoholic. Both data are defended with the timing and microvolts coordinates. So then at the end we can compare the tree different graphics we have as a result.

Assignament 1

Carlos Martín & Albert Sanz

20 gener de 2017