library(dplyr)
library(ggplot2)

HW exercise 1.

Fifty male and fifty female students fill out the same questionnaire in weekly intervals starting five weeks before an important examination to measure state anxiety. The research interests are:

whether there are gender difference in state anxiety
individual differences in state anxiety

Explore the answers to both questions with plots involving confidence intervals or error bars for the means.

Source: Von Eye, A., & Schuster C. (1998). Regression Analysis for Social Sciences. San Diego: Academic Press.

Column 1: Anxiety score 5 weeks before exam for female
Column 2: Anxiety score 4 weeks before exam for female
Column 3: Anxiety score 3 weeks before exam for female
Column 4: Anxiety score 2 weeks before exam for female
Column 5: Anxiety score 1 weeks before exam for female
Column 6: Anxiety score 5 weeks before exam for male
Column 7: Anxiety score 4 weeks before exam for male
Column 8: Anxiety score 3 weeks before exam for male
Column 9: Anxiety score 2 weeks before exam for male
Column 10: Anxiety score 1 weeks before exam for male

Load in the data set and check its structure.

dta1 <- read.table('../data/data_hw0420_1.txt', header = TRUE)
head(dta1)

str(dta1)

'data.frame':   50 obs. of  10 variables:
 $ f1: int  13 26 13 22 18 32 16 18 14 20 ...
 $ f2: int  17 31 17 24 19 31 16 22 17 19 ...
 $ f3: int  18 33 24 26 19 30 21 25 23 23 ...
 $ f4: int  20 38 29 27 22 31 27 29 21 25 ...
 $ f5: int  24 42 32 29 30 32 30 35 25 28 ...
 $ m1: int  6 4 17 19 12 11 14 9 12 11 ...
 $ m2: int  14 11 25 22 21 16 23 18 16 13 ...
 $ m3: int  22 14 26 26 21 20 26 20 23 17 ...
 $ m4: int  20 12 29 30 23 19 29 20 26 14 ...
 $ m5: int  24 23 38 34 24 22 33 24 32 20 ...

Trun the data set into a long form and do some variable transformation.

[Note]: Column 1, `f1, is anxiety score 5 weeks before exam for female. fk is anxiety score k weeks before exam for female. mk is anxiety score k weeks before exam for male. Thus, we should use 6 to minus column index to indicate correct no. of week before the exam.

dta1_long <- dta1 %>% stack() %>% 
  mutate(id = c(rep(1:50, 5), rep(51:100, 5)),
         gender = substr(ind, 1, 1),
         weeks = 6 - as.numeric(substr(ind, 2, 2))) # Note
str(dta1_long)

'data.frame':   500 obs. of  5 variables:
 $ values: int  13 26 13 22 18 32 16 18 14 20 ...
 $ ind   : Factor w/ 10 levels "f1","f2","f3",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ id    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ gender: chr  "f" "f" "f" "f" ...
 $ weeks : num  5 5 5 5 5 5 5 5 5 5 ...

Q1: Data visualization for gender difference

Use `qplot` to draw boxplots to quickly get an insight.

qplot(x = factor(weeks), y = values, col = gender, geom = 'boxplot', data = dta1_long,
      xlab = 'No. of week before the exam', ylab = 'Anxiety score')

It seems that there is gender difference in each no. of week before the exam.

Draw the plot with mean dots and error bars as the question asks.

Group the data by gender and weeks and compute the mean score and the standard error for each group (gender x week).
Use ggplot to plot with weeks as x and group_MEAN as y.
Add data points on the plot. Assign differen colors for different genders.
Add error bars on the plot. Assign the width and the size of the bars.
Name the axises.
Change gender label and assign the color for different gender

dta1_long %>% group_by(gender, weeks) %>%                                     #1
  summarise(group_MEAN = mean(values), group_SE = sd(values) / sqrt(n())) %>% #1
  ggplot(mapping = aes(x = weeks, y = group_MEAN)) +                          #2
  geom_point(aes(color = gender), size = 3) +                                 #3
  geom_errorbar(aes(ymax = group_MEAN + group_SE,                             #4
                    ymin = group_MEAN - group_SE), width=.1, size=.2) +       
  xlab('No. of week before the exam') + ylab('Mean anxiety score')    +       #5
  scale_color_manual(labels = c('Female', 'Male'),                            #6
                     values = c('dodgerblue1', 'hotpink1'))

It seems that there is gender difference in each no. of week before the exam. The gender difference gets smaller as the no. of week before exam decreases.

Q2: Data visualization for individual difference

Create some slope-related variables in the data set

slope_id: the slople of linear model with weeks as x and anxiety score as y.
slope_id_std: standardized slope_id
slope_id_special: Whether slope_id_std is far from others (|slope_id_std| > the third quantile) or not.

slope_ids <- split(dta1_long, dta1_long$id) %>% 
  sapply(., function(x) coefficients(lm(x$values ~ x$weeks))[2])
dta1_long <- dta1_long %>% 
  mutate(slope_id = c(rep(slope_ids[1:50], 5),rep(slope_ids[51:100], 5)),
         slope_id_std = robustHD::standardize(slope_id),
         slope_id_special = abs(slope_id_std) > quantile(abs(slope_id_std))[4])
summary(dta1_long)

     values        ind            id            gender         
 Min.   : 0   f1     : 50   Min.   :  1.00   Length:500        
 1st Qu.:17   f2     : 50   1st Qu.: 25.75   Class :character  
 Median :23   f3     : 50   Median : 50.50   Mode  :character  
 Mean   :22   f4     : 50   Mean   : 50.50                     
 3rd Qu.:27   f5     : 50   3rd Qu.: 75.25                     
 Max.   :42   m1     : 50   Max.   :100.00                     
              (Other):200                                      
     weeks      slope_id       slope_id_std      slope_id_special
 Min.   :1   Min.   :-5.800   Min.   :-2.00191   Mode :logical   
 1st Qu.:2   1st Qu.:-4.100   1st Qu.:-0.73911   FALSE:375       
 Median :3   Median :-3.200   Median :-0.07057   TRUE :125       
 Mean   :3   Mean   :-3.105   Mean   : 0.00000                   
 3rd Qu.:4   3rd Qu.:-2.200   3rd Qu.: 0.67225                   
 Max.   :5   Max.   : 0.700   Max.   : 2.82644

Plot data with regression line for each id. Use different colors for major slopes and minor slopes.

g <- ggplot(data = dta1_long, mapping = aes(x = weeks, y = values)) +
  geom_smooth(mapping = aes(group = id, color = slope_id_special), lwd=.3,
              se = FALSE, method = 'lm', alpha = 0.4, formula = y~x) +
  scale_color_manual(labels = c('Majority', 'Minority'),                            
                     values = c('dodgerblue1', 'hotpink1')) +
  xlab('No. of week before the exam') + ylab('Mean anxiety score') +
  theme(legend.position = 'bottom') + labs(fill = 'Slope')
g

The association of weeks and anxiety score is positive in a few participants. In most of participants, the slopes of linear models of weeks as x and anxiety score as y do not differ too much.

Since we have already found that gender effect is significant, we should group gender when plotting individual difference.

Group the data by gender and weeks and compute the mean score and the standard error for each group (gender x week).
Use ggplot to plot with weeks as x and group_MEAN as y.
Add data points on the plot. Assign differen colors for different genders.
Add error bars on the plot. Assign the width and the size of the bars.
Name the axises.
Change gender label and assign the color for different gender

g + facet_grid(. ~ gender)

It seems that there is no obvious individual difference in neither male data. While in female data, there seems to be individual difference.

Test for gender difference and individual difference

model_gender <- aov(values ~ id*gender*weeks + Error(id / gender), data = dta1_long)
summary(model_gender)


Error: id
   Df Sum Sq Mean Sq
id  1   3429    3429

Error: id:gender
       Df Sum Sq Mean Sq
gender  1   1215    1215

Error: Within
                 Df Sum Sq Mean Sq F value   Pr(>F)    
gender            1    206     206   7.853  0.00527 ** 
weeks             1   9641    9641 366.932  < 2e-16 ***
id:weeks          1    478     478  18.182 2.41e-05 ***
gender:weeks      1    184     184   7.000  0.00841 ** 
id:gender:weeks   1     20      20   0.754  0.38579    
Residuals       492  12927      26                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

When the time effect (weeks) get controlled. The main effect of gender is still statistically significant ($p<.01$). There is a significant difference in anxiety score between females and males.

As for the interaction term, the interaction of gender and weeks is significant ($p<.01$), indicating that the gender differences differ among different no. of week before exam. The interaction of id and weeks is significant as well ($p<.001$), indicating that the time effect (weeks) differ among different participants.

By ANOVA results and plots about gender difference, we can know that:

Females have much higher anxiety scores than males.
The gender difference gets smaller as the no. of week before exam decreases. That is, as the exam time approaches, the anxiety degrees that males increase are more than girls’.

By ANOVA results and plots about individual difference, we can know that:

There is a slight individual difference in anxiety score.
There is individual differences in the time effect on anxiety score.

HW exercise 3.

The dataset consists of a sample of 14 primary school children between 8 and 12 years old. The children were asked to respond on 8 emotions and coping strategies scales for each of 6 situations: fail to fulfill assingments in class, not allowed to play with other children, forbidden to do something by the teacher, victim of bullying, too much school work, forbidden to do something by the mother.

Plot the data in some meaningful ways. You may have to manipulate data into a different format first.

Column 1: Unpleasant (Annoy)
Column 2: Sad
Column 3: Afraid
Column 4: Angry
Column 5: Approach coping
Column 6: Avoidant coping
Column 7: Social support seeking
Column 8: Emotional reaction, especially agression
Column 9: Situation ID
Column 10: Children ID

Source: Roeder, I., Boekaerts, M., & Kroonenberg, P. M. (2002). The stress and coping questionnaire for children (School version and Asthma version): Construction, factor structure, and psychometric properties. Psychological Reports, 91, 29-36.

[Solution and Answer]

Load in the data set and check the data structure

dta3 <- read.table('../data/data_hw0420_3.txt', header = TRUE)
head(dta3)

str(dta3)

'data.frame':   84 obs. of  10 variables:
 $ annoy    : int  4 4 2 4 4 4 3 3 3 4 ...
 $ sad      : int  2 4 2 3 2 3 2 1 1 4 ...
 $ afraid   : int  2 4 2 4 1 1 2 1 1 2 ...
 $ angry    : int  2 2 2 4 1 4 2 2 2 1 ...
 $ approach : num  1 4 2.67 4 1 2.33 2 1.33 1 1.67 ...
 $ avoid    : num  2 3 3 1.5 2.75 2.5 1 4 1 4 ...
 $ support  : num  1 1.25 1 3.25 1.25 1 1.5 2.75 1.33 3.5 ...
 $ agressive: num  2.5 1.5 2.33 1 1.5 3.67 1 2 1.67 2.5 ...
 $ situation: Factor w/ 6 levels "Bully","Fail",..: 2 4 5 1 6 3 2 4 5 1 ...
 $ sbj      : Factor w/ 14 levels "S135","S137",..: 6 6 6 6 6 6 4 4 4 4 ...

summary(dta3)

     annoy            sad           afraid          angry      
 Min.   :1.000   Min.   :1.00   Min.   :1.000   Min.   :1.000  
 1st Qu.:2.000   1st Qu.:1.00   1st Qu.:1.000   1st Qu.:1.000  
 Median :3.000   Median :2.00   Median :1.000   Median :2.000  
 Mean   :2.762   Mean   :1.81   Mean   :1.405   Mean   :2.131  
 3rd Qu.:4.000   3rd Qu.:2.00   3rd Qu.:2.000   3rd Qu.:2.250  
 Max.   :4.000   Max.   :4.00   Max.   :4.000   Max.   :9.000  
                                                               
    approach         avoid          support        agressive    
 Min.   :1.000   Min.   :1.000   Min.   :0.000   Min.   :1.000  
 1st Qu.:1.670   1st Qu.:1.750   1st Qu.:1.330   1st Qu.:1.000  
 Median :2.000   Median :2.500   Median :2.000   Median :1.500  
 Mean   :2.236   Mean   :2.398   Mean   :1.959   Mean   :1.542  
 3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:2.373   3rd Qu.:1.670  
 Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
                                                                
  situation       sbj    
 Bully :14   S135   : 6  
 Fail  :14   S137   : 6  
 MomNo :14   S139   : 6  
 NoPart:14   S17    : 6  
 TeacNo:14   S185   : 6  
 Work  :14   S2     : 6  
             (Other):48

There is something weird in angry, which has the maximum of 9 while all maximums of other scale scores are 4. Let’s check it.

table(dta3$angry)


 1  2  3  4  9 
25 38 11  9  1

Only one data point is 9 in angry. Actually, only this data point have angry score exceeding 4. Thus, I believe that this is a wrong data point. Let’s remove it.

dta3$angry[dta3$angry == 9] <- NA

Lollipop plot 1 - y: scale scores, facet: situations

Data transformation

Standardize the scores for each emotion scale.
Stack the data set.
Group the data set by situation and variable (scales). Compute the mean of standardized scores for 6 situations and 8 scales.

my_std <- function(v) {          # deal with NA
  i <- is.na(v); v[i] <- mean(v, na.rm = TRUE)  
  u <- robustHD::standardize(v); u[i] <- NA
  return(u)
}

dta3_long1 <- dta3 %>% select_if(is.numeric) %>% 
  apply(., 2, my_std) %>% as.data.frame() %>% 
  mutate(situation = dta3$situation) %>% reshape2::melt()

dta3_long1_MEAN <- dta3_long1 %>% group_by(situation, variable) %>%
  summarise(value = mean(value, na.rm = TRUE))

See the basic info of the transformed data set.

summary(dta3_long1_MEAN)

  situation     variable      value           
 Bully :8   annoy   : 6   Min.   :-0.8214336  
 Fail  :8   sad     : 6   1st Qu.:-0.2096925  
 MomNo :8   afraid  : 6   Median :-0.0235430  
 NoPart:8   angry   : 6   Mean   :-0.0002002  
 TeacNo:8   approach: 6   3rd Qu.: 0.2692385  
 Work  :8   avoid   : 6   Max.   : 0.6997865  
            (Other) :12

Randomly pick one row for each situation to check the transformed data set.

dta3_long1_MEAN %>% with(., split(., situation)) %>%
  lapply(., function(df) df[sample(1:6, 1),]) %>% bind_rows()

Looks great.

Plot

qplot(x = value, y = situation, data = dta3_long1_MEAN) +
  facet_wrap(. ~ variable, nrow = 2) +
  scale_x_continuous(limits = c(-1, 1)) + xlab('Standardized score') +
  geom_segment(aes(xend = 0, yend = situation)) +
  geom_vline(xintercept = 0, color = 'grey55')

This plot can help us to see what kind of situation tend to trigger the given emotion more or less than other situations. Take two examples:

Children have much higher scores of “sad” in situation of being a victim of bullying than any other situations.
Children have much lower scores of “avoid” in the situation of failing to fulfill assingments in class than any other situations.

Box plot 1

Box plot 1-1: Boxt plot of standardized emotion scores

qplot(x = situation, y = value, geom = 'boxplot', data = dta3_long1_MEAN) +
  xlab('Emotion') + ylab('Standardized score') +
  theme(legend.position = 'none', axis.text.x = element_text(angle = 45))

This plot can tell us there is a larger/smaller variation among 8 emotions in what kind of situation.

Box plot 1-2: Box plot of standardized emotion scores with faceting situations

qplot(x = situation, y = value, col = variable, geom = 'boxplot', data = dta3_long1) +
  xlab('Situation') + ylab('Standardized score') +
  facet_wrap(. ~ variable, nrow = 2) +
  theme(legend.position = 'none', axis.text.x = element_text(angle = 45))

This plot can tell us what kind of situation has a larger/smaller variation in one given emotion. For example, compared to other situation, situation of being a bully victim has larger variation in agressive emotion. That is, when being a bully victim, the difference of tne degree of agressive emotion among chidren is relation bigger than that of other emotions.

Finding

From these 2 plots, we can find that, for example, the situation “bully” has relative smaller variation among 8 emotions than other situations while it has relative larger variation among 14 children in angry emotion (the median even equals 0). That is, when being a bully victim, the difference among 8 emoitons is not very big while childern differ in expressing angry emotion.

Lollipop plot 2 - y: situations, facet: scale scores

In this plot, we exchange y and facet variable to see different comparsions.

Data transformation

Stack the data set.
Standardize the scores for each situation.
Group the data set by variable (scales) and situation. Compute the mean of standardized scores for 6 situations and 8 scales.

dta3_long2 <- dta3 %>% dplyr::select(-sbj) %>% reshape2::melt() %>%       #1
  with(., split(., situation)) %>%                                        #2
  lapply(., function(df) mutate_at(df, 'value', my_std)) %>%  bind_rows() #2

dta3_long2_MEAN <- dta3_long2 %>% 
  group_by(variable, situation) %>%            #3
  summarise(value = mean(value, na.rm = TRUE)) #3

See the basic info of the transformed data set.

summary(dta3_long2_MEAN)

     variable   situation     value           
 annoy   : 6   Bully :8   Min.   :-1.1294166  
 sad     : 6   Fail  :8   1st Qu.:-0.3436152  
 afraid  : 6   MomNo :8   Median :-0.1051407  
 angry   : 6   NoPart:8   Mean   : 0.0002451  
 approach: 6   TeacNo:8   3rd Qu.: 0.3728867  
 avoid   : 6   Work  :8   Max.   : 0.9547457  
 (Other) :12

Randomly pick one row for each emotion to check the transformed data set.

dta3_long2_MEAN %>% with(., split(., variable)) %>%
  lapply(., function(df) df[sample(1:6, 1),]) %>% bind_rows()

Looks great.

Plot

qplot(x = value, y = variable, data = dta3_long2_MEAN) +  #1
  facet_wrap(. ~ situation, nrow = 2) +
  scale_x_continuous(limits = c(-1.2, 1.2)) + xlab('Standardized score') +
  geom_segment(aes(xend = 0, yend = variable)) +     #2
  geom_vline(xintercept = 0, color = 'grey55')     #3

This plot can help us to see what kind of emotion tend to be triggered more or less than other emotion in the given situation. Take two examples:

Children have much higher scores of “annoy” than other emotions in the situation of being a victim of bullying.
Children have much lower scores of “afraid” than other emotions in the situation of being forbidden to do something by the mother.

Box plot 2

Box plot 2-1: Boxt plot of standardized emotion scores

qplot(x = variable, y = value, geom = 'boxplot', data = dta3_long2_MEAN) +
  xlab('Emotion') + ylab('Standardized score') +
  theme(legend.position = 'none', axis.text.x = element_text(angle = 45))

This plot can tell us what kind of emotion has a larger/smaller variation among six stiuations.

Box plot 2-2: Box plot of standardized emotion scores with faceting situations

qplot(x = variable, y = value, col = situation, geom = 'boxplot', data = dta3_long2) +
  xlab('Emotion') + ylab('Standardized score') +
  facet_wrap(. ~ situation, nrow = 2) +
  theme(legend.position = 'none', axis.text.x = element_text(angle = 45))

This plot can tell us what kind of emotion has a larger/smaller variation in one given stiuation.

Finding

From these plots, we can find that, for example, the emotion “annoy” has relative smaller variation among 6 situations than other emotions while it has relative larger variation among 14 children when being the same situation.

Correlation plot for scales

dta3 %>% select_if(is.numeric) %>% na.omit() %>% cor() %>%
  ggcorrplot::ggcorrplot(., type = 'lower', lab=TRUE, hc.order = TRUE, hc.method = 'centroid',
                         colors = c('steelblue3', 'white', 'deeppink3'),
                         sig.level=.2, insig = 'pch', pch.col = 'grey40')

When being in the same situation, there is a medium positive correlation ($.2 < r <. 8$) between these emotions:

annoy and sad
annoy and angry
avoid and sad
afraid and sad
afraid and approach
angry and aggressive

HW exercise 4 (Need help).

Revise the plot in the contrast detection example so that the plot is faceted by subject × spatial frequency condition.

The data sets are from a psychophysical experiment where five participants detected sine wave gratings at different contrasts and spatial frequencies. At each trial, a binary response (grating left or right) which is either correct or incorrect, is to be scored. Each row in the data frame represents observation from a single trial. The basic plot puts contrast on the x-axis and the proportion of correct responses on the y-axis. The contrast values in the experiment were sampled logarithmically. The aim of the experiment was to see whether and how human visual sensitivity to contrast changes depending on the spatial scale of the information (loosely, whether the pattern is coarse or fine). The researchers also want to compare how the performance shifts as a function of contrast within each subject. But the plot below shows the averaged responses from all subjects, ignoring spatial frequencies.

Column 1: Subject ID
Column 2: Contrast
Column 3: Spatial frequency
Column 4: Target side
Column 5: Response side
Column 6: Unique trial ID (not used)

Target output 2

Codes for original plot

Load in the package

# assume that the subject data files are downloaded into 
# the data folder in the current working directory
library(pacman)

# tidyr stuff and psyphy package for model fitting
p_load(tidyverse, psyphy)

Load in the data sets together and check its structure

fls <- list.files(path = '../data', pattern = 'data_hw0420_4_')
fL <- paste0('../data/', fls)
dta4 <- lapply(fL, function(fl) read.table(fl, sep = ',', header = TRUE)) %>%
  bind_rows()
head(dta4)

str(dta4)

'data.frame':   7000 obs. of  6 variables:
 $ subject    : chr  "S1" "S1" "S1" "S1" ...
 $ contrast   : num  0.0695 0.0131 0.0695 0.0695 0.3679 ...
 $ sf         : num  0.5 40 4.47 40 13.37 ...
 $ target_side: Factor w/ 2 levels "left","right": 2 2 1 1 1 1 2 2 1 2 ...
 $ response   : Factor w/ 2 levels "left","right": 2 1 1 2 1 2 2 1 2 2 ...
 $ unique_id  : chr  "544ee9ff-2569-4f38-b04e-7e4d0a0be4d2" "b27fe910-e3ba-48fb-b168-5afb1f115d8f" "72c9d6ce-0a90-4d4b-a199-03435c15291b" "48b5bbb2-e6ee-4848-b77e-839ed5320c01" ...

Data transformation and visualization

Score the correct responses.

dta4$correct <- ifelse(dta4$target_side == dta4$response, 1, 0)
summary(dta4)

   subject             contrast              sf         target_side 
 Length:7000        Min.   :0.002479   Min.   : 0.500   left :3500  
 Class :character   1st Qu.:0.005704   1st Qu.: 1.495   right:3500  
 Mode  :character   Median :0.030197   Median : 4.472               
                    Mean   :0.092678   Mean   :11.968               
                    3rd Qu.:0.159880   3rd Qu.:13.375               
                    Max.   :0.367879   Max.   :40.000               
  response     unique_id            correct      
 left :3488   Length:7000        Min.   :0.0000  
 right:3512   Class :character   1st Qu.:1.0000  
              Mode  :character   Median :1.0000  
                                 Mean   :0.7697  
                                 3rd Qu.:1.0000  
                                 Max.   :1.0000

Bootstrapped CIs for proportion of correct responses.

p <- ggplot(dta4, aes(contrast, correct)) +
 stat_summary(fun.data = 'mean_cl_boot', color = "dodgerblue") +
 scale_x_log10() +
 scale_y_continuous(limits = c(0, 1)) +
 labs(x = "Contrast in log unit", y = "Proportion of correct responses")

Links for Binomial Family for m-alternative Forced-choice. Get document info.

?mafc

Fit a detection model in which the chance of correct response is $.5$.

m0 <- glm(correct ~ log10(contrast), data = dta4, 
          family = binomial(mafc.probit(2)))

Generate predicted responses.

xval <- with(dta4, rep(seq(min(contrast), max(contrast), len = 100), 10))
yval <- predict(m0, data.frame(contrast = xval), type = "response")
m0_pred <- data.frame(xval, yval)

Augument to the observed plot.

p <- p + 
 geom_line(data = m0_pred, aes(x = xval, y = yval)) 
print(p)

The example codes seem to have something wrong: they tried to bootstrapped CIs for correct, a binary variable, instead of the proportion of correct responses. Thus, I tried figure out the solution to fix it in the following faceted plots.

Revise the plot to make it faceted by subject × spatial frequency condition.

Check the distribution of `sf` to group

table(dta4$subject)


  S1   S2   S3   S4   S5 
1400 1400 1400 1400 1400

summary(dta4$sf)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.500   1.495   4.472  11.968  13.375  40.000

Data transformation

Group sf by the quantiles. Name the variable with group labels sf_group.
Split the data set twice: Split by subject first and then split by sf_group.
Do such tasks for each data frame in the double list:

Fit a detection model in which the chance of correct response is $.5$.
Generate predicted responses.
Return a data frame with predicted values.

Bind rows of each data frame in the double list to turn the data set into a data frame.

dta4_val <- dta4 %>% 
  mutate(sf_group = cut(sf, breaks = quantile(sf), include.lowest = TRUE,
                        labels = paste0(c('Low', 'Mid-low', 'Mid-High', 'High'), ' freq'))) %>%
  with(., split(., subject)) %>% 
  lapply(., function(df) split(df ,df$sf_group)) %>%
  lapply(., function(l) {
    lapply(l, function(df) {
      m0 <- glm(correct ~ log10(contrast), data = df,family = binomial(mafc.probit(2)))
      xval <- with(df, seq(min(contrast), max(contrast), len = 1000))
      yval <- predict(m0, data.frame(contrast = xval), type = 'response')
      y <- predict(m0, data.frame(contrast = df$contrast), type = 'response')
      return(full_join(data.frame(contrast = df$contrast, y, idx = 1:length(y)),
                       data.frame(xval, yval, idx = 1:length(yval))))
    })
  }) %>% 
  lapply(., function(l) bind_rows(l, .id = 'sf_group')) %>% bind_rows(., .id = 'subject')
summary(dta4_val)

   subject            sf_group            contrast           y        
 Length:20000       Length:20000       Min.   :0.002   Min.   :0.500  
 Class :character   Class :character   1st Qu.:0.006   1st Qu.:0.502  
 Mode  :character   Mode  :character   Median :0.030   Median :0.858  
                                       Mean   :0.093   Mean   :0.769  
                                       3rd Qu.:0.160   3rd Qu.:1.000  
                                       Max.   :0.368   Max.   :1.000  
                                       NA's   :13000   NA's   :13000  
      idx              xval               yval       
 Min.   :   1.0   Min.   :0.002479   Min.   :0.5000  
 1st Qu.: 250.8   1st Qu.:0.093829   1st Qu.:0.9911  
 Median : 500.5   Median :0.185179   Median :0.9999  
 Mean   : 500.5   Mean   :0.185179   Mean   :0.9473  
 3rd Qu.: 750.2   3rd Qu.:0.276529   3rd Qu.:1.0000  
 Max.   :1000.0   Max.   :0.367879   Max.   :1.0000

Plot

Dot: The real data in the data set (contrast) and its predicted values (y) based on the model (m0), that is, the proportion of correct responses.
Curve: The synthetic data (xval) and its predicted values (yval) based on the model (m0).

dta4_val %>% na.omit() %>% ggplot(aes(x = contrast, y = y)) +
  facet_wrap(subject ~ sf_group) + 
  scale_x_log10() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(x = "Contrast in log unit", y = "Proportion of correct responses") +
  stat_summary(fun.data = mean_cl_boot, color = "dodgerblue", size = .25) +
  geom_line(data = dta4_val, aes(x = xval, y = yval)) +
  theme_bw()

No error bar appears beacause every predicted value in the given contrast is the same.

Try to make error bars (make variation in predicted values): Generate several times of synthetic data

Change: Generate synthetic data with fewer elements but for several times.

dta4_val2 <- dta4 %>% 
  mutate(sf_group = cut(sf, breaks = quantile(sf), include.lowest = TRUE,
                        labels = paste0(c('Low', 'Mid-low', 'Mid-High', 'High'), ' freq'))) %>%
  with(., split(., subject)) %>% 
  lapply(., function(df) split(df ,df$sf_group)) %>%
  lapply(., function(l) {
    lapply(l, function(df) {
      m0 <- glm(correct ~ log10(contrast), data = df,family = binomial(mafc.probit(2)))
      y <- predict(m0, data.frame(contrast = df$contrast), type = 'response')
      df_xval <- df_yval <- data.frame(rep(NA, 7))
      for (k in 1:30) {     # Generate synthetic data for 30 times
        xval_new <- with(df, seq(min(contrast), max(contrast), len = 7))
        df_xval[paste0('xval_', k)] <- xval_new
        df_yval[paste0('yval_', k)] <- 
          predict(m0, data.frame(contrast = xval_new), type = 'response')
      }
      df_val <- cbind(stack(df_xval[,-1]), stack(df_yval[,-1]))[,c(1, 3)]
      colnames(df_val) <- c('xval', 'yval')
      return(full_join(data.frame(contrast = df$contrast, y, idx = 1:length(y)),
                       df_val %>% mutate(idx = 1:210)))
    })
  }) %>% 
  lapply(., function(l) bind_rows(l, .id = 'sf_group')) %>% bind_rows(., .id = 'subject')
summary(dta4_val2)

   subject            sf_group            contrast              y         
 Length:7000        Length:7000        Min.   :0.002479   Min.   :0.5000  
 Class :character   Class :character   1st Qu.:0.005704   1st Qu.:0.5018  
 Mode  :character   Mode  :character   Median :0.030197   Median :0.8577  
                                       Mean   :0.092678   Mean   :0.7693  
                                       3rd Qu.:0.159880   3rd Qu.:0.9998  
                                       Max.   :0.367879   Max.   :1.0000  
                                                                          
      idx             xval             yval       
 Min.   :  1.0   Min.   :0.0025   Min.   :0.5000  
 1st Qu.: 88.0   1st Qu.:0.0634   1st Qu.:0.9838  
 Median :175.5   Median :0.1852   Median :0.9999  
 Mean   :196.5   Mean   :0.1852   Mean   :0.9094  
 3rd Qu.:263.0   3rd Qu.:0.3070   3rd Qu.:1.0000  
 Max.   :560.0   Max.   :0.3679   Max.   :1.0000  
                 NA's   :2800     NA's   :2800

Plot again

Dot and bar: The synthetic data (xval) and its predicted values (yval) based on the model (m0) with boosted error bar among data coming from 30 generations.
Curve: The real data in the data set (contrast) and its predicted values (y) based on the model (m0), that is, the proportion of correct responses.

dta4_val %>% na.omit() %>% ggplot(aes(x = xval, y = yval)) +
  stat_summary(fun.data = mean_cl_boot, color = "dodgerblue", size = .25) +
  facet_wrap(subject ~ sf_group) + 
  scale_x_log10() +
  scale_y_continuous(limits = c(0, 1)) +
  labs(x = "Contrast in log unit", y = "Proportion of correct responses") +
  geom_line(data = dta4_val, aes(x = contrast, y = y)) +
  theme_bw()

However, there is still no error bar because generating synthetic data with fewer elements but for several times still makes data point unique from each other. We need to figure out the approach that can make the variation exists in the given contrast.

HW exercise 5 (Need help).

Use the Cushings{MASS} data set to generate a plot similar to the following one:

Target output 2

Load in the data set and check the data structure

data(Cushings, package = 'MASS')
head(Cushings)

str(Cushings)

'data.frame':   27 obs. of  3 variables:
 $ Tetrahydrocortisone: num  3.1 3 1.9 3.8 4.1 1.9 8.3 3.8 3.9 7.8 ...
 $ Pregnanetriol      : num  11.7 1.3 0.1 0.04 1.1 0.4 1 0.2 0.6 1.2 ...
 $ Type               : Factor w/ 4 levels "a","b","c","u": 1 1 1 1 1 1 2 2 2 2 ...

[Note] By using ?Cushings, we can know the meaning of each level in Cushings$Type:

a: Adenoma
b: Bilateral Hyperplasia
c: Carcinoma
u: Unknown

Data transformation

Create some variabels (positions and texts) for adding group labels.

dta5 <- Cushings %>%
  mutate(x0 = c(rep(3, table(Cushings$Type)['a']),  rep(10, table(Cushings$Type)['b']), 
                rep(15, table(Cushings$Type)['c']), rep(24, table(Cushings$Type)['u'])),
         y0 = c(rep(11, table(Cushings$Type)['a']), rep(2, table(Cushings$Type)['b']),
                rep(7, table(Cushings$Type)['c'] ), rep(.95, table(Cushings$Type)['u'] )),
         Type_lab = c('Adenoma',   rep('', table(Cushings$Type)['a'] - 1),
                      'Bilateral Hyperplasia', rep('', table(Cushings$Type)['b'] - 1),
                      'Carcinoma', rep('', table(Cushings$Type)['c'] - 1),
                      'Unknown',   rep('', table(Cushings$Type)['u'] - 1)))
dta5 %>% dplyr::select(x0, y0, Type_lab) %>% str()

'data.frame':   27 obs. of  3 variables:
 $ x0      : num  3 3 3 3 3 3 10 10 10 10 ...
 $ y0      : num  11 11 11 11 11 11 2 2 2 2 ...
 $ Type_lab: chr  "Adenoma" "" "" "" ...

Plot

g5 <- ggplot(mapping = aes(x = Tetrahydrocortisone, y = Pregnanetriol, label = Type_lab),
             data = dta5) +
  # Point
  geom_point(color = 'black', size = 2, shape = 16) +    # point frame
  geom_point(aes(color = Type), size = 2, shape = 20) +  # point interior
  scale_color_manual(values = c('hotpink3', 'chartreuse4', 'dodgerblue1', 'orange1')) +
  
  # Axises
  xlab('Tetrahydrocortisone (mg/24 hours)') +
  ylab('Pregnanetriol (mg/24 hours)') + # there is a clerical error here in the original plot
  scale_x_continuous(breaks = seq(0, 60, 10), limits = c(0, 60)) +  # rescale x-axis
  scale_y_continuous(breaks = seq(0, 12, 2)) +  # rescale y-axis
  
  # Title
  ggtitle('Cushing\'s syndrome', ) +
  
  # Layout
  theme(legend.position = 'none',             # remove legend
        panel.background = element_blank(),   # remove background color
        panel.grid.major.x = element_blank(), # remove vertical grid lines
        panel.grid.major.y = element_line(color = 'grey80', size = .3),
        axis.ticks = element_blank(),         # remove ticks of two axises
        axis.text = element_text(size = 10),                  # -
        axis.text.x = element_text(hjust = 1),                # text font and
        axis.text.y = element_text(angle = 90),               # label position
        axis.title = element_text(size = 11.5),               # setting
        axis.title.x = element_text(margin = margin(t = 15)), #
        axis.title.y = element_text(margin = margin(r = 15)), # -
        plot.title = element_text(size = 11, face = 'bold', hjust = 1, margin = margin(b = 25)))

g5 + geom_text(aes(x0, y0, group = Type, color = Type), size = 3)  # Group labels

DataM: Homework Exercise 0420 - Grammar of graphics

DataM: Homework Exercise 0420 - Grammar of graphics

HW exercise 1.

Load in the data set and check its structure.

Trun the data set into a long form and do some variable transformation.

Q1: Data visualization for gender difference

Use qplot to draw boxplots to quickly get an insight.

Draw the plot with mean dots and error bars as the question asks.

Q2: Data visualization for individual difference

Create some slope-related variables in the data set

Plot data with regression line for each id. Use different colors for major slopes and minor slopes.

Since we have already found that gender effect is significant, we should group gender when plotting individual difference.

Test for gender difference and individual difference

HW exercise 3.

Load in the data set and check the data structure

Lollipop plot 1 - y: scale scores, facet: situations

Data transformation

Plot

Box plot 1

Box plot 1-1: Boxt plot of standardized emotion scores

Box plot 1-2: Box plot of standardized emotion scores with faceting situations

Finding

Lollipop plot 2 - y: situations, facet: scale scores

Data transformation

Plot

Box plot 2

Box plot 2-1: Boxt plot of standardized emotion scores

Box plot 2-2: Box plot of standardized emotion scores with faceting situations

Finding

Correlation plot for scales

HW exercise 4 (Need help).

Codes for original plot

Load in the package

Load in the data sets together and check its structure

Data transformation and visualization

Revise the plot to make it faceted by subject × spatial frequency condition.

Check the distribution of sf to group

Data transformation

Plot

Try to make error bars (make variation in predicted values): Generate several times of synthetic data

Plot again

HW exercise 5 (Need help).

Load in the data set and check the data structure

Data transformation

Plot

Use `qplot` to draw boxplots to quickly get an insight.

Check the distribution of `sf` to group