DataM: Homework Exercise 0420 - Grammar of graphics
HW exercise 1.
Fifty male and fifty female students fill out the same questionnaire in weekly intervals starting five weeks before an important examination to measure state anxiety. The research interests are:
- whether there are gender difference in state anxiety
- individual differences in state anxiety
Explore the answers to both questions with plots involving confidence intervals or error bars for the means.
Source: Von Eye, A., & Schuster C. (1998). Regression Analysis for Social Sciences. San Diego: Academic Press.
- Column 1: Anxiety score 5 weeks before exam for female
- Column 2: Anxiety score 4 weeks before exam for female
- Column 3: Anxiety score 3 weeks before exam for female
- Column 4: Anxiety score 2 weeks before exam for female
- Column 5: Anxiety score 1 weeks before exam for female
- Column 6: Anxiety score 5 weeks before exam for male
- Column 7: Anxiety score 4 weeks before exam for male
- Column 8: Anxiety score 3 weeks before exam for male
- Column 9: Anxiety score 2 weeks before exam for male
- Column 10: Anxiety score 1 weeks before exam for male
Load in the data set and check its structure.
'data.frame': 50 obs. of 10 variables:
$ f1: int 13 26 13 22 18 32 16 18 14 20 ...
$ f2: int 17 31 17 24 19 31 16 22 17 19 ...
$ f3: int 18 33 24 26 19 30 21 25 23 23 ...
$ f4: int 20 38 29 27 22 31 27 29 21 25 ...
$ f5: int 24 42 32 29 30 32 30 35 25 28 ...
$ m1: int 6 4 17 19 12 11 14 9 12 11 ...
$ m2: int 14 11 25 22 21 16 23 18 16 13 ...
$ m3: int 22 14 26 26 21 20 26 20 23 17 ...
$ m4: int 20 12 29 30 23 19 29 20 26 14 ...
$ m5: int 24 23 38 34 24 22 33 24 32 20 ...
Trun the data set into a long form and do some variable transformation.
[Note]: Column 1, `f1, is anxiety score 5 weeks before exam for female. fk is anxiety score k weeks before exam for female. mk is anxiety score k weeks before exam for male. Thus, we should use 6 to minus column index to indicate correct no. of week before the exam.
dta1_long <- dta1 %>% stack() %>%
mutate(id = c(rep(1:50, 5), rep(51:100, 5)),
gender = substr(ind, 1, 1),
weeks = 6 - as.numeric(substr(ind, 2, 2))) # Note
str(dta1_long)'data.frame': 500 obs. of 5 variables:
$ values: int 13 26 13 22 18 32 16 18 14 20 ...
$ ind : Factor w/ 10 levels "f1","f2","f3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ id : int 1 2 3 4 5 6 7 8 9 10 ...
$ gender: chr "f" "f" "f" "f" ...
$ weeks : num 5 5 5 5 5 5 5 5 5 5 ...
Q1: Data visualization for gender difference
Use qplot to draw boxplots to quickly get an insight.
qplot(x = factor(weeks), y = values, col = gender, geom = 'boxplot', data = dta1_long,
xlab = 'No. of week before the exam', ylab = 'Anxiety score')It seems that there is gender difference in each no. of week before the exam.
Draw the plot with mean dots and error bars as the question asks.
- Group the data by
genderandweeksand compute the mean score and the standard error for each group (gender x week). - Use
ggplotto plot withweeksas x andgroup_MEANas y. - Add data points on the plot. Assign differen colors for different genders.
- Add error bars on the plot. Assign the width and the size of the bars.
- Name the axises.
- Change gender label and assign the color for different gender
dta1_long %>% group_by(gender, weeks) %>% #1
summarise(group_MEAN = mean(values), group_SE = sd(values) / sqrt(n())) %>% #1
ggplot(mapping = aes(x = weeks, y = group_MEAN)) + #2
geom_point(aes(color = gender), size = 3) + #3
geom_errorbar(aes(ymax = group_MEAN + group_SE, #4
ymin = group_MEAN - group_SE), width=.1, size=.2) +
xlab('No. of week before the exam') + ylab('Mean anxiety score') + #5
scale_color_manual(labels = c('Female', 'Male'), #6
values = c('dodgerblue1', 'hotpink1'))It seems that there is gender difference in each no. of week before the exam. The gender difference gets smaller as the no. of week before exam decreases.
Q2: Data visualization for individual difference
Plot data with regression line for each id. Use different colors for major slopes and minor slopes.
g <- ggplot(data = dta1_long, mapping = aes(x = weeks, y = values)) +
geom_smooth(mapping = aes(group = id, color = slope_id_special), lwd=.3,
se = FALSE, method = 'lm', alpha = 0.4, formula = y~x) +
scale_color_manual(labels = c('Majority', 'Minority'),
values = c('dodgerblue1', 'hotpink1')) +
xlab('No. of week before the exam') + ylab('Mean anxiety score') +
theme(legend.position = 'bottom') + labs(fill = 'Slope')
gThe association of weeks and anxiety score is positive in a few participants. In most of participants, the slopes of linear models of weeks as x and anxiety score as y do not differ too much.
Since we have already found that gender effect is significant, we should group gender when plotting individual difference.
- Group the data by
genderandweeksand compute the mean score and the standard error for each group (gender x week). - Use
ggplotto plot withweeksas x andgroup_MEANas y. - Add data points on the plot. Assign differen colors for different genders.
- Add error bars on the plot. Assign the width and the size of the bars.
- Name the axises.
- Change gender label and assign the color for different gender
It seems that there is no obvious individual difference in neither male data. While in female data, there seems to be individual difference.
Test for gender difference and individual difference
model_gender <- aov(values ~ id*gender*weeks + Error(id / gender), data = dta1_long)
summary(model_gender)
Error: id
Df Sum Sq Mean Sq
id 1 3429 3429
Error: id:gender
Df Sum Sq Mean Sq
gender 1 1215 1215
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
gender 1 206 206 7.853 0.00527 **
weeks 1 9641 9641 366.932 < 2e-16 ***
id:weeks 1 478 478 18.182 2.41e-05 ***
gender:weeks 1 184 184 7.000 0.00841 **
id:gender:weeks 1 20 20 0.754 0.38579
Residuals 492 12927 26
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
When the time effect (weeks) get controlled. The main effect of gender is still statistically significant (\(p<.01\)). There is a significant difference in anxiety score between females and males.
As for the interaction term, the interaction of gender and weeks is significant (\(p<.01\)), indicating that the gender differences differ among different no. of week before exam. The interaction of id and weeks is significant as well (\(p<.001\)), indicating that the time effect (weeks) differ among different participants.
By ANOVA results and plots about gender difference, we can know that:
- Females have much higher anxiety scores than males.
- The gender difference gets smaller as the no. of week before exam decreases. That is, as the exam time approaches, the anxiety degrees that males increase are more than girls’.
By ANOVA results and plots about individual difference, we can know that:
- There is a slight individual difference in anxiety score.
- There is individual differences in the time effect on anxiety score.
HW exercise 3.
The dataset consists of a sample of 14 primary school children between 8 and 12 years old. The children were asked to respond on 8 emotions and coping strategies scales for each of 6 situations: fail to fulfill assingments in class, not allowed to play with other children, forbidden to do something by the teacher, victim of bullying, too much school work, forbidden to do something by the mother.
Plot the data in some meaningful ways. You may have to manipulate data into a different format first.
- Column 1: Unpleasant (Annoy)
- Column 2: Sad
- Column 3: Afraid
- Column 4: Angry
- Column 5: Approach coping
- Column 6: Avoidant coping
- Column 7: Social support seeking
- Column 8: Emotional reaction, especially agression
- Column 9: Situation ID
- Column 10: Children ID
Source: Roeder, I., Boekaerts, M., & Kroonenberg, P. M. (2002). The stress and coping questionnaire for children (School version and Asthma version): Construction, factor structure, and psychometric properties. Psychological Reports, 91, 29-36.
[Solution and Answer]
Load in the data set and check the data structure
'data.frame': 84 obs. of 10 variables:
$ annoy : int 4 4 2 4 4 4 3 3 3 4 ...
$ sad : int 2 4 2 3 2 3 2 1 1 4 ...
$ afraid : int 2 4 2 4 1 1 2 1 1 2 ...
$ angry : int 2 2 2 4 1 4 2 2 2 1 ...
$ approach : num 1 4 2.67 4 1 2.33 2 1.33 1 1.67 ...
$ avoid : num 2 3 3 1.5 2.75 2.5 1 4 1 4 ...
$ support : num 1 1.25 1 3.25 1.25 1 1.5 2.75 1.33 3.5 ...
$ agressive: num 2.5 1.5 2.33 1 1.5 3.67 1 2 1.67 2.5 ...
$ situation: Factor w/ 6 levels "Bully","Fail",..: 2 4 5 1 6 3 2 4 5 1 ...
$ sbj : Factor w/ 14 levels "S135","S137",..: 6 6 6 6 6 6 4 4 4 4 ...
annoy sad afraid angry
Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000
1st Qu.:2.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000
Median :3.000 Median :2.00 Median :1.000 Median :2.000
Mean :2.762 Mean :1.81 Mean :1.405 Mean :2.131
3rd Qu.:4.000 3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:2.250
Max. :4.000 Max. :4.00 Max. :4.000 Max. :9.000
approach avoid support agressive
Min. :1.000 Min. :1.000 Min. :0.000 Min. :1.000
1st Qu.:1.670 1st Qu.:1.750 1st Qu.:1.330 1st Qu.:1.000
Median :2.000 Median :2.500 Median :2.000 Median :1.500
Mean :2.236 Mean :2.398 Mean :1.959 Mean :1.542
3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.373 3rd Qu.:1.670
Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
situation sbj
Bully :14 S135 : 6
Fail :14 S137 : 6
MomNo :14 S139 : 6
NoPart:14 S17 : 6
TeacNo:14 S185 : 6
Work :14 S2 : 6
(Other):48
There is something weird in angry, which has the maximum of 9 while all maximums of other scale scores are 4. Let’s check it.
1 2 3 4 9
25 38 11 9 1
Only one data point is 9 in angry. Actually, only this data point have angry score exceeding 4. Thus, I believe that this is a wrong data point. Let’s remove it.
Lollipop plot 1 - y: scale scores, facet: situations
Data transformation
- Standardize the scores for each emotion scale.
- Stack the data set.
- Group the data set by
situationandvariable (scales). Compute the mean of standardized scores for 6 situations and 8 scales.
my_std <- function(v) { # deal with NA
i <- is.na(v); v[i] <- mean(v, na.rm = TRUE)
u <- robustHD::standardize(v); u[i] <- NA
return(u)
}
dta3_long1 <- dta3 %>% select_if(is.numeric) %>%
apply(., 2, my_std) %>% as.data.frame() %>%
mutate(situation = dta3$situation) %>% reshape2::melt()
dta3_long1_MEAN <- dta3_long1 %>% group_by(situation, variable) %>%
summarise(value = mean(value, na.rm = TRUE))- See the basic info of the transformed data set.
situation variable value
Bully :8 annoy : 6 Min. :-0.8214336
Fail :8 sad : 6 1st Qu.:-0.2096925
MomNo :8 afraid : 6 Median :-0.0235430
NoPart:8 angry : 6 Mean :-0.0002002
TeacNo:8 approach: 6 3rd Qu.: 0.2692385
Work :8 avoid : 6 Max. : 0.6997865
(Other) :12
- Randomly pick one row for each situation to check the transformed data set.
dta3_long1_MEAN %>% with(., split(., situation)) %>%
lapply(., function(df) df[sample(1:6, 1),]) %>% bind_rows()Looks great.
Plot
qplot(x = value, y = situation, data = dta3_long1_MEAN) +
facet_wrap(. ~ variable, nrow = 2) +
scale_x_continuous(limits = c(-1, 1)) + xlab('Standardized score') +
geom_segment(aes(xend = 0, yend = situation)) +
geom_vline(xintercept = 0, color = 'grey55')This plot can help us to see what kind of situation tend to trigger the given emotion more or less than other situations. Take two examples:
- Children have much higher scores of “sad” in situation of being a victim of bullying than any other situations.
- Children have much lower scores of “avoid” in the situation of failing to fulfill assingments in class than any other situations.
Box plot 1
Box plot 1-1: Boxt plot of standardized emotion scores
qplot(x = situation, y = value, geom = 'boxplot', data = dta3_long1_MEAN) +
xlab('Emotion') + ylab('Standardized score') +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45))This plot can tell us there is a larger/smaller variation among 8 emotions in what kind of situation.
Box plot 1-2: Box plot of standardized emotion scores with faceting situations
qplot(x = situation, y = value, col = variable, geom = 'boxplot', data = dta3_long1) +
xlab('Situation') + ylab('Standardized score') +
facet_wrap(. ~ variable, nrow = 2) +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45))This plot can tell us what kind of situation has a larger/smaller variation in one given emotion. For example, compared to other situation, situation of being a bully victim has larger variation in agressive emotion. That is, when being a bully victim, the difference of tne degree of agressive emotion among chidren is relation bigger than that of other emotions.
Finding
From these 2 plots, we can find that, for example, the situation “bully” has relative smaller variation among 8 emotions than other situations while it has relative larger variation among 14 children in angry emotion (the median even equals 0). That is, when being a bully victim, the difference among 8 emoitons is not very big while childern differ in expressing angry emotion.
Lollipop plot 2 - y: situations, facet: scale scores
In this plot, we exchange y and facet variable to see different comparsions.
Data transformation
- Stack the data set.
- Standardize the scores for each situation.
- Group the data set by
variable (scales)andsituation. Compute the mean of standardized scores for 6 situations and 8 scales.
dta3_long2 <- dta3 %>% dplyr::select(-sbj) %>% reshape2::melt() %>% #1
with(., split(., situation)) %>% #2
lapply(., function(df) mutate_at(df, 'value', my_std)) %>% bind_rows() #2
dta3_long2_MEAN <- dta3_long2 %>%
group_by(variable, situation) %>% #3
summarise(value = mean(value, na.rm = TRUE)) #3- See the basic info of the transformed data set.
variable situation value
annoy : 6 Bully :8 Min. :-1.1294166
sad : 6 Fail :8 1st Qu.:-0.3436152
afraid : 6 MomNo :8 Median :-0.1051407
angry : 6 NoPart:8 Mean : 0.0002451
approach: 6 TeacNo:8 3rd Qu.: 0.3728867
avoid : 6 Work :8 Max. : 0.9547457
(Other) :12
- Randomly pick one row for each emotion to check the transformed data set.
dta3_long2_MEAN %>% with(., split(., variable)) %>%
lapply(., function(df) df[sample(1:6, 1),]) %>% bind_rows()Looks great.
Plot
qplot(x = value, y = variable, data = dta3_long2_MEAN) + #1
facet_wrap(. ~ situation, nrow = 2) +
scale_x_continuous(limits = c(-1.2, 1.2)) + xlab('Standardized score') +
geom_segment(aes(xend = 0, yend = variable)) + #2
geom_vline(xintercept = 0, color = 'grey55') #3This plot can help us to see what kind of emotion tend to be triggered more or less than other emotion in the given situation. Take two examples:
- Children have much higher scores of “annoy” than other emotions in the situation of being a victim of bullying.
- Children have much lower scores of “afraid” than other emotions in the situation of being forbidden to do something by the mother.
Box plot 2
Box plot 2-1: Boxt plot of standardized emotion scores
qplot(x = variable, y = value, geom = 'boxplot', data = dta3_long2_MEAN) +
xlab('Emotion') + ylab('Standardized score') +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45))This plot can tell us what kind of emotion has a larger/smaller variation among six stiuations.
Box plot 2-2: Box plot of standardized emotion scores with faceting situations
qplot(x = variable, y = value, col = situation, geom = 'boxplot', data = dta3_long2) +
xlab('Emotion') + ylab('Standardized score') +
facet_wrap(. ~ situation, nrow = 2) +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45))This plot can tell us what kind of emotion has a larger/smaller variation in one given stiuation.
Finding
From these plots, we can find that, for example, the emotion “annoy” has relative smaller variation among 6 situations than other emotions while it has relative larger variation among 14 children when being the same situation.
Correlation plot for scales
dta3 %>% select_if(is.numeric) %>% na.omit() %>% cor() %>%
ggcorrplot::ggcorrplot(., type = 'lower', lab=TRUE, hc.order = TRUE, hc.method = 'centroid',
colors = c('steelblue3', 'white', 'deeppink3'),
sig.level=.2, insig = 'pch', pch.col = 'grey40') When being in the same situation, there is a medium positive correlation (\(.2 < r <. 8\)) between these emotions:
- annoy and sad
- annoy and angry
- avoid and sad
- afraid and sad
- afraid and approach
- angry and aggressive
HW exercise 4 (Need help).
Revise the plot in the contrast detection example so that the plot is faceted by subject × spatial frequency condition.
The data sets are from a psychophysical experiment where five participants detected sine wave gratings at different contrasts and spatial frequencies. At each trial, a binary response (grating left or right) which is either correct or incorrect, is to be scored. Each row in the data frame represents observation from a single trial. The basic plot puts contrast on the x-axis and the proportion of correct responses on the y-axis. The contrast values in the experiment were sampled logarithmically. The aim of the experiment was to see whether and how human visual sensitivity to contrast changes depending on the spatial scale of the information (loosely, whether the pattern is coarse or fine). The researchers also want to compare how the performance shifts as a function of contrast within each subject. But the plot below shows the averaged responses from all subjects, ignoring spatial frequencies.
- Column 1: Subject ID
- Column 2: Contrast
- Column 3: Spatial frequency
- Column 4: Target side
- Column 5: Response side
- Column 6: Unique trial ID (not used)
Target output 2
Codes for original plot
Load in the package
Load in the data sets together and check its structure
fls <- list.files(path = '../data', pattern = 'data_hw0420_4_')
fL <- paste0('../data/', fls)
dta4 <- lapply(fL, function(fl) read.table(fl, sep = ',', header = TRUE)) %>%
bind_rows()
head(dta4)'data.frame': 7000 obs. of 6 variables:
$ subject : chr "S1" "S1" "S1" "S1" ...
$ contrast : num 0.0695 0.0131 0.0695 0.0695 0.3679 ...
$ sf : num 0.5 40 4.47 40 13.37 ...
$ target_side: Factor w/ 2 levels "left","right": 2 2 1 1 1 1 2 2 1 2 ...
$ response : Factor w/ 2 levels "left","right": 2 1 1 2 1 2 2 1 2 2 ...
$ unique_id : chr "544ee9ff-2569-4f38-b04e-7e4d0a0be4d2" "b27fe910-e3ba-48fb-b168-5afb1f115d8f" "72c9d6ce-0a90-4d4b-a199-03435c15291b" "48b5bbb2-e6ee-4848-b77e-839ed5320c01" ...
Data transformation and visualization
- Score the correct responses.
subject contrast sf target_side
Length:7000 Min. :0.002479 Min. : 0.500 left :3500
Class :character 1st Qu.:0.005704 1st Qu.: 1.495 right:3500
Mode :character Median :0.030197 Median : 4.472
Mean :0.092678 Mean :11.968
3rd Qu.:0.159880 3rd Qu.:13.375
Max. :0.367879 Max. :40.000
response unique_id correct
left :3488 Length:7000 Min. :0.0000
right:3512 Class :character 1st Qu.:1.0000
Mode :character Median :1.0000
Mean :0.7697
3rd Qu.:1.0000
Max. :1.0000
- Bootstrapped CIs for proportion of correct responses.
p <- ggplot(dta4, aes(contrast, correct)) +
stat_summary(fun.data = 'mean_cl_boot', color = "dodgerblue") +
scale_x_log10() +
scale_y_continuous(limits = c(0, 1)) +
labs(x = "Contrast in log unit", y = "Proportion of correct responses")- Links for Binomial Family for m-alternative Forced-choice. Get document info.
- Fit a detection model in which the chance of correct response is \(.5\).
- Generate predicted responses.
xval <- with(dta4, rep(seq(min(contrast), max(contrast), len = 100), 10))
yval <- predict(m0, data.frame(contrast = xval), type = "response")
m0_pred <- data.frame(xval, yval)- Augument to the observed plot.
The example codes seem to have something wrong: they tried to bootstrapped CIs for correct, a binary variable, instead of the proportion of correct responses. Thus, I tried figure out the solution to fix it in the following faceted plots.
Revise the plot to make it faceted by subject × spatial frequency condition.
Check the distribution of sf to group
S1 S2 S3 S4 S5
1400 1400 1400 1400 1400
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.500 1.495 4.472 11.968 13.375 40.000
Data transformation
- Group
sfby the quantiles. Name the variable with group labelssf_group. - Split the data set twice: Split by
subjectfirst and then split bysf_group. - Do such tasks for each data frame in the double list:
- Fit a detection model in which the chance of correct response is \(.5\).
- Generate predicted responses.
- Return a data frame with predicted values.
- Bind rows of each data frame in the double list to turn the data set into a data frame.
dta4_val <- dta4 %>%
mutate(sf_group = cut(sf, breaks = quantile(sf), include.lowest = TRUE,
labels = paste0(c('Low', 'Mid-low', 'Mid-High', 'High'), ' freq'))) %>%
with(., split(., subject)) %>%
lapply(., function(df) split(df ,df$sf_group)) %>%
lapply(., function(l) {
lapply(l, function(df) {
m0 <- glm(correct ~ log10(contrast), data = df,family = binomial(mafc.probit(2)))
xval <- with(df, seq(min(contrast), max(contrast), len = 1000))
yval <- predict(m0, data.frame(contrast = xval), type = 'response')
y <- predict(m0, data.frame(contrast = df$contrast), type = 'response')
return(full_join(data.frame(contrast = df$contrast, y, idx = 1:length(y)),
data.frame(xval, yval, idx = 1:length(yval))))
})
}) %>%
lapply(., function(l) bind_rows(l, .id = 'sf_group')) %>% bind_rows(., .id = 'subject')
summary(dta4_val) subject sf_group contrast y
Length:20000 Length:20000 Min. :0.002 Min. :0.500
Class :character Class :character 1st Qu.:0.006 1st Qu.:0.502
Mode :character Mode :character Median :0.030 Median :0.858
Mean :0.093 Mean :0.769
3rd Qu.:0.160 3rd Qu.:1.000
Max. :0.368 Max. :1.000
NA's :13000 NA's :13000
idx xval yval
Min. : 1.0 Min. :0.002479 Min. :0.5000
1st Qu.: 250.8 1st Qu.:0.093829 1st Qu.:0.9911
Median : 500.5 Median :0.185179 Median :0.9999
Mean : 500.5 Mean :0.185179 Mean :0.9473
3rd Qu.: 750.2 3rd Qu.:0.276529 3rd Qu.:1.0000
Max. :1000.0 Max. :0.367879 Max. :1.0000
Plot
- Dot: The real data in the data set (
contrast) and its predicted values (y) based on the model (m0), that is, the proportion of correct responses. - Curve: The synthetic data (
xval) and its predicted values (yval) based on the model (m0).
dta4_val %>% na.omit() %>% ggplot(aes(x = contrast, y = y)) +
facet_wrap(subject ~ sf_group) +
scale_x_log10() +
scale_y_continuous(limits = c(0, 1)) +
labs(x = "Contrast in log unit", y = "Proportion of correct responses") +
stat_summary(fun.data = mean_cl_boot, color = "dodgerblue", size = .25) +
geom_line(data = dta4_val, aes(x = xval, y = yval)) +
theme_bw()No error bar appears beacause every predicted value in the given contrast is the same.
Try to make error bars (make variation in predicted values): Generate several times of synthetic data
Change: Generate synthetic data with fewer elements but for several times.
dta4_val2 <- dta4 %>%
mutate(sf_group = cut(sf, breaks = quantile(sf), include.lowest = TRUE,
labels = paste0(c('Low', 'Mid-low', 'Mid-High', 'High'), ' freq'))) %>%
with(., split(., subject)) %>%
lapply(., function(df) split(df ,df$sf_group)) %>%
lapply(., function(l) {
lapply(l, function(df) {
m0 <- glm(correct ~ log10(contrast), data = df,family = binomial(mafc.probit(2)))
y <- predict(m0, data.frame(contrast = df$contrast), type = 'response')
df_xval <- df_yval <- data.frame(rep(NA, 7))
for (k in 1:30) { # Generate synthetic data for 30 times
xval_new <- with(df, seq(min(contrast), max(contrast), len = 7))
df_xval[paste0('xval_', k)] <- xval_new
df_yval[paste0('yval_', k)] <-
predict(m0, data.frame(contrast = xval_new), type = 'response')
}
df_val <- cbind(stack(df_xval[,-1]), stack(df_yval[,-1]))[,c(1, 3)]
colnames(df_val) <- c('xval', 'yval')
return(full_join(data.frame(contrast = df$contrast, y, idx = 1:length(y)),
df_val %>% mutate(idx = 1:210)))
})
}) %>%
lapply(., function(l) bind_rows(l, .id = 'sf_group')) %>% bind_rows(., .id = 'subject')
summary(dta4_val2) subject sf_group contrast y
Length:7000 Length:7000 Min. :0.002479 Min. :0.5000
Class :character Class :character 1st Qu.:0.005704 1st Qu.:0.5018
Mode :character Mode :character Median :0.030197 Median :0.8577
Mean :0.092678 Mean :0.7693
3rd Qu.:0.159880 3rd Qu.:0.9998
Max. :0.367879 Max. :1.0000
idx xval yval
Min. : 1.0 Min. :0.0025 Min. :0.5000
1st Qu.: 88.0 1st Qu.:0.0634 1st Qu.:0.9838
Median :175.5 Median :0.1852 Median :0.9999
Mean :196.5 Mean :0.1852 Mean :0.9094
3rd Qu.:263.0 3rd Qu.:0.3070 3rd Qu.:1.0000
Max. :560.0 Max. :0.3679 Max. :1.0000
NA's :2800 NA's :2800
Plot again
- Dot and bar: The synthetic data (
xval) and its predicted values (yval) based on the model (m0) with boosted error bar among data coming from 30 generations. - Curve: The real data in the data set (
contrast) and its predicted values (y) based on the model (m0), that is, the proportion of correct responses.
dta4_val %>% na.omit() %>% ggplot(aes(x = xval, y = yval)) +
stat_summary(fun.data = mean_cl_boot, color = "dodgerblue", size = .25) +
facet_wrap(subject ~ sf_group) +
scale_x_log10() +
scale_y_continuous(limits = c(0, 1)) +
labs(x = "Contrast in log unit", y = "Proportion of correct responses") +
geom_line(data = dta4_val, aes(x = contrast, y = y)) +
theme_bw()However, there is still no error bar because generating synthetic data with fewer elements but for several times still makes data point unique from each other. We need to figure out the approach that can make the variation exists in the given contrast.
HW exercise 5 (Need help).
Use the Cushings{MASS} data set to generate a plot similar to the following one:
Target output 2
Load in the data set and check the data structure
'data.frame': 27 obs. of 3 variables:
$ Tetrahydrocortisone: num 3.1 3 1.9 3.8 4.1 1.9 8.3 3.8 3.9 7.8 ...
$ Pregnanetriol : num 11.7 1.3 0.1 0.04 1.1 0.4 1 0.2 0.6 1.2 ...
$ Type : Factor w/ 4 levels "a","b","c","u": 1 1 1 1 1 1 2 2 2 2 ...
[Note] By using ?Cushings, we can know the meaning of each level in Cushings$Type:
- a: Adenoma
- b: Bilateral Hyperplasia
- c: Carcinoma
- u: Unknown
Data transformation
Create some variabels (positions and texts) for adding group labels.
dta5 <- Cushings %>%
mutate(x0 = c(rep(3, table(Cushings$Type)['a']), rep(10, table(Cushings$Type)['b']),
rep(15, table(Cushings$Type)['c']), rep(24, table(Cushings$Type)['u'])),
y0 = c(rep(11, table(Cushings$Type)['a']), rep(2, table(Cushings$Type)['b']),
rep(7, table(Cushings$Type)['c'] ), rep(.95, table(Cushings$Type)['u'] )),
Type_lab = c('Adenoma', rep('', table(Cushings$Type)['a'] - 1),
'Bilateral Hyperplasia', rep('', table(Cushings$Type)['b'] - 1),
'Carcinoma', rep('', table(Cushings$Type)['c'] - 1),
'Unknown', rep('', table(Cushings$Type)['u'] - 1)))
dta5 %>% dplyr::select(x0, y0, Type_lab) %>% str()'data.frame': 27 obs. of 3 variables:
$ x0 : num 3 3 3 3 3 3 10 10 10 10 ...
$ y0 : num 11 11 11 11 11 11 2 2 2 2 ...
$ Type_lab: chr "Adenoma" "" "" "" ...
Plot
g5 <- ggplot(mapping = aes(x = Tetrahydrocortisone, y = Pregnanetriol, label = Type_lab),
data = dta5) +
# Point
geom_point(color = 'black', size = 2, shape = 16) + # point frame
geom_point(aes(color = Type), size = 2, shape = 20) + # point interior
scale_color_manual(values = c('hotpink3', 'chartreuse4', 'dodgerblue1', 'orange1')) +
# Axises
xlab('Tetrahydrocortisone (mg/24 hours)') +
ylab('Pregnanetriol (mg/24 hours)') + # there is a clerical error here in the original plot
scale_x_continuous(breaks = seq(0, 60, 10), limits = c(0, 60)) + # rescale x-axis
scale_y_continuous(breaks = seq(0, 12, 2)) + # rescale y-axis
# Title
ggtitle('Cushing\'s syndrome', ) +
# Layout
theme(legend.position = 'none', # remove legend
panel.background = element_blank(), # remove background color
panel.grid.major.x = element_blank(), # remove vertical grid lines
panel.grid.major.y = element_line(color = 'grey80', size = .3),
axis.ticks = element_blank(), # remove ticks of two axises
axis.text = element_text(size = 10), # -
axis.text.x = element_text(hjust = 1), # text font and
axis.text.y = element_text(angle = 90), # label position
axis.title = element_text(size = 11.5), # setting
axis.title.x = element_text(margin = margin(t = 15)), #
axis.title.y = element_text(margin = margin(r = 15)), # -
plot.title = element_text(size = 11, face = 'bold', hjust = 1, margin = margin(b = 25)))
g5 + geom_text(aes(x0, y0, group = Type, color = Type), size = 3) # Group labels