This is a take-home exam. The exam is open notes and open book (in short, you can use any source of information you like as long as you work on the exam by yourself). There are 20 questions total, split into 4 parts. The maximum score is 120 points. Please adhere to the honor code. Submit the midterm as a PDF on the canvas ‘midterm’ assignment by Friday, February 17th, 8pm.
The late submission policy is:
For questions that require written responses, please make sure to
show any relevant tables, summaries (e.g. from lm() or
anova()), or visualizations. Some of the code chunks have
existing code that you can use to build your code around.
When asked to report results, please do so like you would in a
scientific article (see examples from lectures, as well as in
‘Reporting Results.pdf’ on Canvas under
Files > papers).
\clearpage commands where they are.
This makes sure that each question is printed on a separate page in the
pdf.eval=F, make sure to set
these to eval=T before knitting the final version.If you have any questions about the midterm, please post them on Ed Discussion addressed to the staff only. We will answer your question and may choose to share both your question and our answer with the rest of the group.
We believe in intraindividual change! - Learn while doing!
Honor Code
The Honor Code is the University’s statement on academic integrity written by students in 1921. It articulates University expectations of students and faculty in establishing and maintaining the highest standards in academic work:
The faculty on its part manifests its confidence in the honor of its students by refraining from proctoring examinations and from taking unusual and unreasonable precautions to prevent the forms of dishonesty mentioned above. The faculty will also avoid, as far as practicable, academic procedures that create temptations to violate the Honor Code.
While the faculty alone has the right and obligation to set academic requirements, the students and faculty will work together to establish optimal conditions for honorable academic work.
Feel free to add additional libraries here.
library("knitr") # for knitting
library("broom") # for tidying model fits
library("GGally") # for ggpairs plot
library("emmeans") # for estimating marginal means
library("effectsize") # for effect size measures
library("corrr")
library("patchwork")
library("tidyverse") # for everything else
options(scipen=999)
A student investigated how effective different methods are for eliminating bacteria. She tested four different methods: 1) washing her hands with water only, 2) with regular soap, 3) with antibacterial soap (ABS), or 4) using an antibacterial spray (AS). She suspected that the number of bacteria on her hands might vary considerably from day to day. To account for this, she generated random numbers to determine on what day she would use which treatment. After each treatment, she placed her right hand on a sterile media plate to collect the bacteria. She incubated the plate for 2 days and then measured the size of the bacteria colonies by counting the number of organisms on the plate. She replicated this procedure 9 times for each of the four treatments.
Note: For statistical analysis purposes, we make the assumption that the individual measurements are independent from each other.
Question 1 (4 points): Let’s start by setting your
ggplot theme to theme_classic(), and by disabling the pesky
dplyr summarise warnings.
library(ggplot2); theme_set(theme_classic())
options(dplyr.summarise.inform = FALSE)
Question 2 (5 points): In the code chunk below, read
in the data bacterial_soap.txt from the data/
folder. (Hint: the data are in a tab separated format so the
read_tsv() function may be useful) Rename the “Bacterial
Counts” variable to “bacterial_counts”, and the “Clean Method” variable
to “clean_method”. Mutate() the clean_method variable into
a factor(), where 1 is coded as “water”, 2 as “soap”, 3 as
“antibacterial_soap”, and 4 as “alcohol_spray”.
df.soap = read_tsv("data/bacterial_soap.txt")
## Rows: 36 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (2): Bacterial Counts, Clean Method
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df.soap <- df.soap %>%
rename("bacterial_counts" = "Bacterial Counts",
"clean_method" = "Clean Method") %>%
mutate(clean_method = replace(clean_method, clean_method == 1, "water")) %>%
mutate(clean_method = replace(clean_method, clean_method == 2, "soap")) %>%
mutate(clean_method = replace(clean_method, clean_method == 3, "antibacterial_soap")) %>%
mutate(clean_method = replace(clean_method, clean_method == 4, "alcohol_spray")) %>%
mutate(clean_method = as.factor(clean_method))
Question 3 (8 points): Make a plot showing the bacterial counts for each method. The plot should show each individual data point as well as the means with bootstrapped confidence intervals for each condition. Your plot should be publication-ready (is readable, has a title, axis labels, etc).
ggplot(data=df.soap, aes(x=clean_method, y=bacterial_counts)) +
geom_point(alpha = 0.3,
position = position_jitter(width=0.1,
height=0)) +
stat_summary(fun.data = "mean_cl_boot",
shape = 21,
fill = "white",
size = 0.75) +
labs(title = "Efficacy of Different Cleaning Methods in Eliminating Bacteria",
x = "Cleaning Method Treatment",
y = "Bacteria Remaining After Treatment") +
scale_x_discrete(labels=c("alcohol_spray" = "Alcohol Spray",
"antibacterial_soap" = "Antibacterial Soap",
"soap" = "Soap",
"water" = "Water"))
Question 4 (10 points): Run the appropriate statistical model to test the hypothesis that the average size of bacterial colonies differs across the four cleaning methods (i.e. as compared to a null hypothesis according to which there are no differences in mean bacterial counts). Report your finding below.
#Running a one-way ANOVA
test1 <- lm(bacterial_counts ~ 1 + clean_method, data = df.soap)
anova(test1)
#Running a follow-up test
summary(test1)
##
## Call:
## lm(formula = bacterial_counts ~ 1 + clean_method, data = df.soap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.333 -23.472 -0.333 16.806 101.667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.22 12.51 3.375 0.00195 **
## clean_methodantibacterial_soap 43.11 17.69 2.437 0.02057 *
## clean_methodsoap 63.11 17.69 3.567 0.00116 **
## clean_methodwater 72.89 17.69 4.120 0.00025 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.53 on 32 degrees of freedom
## Multiple R-squared: 0.3849, Adjusted R-squared: 0.3272
## F-statistic: 6.675 on 3 and 32 DF, p-value: 0.001257
Your answer: The average size of bacterial colonies differed significantly as a function of the cleaning method (i.e., whether the cleaning method was using an alcohol spray, antibacterial soap, regular soap, or just water), F(3,32) = 6.67, p = .001.
Follow-up tests reveal that using anti-bacterial soap resulted in bacteria colonies that were 43.11 units larger than if the alcohol spray was used, b = 43.11, SE = 17.69, p = .021. Using regular soap resulted in bacteria colonies that were 63.11 units larger than if alcohol spray was used, b = 63.11, SE = 17.69, p = .001. Finally, using just water resulted in bacteria colonies that were 72.89 units larger than if alcohol spray was used, b = 72.89, SE = 17.69, p < .001. Overall, the different methods explained a total of 38.5% of the variance in bacterial counts.
Question 5 (3 points): (Fun) The CDC recommends that you wash your hands with soap and water for at least how many seconds?
Your answer: 20 seconds.
In this exercise, we are interested in seeing what affects life satisfaction. We have a (fake) data set with the following variables:
| variable | description |
|---|---|
id |
participant id |
age |
age in years |
kids |
number of kids |
jobsatis |
job satisfaction (1 = not at all, 7 = very much) |
marsatis |
marital satisfaction (1 = not at all, 7 = very much) |
lifsatis |
life satisfaction (1 = not at all, 7 = very much) |
Question 6 (3 points): Read in the
satisfaction.csv data set from the data/
folder. This is a comma delimited file. Use one of the methods we’ve
learned in class for taking a quick look at the data (without
visualization).
df.satis = read_csv("data/satisfaction.csv")
## Rows: 62 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): id, age, kids, jobsatis, marsatis, lifsatis
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(df.satis)
## id age kids jobsatis
## Min. : 1.00 Min. :21.36 Min. :0.000 Min. :1.000
## 1st Qu.:16.25 1st Qu.:34.18 1st Qu.:0.000 1st Qu.:3.000
## Median :31.50 Median :40.03 Median :1.000 Median :4.000
## Mean :31.50 Mean :40.78 Mean :1.694 Mean :3.516
## 3rd Qu.:46.75 3rd Qu.:45.82 3rd Qu.:3.000 3rd Qu.:4.000
## Max. :62.00 Max. :68.43 Max. :8.000 Max. :6.000
## marsatis lifsatis
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000
## Median :3.000 Median :3.000
## Mean :2.968 Mean :3.274
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :7.000 Max. :7.000
Question 7 (3 points): We want to test the
hypothesis that life satisfaction is higher when people have kids.
First, for this Question 7, we need to create a new logical variable in
the dataframe that captures, for each observation, whether or not they
have children (with TRUE and FALSE as values).
Call this variable: any_kids. Using the
table() function, describe how many participants are in
each group.
df.satis2 <- df.satis %>%
mutate(any_kids = ifelse(kids > 0, TRUE, FALSE))
table(df.satis2$any_kids)
##
## FALSE TRUE
## 30 32
Question 8 (7 points): Next, let’s visualize our
hypothesized relationship. Make a plot that shows the means and
confidence intervals for life satisfaction (lifsatis)
separately for people with children and people without children. Also
show the individual data points as jittered points on the same plot.
Briefly describe what you learn from this plot (1-2 sentences). You
should use code we learned in class.
ggplot(data = df.satis2, aes(x = any_kids, y = lifsatis)) +
geom_point(alpha = 0.3,
position = position_jitter(width = 0.1,
height = 0)) +
stat_summary(fun.data = "mean_cl_boot",
shape = 21,
fill = "white",
size = 0.75)
> Your answer: Based on the means and confidence
intervals, the data suggests that people with children experience higher
life satisfaction than people without children.
Question 9 (12 points): Test the following
hypothesis: Levels of life satisfaction differ between individuals with
and without kids (use any_kids as the predictor in the
model). Note this this is not a directional hypothesis.
anova() function).Print the main results of each step as you go along.
Finally, write a short paragraph to report and interpret the results (as might appear in a paper). Report the means and standard deviation of each group (the group with kids and the group without kids), the appropriate test statistic, the p-value, with clear indication of whether the finding was “in line with” or “contrary to” the hypothesis.
# formulate and fit models
## Compact Model
compact <- lm(lifsatis ~ 1, data = df.satis2)
summary(compact)
##
## Call:
## lm(formula = lifsatis ~ 1, data = df.satis2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2742 -1.2742 -0.2742 0.7258 3.7258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.274 0.194 16.88 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.528 on 61 degrees of freedom
## Augmented Model
augmented <- lm(lifsatis ~ 1 + any_kids, data = df.satis2)
summary(augmented)
##
## Call:
## lm(formula = lifsatis ~ 1 + any_kids, data = df.satis2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7813 -0.7812 0.2188 1.2188 3.2667
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7333 0.2639 10.358 0.00000000000000554 ***
## any_kidsTRUE 1.0479 0.3673 2.853 0.00593 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.445 on 60 degrees of freedom
## Multiple R-squared: 0.1195, Adjusted R-squared: 0.1048
## F-statistic: 8.14 on 1 and 60 DF, p-value: 0.005934
# calculate sums of squared errors
df.satis2 %>%
summarize(resid_comp = sum((lifsatis - mean(lifsatis))^2),
resid_aug = sum((lifsatis - predict(augmented))^2))
# calculate PRE
PRE <- 1 - (125.3354/142.3387)
print(PRE)
## [1] 0.1194566
# F-test based on PRE
Ftest <- (0.1194566/(2-1))/((1-0.1194566)/(62-2))
print(Ftest)
## [1] 8.139742
# p-value (hint: probability from a F distribution)
pvalue <- 1 - pf(q = 8.139742, df1 = 1, df2 = 60)
print(pvalue)
## [1] 0.0059338
# F-test using anova()
anova(compact,augmented)
# calculate the means and standard deviation in each group (used for reporting below)
df.satis2 %>%
group_by(any_kids) %>%
summarize(mean_ls = mean(lifsatis),
sd_ls = sd(lifsatis))
Your answer: There is a significant difference in life satisfaction between people who have kids (M = 3.78, SD = 1.58) and people who don’t (M = 2.73, SD = 1.28), F(1,60) = 8.14, p = .006. The finding is in line with the hypothesis.
Question 10 (7 points): Delving into the model specifics, please describe what the parameters in the augmented model mean (i.e. the intercept and the coefficient(s)).
Your answer: The intercept refers to the average life satisfaction of people who do not have any kids. The coefficient “any_kidsTRUE” refers to the difference of +1.05 in life satisfaction for people who have kids, compared to people who don’t have any kids.
Question 11 (4 points): Now, lets look at some other
variables. Visualize the correlations between jobsatis,
marsatis, lifsatis, age and
kids. (You may want to use one of the following packages to
do so: corrr, GGally,
corrplot).
ggpairs(df.satis2, columns = c(4,5,6,2,3), progress = F)
Question 12 (7 points): Let’s investigate whether
marital satisfaction (marsatis) explains any additional
variance in the data, controlling for whether or not a person has kids.
Run a linear model predicting life satisfaction from
any_kids and marital satisfaction (marsatis
should be treated as a continuous variable). Report the results of this
model including a short interpretation of the model parameters (variance
explained, intercept, coefficients) (2-3 sentences).
test2 <- lm(lifsatis ~ 1 + marsatis + any_kids, data=df.satis2)
summary(test2)
##
## Call:
## lm(formula = lifsatis ~ 1 + marsatis + any_kids, data = df.satis2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9288 -0.9193 -0.0956 0.6899 3.4044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7104 0.4486 3.813 0.000330 ***
## marsatis 0.2951 0.1073 2.749 0.007918 **
## any_kidsTRUE 1.3331 0.3638 3.664 0.000533 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.372 on 59 degrees of freedom
## Multiple R-squared: 0.2194, Adjusted R-squared: 0.193
## F-statistic: 8.294 on 2 and 59 DF, p-value: 0.0006698
Your answer: Based on the multiple r-squared, the predictor variables in this model explains 21.9% of the variance in life satisfaction. The intercept represents the average life satisfaction of people with no kids and who have a marital satisfaction is 0 (if that’s even possible). The coefficient “marsatis” indicates that when marital satisfaction goes up by 1 unit, life satisfaction increases by 0.295 units, after controlling for the kids people have (b = 0.295, SE = 0.107, p = 0.008). The coefficient “any_kidsTRUE” refers to the difference of +1.333 in life satisfaction for people who have kids, compared to people who don’t have any kids, after controlling for marital satisfaction (b = 1.333, SE = 0.364, p < .001).
Question 13 (2 points): (Fun) Life satisfaction can be increased in a variety of ways. What is the prime number that, when you see it unexpectedly in daily life, increases you life satisfaction the most?
Your answer: Probably 7, reminds me of the good times playing blackjack.
Osborne and Sudick (1972) measured the verbal ability of 204 students
when they were in Grades 1, 2, 4, and 6 using the Wechsler Intelligence
Scale for Children (WISC). We are interested in examining whether the
individual differences in ability at Grade 2 (verb2) and
Grade 4 (verb4) are related to differences in ability at
Grade 6 (`verb6’).
Read in the wiscraw.csv data set from the internet.
#set filepath for data file
filepath <- "https://raw.githubusercontent.com/LRI-2/Data/main/wisc3raw.csv"
#read in the .csv file using the url() function
df.wisc <- read.csv(file = url(filepath),
header = TRUE)
Question 14 (3 points): Select the variables of
interest: the id variable (id), Grade 1 Verbal Ability
(verb1) and Grade 6 Verbal Ability (verb6).
Compute a new variable called verbDthat is a measure of the
Change in Verbal Ability from Grade 1 to Grade 6 (in a formal
equation format, \(verbD_i = verb6_i -
verb1_i\)). Obtain the means and standard deviations of the 3
verb variables, and the correlations among them (i.e., the information
that would go in a descriptives table, “Table 1”).
df.wisc <- df.wisc %>%
select(id, verb1, verb6) %>%
mutate(verbD = verb6 - verb1)
df.wisc %>%
summarize(mean_verb1 = mean(verb1),
sd_verb1 = sd(verb1),
mean_verb6 = mean(verb6),
sd_verb6 = sd(verb6),
mean_verbD = mean(verbD),
sd_verbD = sd(verbD))
df.wisc %>%
select(-id) %>%
correlate() %>%
shave() %>%
fashion()
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
Question 15 (10 points): Fit two simple regression
models. In the first model, the Auto-Regression Model, regress the
outcome variable, Grade 6 Verbal Ability (verb6), on the
predictor variable Grade 1 Verbal Ability (verb1). In the
second model, the Difference Score Model, regress the outcome variable,
Change in Verbal Ability (verbD), on the predictor
variable Grade 1 Verbal Ability (verb1). Separately for
each model, (a) look at the summary lm output, (b) make a scatter plot
of the data with the regression line, and (c) write 1 or 2 sentences
interpreting the model parameters and the R-squared value (as might be
reported in a paper).
# Auto-regression Model
test3 <- lm(verb6 ~ 1 + verb1, data=df.wisc)
summary(test3)
##
## Call:
## lm(formula = verb6 ~ 1 + verb1, data = df.wisc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2459 -5.8651 0.1781 4.9048 27.9976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.22485 1.99608 10.13 <0.0000000000000002 ***
## verb1 1.20117 0.09773 12.29 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.087 on 202 degrees of freedom
## Multiple R-squared: 0.4279, Adjusted R-squared: 0.425
## F-statistic: 151.1 on 1 and 202 DF, p-value: < 0.00000000000000022
ggplot(data=df.wisc, aes(x=verb1, y=verb6)) +
geom_point() +
geom_smooth(method="lm", color="black")
## `geom_smooth()` using formula = 'y ~ x'
Your answer: The model explains 42.8% of the variance in grade 6 verbal ability. A significant relationship can be observed between grade 1 and grade 6 verbal ability, F(1, 202) = 151.1, p < .001. An increase in 1 unit on grade 1 verbal ability is associated with a 1.20 unit increase in grade 6 verbal ability, b = 1.20, SE = 0.10, p < .001.
# Difference Score Model
test4 <- lm(verbD ~ 1 + verb1, data=df.wisc)
summary(test4)
##
## Call:
## lm(formula = verbD ~ 1 + verb1, data = df.wisc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.2459 -5.8651 0.1781 4.9048 27.9976
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.22485 1.99608 10.132 <0.0000000000000002 ***
## verb1 0.20117 0.09773 2.058 0.0408 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.087 on 202 degrees of freedom
## Multiple R-squared: 0.02054, Adjusted R-squared: 0.0157
## F-statistic: 4.237 on 1 and 202 DF, p-value: 0.04083
ggplot(data=df.wisc, aes(x=verb1, y=verbD)) +
geom_point() +
geom_smooth(method="lm", color="black")
## `geom_smooth()` using formula = 'y ~ x'
Your answer: The model explains 2.1% of the variance in change in verbal ability. A significant relationship can be observed between grade 1 and change in verbal ability, F(1, 202) = 4.237, p = .041. An increase in 1 unit on grade 1 verbal ability is associated with a 0.20 unit increase in change in verbal ability, b = 0.20, SE = 0.10, p = .041.
Question 16 (5 points): These two models were
constructed using the same two variables (verb1 and
verb6), and both models are used to study change. But, in
one model we predicted the Grade 6 Verbal Ability scores and in the
other we predicted the Change in Verbal Ability scores.
Which model is examining “Change in Interindividual Differences”, and which model is predicting “Interindividual Differences in Change”?
Your answer: The first model examines change in interindividual differences while the second predicts interindividual differences in change.
From a statistical significance perspective, which model seems better? Why? From a developmental theory (i.e., within-person change or growth mindset) perspective, which model seems better? Why?
Your answer: From a statistical significance perspective, the first model is better because the p-value is much smaller. But from a developmental theory perspective, the second model is better because it measures the differences between individuals in the amount of change that has taken place. In other words, the model results tell us that higher verbal ability at grade 1 is associated with a higher growth in verbal ability over time.
The Affect
Motivation and Interpersonal Behavior (AMIB) Study is an experience
sampling study that obtained daily reports of positive and negative
affect (posaff and negaff) from 190 persons
each evening for 8 consecutive days. Participants also completed a
baseline assessment on entry to the study that included the Big Five
Inventory (BFI), which assesses five dimensions of personality: openness
(bfi_o), conscientiousness (bfi_c),
extraversion (bfi_e), agreeableness (bfi_a),
and neuroticism (bfi_n). Theoretically, personality shapes
individuals’ emotional experience of the world. We are interested in
whether only neuroticism (bfi_n) is related to differences
in individuals’ typical level of negative affect (the typical claim), or
if all five dimensions of personality are related to differences in
negative affect.
Examining this research question requires a bit of data management, visualization, model fitting, and model interpretation.
Read in the two data files.
#set filepath for person file
filepath1 <- "https://raw.githubusercontent.com/LRI-2/Data/main/ILD/AMIBshare_persons_2019_0501.csv"
#read in the .csv file using the url() function
AMIB_persons <- read.csv(file=url(filepath1),header=TRUE)
#set filepath for daily file
filepath2 <- "https://raw.githubusercontent.com/LRI-2/Data/main/ILD/AMIBshare_daily_2019_0501.csv"
#read in the .csv file using the url() function
AMIB_daily <- read.csv(file=url(filepath2),header=TRUE)
Question 17 (8 points): Using the daily file …
Select the id and negaff variables. Summarize the daily negaff data for
each individual by calculating the average for scores. Similar to how
accuracy scores are calculated, this new variables should be calculated
as the mean negaff score for each person, and can be named
proto_negaff. Plot histogram of this new variable and merge
together with the personality variables from the AMIB_persons data
(bfi_o, bfi_c, bfi_e,
bfi_a, bfi_n), by id.
AMIB_daily2 <- AMIB_daily %>%
select(id, negaff) %>%
drop_na() %>%
group_by(id) %>%
summarize(n=n(),
proto_negaff = sum(negaff)/n)
ggplot(data=AMIB_daily2, aes(x=proto_negaff)) +
geom_histogram(color="black", fill="white")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
AMIB_combined <- AMIB_daily2 %>%
left_join(AMIB_persons,
by=c("id"="id")) %>%
select(id, proto_negaff, bfi_o, bfi_c, bfi_e, bfi_a, bfi_n)
Question 18 (12 points): Now fit some models. First,
create a simple linear regression model to examine the association
between individuals’ typical level of negative affect
(proto_negaff is the outcome variable) and neuroticism
(bfi_n). Then, build a multiple regression model where all
five dimensions of personality are included as predictors. Was the added
complexity worth it? Report that result. Then, report, visualize, and
interpret the parameters from the better model with respect to the main
research question and hypothesis.
test5 <- lm(proto_negaff ~ 1 + bfi_n, data = AMIB_combined)
summary(test5)
##
## Call:
## lm(formula = proto_negaff ~ 1 + bfi_n, data = AMIB_combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33162 -0.52631 -0.04822 0.40195 2.23335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.66795 0.16385 10.180 < 0.0000000000000002 ***
## bfi_n 0.27179 0.05234 5.192 0.000000536 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6878 on 188 degrees of freedom
## Multiple R-squared: 0.1254, Adjusted R-squared: 0.1208
## F-statistic: 26.96 on 1 and 188 DF, p-value: 0.0000005364
test6 <- lm(proto_negaff ~ 1 + bfi_n + bfi_o + bfi_c + bfi_e + bfi_a, data = AMIB_combined)
summary(test6)
##
## Call:
## lm(formula = proto_negaff ~ 1 + bfi_n + bfi_o + bfi_c + bfi_e +
## bfi_a, data = AMIB_combined)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.25262 -0.52054 -0.07075 0.34583 2.39784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.80740 0.42062 6.674 0.000000000283 ***
## bfi_n 0.24738 0.05258 4.704 0.000004987415 ***
## bfi_o -0.04878 0.05171 -0.943 0.3468
## bfi_c -0.08679 0.05826 -1.490 0.1380
## bfi_e -0.02692 0.05045 -0.534 0.5943
## bfi_a -0.13148 0.05651 -2.327 0.0211 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6772 on 184 degrees of freedom
## Multiple R-squared: 0.1703, Adjusted R-squared: 0.1478
## F-statistic: 7.554 on 5 and 184 DF, p-value: 0.000001784
anova(test5, test6)
plot1 <- ggplot(data=AMIB_combined, aes(x=bfi_n, y=proto_negaff)) +
geom_point() +
geom_smooth(method="lm", color="black")
plot2 <- ggplot(data=AMIB_combined, aes(x=bfi_o, y=proto_negaff)) +
geom_point() +
geom_smooth(method="lm", color="black")
plot3 <- ggplot(data=AMIB_combined, aes(x=bfi_c, y=proto_negaff)) +
geom_point() +
geom_smooth(method="lm", color="black")
plot4 <- ggplot(data=AMIB_combined, aes(x=bfi_e, y=proto_negaff)) +
geom_point() +
geom_smooth(method="lm", color="black")
plot5 <- ggplot(data=AMIB_combined, aes(x=bfi_a, y=proto_negaff)) +
geom_point() +
geom_smooth(method="lm", color="black")
patchwork <- (plot1 | plot2 | plot3 | plot4 | plot5)
patchwork
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Your answer: The added complexity was worth it. The better model with all personality variables included explains 17.0% of the variance in negative affect. Consistent with the hypothesis, a significant positive relationship can be observed between neuroticism and negative affect after controlling for all other personality variables, in which an increase in 1 unit on neuroticism is associated with a 0.25 unit increase in negative affect, b = 0.25, SE = 0.05, p < .001. A 1 unit increase in agreeableness is also significantly associated with a 0.13 unit decrease in negative affect after controlling for other personality variables, b = -0.13, SE = 0.06, p = .021. For the remaining personality predictors, after controlling for all other personaltiy variables, no significant relationship was observed between openness and negative affect, b = -0.05, SE = 0.05, p = .347, conscientiousness and negative affect, b = -0.09, SE = 0.06, p = .138, and extraversion and negative affect, b = -0.03, SE = 0.05, p = .594. The intercept represents the mean negative affect when all other predictors in the model are 0 (which may not make sense). Overall, the findings show partial support for the hypothesis that other personality variables other than neuroticism can predict negative affect. Agreeableness predicts lower negative affect while openness to experience, conscientiousness, and extraversion do not.
Question 19 (2 points): (Fun) If there was a sixth dimension - What would you name it?
Your answer: Sarcasm
Question 20 (5 points): Looking back over your code, how many points (0 to 5) should you get for coding style?
Your answer: 5!
Congratulations! - Thank you for going on this journey into the unknown with us! Keep pushing!
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Singapore.utf8 LC_CTYPE=English_Singapore.utf8
## [3] LC_MONETARY=English_Singapore.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Singapore.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] forcats_1.0.0 stringr_1.5.0 dplyr_1.0.10 purrr_1.0.1
## [5] readr_2.1.3 tidyr_1.3.0 tibble_3.1.8 tidyverse_1.3.2
## [9] patchwork_1.1.2 corrr_0.4.4 effectsize_0.8.3 emmeans_1.8.4-1
## [13] GGally_2.1.2 ggplot2_3.4.0 broom_1.0.3 knitr_1.42
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-160 fs_1.6.0 lubridate_1.9.1
## [4] bit64_4.0.5 insight_0.18.8 RColorBrewer_1.1-3
## [7] httr_1.4.4 tools_4.2.2 backports_1.4.1
## [10] bslib_0.4.2 utf8_1.2.2 R6_2.5.1
## [13] rpart_4.1.19 mgcv_1.8-41 Hmisc_4.8-0
## [16] DBI_1.1.3 colorspace_2.1-0 nnet_7.3-18
## [19] withr_2.5.0 gridExtra_2.3 tidyselect_1.2.0
## [22] bit_4.0.5 compiler_4.2.2 cli_3.6.0
## [25] rvest_1.0.3 htmlTable_2.4.1 xml2_1.3.3
## [28] labeling_0.4.2 bayestestR_0.13.0 sass_0.4.5
## [31] checkmate_2.1.0 scales_1.2.1 mvtnorm_1.1-3
## [34] digest_0.6.31 foreign_0.8-83 rmarkdown_2.20
## [37] jpeg_0.1-10 base64enc_0.1-3 pkgconfig_2.0.3
## [40] htmltools_0.5.4 highr_0.10 dbplyr_2.3.0
## [43] fastmap_1.1.0 htmlwidgets_1.6.1 rlang_1.0.6
## [46] readxl_1.4.1 rstudioapi_0.14 farver_2.1.1
## [49] jquerylib_0.1.4 generics_0.1.3 jsonlite_1.8.4
## [52] vroom_1.6.1 googlesheets4_1.0.1 magrittr_2.0.3
## [55] Formula_1.2-4 interp_1.1-3 parameters_0.20.2
## [58] Matrix_1.5-1 Rcpp_1.0.10 munsell_0.5.0
## [61] fansi_1.0.4 lifecycle_1.0.3 stringi_1.7.12
## [64] yaml_2.3.7 plyr_1.8.8 grid_4.2.2
## [67] parallel_4.2.2 crayon_1.5.2 deldir_1.0-6
## [70] lattice_0.20-45 haven_2.5.1 splines_4.2.2
## [73] hms_1.1.2 pillar_1.8.1 estimability_1.4.1
## [76] reprex_2.0.2 glue_1.6.2 evaluate_0.20
## [79] latticeExtra_0.6-30 data.table_1.14.6 modelr_0.1.10
## [82] vctrs_0.5.2 png_0.1-8 tzdb_0.3.0
## [85] cellranger_1.1.0 gtable_0.3.1 reshape_0.8.9
## [88] assertthat_0.2.1 datawizard_0.6.5 cachem_1.0.6
## [91] xfun_0.36 survival_3.4-0 googledrive_2.0.0
## [94] gargle_1.2.1 cluster_2.1.4 timechange_0.2.0
## [97] ellipsis_0.3.2