Introduction

This is a take-home exam. The exam is open notes and open book (in short, you can use any source of information you like as long as you work on the exam by yourself). There are 20 questions total, split into 4 parts. The maximum score is 120 points. Please adhere to the honor code. Submit the midterm as a PDF on the canvas ‘midterm’ assignment by Friday, February 17th, 8pm.

The late submission policy is:

We will subtract 2% from your points for each hour that the midterm is submitted late.
No points will be granted if you submit later than 8pm on Sunday, February 19th.

For questions that require written responses, please make sure to show any relevant tables, summaries (e.g. from lm() or anova()), or visualizations. Some of the code chunks have existing code that you can use to build your code around.

When asked to report results, please do so like you would in a scientific article (see examples from lectures, as well as in ‘Reporting Results.pdf’ on Canvas under Files > papers).

Please leave the \clearpage commands where they are. This makes sure that each question is printed on a separate page in the pdf.
Some code chunks are set to eval=F, make sure to set these to eval=T before knitting the final version.
We note for each question how many points you can get.
Good coding style matters! We will add or subtract up to 5 points for style.

If you have any questions about the midterm, please post them on Ed Discussion addressed to the staff only. We will answer your question and may choose to share both your question and our answer with the rest of the group.

We believe in intraindividual change! - Learn while doing!

Honor Code

The Honor Code is the University’s statement on academic integrity written by students in 1921. It articulates University expectations of students and faculty in establishing and maintaining the highest standards in academic work:

The Honor Code is an undertaking of the students, individually and collectively:

that they will not give or receive aid in examinations; that they will not give or receive unpermitted aid in class work, in the preparation of reports, or in any other work that is to be used by the instructor as the basis of grading;
that they will do their share and take an active part in seeing to it that others as well as themselves uphold the spirit and letter of the Honor Code.

The faculty on its part manifests its confidence in the honor of its students by refraining from proctoring examinations and from taking unusual and unreasonable precautions to prevent the forms of dishonesty mentioned above. The faculty will also avoid, as far as practicable, academic procedures that create temptations to violate the Honor Code.
While the faculty alone has the right and obligation to set academic requirements, the students and faculty will work together to establish optimal conditions for honorable academic work.

Load libraries

Feel free to add additional libraries here.

library("knitr")      # for knitting
library("broom")      # for tidying model fits
library("GGally")     # for ggpairs plot
library("emmeans")    # for estimating marginal means 
library("effectsize") # for effect size measures
library("corrr")
library("patchwork")
library("tidyverse")  # for everything else
options(scipen=999)

Part 1: Wash your hands!

A student investigated how effective different methods are for eliminating bacteria. She tested four different methods: 1) washing her hands with water only, 2) with regular soap, 3) with antibacterial soap (ABS), or 4) using an antibacterial spray (AS). She suspected that the number of bacteria on her hands might vary considerably from day to day. To account for this, she generated random numbers to determine on what day she would use which treatment. After each treatment, she placed her right hand on a sterile media plate to collect the bacteria. She incubated the plate for 2 days and then measured the size of the bacteria colonies by counting the number of organisms on the plate. She replicated this procedure 9 times for each of the four treatments.

Note: For statistical analysis purposes, we make the assumption that the individual measurements are independent from each other.

Question 1 (4 points): Let’s start by setting your ggplot theme to theme_classic(), and by disabling the pesky dplyr summarise warnings.

library(ggplot2); theme_set(theme_classic())
options(dplyr.summarise.inform = FALSE)

Question 2 (5 points): In the code chunk below, read in the data bacterial_soap.txt from the data/ folder. (Hint: the data are in a tab separated format so the read_tsv() function may be useful) Rename the “Bacterial Counts” variable to “bacterial_counts”, and the “Clean Method” variable to “clean_method”. Mutate() the clean_method variable into a factor(), where 1 is coded as “water”, 2 as “soap”, 3 as “antibacterial_soap”, and 4 as “alcohol_spray”.

df.soap = read_tsv("data/bacterial_soap.txt")

## Rows: 36 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## dbl (2): Bacterial Counts, Clean Method
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df.soap <- df.soap %>% 
  rename("bacterial_counts" = "Bacterial Counts",
         "clean_method" = "Clean Method") %>% 
  mutate(clean_method = replace(clean_method, clean_method == 1, "water")) %>% 
  mutate(clean_method = replace(clean_method, clean_method == 2, "soap")) %>% 
  mutate(clean_method = replace(clean_method, clean_method == 3, "antibacterial_soap")) %>%
  mutate(clean_method = replace(clean_method, clean_method == 4, "alcohol_spray")) %>%   
  mutate(clean_method = as.factor(clean_method))

Question 3 (8 points): Make a plot showing the bacterial counts for each method. The plot should show each individual data point as well as the means with bootstrapped confidence intervals for each condition. Your plot should be publication-ready (is readable, has a title, axis labels, etc).

ggplot(data=df.soap, aes(x=clean_method, y=bacterial_counts)) +
  geom_point(alpha = 0.3,
             position = position_jitter(width=0.1,
                                        height=0)) +
  stat_summary(fun.data = "mean_cl_boot",
               shape = 21,
               fill = "white",
               size = 0.75) +
  labs(title = "Efficacy of Different Cleaning Methods in Eliminating Bacteria",
       x = "Cleaning Method Treatment",
       y = "Bacteria Remaining After Treatment") +
  scale_x_discrete(labels=c("alcohol_spray" = "Alcohol Spray",
                            "antibacterial_soap" = "Antibacterial Soap",
                            "soap" = "Soap",
                            "water" = "Water"))

Question 4 (10 points): Run the appropriate statistical model to test the hypothesis that the average size of bacterial colonies differs across the four cleaning methods (i.e. as compared to a null hypothesis according to which there are no differences in mean bacterial counts). Report your finding below.

#Running a one-way ANOVA
test1 <- lm(bacterial_counts ~ 1 + clean_method, data = df.soap)
anova(test1)

#Running a follow-up test
summary(test1)

## 
## Call:
## lm(formula = bacterial_counts ~ 1 + clean_method, data = df.soap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.333 -23.472  -0.333  16.806 101.667 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       42.22      12.51   3.375  0.00195 ** 
## clean_methodantibacterial_soap    43.11      17.69   2.437  0.02057 *  
## clean_methodsoap                  63.11      17.69   3.567  0.00116 ** 
## clean_methodwater                 72.89      17.69   4.120  0.00025 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.53 on 32 degrees of freedom
## Multiple R-squared:  0.3849, Adjusted R-squared:  0.3272 
## F-statistic: 6.675 on 3 and 32 DF,  p-value: 0.001257

Your answer: The average size of bacterial colonies differed significantly as a function of the cleaning method (i.e., whether the cleaning method was using an alcohol spray, antibacterial soap, regular soap, or just water), F(3,32) = 6.67, p = .001.

Follow-up tests reveal that using anti-bacterial soap resulted in bacteria colonies that were 43.11 units larger than if the alcohol spray was used, b = 43.11, SE = 17.69, p = .021. Using regular soap resulted in bacteria colonies that were 63.11 units larger than if alcohol spray was used, b = 63.11, SE = 17.69, p = .001. Finally, using just water resulted in bacteria colonies that were 72.89 units larger than if alcohol spray was used, b = 72.89, SE = 17.69, p < .001. Overall, the different methods explained a total of 38.5% of the variance in bacterial counts.

Question 5 (3 points): (Fun) The CDC recommends that you wash your hands with soap and water for at least how many seconds?

Your answer: 20 seconds.

Part 2: Life satisfaction

In this exercise, we are interested in seeing what affects life satisfaction. We have a (fake) data set with the following variables:

Variables in the satisfaction data set.
variable	description
`id`	participant id
`age`	age in years
`kids`	number of kids
`jobsatis`	job satisfaction (1 = not at all, 7 = very much)
`marsatis`	marital satisfaction (1 = not at all, 7 = very much)
`lifsatis`	life satisfaction (1 = not at all, 7 = very much)

Question 6 (3 points): Read in the satisfaction.csv data set from the data/ folder. This is a comma delimited file. Use one of the methods we’ve learned in class for taking a quick look at the data (without visualization).

df.satis = read_csv("data/satisfaction.csv")

## Rows: 62 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (6): id, age, kids, jobsatis, marsatis, lifsatis
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(df.satis)

##        id             age             kids          jobsatis    
##  Min.   : 1.00   Min.   :21.36   Min.   :0.000   Min.   :1.000  
##  1st Qu.:16.25   1st Qu.:34.18   1st Qu.:0.000   1st Qu.:3.000  
##  Median :31.50   Median :40.03   Median :1.000   Median :4.000  
##  Mean   :31.50   Mean   :40.78   Mean   :1.694   Mean   :3.516  
##  3rd Qu.:46.75   3rd Qu.:45.82   3rd Qu.:3.000   3rd Qu.:4.000  
##  Max.   :62.00   Max.   :68.43   Max.   :8.000   Max.   :6.000  
##     marsatis        lifsatis    
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:2.000  
##  Median :3.000   Median :3.000  
##  Mean   :2.968   Mean   :3.274  
##  3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :7.000   Max.   :7.000

Question 7 (3 points): We want to test the hypothesis that life satisfaction is higher when people have kids. First, for this Question 7, we need to create a new logical variable in the dataframe that captures, for each observation, whether or not they have children (with TRUE and FALSE as values). Call this variable: any_kids. Using the table() function, describe how many participants are in each group.

df.satis2 <- df.satis %>% 
  mutate(any_kids = ifelse(kids > 0, TRUE, FALSE))

table(df.satis2$any_kids)

## 
## FALSE  TRUE 
##    30    32

Question 8 (7 points): Next, let’s visualize our hypothesized relationship. Make a plot that shows the means and confidence intervals for life satisfaction (lifsatis) separately for people with children and people without children. Also show the individual data points as jittered points on the same plot. Briefly describe what you learn from this plot (1-2 sentences). You should use code we learned in class.

ggplot(data = df.satis2, aes(x = any_kids, y = lifsatis)) +
  geom_point(alpha = 0.3,
             position = position_jitter(width = 0.1,
                                        height = 0)) +
  stat_summary(fun.data = "mean_cl_boot",
               shape = 21,
               fill = "white",
               size = 0.75)

> Your answer: Based on the means and confidence intervals, the data suggests that people with children experience higher life satisfaction than people without children.

Question 9 (12 points): Test the following hypothesis: Levels of life satisfaction differ between individuals with and without kids (use any_kids as the predictor in the model). Note this this is not a directional hypothesis.

Formulate this hypothesis as a comparison between a compact and an augmented model. Calculate “by hand” the sum of squared errors for each model, and then, the proportional reduction in error (PRE). (Hint: the equations are in the lecture slides).
Calculate “by hand” the F-statistic for this particular PRE, and then obtain the p-value for this statistic (assuming a two-sided hypothesis test – since we had an undirected hypothesis).
Double check your calculation of the F-statistic (and p-value) by running an F-test that compares the two models (using the anova() function).

Print the main results of each step as you go along.

Finally, write a short paragraph to report and interpret the results (as might appear in a paper). Report the means and standard deviation of each group (the group with kids and the group without kids), the appropriate test statistic, the p-value, with clear indication of whether the finding was “in line with” or “contrary to” the hypothesis.

# formulate and fit models 
## Compact Model
compact <- lm(lifsatis ~ 1, data = df.satis2)
summary(compact)

## 
## Call:
## lm(formula = lifsatis ~ 1, data = df.satis2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2742 -1.2742 -0.2742  0.7258  3.7258 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    3.274      0.194   16.88 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.528 on 61 degrees of freedom

## Augmented Model
augmented <- lm(lifsatis ~ 1 + any_kids, data = df.satis2)
summary(augmented)

## 
## Call:
## lm(formula = lifsatis ~ 1 + any_kids, data = df.satis2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7813 -0.7812  0.2188  1.2188  3.2667 
## 
## Coefficients:
##              Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    2.7333     0.2639  10.358 0.00000000000000554 ***
## any_kidsTRUE   1.0479     0.3673   2.853             0.00593 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.445 on 60 degrees of freedom
## Multiple R-squared:  0.1195, Adjusted R-squared:  0.1048 
## F-statistic:  8.14 on 1 and 60 DF,  p-value: 0.005934

# calculate sums of squared errors
df.satis2 %>%
  summarize(resid_comp = sum((lifsatis - mean(lifsatis))^2),
            resid_aug = sum((lifsatis - predict(augmented))^2))

# calculate PRE 
PRE <- 1 - (125.3354/142.3387)
print(PRE)

## [1] 0.1194566

# F-test based on PRE 
Ftest <- (0.1194566/(2-1))/((1-0.1194566)/(62-2))
print(Ftest)

## [1] 8.139742

# p-value (hint: probability from a F distribution)
pvalue <- 1 - pf(q = 8.139742, df1 = 1, df2 = 60)
print(pvalue)

## [1] 0.0059338

# F-test using anova()
anova(compact,augmented)

# calculate the means and standard deviation in each group (used for reporting below)
df.satis2 %>% 
  group_by(any_kids) %>% 
  summarize(mean_ls = mean(lifsatis),
            sd_ls = sd(lifsatis))

Your answer: There is a significant difference in life satisfaction between people who have kids (M = 3.78, SD = 1.58) and people who don’t (M = 2.73, SD = 1.28), F(1,60) = 8.14, p = .006. The finding is in line with the hypothesis.

Question 10 (7 points): Delving into the model specifics, please describe what the parameters in the augmented model mean (i.e. the intercept and the coefficient(s)).

Your answer: The intercept refers to the average life satisfaction of people who do not have any kids. The coefficient “any_kidsTRUE” refers to the difference of +1.05 in life satisfaction for people who have kids, compared to people who don’t have any kids.

Question 11 (4 points): Now, lets look at some other variables. Visualize the correlations between jobsatis, marsatis, lifsatis, age and kids. (You may want to use one of the following packages to do so: corrr, GGally, corrplot).

ggpairs(df.satis2, columns = c(4,5,6,2,3), progress = F)

Question 12 (7 points): Let’s investigate whether marital satisfaction (marsatis) explains any additional variance in the data, controlling for whether or not a person has kids. Run a linear model predicting life satisfaction from any_kids and marital satisfaction (marsatis should be treated as a continuous variable). Report the results of this model including a short interpretation of the model parameters (variance explained, intercept, coefficients) (2-3 sentences).

test2 <- lm(lifsatis ~ 1 + marsatis + any_kids, data=df.satis2)
summary(test2)

## 
## Call:
## lm(formula = lifsatis ~ 1 + marsatis + any_kids, data = df.satis2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9288 -0.9193 -0.0956  0.6899  3.4044 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.7104     0.4486   3.813 0.000330 ***
## marsatis       0.2951     0.1073   2.749 0.007918 ** 
## any_kidsTRUE   1.3331     0.3638   3.664 0.000533 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.372 on 59 degrees of freedom
## Multiple R-squared:  0.2194, Adjusted R-squared:  0.193 
## F-statistic: 8.294 on 2 and 59 DF,  p-value: 0.0006698

Your answer: Based on the multiple r-squared, the predictor variables in this model explains 21.9% of the variance in life satisfaction. The intercept represents the average life satisfaction of people with no kids and who have a marital satisfaction is 0 (if that’s even possible). The coefficient “marsatis” indicates that when marital satisfaction goes up by 1 unit, life satisfaction increases by 0.295 units, after controlling for the kids people have (b = 0.295, SE = 0.107, p = 0.008). The coefficient “any_kidsTRUE” refers to the difference of +1.333 in life satisfaction for people who have kids, compared to people who don’t have any kids, after controlling for marital satisfaction (b = 1.333, SE = 0.364, p < .001).

Question 13 (2 points): (Fun) Life satisfaction can be increased in a variety of ways. What is the prime number that, when you see it unexpectedly in daily life, increases you life satisfaction the most?

Your answer: Probably 7, reminds me of the good times playing blackjack.

Part 3: Verbal ability over time

Osborne and Sudick (1972) measured the verbal ability of 204 students when they were in Grades 1, 2, 4, and 6 using the Wechsler Intelligence Scale for Children (WISC). We are interested in examining whether the individual differences in ability at Grade 2 (verb2) and Grade 4 (verb4) are related to differences in ability at Grade 6 (`verb6’).

Read in the wiscraw.csv data set from the internet.

#set filepath for data file
filepath <- "https://raw.githubusercontent.com/LRI-2/Data/main/wisc3raw.csv"
#read in the .csv file using the url() function
df.wisc <- read.csv(file = url(filepath), 
                    header = TRUE)

Question 14 (3 points): Select the variables of interest: the id variable (id), Grade 1 Verbal Ability (verb1) and Grade 6 Verbal Ability (verb6). Compute a new variable called verbDthat is a measure of the Change in Verbal Ability from Grade 1 to Grade 6 (in a formal equation format, \(verbD_i = verb6_i - verb1_i\)). Obtain the means and standard deviations of the 3 verb variables, and the correlations among them (i.e., the information that would go in a descriptives table, “Table 1”).

df.wisc <- df.wisc %>% 
  select(id, verb1, verb6) %>% 
  mutate(verbD = verb6 - verb1)

df.wisc %>% 
  summarize(mean_verb1 = mean(verb1),
            sd_verb1 = sd(verb1),
            mean_verb6 = mean(verb6),
            sd_verb6 = sd(verb6),
            mean_verbD = mean(verbD),
            sd_verbD = sd(verbD))

df.wisc %>% 
  select(-id) %>% 
  correlate() %>% 
  shave() %>% 
  fashion()

## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'

Question 15 (10 points): Fit two simple regression models. In the first model, the Auto-Regression Model, regress the outcome variable, Grade 6 Verbal Ability (verb6), on the predictor variable Grade 1 Verbal Ability (verb1). In the second model, the Difference Score Model, regress the outcome variable, Change in Verbal Ability (verbD), on the predictor variable Grade 1 Verbal Ability (verb1). Separately for each model, (a) look at the summary lm output, (b) make a scatter plot of the data with the regression line, and (c) write 1 or 2 sentences interpreting the model parameters and the R-squared value (as might be reported in a paper).

# Auto-regression Model
test3 <- lm(verb6 ~ 1 + verb1, data=df.wisc)
summary(test3)

## 
## Call:
## lm(formula = verb6 ~ 1 + verb1, data = df.wisc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2459  -5.8651   0.1781   4.9048  27.9976 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 20.22485    1.99608   10.13 <0.0000000000000002 ***
## verb1        1.20117    0.09773   12.29 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.087 on 202 degrees of freedom
## Multiple R-squared:  0.4279, Adjusted R-squared:  0.425 
## F-statistic: 151.1 on 1 and 202 DF,  p-value: < 0.00000000000000022

ggplot(data=df.wisc, aes(x=verb1, y=verb6)) +
  geom_point() +
  geom_smooth(method="lm", color="black")

## `geom_smooth()` using formula = 'y ~ x'

Your answer: The model explains 42.8% of the variance in grade 6 verbal ability. A significant relationship can be observed between grade 1 and grade 6 verbal ability, F(1, 202) = 151.1, p < .001. An increase in 1 unit on grade 1 verbal ability is associated with a 1.20 unit increase in grade 6 verbal ability, b = 1.20, SE = 0.10, p < .001.

# Difference Score Model
test4 <- lm(verbD ~ 1 + verb1, data=df.wisc)
summary(test4)

## 
## Call:
## lm(formula = verbD ~ 1 + verb1, data = df.wisc)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2459  -5.8651   0.1781   4.9048  27.9976 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 20.22485    1.99608  10.132 <0.0000000000000002 ***
## verb1        0.20117    0.09773   2.058              0.0408 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.087 on 202 degrees of freedom
## Multiple R-squared:  0.02054,    Adjusted R-squared:  0.0157 
## F-statistic: 4.237 on 1 and 202 DF,  p-value: 0.04083

ggplot(data=df.wisc, aes(x=verb1, y=verbD)) +
  geom_point() +
  geom_smooth(method="lm", color="black")

## `geom_smooth()` using formula = 'y ~ x'

Your answer: The model explains 2.1% of the variance in change in verbal ability. A significant relationship can be observed between grade 1 and change in verbal ability, F(1, 202) = 4.237, p = .041. An increase in 1 unit on grade 1 verbal ability is associated with a 0.20 unit increase in change in verbal ability, b = 0.20, SE = 0.10, p = .041.

Question 16 (5 points): These two models were constructed using the same two variables (verb1 and verb6), and both models are used to study change. But, in one model we predicted the Grade 6 Verbal Ability scores and in the other we predicted the Change in Verbal Ability scores.

Which model is examining “Change in Interindividual Differences”, and which model is predicting “Interindividual Differences in Change”?

Your answer: The first model examines change in interindividual differences while the second predicts interindividual differences in change.

From a statistical significance perspective, which model seems better? Why? From a developmental theory (i.e., within-person change or growth mindset) perspective, which model seems better? Why?

Your answer: From a statistical significance perspective, the first model is better because the p-value is much smaller. But from a developmental theory perspective, the second model is better because it measures the differences between individuals in the amount of change that has taken place. In other words, the model results tell us that higher verbal ability at grade 1 is associated with a higher growth in verbal ability over time.

Part 4: Many Many Groups

The Affect Motivation and Interpersonal Behavior (AMIB) Study is an experience sampling study that obtained daily reports of positive and negative affect (posaff and negaff) from 190 persons each evening for 8 consecutive days. Participants also completed a baseline assessment on entry to the study that included the Big Five Inventory (BFI), which assesses five dimensions of personality: openness (bfi_o), conscientiousness (bfi_c), extraversion (bfi_e), agreeableness (bfi_a), and neuroticism (bfi_n). Theoretically, personality shapes individuals’ emotional experience of the world. We are interested in whether only neuroticism (bfi_n) is related to differences in individuals’ typical level of negative affect (the typical claim), or if all five dimensions of personality are related to differences in negative affect.

Examining this research question requires a bit of data management, visualization, model fitting, and model interpretation.

Read in the two data files.

#set filepath for person file
filepath1 <- "https://raw.githubusercontent.com/LRI-2/Data/main/ILD/AMIBshare_persons_2019_0501.csv"
#read in the .csv file using the url() function
AMIB_persons <- read.csv(file=url(filepath1),header=TRUE)

#set filepath for daily file
filepath2 <- "https://raw.githubusercontent.com/LRI-2/Data/main/ILD/AMIBshare_daily_2019_0501.csv"
#read in the .csv file using the url() function
AMIB_daily <- read.csv(file=url(filepath2),header=TRUE)

Question 17 (8 points): Using the daily file … Select the id and negaff variables. Summarize the daily negaff data for each individual by calculating the average for scores. Similar to how accuracy scores are calculated, this new variables should be calculated as the mean negaff score for each person, and can be named proto_negaff. Plot histogram of this new variable and merge together with the personality variables from the AMIB_persons data (bfi_o, bfi_c, bfi_e, bfi_a, bfi_n), by id.

AMIB_daily2 <- AMIB_daily %>% 
  select(id, negaff) %>%
  drop_na() %>% 
  group_by(id) %>% 
  summarize(n=n(),
            proto_negaff = sum(negaff)/n)

ggplot(data=AMIB_daily2, aes(x=proto_negaff)) +
  geom_histogram(color="black", fill="white")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

AMIB_combined <- AMIB_daily2 %>% 
  left_join(AMIB_persons,
            by=c("id"="id")) %>% 
  select(id, proto_negaff, bfi_o, bfi_c, bfi_e, bfi_a, bfi_n)

Question 18 (12 points): Now fit some models. First, create a simple linear regression model to examine the association between individuals’ typical level of negative affect (proto_negaff is the outcome variable) and neuroticism (bfi_n). Then, build a multiple regression model where all five dimensions of personality are included as predictors. Was the added complexity worth it? Report that result. Then, report, visualize, and interpret the parameters from the better model with respect to the main research question and hypothesis.

test5 <- lm(proto_negaff ~ 1 + bfi_n, data = AMIB_combined)
summary(test5)

## 
## Call:
## lm(formula = proto_negaff ~ 1 + bfi_n, data = AMIB_combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33162 -0.52631 -0.04822  0.40195  2.23335 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  1.66795    0.16385  10.180 < 0.0000000000000002 ***
## bfi_n        0.27179    0.05234   5.192          0.000000536 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6878 on 188 degrees of freedom
## Multiple R-squared:  0.1254, Adjusted R-squared:  0.1208 
## F-statistic: 26.96 on 1 and 188 DF,  p-value: 0.0000005364

test6 <- lm(proto_negaff ~ 1 + bfi_n + bfi_o + bfi_c + bfi_e + bfi_a, data = AMIB_combined)
summary(test6)

## 
## Call:
## lm(formula = proto_negaff ~ 1 + bfi_n + bfi_o + bfi_c + bfi_e + 
##     bfi_a, data = AMIB_combined)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.25262 -0.52054 -0.07075  0.34583  2.39784 
## 
## Coefficients:
##             Estimate Std. Error t value       Pr(>|t|)    
## (Intercept)  2.80740    0.42062   6.674 0.000000000283 ***
## bfi_n        0.24738    0.05258   4.704 0.000004987415 ***
## bfi_o       -0.04878    0.05171  -0.943         0.3468    
## bfi_c       -0.08679    0.05826  -1.490         0.1380    
## bfi_e       -0.02692    0.05045  -0.534         0.5943    
## bfi_a       -0.13148    0.05651  -2.327         0.0211 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6772 on 184 degrees of freedom
## Multiple R-squared:  0.1703, Adjusted R-squared:  0.1478 
## F-statistic: 7.554 on 5 and 184 DF,  p-value: 0.000001784

anova(test5, test6)

plot1 <- ggplot(data=AMIB_combined, aes(x=bfi_n, y=proto_negaff)) +
  geom_point() +
  geom_smooth(method="lm", color="black")

plot2 <- ggplot(data=AMIB_combined, aes(x=bfi_o, y=proto_negaff)) +
  geom_point() +
  geom_smooth(method="lm", color="black")

plot3 <- ggplot(data=AMIB_combined, aes(x=bfi_c, y=proto_negaff)) +
  geom_point() +
  geom_smooth(method="lm", color="black")

plot4 <- ggplot(data=AMIB_combined, aes(x=bfi_e, y=proto_negaff)) +
  geom_point() +
  geom_smooth(method="lm", color="black")

plot5 <- ggplot(data=AMIB_combined, aes(x=bfi_a, y=proto_negaff)) +
  geom_point() +
  geom_smooth(method="lm", color="black")

patchwork <- (plot1 | plot2 | plot3 | plot4 | plot5)
patchwork

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

Your answer: The added complexity was worth it. The better model with all personality variables included explains 17.0% of the variance in negative affect. Consistent with the hypothesis, a significant positive relationship can be observed between neuroticism and negative affect after controlling for all other personality variables, in which an increase in 1 unit on neuroticism is associated with a 0.25 unit increase in negative affect, b = 0.25, SE = 0.05, p < .001. A 1 unit increase in agreeableness is also significantly associated with a 0.13 unit decrease in negative affect after controlling for other personality variables, b = -0.13, SE = 0.06, p = .021. For the remaining personality predictors, after controlling for all other personaltiy variables, no significant relationship was observed between openness and negative affect, b = -0.05, SE = 0.05, p = .347, conscientiousness and negative affect, b = -0.09, SE = 0.06, p = .138, and extraversion and negative affect, b = -0.03, SE = 0.05, p = .594. The intercept represents the mean negative affect when all other predictors in the model are 0 (which may not make sense). Overall, the findings show partial support for the hypothesis that other personality variables other than neuroticism can predict negative affect. Agreeableness predicts lower negative affect while openness to experience, conscientiousness, and extraversion do not.

Question 19 (2 points): (Fun) If there was a sixth dimension - What would you name it?

Your answer: Sarcasm

Question 20 (5 points): Looking back over your code, how many points (0 to 5) should you get for coding style?

Your answer: 5!

Congratulations! - Thank you for going on this journey into the unknown with us! Keep pushing!

Session info

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Singapore.utf8  LC_CTYPE=English_Singapore.utf8   
## [3] LC_MONETARY=English_Singapore.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Singapore.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] forcats_1.0.0    stringr_1.5.0    dplyr_1.0.10     purrr_1.0.1     
##  [5] readr_2.1.3      tidyr_1.3.0      tibble_3.1.8     tidyverse_1.3.2 
##  [9] patchwork_1.1.2  corrr_0.4.4      effectsize_0.8.3 emmeans_1.8.4-1 
## [13] GGally_2.1.2     ggplot2_3.4.0    broom_1.0.3      knitr_1.42      
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-160        fs_1.6.0            lubridate_1.9.1    
##  [4] bit64_4.0.5         insight_0.18.8      RColorBrewer_1.1-3 
##  [7] httr_1.4.4          tools_4.2.2         backports_1.4.1    
## [10] bslib_0.4.2         utf8_1.2.2          R6_2.5.1           
## [13] rpart_4.1.19        mgcv_1.8-41         Hmisc_4.8-0        
## [16] DBI_1.1.3           colorspace_2.1-0    nnet_7.3-18        
## [19] withr_2.5.0         gridExtra_2.3       tidyselect_1.2.0   
## [22] bit_4.0.5           compiler_4.2.2      cli_3.6.0          
## [25] rvest_1.0.3         htmlTable_2.4.1     xml2_1.3.3         
## [28] labeling_0.4.2      bayestestR_0.13.0   sass_0.4.5         
## [31] checkmate_2.1.0     scales_1.2.1        mvtnorm_1.1-3      
## [34] digest_0.6.31       foreign_0.8-83      rmarkdown_2.20     
## [37] jpeg_0.1-10         base64enc_0.1-3     pkgconfig_2.0.3    
## [40] htmltools_0.5.4     highr_0.10          dbplyr_2.3.0       
## [43] fastmap_1.1.0       htmlwidgets_1.6.1   rlang_1.0.6        
## [46] readxl_1.4.1        rstudioapi_0.14     farver_2.1.1       
## [49] jquerylib_0.1.4     generics_0.1.3      jsonlite_1.8.4     
## [52] vroom_1.6.1         googlesheets4_1.0.1 magrittr_2.0.3     
## [55] Formula_1.2-4       interp_1.1-3        parameters_0.20.2  
## [58] Matrix_1.5-1        Rcpp_1.0.10         munsell_0.5.0      
## [61] fansi_1.0.4         lifecycle_1.0.3     stringi_1.7.12     
## [64] yaml_2.3.7          plyr_1.8.8          grid_4.2.2         
## [67] parallel_4.2.2      crayon_1.5.2        deldir_1.0-6       
## [70] lattice_0.20-45     haven_2.5.1         splines_4.2.2      
## [73] hms_1.1.2           pillar_1.8.1        estimability_1.4.1 
## [76] reprex_2.0.2        glue_1.6.2          evaluate_0.20      
## [79] latticeExtra_0.6-30 data.table_1.14.6   modelr_0.1.10      
## [82] vctrs_0.5.2         png_0.1-8           tzdb_0.3.0         
## [85] cellranger_1.1.0    gtable_0.3.1        reshape_0.8.9      
## [88] assertthat_0.2.1    datawizard_0.6.5    cachem_1.0.6       
## [91] xfun_0.36           survival_3.4-0      googledrive_2.0.0  
## [94] gargle_1.2.1        cluster_2.1.4       timechange_0.2.0   
## [97] ellipsis_0.3.2

Psych 252 Midterm: An Exploration of Possibilites!

Cedric Lim Chun Hui