The project presented below was modified from Boslaugh and Watters (2008). Its purpose is to demonstrate the author’s ability to conduct analyses, replicate the work of others, and create a written summary using R and R Markdown.
# Libraries used
library(ggplot2)
library(carData)
The study included 10 participants who recorded the amount of coffee they drank per day and their score on an intelligence quotient (IQ) test. The purpose of the study was to examine if there was a correlation between coffee consumption and IQ.
# Hard coding data
coffee <- c(2, 1, 1, 1, 0, 0, 1, 2, 2, 3)
iq <- c(123, 112, 102, 98, 79, 87, 102, 120, 120, 145)
# Creating data frame with data
coffee_iq <- data.frame(coffee, iq)
# Viewing the first six rows in the data frame
head(coffee_iq)
## coffee iq
## 1 2 123
## 2 1 112
## 3 1 102
## 4 1 98
## 5 0 79
## 6 0 87
The data were analyzed using a simple scatter plot. In the graph below, cups of coffee were plotted on the x-axis and IQ scores were plotted on the y-axis.
ggplot(coffee_iq, aes(x = coffee, y = iq)) +
geom_point() +
xlab("Coffee") + ylab("IQ Score") +
theme_classic()
Based on the plot, there appears to be a positive correlation between cups of coffee drank and IQ score. To provide a more precise measure of their correlation, a correlation coefficient was calculated.
# Calculating the correlation coefficient
round(cor(coffee_iq$coffee, coffee_iq$iq), 2)
## [1] 0.98
The correlation coefficient was 0.98, indicating a strong relationship between the two variables.
A linear model using IQ as the response variable and coffee as the predictor was constructed.
# Creating a linear model with the lm() function
coffee_iq_model <- lm(iq ~ coffee, data = coffee_iq)
# Printing a summary of the model
summary(coffee_iq_model)
##
## Call:
## lm(formula = iq ~ coffee, data = coffee_iq)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8519 -2.6790 -0.8519 1.9506 9.1481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.025 2.495 33.28 7.26e-10 ***
## coffee 19.827 1.578 12.56 1.51e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.491 on 8 degrees of freedom
## Multiple R-squared: 0.9518, Adjusted R-squared: 0.9457
## F-statistic: 157.9 on 1 and 8 DF, p-value: 1.509e-06
The summary of the model provides a wealth of information. The multiple r-squared value describes how well the model accounts for the variability in the data. With a value of 0.9518, one can say that the model accounts for approximately 95% of the variability in the data. Based on the model’s ability to predict the data, it is appropriate to develop an equation to predict future values.
The equation takes the general form:
\[ y = m(x) + b \] Where y represents the response variable, m represents the slope, x represents cups of coffee drank, and b represents in intercept of the y-axis. The values for the m and b are housed in the summary output. The slope (i.e., m) is 19.827 and the intercept of the y-axis (i.e., b) is 83.025.
Therefore, the prediction equation takes the form:
\[y = 19.827(cups-of-coffee) + 83.025\]
Using the equation displayed above, a person’s IQ can be predicted when they drink between 0 - 3 cups of coffee. Note: Based on the limited data set, we can only predict scores for people who drink between 0 and 3 cups of coffee to avoid extrapolation.
Predictions for average IQ based on cups of coffee consumed are presented below:
# Range of cups of coffee drank
cups <- c(0, 1, 2, 3)
# Function to calculate IQ based on prediction equation
predict_iq <- function(x){
19.827*x + 83.025
}
# Data frame with cups of coffee and predicted values
prediction_df <- data.frame(cups, predict_iq(cups))
# Printing the data frame
prediction_df
## cups predict_iq.cups.
## 1 0 83.025
## 2 1 102.852
## 3 2 122.679
## 4 3 142.506
Based on the information in the table, one can predict a person’s IQ based on the number of cups of coffee they drink per day. For example, if a person consumed 2 cups of coffee per day, the model predicts s/he will have an IQ of approximately 122.
To fully utilize the model, its limitations must be acknowledged. The primary limitation of the analysis is the sample size. Specifically, with only 10 participants, it is unclear how well the results will generalize to the general public. Although the generality of the results may be limited, the current study could serve as pilot study for a larger analysis conducted in the future. For instance, it could reveal potential barriers related to data collection. Based on the results of this brief analysis, a larger replication should be explored.