Task 1: A student was interested in whether there was a positive relationship between the time spent doing an essay and the mark received. He got 45 of his friends and timed how long they spent writing an essay (hours) and the percentage they got in the essay (essay). He also translated these grades into their degree classifications (grade): in the UK, a student can get a first-class mark (the best), an upper-second-class mark, a lower second, a third, a pass or a fail (the worst). Using the data in the file EssayMarks.sav, find out what the relationship was between the time spent doing an essay and the eventual mark in terms of percentage and degree class (draw a scatterplot too).
*You do not need to do the assumptions tests (histograms) nor the bootstrapping, Just get the correlations and basic regressions. Do get the scatterplots and plot the regression line. You can use the videos provided and I’ll review in class.
essaymarks <- read.csv("https://blue.butler.edu/~rpadgett/ps310/Data/essaymarks.csv")
head(essaymarks)
## essay hours grade
## 1 61.67550 10.630337 2
## 2 69.54501 7.285226 1
## 3 48.22930 5.052048 4
## 4 70.67865 2.886614 1
## 5 59.89962 9.545012 3
## 6 61.16202 11.310838 2
# Scatterplot
plot(essaymarks$hours, essaymarks$essay, main="Hours vs. Essay Percentage",
xlab="Hours spent on essay", ylab="Essay Percentage", pch=19, col="blue")
abline(lm(essaymarks$essay ~ essaymarks$hours), col="red") #Regression line
#Correlation
correlation <- cor(essaymarks$hours, essaymarks$essay)
cat("Correlation between hours and essay percentage:", round(correlation, 2), "\n")
## Correlation between hours and essay percentage: 0.27
#Regression
model <- lm(essay ~ hours, data = essaymarks)
summary(model)
##
## Call:
## lm(formula = essay ~ hours, data = essaymarks)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.0407 -4.1877 -0.7728 4.7592 16.7620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.9296 3.1970 18.120 <2e-16 ***
## hours 0.6612 0.3644 1.814 0.0766 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.588 on 43 degrees of freedom
## Multiple R-squared: 0.07112, Adjusted R-squared: 0.04952
## F-statistic: 3.292 on 1 and 43 DF, p-value: 0.07659
Correlation Analysis: The correlation between hours spent on an essay and the essay percentage is positive, suggesting that students who spend more time on their essays tend to score slightly higher. However, the relationship isn’t very strong. So, while time invested might help, there are likely other factors at play that determine essay scores.
Regression Analysis: - The relationship indicates that as students spend more hours on their essay, their score might see a small increase. However, the number of hours alone doesn’t guarantee a high score. - The relationship between hours spent and essay scores is on the borderline of statistical significance. This means we’re not entirely certain if the observed relationship in our data would consistently appear in the larger student population. It’s a hint that time might matter, but it’s not definitive. - The scatterplot shows a widespread of essay scores at any given number of hours spent. Although there’s a slight upward trend, there’s a lot of variability. Some students might have spent many hours and received average scores, while others might have spent less time and scored high, and vice versa.
Summary Overall, while there seems to be a slight trend suggesting more hours lead to better essay scores, the relationship isn’t strong or definitive. Other factors beyond time spent are likely influencing essay grades.
Task 6: In Chapter 4 (Task 6) we looked at data from people who had been forced to marry goats and dogs and measured their life satisfaction and, also, how much they like animals (Goat or Dog.sav). Is there a significant correlation between life satisfaction and the type of animal to which a person was married? *Again here, you just need to get the Pearson correlations and report if they they are significant. You should note that these are point biserial correlations (rpbis)
Note: I tried to do the bootstrapped confidence interval, but I don’t know if I went about it the right way, the commpute_correlation function is something I found online so yeah
library(boot)
Goat_or_Dog <- read.csv("https://blue.butler.edu/~rpadgett/ps310/Data/Goat_or_Dog.csv")
#Correlation
compute_correlation <- function(data, indices) {
sample_data <- data[indices, ]
return(cor(sample_data$life_satisfaction, sample_data$wife))
}
bootstrap_results <- boot(data = Goat_or_Dog, statistic = compute_correlation, R = 1000)
#Bootstrapped confidence intervals
confidence_intervals <- boot.ci(bootstrap_results, type = "bca")
bootstrap_results
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = Goat_or_Dog, statistic = compute_correlation, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.6304468 0.006599903 0.111855
confidence_intervals
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = bootstrap_results, type = "bca")
##
## Intervals :
## Level BCa
## 95% ( 0.2794, 0.7841 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
Analysis The results revealed a moderate positive correlation between the type of animal spouse and life satisfaction. In essence, there seems to be a connection between the type of animal someone is married to and how satisfied they feel in life. I used bootstrapping to repeatedly sample the data to get a more accurte understanding of the correlation. Bootstrapping provided a confidence interval for the correlation. The range didn’t include zero, which is a good indicator that the findings weren’t just random. It appears that being married to a dog is associated with higher life satisfaction compared to being married to a goat. But realistically just don’t marry an animal it’s weird, I don’t care what the data says.
Run the regression (using IQ to predict social competence) and add y’ (unstandardized predicted scores) and y-y’ (unstandardized residuals). To get these values you need to store the results of the lm() function in an object and then the values will be stored in the object called coefficients (y’) and residuals (y-y’) respectively.
Now run a correlation matrix between all 4 variables(x,y,y’ and y-y’) To do this, put the 4 variables together in a new data frame and use the names() function to name the 4 variables x,y,y’ and y-y’.
socialComp <- read.csv("https://blue.butler.edu/~rpadgett/ps310/Data/socialComp.csv")
#regression
model <- lm(SocComp ~ IQ, data = socialComp)
#Unstandardized predicted scores (y') and residuals (y-y')
y_prime <- model$fitted.values
y_minus_y_prime <- model$residuals
#New dataframe with the 4 variables
new_data <- data.frame(x = socialComp$IQ, y = socialComp$SocComp, y_prime = y_prime, y_minus_y_prime = y_minus_y_prime)
names(new_data) <- c("x", "y", "y'", "y-y'")
# Correlation matrix
cor_matrix <- cor(new_data)
print(cor_matrix)
## x y y' y-y'
## x 1.000000e+00 0.6492791 1.000000e+00 -4.420499e-17
## y 6.492791e-01 1.0000000 6.492791e-01 7.605502e-01
## y' 1.000000e+00 0.6492791 1.000000e+00 6.686895e-17
## y-y' -4.420499e-17 0.7605502 6.686895e-17 1.000000e+00
plot(new_data)
Explain the meaning of each correlation in the table and why the value
is the value that it is. You only need to explain either the upper
diagonal of the matrix as it is symmetric. The hard one there is the
.761 value… What did I tell you that was?
x with y: 0.6492791 - This represents the correlation between IQ (x) and social competence (y). A value of about 0.65 suggests a moderate positive relationship: as IQ scores increase, social competence also generally increases.
x with y’: This value is 1.000000e+00 - Means that the predicted values (y’) are directly derived from the IQ scores (x) in the linear regression, so they are perfectly correlated. This makes sense since y ′ is a linear transformation of x based on our regression model.
x with y-y’: −4.420499e−17 - It’s literally zero. It shows that the independent variable (IQ) does not correlate with the residuals.
y with y’: 0.6492791 - It’s the same as the correlation between x and y. This means the observed values of social competence (y) have a moderate positive correlation with their predicted values (y’).
y with y-y’: 0.7605502 - This reflects the correlation between the actual scores and the residuals. Since the residuals represent the error (or the difference between actual and predicted scores), a correlation of 0.76 indicates a strong positive relationship between the actual scores and their errors. I think that that there might be some pattern in the errors that the model hasn’t captured. And to be completely honest I don’t remember this at all because I’m a bad student so that’s unfortunate.