library(readxl)
library(ggpubr)
## Loading required package: ggplot2
DatasetA <- read_excel("/Users/alexiaprudencio/Desktop/Applied Analytics 1/Assingment 4/DatasetA.xlsx")

Research Question: What is the relationship between how much students study (hours) and their exam score (percentage)?

  1. Descriptive Statistics
mean(DatasetA$StudyHours)
## [1] 6.135609
sd(DatasetA$StudyHours)
## [1] 1.369224
mean(DatasetA$ExamScore) 
## [1] 90.06906
sd(DatasetA$ExamScore)
## [1] 6.795224
  1. Histograms & Visually Check Normality
  hist(DatasetA$StudyHours,
     main = "StudyHours",
     breaks = 20,
     col = "lightgreen",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “StudyHours” appears normally distributed. The data looks symmetrical as most of the data is concentrated in the middle around 5-7 hours. The data appears to have a proper bell curve as it is not excessively flat or tall.

hist(DatasetA$ExamScore,
     main = "ExamScore",
     breaks = 20,
     col = "lightpink",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “ExamScore” does not appear normally distributed.The data looks negatively skewed as most of the data is on the right, representing higher scores, and the tail extends to the left. The data does not have a proper bell curve; it looks somewhat irregular with a large spike at the very end.

  1. Statistically Test Normality
shapiro.test(DatasetA$StudyHours) 
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349
shapiro.test(DatasetA$ExamScore)
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465

Shapir-Wilk Output Interpretation The Shapiro-Wilk p-value for the StudyHours normality test is greater than .05 (0.9349), so the data is normal. The Shapiro-Wilk p-value for the ExamScore normality test is less than .05 (0.006465), so the data is not normal. Since one of the variables (ExamScore) is not normally distributed (p-value < .05), I will use a Spearman Correlation for this reason.

  1. Test Hypotheses - Conduct Correlation Test
cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman", exact = FALSE)
## 
##  Spearman's rank correlation rho
## 
## data:  DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9008825

The Spearman Correlation test was selected because the variable “ExamScore” was abnormally distributed according to the histogram and the Shapiro-Wilk test (p < .05). The p-value is < 2.2e-16, which is significantly below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is 0.90. The correlation is positive, which means as study hours increase, exam scores also increase. The correlation value is greater than 0.50, which means the relationship is strong. Note: The argument “exact=FALSE” was added to the code because the data contained “ties” (duplicate values), preventing R from computing an exact p-value otherwise.

  1. Scatterplot to Visualize the Relationship
ggscatter(
  DatasetA,
  x = "StudyHours",
  y = "ExamScore",
  add = "reg.line",
  xlab = "StudyHours",
  ylab = "ExamScore"
)

The line of best fit is pointing to the top right. This means the direction of the data is positive. As study hours increase, exam scores increase. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. There appear to be no significant outliers. While there is a cluster of data points at the top right, likely students who scored 100, they follow the general trend of the line.

  1. Report the Results

The StudyHours (M = 6.135609, SD = 1.369224) was correlated with the ExamScore (M = 90.06906, SD = 6.795224), ρ(98) = 0.9008825, p = .000. The relationship was positive and strong. As StudyHours increased, ExamScore increased.