PUBLISHED LINK:
https://rpubs.com/Haileab/1393296
library(readxl)
#library(readxl): Loads the readxl package so R can read Excel files (.xlsx).
library(ggpubr)
## Loading required package: ggplot2
# library(ggpubr):Loads the ggpubr package, which makes it easier to create publication-style plots (like scatter plots with regression lines).
DatasetA <- read_excel("DatasetA.xlsx")
#DatasetA <- read_excel("DatasetA.xlsx"): Reads the Excel file DatasetA.xlsx
mean(DatasetA$StudyHours)
## [1] 6.135609
sd(DatasetA$StudyHours)
## [1] 1.369224
# mean(DatasetA$StudyHours) ---- finds the average study hours; Study hours had a mean of 6.14 hours
# sd(DatasetA$StudyHours) ---- finds how spread out the study hours are. Study hours had standard deviation (SD = 1.37).
mean(DatasetA$ExamScore)
## [1] 90.06906
sd(DatasetA$ExamScore)
## [1] 6.795224
# mean(DatasetA$ExamScore) ---- finds the average exam score Exam scores had a mean of 90.07
# sd(DatasetA$ExamScore) ---- finds how spread out the exam scores are. Exam Score had Standard deviation (SD = 6.80).
hist(DatasetA$StudyHours,
main = "StudyHours",
breaks = 20,
col = "orange",
border = "black")
# Here we are building the Histogram Visualization for StudyHours
# Study hours were roughly normally distributed, with most values centered around 6 hours.
# The variable "StudyHours" appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve.
hist(DatasetA$ExamScore,
main = "ExamScore",
breaks = 20,
col = "grey",
border = "white")
# Here we are building the Histogram Visualization for ExamScore
# Exam scores are clustered at the higher end, with most students scoring above 85.
# The variable "ExamScore" appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve.
shapiro.test(DatasetA$StudyHours)
##
## Shapiro-Wilk normality test
##
## data: DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349
# Here we are conducting the Shapiro-Wilk test to determine whether a dataset follows a normal distribution or not. from this point we can check the null hypothesis and say that the data is drawn from a normally distributed set of data or not.
# The Shapiro-Wilk p-value for the StudyHours normality test is greater than .05 (.934), so the data is normal.
shapiro.test(DatasetA$ExamScore)
##
## Shapiro-Wilk normality test
##
## data: DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465
# Here we are conducting the Shapiro-Wilk test to determine whether a dataset follows a normal distribution. from this point we can check the null hypothesis and say that the data is drawn from a normally distributed set of data with that being said;
# The Shapiro-Wilk p-value for the ExamScore normality test is less than .05 (.006), so the data is not normal.
cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman")
## Warning in cor.test.default(DatasetA$StudyHours, DatasetA$ExamScore, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9008825
#The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests.
# The p-value (probability value) is .000, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported.
# The rho-value is 0.90.
# The correlation is negative, which means as coffee drinking increases, hours sleeping decreases.
# The correlation value is greater -0.50, which means the relationship is strong.
ggscatter(
DatasetA,
x = "StudyHours",
y = "ExamScore",
add = "reg.line",
xlab = "StudyHours",
ylab = "ExamScore"
#The line of best fit is pointing to the top right. This means the direction of the data is positive. As Study Hours increase, Score increases.
#The dots closely hug the line. This means there is a strong relationship between the variables.
#The dots form a straight-line pattern. This means the data is linear.
#There is possibly one outlier (the individual who studied 2.0 hours and scored 82.0). However, the dot is towards the center of the line of best fit. Therefore, it does not appear to impact the relationship between the independent and dependent variables.
)