library(readxl)
library(ggpubr)
## Loading required package: ggplot2
DatasetA <- read_excel("C:/Users/varun/Downloads/DatasetA.xlsx")
{r} mean(DatasetA$StudyHours) sd(DatasetA$StudyHours)
mean(DatasetA$ExamScore)
## [1] 90.06906
sd(DatasetA$ExamScore)
## [1] 6.795224
hist(DatasetA$StudyHours,
main = "StudyHours",
breaks = 20,
col = "lightblue",
border = "white",
cex.main = 1,
cex.axis = 1,
cex.lab = 1)
hist(DatasetA$ExamScore,
main = "ExamScore",
breaks = 20,
col = "pink",
border = "white",
cex.main = 1,
cex.axis = 1,
cex.lab = 1)
The variable “StudyHours” appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve.
The variable “ExamScore” appears is not normally distributed. The data is negatively skewed (most data is on the right).
shapiro.test(DatasetA$StudyHours)
##
## Shapiro-Wilk normality test
##
## data: DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349
shapiro.test(DatasetA$ExamScore)
##
## Shapiro-Wilk normality test
##
## data: DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465
The Shaprio-Wilk p-value for StudyHours normality test is greater than .05 (.93), so the data is normal.We use Pearson Correlation.
The Shaprio-Wilk p-value for ExamScore normality test is less than .05 (.006), so the data is NOT normal.We use Spearman Correlation.
cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman")
## Warning in cor.test.default(DatasetA$StudyHours, DatasetA$ExamScore, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9008825
The Spearman Correlation test was selected because ExamScore failed Shapiro-wilk normality test.
The p-value (probability value) is 2.2e-16, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported.
The rho-value is 0.9008825.
The correlation is positive, which means as StudyHours increases, ExamScore increases.
The correlation value falls within ± 0.50 to 1.00, which means the relationship is strong.
ggscatter(
DatasetA,
x = "StudyHours",
y = "ExamScore",
add = "reg.line",
xlab = "StudyHours",
ylab = "ExamScore"
)
The line of best fit is pointing to the top right. This means the direction of the data is positive. As StudyHours increases, ExamScore increases.
The dots closely hug the line. This means there is a strong relationship between the variables.
The dots form a straight-line pattern. This means the data is linear.
There is possibly no outlier (It does not appear to impact the relationship between the independent and dependent variables.
library(rmarkdown)