library(readxl)
library(ggpubr)
## Loading required package: ggplot2
============================== DATASET A: STUDY HOURS & EXAM SCORE ==============================
DatasetA <- read_excel("/Users/atharvapitke/Documents/Analytics/Assignments/DatasetA.xlsx")
DatasetB <- read_excel("/Users/atharvapitke/Documents/Analytics/Assignments/DatasetB.xlsx")
Independent Variable: StudyHours
mean(DatasetA$StudyHours)
## [1] 6.135609
sd(DatasetA$StudyHours)
## [1] 1.369224
Dependent Variable: ExamScore
mean(DatasetA$ExamScore)
## [1] 90.06906
sd(DatasetA$ExamScore)
## [1] 6.795224
Histogram for StudyHours IV
hist(DatasetA$StudyHours,
main = "StudyHours",
breaks = 20,
col = "lightblue",
border = "white",
cex.main = 1,
cex.axis = 1,
cex.lab = 1)
The variable “StudyHours” appears approximately normally distributed. The shape of histogram appears as bell curve. And the skewness is positive(approximately symmetrical where the most of the data is in middle).
Histogram for ExamScore DV
hist(DatasetA$ExamScore,
main = "ExamScore",
breaks = 20,
col = "lightcoral",
border = "white",
cex.main = 1,
cex.axis = 1,
cex.lab = 1)
The variable “ExamScore” does not appear normally distributed. The data looks negatively-skewed (most data is on the right side). The data also does not appear to have a proper bell curve.
Normality Test for StudyHours
shapiro.test(DatasetA$StudyHours)
##
## Shapiro-Wilk normality test
##
## data: DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349
The Shaprio-Wilk p-value for Age normality test is greater than .05 (.93), so the data is normal.
Normality Test for ExamScore
shapiro.test(DatasetA$ExamScore)
##
## Shapiro-Wilk normality test
##
## data: DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465
The Shaprio-Wilk p-value for Age normality test is lower than .05 (0.006465), so the data is not normal.
cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman")
## Warning in cor.test.default(DatasetA$StudyHours, DatasetA$ExamScore, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9008825
The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is 2.2e-16, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is 0.9008825. The correlation is positive, which means as StudyHours increases, ExamScore increases. The correlation value is greater than 0.50, which means the relationship is strong.
ggscatter(
DatasetA,
x = "StudyHours",
y = "ExamScore",
add = "reg.line",
xlab = "StudyHours",
ylab = "ExamScore"
)
The line of best fit is pointing to top right. This means the direction of the data is positive. As StudyHours increases, Examscore increases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. The extreme outliers do not stand out. Even though a few of the points fall either above or below the line, they are near the line of best fit and do not appear to impact on the relationship between the independent variable (StudyHours) and dependent variable (ExamScore) seriously.
StudyHours (M = 6.135609, SD = 1.369224) was correlated with ExamScore(M = 90.06906, SD = 6.795224), ρ(rho) =0.9008825, p = 2.2e-16 The relationship was positive and strong. As the StudyHours increased, the ExamScore increased.
============================== DATASET B: SCREEN TIME & SLEEPING HOURS ==============================
Independent Variable: SCREEN TIME
mean(DatasetB$ScreenTime)
## [1] 5.063296
sd(DatasetB$ScreenTime)
## [1] 2.056833
Dependent Variable: SLEEPING HOURS
mean(DatasetB$SleepingHours)
## [1] 6.938459
sd(DatasetB$SleepingHours)
## [1] 1.351332
Histogram for ScreenTime IV
hist(DatasetB$ScreenTime,
main = "ScreenTime",
breaks = 20,
col = "lightpink",
border = "white",
cex.main = 1,
cex.axis = 1,
cex.lab = 1)
The variable “ScreenTime” does not appear normally distributed. The data looks positively-skewed to the left. The data also does not appear to have a proper bell curve.
hist(DatasetB$SleepingHours,
main = "SleepingHours",
breaks = 20,
col = "green",
border = "white",
cex.main = 1,
cex.axis = 1,
cex.lab = 1)
The variable “SleepingHours” appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve.
Normality Test for ScreenTime
shapiro.test(DatasetB$ScreenTime)
##
## Shapiro-Wilk normality test
##
## data: DatasetB$ScreenTime
## W = 0.90278, p-value = 1.914e-06
The Shaprio-Wilk p-value for Age normality test is lower than .05 (1.914e-06), so the data is not normal.
Normality Test for SleepingHours
shapiro.test(DatasetB$SleepingHours)
##
## Shapiro-Wilk normality test
##
## data: DatasetB$SleepingHours
## W = 0.98467, p-value = 0.3004
The Shaprio-Wilk p-value for Age normality test is greater than .05 (0.3004), so the data is normal.
cor.test(DatasetB$ScreenTime, DatasetB$SleepingHours, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: DatasetB$ScreenTime and DatasetB$SleepingHours
## S = 259052, p-value = 3.521e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.5544674
The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is 3.521e-09, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is -0.55. The correlation is negative, which means as screen time increases, sleeping hours decrease. The correlation value is greater than -0.50, which means the relationship is strong.
ggscatter(
DatasetB,
x = "ScreenTime",
y = "SleepingHours",
add = "reg.line",
xlab = "ScreenTimes",
ylab = "SleepingHours"
)
The line of best fit is pointing downward from left to right. This means the direction of the data is negative. As ScreenTime increases, SleepingHours decreases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. The extreme outliers are not apparent. Although some of the points lie slightly above or below the line, they are close to the line of best fit and do not seem to have a serious effect on the relationship between the independent variable (StudyHours) and the dependent variable (ExamScore).
ScreenTime (M = 5.063296, SD = 2.056833) was correlated with SleepingHours (M = 6.938459, SD = 1.351332), ρ(rho) =-0.5544674, p = 3.521e-09 The relationship was negative and strong. As the ScreenTime increased, the SleepingHours decreased.