Open the Installed Packages

library(readxl)
library(ggpubr)

## Loading required package: ggplot2

============================== DATASET A: STUDY HOURS & EXAM SCORE ==============================

Import & Name Dataset

DatasetA <- read_excel("/Users/atharvapitke/Documents/Analytics/Assignments/DatasetA.xlsx")
DatasetB <- read_excel("/Users/atharvapitke/Documents/Analytics/Assignments/DatasetB.xlsx")

Descriptive Statistics for Dataset A

Independent Variable: StudyHours

mean(DatasetA$StudyHours)

## [1] 6.135609

sd(DatasetA$StudyHours)

## [1] 1.369224

Dependent Variable: ExamScore

mean(DatasetA$ExamScore)

## [1] 90.06906

sd(DatasetA$ExamScore)

## [1] 6.795224

Histograms and Visual Normality Check for Dataset A

Histogram for StudyHours IV

hist(DatasetA$StudyHours,
     main = "StudyHours",
     breaks = 20,
     col = "lightblue",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

Histogram for ExamScore DV

hist(DatasetA$ExamScore,
     main = "ExamScore",
     breaks = 20,
     col = "lightcoral",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

Statistical Test of Normality for Dataset A

Normality Test for StudyHours

shapiro.test(DatasetA$StudyHours)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349

Normality Test for ExamScore

shapiro.test(DatasetA$ExamScore)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465

Correlation Test for Dataset A

cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman")

## Warning in cor.test.default(DatasetA$StudyHours, DatasetA$ExamScore, method =
## "spearman"): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9008825

The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is 2.2e-16, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is 0.9008825. The correlation is positive, which means as StudyHours increases, ExamScore increases. The correlation value is greater than 0.50, which means the relationship is strong.

Scatterplot for Dataset A

ggscatter(
  DatasetA,
  x = "StudyHours",
  y = "ExamScore",
  add = "reg.line",
  xlab = "StudyHours",
  ylab = "ExamScore"
)

The line of best fit is pointing to top right. This means the direction of the data is positive. As StudyHours increases, Examscore increases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. The extreme outliers do not stand out. Even though a few of the points fall either above or below the line, they are near the line of best fit and do not appear to impact on the relationship between the independent variable (StudyHours) and dependent variable (ExamScore) seriously.

Results for DatasetA

StudyHours (M = 6.135609, SD = 1.369224) was correlated with ExamScore(M = 90.06906, SD = 6.795224), ρ(rho) =0.9008825, p = 2.2e-16 The relationship was positive and strong. As the StudyHours increased, the ExamScore increased.

============================== DATASET B: SCREEN TIME & SLEEPING HOURS ==============================

Descriptive Statistics for Dataset B

Independent Variable: SCREEN TIME

mean(DatasetB$ScreenTime)

## [1] 5.063296

sd(DatasetB$ScreenTime)

## [1] 2.056833

Dependent Variable: SLEEPING HOURS

mean(DatasetB$SleepingHours)

## [1] 6.938459

sd(DatasetB$SleepingHours)

## [1] 1.351332

Histograms and Visual Normality Check for Dataset B

Histogram for ScreenTime IV

Histogram for DatasetB - IV

hist(DatasetB$ScreenTime,
     main = "ScreenTime",
     breaks = 20,
     col = "lightpink",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

Histogram for DatasetB - DV

hist(DatasetB$SleepingHours,
     main = "SleepingHours",
     breaks = 20,
     col = "green",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

Statistical Test of Normality for Dataset B

Normality Test for ScreenTime

shapiro.test(DatasetB$ScreenTime)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$ScreenTime
## W = 0.90278, p-value = 1.914e-06

Normality Test for SleepingHours

shapiro.test(DatasetB$SleepingHours)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$SleepingHours
## W = 0.98467, p-value = 0.3004

Correlation Test for Dataset B

cor.test(DatasetB$ScreenTime, DatasetB$SleepingHours, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  DatasetB$ScreenTime and DatasetB$SleepingHours
## S = 259052, p-value = 3.521e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5544674

The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is 3.521e-09, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is -0.55. The correlation is negative, which means as screen time increases, sleeping hours decrease. The correlation value is greater than -0.50, which means the relationship is strong.

Scatterplot for Dataset B

ggscatter(
  DatasetB,
  x = "ScreenTime",
  y = "SleepingHours",
  add = "reg.line",
  xlab = "ScreenTimes",
  ylab = "SleepingHours"
)

The line of best fit is pointing downward from left to right. This means the direction of the data is negative. As ScreenTime increases, SleepingHours decreases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. The extreme outliers are not apparent. Although some of the points lie slightly above or below the line, they are close to the line of best fit and do not seem to have a serious effect on the relationship between the independent variable (StudyHours) and the dependent variable (ExamScore).

Results for DatasetB

ScreenTime (M = 5.063296, SD = 2.056833) was correlated with SleepingHours (M = 6.938459, SD = 1.351332), ρ(rho) =-0.5544674, p = 3.521e-09 The relationship was negative and strong. As the ScreenTime increased, the SleepingHours decreased.

Assignment4

Atharva Pitke

2026-02-04