Assignment4

library(readxl)
library(ggpubr)

## Loading required package: ggplot2

DatasetA <- read_excel("/Users/ishujha/Downloads/AppliedAnalytics/DatasetA.xlsx")
DatasetB <- read_excel("/Users/ishujha/Downloads/AppliedAnalytics/DatasetB.xlsx")

mean(DatasetA$StudyHours)

## [1] 6.135609

StudyHours is the IV.

sd(DatasetA$StudyHours)

## [1] 1.369224

mean(DatasetA$ExamScore)

## [1] 90.06906

ExamScore is the DV.

sd(DatasetA$ExamScore)

## [1] 6.795224

hist(DatasetA$StudyHours,
     main = "StudyHours",
     breaks = 20,
     col = "lightblue",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

hist(DatasetA$ExamScore,
     main = "ExamScore",
     breaks = 20,
     col = "lightcoral",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “StudyHours” appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve. The variable “ExamScore” does not appear to be normally distributed. The data looks slightly negatively skewed (most data is on the right). The data appears to be too flat.

shapiro.test(DatasetA$StudyHours)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349

shapiro.test(DatasetA$ExamScore)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465

The Shaprio-Wilk p-value for StudyHours normality test is greater than .05 (.93), so the data is normal. The Shapiro-Wilk p-value for the ExamScore normality test is less than .05 (.006), so the data is not normal. We will use Spearman Correlation as the ExamScore normality test is less than .05 (.006).(p-value<0.05)

cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman")

## Warning in cor.test.default(DatasetA$StudyHours, DatasetA$ExamScore, method =
## "spearman"): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9008825

The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is 2.2e-16, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is 0.9008825. The correlation is positive, which means as StudyHours increases, ExamScore increases. The correlation value is greater than 0.50, which means the relationship is strong.

library("ggpubr")
ggscatter(
  DatasetA,
  x = "StudyHours",
  y = "ExamScore",
  add = "reg.line",
  xlab = "StudyHours",
  ylab = "ExamScore"
)

The line of best fit is pointing to the top right. This means the diretion of the data is positive. As StudyHours increases, ExamScore increases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. There are no outliers.

What is the relationship between how much students study (hours) and their exam score (percentage)? The independent variable StudyHours(M = 6.135609, SD = 1.369224) was correlated with the dependent variable ExamScore (M = 90.06906, SD = 6.795224), ρ = 0.9008825, p = 2.2e-16. The relationship was positive and strong. As the StudyHours increased, ExamScore increased.

Means and standard deviations of StudyHours: Mean - 6.135609, SD - 1.369224 Means and standard deviations of ExamScore: Mean - 90.06906, SD - 6.795224 Correlation coefficient (r or ρ): 0.9008825 p-value: 2.2e-16 Strength and direction of the relationship: Strength is strong and direction is positive.

mean(DatasetB$ScreenTime)

## [1] 5.063296

ScreenTime is the IV.

sd(DatasetB$ScreenTime)

## [1] 2.056833

mean(DatasetB$SleepingHours)

## [1] 6.938459

SleepingHours is the DV.

sd(DatasetB$SleepingHours)

## [1] 1.351332

hist(DatasetB$ScreenTime,
     main = "ScreenTime",
     breaks = 20,
     col = "lightblue",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

hist(DatasetB$SleepingHours,
     main = "SleepingHours",
     breaks = 20,
     col = "lightcoral",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “ScreenTime” does not appear to be normally distributed. The data is positively skewed (most data is on the left). The data does not appear to have a proper bell curve. The variable “SleepingHours” appears to be normally distributed. The data looks symmetrical (most data is in the middle). The data does appears to have a proper bell curve.

shapiro.test(DatasetB$ScreenTime)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$ScreenTime
## W = 0.90278, p-value = 1.914e-06

shapiro.test(DatasetB$SleepingHours)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$SleepingHours
## W = 0.98467, p-value = 0.3004

The Shaprio-Wilk p-value for ScreenTime normality test is less than .05 (.0.000001914), so the data is not normal. The Shaprio-Wilk p-value for SleepingHours normality test is greater than .05 (.30), so the data is normal. We will use Spearman Correlation as the ScreenTime normality test is less than .05 (.0.000001914).(p-value<0.05)

cor.test(DatasetB$ScreenTime, DatasetB$SleepingHours, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  DatasetB$ScreenTime and DatasetB$SleepingHours
## S = 259052, p-value = 3.521e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5544674

The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is 3.521e-09, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is -0.5544674. The correlation is negative, which means as ScreenTime increases, SleepingHours decreases. The correlation value is greater than 0.50, which means the relationship is strong.

library("ggpubr")
ggscatter(
  DatasetB,
  x = "ScreenTime",
  y = "SleepingHours",
  add = "reg.line",
  xlab = "ScreenTime",
  ylab = "SleepingHours"
)

The line of best fit is pointing to the bottom right. This means the diretion of the data is negative. As ScreenTime increases, SleepingHours decreases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a curved pattern. This means the data is non-linear. There are no outliers.

What is the relationship between how much a person uses their phone (hours) and how much they sleep (hours)? The independent variable ScreenTime(M = 5.063296, SD = 2.056833) was correlated with the dependent variable SleepingHours (M = 6.938459, SD = 1.351332), ρ = -0.5544674, p = 3.521e-09. The relationship was negative and strong. As the ScreenTime increased, SleepingHours decreased.

Means and standard deviations of ScreenTime: Mean - 5.063296, SD - 2.056833 Means and standard deviations of SleepingHours: Mean - 6.938459, SD - 1.351332 Correlation coefficient (r or ρ): -0.5544674 p-value: 3.521e-09 Strength and direction of the relationship: Strength is strong and direction is negative.

Assignment4

Ishu Jha

2026-02-04