Assignment-4

install.packages(“readxl”) install.packages(“ggpubr”)

library(readxl)
library(ggpubr)

## Loading required package: ggplot2

Loading the Data set A

DatasetA <- read_excel("C:/Users/vamsh/Downloads/DatasetA.xlsx")

In the DATASET A, the independent variable is Study Hours and the dependent variable is Exam Score.

Descriptive Statistics:

mean(DatasetA$StudyHours)

## [1] 6.135609

     sd(DatasetA$StudyHours)

## [1] 1.369224

For Study hours, Mean = 6.1356, SD = 1.369

mean(DatasetA$ExamScore)

## [1] 90.06906

sd(DatasetA$ExamScore)

## [1] 6.795224

For Exam Score, Mean = 90.069, SD = 6.795

Checking Normality

hist(DatasetA$StudyHours,
     main = "StudyHours",
     breaks = 20,
     col = "lightblue",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “Study Hours” has a normally distributed data. The data looks symmetrical with most data in the middle, also the data appears to have a bell curve.

hist(DatasetA$ExamScore,
     main = "ExamScore",
     breaks = 20,
     col = "lightcoral",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “Exam Score” have an abnormally distributed data. The histogram shows some skewness and it does not form a proper bell curve.

Statistical Normality Checking:

shapiro.test(DatasetA$StudyHours)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349

shapiro.test(DatasetA$ExamScore)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465

The p-value for Study Hours is greater than .05, indicating the data is distributed normally. The p-value for Exam Score is less than 0.05, indicating the data is not normally distributed.

CORRELATION ANALYSIS:

A Spearman correlation was selected for data set A, because at least one variable data in each data set is not normal according to the histograms and Shapiro-Wilk tests.

cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman")

## Warning in cor.test.default(DatasetA$StudyHours, DatasetA$ExamScore, method =
## "spearman"): Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9008825

The p-value in spearman correlation is below .05, so the results are statistically significant. The alternate hypothesis is supported. The rho value is 0.9 (positive), so the correlation is positive As study hours increases, Exam score also increased.The correlation is strong as the correlation value is more than 0.5 and less than 1.0.

SCATTERPLOT FOR DATASET A

ggscatter(
  DatasetA,
  x = "StudyHours",
  y = "ExamScore",
  add = "reg.line",
  xlab = "Study Hours",
  ylab = "Exam Score"
)

The line of best fit is SHOWING towards top right,so the direction of data is positive. As study hours increases, Exam score increased. The dots hug the line indicating a strong relationship between the variables. The dots form a straight-line pattern, meaning the data is linear. There might be a outlier, but all the dots are close to the line. So the outlier is not affecting the relationship between the variables.

Reporting Results for Dataset A: StudyHours (M = 6.14, SD = 1.37) was correlated with the ExamScore (M = 90.07, SD = 6.80), ρ(98) = .90, p < .05. The relationship was positive and strong. As the StudyHours increased, the ExamScore increased.

Loading the Data Set B

DatasetB <- read_excel("C:/Users/vamsh/Downloads/DatasetB.xlsx")

Descriptive Statistics:

mean(DatasetB$ScreenTime)

## [1] 5.063296

sd(DatasetB$ScreenTime)

## [1] 2.056833

For Screen Time, Mean = 5.063, SD = 2.056

mean(DatasetB$SleepingHours)

## [1] 6.938459

sd(DatasetB$SleepingHours)

## [1] 1.351332

For Sleeping Hours, Mean = 6.938, SD = 1.351

Checking Normality:

hist(DatasetB$ScreenTime,
     main = "ScreenTime",
     breaks = 20,
     col = "lightblue",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “Screen Time” appears abnormally distributed. The histogram shows some skewness and it does not form a proper bell curve.

hist(DatasetB$SleepingHours,
     main = "SleepingHours",
     breaks = 20,
     col = "lightcoral",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “Sleeping Hours” appears normally distributed. The data looks symmetrical with most data in the middle, The data appears to have a bell curve.

Statistical Normality Checking:

shapiro.test(DatasetB$ScreenTime)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$ScreenTime
## W = 0.90278, p-value = 1.914e-06

shapiro.test(DatasetB$SleepingHours)

## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$SleepingHours
## W = 0.98467, p-value = 0.3004

The p-value for Screen Time is less than 0.05, indicating the data is not distributed normally. The p-value for Sleeping Hours is more than .05, indicating the data is normally distributed.

CORRELATION ANALYSIS:

A Spearman correlation was selected for data set B, because at least one variable data in each data set is not normal according to the histograms and Shapiro-Wilk tests.

cor.test(DatasetB$ScreenTime, DatasetB$SleepingHours, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  DatasetB$ScreenTime and DatasetB$SleepingHours
## S = 259052, p-value = 3.521e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5544674

The p=value in spearman correlation is below .05, so the results are statistically significant. The alternate hypothesis is supported. The rho value is -0.55 (negative), so the correlation is negative. As Screen Time increases, sleeping hours decreased. The correlation is strong as the rho value is between 0.5 and 1.0.

Scatter Plot for Dataset B:

ggscatter(
  DatasetB,
  x = "ScreenTime",
  y = "SleepingHours",
  add = "reg.line",
  xlab = "Screen Time",
  ylab = "Sleeping Hours"
)

The line of best fit is pointing towards downward, indicating a negative relationship. The dots slightly hug the line, still indicating a strong relationship between variables. The data is linear, with no extreme outliers. So the relationship between variables does not have any impact. As the relationship is negative, if Screen Time increases then Sleeping Time is decreased.

Reporting the Results for Dataset B: ScreenTime (M = 5.06, SD = 2.06) was negatively correlated with the SleepingHours (M = 6.94, SD = 1.35), ρ(98) = −.554, p < .05. The relationship was negative and strong. As the ScreenTime increases, the SleepingHours decreases.

Assignment-4

Hema Vamsinath Reddy

2026-02-04