Open packages

library(readxl)
library(ggpubr)
## Loading required package: ggplot2

Importing data sets

DatasetA <- read_excel("C:/Users/DELL/Documents/Applied Analytics/Assignment4/DatasetA.xlsx")
DatasetB <- read_excel("C:/Users/DELL/Documents/Applied Analytics/Assignment4/DatasetB.xlsx")

Dataset A -

Calculate the Descriptive Statistics for DatasetA IV - StudyHours

mean(DatasetA$StudyHours)
## [1] 6.135609
sd(DatasetA$StudyHours)
## [1] 1.369224

Calculate the Descriptive Statistics for DatasetA DV - ExamScore

mean(DatasetA$ExamScore)
## [1] 90.06906
sd(DatasetA$ExamScore)
## [1] 6.795224

Create Histogram for DatasetA-IV

hist(DatasetA$StudyHours,
     main = "StudyHours",
     breaks = 20,
     col = "lightblue",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “StudyHours” appears approximately normally distributed. The shape of histogram appears as bell curve. And the skewness is positive(approximately symmetrical where the most of the data is in middle).

Histogram for DatasetA-DV

hist(DatasetA$ExamScore,
     main = "ExamScore",
     breaks = 20,
     col = "lightcoral",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “ExamScore” appears abnormally distributed. The histogram looks like negatively skewed. The shape of histogram is slightly tall.

Statistically Test Normality for DatasetA - IV

shapiro.test(DatasetA$StudyHours)
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349

The Shapiro–Wilk p-value for the StudyHours normality test is greater than .05 i.e. (.93), so the data is normal.

Statistically Test Normality for DatasetA - DV

shapiro.test(DatasetA$ExamScore)
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465

The Shapiro–Wilk p-value for the ExamScore normality test is less than .05 i.e. (.006), so the data is not normal.

Conduct Correlation Test (Test Hypotheses) for DatasetA

cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman")
## Warning in cor.test.default(DatasetA$StudyHours, DatasetA$ExamScore, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9008825

The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is 2.2e-16, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is 0.9008825. The correlation is positive, which means as hours of study increases, exam scores increases. The correlation value is greater 0.50, which means the relationship is strong.

Create a Scatterplot to Visualize the Relationship for DatasetA

ggscatter(
  DatasetA,
  x = "StudyHours",
  y = "ExamScore",
  add = "reg.line",
  xlab = "StudyHours",
  ylab = "ExamScore"
)

The line of best fit is pointing to the top right. This means the direction of the data is positive. As Study Hours increases, Exam Score increases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. The extreme outliers are not evident. Although some of the points are a bit higher or lower than the line, they are close to the line of best fit and do not seem to have any serious effect on the relationship between the independent variable (StudyHours) and the dependent variable (ExamScore).

Report for DatasetA

#Study Hours (M = 6.135609, SD = 1.369224) was correlated with exam score (M = 90.06906, SD = 6.795224), ρ(rho) = 0.9008825, p = 2.2e-16. #The relationship was positive and strong. As the study hours increased, the exam score increase.

Dataset B -

Calculate the Descriptive Statistics for DatasetB IV - ScreenTime

mean(DatasetB$ScreenTime)
## [1] 5.063296
sd(DatasetB$ScreenTime)
## [1] 2.056833

Calculate the Descriptive Statistics for DatasetB DV - SleepingHours

mean(DatasetB$SleepingHours)
## [1] 6.938459
sd(DatasetB$SleepingHours)
## [1] 1.351332

Create Histogram for DatasetB-IV

hist(DatasetB$ScreenTime,
     main = "ScreenTime",
     breaks = 20,
     col = "lightgreen",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “ScreenTime” appears abnormally distributed. The histogram looks like positively skewed mostly to the left. The shape of histogram is slightly tall.

Histogram for DatasetB-DV

hist(DatasetB$SleepingHours,
     main = "SleepingHours",
     breaks = 20,
     col = "orange",
     border = "white",
     cex.main = 1,
     cex.axis = 1,
     cex.lab = 1)

The variable “SleepingHours” appears normally distributed. The histogram looks like positively skewed mostly to the center. The shape of histogram is like bell curve.

Statistically Test Normality for DatasetB - IV

shapiro.test(DatasetB$ScreenTime) 
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$ScreenTime
## W = 0.90278, p-value = 1.914e-06

The Shapiro–Wilk p-value for the ScreenTime normality test is less than .05 i.e.(.000002), so the data is not normal.

Statistically Test Normality for DatasetB - DV

shapiro.test(DatasetB$SleepingHours)
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$SleepingHours
## W = 0.98467, p-value = 0.3004

The Shapiro–Wilk p-value for the SleepingHours normality test is greater than .05 i.e.(.30), so the data is normal.

Conduct Correlation Test (Test Hypotheses) for DatasetB

cor.test(DatasetB$ScreenTime, DatasetB$SleepingHours, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  DatasetB$ScreenTime and DatasetB$SleepingHours
## S = 259052, p-value = 3.521e-09
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.5544674

The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is 3.521e-09, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is -0.5544674 . The correlation is negative, which means as screen time increases, hours of sleeping decreases. The correlation value is greater -0.50, which means the relationship is strong.

Create a Scatterplot to Visualize the Relationship for DatasetB

ggscatter(
  DatasetB,
  x = "ScreenTime",
  y = "SleepingHours",
  add = "reg.line",
  xlab = "ScreenTime",
  ylab = "SleepingHours"
)

The line of best fit is pointing downward from left to right. This means the direction of the data is negative. As screen time increases, sleeping hours decreases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. The possible outliers (extremely large screen time and extremely small sleeping hours) are still somewhat near the line of best fit, and do not seem to have a pronounced effect on the overall relationship between the independent variable (ScreenTime) and the dependent variable (SleepingHours).

Report for DatasetB

#Screen Time (M = 5.063296, SD = 2.056833) was correlated with Sleeping Hours (M = 6.938459, SD = 1.351332), ρ(rho) = -0.5544674, p = 3.521e-09. #The relationship was negative and strong. As the screen time increased, the sleeping hours decrease.