title: “assignemnt 1 q 1” author: “Karim Pare” date: “2026-02-04” output: html_document

install.packages(“readxl”) install.packages(“ggpubr”)

library(readxl) 
library(ggpubr)
## Loading required package: ggplot2
DatasetA<-read_excel("/Users/karim/Desktop/DatasetA.xlsx")
DatasetB<-read_excel("/Users/karim/Desktop/DatasetB.xlsx")

Running the stats The variable “exam score” appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve. The variable “stud hours” appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve.

mean(DatasetA$StudyHours); sd(DatasetA$StudyHours)
## [1] 6.135609
## [1] 1.369224
mean(DatasetA$ExamScore); sd(DatasetA$ExamScore)
## [1] 90.06906
## [1] 6.795224
mean(DatasetB$ScreenTime); sd(DatasetB$ScreenTime) 
## [1] 5.063296
## [1] 2.056833
mean(DatasetB$SleepingHours); sd(DatasetB$SleepingHours) 
## [1] 6.938459
## [1] 1.351332

number 2: The Shaprio-Wilk p-value for Age normality test is greater than .05 (.75), so the data is normal. The Shapiro-Wilk p-value for the USD normality test is greater than .05 (.87), so the data is normal. {r}

shapiro.test(DatasetA$StudyHours) 
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$StudyHours
## W = 0.99388, p-value = 0.9349
shapiro.test(DatasetA$ExamScore) 
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetA$ExamScore
## W = 0.96286, p-value = 0.006465
shapiro.test(DatasetB$ScreenTime) 
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$ScreenTime
## W = 0.90278, p-value = 1.914e-06
shapiro.test(DatasetB$SleepingHours)
## 
##  Shapiro-Wilk normality test
## 
## data:  DatasetB$SleepingHours
## W = 0.98467, p-value = 0.3004

3: Histograms The variable “exam score” appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve. The variable “study hours” appears normally distributed. The data looks symmetrical (most data is in the middle). The data also appears to have a proper bell curve.

{r} Histograms for Dataset A

hist(DatasetA$ExamScore, main="Histogram of Exam Scores", col="lightgreen", breaks=20)

Histograms for Dataset B

hist(DatasetB$ScreenTime, main="Histogram of Screen Time", col="pink", breaks=20) 

hist(DatasetB$SleepingHours, main="Histogram of Sleeping Hours", col="lightyellow", breaks=20) 

4 and 5: Correlation and Scatterplots The p-value for the correlation (.003) is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The correlation value is 0.42. The correlation is positive, which means as Age increases, USD increases. The correlation value is greater than 0.30 but less than 0.50, which means the relationship is moderate. {r}

cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  DatasetA$StudyHours and DatasetA$ExamScore
## t = 20.959, df = 98, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8606509 0.9346369
## sample estimates:
##      cor 
## 0.904214

{r} Spearman used because p < .05 in Shapiro tests The Spearman Correlation test was selected because both variables were abnormally distributed according to the histograms and the Shapiro-Wilk tests. The p-value (probability value) is .004, which is below .05. This means the results are statistically significant. The alternate hypothesis is supported. The rho-value is -.52. The correlation is negative, which means as coffee drinking increases, hours sleeping decreases. The correlation value is greater -0.50, which means the relationship is strong

cor.test(DatasetA$StudyHours, DatasetA$ExamScore, method = "spearman") 
## Warning in cor.test.default(DatasetA$StudyHours, DatasetA$ExamScore, method =
## "spearman"): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  DatasetA$StudyHours and DatasetA$ExamScore
## S = 16518, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9008825

Visualizing the relationships The line of best fit is pointing to the top right. This means the diretion of the data is positive. As age increases, USD increases. The dots closely hug the line. This means there is a strong relationship between the variables. The dots form a straight-line pattern. This means the data is linear. There is possibly one outlier (the individual who is 31 years old and makes $76,000) per year. However, the dot is towards the center of the line of best fit. Therefore, it does not appear to impact the relationship between the independent and dependent variables.

ggscatter(DatasetA, x = "StudyHours", y = "ExamScore", add = "reg.line", 
          xlab = "Study Hours", ylab = "Exam Score") 

ggscatter(DatasetB, x = "ScreenTime", y = "SleepingHours", add = "reg.line", 
          xlab = "Screen Time", ylab = "Sleeping Hours")