# ===================================================
# PEARSON CORRELATION & SPEARMAN CORRELATION OVERVIEW
# ===================================================

# PURPOSE
# Used to test the relationship between two continuous variables.

# ==========
# HYPOTHESES
# ==========

# .................................................................
# QUESTION
# What are the null and alternate hypotheses for your research?
# H0: There is no relation between time spent and number of drinks purchased.
# H1: There is a relation between time spent and number of drinks purchased.
# .................................................................

# ======================
# IMPORT EXCEL FILE CODE
# ======================


#install.packages("readxl")

# LOAD THE PACKAGE

library(readxl)

# IMPORT THE EXCEL FILE INTO R STUDIO

A5RQ1 <- read_excel("C:\\Users\\leena\\Desktop\\SLU\\Sem 3 Fall 1\\Week 5\\A5RQ1.xlsx")

# ======================
# DESCRIPTIVE STATISTICS
# ======================

# Calculate the mean, median, SD, and sample size for each variable.

# INSTALL THE REQUIRED PACKAGE

#install.packages("psych")

# LOAD THE PACKAGE

library(psych)

# CALCULATE THE DESCRIPTIVE DATA

describe(A5RQ1[, c("Minutes", "Drinks")])
##         vars   n  mean    sd median trimmed   mad min   max range skew kurtosis
## Minutes    1 461 29.89 18.63   24.4   26.99 15.12  10 154.2 144.2 1.79     5.20
## Drinks     2 461  3.00  1.95    3.0    2.75  1.48   0  17.0  17.0 1.78     6.46
##           se
## Minutes 0.87
## Drinks  0.09
# =========================
# VISUALLY DISPLAY THE DATA
# =========================

# CREATE A SCATTERPLOT

# PURPOSE
# A scatterplot visually shows the relationship between two continuous variables.

# INSTALL THE REQUIRED PACKAGES

#install.packages("ggplot2")
#install.packages("ggpubr")

# LOAD THE PACKAGE

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(ggpubr)

# CREATE THE SCATTERPLOT

ggscatter(A5RQ1, x = "Minutes", y = "Drinks",
          add = "reg.line",
          conf.int = TRUE,
          cor.coef = TRUE,
          cor.method = "spearman",
          xlab = "Variable V1", ylab = "Variable V2")

# ........................................................
# QUESTION
# Answer the questions below as a comment within the R script:
# Is the relationship positive (line pointing up), negative (line pointing down), or is there no relationship (line is flat)?
# Ans: The relationship is positive since the line is pointing up.
# ........................................................


# ===============================================
# CHECK THE NORMALITY OF THE CONTINUOUS VARIABLES
# ===============================================

# OVERVIEW
# Two methods will be used to check the normality of the continuous variables.
# First, you will create histograms to visually inspect the normality of the variables.
# Next, you will conduct a test called the Shapiro-Wilk test to inspect the normality of the variables.
# It is important to know whether or not the data is normal to determine which inferential test should be used.


# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# A histogram is used to visually check if the data is normally distributed.

# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.

hist(A5RQ1$Minutes,
     main = "Histogram of V1",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightblue",
     border = "black",
     breaks = 20)

hist(A5RQ1$Drinks,
     main = "Histogram of V2",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightgreen",
     border = "black",
     breaks = 20)

# ........................................................
# QUESTION
# Answer the questions below as comments within the R script:
# Q1) Check the SKEWNESS of the VARIABLE 1 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# Ans: The Variable 1 is not symmetrical, it is positively skewed.
# Q2) Check the KURTOSIS of the VARIABLE 1 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# Ans: In our opinion the Kurosis of the variable 1 is too tall.
# Q3) Check the SKEWNESS of the VARIABLE 2 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# Ans:The Variable 2 is not symmetrical, it is positively skewed.
# Q4) Check the KUROTSIS of the VARIABLE 2 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# Ans: In our opinion the Kurosis of the variable 2 is too tall.
# ........................................................

# PURPOSE
# Use a statistical test to check the normality of the continuous variables.
# The Shapiro-Wilk Test is a test that checks skewness and kurtosis at the same time.
# CONDUCT THE SHAPIRO-WILK TEST

shapiro.test(A5RQ1$Minutes)
## 
##  Shapiro-Wilk normality test
## 
## data:  A5RQ1$Minutes
## W = 0.84706, p-value < 2.2e-16
shapiro.test(A5RQ1$Drinks)
## 
##  Shapiro-Wilk normality test
## 
## data:  A5RQ1$Drinks
## W = 0.85487, p-value < 2.2e-16
# .........................................................
# QUESTION
# Answer the questions below as a comment within the R script:
# Was the data normally distributed for Variable 1?
# No, the data is not normally distributed for Variable 1.
# Was the data normally distributed for Variable 2?
# No, the data is not normally distributed for Variable 2.
# .........................................................

# 
# Since the data for both variables are NOT normal, we conducted the Spearman Correlation test.

# ================================================
# PEARSON CORRELATION OR SPEARMAN CORRELATION TEST
# ================================================

# PURPOSE
# Check if the means of the two groups are different.


cor.test(A5RQ1$Minutes, A5RQ1$Drinks, method = "spearman")
## Warning in cor.test.default(A5RQ1$Minutes, A5RQ1$Drinks, method = "spearman"):
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  A5RQ1$Minutes and A5RQ1$Drinks
## S = 1305608, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9200417
# 1) REVIEW THE CORRECT CORRELATION TEST


# ===============================================
# EFFECT SIZE FOR SPEARMAN CORFRELATION
# ===============================================

# If results were statistically significant, then determine how the variables are related and how strong the relationship is.

# 1) REVIEW THE CORRECT CORRELATION TEST
# The rho value is 0.9200417, Which says that the data is statistically dependent, So if the number of hours spent in cafe increases the number of drinks also increases.