assignment5(1)

# ===================================================
# PEARSON CORRELATION & SPEARMAN CORRELATION OVERVIEW
# ===================================================

# PURPOSE
# Used to test the relationship between two continuous variables.

# ==========
# HYPOTHESES
# ==========

# NULL HYPOTHESIS
# There is no relationship between Variables A and B.

# ALTERNATE HYPOTHESIS
# There is a relationship between Variables A and B.

# DIRECTIONAL ALTERNATE HYPOTHESES
# As Variable A increases, Variable B increases.
# As Variable A increases, Variable B decreases.

# .................................................................
# QUESTION
# What are the null and alternate hypotheses for your research?
# H0:There is no relationship between time spent (minutes) in the shop and number of drinks purchased.
# H1:There is a relationship between time spent (minutes) in the shop and number of drinks purchased. 
# .................................................................

# ======================
# IMPORT EXCEL FILE CODE
# ======================

# PURPOSE OF THIS CODE
# Imports your Excel dataset automatically into R Studio.
# You need to import your dataset every time you want to analyze your data in R Studio.

# INSTALL REQUIRED PACKAGE
# The package only needs to be installed once.
# The code for this task is provided below. Remove the hashtag below to convert the note into code.

# install.packages("readxl")

# LOAD THE PACKAGE
# You must always reload the package you want to use. 
# The code for this task is provided below. Remove the hashtag below to convert the note into code.

library(readxl)

# IMPORT THE EXCEL FILE INTO R STUDIO
# Download the Excel file from One Drive and save it to your desktop.
# Right-click the Excel file and click “Copy as path” from the menu.
# In R Studio, replace the example path below with your actual path.
# Replace backslashes \ with forward slashes / or double them //:
# ✘ WRONG   "C:\Users\Joseph\Desktop\mydata.xlsx"
# ✔ CORRECT "C:/Users/Joseph/Desktop/mydata.xlsx"
# ✔ CORRECT "C:\\Users\\Joseph\\Desktop\\mydata.xlsx"
# Replace "dataset" with the name of your excel data (without the .xlsx)

# An example of the code for this task is provided below.
# You can edit the code below and remove the hashtag to use the code below.

A5RQ1 <- read_excel("D:\\applied analytics\\A5RQ1.xlsx")

# ======================
# DESCRIPTIVE STATISTICS
# ======================

# Calculate the mean, median, SD, and sample size for each variable.

# INSTALL THE REQUIRED PACKAGE
# Remove the hashtag in front of the code below to install the package once. 
# After installing the package, put the hashtag in front of the code again.

# install.packages("psych")

# LOAD THE PACKAGE
# Always reload the package you want to use. 

library(psych)

# CALCULATE THE DESCRIPTIVE DATA
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.

describe(A5RQ1[, c("Minutes", "Drinks")])

##         vars   n  mean    sd median trimmed   mad min   max range skew kurtosis
## Minutes    1 461 29.89 18.63   24.4   26.99 15.12  10 154.2 144.2 1.79     5.20
## Drinks     2 461  3.00  1.95    3.0    2.75  1.48   0  17.0  17.0 1.78     6.46
##           se
## Minutes 0.87
## Drinks  0.09

# =========================
# VISUALLY DISPLAY THE DATA
# =========================

# CREATE A SCATTERPLOT

# PURPOSE
# A scatterplot visually shows the relationship between two continuous variables.

# INSTALL THE REQUIRED PACKAGES
# Remove the hashtags in front of the code below to install the package once. 
# After installing the packages, put the hashtag in front of the code again.

# install.packages("ggplot2")
# install.packages("ggpubr")

# LOAD THE PACKAGE
# Always reload the package you want to use. 

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(ggpubr)

# CREATE THE SCATTERPLOT
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
# Replace "pearson" with "spearman" if you are using the spearman correlation.

ggscatter(A5RQ1, x = "Minutes", y = "Drinks",
          add = "reg.line",
          conf.int = TRUE,
          cor.coef = TRUE,
          cor.method = "spearman",
          xlab = "Minutes", ylab = "Drinks")

# ........................................................
# QUESTION
# Answer the questions below as a comment within the R script:
# Is the relationship positive (line pointing up), negative (line pointing down), or is there no relationship (line is flat)?
# The Relationship is positive
# ........................................................


# ===============================================
# CHECK THE NORMALITY OF THE CONTINUOUS VARIABLES
# ===============================================

# OVERVIEW
# Two methods will be used to check the normality of the continuous variables.
# First, you will create histograms to visually inspect the normality of the variables.
# Next, you will conduct a test called the Shapiro-Wilk test to inspect the normality of the variables.
# It is important to know whether or not the data is normal to determine which inferential test should be used.


# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# A histogram is used to visually check if the data is normally distributed.

# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.

hist(A5RQ1$Minutes,
     main = "Histogram of V1",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightblue",
     border = "black",
     breaks = 20)

hist(A5RQ1$Drinks,
     main = "Histogram of V2",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightgreen",
     border = "black",
     breaks = 20)

# ........................................................
# QUESTION
# Answer the questions below as comments within the R script:
# Q1) Check the SKEWNESS of the VARIABLE 1 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
#     Visually: positively skewed
# Q2) Check the KURTOSIS of the VARIABLE 1 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
#     Visually: too tall
# Q3) Check the SKEWNESS of the VARIABLE 2 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
#     Visually: positively skewed
# Q4) Check the KUROTSIS of the VARIABLE 2 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
#     Visually: too tall
# ........................................................

# PURPOSE
# Use a statistical test to check the normality of the continuous variables.
# The Shapiro-Wilk Test is a test that checks skewness and kurtosis at the same time.
# The test is checking "Is this variable the SAME as normal data (null hypothesis) or DIFFERENT from normal data (alternate hypothesis)?"
# Minutes: p < .001 (≈ 9.96e-21) → NOT normal.
# Drinks : p < .001 (≈ 3.21e-20) → NOT normal.
# For this test, if p is GREATER than .05 (p > .05), the data is NORMAL.
# If p is LESS than .05 (p < .05), the data is NOT normal.

# CONDUCT THE SHAPIRO-WILK TEST
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.

shapiro.test(A5RQ1$Minutes)

## 
##  Shapiro-Wilk normality test
## 
## data:  A5RQ1$Minutes
## W = 0.84706, p-value < 2.2e-16

shapiro.test(A5RQ1$Drinks)

## 
##  Shapiro-Wilk normality test
## 
## data:  A5RQ1$Drinks
## W = 0.85487, p-value < 2.2e-16

# .........................................................
# QUESTION
# Answer the questions below as a comment within the R script:
# Was the data normally distributed for Variable 1? NO.
# Was the data normally distributed for Variable 2? NO.
# .........................................................

# If the data is normal for both variables, continue with the Pearson Correlation test.
# If one or both of variables are NOT normal, change to the Spearman Correlation test.

# ================================================
# PEARSON CORRELATION OR SPEARMAN CORRELATION TEST
# ================================================

# PURPOSE
# Check if the means of the two groups are different.

# CONDUCT THE PEARSON CORRELATION OR SPEARMAN CORRELATION
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
# Replace "pearson" with "spearman" if you are using the spearman correlation.

cor.test(A5RQ1$Minutes, A5RQ1$Drinks, method = "spearman")

## Warning in cor.test.default(A5RQ1$Minutes, A5RQ1$Drinks, method = "spearman"):
## Cannot compute exact p-value with ties

## 
##  Spearman's rank correlation rho
## 
## data:  A5RQ1$Minutes and A5RQ1$Drinks
## S = 1305608, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.9200417

# DETERMINE STATISTICAL SIGNIFICANCE
# If results were statistically significant (p < .05), continue to effect size section below.
# If results were NOT statistically significant (p > .05), skip to reporting section below.
# NOTE: Getting results that are not statistically significant does NOT mean you switch to Spearman Correlation.
# The Spearman Correlation is only for abnormally distributed data — not based on outcome significance.

# ===============================================
# EFFECT SIZE FOR PEARSON & SPEARMAN CORFRELATION
# ===============================================

# If results were statistically significant, then determine how the variables are related and how strong the relationship is.

# 1) REVIEW THE CORRECT CORRELATION TEST
#    • For Pearson correlation, find "sample estimates: cor" in your output (when you calculated the Pearson Correlation earlier).
#    • For Spearman correlation, find "sample estimates: rho" in your output (when you calculated the Spearman Correlation earlier).

# 1) The name of the inferential test used (Spearman Correlation)
# 2) The names of the two variables analyzed relationship between time spent (minutes) in the shop and number of drinks purchased.
# 3) The total sample size is 461
# 4) Whether the inferential test results were statistically significant p-value <2.2e-16
# 5) The mean for minutes was 29.89 and for drinks was 3.00
# The SD for Minutes was 18.63 and for Drinks was 1.95
# 6) The direction and size of the correlation was positive with a correlation of 0.9200417
# 7) Degrees of freedom was 459
# 8) r-value sample estimates:0.92
# 9) p-value 0.00000000000000022

# ........................................................

#  Summary
# 
# Based on the data you provided, here is the analysis:
# 
# The Spearman Correlation analysis revealed a statistically significant, strong, positive relationship between the time spent in a shop and the number of drinks purchased, with a correlation of 0.9200417 and a p-value of less than 2.2e-16. This strong positive correlation, based on a sample size of 461, indicates that as the time spent in the shop increases, the number of drinks purchased also tends to increase. The average time spent was 29.89 minutes (SD = 18.63), and the average number of drinks purchased was 3.00 (SD = 1.95).

assignment5(1)

TEAM12

2025-09-23