This analysis is for RESEARCH SCENARIO 1 from assignment 5. It tests to see if there is a relationship between time spent (minutes) in the shop and number of drinks purchased.
# IMPORT EXCEL FILE CODE
# PURPOSE OF THIS CODE
# Imports your Excel dataset automatically into R Studio.
# You need to import your dataset every time you want to analyze your data in R Studio.
# INSTALL REQUIRED PACKAGE
# install.packages("readxl")
# LOAD THE PACKAGE
library(readxl)
# IMPORT THE EXCEL FILE INTO R STUDIO
dataset <- read_excel("//apporto.com/dfs/SLU/Users/minhoku_slu/Desktop/A5RQ1.xlsx")
# ======================
# DESCRIPTIVE STATISTICS
# ======================
# Calculate the mean, median, SD, and sample size for each variable.
# INSTALL THE REQUIRED PACKAGE
# install.packages("psych")
# LOAD THE PACKAGE
library(psych)
# CALCULATE THE DESCRIPTIVE DATA
describe(dataset[, c("Minutes", "Drinks")])
## vars n mean sd median trimmed mad min max range skew kurtosis
## Minutes 1 461 29.89 18.63 24.4 26.99 15.12 10 154.2 144.2 1.79 5.20
## Drinks 2 461 3.00 1.95 3.0 2.75 1.48 0 17.0 17.0 1.78 6.46
## se
## Minutes 0.87
## Drinks 0.09
# ===============================================
# CHECK THE NORMALITY OF THE CONTINUOUS VARIABLES
# ===============================================
# OVERVIEW
# Two methods will be used to check the normality of the continuous variables.
# First, you will create histograms to visually inspect the normality of the variables.
# Next, you will conduct a test called the Shapiro-Wilk test to inspect the normality of the variables.
# It is important to know whether or not the data is normal to determine which inferential test should be used.
# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# A histogram is used to visually check if the data is normally distributed.
hist(dataset$Minutes,
main = "Histogram of Minutes",
xlab = "Value",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 20)
hist(dataset$Drinks,
main = "Histogram of Drinks",
xlab = "Value",
ylab = "Frequency",
col = "lightgreen",
border = "black",
breaks = 20)
# ........................................................
# Q1) Check the SKEWNESS of the VARIABLE 1 (Minutes) histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# Answer 1: The histogram is POSITIVELY SKEWED. The majority of the data is clustered on the left side (shorter times), with a long "tail" of data points stretching out to the right (longer times).
# Q2) Check the KURTOSIS of the VARIABLE 1 (Minutes) histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# Answer 2: The histogram looks TOO TALL. It has a very sharp, high peak and is much thinner than a proper bell curve.
# Q3) Check the SKEWNESS of the VARIABLE 2 (Drinks) histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# Answer 3: The histogram is POSITIVELY SKEWED. Just like the first histogram, the data is heavily clustered on the left (low number of drinks), with a long tail stretching to the right (high number of drinks).
# Q4) Check the KUROTSIS of the VARIABLE 2 (Drinks) histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# Answer 4: The histogram looks TOO TALL. It has an extremely sharp and high peak and does not look like a proper bell curve at all.
# ........................................................
# PURPOSE
# Use a statistical test to check the normality of the continuous variables.
# CONDUCT THE SHAPIRO-WILK TEST
shapiro.test(dataset$Minutes)
##
## Shapiro-Wilk normality test
##
## data: dataset$Minutes
## W = 0.84706, p-value < 2.2e-16
shapiro.test(dataset$Drinks)
##
## Shapiro-Wilk normality test
##
## data: dataset$Drinks
## W = 0.85487, p-value < 2.2e-16
# .........................................................
# Was the data normally distributed for Variable 1 (Minutes)?
# No, the data was NOT normally distributed for Variable 1 (Minutes).
# The Shapiro-Wilk test (shapiro.test) result shows a p-value < 2.2e-16 (<0.05).
# Was the data normally distributed for Variable 2?
# No, the data was NOT normally distributed for Variable 2 (Drinks).
# The Shapiro-Wilk test (shapiro.test) result shows a p-value < 2.2e-16 (<0.05).
# .........................................................
# =========================
# VISUALLY DISPLAY THE DATA
# =========================
# CREATE A SCATTERPLOT
# PURPOSE
# A scatterplot visually shows the relationship between two continuous variables.
# INSTALL THE REQUIRED PACKAGES
# install.packages("ggplot2")
# install.packages("ggpubr")
# LOAD THE PACKAGE
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(ggpubr)
# CREATE THE SCATTERPLOT
ggscatter(dataset, x = "Minutes", y = "Drinks",
add = "reg.line",
conf.int = TRUE,
cor.coef = TRUE,
cor.method = "spearman",
xlab = "Variable Minutes", ylab = "Variable Drinks")
# ........................................................
# Is the relationship positive (line pointing up), negative (line pointing down), or is there no relationship (line is flat)?
# The relationship is positive.
# ........................................................
# ================================================
# SPEARMAN CORRELATION TEST (not normally distributed)
# ================================================
# PURPOSE
# Check if the means of the two groups are different.
# CONDUCT THE PEARSON CORRELATION OR SPEARMAN CORRELATION
cor.test(dataset$Minutes, dataset$Drinks, method = "spearman")
## Warning in cor.test.default(dataset$Minutes, dataset$Drinks, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: dataset$Minutes and dataset$Drinks
## S = 1305608, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9200417
# DETERMINE STATISTICAL SIGNIFICANCE.
# ===============================================
# EFFECT SIZE FOR SPEARMAN CORRRELATION
# ===============================================
# If results were statistically significant, then determine how the variables are related and how strong the relationship is.
# 1) REVIEW THE CORRECT CORRELATION TEST
# rho = 0.9200417
# ........................................................
# 1) WRITE THE REPORT
# Q1) What is the direction of the effect?
# A correlation of 0.95 is positive. As X increases, Y increases.
#
# Q2) What is the size of the effect?
# A correlation of 0.95 is a strong relationship.