# ===================================================
# PEARSON CORRELATION & SPEARMAN CORRELATION OVERVIEW
# ===================================================
# PURPOSE
# Used to test the relationship between two continuous variables.
# ==========
# HYPOTHESES
# ==========
# NULL HYPOTHESIS
# There is no relationship between Variables A and B.
# ALTERNATE HYPOTHESIS
# There is a relationship between Variables A and B.
# DIRECTIONAL ALTERNATE HYPOTHESES
# As Variable A increases, Variable B increases.
# As Variable A increases, Variable B decreases.
# .................................................................
# QUESTION
# What are the null and alternate hypotheses for your research?
# H0: There is no relationship between the time spent and the number of drinks ordered
# H1: There is a relationship between time spent and the number of drinks ordered.
# .................................................................
# ======================
# IMPORT EXCEL FILE CODE
# ======================
# PURPOSE OF THIS CODE
# Imports your Excel dataset automatically into R Studio.
# You need to import your dataset every time you want to analyze your data in R Studio.
# INSTALL REQUIRED PACKAGE
# The package only needs to be installed once.
# The code for this task is provided below. Remove the hashtag below to convert the note into code.
# install.packages("readxl")
# LOAD THE PACKAGE
# You must always reload the package you want to use.
# The code for this task is provided below. Remove the hashtag below to convert the note into code.
library(readxl)
# IMPORT THE EXCEL FILE INTO R STUDIO
# Download the Excel file from One Drive and save it to your desktop.
# Right-click the Excel file and click “Copy as path” from the menu.
# In R Studio, replace the example path below with your actual path.
# Replace backslashes \ with forward slashes / or double them //:
# ✘ WRONG "C:\Users\Joseph\Desktop\mydata.xlsx"
# ✔ CORRECT "C:/Users/Joseph/Desktop/mydata.xlsx"
# ✔ CORRECT "C:\\Users\\Joseph\\Desktop\\mydata.xlsx"
# Replace "dataset" with the name of your excel data (without the .xlsx)
# An example of the code for this task is provided below.
# You can edit the code below and remove the hashtag to use the code below.
Datasets <- read_excel("C:/Users/Luqman ullah/Downloads/A5RQ1.xlsx")
A5RQ1 <- read_excel("C:/Users/Luqman ullah/Downloads/A5RQ1.xlsx")
# ======================
# DESCRIPTIVE STATISTICS
# ======================
# Calculate the mean, median, SD, and sample size for each variable.
# INSTALL THE REQUIRED PACKAGE
# Remove the hashtag in front of the code below to install the package once.
# After installing the package, put the hashtag in front of the code again.
#install.packages("psych")
# LOAD THE PACKAGE
# Always reload the package you want to use.
library(psych)
# CALCULATE THE DESCRIPTIVE DATA
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
describe(A5RQ1[, c("Minutes", "Drinks")])
## vars n mean sd median trimmed mad min max range skew kurtosis
## Minutes 1 461 29.89 18.63 24.4 26.99 15.12 10 154.2 144.2 1.79 5.20
## Drinks 2 461 3.00 1.95 3.0 2.75 1.48 0 17.0 17.0 1.78 6.46
## se
## Minutes 0.87
## Drinks 0.09
# =========================
# VISUALLY DISPLAY THE DATA
# =========================
# CREATE A SCATTERPLOT
# PURPOSE
# A scatterplot visually shows the relationship between two continuous variables.
# INSTALL THE REQUIRED PACKAGES
# Remove the hashtags in front of the code below to install the package once.
# After installing the packages, put the hashtag in front of the code again.
#install.packages("ggplot2")
#install.packages("ggpubr")
# LOAD THE PACKAGE
# Always reload the package you want to use.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(ggpubr)
# CREATE THE SCATTERPLOT
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
# Replace "pearson" with "spearman" if you are using the spearman correlation.
ggscatter(A5RQ1, x = "Minutes", y = "Drinks",
add = "reg.line",
conf.int = TRUE,
cor.coef = TRUE,
cor.method = "pearson",
xlab = "Variable Minutes", ylab = "Variable Drinks")
# ........................................................
# QUESTION
# Answer the questions below as a comment within the R script:
# Is the relationship positive (line pointing up), negative (line pointing down), or is there no relationship (line is flat)?
# answer: There is a positive relationship between the time spent (minutes) and number of drinks purchased.
# ===============================================
# CHECK THE NORMALITY OF THE CONTINUOUS VARIABLES
# ===============================================
# OVERVIEW
# Two methods will be used to check the normality of the continuous variables.
# First, you will create histograms to visually inspect the normality of the variables.
# Next, you will conduct a test called the Shapiro-Wilk test to inspect the normality of the variables.
# It is important to know whether or not the data is normal to determine which inferential test should be used.
# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# A histogram is used to visually check if the data is normally distributed.
# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
hist(A5RQ1$Minutes,
main = "Histogram of Minutes",
xlab = "Value",
ylab = "Frequency",
col = "lightblue",
border = "black",
breaks = 20)
hist(A5RQ1$Drinks,
main = "Histogram of Drinks",
xlab = "Value",
ylab = "Frequency",
col = "lightgreen",
border = "black",
breaks = 20)
# ........................................................
# QUESTION
# Answer the questions below as comments within the R script:
# Q1) Check the SKEWNESS of the VARIABLE 1 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# answer: the histogram is positively skewed and its not symetrical.
# Q2) Check the KURTOSIS of the VARIABLE 1 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# answer: it does not have a proper bell curve.
# Q3) Check the SKEWNESS of the VARIABLE 2 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# answer: the histogram is positively skewed and not symmetrical
# Q4) Check the KUROTSIS of the VARIABLE 2 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# answer: Its not shapped like a proper bell.
# ........................................................
# PURPOSE
# Use a statistical test to check the normality of the continuous variables.
# The Shapiro-Wilk Test is a test that checks skewness and kurtosis at the same time.
# The test is checking "Is this variable the SAME as normal data (null hypothesis) or DIFFERENT from normal data (alternate hypothesis)?"
# For this test, if p is GREATER than .05 (p > .05), the data is NORMAL.
# If p is LESS than .05 (p < .05), the data is NOT normal.
# CONDUCT THE SHAPIRO-WILK TEST
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
shapiro.test(A5RQ1$Minutes)
##
## Shapiro-Wilk normality test
##
## data: A5RQ1$Minutes
## W = 0.84706, p-value < 2.2e-16
shapiro.test(A5RQ1$Drinks)
##
## Shapiro-Wilk normality test
##
## data: A5RQ1$Drinks
## W = 0.85487, p-value < 2.2e-16
# .........................................................
# QUESTION
# Answer the questions below as a comment within the R script:
# Was the data normally distributed for Variable 1?
# No the data is not normally distributed since P< 2.2e-16.
# Was the data normally distributed for Variable 2?
# No the data is not normally distributed since P <2.2e-16.
# If the data is normal for both variables, continue with the Pearson Correlation test.
# If one or both of variables are NOT normal, change to the Spearman Correlation test.
# ================================================
# PEARSON CORRELATION OR SPEARMAN CORRELATION TEST
# ================================================
# PURPOSE
# Check if the means of the two groups are different.
# CONDUCT THE PEARSON CORRELATION OR SPEARMAN CORRELATION
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
# Replace "pearson" with "spearman" if you are using the spearman correlation.
cor.test(A5RQ1$Minutes, A5RQ1$Drinks, method = "spearman")
## Warning in cor.test.default(A5RQ1$Minutes, A5RQ1$Drinks, method = "spearman"):
## Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: A5RQ1$Minutes and A5RQ1$Drinks
## S = 1305608, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.9200417
# DETERMINE STATISTICAL SIGNIFICANCE
# If results were statistically significant (p < .05), continue to effect size section below.
# If results were NOT statistically significant (p > .05), skip to reporting section below.
# NOTE: Getting results that are not statistically significant does NOT mean you switch to Spearman Correlation.
# The Spearman Correlation is only for abnormally distributed data — not based on outcome significance.
# ===============================================
# EFFECT SIZE FOR PEARSON & SPEARMAN CORFRELATION
# ===============================================
# If results were statistically significant, then determine how the variables are related and how strong the relationship is.
# 1) REVIEW THE CORRECT CORRELATION TEST
# • For Pearson correlation, find "sample estimates: cor" in your output (when you calculated the Pearson Correlation earlier).
# • For Spearman correlation, find "sample estimates: rho" in your output (when you calculated the Spearman Correlation earlier).
# ........................................................
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.