# ===================================================
# PEARSON CORRELATION & SPEARMAN CORRELATION OVERVIEW
# ===================================================

# PURPOSE
# Used to test the relationship between two continuous variables.

# ==========
# HYPOTHESES
# ==========

# NULL HYPOTHESIS
# There is no relationship between Variables A and B.

# ALTERNATE HYPOTHESIS
# There is a relationship between Variables A and B.

# DIRECTIONAL ALTERNATE HYPOTHESES
# As Variable A increases, Variable B increases.
# As Variable A increases, Variable B decreases.

# .................................................................
# QUESTION
# What are the null and alternate hypotheses for your research?
# H0:There is no relationship between the number of laptops sold and that of anti-virus 
# H1:As the number of laptops sold increases and that of anti-virus also increase
# .................................................................

# ======================
# IMPORT EXCEL FILE CODE
# ======================

# PURPOSE OF THIS CODE
# Imports your Excel dataset automatically into R Studio.
# You need to import your dataset every time you want to analyze your data in R Studio.

# INSTALL REQUIRED PACKAGE
# The package only needs to be installed once.
# The code for this task is provided below. Remove the hashtag below to convert the note into code.

#install.packages("readxl")

# LOAD THE PACKAGE
# You must always reload the package you want to use. 
# The code for this task is provided below. Remove the hashtag below to convert the note into code.

library(readxl)
## Warning: package 'readxl' was built under R version 4.5.1
# IMPORT THE EXCEL FILE INTO R STUDIO
# Download the Excel file from One Drive and save it to your desktop.
# Right-click the Excel file and click “Copy as path” from the menu.
# In R Studio, replace the example path below with your actual path.
# Replace backslashes \ with forward slashes / or double them //:
# ✘ WRONG   "C:\Users\Joseph\Desktop\mydata.xlsx"
# ✔ CORRECT "C:/Users/Joseph/Desktop/mydata.xlsx"
# ✔ CORRECT "C:\\Users\\Joseph\\Desktop\\mydata.xlsx"
# Replace "dataset" with the name of your excel data (without the .xlsx)

# An example of the code for this task is provided below.
# You can edit the code below and remove the hashtag to use the code below.

Dataset <- read_excel("C:/Users/User/Downloads/A5RQ2.xlsx")
A5RQ2 <- read_excel("C:/Users/User/Downloads/A5RQ2.xlsx")

# ======================
# DESCRIPTIVE STATISTICS
# ======================

# Calculate the mean, median, SD, and sample size for each variable.

# INSTALL THE REQUIRED PACKAGE
# Remove the hashtag in front of the code below to install the package once. 
# After installing the package, put the hashtag in front of the code again.

#install.packages("psych")

# LOAD THE PACKAGE
# Always reload the package you want to use. 

library(psych)
## Warning: package 'psych' was built under R version 4.5.1
# CALCULATE THE DESCRIPTIVE DATA
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.

describe(A5RQ2[, c("Antivirus", "Laptop")])
##           vars   n  mean    sd median trimmed   mad min max range  skew
## Antivirus    1 122 50.18 13.36     49   49.92 12.60  15  83    68  0.15
## Laptop       2 122 40.02 12.30     39   39.93 11.86   8  68    60 -0.01
##           kurtosis   se
## Antivirus    -0.14 1.21
## Laptop       -0.32 1.11
# =========================
# VISUALLY DISPLAY THE DATA
# =========================

# CREATE A SCATTERPLOT

# PURPOSE
# A scatterplot visually shows the relationship between two continuous variables.

# INSTALL THE REQUIRED PACKAGES
# Remove the hashtags in front of the code below to install the package once. 
# After installing the packages, put the hashtag in front of the code again.

#install.packages("ggplot2")
#install.packages("ggpubr")

# LOAD THE PACKAGE
# Always reload the package you want to use. 

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.1
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.5.1
# CREATE THE SCATTERPLOT
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
# Replace "pearson" with "spearman" if you are using the spearman correlation.

ggscatter(A5RQ2, x = "Antivirus", y = "Laptop",
          add = "reg.line",
          conf.int = TRUE,
          cor.coef = TRUE,
          cor.method = "pearson",
          xlab = "Antivirus", ylab = "Laptop")

# ........................................................
# QUESTION
# Answer the questions below as a comment within the R script:
# Is the relationship positive (line pointing up), negative (line pointing down), or is there no relationship (line is flat)?
# THere is a positive relationship between the sales of Antivirus and Laptop

#  ===============================================
# CHECK THE NORMALITY OF THE CONTINUOUS VARIABLES
# ===============================================

# OVERVIEW
# Two methods will be used to check the normality of the continuous variables.
# First, you will create histograms to visually inspect the normality of the variables.
# Next, you will conduct a test called the Shapiro-Wilk test to inspect the normality of the variables.
# It is important to know whether or not the data is normal to determine which inferential test should be used.


# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# A histogram is used to visually check if the data is normally distributed.

# CREATE A HISTOGRAM FOR EACH CONTINUOUS VARIABLE
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.

hist(A5RQ2$Antivirus,
     main = "Histogram of Antivirus",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightblue",
     border = "black",
     breaks = 20)

hist(A5RQ2$Laptop,
     main = "Histogram of Laptop",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightgreen",
     border = "black",
     breaks = 20)

# ........................................................
# QUESTION
# Answer the questions below as comments within the R script:
# Q1) Check the SKEWNESS of the VARIABLE 1 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# Q2) Check the KURTOSIS of the VARIABLE 1 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?
# Q3) Check the SKEWNESS of the VARIABLE 2 histogram. In your opinion, does the histogram look symmetrical, positively skewed, or negatively skewed?
# Q4) Check the KUROTSIS of the VARIABLE 2 histogram. In your opinion, does the histogram look too flat, too tall, or does it have a proper bell curve?

# ANSWER
# 1. In my opinion, the skewness of Antivirus is positive
# 2. The kurtosis of the histogram of Antivirus is almost flat 
# 3. In my opinion, the skewness of Laptop is also positive
# 4. The kurtosis of the histogram of Antivirus is almost flat
# ........................................................

# PURPOSE
# Use a statistical test to check the normality of the continuous variables.
# The Shapiro-Wilk Test is a test that checks skewness and kurtosis at the same time.
# The test is checking "Is this variable the SAME as normal data (null hypothesis) or DIFFERENT from normal data (alternate hypothesis)?"
# For this test, if p is GREATER than .05 (p > .05), the data is NORMAL.
# If p is LESS than .05 (p < .05), the data is NOT normal.

# CONDUCT THE SHAPIRO-WILK TEST
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.

shapiro.test(A5RQ2$Antivirus)
## 
##  Shapiro-Wilk normality test
## 
## data:  A5RQ2$Antivirus
## W = 0.99419, p-value = 0.8981
shapiro.test(A5RQ2$Laptop)
## 
##  Shapiro-Wilk normality test
## 
## data:  A5RQ2$Laptop
## W = 0.99362, p-value = 0.8559
# .........................................................
# QUESTION
# Answer the questions below as a comment within the R script:
# Was the data normally distributed for Variable 1?
# Was the data normally distributed for Variable 2?
# .........................................................

# If the data is normal for both variables, continue with the Pearson Correlation test.
# If one or both of variables are NOT normal, change to the Spearman Correlation test.

# ================================================
# PEARSON CORRELATION OR SPEARMAN CORRELATION TEST
# ================================================

# PURPOSE
# Check if the means of the two groups are different.

# CONDUCT THE PEARSON CORRELATION OR SPEARMAN CORRELATION
# Replace "dataset" with the name of your excel data (without the .xlsx)
# Replace "V1" with the R code name for your first variable.
# Replace "V2" with the R code name for your second variable.
# Replace "pearson" with "spearman" if you are using the spearman correlation.

cor.test(A5RQ2$Antivirus, A5RQ2$Laptop, method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  A5RQ2$Antivirus and A5RQ2$Laptop
## t = 25.16, df = 120, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8830253 0.9412249
## sample estimates:
##       cor 
## 0.9168679
mean(A5RQ2$Antivirus); sd(A5RQ2$Antivirus)
## [1] 50.18033
## [1] 13.35777
mean(A5RQ2$Laptop); sd(A5RQ2$Laptop)
## [1] 40.01639
## [1] 12.29525
# DETERMINE STATISTICAL SIGNIFICANCE
# If results were statistically significant (p < .05), continue to effect size section below.
# If results were NOT statistically significant (p > .05), skip to reporting section below.
# NOTE: Getting results that are not statistically significant does NOT mean you switch to Spearman Correlation.
# The Spearman Correlation is only for abnormally distributed data — not based on outcome significance.

# ===============================================
# EFFECT SIZE FOR PEARSON & SPEARMAN CORFRELATION
# ===============================================

# If results were statistically significant, then determine how the variables are related and how strong the relationship is.

# 1) REVIEW THE CORRECT CORRELATION TEST
#    • For Pearson correlation, find "sample estimates: cor" in your output (when you calculated the Pearson Correlation earlier).
#    • For Spearman correlation, find "sample estimates: rho" in your output (when you calculated the Spearman Correlation earlier).

# ........................................................

# 1) WRITE THE REPORT 
#    Answer the questions below as a comment within the R script:
#    Q1) What is the direction of the effect?  
#        "Direction" explains the relationship between the variables.
#        A positive (+) correlation means as Variable X increases, Variable Y increases.
#        A negative (-) correlation means as Variable X increases, Variable Y decreases.
#        Examples: 
#                A correlation of 0.90 is positive. As X increases, Y increases.
#            A correlation of -0.90 is negative. As X increases, Y decreases.
#
#     Q2) What is the size of the effect? 
#         Ranges from 0 (no relationship) to 1.00 (perfect relationship).
#     "Size" explains how much the variables are connected to each other.
#         ± 0.00 to 0.09 = no relationship
#         ± 0.10 to 0.29 = weak
#         ± 0.30 to 0.49 = moderate
#         ± 0.50 to 1.00 = strong
#         Examples: 
#                 A correlation of 0.90 is a strong relationship. 
#                 A correlation of 0.15 is a weak relationship.


# ========================================================
#     >> WRITTEN REPORT FOR PEARSON CORRELATION <<
# ========================================================

# Write a paragraph summarizing your findings.

# ........................................................

# 1) REVIEW YOUR OUTPUT
#    Collect the information below from your output:
#    1) The name of the inferential test used (Pearson Correlation)
#    2) The names of the two variables you analyzed (their proper names, not their R code names).
#    3) The total sample size (labeled as "n").
#    4) Whether the inferential test results were statistically significant (p < .05) or not (p > .05)
#    5) The mean and SD for each variable (rounded to two places after the decimal)
#    6) The direction and size of the correlation.
#    7) Degrees of freedom (labeled as "df")
#    8) r-value (labeled as "sample estimate: cor" in output)
#    9) p-value (exact value rounded to two places after the decimal)

# ........................................................

# 2) WRITE YOUR FINAL REPORT
#    An example report is provided below. You should copy the paragraph and just edit/ replace words with your information.
#    This is not considered plagiarizing because science has a specific format for reporting information.
#
#    EXAMPLE
#    A Pearson correlation was conducted to examine the relationship between 
#    job satisfaction and employee performance (n = 300). 
#    There was a statistically significant correlation between 
#    job satisfaction (M = 8.21, SD = 0.2) and employee performance (M = 4.2, SD = 0.02). 
#    The correlation was positive and strong, r(298) = 0.65, p < .05.
#    As job satisfaction increases, employee performance also increases.


# ========================================================
#     >> WRITTEN REPORT FOR SPEARMAN CORRELATION <<
# ========================================================

# Write a paragraph summarizing your findings.

# ........................................................

# 1) REVIEW YOUR OUTPUT
#    Collect the information below from your output:
#    1) The name of the inferential test used (Spearman Correlation)
#    2) The names of the two variables you analyzed (their proper names, not their R code names).
#    3) The total sample size (labeled as "n").
#    4) Whether the inferential test results were statistically significant (p < .05) or not (p > .05)
#    5) The mean and SD for each variable (rounded to two places after the decimal)
#    6) The direction and size of the correlation.
#    7) Degrees of freedom (labeled as "df")
#    8) rho-value (labeled as "sample estimate: rho" in output, and labeled as ρ in paragraph)
#    9) p-value (exact value rounded to two places after the decimal)

# ........................................................

# 2) WRITE YOUR FINAL REPORT
#    An example report is provided below. You should copy the paragraph and just edit/ replace words with your information.
#    This is not considered plagiarizing because science has a specific format for reporting information.
#
#    EXAMPLE
#    A Spearman correlation was conducted to assess the relationship between 
#    stress levels and sleep quality (n = 75).
#    There was a statistically significant correlation between 
#    stress (M = 6.31, SD = 1.21) and sleep quality (M = 4.12, SD = 0.91).
#    The correlation was negative and moderate, ρ(73) = -0.45, p = .02.
#    As stress level increases, sleep quality decreases.


# ========================================================
#           >> CREATING AN R MARKDOWN FILE <<
# ========================================================

# You need to present your r code, output, and charts in one document.
# R has created a feature called R Markdown that organizes this information into one HTML document.

# Here is a YouTube video detailing how to use R Markdown: https://www.youtube.com/watch?v=DNS7i2m4sB0

# .........................................................

# CREATE A NEW R MARKDOWN FILE
#  In R Studio, click on "File" in the top-left corner > "New File" > "R Markdown"
#  A dialogue box will appear.
# For title, put Assignment 1 (or 2, 3, 4, etc).
# For author, put Team 1 (or 2, 3, 4, whatever your Team number is).
# For format, select HTML. Other formats require complex plug-ins/ downloads.
# Click "OK"
#  A new R Markdown file will appear in R Studio (top-left section).

# ..........................................................

# R MARKDOWN EXAMPLE & CODING
# An example of how to use R Markdown automatically populates the R Markdown file.
# This is example is helpful, because it demonstrates how R Markdown code is slightly different than R Script code.
# Specifically, comments do not need to start with a hashtag.
# Sections of code should start with ```{r} and end with ```
# To view how this will look as an HTML document, click the small button with a blue ball that says "Knit" in the R Markdown file.
# This button creates an HTML document of your code, output, and charts.
# To save your R Markdown file, click the floppy disk icon (blue square) at the top of your R Markdown file.
# Save your R Markdown file to your desktop.

# ...........................................................

# EDITING YOUR R MARKDOWN FILE
# In R markdown, leave the title, output, and date section and delete everything else.
# Copy-and-paste your the R code from your R script file to your R Markdown file.
# Edit your code so that it is clean. You can save the R script file for your own notes. 
# However, the R Markdown file is the "final product" you are submitting to Dr. Saffaf. It is similar to what you would submit
# to your boss at a real job. Remove the text Dr. Saffaf provided as directions to you, and instead provide notes that would be useful
# to someone looking at your document trying to follow along with what you did. For example, the entire first section labeled
# "IMPORT EXCEL FILE" could be condensed to just a few lines of comments and code:
#
#     Import excel file to R Studio
#     Previously installed readxl package
#     ```{r}
#     library(readxl)
#     dataset <- read_excel("C:/Users/Joseph/Desktop/dataset.xlsx")
summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.