Clear Data

rm(list = ls())      # Clear all files from your environment
         gc()            # Clear unused memory
##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 526490 28.2    1169547 62.5         NA   669420 35.8
## Vcells 970770  7.5    8388608 64.0      16384  1851931 14.2
         cat("\f")       # Clear the console
 graphics.off()      # Clear all graphs

Part 1)

Do a few Google searches and tell us what is correlation (5 lines max)

Correlations is a standardized measure between -1 and 1, indicating the strength and direction of the linear relationship of two variables.

Part 2)

Do a few Google searches and tell us what is covariance (5 lines max).

Covariance measures the degree to which two random variables vary together, they may have a positive or negative relationship (- \(\infty\), + \(\infty\)). If positive they will increase or decrease together while if negative one will increase while the other decreases.

Unlike correlations, the magnitude of covariance is influenced by the scales of the variables.

Part 3)

Try merging any dataset that interests you based on the data dictionary (pay attention to the unique keys), and create a meaningful dataset (that have some interesting y (outcome) and an interesting x (independent variable).

library(readr)
setwd("~/Desktop/Data Analysis/Discussion 12 - Merging Data & correlation and covariance")

orders  <-  read_csv("order_details.csv")
## Rows: 48620 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): pizza_id
## dbl (3): order_details_id, order_id, quantity
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pizzas  <-  read.csv("pizzas.csv")
# Merge data
pizza_merged <- merge(pizzas, orders, by = "pizza_id", all = TRUE)

Part 4)

Create a summary statistics table of the merged dataset.

#install.packages("stargazer")
library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
stargazer(pizza_merged, 
          type = "text", 
          title = "Summary")
## 
## Summary
## ==========================================================
## Statistic          N       Mean     St. Dev.   Min   Max  
## ----------------------------------------------------------
## price            48,625   16.494     3.622    9.750 35.950
## order_details_id 48,620 24,310.500 14,035.530   1   48,620
## order_id         48,620 10,701.480 6,180.120    1   21,350
## quantity         48,620   1.020      0.143      1     4   
## ----------------------------------------------------------
summary(pizza_merged)
##    pizza_id         pizza_type_id          size               price      
##  Length:48625       Length:48625       Length:48625       Min.   : 9.75  
##  Class :character   Class :character   Class :character   1st Qu.:12.75  
##  Mode  :character   Mode  :character   Mode  :character   Median :16.50  
##                                                           Mean   :16.49  
##                                                           3rd Qu.:20.25  
##                                                           Max.   :35.95  
##                                                                          
##  order_details_id    order_id        quantity   
##  Min.   :    1    Min.   :    1   Min.   :1.00  
##  1st Qu.:12156    1st Qu.: 5337   1st Qu.:1.00  
##  Median :24310    Median :10682   Median :1.00  
##  Mean   :24310    Mean   :10701   Mean   :1.02  
##  3rd Qu.:36465    3rd Qu.:16100   3rd Qu.:1.00  
##  Max.   :48620    Max.   :21350   Max.   :4.00  
##  NA's   :5        NA's   :5       NA's   :5

Part 5)

Pick any two quantitative variables from the data set that interests you. Run a Correlation (measures strength of linear relationship) between the two variables, and run thev Covariance between the two variables. Interpret.

#Create Sales field
pizza_merged$sales <- pizza_merged$price * pizza_merged$quantity

#Stats
correlation <- cor(pizza_merged$quantity, 
                   pizza_merged$sales, 
                   use = "complete.obs")

covariance <- cov(pizza_merged$quantity, 
                  pizza_merged$sales, 
                  use = "complete.obs")

#Print
print(correlation)
## [1] 0.5419262
print(covariance)
## [1] 0.3440633

Based on the Correlation of 0.54 and Covariance of 0.34 we can tell there is a positive relationship between Sales and quantity. The correlation is specifically interesting to see as it shows a moderately strong relationship. This tells us that as the quantity of pizzas sold increases, so does the sales.