R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Loading Libraries:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the Dataset:

Online_Retail <- read.csv("C:/Users/laasy/Documents/Fall 2023/Intro to Statistics in R/Datasets for Final Project/OnlineRetail.csv")

Creating a binary variable called “Popular” :

data <- Online_Retail %>% 
  group_by(StockCode) %>% 
  mutate(Popular = ifelse(sum(Quantity) > 100, 1, 0)) %>% 
  ungroup()

Logistic Regression Model:

# Creating a logistic regression model with explanatory variables
logistic_model <- glm(Popular ~ Quantity + UnitPrice + Country, data = data, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

# The summary of the logistic regression model
summary(logistic_model)

## 
## Call:
## glm(formula = Popular ~ Quantity + UnitPrice + Country, family = "binomial", 
##     data = data)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  4.125e+00  2.313e-01  17.834  < 2e-16 ***
## Quantity                     8.871e-04  1.317e-04   6.733 1.66e-11 ***
## UnitPrice                   -6.769e-04  6.629e-05 -10.212  < 2e-16 ***
## CountryAustria               5.443e-02  4.719e-01   0.115 0.908170    
## CountryBahrain              -1.243e+00  1.053e+00  -1.181 0.237712    
## CountryBelgium              -2.060e-01  2.810e-01  -0.733 0.463449    
## CountryBrazil                8.434e+00  5.741e+01   0.147 0.883197    
## CountryCanada               -1.592e+00  3.893e-01  -4.090 4.31e-05 ***
## CountryChannel Islands       2.889e-01  4.073e-01   0.709 0.478129    
## CountryCyprus               -9.974e-01  3.060e-01  -3.260 0.001116 ** 
## CountryCzech Republic       -7.731e-01  1.043e+00  -0.741 0.458585    
## CountryDenmark               7.160e-01  6.240e-01   1.147 0.251236    
## CountryEIRE                 -8.152e-01  2.390e-01  -3.412 0.000646 ***
## CountryEuropean Community   -1.167e+00  6.357e-01  -1.836 0.066330 .  
## CountryFinland              -9.263e-01  3.033e-01  -3.054 0.002259 ** 
## CountryFrance                1.668e-01  2.496e-01   0.668 0.504063    
## CountryGermany              -1.467e-02  2.453e-01  -0.060 0.952301    
## CountryGreece                1.455e-01  7.486e-01   0.194 0.845913    
## CountryHong Kong            -7.528e-01  3.974e-01  -1.894 0.058195 .  
## CountryIceland              -3.397e-01  5.560e-01  -0.611 0.541250    
## CountryIsrael               -9.679e-01  3.746e-01  -2.584 0.009768 ** 
## CountryItaly                -9.883e-02  3.552e-01  -0.278 0.780848    
## CountryJapan                 8.025e-02  5.063e-01   0.159 0.874050    
## CountryLebanon              -3.448e-01  1.037e+00  -0.332 0.739580    
## CountryLithuania            -1.336e+00  7.641e-01  -1.749 0.080300 .  
## CountryMalta                -2.327e+00  3.438e-01  -6.768 1.30e-11 ***
## CountryNetherlands           3.887e-01  3.091e-01   1.258 0.208550    
## CountryNorway               -4.284e-01  3.048e-01  -1.406 0.159861    
## CountryPoland                7.597e-02  5.064e-01   0.150 0.880757    
## CountryPortugal             -3.685e-02  3.069e-01  -0.120 0.904404    
## CountryRSA                   8.438e+00  4.264e+01   0.198 0.843128    
## CountrySaudi Arabia         -1.933e+00  1.079e+00  -1.791 0.073281 .  
## CountrySingapore            -5.149e-01  4.533e-01  -1.136 0.256020    
## CountrySpain                -6.679e-01  2.586e-01  -2.583 0.009805 ** 
## CountrySweden                1.458e-01  4.715e-01   0.309 0.757123    
## CountrySwitzerland          -4.729e-02  2.903e-01  -0.163 0.870602    
## CountryUnited Arab Emirates -6.390e-01  7.541e-01  -0.847 0.396756    
## CountryUnited Kingdom       -9.594e-01  2.314e-01  -4.146 3.38e-05 ***
## CountryUnspecified          -7.004e-01  3.567e-01  -1.964 0.049574 *  
## CountryUSA                   4.377e-01  6.247e-01   0.701 0.483531    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 177687  on 541908  degrees of freedom
## Residual deviance: 176580  on 541869  degrees of freedom
## AIC: 176660
## 
## Number of Fisher Scoring iterations: 11

Interpretation of the coefficients and other key information:

Coefficients: The table shows the estimated coefficients for the intercept and the explanatory variables, including ‘Quantity’, ‘UnitPrice’, and the various levels of the ‘Country’ variable.
Estimate: This column represents the estimated values for the coefficients of the corresponding variables in the logistic regression model.
Std. Error: This column provides the standard errors associated with each coefficient estimate, indicating the uncertainty or variability in the estimated coefficients.
z value: The z-value is a measure of how many standard deviations a particular coefficient is away from the mean. It helps in determining the significance of the coefficients.
Pr(>|z|): This column shows the p-value associated with each coefficient estimate. It indicates the significance of the corresponding variable. Lower p-values suggest stronger evidence against the null hypothesis.
Significance codes: The significance codes provide a quick reference to the level of significance of each coefficient. ‘***’ indicates that the variable is highly significant, while ‘.’ indicates a lower level of significance.
Null deviance: It represents the measure of the total deviance when only the intercept is included in the model.
Residual deviance: It represents the measure of the total deviance after including the explanatory variables in the model. The difference between null deviance and residual deviance indicates the improvement in model fit.
AIC: The Akaike Information Criterion (AIC) is a measure of the relative quality of the model. Lower AIC values suggest a better-fitting model.
Number of Fisher Scoring iterations: This indicates the number of iterations performed during model fitting.

Interpret the coefficients and explanation:

Intercept (4.125e+00): The intercept represents the log-odds of an item being popular when all other variables are held at zero. In this case, it serves as a baseline value for the log-odds of popularity.
Quantity (8.871e-04): This coefficient suggests that for a one-unit increase in quantity sold, the log-odds of an item being popular increase by 8.871e-04 (or 0.0008871). Therefore, as the quantity sold increases, the likelihood of the item being popular also increases.
UnitPrice (-6.769e-04): This coefficient indicates that for a one-unit increase in the unit price, the log-odds of an item being popular decrease by 6.769e-04 (or 0.0006769). As the unit price increases, the likelihood of the item being popular decreases.
Country Coefficients: The coefficients for each country represent the difference in log-odds of an item being popular in that particular country compared to the reference country (which is typically the baseline or default country). Positive coefficients suggest an increased likelihood, while negative coefficients suggest a decreased likelihood of an item being popular in the corresponding country compared to the reference country.

For example, taking the coefficient for ‘CountryUnited Kingdom’ (-9.594e-01), it implies that the log-odds of an item being popular in the United Kingdom are lower by 9.594e-01 (or 0.9594) compared to the reference country when all other variables are held constant.

The standard errors associated with each coefficient provide a measure of the uncertainty in the estimated coefficients. Smaller standard errors suggest more precise estimates, while larger standard errors indicate more uncertainty in the coefficient estimates.

Building Confidence Interval for Quantity coefficient:

# Extracting necessary values
coefficient <- 8.871e-04  # Coefficient for Quantity
std_error <- 1.317e-04    # Standard error for the coefficient of Quantity
n <- nrow(data)           # Total number of observations

# Setting the confidence level (e.g., 95% confidence level)
confidence_level <- 0.95

# Calculating the margin of error
margin_of_error <- qt((1 - confidence_level) / 2, df = n - 2) * std_error

# Calculating the confidence interval
lower_bound <- coefficient - margin_of_error
upper_bound <- coefficient + margin_of_error

# Printing the confidence interval
cat("The 95% confidence interval for the coefficient of 'Quantity' is [", lower_bound, ", ", upper_bound, "].\n")

## The 95% confidence interval for the coefficient of 'Quantity' is [ 0.001145228 ,  0.0006289722 ].

The output provides the 95% confidence interval for the coefficient of ‘Quantity’ as [0.001145228, 0.0006289722]. This interval suggests that we can be 95% confident that the true value of the coefficient lies within this range.

Since both lower and upper bounds of the confidence interval are positive, it indicates that the quantity sold has a statistically significant effect on the likelihood of an item being popular. Specifically, for a one-unit increase in the quantity sold, the log-odds of an item being popular are expected to increase between 0.0006289722 and 0.001145228. This suggests a positive association between the quantity sold and the likelihood of an item being popular in the online retail data.

Checking the linearity for Unit Price and Popularity and deciding whether to perform Transformation is necessary or not:

# Scatter plot to assess linearity
ggplot(data, aes(x = UnitPrice, y = Popular)) +
  geom_point() +
  ggtitle("Scatter Plot of 'Unit Price' vs 'Popular'")

# Add a trend line to the scatter plot to visually assess linearity
ggplot(data, aes(x = UnitPrice, y = Popular)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  ggtitle("Scatter Plot of 'Unit Price' vs 'Popular' with Trend Line")

## `geom_smooth()` using formula = 'y ~ x'

The trend line appears as a straight line with a negative slope, it suggests a negative linear relationship between the variables ‘UnitPrice’ and ‘Popular’. This indicates that as the unit price increases, the likelihood of an item being popular decreases. Given the observed negative linearity, it might not be necessary to apply a transformation to the ‘UnitPrice’ variable in this case.

The straight-line trend suggests that the current form of the relationship between ‘UnitPrice’ and ‘Popular’ can be adequately captured without transformation.

Week_10 Data Dive GLM’s

2023-10-30