R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Online_Retail <- read.csv('C:/Users/laasy/Documents/Fall 2023/Intro to Statistics in R/Datasets for Final Project/OnlineRetail.csv')
summary(Online_Retail)
##   InvoiceNo          StockCode         Description           Quantity        
##  Length:541909      Length:541909      Length:541909      Min.   :-80995.00  
##  Class :character   Class :character   Class :character   1st Qu.:     1.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :     3.00  
##                                                           Mean   :     9.55  
##                                                           3rd Qu.:    10.00  
##                                                           Max.   : 80995.00  
##                                                                              
##  InvoiceDate          UnitPrice           CustomerID       Country         
##  Length:541909      Min.   :-11062.06   Min.   :12346    Length:541909     
##  Class :character   1st Qu.:     1.25   1st Qu.:13953    Class :character  
##  Mode  :character   Median :     2.08   Median :15152    Mode  :character  
##                     Mean   :     4.61   Mean   :15288                      
##                     3rd Qu.:     4.13   3rd Qu.:16791                      
##                     Max.   : 38970.00   Max.   :18287                      
##                                         NA's   :135080
# Checking the unique countries and their counts
table(Online_Retail$Country)
## 
##            Australia              Austria              Bahrain 
##                 1259                  401                   19 
##              Belgium               Brazil               Canada 
##                 2069                   32                  151 
##      Channel Islands               Cyprus       Czech Republic 
##                  758                  622                   30 
##              Denmark                 EIRE   European Community 
##                  389                 8196                   61 
##              Finland               France              Germany 
##                  695                 8557                 9495 
##               Greece            Hong Kong              Iceland 
##                  146                  288                  182 
##               Israel                Italy                Japan 
##                  297                  803                  358 
##              Lebanon            Lithuania                Malta 
##                   45                   35                  127 
##          Netherlands               Norway               Poland 
##                 2371                 1086                  341 
##             Portugal                  RSA         Saudi Arabia 
##                 1519                   58                   10 
##            Singapore                Spain               Sweden 
##                  229                 2533                  462 
##          Switzerland United Arab Emirates       United Kingdom 
##                 2002                   68               495478 
##          Unspecified                  USA 
##                  446                  291
# consolidate into fewer categories
Online_Retail$Country[Online_Retail$Country %in% c('Germany', 'France', 'Spain', 'Italy', 'Netherlands', 'Belgium')] <- 'Western Europe'
Online_Retail$Country[Online_Retail$Country %in% c('United Kingdom', 'Ireland')] <- 'UK and Ireland'

# Running ANOVA
model <- lm(UnitPrice ~ Country, data = Online_Retail)
anova_result <- anova(model)

# Summarizing results
summary(anova_result)
##        Df             Sum Sq             Mean Sq         F value     
##  Min.   :    32   Min.   :3.001e+06   Min.   : 9357   Min.   :10.02  
##  1st Qu.:135493   1st Qu.:1.270e+09   1st Qu.:30464   1st Qu.:10.02  
##  Median :270954   Median :2.537e+09   Median :51571   Median :10.02  
##  Mean   :270954   Mean   :2.537e+09   Mean   :51571   Mean   :10.02  
##  3rd Qu.:406415   3rd Qu.:3.804e+09   3rd Qu.:72678   3rd Qu.:10.02  
##  Max.   :541876   Max.   :5.071e+09   Max.   :93784   Max.   :10.02  
##                                                       NA's   :1      
##      Pr(>F) 
##  Min.   :0  
##  1st Qu.:0  
##  Median :0  
##  Mean   :0  
##  3rd Qu.:0  
##  Max.   :0  
##  NA's   :1

The output seems to be the result of an ANOVA test performed on the ‘Country’ column and the ‘UnitPrice’ variable. From the output, it appears that the data might have been divided into different groups based on countries, and the ANOVA test has been conducted to examine if there is a significant difference in the mean unit prices across these groups.

The output shows the counts for each country and then provides a summary of the ANOVA test results, including the degrees of freedom (Df), the sum of squares (Sum Sq), the mean squares (Mean Sq), the F-value, and the p-value (Pr(>F)).

Here are the main points from the output:

  1. The minimum count of orders is 10 (represented by the ‘Min.’ values).
  2. The maximum count of orders is 541,876 (represented by the ‘Max.’ values).
  3. The mean count of orders is approximately 270,954.
  4. The minimum sum of squares is 3.001e+06.
  5. The maximum sum of squares is 5.071e+09.
  6. The mean sum of squares is approximately 2.537e+09.
  7. The F-value is approximately 10.02 across all groups.
  8. The p-value for the test is 0.

The p-value of 0 suggests that there is strong evidence to reject the null hypothesis, indicating that there is a significant difference in the mean unit prices across the different countries in the dataset.

For stakeholders interested in this data, this implies that the country from which an order originates has a significant impact on the prices of the products. Therefore, businesses operating in the online retail industry need to consider the geographical location of their customers when determining pricing strategies and marketing approaches. Furthermore, they may need to tailor their pricing and marketing efforts according to the preferences and purchasing power of customers in different countries.

Diagnostic plot

# Checking diagnostic plots
par(mfrow=c(2,2))
plot(model, which =1)
plot(model, which =2)
plot(model, which =3)

The ‘summary’ function will provide detailed information about the linear regression model, including the coefficients, their standard errors, t-values, and p-values. The ‘anova’ function will run an ANOVA test to assess the significance of the overall model. The diagnostic plots will help identify any issues with the model, such as heteroscedasticity or non linearity.

Interpreting the coefficients of the model, the intercept represents the expected unit price when the quantity is zero. The coefficient for the ‘Quantity’ variable indicates the change in the unit price for a one-unit increase in quantity, assuming all other variables are held constant. A positive coefficient suggests that there is a positive linear relationship between quantity and unit price.

Based on the results, if the coefficient for ‘Quantity’ is statistically significant, it implies that an increase in the quantity of products purchased is associated with a corresponding increase in the unit price. This information can be utilized to optimize pricing strategies and offer appropriate discounts or incentives based on the quantity of products customers are purchasing. Additionally, it can help in forecasting revenue and managing inventory levels more efficiently. If any issues are identified in the diagnostic plots, adjustments to the model may be necessary to improve its accuracy.

To build a more comprehensive regression model, let’s include the ‘Country’ variable from the ANOVA analysis as an additional categorical predictor in the linear regression model. We will also evaluate if the interaction between ‘Quantity’ and ‘Country’ has a significant impact on the ‘UnitPrice’ variable.

In this updated model, ‘Quantity’ and ‘Country’ are included as predictors, and an interaction term ‘Quantity:Country’ is also added to assess whether the effect of ‘Quantity’ on ‘UnitPrice’ varies across different countries. The ‘summary’ function provides details about the coefficients, standard errors, t-values, and p-values, helping to understand the significance of each variable and the interaction term in predicting ‘UnitPrice’.

By including the ‘Country’ variable and its interaction with ‘Quantity’, we can examine whether the relationship between ‘Quantity’ and ‘UnitPrice’ is influenced by the country from which the order originates. This can provide insights into how different market dynamics in various countries impact the pricing strategy and customer behavior. Additionally, the interaction term can help in understanding if the effect of quantity on unit price differs based on the geographical location of the customers, enabling businesses to tailor their pricing and marketing strategies accordingly.

# Building a linear regression model with interaction term
model <- lm(UnitPrice ~ Quantity + Country + Quantity:Country, data = Online_Retail)

# Displaying the summary of the linear regression model
summary(model)
## 
## Call:
## lm(formula = UnitPrice ~ Quantity + Country + Quantity:Country, 
##     data = Online_Retail)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11067     -3     -2      0  38965 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            3.953068   3.296847   1.199   0.2305    
## Quantity                              -0.011024   0.027913  -0.395   0.6929    
## CountryAustria                         0.947377   6.431200   0.147   0.8829    
## CountryBahrain                         1.140084  24.721683   0.046   0.9632    
## CountryBrazil                          2.714822  28.685850   0.095   0.9246    
## CountryCanada                          2.804696   9.077109   0.309   0.7573    
## CountryChannel Islands                 1.843719   5.196338   0.355   0.7227    
## CountryCyprus                          3.161142   5.364488   0.589   0.5557    
## CountryCzech Republic                  0.334179  23.747993   0.014   0.9888    
## CountryDenmark                        -0.125403   7.009881  -0.018   0.9857    
## CountryEIRE                            2.713661   3.496079   0.776   0.4376    
## CountryEuropean Community              4.365212  20.140760   0.217   0.8284    
## CountryFinland                         3.290004   5.614682   0.586   0.5579    
## CountryGreece                          4.494560  14.062155   0.320   0.7493    
## CountryHong Kong                      77.291110   8.631445   8.955  < 2e-16 ***
## CountryIceland                        -0.849683   9.421930  -0.090   0.9281    
## CountryIsrael                          1.204422   8.294058   0.145   0.8845    
## CountryJapan                          -1.538290   6.413837  -0.240   0.8105    
## CountryLebanon                         5.474389  32.543321   0.168   0.8664    
## CountryLithuania                      -0.091695  34.742855  -0.003   0.9979    
## CountryMalta                           3.506056  12.145079   0.289   0.7728    
## CountryNorway                          4.624795   4.976201   0.929   0.3527    
## CountryPoland                          2.659671   8.294321   0.321   0.7485    
## CountryPortugal                       10.126084   4.693351   2.158   0.0310 *  
## CountryRSA                             5.160678  26.786074   0.193   0.8472    
## CountrySaudi Arabia                   -0.277510  52.275984  -0.005   0.9958    
## CountrySingapore                     179.250495   8.919577  20.096  < 2e-16 ***
## CountrySweden                          0.992968   6.195314   0.160   0.8727    
## CountrySwitzerland                     0.436706   4.304162   0.101   0.9192    
## CountryUK and Ireland                  0.583088   3.299713   0.177   0.8597    
## CountryUnited Arab Emirates            1.391060  18.316786   0.076   0.9395    
## CountryUnspecified                    -0.353560   6.842331  -0.052   0.9588    
## CountryUSA                            -1.711249   6.672098  -0.256   0.7976    
## CountryWestern Europe                  0.838466   3.361268   0.249   0.8030    
## Quantity:CountryAustria               -0.043577   0.224122  -0.194   0.8458    
## Quantity:CountryBahrain               -0.028207   0.759945  -0.037   0.9704    
## Quantity:CountryBrazil                -0.187775   2.049371  -0.092   0.9270    
## Quantity:CountryCanada                -0.028731   0.171444  -0.168   0.8669    
## Quantity:CountryChannel Islands       -0.058120   0.158219  -0.367   0.7134    
## Quantity:CountryCyprus                -0.068914   0.169146  -0.407   0.6837    
## Quantity:CountryCzech Republic        -0.057333   0.787701  -0.073   0.9420    
## Quantity:CountryDenmark               -0.016091   0.181339  -0.089   0.9293    
## Quantity:CountryEIRE                  -0.032396   0.038465  -0.842   0.3997    
## Quantity:CountryEuropean Community    -0.418282   1.907338  -0.219   0.8264    
## Quantity:CountryFinland               -0.105898   0.177022  -0.598   0.5497    
## Quantity:CountryGreece                -0.323207   1.040197  -0.311   0.7560    
## Quantity:CountryHong Kong             -2.328423   0.338230  -6.884 5.82e-12 ***
## Quantity:CountryIceland               -0.022990   0.382259  -0.060   0.9520    
## Quantity:CountryIsrael                -0.092982   0.351865  -0.264   0.7916    
## Quantity:CountryJapan                  0.009056   0.040170   0.225   0.8216    
## Quantity:CountryLebanon               -0.459949   3.379620  -0.136   0.8917    
## Quantity:CountryLithuania             -0.043743   1.636365  -0.027   0.9787    
## Quantity:CountryMalta                 -0.286962   1.068054  -0.269   0.7882    
## Quantity:CountryNorway                -0.133752   0.132630  -1.008   0.3132    
## Quantity:CountryPoland                -0.216919   0.516258  -0.420   0.6744    
## Quantity:CountryPortugal              -0.504965   0.211786  -2.384   0.0171 *  
## Quantity:CountryRSA                   -0.785844   3.848015  -0.204   0.8382    
## Quantity:CountrySaudi Arabia          -0.157584   5.635750  -0.028   0.9777    
## Quantity:CountrySingapore             -3.207304   0.232553 -13.792  < 2e-16 ***
## Quantity:CountrySweden                -0.002396   0.044727  -0.054   0.9573    
## Quantity:CountrySwitzerland           -0.054092   0.117417  -0.461   0.6450    
## Quantity:CountryUK and Ireland         0.010590   0.027920   0.379   0.7045    
## Quantity:CountryUnited Arab Emirates  -0.124934   0.947551  -0.132   0.8951    
## Quantity:CountryUnspecified           -0.110604   0.523767  -0.211   0.8328    
## Quantity:CountryUSA                    0.003877   0.346358   0.011   0.9911    
## Quantity:CountryWestern Europe        -0.014720   0.031108  -0.473   0.6361    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.71 on 541843 degrees of freedom
## Multiple R-squared:  0.001069,   Adjusted R-squared:  0.0009489 
## F-statistic: 8.918 on 65 and 541843 DF,  p-value: < 2.2e-16

The output above provides a summary of the multiple linear regression model with ‘UnitPrice’ as the response variable and ‘Quantity,’ ‘Country,’ and their interaction term as predictors. Here’s how to interpret the main components of the summary:

Residuals: The residuals represent the differences between the observed and predicted values of the dependent variable. The minimum, first quartile (1Q), median, third quartile (3Q), and maximum values of the residuals are displayed.

Coefficients: The table displays the estimates, standard errors, t-values, and p-values for each coefficient in the model. The coefficients represent the estimated change in the response variable for a one-unit change in the predictor variable, keeping other variables constant.

Interpretation of Coefficients: For instance, the coefficient for ‘Quantity’ is -0.011024, indicating that, on average, a one-unit increase in ‘Quantity’ is associated with a decrease of 0.011024 in ‘UnitPrice,’ holding other variables constant.

Interaction Terms: The model includes interaction terms between ‘Quantity’ and ‘Country.’ For instance, ‘Quantity:CountryHong Kong’ has a coefficient of -2.328423, suggesting that the effect of ‘Quantity’ on ‘UnitPrice’ is modified by the specific context of Hong Kong, resulting in a steeper decrease in ‘UnitPrice’ with increasing ‘Quantity’ in this context.

Significance of Coefficients: The p-values associated with each coefficient indicate the significance of the corresponding variable in the model. Variables with p-values less than the significance level (e.g., 0.05) are considered statistically significant. For instance, ‘Quantity:CountryPortugal’ has a p-value of 0.0171, suggesting that the interaction effect of ‘Quantity’ and ‘Portugal’ is statistically significant.

Model Fit: The ‘Multiple R-squared’ value, which is 0.001069, indicates the proportion of variance in the dependent variable explained by the independent variables in the model. The ‘F-statistic’ tests the overall significance of the model, and the extremely low p-value (< 2.2e-16) indicates that the model is statistically significant.

Overall, the model suggests that the variables and their interactions have some statistically significant associations with the ‘UnitPrice.’ However, the low R-squared value indicates that the model explains only a small proportion of the variance in the ‘UnitPrice.’ This suggests that other variables or complex interactions might need to be considered to improve the model’s predictive power.