This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
Online_Retail <- read.csv('C:/Users/laasy/Documents/Fall 2023/Intro to Statistics in R/Datasets for Final Project/OnlineRetail.csv')
summary(Online_Retail)
## InvoiceNo StockCode Description Quantity
## Length:541909 Length:541909 Length:541909 Min. :-80995.00
## Class :character Class :character Class :character 1st Qu.: 1.00
## Mode :character Mode :character Mode :character Median : 3.00
## Mean : 9.55
## 3rd Qu.: 10.00
## Max. : 80995.00
##
## InvoiceDate UnitPrice CustomerID Country
## Length:541909 Min. :-11062.06 Min. :12346 Length:541909
## Class :character 1st Qu.: 1.25 1st Qu.:13953 Class :character
## Mode :character Median : 2.08 Median :15152 Mode :character
## Mean : 4.61 Mean :15288
## 3rd Qu.: 4.13 3rd Qu.:16791
## Max. : 38970.00 Max. :18287
## NA's :135080
# Checking the unique countries and their counts
table(Online_Retail$Country)
##
## Australia Austria Bahrain
## 1259 401 19
## Belgium Brazil Canada
## 2069 32 151
## Channel Islands Cyprus Czech Republic
## 758 622 30
## Denmark EIRE European Community
## 389 8196 61
## Finland France Germany
## 695 8557 9495
## Greece Hong Kong Iceland
## 146 288 182
## Israel Italy Japan
## 297 803 358
## Lebanon Lithuania Malta
## 45 35 127
## Netherlands Norway Poland
## 2371 1086 341
## Portugal RSA Saudi Arabia
## 1519 58 10
## Singapore Spain Sweden
## 229 2533 462
## Switzerland United Arab Emirates United Kingdom
## 2002 68 495478
## Unspecified USA
## 446 291
# consolidate into fewer categories
Online_Retail$Country[Online_Retail$Country %in% c('Germany', 'France', 'Spain', 'Italy', 'Netherlands', 'Belgium')] <- 'Western Europe'
Online_Retail$Country[Online_Retail$Country %in% c('United Kingdom', 'Ireland')] <- 'UK and Ireland'
# Running ANOVA
model <- lm(UnitPrice ~ Country, data = Online_Retail)
anova_result <- anova(model)
# Summarizing results
summary(anova_result)
## Df Sum Sq Mean Sq F value
## Min. : 32 Min. :3.001e+06 Min. : 9357 Min. :10.02
## 1st Qu.:135493 1st Qu.:1.270e+09 1st Qu.:30464 1st Qu.:10.02
## Median :270954 Median :2.537e+09 Median :51571 Median :10.02
## Mean :270954 Mean :2.537e+09 Mean :51571 Mean :10.02
## 3rd Qu.:406415 3rd Qu.:3.804e+09 3rd Qu.:72678 3rd Qu.:10.02
## Max. :541876 Max. :5.071e+09 Max. :93784 Max. :10.02
## NA's :1
## Pr(>F)
## Min. :0
## 1st Qu.:0
## Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
## NA's :1
The output seems to be the result of an ANOVA test performed on the ‘Country’ column and the ‘UnitPrice’ variable. From the output, it appears that the data might have been divided into different groups based on countries, and the ANOVA test has been conducted to examine if there is a significant difference in the mean unit prices across these groups.
The output shows the counts for each country and then provides a summary of the ANOVA test results, including the degrees of freedom (Df), the sum of squares (Sum Sq), the mean squares (Mean Sq), the F-value, and the p-value (Pr(>F)).
Here are the main points from the output:
The p-value of 0 suggests that there is strong evidence to reject the null hypothesis, indicating that there is a significant difference in the mean unit prices across the different countries in the dataset.
For stakeholders interested in this data, this implies that the country from which an order originates has a significant impact on the prices of the products. Therefore, businesses operating in the online retail industry need to consider the geographical location of their customers when determining pricing strategies and marketing approaches. Furthermore, they may need to tailor their pricing and marketing efforts according to the preferences and purchasing power of customers in different countries.
# Checking diagnostic plots
par(mfrow=c(2,2))
plot(model, which =1)
plot(model, which =2)
plot(model, which =3)
The ‘summary’ function will provide detailed information about the
linear regression model, including the coefficients, their standard
errors, t-values, and p-values. The ‘anova’ function will run an ANOVA
test to assess the significance of the overall model. The diagnostic
plots will help identify any issues with the model, such as
heteroscedasticity or non linearity.
Interpreting the coefficients of the model, the intercept represents the expected unit price when the quantity is zero. The coefficient for the ‘Quantity’ variable indicates the change in the unit price for a one-unit increase in quantity, assuming all other variables are held constant. A positive coefficient suggests that there is a positive linear relationship between quantity and unit price.
To build a more comprehensive regression model, let’s include the ‘Country’ variable from the ANOVA analysis as an additional categorical predictor in the linear regression model. We will also evaluate if the interaction between ‘Quantity’ and ‘Country’ has a significant impact on the ‘UnitPrice’ variable.
In this updated model, ‘Quantity’ and ‘Country’ are included as predictors, and an interaction term ‘Quantity:Country’ is also added to assess whether the effect of ‘Quantity’ on ‘UnitPrice’ varies across different countries. The ‘summary’ function provides details about the coefficients, standard errors, t-values, and p-values, helping to understand the significance of each variable and the interaction term in predicting ‘UnitPrice’.
By including the ‘Country’ variable and its interaction with ‘Quantity’, we can examine whether the relationship between ‘Quantity’ and ‘UnitPrice’ is influenced by the country from which the order originates. This can provide insights into how different market dynamics in various countries impact the pricing strategy and customer behavior. Additionally, the interaction term can help in understanding if the effect of quantity on unit price differs based on the geographical location of the customers, enabling businesses to tailor their pricing and marketing strategies accordingly.
# Building a linear regression model with interaction term
model <- lm(UnitPrice ~ Quantity + Country + Quantity:Country, data = Online_Retail)
# Displaying the summary of the linear regression model
summary(model)
##
## Call:
## lm(formula = UnitPrice ~ Quantity + Country + Quantity:Country,
## data = Online_Retail)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11067 -3 -2 0 38965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.953068 3.296847 1.199 0.2305
## Quantity -0.011024 0.027913 -0.395 0.6929
## CountryAustria 0.947377 6.431200 0.147 0.8829
## CountryBahrain 1.140084 24.721683 0.046 0.9632
## CountryBrazil 2.714822 28.685850 0.095 0.9246
## CountryCanada 2.804696 9.077109 0.309 0.7573
## CountryChannel Islands 1.843719 5.196338 0.355 0.7227
## CountryCyprus 3.161142 5.364488 0.589 0.5557
## CountryCzech Republic 0.334179 23.747993 0.014 0.9888
## CountryDenmark -0.125403 7.009881 -0.018 0.9857
## CountryEIRE 2.713661 3.496079 0.776 0.4376
## CountryEuropean Community 4.365212 20.140760 0.217 0.8284
## CountryFinland 3.290004 5.614682 0.586 0.5579
## CountryGreece 4.494560 14.062155 0.320 0.7493
## CountryHong Kong 77.291110 8.631445 8.955 < 2e-16 ***
## CountryIceland -0.849683 9.421930 -0.090 0.9281
## CountryIsrael 1.204422 8.294058 0.145 0.8845
## CountryJapan -1.538290 6.413837 -0.240 0.8105
## CountryLebanon 5.474389 32.543321 0.168 0.8664
## CountryLithuania -0.091695 34.742855 -0.003 0.9979
## CountryMalta 3.506056 12.145079 0.289 0.7728
## CountryNorway 4.624795 4.976201 0.929 0.3527
## CountryPoland 2.659671 8.294321 0.321 0.7485
## CountryPortugal 10.126084 4.693351 2.158 0.0310 *
## CountryRSA 5.160678 26.786074 0.193 0.8472
## CountrySaudi Arabia -0.277510 52.275984 -0.005 0.9958
## CountrySingapore 179.250495 8.919577 20.096 < 2e-16 ***
## CountrySweden 0.992968 6.195314 0.160 0.8727
## CountrySwitzerland 0.436706 4.304162 0.101 0.9192
## CountryUK and Ireland 0.583088 3.299713 0.177 0.8597
## CountryUnited Arab Emirates 1.391060 18.316786 0.076 0.9395
## CountryUnspecified -0.353560 6.842331 -0.052 0.9588
## CountryUSA -1.711249 6.672098 -0.256 0.7976
## CountryWestern Europe 0.838466 3.361268 0.249 0.8030
## Quantity:CountryAustria -0.043577 0.224122 -0.194 0.8458
## Quantity:CountryBahrain -0.028207 0.759945 -0.037 0.9704
## Quantity:CountryBrazil -0.187775 2.049371 -0.092 0.9270
## Quantity:CountryCanada -0.028731 0.171444 -0.168 0.8669
## Quantity:CountryChannel Islands -0.058120 0.158219 -0.367 0.7134
## Quantity:CountryCyprus -0.068914 0.169146 -0.407 0.6837
## Quantity:CountryCzech Republic -0.057333 0.787701 -0.073 0.9420
## Quantity:CountryDenmark -0.016091 0.181339 -0.089 0.9293
## Quantity:CountryEIRE -0.032396 0.038465 -0.842 0.3997
## Quantity:CountryEuropean Community -0.418282 1.907338 -0.219 0.8264
## Quantity:CountryFinland -0.105898 0.177022 -0.598 0.5497
## Quantity:CountryGreece -0.323207 1.040197 -0.311 0.7560
## Quantity:CountryHong Kong -2.328423 0.338230 -6.884 5.82e-12 ***
## Quantity:CountryIceland -0.022990 0.382259 -0.060 0.9520
## Quantity:CountryIsrael -0.092982 0.351865 -0.264 0.7916
## Quantity:CountryJapan 0.009056 0.040170 0.225 0.8216
## Quantity:CountryLebanon -0.459949 3.379620 -0.136 0.8917
## Quantity:CountryLithuania -0.043743 1.636365 -0.027 0.9787
## Quantity:CountryMalta -0.286962 1.068054 -0.269 0.7882
## Quantity:CountryNorway -0.133752 0.132630 -1.008 0.3132
## Quantity:CountryPoland -0.216919 0.516258 -0.420 0.6744
## Quantity:CountryPortugal -0.504965 0.211786 -2.384 0.0171 *
## Quantity:CountryRSA -0.785844 3.848015 -0.204 0.8382
## Quantity:CountrySaudi Arabia -0.157584 5.635750 -0.028 0.9777
## Quantity:CountrySingapore -3.207304 0.232553 -13.792 < 2e-16 ***
## Quantity:CountrySweden -0.002396 0.044727 -0.054 0.9573
## Quantity:CountrySwitzerland -0.054092 0.117417 -0.461 0.6450
## Quantity:CountryUK and Ireland 0.010590 0.027920 0.379 0.7045
## Quantity:CountryUnited Arab Emirates -0.124934 0.947551 -0.132 0.8951
## Quantity:CountryUnspecified -0.110604 0.523767 -0.211 0.8328
## Quantity:CountryUSA 0.003877 0.346358 0.011 0.9911
## Quantity:CountryWestern Europe -0.014720 0.031108 -0.473 0.6361
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 96.71 on 541843 degrees of freedom
## Multiple R-squared: 0.001069, Adjusted R-squared: 0.0009489
## F-statistic: 8.918 on 65 and 541843 DF, p-value: < 2.2e-16
The output above provides a summary of the multiple linear regression model with ‘UnitPrice’ as the response variable and ‘Quantity,’ ‘Country,’ and their interaction term as predictors. Here’s how to interpret the main components of the summary:
Residuals: The residuals represent the differences between the observed and predicted values of the dependent variable. The minimum, first quartile (1Q), median, third quartile (3Q), and maximum values of the residuals are displayed.
Coefficients: The table displays the estimates, standard errors, t-values, and p-values for each coefficient in the model. The coefficients represent the estimated change in the response variable for a one-unit change in the predictor variable, keeping other variables constant.
Interpretation of Coefficients: For instance, the coefficient for ‘Quantity’ is -0.011024, indicating that, on average, a one-unit increase in ‘Quantity’ is associated with a decrease of 0.011024 in ‘UnitPrice,’ holding other variables constant.
Interaction Terms: The model includes interaction terms between ‘Quantity’ and ‘Country.’ For instance, ‘Quantity:CountryHong Kong’ has a coefficient of -2.328423, suggesting that the effect of ‘Quantity’ on ‘UnitPrice’ is modified by the specific context of Hong Kong, resulting in a steeper decrease in ‘UnitPrice’ with increasing ‘Quantity’ in this context.
Significance of Coefficients: The p-values associated with each coefficient indicate the significance of the corresponding variable in the model. Variables with p-values less than the significance level (e.g., 0.05) are considered statistically significant. For instance, ‘Quantity:CountryPortugal’ has a p-value of 0.0171, suggesting that the interaction effect of ‘Quantity’ and ‘Portugal’ is statistically significant.
Model Fit: The ‘Multiple R-squared’ value, which is 0.001069, indicates the proportion of variance in the dependent variable explained by the independent variables in the model. The ‘F-statistic’ tests the overall significance of the model, and the extremely low p-value (< 2.2e-16) indicates that the model is statistically significant.
Overall, the model suggests that the variables and their interactions have some statistically significant associations with the ‘UnitPrice.’ However, the low R-squared value indicates that the model explains only a small proportion of the variance in the ‘UnitPrice.’ This suggests that other variables or complex interactions might need to be considered to improve the model’s predictive power.