title: “Descriptive and Predictive Analysis on Online Retail Dataset” author: “Shubhpreet Kaur & Pragati Gupta” date: “2025-04-23” output: word_document: default html_document: default pdf_document: default —
{r} library(dplyr) library(tidyr) library(ggplot2) library(GGally) library(gridExtra)
#1. Loading the dataset {r} library(readr) online_retail <- read_csv(“C:/Users/HP/OneDrive/Desktop/CA 3 Project of R/online retail.zip”) View(online_retail)
There are 1,000 rows and 13 columns in the dataset
#2.What is the structure of dataset? {r} str(online_retail)
This dataset contains detailed transactional data from an e-commerce platform including 99,441 rows and 13 columns.
#3.Are there any missing values? {r} colSums(is.na(online_retail))
Yes, there are 201 in review_score and 103 in gender missing values in dataset.
#4.How to Remove Missing values? {r} online_retail <- online_retail %>% drop_na(review_score) online_retail <- online_retail %>% drop_na(gender) colSums(is.na(online_retail))
Now, there is no missing values in the dataset.
#5.Are there any duplicate values? {r} sum(duplicated(online_retail))
There is not a single duplicate value in dataset
#6.What is the distribution of transactions by City? {r} table(online_retail$city)
Every city has single transaction
#7.What is the average price of products by category? {r} aggregate(price ~ category_name, data = online_retail, mean)
Electronics have the highest average price, while Home & Living has the lowest.
#8.Which payment mode is preferred by customers? {r} table(online_retail$payment_method)
Mostly people pay through cash on delivery.
#9.Which product category has the highest sales quantity? {r} online_retail %>% group_by(category_name) %>% summarise(total_quantity = sum(quantity)) %>% arrange(desc(total_quantity))
Products of electronics are sold the most.
#10.What is the average review score for each product category? {r} online_retail %>% group_by(category_name) %>% summarise(avg_review_score = mean(review_score))
Sports & Outdoors category has the highest review_score.
#11.What is the gender distribution of customers? {r} table(online_retail$gender)
There are more males as compared to females.
#12.Which city has the highest number of orders? {r} online_retail %>% group_by(city) %>% summarise(total_orders = n()) %>% arrange(desc(total_orders))
Each city has two orders only.
#13. What is the distribution of review scores? {r} table(online_retail$review_score)
Most customers gave a review score of 5, indicating high customer satisfaction.
#14.What is the total revenue generated from each product category? {r} online_retail %>% mutate(revenue = price * quantity) %>% group_by(category_name) %>% summarise(total_revenue = sum(revenue)) %>% arrange(desc(total_revenue))
Highest revenue is generated from electronics category.
#15.Which payment method generates the highest revenue? {r} online_retail %>% mutate(revenue = price * quantity) %>% group_by(payment_method) %>% summarise(total_revenue = sum(revenue)) %>% arrange(desc(total_revenue))
Highest revenue is generated via cash on delivery.
#16.Which customer has placed the highest number of orders? {r} online_retail %>% group_by(customer_id) %>% summarise(total_orders = n()) %>% arrange(desc(total_orders))
Every customer bought one unique product.
#17.Which product is sold the most? {r} aggregate(quantity ~ product_name, data = online_retail, sum)
Mostly customers bought laptop, while least bought Novel.
#18.Creating New Column {r} online_retail <- online_retail %>% mutate(TotalPrice = quantity * price)
A new column is created named TotalPrice.
#19.Bar Chart: What is the count of orders in each product category? {r} category_counts <- online_retail %>% count(category_name)
ggplot(category_counts, aes(x = reorder(category_name, -n), y = n)) + geom_col(fill = “steelblue”) + ggtitle(“Orders per Product Category”) + xlab(“Category”) + ylab(“Order Count”) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))
The bar chart shows that Electronics is the most popular product category, followed by Sports & Outdoors and Fashion. These categories have the highest order counts, indicating strong customer preference. Home & Living and Books & Stationery have slightly lower order volumes, suggesting moderate demand. This insight can help prioritize marketing and inventory planning.
#20.What is the gender distribution of customers in the dataset? {r} ggplot(online_retail, aes(x = gender)) + geom_bar(fill = “mediumseagreen”) + ggtitle(“Customer Gender Distribution”) + xlab(“Gender”) + ylab(“Number of Customers”) + theme_minimal()
Based on the observation, the bar for “Male” is slightly higher than that for “Female,” indicating that the dataset has a larger number of male customers.
#21.Scatter Plot: Is there a relationship between age and review_score? {r} ggplot(online_retail, aes(x = age, y = review_score)) + geom_point(alpha = 0.5, color = “darkgreen”) + labs(title = “Age vs Review Score”, x = “Age”, y = “Review Score”)
The scatter plot of age versus review score shows that most customers give extreme ratings (1 or 5), with fewer ratings in the middle. There is no clear correlation between age and review score, suggesting that age does not significantly influence customer satisfaction. The data appears to be polarized, with many customers either very satisfied or very dissatisfied. This pattern may indicate that factors other than age influence review scores more strongly.
#22.Pair Plot: What relationships exist among age, quantity, and review_score? {r} GGally::ggpairs(online_retail[, c(“age”, “quantity”, “review_score”)])
The ggpairs plot shows very weak correlations among age, quantity, and review_score, all close to zero. This indicates no significant linear relationship between any of the variables. The review scores are mostly polarized at 1 and 5, with little variation in between. Overall, these variables do not strongly influence each other, suggesting other factors may impact review behavior.
#23.Histogram: What is the distribution of product prices? {r} ggplot(online_retail, aes(x = price)) + geom_histogram(binwidth = 5, fill = “orange”, color = “black”) + labs(title = “Distribution of Product Prices”, x = “Price”, y = “Count”)
The histogram shows that product prices are distributed fairly evenly across the range, with some noticeable spikes. There is no strong skew, indicating a wide variety of product pricing. A few price ranges, especially near ₹500, appear more popular. Overall, the product prices are well spread out without heavy concentration in a specific range.
#24.Box Plot: How do review scores differ across gender? {r} ggplot(online_retail, aes(x = gender, y = review_score, fill = gender)) + geom_boxplot() + labs(title = “Review Score by Gender”, x = “Gender”, y = “Review Score”) + theme_minimal()
The boxplot shows that both male and female customers generally gave high review scores, with medians around 4. The spread of scores is similar for both genders, ranging mostly between 3 and 5. There are no significant outliers, and the distribution appears symmetric. Overall, both genders seem equally satisfied based on their review scores.
#25.How does the quantity of products purchased vary by customer age and review score? {r} # Box plot 1: Quantity by Age plot1 <- ggplot(online_retail, aes(x = ““, y = quantity)) + geom_boxplot(fill =”lightblue”) + ggtitle(“Distribution of Quantity Purchased”) + ylab(“Quantity”) + xlab(““) + theme_minimal()
plot2 <- ggplot(online_retail, aes(x = as.factor(review_score), y = quantity)) + geom_boxplot(fill = “salmon”) + ggtitle(“Quantity vs Review Score”) + xlab(“Review Score”) + ylab(“Quantity”) + theme_minimal()
grid.arrange(plot1, plot2, ncol = 2)
The first box plot shows that the quantity purchased typically ranges between 2 and 4 units, with a median of 3, suggesting most customers buy small quantities. The second plot indicates that the quantity purchased remains relatively consistent across different review scores, showing no significant relationship between the quantity bought and the customer satisfaction level. Overall, customer reviews don’t seem to be influenced by the size of their order.
#26.Is there a correlation between the customer’s age and the quantity of products they purchase? {r} cor.test(online_retail\(age, online_retail\)quantity, use = “complete.obs”)
The Pearson correlation coefficient between age and quantity is approximately -0.052, indicating a very weak negative correlation. The p-value is 0.164, which is greater than 0.05, meaning the result is not statistically significant. Therefore, we cannot conclude that there’s a meaningful relationship between a customer’s age and the number of products they purchase. The confidence interval also includes zero, reinforcing this finding.
#27.Is there a significant correlation between product price and review score? {r} cor.test(online_retail\(price, online_retail\)review_score, use = “complete.obs”)
The Pearson correlation coefficient between product price and review score is 0.083, indicating a very weak positive correlation. The p-value is 0.026, which is less than 0.05, suggesting that this relationship is statistically significant. Although the correlation is weak, the result implies that as product price increases, review scores tend to slightly increase. However, the effect size is minimal and may not have strong practical significance.
#28.Can a customer’s age predict the quantity of products they purchase? {r} model <- lm(quantity ~ age, data = online_retail) summary(model)
The regression model shows that age is not a significant predictor of the quantity of products purchased, as the p-value is 0.164 (greater than 0.05). The coefficient for age is -0.004, indicating a very small negative relationship, but it is not statistically meaningful. The R-squared value is 0.0027, meaning the model explains less than 1% of the variance in quantity. Thus, customer age does not effectively predict purchase quantity in this dataset.
#29.How do age, gender, and product price together influence the quantity of products purchased? {r} model <- lm(quantity ~ age + gender + price, data = online_retail) summary(model)
The multiple regression model suggests that age, gender, and product price do not significantly influence the quantity of products purchased. All p-values are greater than 0.05, indicating none of the predictors are statistically significant. The R-squared value is only 0.0032, meaning the model explains less than 1% of the variance in quantity. Overall, this model has very weak explanatory power and does not provide useful predictions for product quantity based on these variables.
#30.Is there a significant difference in quantity purchased across different product categories? {r} anova_result <- aov(quantity ~ category_name, data = online_retail) summary(anova_result)
The ANOVA test evaluates whether the mean quantity purchased differs across product categories. The p-value is 0.398, which is much greater than 0.05, indicating that the difference in mean quantities across categories is not statistically significant. The F-value of 1.017 also supports this conclusion. Thus, we cannot conclude that product category has a significant effect on the quantity of products purchased.
{r}