Project Overview This project analyzes an e-commerce dataset containing 300 transactions with customer details, product information, and transaction records. The goal is to uncover insights through data cleaning, exploratory analysis, statistical tests, and visualization. # Set a working directory.
spc_tbl_ [300 × 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ TransactionID : num [1:300] 1 2 3 4 5 6 7 8 9 10 ...
$ CustomerID : num [1:300] 207 253 110 256 274 52 191 165 18 169 ...
$ ProductID : num [1:300] 14 15 19 1 2 15 10 5 5 17 ...
$ Quantity : num [1:300] 5 1 5 3 5 2 1 5 4 2 ...
$ PaymentMethod : chr [1:300] "UPI" "Cash on Delivery" "Cash on Delivery" "Debit Card" ...
$ TransactionDate : chr [1:300] "12/28/2023" "4/17/2023" "1/17/2023" "10/23/2023" ...
$ ProductCategory : chr [1:300] "Clothing" "Electronics" "Clothing" "Home Appliances" ...
$ Price : num [1:300] 33.4 389 145.9 215 325.5 ...
$ Rating : num [1:300] 4.2 2.2 1.7 2.7 3.4 2.2 3.7 1.4 1.4 1.9 ...
$ TotalAmount : num [1:300] 167 389 729 645 1627 ...
$ Age : num [1:300] 50 19 37 50 24 50 40 37 25 53 ...
$ Gender : chr [1:300] "Male" "Female" "Female" "Male" ...
$ Location : chr [1:300] "Houston" "Chicago" "Houston" "Los Angeles" ...
$ MembershipStatus: chr [1:300] "Basic" "Premium" "Basic" "Premium" ...
- attr(*, "spec")=
.. cols(
.. TransactionID = col_double(),
.. CustomerID = col_double(),
.. ProductID = col_double(),
.. Quantity = col_double(),
.. PaymentMethod = col_character(),
.. TransactionDate = col_character(),
.. ProductCategory = col_character(),
.. Price = col_double(),
.. Rating = col_double(),
.. TotalAmount = col_double(),
.. Age = col_double(),
.. Gender = col_character(),
.. Location = col_character(),
.. MembershipStatus = col_character()
.. )
- attr(*, "problems")=<externalptr>
TransactionID CustomerID ProductID Quantity PaymentMethod TransactionDate ProductCategory Price
0 0 0 0 0 0 0 0
Rating TotalAmount Age Gender Location MembershipStatus
0 0 8 13 0 0
[1] 50 19 37 50 24 50 40 37 25 53 65 62 42 42 42 42 42 42 42 42 34 24 25 52 53 29 24 24 44 19 65 22 52 24 43 44 47 21 42 50 43 23 25
[44] 27 18 35 42 53 34 49 63 27 49 18 62 43 41 41 52 44 31 43 37 50 26 24 36 24 28 62 61 50 37 52 38 25 25 65 49 44 63 59 24 41 56 56
[87] 49 63 46 60 28 46 22 52 61 62 36 42 47 57 54 43 47 32 40 27 39 26 43 20 24 58 59 19 28 59 42 18 42 22 38 65 56 50 40 52 31 24 34
[130] 33 41 38 31 40 43 58 23 59 53 62 24 32 24 63 42 22 25 45 35 21 30 44 37 46 29 62 37 42 21 42 44 55 33 56 40 32 63 39 39 24 52 60
[173] 48 53 35 63 30 33 44 54 58 40 44 22 41 22 39 24 50 63 59 20 43 40 19 63 42 58 56 52 64 56 41 49 54 35 52 26 51 63 42 25 31 46 45
[216] 36 65 41 43 35 38 42 54 39 59 36 42 57 19 55 32 57 59 27 48 56 44 43 43 51 57 52 33 27 42 50 19 47 46 60 29 41 40 40 42 60 49 24
[259] 57 25 34 49 46 42 40 27 29 49 28 60 41 63 23 58 45 59 27 31 38 29 50 65 59 27 24 24 55 53 26 28 40 32 35 59 30 57 59 31 63 23
` ## handling missing value for gender # For Gender,replace with “male” category
[1] "Male" "Female" "Female" "Male" "Male" "Male" "Male" "Male" "Female" "Female" "Male" "Male" "Male" "Female"
[15] "Male" "Female" "Male" "Male" "Male" "Female" "Female" "Male" "Male" "Male" "Female" "Male" "Male" "Male"
[29] "Female" "Male" "Female" "Female" "Male" "Male" "Female" "Male" "Male" "Male" "Male" "Female" "Male" "Male"
[43] "Female" "Female" "Female" "Female" NA NA NA NA NA NA NA NA NA NA
[57] NA NA NA "Female" "Male" "Female" "Female" "Male" "Female" "Female" "Female" "Male" "Female" "Female"
[71] "Female" "Female" "Female" "Female" "Female" "Female" "Male" "Female" "Female" "Female" "Female" "Male" "Male" "Male"
[85] "Male" "Female" "Male" "Female" "Female" "Female" "Female" "Female" "Female" "Male" "Male" "Female" "Male" "Female"
[99] "Male" "Male" "Male" "Male" "Female" "Female" "Female" "Male" "Male" "Male" "Female" "Female" "Female" "Male"
[113] "Male" "Male" "Female" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Male" "Female"
[127] "Male" "Male" "Female" "Female" "Male" "Male" "Female" "Female" "Male" "Male" "Male" "Male" "Male" "Male"
[141] "Male" "Male" "Male" "Female" "Male" "Female" "Male" "Male" "Female" "Male" "Male" "Female" "Female" "Female"
[155] "Female" "Female" "Male" "Male" "Male" "Male" "Male" "Female" "Male" "Male" "Male" "Female" "Male" "Male"
[169] "Male" "Female" "Male" "Male" "Female" "Female" "Female" "Male" "Male" "Male" "Male" "Female" "Male" "Male"
[183] "Female" "Male" "Male" "Female" "Male" "Male" "Female" "Female" "Male" "Female" "Female" "Female" "Female" "Female"
[197] "Female" "Female" "Female" "Male" "Female" "Female" "Female" "Male" "Male" "Female" "Male" "Female" "Female" "Female"
[211] "Female" "Female" "Male" "Female" "Female" "Female" "Female" "Male" "Female" "Female" "Male" "Female" "Male" "Male"
[225] "Male" "Female" "Female" "Female" "Male" "Male" "Female" "Female" "Male" "Female" "Female" "Male" "Male" "Male"
[239] "Male" "Male" "Female" "Female" "Female" "Female" "Male" "Male" "Male" "Male" "Female" "Male" "Female" "Male"
[253] "Female" "Male" "Male" "Female" "Male" "Male" "Male" "Male" "Male" "Male" "Female" "Male" "Female" "Male"
[267] "Female" "Male" "Male" "Female" "Male" "Male" "Female" "Female" "Male" "Male" "Male" "Male" "Male" "Female"
[281] "Male" "Male" "Male" "Male" "Male" "Male" "Female" "Male" "Male" "Male" "Male" "Female" "Female" "Female"
[295] "Male" "Male" "Male" "Male" "Female" "Female"
TransactionID CustomerID ProductID Quantity PaymentMethod TransactionDate ProductCategory Price
0 0 0 0 0 0 0 0
Rating TotalAmount Age Gender Location MembershipStatus
0 0 0 13 0 0
• Detect and handle outliers in numerical variables (e.g., Age, Price, TotalAmount) using methods like the IQR rule or Z-score.
Variable: TransactionID - Outliers found: 0
Variable: CustomerID - Outliers found: 0
Variable: ProductID - Outliers found: 0
Variable: Quantity - Outliers found: 0
Variable: Price - Outliers found: 0
Variable: Rating - Outliers found: 0
Variable: TotalAmount - Outliers found: 4
Variable: Age - Outliers found: 0
[1] "Number of outliers removed: 4"
Perform univariate and bivariate analysis to understand the distribution of variables and relationships between them # Summary statistics
TransactionID CustomerID ProductID Quantity PaymentMethod TransactionDate ProductCategory
Min. : 1.00 Min. : 4.00 Min. : 1.00 Min. :1.000 Length:296 Length:296 Length:296
1st Qu.: 74.75 1st Qu.: 76.75 1st Qu.: 6.00 1st Qu.:2.000 Class :character Class :character Class :character
Median :149.50 Median :156.50 Median :11.00 Median :3.000 Mode :character Mode :character Mode :character
Mean :149.80 Mean :154.93 Mean :10.64 Mean :3.101
3rd Qu.:224.25 3rd Qu.:230.25 3rd Qu.:15.25 3rd Qu.:4.000
Max. :300.00 Max. :299.00 Max. :20.00 Max. :5.000
Price Rating TotalAmount Age Gender Location MembershipStatus
Min. : 33.36 Min. :1.400 Min. : 33.36 Min. :18.00 Length:296 Length:296 Length:296
1st Qu.:145.89 1st Qu.:2.100 1st Qu.: 348.95 1st Qu.:31.00 Class :character Class :character Class :character
Median :215.02 Median :2.600 Median : 650.96 Median :42.00 Mode :character Mode :character Mode :character
Mean :250.65 Mean :2.932 Mean : 782.45 Mean :41.81
3rd Qu.:341.40 3rd Qu.:3.700 3rd Qu.:1087.23 3rd Qu.:52.00
Max. :466.49 Max. :5.000 Max. :2252.85 Max. :65.00
Basic Gold Premium
Cash on Delivery 24 34 21
Credit Card 22 16 25
Debit Card 22 20 23
UPI 35 23 31
• Build a regression model to predict TotalAmount based on variables like Age, Price, Quantity, and Rating.
Call:
lm(formula = TotalAmount ~ Age + Price + Rating + Quantity, data = ecommerce_cleaned)
Residuals:
Min 1Q Median 3Q Max
-493.37 -98.03 8.40 99.71 447.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -725.30851 52.37741 -13.848 <2e-16 ***
Age 0.08560 0.74111 0.116 0.908
Price 3.14353 0.08184 38.412 <2e-16 ***
Rating -8.46592 9.18348 -0.922 0.357
Quantity 238.95715 6.71914 35.564 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 166.5 on 291 degrees of freedom
Multiple R-squared: 0.9076, Adjusted R-squared: 0.9063
F-statistic: 714.4 on 4 and 291 DF, p-value: < 2.2e-16
Age Price Rating Quantity
1.003959 1.006389 1.015030 1.011156
Analyze the correlation between numerical variables (e.g., Price vs. TotalAmount, Age vs. TotalAmount).
Hypothesis:H0:There is significant relationship between the variables H1:There is no significant Relationship in atleast one of the variables # Select numeric variables
Quantity Price Rating TotalAmount Age
Quantity 1.00000000 0.02973474 -0.09877651 0.65943058 0.02206473
Price 0.02973474 1.00000000 -0.06313723 0.70665706 -0.04423694
Rating -0.09877651 -0.06313723 1.00000000 -0.12278622 0.03821019
TotalAmount 0.65943058 0.70665706 -0.12278622 1.00000000 -0.01488728
Age 0.02206473 -0.04423694 0.03821019 -0.01488728 1.00000000
##pearson Rank
Age Price Quantity Rating TotalAmount
Age 1.00000000 -0.04423694 0.02206473 0.03821019 -0.01488728
Price -0.04423694 1.00000000 0.02973474 -0.06313723 0.70665706
Quantity 0.02206473 0.02973474 1.00000000 -0.09877651 0.65943058
Rating 0.03821019 -0.06313723 -0.09877651 1.00000000 -0.12278622
TotalAmount -0.01488728 0.70665706 0.65943058 -0.12278622 1.00000000
Age Price Quantity Rating TotalAmount
Age 1.000000000 -0.042935951 0.01352312 0.026838550 -0.009588703
Price -0.042935951 1.000000000 0.01763169 -0.001029652 0.685290962
Quantity 0.013523121 0.017631694 1.00000000 -0.066746308 0.689802071
Rating 0.026838550 -0.001029652 -0.06674631 1.000000000 -0.057914324
TotalAmount -0.009588703 0.685290962 0.68980207 -0.057914324 1.000000000
• Perform a t-test to compare the average TotalAmount spent by male and female customers. • Perform a chi-square test to check the association between ProductCategory and PaymentMethod. • Perform ANOVA to compare the average TotalAmount across different MembershipStatus levels.
Welch Two Sample t-test
data: TotalAmount by Gender
t = -2.4726, df = 274.83, p-value = 0.01402
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
-285.1822 -32.3603
sample estimates:
mean in group Female mean in group Male
696.0371 854.8084
Pearson's Chi-squared test
data: table(ecommerce_cleaned$ProductCategory, ecommerce_cleaned$PaymentMethod)
X-squared = 2.2394, df = 6, p-value = 0.8964
Df Sum Sq Mean Sq F value Pr(>F)
MembershipStatus 2 451973 225986 0.762 0.467
Residuals 293 86840140 296383
[1] "No significant difference found in Total Amount across Membership Status levels."
This project provided a comprehensive analysis of the e-commerce dataset, revealing insights into customer behavior, product performance, and spending patterns. Key findings include: - The average age of customers is around 35 years, with a significant portion being in the 25-34 age range. - The most popular product categories are Electronics and Fashion, with Electronics having the highest average spending. - Payment methods are predominantly Credit Card and PayPal, with Credit Card being the most common.