E-commerce, or electronic commerce, is the process of buying and selling goods or services via the internet. The concept first emerged in 1979 when Michael Aldrich, a British entrepreneur, created a system that allowed a computer to connect to a regular TV via a telephone line, allowing him to shop from home. However, e-commerce in the form we are more familiar with today began to develop in the 1990s with the advent of the commercial internet.
Looking ahead, e-commerce is expected to continue to grow with the adoption of new technologies such as artificial intelligence (AI), virtual reality (VR), and blockchain, which will increase efficiency and provide a more personalized and secure shopping experience.
# Input Data dan Check Data
ecommerce <- read.csv("E-commerce Customer Behavior - Sheet1.csv")
Data already input! lets get started it.
head(ecommerce)
## Customer.ID Gender Age City Membership.Type Total.Spend
## 1 101 Female 29 New York Gold 1120.20
## 2 102 Male 34 Los Angeles Silver 780.50
## 3 103 Female 43 Chicago Bronze 510.75
## 4 104 Male 30 San Francisco Gold 1480.30
## 5 105 Male 27 Miami Silver 720.40
## 6 106 Female 37 Houston Bronze 440.80
## Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 1 14 4.6 TRUE 25
## 2 11 4.1 FALSE 18
## 3 9 3.4 TRUE 42
## 4 19 4.7 FALSE 12
## 5 13 4.0 TRUE 55
## 6 8 3.1 FALSE 22
## Satisfaction.Level
## 1 Satisfied
## 2 Neutral
## 3 Unsatisfied
## 4 Satisfied
## 5 Unsatisfied
## 6 Neutral
tail(ecommerce)
## Customer.ID Gender Age City Membership.Type Total.Spend
## 345 445 Male 28 San Francisco Gold 1480.10
## 346 446 Male 32 Miami Silver 660.30
## 347 447 Female 36 Houston Bronze 470.50
## 348 448 Female 30 New York Gold 1190.80
## 349 449 Male 34 Los Angeles Silver 780.20
## 350 450 Female 43 Chicago Bronze 515.75
## Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 345 21 4.9 FALSE 13
## 346 10 3.8 TRUE 42
## 347 8 3.0 FALSE 27
## 348 16 4.5 TRUE 28
## 349 11 4.2 FALSE 21
## 350 10 3.3 TRUE 49
## Satisfaction.Level
## 345 Satisfied
## 346 Unsatisfied
## 347 Neutral
## 348 Satisfied
## 349 Neutral
## 350 Unsatisfied
dim(ecommerce)
## [1] 350 11
names(ecommerce)
## [1] "Customer.ID" "Gender"
## [3] "Age" "City"
## [5] "Membership.Type" "Total.Spend"
## [7] "Items.Purchased" "Average.Rating"
## [9] "Discount.Applied" "Days.Since.Last.Purchase"
## [11] "Satisfaction.Level"
from our inspection, we can see - ecommerce data contains 350 rows and 11 column - each column named by “Customer.ID”, “Gender”, “Age”, “City”, “Membership.Type”, “Total.Spend”, “Items.Purchased”, “Average.Rating”, “Discount.Applied”, “Average.Rating”, “Days.Since.Lasy.Purchase”, and “Satisfaction.Level”.
Check data type for each column
str(ecommerce)
## 'data.frame': 350 obs. of 11 variables:
## $ Customer.ID : int 101 102 103 104 105 106 107 108 109 110 ...
## $ Gender : chr "Female" "Male" "Female" "Male" ...
## $ Age : int 29 34 43 30 27 37 31 35 41 28 ...
## $ City : chr "New York" "Los Angeles" "Chicago" "San Francisco" ...
## $ Membership.Type : chr "Gold" "Silver" "Bronze" "Gold" ...
## $ Total.Spend : num 1120 780 511 1480 720 ...
## $ Items.Purchased : int 14 11 9 19 13 8 15 12 10 21 ...
## $ Average.Rating : num 4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
## $ Discount.Applied : logi TRUE FALSE TRUE FALSE TRUE FALSE ...
## $ Days.Since.Last.Purchase: int 25 18 42 12 55 22 28 14 40 9 ...
## $ Satisfaction.Level : chr "Satisfied" "Neutral" "Unsatisfied" "Satisfied" ...
From this result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion)
# Inspect Data & Celaning Data
ecommerce$age <- as.numeric(ecommerce$Age)
ecommerce$Gender <- as.factor(ecommerce$Gender)
ecommerce$City <- as.factor(ecommerce$City)
ecommerce$Membership.Type <- as.factor(ecommerce$Membership.Type)
ecommerce$Satisfaction.Level <- as.factor(ecommerce$Satisfaction.Level)
str(ecommerce)
## 'data.frame': 350 obs. of 12 variables:
## $ Customer.ID : int 101 102 103 104 105 106 107 108 109 110 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 1 2 1 2 ...
## $ Age : int 29 34 43 30 27 37 31 35 41 28 ...
## $ City : Factor w/ 6 levels "Chicago","Houston",..: 5 3 1 6 4 2 5 3 1 6 ...
## $ Membership.Type : Factor w/ 3 levels "Bronze","Gold",..: 2 3 1 2 3 1 2 3 1 2 ...
## $ Total.Spend : num 1120 780 511 1480 720 ...
## $ Items.Purchased : int 14 11 9 19 13 8 15 12 10 21 ...
## $ Average.Rating : num 4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
## $ Discount.Applied : logi TRUE FALSE TRUE FALSE TRUE FALSE ...
## $ Days.Since.Last.Purchase: int 25 18 42 12 55 22 28 14 40 9 ...
## $ Satisfaction.Level : Factor w/ 4 levels "","Neutral","Satisfied",..: 3 2 4 3 4 2 3 2 4 3 ...
## $ age : num 29 34 43 30 27 37 31 35 41 28 ...
Each of column already changed into desired data type check if there’s any missing data
colSums(is.na(ecommerce))
## Customer.ID Gender Age
## 0 0 0
## City Membership.Type Total.Spend
## 0 0 0
## Items.Purchased Average.Rating Discount.Applied
## 0 0 0
## Days.Since.Last.Purchase Satisfaction.Level age
## 0 0 0
anyNA(ecommerce)
## [1] FALSE
Okay! There’s no missing value. Lets do subset.
we’ll do subsetting by only delete discount.applied colomn, since we dont need the information.
head(ecommerce)
## Customer.ID Gender Age City Membership.Type Total.Spend
## 1 101 Female 29 New York Gold 1120.20
## 2 102 Male 34 Los Angeles Silver 780.50
## 3 103 Female 43 Chicago Bronze 510.75
## 4 104 Male 30 San Francisco Gold 1480.30
## 5 105 Male 27 Miami Silver 720.40
## 6 106 Female 37 Houston Bronze 440.80
## Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 1 14 4.6 TRUE 25
## 2 11 4.1 FALSE 18
## 3 9 3.4 TRUE 42
## 4 19 4.7 FALSE 12
## 5 13 4.0 TRUE 55
## 6 8 3.1 FALSE 22
## Satisfaction.Level age
## 1 Satisfied 29
## 2 Neutral 34
## 3 Unsatisfied 43
## 4 Satisfied 30
## 5 Unsatisfied 27
## 6 Neutral 37
# Deleting Column 'Discount.Applied'
ecommerce <- ecommerce[, !names(ecommerce) %in% "Discount.Applied"]
# Displays the first six rows of the dataset after column deletion
head(ecommerce)
## Customer.ID Gender Age City Membership.Type Total.Spend
## 1 101 Female 29 New York Gold 1120.20
## 2 102 Male 34 Los Angeles Silver 780.50
## 3 103 Female 43 Chicago Bronze 510.75
## 4 104 Male 30 San Francisco Gold 1480.30
## 5 105 Male 27 Miami Silver 720.40
## 6 106 Female 37 Houston Bronze 440.80
## Items.Purchased Average.Rating Days.Since.Last.Purchase Satisfaction.Level
## 1 14 4.6 25 Satisfied
## 2 11 4.1 18 Neutral
## 3 9 3.4 42 Unsatisfied
## 4 19 4.7 12 Satisfied
## 5 13 4.0 55 Unsatisfied
## 6 8 3.1 22 Neutral
## age
## 1 29
## 2 34
## 3 43
## 4 30
## 5 27
## 6 37
After subset discount.applied, ecommerce data ready to be analyze!
summary(ecommerce)
## Customer.ID Gender Age City Membership.Type
## Min. :101.0 Female:175 Min. :26.0 Chicago :58 Bronze:116
## 1st Qu.:188.2 Male :175 1st Qu.:30.0 Houston :58 Gold :117
## Median :275.5 Median :32.5 Los Angeles :59 Silver:117
## Mean :275.5 Mean :33.6 Miami :58
## 3rd Qu.:362.8 3rd Qu.:37.0 New York :59
## Max. :450.0 Max. :43.0 San Francisco:58
## Total.Spend Items.Purchased Average.Rating Days.Since.Last.Purchase
## Min. : 410.8 Min. : 7.0 Min. :3.000 Min. : 9.00
## 1st Qu.: 502.0 1st Qu.: 9.0 1st Qu.:3.500 1st Qu.:15.00
## Median : 775.2 Median :12.0 Median :4.100 Median :23.00
## Mean : 845.4 Mean :12.6 Mean :4.019 Mean :26.59
## 3rd Qu.:1160.6 3rd Qu.:15.0 3rd Qu.:4.500 3rd Qu.:38.00
## Max. :1520.1 Max. :21.0 Max. :4.900 Max. :63.00
## Satisfaction.Level age
## : 2 Min. :26.0
## Neutral :107 1st Qu.:30.0
## Satisfied :125 Median :32.5
## Unsatisfied:116 Mean :33.6
## 3rd Qu.:37.0
## Max. :43.0
Summary: - The cities of LA and NYC are the cities with the most shopping. - Women and men have the same number of purchases in e-commerce - The youngest age is 26 years, and the oldest is 43 years. - Gold and Silver are the most widely used memberships. - The minimum total spend of ecommerce users in this data is $410.8 and the max total spend is $1520.1 - The lowest number of items purchased is 7 quantities, with the most items purchased being 21 quantities. - In the average rating, the lowest is a rating of 3.00 and the highest is a rating of 4.90 - Day Since Last Purchased is 9 days since last purchased. - For satisfaction level, the highest is the satisfied level at 125.
Check the outlier total spend within the City
aggregate(Total.Spend ~ City + Items.Purchased, data = ecommerce, FUN = sum)
## City Items.Purchased Total.Spend
## 1 Houston 7 10849.5
## 2 Houston 8 15070.4
## 3 Chicago 9 16856.2
## 4 Chicago 10 12137.0
## 5 Miami 10 15426.9
## 6 Los Angeles 11 21336.4
## 7 Miami 11 690.3
## 8 Los Angeles 12 19551.6
## 9 Miami 12 6205.4
## 10 Los Angeles 13 6636.0
## 11 Miami 13 16989.6
## 12 Miami 14 730.4
## 13 New York 14 11582.9
## 14 New York 15 27404.4
## 15 New York 16 28539.2
## 16 New York 17 1210.6
## 17 San Francisco 18 12341.8
## 18 San Francisco 19 8693.1
## 19 San Francisco 20 27819.5
## 20 San Francisco 21 35812.4
aggregate(Total.Spend ~ City, ecommerce, mean)
## City Total.Spend
## 1 Chicago 499.8828
## 2 Houston 446.8948
## 3 Los Angeles 805.4915
## 4 Miami 690.3897
## 5 New York 1165.0356
## 6 San Francisco 1459.7724
aggregate(Total.Spend ~ City, ecommerce, var)
## City Total.Spend
## 1 Chicago 233.4142
## 2 Houston 314.5924
## 3 Los Angeles 295.4784
## 4 Miami 373.4490
## 5 New York 605.7437
## 6 San Francisco 1784.3171
aggregate(Total.Spend ~ City, ecommerce, sd)
## City Total.Spend
## 1 Chicago 15.27790
## 2 Houston 17.73675
## 3 Los Angeles 17.18948
## 4 Miami 19.32483
## 5 New York 24.61186
## 6 San Francisco 42.24118
boxplot(ecommerce$Total.Spend)
Summary: - Median customer spend is around 1000. - Most expenses (50%)
are in the 800 to 1200 range. - Total spending ranges from about 400 to
1400. - There are no significant outliers in this data, indicating
consistency in customer spending.
# Group data by city and add up total expenses
total_spend_by_city <- aggregate(Total.Spend ~ City, data = ecommerce, sum)
# Find the city with the lowest total spending
city_with_lowest_spend <- total_spend_by_city[which.min(total_spend_by_city$Total.Spend), ]
# Displays results
city_with_lowest_spend
## City Total.Spend
## 2 Houston 25919.9
Answer: Houston is the city with lowest total spend among all the city
# Group data by membership and add up total spend
total_spend_by_membership <- aggregate(Total.Spend ~ Membership.Type, data = ecommerce, sum)
#Find Membership with the highest total spend
membership_with_highest_spend <- total_spend_by_membership[which.max(total_spend_by_membership$Total.Spend),]
# Display result
total_spend_by_membership
## Membership.Type Total.Spend
## 1 Bronze 54913.1
## 2 Gold 153403.9
## 3 Silver 87566.6
Answer: the highest membership that have highest total spend is Gold Membership
# Calculate the Pearson correlation coefficient
correlation <- cor(ecommerce$Items.Purchased, ecommerce$Total.Spend, method = "pearson")
# Menampilkan hasil korelasi
correlation
## [1] 0.9724248
Answer: The correlation value of 0.9724248 indicates that there is a very strong and positive relationship between the number of items purchased (Items.Purchased) and total expenditure (Total.Spend). A correlation close to 1 indicates that as the number of items purchased increases, total spending also tends to increase linearly.
# Calculate the frequency of each level of satisfaction
satisfaction_frequency <- table(ecommerce$Satisfaction.Level)
# Showing Satifaction Frequency
satisfaction_frequency
##
## Neutral Satisfied Unsatisfied
## 2 107 125 116
# Create a bar plot for the distribution of satisfaction levels
barplot(satisfaction_frequency,
main = "Distribution of Satisfaction Levels",
xlab = "Level of Satisfaction",
ylab = "Number of Customer",
col = "lightblue",
border = "blue")
Answer: Based on the bar plot of the distribution of customer
satisfaction levels, the majority of customers feel neutral or satisfied
with the services provided, with almost the same number of neutral and
satisfied customers. However, there were slightly more dissatisfied
customers compared to neutral and satisfied ones. There were no very
dissatisfied customers, indicating no very low levels of
satisfaction.
# Convert the column type
ecommerce$Satisfaction.Level <- as.factor(ecommerce$Satisfaction.Level)
ecommerce$Age <- as.numeric(ecommerce$Age)
# Check the Structure to see if the type of the data already change
str(ecommerce)
## 'data.frame': 350 obs. of 11 variables:
## $ Customer.ID : int 101 102 103 104 105 106 107 108 109 110 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 1 2 1 2 ...
## $ Age : num 29 34 43 30 27 37 31 35 41 28 ...
## $ City : Factor w/ 6 levels "Chicago","Houston",..: 5 3 1 6 4 2 5 3 1 6 ...
## $ Membership.Type : Factor w/ 3 levels "Bronze","Gold",..: 2 3 1 2 3 1 2 3 1 2 ...
## $ Total.Spend : num 1120 780 511 1480 720 ...
## $ Items.Purchased : int 14 11 9 19 13 8 15 12 10 21 ...
## $ Average.Rating : num 4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
## $ Days.Since.Last.Purchase: int 25 18 42 12 55 22 28 14 40 9 ...
## $ Satisfaction.Level : Factor w/ 4 levels "","Neutral","Satisfied",..: 3 2 4 3 4 2 3 2 4 3 ...
## $ age : num 29 34 43 30 27 37 31 35 41 28 ...
# Conduct ANOVA to see differences in satisfaction levels based on age
anova_model <- aov(Age ~ Satisfaction.Level, data = ecommerce)
# Showing ANOVA results
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Satisfaction.Level 3 2356 785.2 45.85 <2e-16 ***
## Residuals 346 5925 17.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Answer: ANOVA results show that there is a significant difference in age based on customer satisfaction level (F(3, 346) = 45.85, p < 2e-16). This shows that customer age differs significantly between different satisfaction level categories. With a very small p value (less than 0.001), we can conclude that this difference is highly statistically significant.
This e-commerce dataset consists of 350 observations and 11 variables covering customer identification, demographics, and shopping behavior. The majority of customers were neutral or satisfied with the service, with slightly more dissatisfied customers than neutral and satisfied customers. The analysis shows that there is a very strong and positive relationship between the number of items purchased and total expenditure, with a correlation coefficient of 0.972, indicating that an increase in the number of items purchased is correlated with an increase in total expenditure. In addition, the ANOVA results showed significant differences in age based on customer satisfaction level (F(3, 346) = 45.85, p < 2e-16), indicating that customer age differed significantly between different satisfaction level categories.