1 Short History About E-commerce

E-commerce, or electronic commerce, is the process of buying and selling goods or services via the internet. The concept first emerged in 1979 when Michael Aldrich, a British entrepreneur, created a system that allowed a computer to connect to a regular TV via a telephone line, allowing him to shop from home. However, e-commerce in the form we are more familiar with today began to develop in the 1990s with the advent of the commercial internet.

1.1 E-commerce In the Future

Looking ahead, e-commerce is expected to continue to grow with the adoption of new technologies such as artificial intelligence (AI), virtual reality (VR), and blockchain, which will increase efficiency and provide a more personalized and secure shopping experience.

2 Data Input

# Input Data dan Check Data 
ecommerce <- read.csv("E-commerce Customer Behavior - Sheet1.csv")

Data already input! lets get started it.

2.1 Data inspection

head(ecommerce)
##   Customer.ID Gender Age          City Membership.Type Total.Spend
## 1         101 Female  29      New York            Gold     1120.20
## 2         102   Male  34   Los Angeles          Silver      780.50
## 3         103 Female  43       Chicago          Bronze      510.75
## 4         104   Male  30 San Francisco            Gold     1480.30
## 5         105   Male  27         Miami          Silver      720.40
## 6         106 Female  37       Houston          Bronze      440.80
##   Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 1              14            4.6             TRUE                       25
## 2              11            4.1            FALSE                       18
## 3               9            3.4             TRUE                       42
## 4              19            4.7            FALSE                       12
## 5              13            4.0             TRUE                       55
## 6               8            3.1            FALSE                       22
##   Satisfaction.Level
## 1          Satisfied
## 2            Neutral
## 3        Unsatisfied
## 4          Satisfied
## 5        Unsatisfied
## 6            Neutral
tail(ecommerce)
##     Customer.ID Gender Age          City Membership.Type Total.Spend
## 345         445   Male  28 San Francisco            Gold     1480.10
## 346         446   Male  32         Miami          Silver      660.30
## 347         447 Female  36       Houston          Bronze      470.50
## 348         448 Female  30      New York            Gold     1190.80
## 349         449   Male  34   Los Angeles          Silver      780.20
## 350         450 Female  43       Chicago          Bronze      515.75
##     Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 345              21            4.9            FALSE                       13
## 346              10            3.8             TRUE                       42
## 347               8            3.0            FALSE                       27
## 348              16            4.5             TRUE                       28
## 349              11            4.2            FALSE                       21
## 350              10            3.3             TRUE                       49
##     Satisfaction.Level
## 345          Satisfied
## 346        Unsatisfied
## 347            Neutral
## 348          Satisfied
## 349            Neutral
## 350        Unsatisfied
dim(ecommerce)
## [1] 350  11
names(ecommerce)
##  [1] "Customer.ID"              "Gender"                  
##  [3] "Age"                      "City"                    
##  [5] "Membership.Type"          "Total.Spend"             
##  [7] "Items.Purchased"          "Average.Rating"          
##  [9] "Discount.Applied"         "Days.Since.Last.Purchase"
## [11] "Satisfaction.Level"

from our inspection, we can see - ecommerce data contains 350 rows and 11 column - each column named by “Customer.ID”, “Gender”, “Age”, “City”, “Membership.Type”, “Total.Spend”, “Items.Purchased”, “Average.Rating”, “Discount.Applied”, “Average.Rating”, “Days.Since.Lasy.Purchase”, and “Satisfaction.Level”.

2.2 Data Cleansing & Coertions

Check data type for each column

str(ecommerce)
## 'data.frame':    350 obs. of  11 variables:
##  $ Customer.ID             : int  101 102 103 104 105 106 107 108 109 110 ...
##  $ Gender                  : chr  "Female" "Male" "Female" "Male" ...
##  $ Age                     : int  29 34 43 30 27 37 31 35 41 28 ...
##  $ City                    : chr  "New York" "Los Angeles" "Chicago" "San Francisco" ...
##  $ Membership.Type         : chr  "Gold" "Silver" "Bronze" "Gold" ...
##  $ Total.Spend             : num  1120 780 511 1480 720 ...
##  $ Items.Purchased         : int  14 11 9 19 13 8 15 12 10 21 ...
##  $ Average.Rating          : num  4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
##  $ Discount.Applied        : logi  TRUE FALSE TRUE FALSE TRUE FALSE ...
##  $ Days.Since.Last.Purchase: int  25 18 42 12 55 22 28 14 40 9 ...
##  $ Satisfaction.Level      : chr  "Satisfied" "Neutral" "Unsatisfied" "Satisfied" ...

From this result, we find some of data type not in the corect type. we need to convert it into corect type (data coertion)

# Inspect Data & Celaning Data
ecommerce$age <- as.numeric(ecommerce$Age)
ecommerce$Gender <- as.factor(ecommerce$Gender)
ecommerce$City <- as.factor(ecommerce$City)
ecommerce$Membership.Type <- as.factor(ecommerce$Membership.Type)
ecommerce$Satisfaction.Level <- as.factor(ecommerce$Satisfaction.Level)
str(ecommerce)
## 'data.frame':    350 obs. of  12 variables:
##  $ Customer.ID             : int  101 102 103 104 105 106 107 108 109 110 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 1 2 1 2 ...
##  $ Age                     : int  29 34 43 30 27 37 31 35 41 28 ...
##  $ City                    : Factor w/ 6 levels "Chicago","Houston",..: 5 3 1 6 4 2 5 3 1 6 ...
##  $ Membership.Type         : Factor w/ 3 levels "Bronze","Gold",..: 2 3 1 2 3 1 2 3 1 2 ...
##  $ Total.Spend             : num  1120 780 511 1480 720 ...
##  $ Items.Purchased         : int  14 11 9 19 13 8 15 12 10 21 ...
##  $ Average.Rating          : num  4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
##  $ Discount.Applied        : logi  TRUE FALSE TRUE FALSE TRUE FALSE ...
##  $ Days.Since.Last.Purchase: int  25 18 42 12 55 22 28 14 40 9 ...
##  $ Satisfaction.Level      : Factor w/ 4 levels "","Neutral","Satisfied",..: 3 2 4 3 4 2 3 2 4 3 ...
##  $ age                     : num  29 34 43 30 27 37 31 35 41 28 ...

Each of column already changed into desired data type check if there’s any missing data

colSums(is.na(ecommerce))
##              Customer.ID                   Gender                      Age 
##                        0                        0                        0 
##                     City          Membership.Type              Total.Spend 
##                        0                        0                        0 
##          Items.Purchased           Average.Rating         Discount.Applied 
##                        0                        0                        0 
## Days.Since.Last.Purchase       Satisfaction.Level                      age 
##                        0                        0                        0
anyNA(ecommerce)
## [1] FALSE

Okay! There’s no missing value. Lets do subset.

we’ll do subsetting by only delete discount.applied colomn, since we dont need the information.

head(ecommerce)
##   Customer.ID Gender Age          City Membership.Type Total.Spend
## 1         101 Female  29      New York            Gold     1120.20
## 2         102   Male  34   Los Angeles          Silver      780.50
## 3         103 Female  43       Chicago          Bronze      510.75
## 4         104   Male  30 San Francisco            Gold     1480.30
## 5         105   Male  27         Miami          Silver      720.40
## 6         106 Female  37       Houston          Bronze      440.80
##   Items.Purchased Average.Rating Discount.Applied Days.Since.Last.Purchase
## 1              14            4.6             TRUE                       25
## 2              11            4.1            FALSE                       18
## 3               9            3.4             TRUE                       42
## 4              19            4.7            FALSE                       12
## 5              13            4.0             TRUE                       55
## 6               8            3.1            FALSE                       22
##   Satisfaction.Level age
## 1          Satisfied  29
## 2            Neutral  34
## 3        Unsatisfied  43
## 4          Satisfied  30
## 5        Unsatisfied  27
## 6            Neutral  37
# Deleting Column 'Discount.Applied'
ecommerce <- ecommerce[, !names(ecommerce) %in% "Discount.Applied"]

# Displays the first six rows of the dataset after column deletion
head(ecommerce)
##   Customer.ID Gender Age          City Membership.Type Total.Spend
## 1         101 Female  29      New York            Gold     1120.20
## 2         102   Male  34   Los Angeles          Silver      780.50
## 3         103 Female  43       Chicago          Bronze      510.75
## 4         104   Male  30 San Francisco            Gold     1480.30
## 5         105   Male  27         Miami          Silver      720.40
## 6         106 Female  37       Houston          Bronze      440.80
##   Items.Purchased Average.Rating Days.Since.Last.Purchase Satisfaction.Level
## 1              14            4.6                       25          Satisfied
## 2              11            4.1                       18            Neutral
## 3               9            3.4                       42        Unsatisfied
## 4              19            4.7                       12          Satisfied
## 5              13            4.0                       55        Unsatisfied
## 6               8            3.1                       22            Neutral
##   age
## 1  29
## 2  34
## 3  43
## 4  30
## 5  27
## 6  37

After subset discount.applied, ecommerce data ready to be analyze!

3 Data Explanation

summary(ecommerce)
##   Customer.ID       Gender         Age                  City    Membership.Type
##  Min.   :101.0   Female:175   Min.   :26.0   Chicago      :58   Bronze:116     
##  1st Qu.:188.2   Male  :175   1st Qu.:30.0   Houston      :58   Gold  :117     
##  Median :275.5                Median :32.5   Los Angeles  :59   Silver:117     
##  Mean   :275.5                Mean   :33.6   Miami        :58                  
##  3rd Qu.:362.8                3rd Qu.:37.0   New York     :59                  
##  Max.   :450.0                Max.   :43.0   San Francisco:58                  
##   Total.Spend     Items.Purchased Average.Rating  Days.Since.Last.Purchase
##  Min.   : 410.8   Min.   : 7.0    Min.   :3.000   Min.   : 9.00           
##  1st Qu.: 502.0   1st Qu.: 9.0    1st Qu.:3.500   1st Qu.:15.00           
##  Median : 775.2   Median :12.0    Median :4.100   Median :23.00           
##  Mean   : 845.4   Mean   :12.6    Mean   :4.019   Mean   :26.59           
##  3rd Qu.:1160.6   3rd Qu.:15.0    3rd Qu.:4.500   3rd Qu.:38.00           
##  Max.   :1520.1   Max.   :21.0    Max.   :4.900   Max.   :63.00           
##    Satisfaction.Level      age      
##             :  2      Min.   :26.0  
##  Neutral    :107      1st Qu.:30.0  
##  Satisfied  :125      Median :32.5  
##  Unsatisfied:116      Mean   :33.6  
##                       3rd Qu.:37.0  
##                       Max.   :43.0

Summary: - The cities of LA and NYC are the cities with the most shopping. - Women and men have the same number of purchases in e-commerce - The youngest age is 26 years, and the oldest is 43 years. - Gold and Silver are the most widely used memberships. - The minimum total spend of ecommerce users in this data is $410.8 and the max total spend is $1520.1 - The lowest number of items purchased is 7 quantities, with the most items purchased being 21 quantities. - In the average rating, the lowest is a rating of 3.00 and the highest is a rating of 4.90 - Day Since Last Purchased is 9 days since last purchased. - For satisfaction level, the highest is the satisfied level at 125.

Check the outlier total spend within the City

aggregate(Total.Spend ~ City + Items.Purchased, data = ecommerce, FUN = sum)
##             City Items.Purchased Total.Spend
## 1        Houston               7     10849.5
## 2        Houston               8     15070.4
## 3        Chicago               9     16856.2
## 4        Chicago              10     12137.0
## 5          Miami              10     15426.9
## 6    Los Angeles              11     21336.4
## 7          Miami              11       690.3
## 8    Los Angeles              12     19551.6
## 9          Miami              12      6205.4
## 10   Los Angeles              13      6636.0
## 11         Miami              13     16989.6
## 12         Miami              14       730.4
## 13      New York              14     11582.9
## 14      New York              15     27404.4
## 15      New York              16     28539.2
## 16      New York              17      1210.6
## 17 San Francisco              18     12341.8
## 18 San Francisco              19      8693.1
## 19 San Francisco              20     27819.5
## 20 San Francisco              21     35812.4
aggregate(Total.Spend ~ City, ecommerce, mean)
##            City Total.Spend
## 1       Chicago    499.8828
## 2       Houston    446.8948
## 3   Los Angeles    805.4915
## 4         Miami    690.3897
## 5      New York   1165.0356
## 6 San Francisco   1459.7724
aggregate(Total.Spend ~ City, ecommerce, var)
##            City Total.Spend
## 1       Chicago    233.4142
## 2       Houston    314.5924
## 3   Los Angeles    295.4784
## 4         Miami    373.4490
## 5      New York    605.7437
## 6 San Francisco   1784.3171
aggregate(Total.Spend ~ City, ecommerce, sd)
##            City Total.Spend
## 1       Chicago    15.27790
## 2       Houston    17.73675
## 3   Los Angeles    17.18948
## 4         Miami    19.32483
## 5      New York    24.61186
## 6 San Francisco    42.24118
boxplot(ecommerce$Total.Spend)

Summary: - Median customer spend is around 1000. - Most expenses (50%) are in the 800 to 1200 range. - Total spending ranges from about 400 to 1400. - There are no significant outliers in this data, indicating consistency in customer spending.

4 Data Manipulation & Transformation

  1. Which city that have lowest total spend?
# Group data by city and add up total expenses
total_spend_by_city <- aggregate(Total.Spend ~ City, data = ecommerce, sum)

# Find the city with the lowest total spending
city_with_lowest_spend <- total_spend_by_city[which.min(total_spend_by_city$Total.Spend), ]

# Displays results
city_with_lowest_spend
##      City Total.Spend
## 2 Houston     25919.9

Answer: Houston is the city with lowest total spend among all the city

  1. What membership that highest total spend?
# Group data by membership and add up total spend
total_spend_by_membership <- aggregate(Total.Spend ~ Membership.Type, data = ecommerce, sum)

#Find Membership with the highest total spend 
membership_with_highest_spend <- total_spend_by_membership[which.max(total_spend_by_membership$Total.Spend),]

# Display result
total_spend_by_membership
##   Membership.Type Total.Spend
## 1          Bronze     54913.1
## 2            Gold    153403.9
## 3          Silver     87566.6

Answer: the highest membership that have highest total spend is Gold Membership

  1. What is the relationship between the number of items purchased (Items.Purchased) and total spending (Total.Spend)?
# Calculate the Pearson correlation coefficient
correlation <- cor(ecommerce$Items.Purchased, ecommerce$Total.Spend, method = "pearson")

# Menampilkan hasil korelasi
correlation
## [1] 0.9724248

Answer: The correlation value of 0.9724248 indicates that there is a very strong and positive relationship between the number of items purchased (Items.Purchased) and total expenditure (Total.Spend). A correlation close to 1 indicates that as the number of items purchased increases, total spending also tends to increase linearly.

  1. How is the distribution of customer satisfaction levels?
# Calculate the frequency of each level of satisfaction
satisfaction_frequency <- table(ecommerce$Satisfaction.Level)

# Showing Satifaction Frequency
satisfaction_frequency
## 
##                 Neutral   Satisfied Unsatisfied 
##           2         107         125         116
# Create a bar plot for the distribution of satisfaction levels
barplot(satisfaction_frequency, 
        main = "Distribution of Satisfaction Levels", 
        xlab = "Level of Satisfaction", 
        ylab = "Number of Customer", 
        col = "lightblue", 
        border = "blue")

Answer: Based on the bar plot of the distribution of customer satisfaction levels, the majority of customers feel neutral or satisfied with the services provided, with almost the same number of neutral and satisfied customers. However, there were slightly more dissatisfied customers compared to neutral and satisfied ones. There were no very dissatisfied customers, indicating no very low levels of satisfaction.

  1. Are there differences in satisfaction levels based on age?
# Convert the column type
ecommerce$Satisfaction.Level <- as.factor(ecommerce$Satisfaction.Level)
ecommerce$Age <- as.numeric(ecommerce$Age)

# Check the Structure to see if the type of the data already change
str(ecommerce)
## 'data.frame':    350 obs. of  11 variables:
##  $ Customer.ID             : int  101 102 103 104 105 106 107 108 109 110 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 1 2 1 2 ...
##  $ Age                     : num  29 34 43 30 27 37 31 35 41 28 ...
##  $ City                    : Factor w/ 6 levels "Chicago","Houston",..: 5 3 1 6 4 2 5 3 1 6 ...
##  $ Membership.Type         : Factor w/ 3 levels "Bronze","Gold",..: 2 3 1 2 3 1 2 3 1 2 ...
##  $ Total.Spend             : num  1120 780 511 1480 720 ...
##  $ Items.Purchased         : int  14 11 9 19 13 8 15 12 10 21 ...
##  $ Average.Rating          : num  4.6 4.1 3.4 4.7 4 3.1 4.5 4.2 3.6 4.8 ...
##  $ Days.Since.Last.Purchase: int  25 18 42 12 55 22 28 14 40 9 ...
##  $ Satisfaction.Level      : Factor w/ 4 levels "","Neutral","Satisfied",..: 3 2 4 3 4 2 3 2 4 3 ...
##  $ age                     : num  29 34 43 30 27 37 31 35 41 28 ...
# Conduct ANOVA to see differences in satisfaction levels based on age
anova_model <- aov(Age ~ Satisfaction.Level, data = ecommerce)

# Showing ANOVA results
summary(anova_model)
##                     Df Sum Sq Mean Sq F value Pr(>F)    
## Satisfaction.Level   3   2356   785.2   45.85 <2e-16 ***
## Residuals          346   5925    17.1                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Answer: ANOVA results show that there is a significant difference in age based on customer satisfaction level (F(3, 346) = 45.85, p < 2e-16). This shows that customer age differs significantly between different satisfaction level categories. With a very small p value (less than 0.001), we can conclude that this difference is highly statistically significant.

5 Explanatory Text

This e-commerce dataset consists of 350 observations and 11 variables covering customer identification, demographics, and shopping behavior. The majority of customers were neutral or satisfied with the service, with slightly more dissatisfied customers than neutral and satisfied customers. The analysis shows that there is a very strong and positive relationship between the number of items purchased and total expenditure, with a correlation coefficient of 0.972, indicating that an increase in the number of items purchased is correlated with an increase in total expenditure. In addition, the ANOVA results showed significant differences in age based on customer satisfaction level (F(3, 346) = 45.85, p < 2e-16), indicating that customer age differed significantly between different satisfaction level categories.