Capstone Project: Analyzing Customer Behavior and Revenue Trends in a Telecom Company

Project Overview

This project focuses on understanding customer behavior and revenue trends using descriptive statistics, hypothesis testing, and linear regression. Students will analyze telecom company data to identify factors influencing customer spending, contract types, and service usage patterns

Project Objectives

Data Understanding: Explore the dataset and define key business questions.

Data Cleaning & Preparation: Handle missing values and transform data for analysis.

Descriptive Statistics & Exploratory Data Analysis (EDA): Summarize the dataset using statistical measures.

Hypothesis Testing: Test key assumptions about customer behavior and revenue.

Regression Analysis: Build and interpret linear regression models to explain customer spending trends.

Insights & Recommendations: Provide data-driven business recommendations*

Dataset

Use the Telco Customer Churn dataset (available on Kaggle) or generate a synthetic dataset with the following columns:

CustomerID – Unique customer identifier

Gender – Male/Female

SeniorCitizen – 1 for senior citizens, 0 otherwise

Partner – Whether the customer has a partner (Yes/No)

Dependents – Whether the customer has dependents (Yes/No)

Tenure – Number of months the customer has been with the company

Contract – Type of contract (Month-to-month, One-year, Two-year)

PaymentMethod – Payment method (e.g., Electronic check, Credit card)

MonthlyCharges – Monthly bill amount ($)

TotalCharges – Total amount billed to the customer ($)

#Project Workflow & Deliverables……………………………. Understanding the Business Context & Data Define key questions: Do contract types influence customer spending? Are senior citizens more likely to spend less? What factors impact total customer spending?*

1. Load the dataset and inspect the structure.

setwd("~/R training")
Telco_data<-read.csv("Telco_data.csv")
head(Telco_data)
  customerID gender SeniorCitizen Partner Dependents tenure PhoneService
1 7590-VHVEG Female             0     Yes         No      1           No
2 5575-GNVDE   Male             0      No         No     34          Yes
3 3668-QPYBK   Male             0      No         No      2          Yes
4 7795-CFOCW   Male             0      No         No     45           No
5 9237-HQITU Female             0      No         No      2          Yes
6 9305-CDSKC Female             0      No         No      8          Yes
     MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
1 No phone service             DSL             No          Yes               No
2               No             DSL            Yes           No              Yes
3               No             DSL            Yes          Yes               No
4 No phone service             DSL            Yes           No              Yes
5               No     Fiber optic             No           No               No
6              Yes     Fiber optic             No           No              Yes
  TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
1          No          No              No Month-to-month              Yes
2          No          No              No       One year               No
3          No          No              No Month-to-month              Yes
4         Yes          No              No       One year               No
5          No          No              No Month-to-month              Yes
6          No         Yes             Yes Month-to-month              Yes
              PaymentMethod MonthlyCharges TotalCharges Churn
1          Electronic check          29.85        29.85    No
2              Mailed check          56.95      1889.50    No
3              Mailed check          53.85       108.15   Yes
4 Bank transfer (automatic)          42.30      1840.75    No
5          Electronic check          70.70       151.65   Yes
6          Electronic check          99.65       820.50   Yes
str(Telco_data)
'data.frame':   7043 obs. of  21 variables:
 $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
 $ gender          : chr  "Female" "Male" "Male" "Male" ...
 $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Partner         : chr  "Yes" "No" "No" "No" ...
 $ Dependents      : chr  "No" "No" "No" "No" ...
 $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
 $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
 $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
 $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
 $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
 $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
 $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
 $ TechSupport     : chr  "No" "No" "No" "Yes" ...
 $ StreamingTV     : chr  "No" "No" "No" "No" ...
 $ StreamingMovies : chr  "No" "No" "No" "No" ...
 $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
 $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
 $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
 $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
 $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...

2. Data Cleaning & Preparation…………………………

Finding count of missing values in TotalCharges from the Telco data

missing_values<-sum(is.na(Telco_data$TotalCharges)) # sum of missing values
missing_values
[1] 11
View(missing_values)#view the missing values

Loading the Mice package and library for missing values handling

library(mice)
library(tidyverse)
md.pattern(Telco_data) #View the distribution of missing data

     customerID gender SeniorCitizen Partner Dependents tenure PhoneService
7032          1      1             1       1          1      1            1
11            1      1             1       1          1      1            1
              0      0             0       0          0      0            0
     MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
7032             1               1              1            1                1
11               1               1              1            1                1
                 0               0              0            0                0
     TechSupport StreamingTV StreamingMovies Contract PaperlessBilling
7032           1           1               1        1                1
11             1           1               1        1                1
               0           0               0        0                0
     PaymentMethod MonthlyCharges Churn TotalCharges   
7032             1              1     1            1  0
11               1              1     1            0  1
                 0              0     0           11 11

Loading the VIM package for visualizationof missing data

library(VIM)
missing_values_plot<-aggr(Telco_data, col=c("red"), 
                          numbers=TRUE, 
                          sortvars=TRUE, 
                          labels=names(Telco_data),
                          cex.axis=.7,
                          gap=3,
                          ylab=c("Histogram of missing data", "Pattern"))#Display pattern and Ditribution of missing values in the telco data

Selecting data imputaion method to use from the mice library

methods(mice) #checking for methods to use in imputation of missing values. In this case i am going to use the method of random forests
 [1] mice.impute.2l.bin              mice.impute.2l.lmer            
 [3] mice.impute.2l.norm             mice.impute.2l.pan             
 [5] mice.impute.2lonly.mean         mice.impute.2lonly.norm        
 [7] mice.impute.2lonly.pmm          mice.impute.cart               
 [9] mice.impute.jomoImpute          mice.impute.lasso.logreg       
[11] mice.impute.lasso.norm          mice.impute.lasso.select.logreg
[13] mice.impute.lasso.select.norm   mice.impute.lda                
[15] mice.impute.logreg              mice.impute.logreg.boot        
[17] mice.impute.mean                mice.impute.midastouch         
[19] mice.impute.mnar.logreg         mice.impute.mnar.norm          
[21] mice.impute.mpmm                mice.impute.norm               
[23] mice.impute.norm.boot           mice.impute.norm.nob           
[25] mice.impute.norm.predict        mice.impute.panImpute          
[27] mice.impute.passive             mice.impute.pmm                
[29] mice.impute.polr                mice.impute.polyreg            
[31] mice.impute.quadratic           mice.impute.rf                 
[33] mice.impute.ri                  mice.impute.sample             
[35] mice.mids                       mice.theme                     
see '?methods' for accessing help and source code
imputed_data<-mice(Telco_data, m=5, method='rf')

 iter imp variable
  1   1  TotalCharges
  1   2  TotalCharges
  1   3  TotalCharges
  1   4  TotalCharges
  1   5  TotalCharges
  2   1  TotalCharges
  2   2  TotalCharges
  2   3  TotalCharges
  2   4  TotalCharges
  2   5  TotalCharges
  3   1  TotalCharges
  3   2  TotalCharges
  3   3  TotalCharges
  3   4  TotalCharges
  3   5  TotalCharges
  4   1  TotalCharges
  4   2  TotalCharges
  4   3  TotalCharges
  4   4  TotalCharges
  4   5  TotalCharges
  5   1  TotalCharges
  5   2  TotalCharges
  5   3  TotalCharges
  5   4  TotalCharges
  5   5  TotalCharges
summary(imputed_data)
Class: mids
Number of multiple imputations:  5 
Imputation methods:
      customerID           gender    SeniorCitizen          Partner 
              ""               ""               ""               "" 
      Dependents           tenure     PhoneService    MultipleLines 
              ""               ""               ""               "" 
 InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
              ""               ""               ""               "" 
     TechSupport      StreamingTV  StreamingMovies         Contract 
              ""               ""               ""               "" 
PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
              ""               ""               ""             "rf" 
           Churn 
              "" 
PredictorMatrix:
              customerID gender SeniorCitizen Partner Dependents tenure
customerID             0      0             1       0          0      1
gender                 0      0             1       0          0      1
SeniorCitizen          0      0             0       0          0      1
Partner                0      0             1       0          0      1
Dependents             0      0             1       0          0      1
tenure                 0      0             1       0          0      0
              PhoneService MultipleLines InternetService OnlineSecurity
customerID               0             0               0              0
gender                   0             0               0              0
SeniorCitizen            0             0               0              0
Partner                  0             0               0              0
Dependents               0             0               0              0
tenure                   0             0               0              0
              OnlineBackup DeviceProtection TechSupport StreamingTV
customerID               0                0           0           0
gender                   0                0           0           0
SeniorCitizen            0                0           0           0
Partner                  0                0           0           0
Dependents               0                0           0           0
tenure                   0                0           0           0
              StreamingMovies Contract PaperlessBilling PaymentMethod
customerID                  0        0                0             0
gender                      0        0                0             0
SeniorCitizen               0        0                0             0
Partner                     0        0                0             0
Dependents                  0        0                0             0
tenure                      0        0                0             0
              MonthlyCharges TotalCharges Churn
customerID                 1            1     0
gender                     1            1     0
SeniorCitizen              1            1     0
Partner                    1            1     0
Dependents                 1            1     0
tenure                     1            1     0
Number of logged events:  17 
  it im dep     meth           out
1  0  0     constant    customerID
2  0  0     constant        gender
3  0  0     constant       Partner
4  0  0     constant    Dependents
5  0  0     constant  PhoneService
6  0  0     constant MultipleLines
imputed_data$imp$TotalCharges
           1       2       3       4       5
489   595.05  399.45 1777.60 1924.10  554.05
754    19.90 1893.50  393.45  542.40   70.75
937  1240.25 6241.35 3205.60   20.25 1879.25
1083   70.45   19.45   70.80  226.95  369.05
1341  169.45 2773.90 1028.90 3505.10 4719.75
3332 4055.50  535.05 1433.80  289.30  168.65
3827   19.20  193.80 1734.50   94.50 2444.25
4381   20.05 7348.80  295.65  201.10  330.80
5219 1240.15 3231.05  696.80  332.65  318.90
6671   50.15 2391.15 4855.35 1060.60  112.75
6755   45.55  918.70  993.15  270.70  318.90
finished_imputed_data<-complete(imputed_data,1)
head(finished_imputed_data)
  customerID gender SeniorCitizen Partner Dependents tenure PhoneService
1 7590-VHVEG Female             0     Yes         No      1           No
2 5575-GNVDE   Male             0      No         No     34          Yes
3 3668-QPYBK   Male             0      No         No      2          Yes
4 7795-CFOCW   Male             0      No         No     45           No
5 9237-HQITU Female             0      No         No      2          Yes
6 9305-CDSKC Female             0      No         No      8          Yes
     MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection
1 No phone service             DSL             No          Yes               No
2               No             DSL            Yes           No              Yes
3               No             DSL            Yes          Yes               No
4 No phone service             DSL            Yes           No              Yes
5               No     Fiber optic             No           No               No
6              Yes     Fiber optic             No           No              Yes
  TechSupport StreamingTV StreamingMovies       Contract PaperlessBilling
1          No          No              No Month-to-month              Yes
2          No          No              No       One year               No
3          No          No              No Month-to-month              Yes
4         Yes          No              No       One year               No
5          No          No              No Month-to-month              Yes
6          No         Yes             Yes Month-to-month              Yes
              PaymentMethod MonthlyCharges TotalCharges Churn
1          Electronic check          29.85        29.85    No
2              Mailed check          56.95      1889.50    No
3              Mailed check          53.85       108.15   Yes
4 Bank transfer (automatic)          42.30      1840.75    No
5          Electronic check          70.70       151.65   Yes
6          Electronic check          99.65       820.50   Yes
sapply(finished_imputed_data, function(x)sum(is.na(x))) # Checking for each column for missing values and we are good to go
      customerID           gender    SeniorCitizen          Partner 
               0                0                0                0 
      Dependents           tenure     PhoneService    MultipleLines 
               0                0                0                0 
 InternetService   OnlineSecurity     OnlineBackup DeviceProtection 
               0                0                0                0 
     TechSupport      StreamingTV  StreamingMovies         Contract 
               0                0                0                0 
PaperlessBilling    PaymentMethod   MonthlyCharges     TotalCharges 
               0                0                0                0 
           Churn 
               0 
sum(is.na(finished_imputed_data)) #checking again if missing values are present
[1] 0

Convert categorical variables (e.g., Contract, Gender) into numerical

str(finished_imputed_data$gender, finished_imputed_data$Contract) #View the structure of the data
 chr [1:7043] "Female" "Male" "Male" "Male" "Female" "Female" "Male" ...
finished_imputed_data1 <- finished_imputed_data %>%
  mutate(
    gender = as.numeric(factor(gender)),
    Contract = as.numeric(factor(Contract))
  )

str(finished_imputed_data1$Contract)
 num [1:7043] 1 2 1 2 1 1 1 1 1 2 ...
str(finished_imputed_data1$gender) #View the structure of the data ie gender and contract
 num [1:7043] 1 2 2 2 1 1 2 1 1 2 ...

Checking for outliers in the data in MonthlyCharges and TotalCharges

boxplot(finished_imputed_data$MonthlyCharges, col='red', main = "income Distribution")

*checking for outliers using a plot of the boxplot

boxplot(finished_imputed_data$TotalCharges, col='blue', main = "income Distribution")

3. Exploratory Data Analysis (EDA)………………………

Visualize customer distribution across different Contract types.

glimpse(finished_imputed_data$Contract)
 chr [1:7043] "Month-to-month" "One year" "Month-to-month" "One year" ...
View(finished_imputed_data)
contract_counts<-table(finished_imputed_data$Contract)
contract_counts

Month-to-month       One year       Two year 
          3875           1473           1695 

Barplot

barplot(contract_counts, 
        col = "skyblue", 
        main = "Customer Distribution by Contract Type",
        ylab = "Number of Customers",
        xlab = "Contract Type")

Piechart

contract_counts <- table(finished_imputed_data$Contract)
pct <- round(100 * contract_counts / sum(contract_counts), 1)
pie_labels <- paste(names(contract_counts), "-", pct, "%")
pie(contract_counts,
    labels = pie_labels,
    col = c("tomato", "lightgreen", "lightblue"),
    main = "Contract Type Distribution")

Using ggplot2 to plot a bar plot

library(ggplot2)
ggplot(finished_imputed_data, aes(x = Contract, fill = Contract)) +
  geom_bar() +
  labs(title = "Customer Distribution by Contract Type",
       x = "Contract Type", y = "Number of Customers") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

The same for horizontal bar

ggplot(finished_imputed_data, aes(x = Contract, fill = Contract)) +
  geom_bar() +
  coord_flip() +
  labs(title = "Customer Distribution by Contract Type (Horizontal)",
       x = "Contract Type", y = "Number of Customers") +
  theme_classic()

Compare MonthlyCharges and TotalCharges for different customer groups. Comparison by Gender MonthlyCharges vs Gender

ggplot(finished_imputed_data, aes(x = gender, y = MonthlyCharges, fill = gender)) +
  geom_boxplot() +
  labs(title = "Monthly Charges by Gender", y = "Monthly Charges") +
  theme_minimal()

TotalCharges vs Gender

ggplot(finished_imputed_data, aes(x = gender, y = TotalCharges, fill = gender)) +
  geom_boxplot() +
  labs(title = "Total Charges by Gender", y = "Total Charges") +
  theme_minimal()

Comparison by InternetService MonthlyCharges vs InternetService

ggplot(finished_imputed_data, aes(x = InternetService, y = MonthlyCharges, fill = InternetService)) +
  geom_boxplot() +
  labs(title = "Monthly Charges by Internet Service", y = "Monthly Charges") +
  theme_minimal()

TotalCharges vs InternetService

ggplot(finished_imputed_data, aes(x = InternetService, y = TotalCharges, fill = InternetService)) +
  geom_boxplot() +
  labs(title = "Total Charges by Internet Service", y = "Total Charges") +
  theme_minimal()

Comparison by MultipleLines MonthlyCharges vs MultipleLines

ggplot(finished_imputed_data, aes(x = MultipleLines, y = MonthlyCharges, fill = MultipleLines)) +
  geom_boxplot() +
  labs(title = "Monthly Charges by Multiple Lines", y = "Monthly Charges") +
  theme_minimal()

TotalCharges vs MultipleLines

ggplot(finished_imputed_data, aes(x = MultipleLines, y = TotalCharges, fill = MultipleLines)) +
  geom_boxplot() +
  labs(title = "Total Charges by Multiple Lines", y = "Total Charges") +
  theme_minimal()

Check correlations between numerical variables

names(finished_imputed_data)
 [1] "customerID"       "gender"           "SeniorCitizen"    "Partner"         
 [5] "Dependents"       "tenure"           "PhoneService"     "MultipleLines"   
 [9] "InternetService"  "OnlineSecurity"   "OnlineBackup"     "DeviceProtection"
[13] "TechSupport"      "StreamingTV"      "StreamingMovies"  "Contract"        
[17] "PaperlessBilling" "PaymentMethod"    "MonthlyCharges"   "TotalCharges"    
[21] "Churn"           
str(finished_imputed_data)
'data.frame':   7043 obs. of  21 variables:
 $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
 $ gender          : chr  "Female" "Male" "Male" "Male" ...
 $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Partner         : chr  "Yes" "No" "No" "No" ...
 $ Dependents      : chr  "No" "No" "No" "No" ...
 $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
 $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
 $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
 $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
 $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
 $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
 $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
 $ TechSupport     : chr  "No" "No" "No" "Yes" ...
 $ StreamingTV     : chr  "No" "No" "No" "No" ...
 $ StreamingMovies : chr  "No" "No" "No" "No" ...
 $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
 $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
 $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
 $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
 $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
 $ Churn           : chr  "No" "No" "Yes" "No" ...
str(finished_imputed_data$TotalCharges) #checking again for the structure of Total charges
 num [1:7043] 29.9 1889.5 108.2 1840.8 151.7 ...
numeric_vars <- finished_imputed_data[, c("tenure", "MonthlyCharges", "TotalCharges")] #select numerical variable columns from the data

Correlation matrix of the observations

cor_matrix <- cor(numeric_vars, use = "complete.obs")  # handles NAs
print(cor_matrix) # Compute correlation matrix for the variables
                  tenure MonthlyCharges TotalCharges
tenure         1.0000000      0.2478999    0.8257214
MonthlyCharges 0.2478999      1.0000000    0.6508208
TotalCharges   0.8257214      0.6508208    1.0000000
#Insights from correlation matrix
#Tenure and TotalCharges are highly correlated (makes sense: longer stay → higher total)
#MonthlyCharges and TotalCharges are moderately correlated
#Tenure and MonthlyCharges have a weak correlation

Plot of the correlations matrix

library(corrplot)
corrplot(cor_matrix, method = "circle", type = "upper", 
         title = "Correlation Matrix", mar = c(0,0,1,0)) #Plot correlation matrix

Use boxplots and histograms to understand spending behavior. Histograms of spending

#for monthly charges
hist(finished_imputed_data$MonthlyCharges,
     col = "skyblue",
     main = "Distribution of Monthly Charges",
     xlab = "Monthly Charges",
     breaks = 30) #Histogram of mnothly spending

#for total charges
hist(finished_imputed_data$TotalCharges,
     col = "lightgreen",
     main = "Distribution of Total Charges",
     xlab = "Total Charges",
     breaks = 30) #Histogram of Total charges

Boxplots to Spot Spending Patterns (Outliers & Group Differences) Monthly Charges by Churn

boxplot(MonthlyCharges ~ Churn, data = finished_imputed_data,
        col = c("lightblue", "tomato"),
        main = "Monthly Charges by Churn Status",
        xlab = "Churn", ylab = "Monthly Charges")

Monthly charges by internet service

boxplot(MonthlyCharges ~ InternetService, data = finished_imputed_data,
        col = c("plum", "lightblue", "salmon"),
        main = "Monthly Charges by Internet Service",
        xlab = "Internet Service", ylab = "Monthly Charges")

##Total Charges by Contract type

boxplot(TotalCharges ~ Contract, data = finished_imputed_data,
        col = c("lightyellow", "orange", "lightgreen"),
        main = "Total Charges by Contract Type",
        xlab = "Contract Type", ylab = "Total Charges")

4. Hypothesis Testing…………………………………. Perform hypothesis tests to validate business assumptions. Hypothesis 1: Do customers with long-term contracts have higher total spending?

Null Hypothesis (H₀): No difference in TotalCharges among contract types. Alternative Hypothesis (H₁): Customers with longer contracts have significantly higher TotalCharges.

finished_imputed_data$Contract <- as.factor(finished_imputed_data$Contract) #Change contract variable into factor
anova_model <- aov(TotalCharges ~ Contract, data = finished_imputed_data) ##We use one way ANOVA since has more than 2 categories
summary(anova_model) ##### We reject the null hypothesis.There is a statistically significant difference in TotalCharges across contract types. This supports the assumption that customers with long-term contracts tend to spend more over time.
              Df    Sum Sq  Mean Sq F value Pr(>F)    
Contract       2 7.520e+09 3.76e+09   923.9 <2e-16 ***
Residuals   7040 2.865e+10 4.07e+06                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(anova_model) #To see which contract types differ
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = TotalCharges ~ Contract, data = finished_imputed_data)

$Contract
                             diff       lwr       upr p adj
One year-Month-to-month 1664.2102 1519.4561 1808.9644     0
Two year-Month-to-month 2341.3880 2203.6736 2479.1025     0
Two year-One year        677.1778  508.7248  845.6309     0
#Interpretations
#All pairwise comparisons (Month-to-month vs One year, Month-to-month vs Two year, One year vs Two year) show statistically significant differences in TotalCharges.
#Customers with longer-term contracts (One year, Two year) generally have higher total spending than those with a Month-to-month contract. 
#The Two year contract has the highest spending, followed by the One year contract, and the Month-to-month contract has the lowest total charges.

Hypothesis 2: Is there a significant difference in spending between male and female customers?

Null Hypothesis (H₀): No difference in MonthlyCharges between male and female customers

Alternative Hypothesis (H₁): Males and females have different spending behaviors Test to use: Independent t-test

t_test_result <- t.test(MonthlyCharges ~ gender, data = finished_imputed_data)
t_test_result

    Welch Two Sample t-test

data:  MonthlyCharges by gender
t = 1.2227, df = 7038.9, p-value = 0.2215
alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
95 percent confidence interval:
 -0.5289426  2.2824640
sample estimates:
mean in group Female   mean in group Male 
            65.20424             64.32748 
#Since the p-value (0.2215) is greater than 0.05, we fail to reject the null hypothesis. This means that there is no statistically significant difference in MonthlyCharges between male and female customers in your dataset.
#The confidence interval further supports this conclusion, as it contains zero, suggesting that the true difference in means could be zero.

5. Regression Analysis………………………………….

Building a Linear Regression Model to Predict MonthlyCharges

Define MonthlyCharges as the dependent variable

Use Tenure, Contract, and Payment Method as independent variables

linear_model <- lm(MonthlyCharges ~ tenure + Contract + PaymentMethod, data = finished_imputed_data)
summary(linear_model)

Call:
lm(formula = MonthlyCharges ~ tenure + Contract + PaymentMethod, 
    data = finished_imputed_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-72.154 -20.211   3.787  19.016  76.913 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           54.445739   0.894149  60.891   <2e-16 ***
tenure                                 0.551490   0.018054  30.547   <2e-16 ***
ContractOne year                     -11.885419   0.912707 -13.022   <2e-16 ***
ContractTwo year                     -22.775408   1.047304 -21.747   <2e-16 ***
PaymentMethodCredit card (automatic)   0.006174   0.935446   0.007    0.995    
PaymentMethodElectronic check         11.288247   0.896915  12.586   <2e-16 ***
PaymentMethodMailed check            -14.685846   0.973991 -15.078   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.89 on 7036 degrees of freedom
Multiple R-squared:  0.2605,    Adjusted R-squared:  0.2599 
F-statistic: 413.1 on 6 and 7036 DF,  p-value: < 2.2e-16

Tenure has a positive and significant effect on MonthlyCharges. As tenure increases, customers tend to pay higher monthly charges.

Customers with longer contracts (one or two years) tend to spend less per month compared to month-to-month contract customers.

Payment method plays a significant role in spending behavior:

Electronic check users tend to spend more per month.

Mailed check users tend to spend less per month.

Credit card (automatic) users don’t show a significant difference in spending compared to other methods.

Model Explanation: The model explains about 26.05% of the variance in MonthlyCharges, which is a moderate level of explanatory power. More features could be added or other techniques (like regularization) could be explored to improve predictive accuracy.