#Part A: Marketing Case

# Load the Segment data (make sure SegmentData.xlsx is in the working directory)
segment_data <- read_excel("SegmentData.xlsx")

# Ensure proper types for categorical variables
segment_data <- segment_data %>%
  mutate(
    gender    = factor(gender),
    ownHome   = factor(ownHome),
    subscribe = factor(subscribe),
    Segment   = factor(Segment)
  )

str(segment_data)
## tibble [300 × 7] (S3: tbl_df/tbl/data.frame)
##  $ age      : num [1:300] 42 38 39 40 32 41 45 37 38 43 ...
##  $ gender   : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 2 1 2 ...
##  $ income   : num [1:300] 64437 65608 45426 63066 45927 ...
##  $ kids     : num [1:300] 1 2 1 0 5 3 1 2 2 4 ...
##  $ ownHome  : Factor w/ 2 levels "ownNo","ownYes": 1 1 1 1 2 2 1 2 2 2 ...
##  $ subscribe: Factor w/ 2 levels "subNo","subYes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Segment  : Factor w/ 4 levels "Moving up","Suburb mix",..: 2 2 2 2 2 2 2 2 2 2 ...

A.1 – Explorative Data Analysis (EDA)

A.1.1 Summary statistics table for all variables

summary(segment_data)
##       age           gender        income            kids         ownHome   
##  Min.   :18.00   Female:149   Min.   : 10557   Min.   :0.000   ownNo :162  
##  1st Qu.:33.00   Male  :151   1st Qu.: 41275   1st Qu.:0.000   ownYes:138  
##  Median :39.00                Median : 53186   Median :1.000               
##  Mean   :40.56                Mean   : 52071   Mean   :1.283               
##  3rd Qu.:48.00                3rd Qu.: 64236   3rd Qu.:2.000               
##  Max.   :73.00                Max.   :114615   Max.   :7.000               
##   subscribe         Segment   
##  subNo :265   Moving up : 70  
##  subYes: 35   Suburb mix:100  
##               Travelers : 80  
##               Urban hip : 50  
##                               
## 

Write-up (A.1.1):
Comment briefly on the sample: typical ages, incomes, numbers of children, and any obvious differences between segments.

The dataset provides a snapshot of 300 survey respondents from Company XYZ’s customer base, divided into four demographic segments: Suburb mix (the largest group at 100 respondents), Travelers (80), Moving up (70), and Urban hip (the smallest at 50). Overall, the sample appears balanced in terms of gender, with males slightly outnumbering females (50.3% vs. 49.7%), suggesting no strong gender bias in the survey.

Age: Respondents range from young adults (18) to seniors (73), with a mean age of about 41 and a median of 39. The distribution is slightly right-skewed (mean > median) toward older individuals. The interquartile range (IQR) is from 33 (25th percentile) to 48 (75th percentile). So, the middle 50% of ages are spread between 33 and 48 years old. This means that half of the respondents are middle-aged, which may reflect which age group is the prime target for Company XYZ’s marketing.

Income: Income ranges rom USD 10,557 to USD 114,615. Mean is USD 52,071, while median is USD 53,187. This distribution is also right-skewed, suggesting income inequality in the sample, where a smaller group of high-income outliers pulls the mean upward. The IQR is USD 41,275 (25th percentile) to USD 64,236 (75th percentile). So, the middle 50% of respondents earn between USD 41,275 and USD 64,236.

Kids: The number of children is low on average (mean 1.28, median 1), with most respondents having 0–2 kids (IQR). The maximum of 7 is an outlier, but overall, this points to smaller family sizes.

OwnHome: A slight majority (54%) of respondents do not own their homes.

Subscribe:The subscription rate within the sample is very low wih only 35 customers (11.7%) currently subscribing to the offered serevices. This may be a potential challenge for Company XYZ, and may require further investigation why people do not have incentive to subscribe.

Segment: There is an uneven distribution across segments, where Suburb mix (100) is the most represented. It may indicate that this segment is company’s core customer group, or group of interest for this study. On the other hand, Urban hip segment is the smallest group.

In short: Most respondents in the sample are middle-aged (33–48 year old), earn roughly USD 41,000 to USD 64,000, have 0 to 2 kids, are a bit more likely to rent than own, and almost 90% do not subscribe to the service.


A.1.2 Histograms and Boxplots

# Age histogram
ggplot(segment_data, aes(x = age)) +
  geom_histogram(binwidth = 5,
                 boundary = 0,
                 closed = "left",
                 color = "white",
                 fill = "steelblue") +
  labs(title = "Age Distribution",
       x = "Age (years)",
       y = "Count") +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(face = "bold")
  )

#Income histogram
ggplot(segment_data, aes(x = income)) +
  geom_histogram(binwidth = 5000,
                 boundary = 0,
                 closed = "left",
                 color = "white",
                 fill = "steelblue") +
  labs(title = "Income Distribution",
       x = "Income",
       y = "Count") +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(face = "bold")
  )

#Number of children histogram
ggplot(segment_data, aes(x = kids)) +
  geom_histogram(binwidth = 1,
                 boundary = -0.5,
                 closed = "left",
                 color = "white",
                 fill = "steelblue") +
  scale_x_continuous(breaks = 0:max(segment_data$kids, na.rm = TRUE)) +
  labs(title = "Number of Children",
       x = "Number of children",
       y = "Count") +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(face = "bold")
  )

#Aggregate income boxplot 
ggplot(segment_data, aes(y = income)) +
  geom_boxplot(
    color = "black",
    fill  = "steelblue",
    width = 0.3
  ) +
  labs(
    title = "Income Distribution (All Segments)",
    x     = "",
    y     = "Income"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(face = "bold")
  )

#Boxplot of income by segment
ggplot(segment_data, aes(x = Segment, y = income)) +
  geom_boxplot(
    color = "black",
    fill  = "steelblue"
  ) +
  labs(
    title = "Income Distribution by Segment",
    x     = "Segment",
    y     = "Income"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Write-up (A.1.2):
Comment on outliers and differences in centrality (mean vs. median) based on the histograms and boxplots.

Age Histogram: Mean = 40.56, Median = 39, so the center of the distribution is right around those two values. There are no visible outliers on the histogram. However, if we use the standard Q3+1.5xIQR Rule to find the outliers on the left hand side of the distribution, we find out that any value beyond 48+1.5x15=70.5 is considered an outlier.

Income Histogram: Median (USD 53,186) > Mean (USD 52,071), so the center of the distribution is at the median. There are also visible outliers above USD 90,000 in the right-tail of the distribution, and the highest is USD 114,615.

Kids (“Number of Children”) Histogram: A heavily right-skewed disitribution. A few families with 5, 6, or 7 children appear as small bars on the far right of the histogram. The mean (1.28) is pulled higher than the median (1) because of these large values. The true center of the distribution is best described by the median: 1 child.

Income Distribution (All Segments) Box Plot: There are several outliers, mostly on the high end. IQR is within [USD 41,275, USD 64,236], so outliers then will lie when Income is < USD 18,807.5 or Income > USD 86,703.5.The aggregate boxplot shows dots above ~ USD 90,000, confirming high-income skew.

Income Distribution By Segments Box Plot:

Moving up: The box is reasonably symmetric in age, but for income the median is pulled toward the lower half of the box and a long upper whisker plus a few high outlier dots reveal right-skew: median is preferred for this segment.

Suburb mix: The box is almost perfectly symmetric, the median sits dead centre, and both whiskers are of similar length with virtually no outliers: distribution is close to symmetric, so mean and median are essentially interchangeable.

Travelers: The box is short and positioned high on the plot; the median is near the top of the box while a much longer lower whisker extends downward and a handful of points sit well below the main group: clear left-skew; the median (~ USD 68,000) is therefore considerably more representative than the lower mean (~ USD 66,000).

Urban hip: The median is in the middle but an extremely long upper whisker and a couple of outlier points stretching upward: strongly right-skewed; the median (~ USD 21,000) far better captures the typical income of this young, low-earning group.


A.1.3 Distributional assumptions

Write-up (A.1.3):
Do Income and Age appear approximately normally distributed? What distribution might be appropriate for the variable kids (number of children)?

Age: Mildly right-skewed (upper whisker longer, mean 40.9 > median 39) but fairly close to normal – so can be considered approximately normally distributed for most practical purposes.

Income: It doesn’t appear to be approximately normally distributed. Income distribution is mostly right-skewed overall, and especially skewness in Moving up, Travelers (left-skewed) and Urban hip (very right-skewed). Normality assumption does not hold.

‘Kids’: Poisson distribution.


A.2 – Confidence Intervals

A.2.1 90% CIs for mean Income (overall, by segment, by gender)

#90% CI for Income (overall)
t.test(segment_data$income, conf.level = 0.90)
## 
##  One Sample t-test
## 
## data:  segment_data$income
## t = 45.724, df = 299, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  50192.24 53950.21
## sample estimates:
## mean of x 
##  52071.23
#90% CI for Income – by segment
income_segment <- segment_data %>%
  group_by(Segment) %>%
  summarise(
    n    = sum(!is.na(income)),
    mean = mean(income, na.rm = TRUE),
    lower = t.test(income, conf.level = 0.90)$conf.int[1],
    upper = t.test(income, conf.level = 0.90)$conf.int[2]
  )

income_segment
## # A tibble: 4 × 5
##   Segment        n   mean  lower  upper
##   <fct>      <int>  <dbl>  <dbl>  <dbl>
## 1 Moving up     70 53004. 51219. 54790.
## 2 Suburb mix   100 55677. 53969. 57385.
## 3 Travelers     80 65953. 62030. 69877.
## 4 Urban hip     50 21343. 20170. 22515.
#90% CI for Income – by gender
income_gender <- segment_data %>%
  group_by(gender) %>%
  summarise(
    n    = sum(!is.na(income)),
    mean = mean(income, na.rm = TRUE),
    lower = t.test(income, conf.level = 0.90)$conf.int[1],
    upper = t.test(income, conf.level = 0.90)$conf.int[2]
  )

income_gender
## # A tibble: 2 × 5
##   gender     n   mean  lower  upper
##   <fct>  <int>  <dbl>  <dbl>  <dbl>
## 1 Female   149 54046. 51689. 56404.
## 2 Male     151 50122. 47204. 53041.

Write-up (A.2.1):
Interpret the confidence intervals and comment on differences between overall, segments, and genders.

We are 90% confident that: The overall population mean income of all customers lies between USD 50,192 and USD 53,950 (point estimate is USD 52,071).

By segment, we are 90% confident that the true average incomes are:

Moving up: [USD 51,219; USD 54,790]

Suburb mix: [USD 53,969; USD 57,385]

Travelers: [USD 62,030; USD 69,877] (clearly the wealthiest segment)

Urban hip: [USD 20,170; USD 22,515] (dramatically lower than all others)

By gender, there is almost no difference:

Females: [USD 51,689; USD 56,404]

Males: [USD 47,204; USD 53,041]


A.2.2 90% CI for subscriber proportion

#Overall 90% CI for proportion of subscribers
prop_overall <- segment_data %>%
  summarise(
    n           = n(),
    subscribers = sum(subscribe == "subYes"),
    p_hat       = subscribers / n,
    lower       = prop.test(subscribers, n, conf.level = 0.90)$conf.int[1],
    upper       = prop.test(subscribers, n, conf.level = 0.90)$conf.int[2]
  )

prop_overall
## # A tibble: 1 × 5
##       n subscribers p_hat  lower upper
##   <int>       <int> <dbl>  <dbl> <dbl>
## 1   300          35 0.117 0.0881 0.152
#Proportion of subscribers by segment (with 90% CIs)
prop_segment <- segment_data %>%
  group_by(Segment) %>%
  summarise(
    n           = n(),
    subscribers = sum(subscribe == "subYes"),
    p_hat       = subscribers / n,
    lower       = prop.test(subscribers, n, conf.level = 0.90)$conf.int[1],
    upper       = prop.test(subscribers, n, conf.level = 0.90)$conf.int[2],
    .groups = "drop"
  )

prop_segment
## # A tibble: 4 × 6
##   Segment        n subscribers p_hat   lower  upper
##   <fct>      <int>       <int> <dbl>   <dbl>  <dbl>
## 1 Moving up     70          15 0.214 0.139   0.313 
## 2 Suburb mix   100          10 0.1   0.0569  0.166 
## 3 Travelers     80           2 0.025 0.00534 0.0816
## 4 Urban hip     50           8 0.16  0.0851  0.274
#Segment with highest subscription proportion
best_segment <- prop_segment %>%
  slice_max(p_hat, n = 1)

best_segment
## # A tibble: 1 × 6
##   Segment       n subscribers p_hat lower upper
##   <fct>     <int>       <int> <dbl> <dbl> <dbl>
## 1 Moving up    70          15 0.214 0.139 0.313

Write-up (A.2.2):
Interpret the overall CI for the subscription proportion and the CI for the segment with the highest proportion.

We are 90% confident that the true population proportion of subscribers lies between 8.8% and 15.2% (point estimate 11.7%).

The segment with the highest subscription rate is Moving up (observed 21.4%).

We are 90% confident that the true subscription rate in the Moving up segment is between 13.9% and 31.3%.


A.2.3 90% CI for σ of Age

age <- segment_data$age

#Sample size and sample SD
n <- sum(!is.na(age))
s <- sd(age, na.rm = TRUE)

#Confidence level
alpha <- 0.10   # for 90% CI

#Chi-square critical values
chi_low  <- qchisq(1 - alpha/2, df = n - 1)  # upper quantile
chi_high <- qchisq(alpha/2,     df = n - 1)  # lower quantile

#CI for variance
lower_var <- (n - 1) * s^2 / chi_low
upper_var <- (n - 1) * s^2 / chi_high

#CI for standard deviation
lower_sd <- sqrt(lower_var)
upper_sd <- sqrt(upper_var)

c(n = n,
  sample_sd = s,
  lower_sd_90 = lower_sd,
  upper_sd_90 = upper_sd)
##           n   sample_sd lower_sd_90 upper_sd_90 
##   300.00000    12.68318    11.88817    13.60262

Write-up (A.2.3):
Report and interpret the CI for the population standard deviation of Age.
Discuss whether the assumptions (normality, independence) are reasonable.

We are 90% confident that the true population standard deviation of age lies between 11.9 and 13.6 years (sample SD = 12.68 years).

Are the assumptions met? No, because the confidence interval for σ assumes that age data is normally distributed. Since age is only mildly right-skewed, the assumption is not fully satisfied. The interval, however, is still acceptable for practical purposes but not perfectly valid.


A.3 – Sample Size and Margin of Error

z <- qnorm(1 - 0.10/2)  #for 90% CI
p <- 0.5                #conservative choice

ME1 <- 0.02
ME2 <- 0.01

n_ME1 <- ceiling(z^2 * p * (1 - p) / ME1^2)
n_ME2 <- ceiling(z^2 * p * (1 - p) / ME2^2)

c(ME_2pct = n_ME1,
  ME_1pct = n_ME2)
## ME_2pct ME_1pct 
##    1691    6764

Write-up (A.3.1):
Compare the required sample sizes for ME = 2% and ME = 1%. Is the cost of improving the ME worth it?

To estimate any population proportion with a ±2% margin of error: need n = 1,691 respondents. To improve precision to ±1% margin of error: need n = 6,764 respondents.

Comparison: Going from ±2% to ±1% precision requires exactly four times as many respondents (6,764 ÷ 1,691 ≈ 4).

Whether this is worth it depends on budget: quadrupling the sample size is usually very expensive, so ±2% is often considered sufficient for marketing decisions, whereas ±1% is only justified for high-stakes campaigns.


A.4 – CI for Difference in Means

traveler_income <- segment_data %>%
  filter(Segment == "Travelers") %>%
  pull(income)

urban_income <- segment_data %>%
  filter(Segment == "Urban hip") %>%
  pull(income)

#Two-sample 90% CI for difference in means
tt_income <- t.test(traveler_income,
                    urban_income,
                    conf.level = 0.90)

tt_income
## 
##  Welch Two Sample t-test
## 
## data:  traveler_income and urban_income
## t = 18.141, df = 92.356, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##  40524.9 48696.6
## sample estimates:
## mean of x mean of y 
##  65953.49  21342.74

Write-up (A.4.1):
Interpret the CI for Mean_Travelers − Mean_UrbanHip. Does it suggest a meaningful difference?

We are 90% confident that the true difference in average income (Travelers − Urban hip) lies between USD 40,524 and USD 48,697 per year.

Interpretation: The interval is entirely positive and far from zero, so there is extremely strong evidence of a large and meaningful income difference. Travelers earn, on average, around USD 44,000 more per year than Urban hip customers. —

Part B: Global Challenge Case (Life Expectancy)

B.1.1 – EDA

led_data <- read_excel("LED.xlsx", sheet = "Values")
summary(led_data)
##    Country            Developed         HIV/AIDS      Adult Mortality 
##  Length:156         Min.   :0.0000   Min.   : 0.100   Min.   : 19.93  
##  Class :character   1st Qu.:0.0000   1st Qu.: 0.100   1st Qu.: 76.43  
##  Mode  :character   Median :0.0000   Median : 0.100   Median :144.27  
##                     Mean   :0.1806   Mean   : 1.756   Mean   :162.61  
##                     3rd Qu.:0.0000   3rd Qu.: 1.097   3rd Qu.:218.03  
##                     Max.   :1.0000   Max.   :24.554   Max.   :554.47  
##                                      NA's   :1        NA's   :1       
##    Schooling      Income composition of resources Life Expectancy
##  Min.   : 3.992   Min.   :0.2946                  Min.   :53.30  
##  1st Qu.: 9.883   1st Qu.:0.4934                  1st Qu.:68.95  
##  Median :12.433   Median :0.6931                  Median :74.50  
##  Mean   :12.066   Mean   :0.6538                  Mean   :73.59  
##  3rd Qu.:14.159   3rd Qu.:0.7842                  3rd Qu.:78.95  
##  Max.   :20.086   Max.   :0.9303                  Max.   :85.30  
##  NA's   :1        NA's   :1                       NA's   :1      
##      Income       Alcohol Consumption Infants death     Smooking        
##  Min.   :   628   Min.   : 0.003      Min.   : 1.60   Length:156        
##  1st Qu.:  4160   1st Qu.: 2.195      1st Qu.: 5.70   Class :character  
##  Median : 13200   Median : 5.540      Median :13.40   Mode  :character  
##  Mean   : 19077   Mean   : 5.790      Mean   :20.95                     
##  3rd Qu.: 28200   3rd Qu.: 8.935      3rd Qu.:32.50                     
##  Max.   :116000   Max.   :20.500      Max.   :81.00                     
##  NA's   :1        NA's   :1           NA's   :1                         
##     Measles            BMI Men        BMI Women       Population       
##  Min.   :    0.00   Min.   :20.30   Min.   :20.60   Min.   :9.790e+04  
##  1st Qu.:    0.01   1st Qu.:22.60   1st Qu.:23.70   1st Qu.:2.800e+06  
##  Median :    0.44   Median :25.40   Median :25.90   Median :9.540e+06  
##  Mean   :  345.18   Mean   :24.87   Mean   :25.63   Mean   :4.286e+07  
##  3rd Qu.:   43.70   3rd Qu.:26.70   3rd Qu.:27.10   3rd Qu.:3.045e+07  
##  Max.   :17100.00   Max.   :31.00   Max.   :34.30   Max.   :1.440e+09  
##  NA's   :1          NA's   :1       NA's   :1       NA's   :1          
##  Polio Vaccine   Maternal Death    CO2 Emission     Blood Preasure Men
##  Min.   :0.420   Min.   :   2.0   Min.   : 0.0467   Min.   :123.0     
##  1st Qu.:0.850   1st Qu.:  12.0   1st Qu.: 0.6315   1st Qu.:129.0     
##  Median :0.930   Median :  52.0   Median : 2.5300   Median :131.0     
##  Mean   :0.885   Mean   : 154.3   Mean   : 4.4837   Mean   :131.2     
##  3rd Qu.:0.970   3rd Qu.: 190.5   3rd Qu.: 5.9700   3rd Qu.:134.0     
##  Max.   :0.990   Max.   :1140.0   Max.   :38.0000   Max.   :139.0     
##  NA's   :1       NA's   :1        NA's   :1         NA's   :1         
##  Blood presure Women
##  Min.   :117.0      
##  1st Qu.:123.5      
##  Median :126.0      
##  Mean   :126.8      
##  3rd Qu.:130.0      
##  Max.   :136.0      
##  NA's   :1
#Outlier table for numeric variables
num_data <- led_data[ , sapply(led_data, is.numeric)]

outlier_fun <- function(x) {
  x <- x[!is.na(x)]
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  IQR <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR
  upper <- Q3 + 1.5 * IQR
  n_out <- sum(x < lower | x > upper)
  c(Q1 = Q1, Q3 = Q3, IQR = IQR,
    lower_bound = lower, upper_bound = upper,
    n_outliers = n_out)
}

outlier_table <- t(sapply(num_data, outlier_fun))
outlier_table
##                                       Q1.25%       Q3.75%      IQR.75%
## Developed                       0.000000e+00 0.000000e+00 0.000000e+00
## HIV/AIDS                        1.000000e-01 1.096667e+00 9.966667e-01
## Adult Mortality                 7.643333e+01 2.180333e+02 1.416000e+02
## Schooling                       9.883333e+00 1.415889e+01 4.275556e+00
## Income composition of resources 4.934333e-01 7.841778e-01 2.907444e-01
## Life Expectancy                 6.895000e+01 7.895000e+01 1.000000e+01
## Income                          4.160000e+03 2.820000e+04 2.404000e+04
## Alcohol Consumption             2.195000e+00 8.935000e+00 6.740000e+00
## Infants death                   5.700000e+00 3.250000e+01 2.680000e+01
## Measles                         1.000000e-02 4.370000e+01 4.369000e+01
## BMI Men                         2.260000e+01 2.670000e+01 4.100000e+00
## BMI Women                       2.370000e+01 2.710000e+01 3.400000e+00
## Population                      2.800000e+06 3.045000e+07 2.765000e+07
## Polio Vaccine                   8.500000e-01 9.700000e-01 1.200000e-01
## Maternal Death                  1.200000e+01 1.905000e+02 1.785000e+02
## CO2 Emission                    6.315000e-01 5.970000e+00 5.338500e+00
## Blood Preasure Men              1.290000e+02 1.340000e+02 5.000000e+00
## Blood presure Women             1.235000e+02 1.300000e+02 6.500000e+00
##                                 lower_bound.25% upper_bound.75% n_outliers
## Developed                          0.000000e+00    0.000000e+00         29
## HIV/AIDS                          -1.395000e+00    2.591667e+00         23
## Adult Mortality                   -1.359667e+02    4.304333e+02          3
## Schooling                          3.470000e+00    2.057222e+01          0
## Income composition of resources    5.731667e-02    1.220294e+00          0
## Life Expectancy                    5.395000e+01    9.395000e+01          1
## Income                            -3.190000e+04    6.426000e+04          7
## Alcohol Consumption               -7.915000e+00    1.904500e+01          1
## Infants death                     -3.450000e+01    7.270000e+01          3
## Measles                           -6.552500e+01    1.092350e+02         29
## BMI Men                            1.645000e+01    3.285000e+01          0
## BMI Women                          1.860000e+01    3.220000e+01          2
## Population                        -3.867500e+07    7.192500e+07         14
## Polio Vaccine                      6.700000e-01    1.150000e+00          9
## Maternal Death                    -2.557500e+02    4.582500e+02         17
## CO2 Emission                      -7.376250e+00    1.397775e+01         11
## Blood Preasure Men                 1.215000e+02    1.415000e+02          0
## Blood presure Women                1.137500e+02    1.397500e+02          0
#Correlation matrix for numeric variables
num_only <- led_data[ , sapply(led_data, is.numeric)]

cor_matrix <- cor(num_only, use = "complete.obs")
round(cor_matrix, 2)
##                                 Developed HIV/AIDS Adult Mortality Schooling
## Developed                            1.00    -0.18           -0.40      0.56
## HIV/AIDS                            -0.18     1.00            0.77     -0.26
## Adult Mortality                     -0.40     0.77            1.00     -0.61
## Schooling                            0.56    -0.26           -0.61      1.00
## Income composition of resources      0.57    -0.40           -0.73      0.93
## Life Expectancy                      0.52    -0.51           -0.80      0.80
## Income                               0.60    -0.26           -0.55      0.65
## Alcohol Consumption                  0.55    -0.03           -0.19      0.56
## Infants death                       -0.43     0.45            0.72     -0.82
## Measles                             -0.10     0.07            0.13     -0.20
## BMI Men                              0.35    -0.38           -0.63      0.72
## BMI Women                           -0.01    -0.16           -0.37      0.48
## Population                          -0.07    -0.05           -0.05     -0.08
## Polio Vaccine                        0.24    -0.20           -0.38      0.45
## Maternal Death                      -0.31     0.40            0.62     -0.71
## CO2 Emission                         0.23    -0.20           -0.37      0.47
## Blood Preasure Men                   0.14     0.24            0.25      0.04
## Blood presure Women                 -0.25     0.42            0.54     -0.41
##                                 Income composition of resources Life Expectancy
## Developed                                                  0.57            0.52
## HIV/AIDS                                                  -0.40           -0.51
## Adult Mortality                                           -0.73           -0.80
## Schooling                                                  0.93            0.80
## Income composition of resources                            1.00            0.89
## Life Expectancy                                            0.89            1.00
## Income                                                     0.76            0.70
## Alcohol Consumption                                        0.49            0.37
## Infants death                                             -0.88           -0.87
## Measles                                                   -0.21           -0.19
## BMI Men                                                    0.78            0.64
## BMI Women                                                  0.51            0.36
## Population                                                -0.06            0.00
## Polio Vaccine                                              0.47            0.54
## Maternal Death                                            -0.79           -0.74
## CO2 Emission                                               0.58            0.47
## Blood Preasure Men                                        -0.04           -0.17
## Blood presure Women                                       -0.50           -0.56
##                                 Income Alcohol Consumption Infants death
## Developed                         0.60                0.55         -0.43
## HIV/AIDS                         -0.26               -0.03          0.45
## Adult Mortality                  -0.55               -0.19          0.72
## Schooling                         0.65                0.56         -0.82
## Income composition of resources   0.76                0.49         -0.88
## Life Expectancy                   0.70                0.37         -0.87
## Income                            1.00                0.36         -0.60
## Alcohol Consumption               0.36                1.00         -0.39
## Infants death                    -0.60               -0.39          1.00
## Measles                          -0.14               -0.03          0.24
## BMI Men                           0.55                0.29         -0.70
## BMI Women                         0.27                0.01         -0.47
## Population                       -0.06               -0.01          0.01
## Polio Vaccine                     0.37                0.21         -0.59
## Maternal Death                   -0.49               -0.28          0.87
## CO2 Emission                      0.76                0.16         -0.46
## Blood Preasure Men               -0.01                0.36          0.08
## Blood presure Women              -0.39               -0.06          0.48
##                                 Measles BMI Men BMI Women Population
## Developed                         -0.10    0.35     -0.01      -0.07
## HIV/AIDS                           0.07   -0.38     -0.16      -0.05
## Adult Mortality                    0.13   -0.63     -0.37      -0.05
## Schooling                         -0.20    0.72      0.48      -0.08
## Income composition of resources   -0.21    0.78      0.51      -0.06
## Life Expectancy                   -0.19    0.64      0.36       0.00
## Income                            -0.14    0.55      0.27      -0.06
## Alcohol Consumption               -0.03    0.29      0.01      -0.01
## Infants death                      0.24   -0.70     -0.47       0.01
## Measles                            1.00   -0.28     -0.27       0.60
## BMI Men                           -0.28    1.00      0.85      -0.19
## BMI Women                         -0.27    0.85      1.00      -0.21
## Population                         0.60   -0.19     -0.21       1.00
## Polio Vaccine                     -0.21    0.29      0.19       0.02
## Maternal Death                     0.26   -0.67     -0.50      -0.01
## CO2 Emission                      -0.12    0.51      0.37      -0.03
## Blood Preasure Men                -0.11    0.00     -0.06      -0.22
## Blood presure Women                0.08   -0.37     -0.24      -0.10
##                                 Polio Vaccine Maternal Death CO2 Emission
## Developed                                0.24          -0.31         0.23
## HIV/AIDS                                -0.20           0.40        -0.20
## Adult Mortality                         -0.38           0.62        -0.37
## Schooling                                0.45          -0.71         0.47
## Income composition of resources          0.47          -0.79         0.58
## Life Expectancy                          0.54          -0.74         0.47
## Income                                   0.37          -0.49         0.76
## Alcohol Consumption                      0.21          -0.28         0.16
## Infants death                           -0.59           0.87        -0.46
## Measles                                 -0.21           0.26        -0.12
## BMI Men                                  0.29          -0.67         0.51
## BMI Women                                0.19          -0.50         0.37
## Population                               0.02          -0.01        -0.03
## Polio Vaccine                            1.00          -0.54         0.30
## Maternal Death                          -0.54           1.00        -0.41
## CO2 Emission                             0.30          -0.41         1.00
## Blood Preasure Men                      -0.01           0.13        -0.01
## Blood presure Women                     -0.19           0.44        -0.22
##                                 Blood Preasure Men Blood presure Women
## Developed                                     0.14               -0.25
## HIV/AIDS                                      0.24                0.42
## Adult Mortality                               0.25                0.54
## Schooling                                     0.04               -0.41
## Income composition of resources              -0.04               -0.50
## Life Expectancy                              -0.17               -0.56
## Income                                       -0.01               -0.39
## Alcohol Consumption                           0.36               -0.06
## Infants death                                 0.08                0.48
## Measles                                      -0.11                0.08
## BMI Men                                       0.00               -0.37
## BMI Women                                    -0.06               -0.24
## Population                                   -0.22               -0.10
## Polio Vaccine                                -0.01               -0.19
## Maternal Death                                0.13                0.44
## CO2 Emission                                 -0.01               -0.22
## Blood Preasure Men                            1.00                0.75
## Blood presure Women                           0.75                1.00

Write-up (B.1.1):
Comment on outliers and on key patterns in the correlation matrix (especially correlations with Life Expectancy, and multicollinearity).

DATA SUMMARY The data reveal global disparities in health and economy, with Life Expectancy averaging 70 years (range 53–83). Developing countries predominate (mean Developed=0.18). Right-skew is evident in risks like HIV/AIDS (mean 7.64 > median 0.4), Adult Mortality (218 > 163), and Population (28.6M > 4.3M), indicating tails toward worse outcomes; Income also skews right (mean USD 19,000 > median USD 13,000), signaling inequality.

OUTLIERS Outliers were identified using the IQR method (values < Q1 - 1.5IQR or > Q3 + 1.5IQR) and are provided in the table. Note: developed has 29 “upper” outliers, but as a binary variable, this is misleading—treat it as categorical, no true outliers.

CORRELATIONS: There were noticeable patterns. Strong positive correlations among positive factors: e.g., Life Expectancy with Income Composition of Resources (the breakdown of a nation’s total income into different resources and how efficiently government allocate and use resources) (0.89), Schooling (0.80), Income (0.70), etc. These variables tend to increase together with Life Expectancy. A high positive correlation (e.g., 0.89) means that once you know a country’s Income Composition, you can predict LE reasonably well.

Strong negative correlations with risks: Life Expectancy with Adult Mortality (-0.80), HIV/AIDS (-0.51), Infants Death (-0.87), Maternal Death (-0.74). As mortality, infant deaths, maternal deaths, and HIV prevalence increase, life expectancy falls.Strong negative values mean the variables move in opposite directions in a predictable pattern.

Moderate: Alcohol Consumption (0.37), Polio Vaccine (0.47), CO2 Emission (0.41). There is some association, but it’s weaker. These often act as development proxies: richer countries drink more alcohol, vaccinate more, and produce more emissions — and also have higher LE.

Weak or near-zero correlations: LE & Smoking: 0.12 LE & Population: –0.06 There is almost no linear relationship. Population size does not predict life expectancy. Smoking varies across countries in non-linear ways.

The matrix reveals clear patterns: Life Expectancy has a strong positive relationship with economic/social development (e.g., Schooling, Income) and negatively with mortality/HIV risks.

Potential multicollinearity issues: high correlations between predictors like Schooling and Income Composition (0.93), Adult Mortality and HIV/AIDS (0.77), BMI Men/Women (0.99—near perfect), Infants Death and Maternal Death (0.87). High correlations among predictors matter because they make it statistically difficult for the model to separate each variable’s independent effect on life expectancy.


B.1.2 – Specification and Estimation

First MLR with five independent variables

m1 <- lm(`Life Expectancy` ~ Developed +
           `HIV/AIDS` +
           `Income composition of resources` +
           `Alcohol Consumption` +
           `Polio Vaccine`,
         data = led_data)

summary(m1)
## 
## Call:
## lm(formula = `Life Expectancy` ~ Developed + `HIV/AIDS` + `Income composition of resources` + 
##     `Alcohol Consumption` + `Polio Vaccine`, data = led_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5478 -1.5350  0.0748  1.7350  7.8782 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       46.76097    1.82686  25.596  < 2e-16 ***
## Developed                          1.16910    0.77613   1.506    0.134    
## `HIV/AIDS`                        -0.27841    0.05995  -4.644 7.45e-06 ***
## `Income composition of resources` 30.63928    2.06136  14.864  < 2e-16 ***
## `Alcohol Consumption`             -0.10250    0.07047  -1.454    0.148    
## `Polio Vaccine`                    8.66321    2.13606   4.056 8.03e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.844 on 149 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.8349, Adjusted R-squared:  0.8293 
## F-statistic: 150.7 on 5 and 149 DF,  p-value: < 2.2e-16

Write-up (B.1.2 – first model):
For each variable, comment on (i) significance (one-sided alternatives), (ii) sign (is it as expected?), and (iii) magnitude.

The first MLR model explains 83.5% of variation in Life Expectancy (adj R²=83.2%), with overall high significance (F=150.7, p<2e-16).

Developed: (i) Not significant (one-sided p=0.134 > 0.05). (ii) Positive sign as expected (developed countries have higher LE). (iii) Magnitude small: +1.17 years for developed status.

HIV/AIDS: (i) Highly significant (one-sided p=7.45e-06 <0.001). (ii) Negative sign as expected (higher HIV prevalence reduces LE). (iii) Moderate magnitude: each unit increase reduces LE by 0.28 years.

Income composition of resources: (i) Highly significant (one-sided p<2e-16 <0.001). (ii) Positive as expected (better resource use boosts LE). (iii) Large magnitude: each 0.1 unit increase adds ~3.06 years.

Alcohol Consumption: (i) Not significant (one-sided p=0.148 >0.05, since unexpected positive t). (ii) Negative but close to zero, as expected for health risk. (iii) Small magnitude: negligible -0.10 years per unit.

Polio Vaccine: (i) Highly significant (one-sided p=4.02e-05 <0.001). (ii) Positive as expected (better immunization increases LE). (iii) Large magnitude: each 0.1 coverage increase adds ~0.87 years.


Best single predictor (highest R²) among all variables

vars <- c("Developed", "HIV/AIDS", "Adult Mortality", "Schooling",
          "Income composition of resources", "Income", "Alcohol Consumption",
          "Infants death", "Smooking", "Measles", "BMI Men", "BMI Women",
          "Population", "Polio Vaccine", "Maternal Death", "CO2 Emission",
          "Blood Preasure Men", "Blood presure Women")

r2_list <- sapply(vars, function(v) {
  summary(lm(led_data$`Life Expectancy` ~ led_data[[v]]))$r.squared
})

best_idx <- which.max(r2_list)
best_var <- vars[best_idx]
best_r2  <- r2_list[best_idx]

best_var
## [1] "Income composition of resources"
best_r2
## Income composition of resources 
##                       0.7848303

Write-up (B.1.2 – WHO questions):
Which variable is most effective in increasing LE (based on the 5-variable MLR)?

  • The most effective variable for increasing LE in the 5-variable MLR is Income composition of resources, with the largest positive coefficient (30.64) and high significance—improving resource allocation yields the biggest LE gains.

Which variable has the best predictive power in terms of R² (from the simple regressions)?

  • The variable with the best predictive power (highest R² from simple regressions) is Income composition of resources (R²=0.785), explaining ~79% of LE variation alone.

Should WHO be worried about LE in non-developed vs. developed countries?

  • Yes, absolutely. Even after controlling for HIV/AIDS, income composition, alcohol, and polio vaccination, being a developed country adds approximately 1.2 (the intercept) years of life expectancy (though marginally insignificant in this small model).The WHO should continue to prioritize health system strengthening, education, and economic equity in developing nations.

Full MLR with all 18 independent variables

#Ensure Smooking is numeric
led_data$Smooking <- as.numeric(led_data$Smooking)

m_full <- lm(`Life Expectancy` ~ . - Country, data = led_data)
summary(m_full)
## 
## Call:
## lm(formula = `Life Expectancy` ~ . - Country, data = led_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2954 -1.0584  0.0381  1.0803  4.7854 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        9.950e+01  1.026e+01   9.699 1.67e-15 ***
## Developed                          3.643e-03  8.175e-01   0.004 0.996454    
## `HIV/AIDS`                         1.568e-01  9.528e-02   1.645 0.103533    
## `Adult Mortality`                 -2.732e-02  5.669e-03  -4.820 6.04e-06 ***
## Schooling                          2.960e-02  2.367e-01   0.125 0.900784    
## `Income composition of resources`  2.042e+01  7.140e+00   2.859 0.005315 ** 
## Income                             5.722e-06  2.963e-05   0.193 0.847314    
## `Alcohol Consumption`             -2.102e-02  7.501e-02  -0.280 0.779950    
## `Infants death`                   -1.451e-01  3.901e-02  -3.719 0.000354 ***
## Smooking                           7.151e-01  2.943e+00   0.243 0.808563    
## Measles                            3.000e-05  1.479e-04   0.203 0.839757    
## `BMI Men`                         -2.679e-01  3.266e-01  -0.820 0.414310    
## `BMI Women`                       -1.783e-01  2.339e-01  -0.762 0.447857    
## Population                        -1.529e-09  1.519e-09  -1.007 0.316798    
## `Polio Vaccine`                   -1.469e-01  3.156e+00  -0.047 0.962977    
## `Maternal Death`                   4.278e-03  2.415e-03   1.771 0.080025 .  
## `CO2 Emission`                    -5.178e-02  6.163e-02  -0.840 0.403082    
## `Blood Preasure Men`              -1.045e-01  1.305e-01  -0.801 0.425550    
## `Blood presure Women`             -6.358e-02  1.136e-01  -0.560 0.577172    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.112 on 87 degrees of freedom
##   (50 observations deleted due to missingness)
## Multiple R-squared:  0.9057, Adjusted R-squared:  0.8862 
## F-statistic: 46.42 on 18 and 87 DF,  p-value: < 2.2e-16
coefs_full <- summary(m_full)$coefficients
pvals <- coefs_full[-1, 4]  # exclude intercept

#Number of significant variables at 5% level
sum(pvals < 0.05)
## [1] 3
#Names of significant variables
names(pvals)[pvals < 0.05]
## [1] "`Adult Mortality`"                 "`Income composition of resources`"
## [3] "`Infants death`"
#R-squared comparison
summary(m1)$r.squared
## [1] 0.8348584
summary(m1)$adj.r.squared
## [1] 0.8293167
summary(m_full)$r.squared
## [1] 0.9056998
summary(m_full)$adj.r.squared
## [1] 0.8861895

Write-up (B.1.2 – full model):
Test the overall significance (F-test) of the full model.

  • The full model with all 18 predictors is highly significant (F = 46.42 on 18 and 87 df, p-value < 2.2e-16). We strongly reject the null hypothesis that all coefficients (except the intercept) are zero. The set of variables jointly explains a very large portion of the variation in Life Expectancy.

How many variables are significant at 5%? Did any variables from the first model drop out?

  • Only 3 variables are statistically significant at the 5% level in the full model: Adult Mortality, Income composition of resources, and Infants death. From the original 5-variable model, Developed, HIV/AIDS, Alcohol Consumption, and Polio Vaccine are no longer significant at 5% once the other predictors are added — they “dropped out” in terms of statistical significance.

Compare R² and adjusted R² between the first and full models. What do you conclude?

  • The full model explains an additional ~7% of the variation in Life Expectancy (R² rises from 83.5% (5 var model) to 90.6% (full model)) and the adjusted R² also improves markedly (82.9% → 88.6%). This indicates that the added variables contribute genuine explanatory power and are not simply inflating R² due to overfitting.

B.1.3 – Predictions

#Predicted LE for an "average" country
avg_vals <- led_data %>%
  summarise(
    Developed                        = mean(Developed, na.rm = TRUE),
    `HIV/AIDS`                       = mean(`HIV/AIDS`, na.rm = TRUE),
    `Income composition of resources` = mean(`Income composition of resources`, na.rm = TRUE),
    `Alcohol Consumption`            = mean(`Alcohol Consumption`, na.rm = TRUE),
    `Polio Vaccine`                  = mean(`Polio Vaccine`, na.rm = TRUE)
  )

avg_vals
## # A tibble: 1 × 5
##   Developed `HIV/AIDS` `Income composition of resources` `Alcohol Consumption`
##       <dbl>      <dbl>                             <dbl>                 <dbl>
## 1     0.181       1.76                             0.654                  5.79
## # ℹ 1 more variable: `Polio Vaccine` <dbl>
pred_LE_avg <- predict(m1, newdata = avg_vals)
pred_LE_avg
##        1 
## 73.58903
#Country with highest observed LE
idx_max_LE <- which.max(led_data$`Life Expectancy`)
led_data[idx_max_LE, c("Country", "Life Expectancy")]
## # A tibble: 1 × 2
##   Country   `Life Expectancy`
##   <chr>                 <dbl>
## 1 Singapore              85.3
#Predicted LE for that country
pred_LE_max_country <- predict(m1, newdata = led_data[idx_max_LE, ])
pred_LE_max_country
##        1 
## 82.45459

Write-up (B.1.3):
- Report the predicted LE for an “average” country.
- Identify the country with the highest observed LE and its observed value. - Report the predicted LE for that country and comment on the difference vs observed.

The predicted life expectancy for an “average” country (based on mean values of the predictors: Developed=0.181, HIV/AIDS=1.76, Income composition=0.654, Alcohol Consumption=5.79, Polio Vaccine=0.884) is 73.6 years.

The country with the highest observed life expectancy is Singapore, with an observed value of 85.3 years. The predicted life expectancy for Singapore is 82.5 years.

Comment on difference: - The model under-predicts by about 2.8 years (observed 85.3 vs predicted 82.5), a reasonable residual given the model’s R²~83%—it captures broad trends but misses country-specific factors.


Part C: Credit Risk Case – Logistic Regression

C.1 – Load Credit Data and Split Train/Test

credit_data <- read_excel("CreditData.xlsx")

#Rename dependent variable
str(credit_data)
## tibble [30,000 × 18] (S3: tbl_df/tbl/data.frame)
##  $ LIMIT_BAL      : num [1:30000] -1.137 -0.366 -0.597 -0.905 -0.905 ...
##  $ SEX            : num [1:30000] 0.81 0.81 0.81 0.81 -1.23 ...
##  $ EDUCATION      : num [1:30000] 0.212 0.212 0.212 0.212 0.212 ...
##  $ MARRIAGE       : num [1:30000] -1.069 0.849 0.849 -1.069 -1.069 ...
##  $ PAY_0          : num [1:30000] 1.7945 -0.875 0.0149 0.0149 -0.875 ...
##  $ PAY_2          : num [1:30000] 1.782 1.782 0.112 0.112 0.112 ...
##  $ PAY_3          : num [1:30000] -0.697 0.139 0.139 0.139 -0.697 ...
##  $ PAY_4          : num [1:30000] -0.667 0.189 0.189 0.189 0.189 ...
##  $ PAY_5          : num [1:30000] -1.53 0.235 0.235 0.235 0.235 ...
##  $ PAY_6          : num [1:30000] -1.486 1.992 0.253 0.253 0.253 ...
##  $ BILL_AMT1      : num [1:30000] -0.6425 -0.6592 -0.2986 -0.0575 -0.5786 ...
##  $ PAY_AMT1       : num [1:30000] -0.342 -0.342 -0.25 -0.221 -0.221 ...
##  $ PAY_AMT2       : num [1:30000] -0.227 -0.214 -0.192 -0.169 1.335 ...
##  $ PAY_AMT3       : num [1:30000] -0.297 -0.24 -0.24 -0.229 0.271 ...
##  $ PAY_AMT4       : num [1:30000] -0.308 -0.244 -0.244 -0.238 0.266 ...
##  $ PAY_AMT5       : num [1:30000] -0.314 -0.314 -0.249 -0.244 -0.269 ...
##  $ PAY_AMT6       : num [1:30000] -0.2934 -0.1809 -0.0121 -0.2371 -0.2552 ...
##  $ default_payment: num [1:30000] 1 1 0 0 0 0 0 0 0 0 ...
names(credit_data)
##  [1] "LIMIT_BAL"       "SEX"             "EDUCATION"       "MARRIAGE"       
##  [5] "PAY_0"           "PAY_2"           "PAY_3"           "PAY_4"          
##  [9] "PAY_5"           "PAY_6"           "BILL_AMT1"       "PAY_AMT1"       
## [13] "PAY_AMT2"        "PAY_AMT3"        "PAY_AMT4"        "PAY_AMT5"       
## [17] "PAY_AMT6"        "default_payment"
colnames(credit_data)[ncol(credit_data)] <- "DefaultPayment"
table(credit_data$DefaultPayment)
## 
##     0     1 
## 23364  6636
set.seed(123) 

n <- nrow(credit_data)
train_idx <- sample(1:n, size = 0.7 * n)

train_data <- credit_data[train_idx, ]
test_data  <- credit_data[-train_idx, ]

#Counts of defaults/non-defaults
cat("Whole sample:\n")
## Whole sample:
table(credit_data$DefaultPayment)
## 
##     0     1 
## 23364  6636
prop.table(table(credit_data$DefaultPayment))
## 
##      0      1 
## 0.7788 0.2212
cat("\nTraining data:\n")
## 
## Training data:
table(train_data$DefaultPayment)
## 
##     0     1 
## 16321  4679
prop.table(table(train_data$DefaultPayment))
## 
##         0         1 
## 0.7771905 0.2228095
cat("\nTest data:\n")
## 
## Test data:
table(test_data$DefaultPayment)
## 
##    0    1 
## 7043 1957
prop.table(table(test_data$DefaultPayment))
## 
##         0         1 
## 0.7825556 0.2174444

Write-up (C.1):
Report the counts and proportions of defaults and non-defaults in the whole sample, training data, and test data.

Non Default: 23,364 Default: 6,636 Training Data (70%): 16,321 Test Data (30%): 4,679


C.2 – Logistic Regression and Classification Performance

#Logistic regression with all 17 predictors
lr_model <- glm(
  DefaultPayment ~ .,
  data   = train_data,
  family = binomial(link = "logit"),
  control = list(maxit = 160)   # K = 160 iterations
)

summary(lr_model)
## 
## Call:
## glm(formula = DefaultPayment ~ ., family = binomial(link = "logit"), 
##     data = train_data, control = list(maxit = 160))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.46881    0.01999 -73.490  < 2e-16 ***
## LIMIT_BAL   -0.10240    0.02429  -4.216 2.49e-05 ***
## SEX         -0.06912    0.01776  -3.892 9.94e-05 ***
## EDUCATION   -0.07958    0.01929  -4.125 3.71e-05 ***
## MARRIAGE    -0.11230    0.01840  -6.104 1.03e-09 ***
## PAY_0        0.66689    0.02382  27.991  < 2e-16 ***
## PAY_2        0.09895    0.02873   3.443 0.000574 ***
## PAY_3        0.07695    0.03233   2.380 0.017293 *  
## PAY_4        0.06111    0.03502   1.745 0.080970 .  
## PAY_5        0.01067    0.03653   0.292 0.770275    
## PAY_6        0.02822    0.03024   0.933 0.350719    
## BILL_AMT1   -0.10170    0.02376  -4.280 1.87e-05 ***
## PAY_AMT1    -0.18523    0.04272  -4.336 1.45e-05 ***
## PAY_AMT2    -0.21420    0.05354  -4.001 6.32e-05 ***
## PAY_AMT3    -0.02501    0.02935  -0.852 0.394167    
## PAY_AMT4    -0.08643    0.03200  -2.700 0.006924 ** 
## PAY_AMT5    -0.07022    0.02871  -2.446 0.014430 *  
## PAY_AMT6    -0.07338    0.02890  -2.540 0.011100 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 22279  on 20999  degrees of freedom
## Residual deviance: 19497  on 20982  degrees of freedom
## AIC: 19533
## 
## Number of Fisher Scoring iterations: 5
#Predicted probabilities and classes on test data
test_prob <- predict(lr_model, newdata = test_data, type = "response")
test_pred_class <- ifelse(test_prob >= 0.5, 1, 0)
#Confusion matrix and metrics in base R
true <- factor(test_data$DefaultPayment, levels = c(0, 1))
pred <- factor(test_pred_class,          levels = c(0, 1))

cm <- table(Predicted = pred, Actual = true)
cm
##          Actual
## Predicted    0    1
##         0 6826 1479
##         1  217  478
TN <- cm["0", "0"]
FP <- cm["1", "0"]
FN <- cm["0", "1"]
TP <- cm["1", "1"]

total <- TN + FP + FN + TP

accuracy    <- (TP + TN) / total
sensitivity <- TP / (TP + FN)         # true positive rate (default correctly flagged)
specificity <- TN / (TN + FP)         # true negative rate (non-default correctly flagged)
prevalence  <- (TP + FN) / total      # proportion of defaults in test set
precision   <- TP / (TP + FP)         # positive predictive value
F1          <- 2 * precision * sensitivity / (precision + sensitivity)

metrics <- c(
  Accuracy    = accuracy,
  Sensitivity = sensitivity,
  Specificity = specificity,
  Prevalence  = prevalence,
  Precision   = precision,
  F1_score    = F1
)

round(metrics, 3)
##    Accuracy Sensitivity Specificity  Prevalence   Precision    F1_score 
##       0.812       0.244       0.969       0.217       0.688       0.360

Write-up (C.2):
Overall accuracy of the model:

- The model achieves an accuracy of 81.2%, correctly classifying 81.2% of the test cases. This is reasonably good.

Sensitivity and specificity (how well it detects defaulters vs non-defaulters)
- Sensitivity is low at 24.4%, meaning the model only flags 24.4% of actual defaulters correctly, thus missing misses most risky customers (false negatives), which is problematic for risk management.

Prevalence of default in the test data
- The prevalence of default is 21.7%.

F1-score (balance of precision and recall for defaults)
- The F1-score is 0.360, indicating a poor balance between precision (68.8%, meaning 68.8% of flagged defaults are true) and recall (sensitivity 24.4%). The low F1 highlights the model’s weakness in reliably identifying defaults without too many errors, due to low recall despite decent precision

From a bank’s perspective, what would be an acceptable general prediction accuracy? What is the minimum accuracy you would require before using the LR model in practice?

  • From a bank’s perspective, an acceptable prediction accuracy is typically around 80–90%, since credit institutions need reasonably reliable models before making lending decisions. As a minimum threshold, the model should achieve at least 80% accuracy and maintain adequate sensitivity to detect risky borrowers. If a logistic regression model falls below this level—or if its sensitivity is too low—it would not be considered ready for practical use.