#Part A: Marketing Case
# Load the Segment data (make sure SegmentData.xlsx is in the working directory)
segment_data <- read_excel("SegmentData.xlsx")
# Ensure proper types for categorical variables
segment_data <- segment_data %>%
mutate(
gender = factor(gender),
ownHome = factor(ownHome),
subscribe = factor(subscribe),
Segment = factor(Segment)
)
str(segment_data)
## tibble [300 × 7] (S3: tbl_df/tbl/data.frame)
## $ age : num [1:300] 42 38 39 40 32 41 45 37 38 43 ...
## $ gender : Factor w/ 2 levels "Female","Male": 2 1 1 1 1 1 2 2 1 2 ...
## $ income : num [1:300] 64437 65608 45426 63066 45927 ...
## $ kids : num [1:300] 1 2 1 0 5 3 1 2 2 4 ...
## $ ownHome : Factor w/ 2 levels "ownNo","ownYes": 1 1 1 1 2 2 1 2 2 2 ...
## $ subscribe: Factor w/ 2 levels "subNo","subYes": 1 1 1 1 1 1 1 1 1 2 ...
## $ Segment : Factor w/ 4 levels "Moving up","Suburb mix",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(segment_data)
## age gender income kids ownHome
## Min. :18.00 Female:149 Min. : 10557 Min. :0.000 ownNo :162
## 1st Qu.:33.00 Male :151 1st Qu.: 41275 1st Qu.:0.000 ownYes:138
## Median :39.00 Median : 53186 Median :1.000
## Mean :40.56 Mean : 52071 Mean :1.283
## 3rd Qu.:48.00 3rd Qu.: 64236 3rd Qu.:2.000
## Max. :73.00 Max. :114615 Max. :7.000
## subscribe Segment
## subNo :265 Moving up : 70
## subYes: 35 Suburb mix:100
## Travelers : 80
## Urban hip : 50
##
##
Write-up (A.1.1):
Comment briefly on the sample: typical ages, incomes, numbers of
children, and any obvious differences between segments.
The dataset provides a snapshot of 300 survey respondents from Company XYZ’s customer base, divided into four demographic segments: Suburb mix (the largest group at 100 respondents), Travelers (80), Moving up (70), and Urban hip (the smallest at 50). Overall, the sample appears balanced in terms of gender, with males slightly outnumbering females (50.3% vs. 49.7%), suggesting no strong gender bias in the survey.
Age: Respondents range from young adults (18) to seniors (73), with a mean age of about 41 and a median of 39. The distribution is slightly right-skewed (mean > median) toward older individuals. The interquartile range (IQR) is from 33 (25th percentile) to 48 (75th percentile). So, the middle 50% of ages are spread between 33 and 48 years old. This means that half of the respondents are middle-aged, which may reflect which age group is the prime target for Company XYZ’s marketing.
Income: Income ranges rom USD 10,557 to USD 114,615. Mean is USD 52,071, while median is USD 53,187. This distribution is also right-skewed, suggesting income inequality in the sample, where a smaller group of high-income outliers pulls the mean upward. The IQR is USD 41,275 (25th percentile) to USD 64,236 (75th percentile). So, the middle 50% of respondents earn between USD 41,275 and USD 64,236.
Kids: The number of children is low on average (mean 1.28, median 1), with most respondents having 0–2 kids (IQR). The maximum of 7 is an outlier, but overall, this points to smaller family sizes.
OwnHome: A slight majority (54%) of respondents do not own their homes.
Subscribe:The subscription rate within the sample is very low wih only 35 customers (11.7%) currently subscribing to the offered serevices. This may be a potential challenge for Company XYZ, and may require further investigation why people do not have incentive to subscribe.
Segment: There is an uneven distribution across segments, where Suburb mix (100) is the most represented. It may indicate that this segment is company’s core customer group, or group of interest for this study. On the other hand, Urban hip segment is the smallest group.
In short: Most respondents in the sample are middle-aged (33–48 year old), earn roughly USD 41,000 to USD 64,000, have 0 to 2 kids, are a bit more likely to rent than own, and almost 90% do not subscribe to the service.
# Age histogram
ggplot(segment_data, aes(x = age)) +
geom_histogram(binwidth = 5,
boundary = 0,
closed = "left",
color = "white",
fill = "steelblue") +
labs(title = "Age Distribution",
x = "Age (years)",
y = "Count") +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(face = "bold")
)
#Income histogram
ggplot(segment_data, aes(x = income)) +
geom_histogram(binwidth = 5000,
boundary = 0,
closed = "left",
color = "white",
fill = "steelblue") +
labs(title = "Income Distribution",
x = "Income",
y = "Count") +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(face = "bold")
)
#Number of children histogram
ggplot(segment_data, aes(x = kids)) +
geom_histogram(binwidth = 1,
boundary = -0.5,
closed = "left",
color = "white",
fill = "steelblue") +
scale_x_continuous(breaks = 0:max(segment_data$kids, na.rm = TRUE)) +
labs(title = "Number of Children",
x = "Number of children",
y = "Count") +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(face = "bold")
)
#Aggregate income boxplot
ggplot(segment_data, aes(y = income)) +
geom_boxplot(
color = "black",
fill = "steelblue",
width = 0.3
) +
labs(
title = "Income Distribution (All Segments)",
x = "",
y = "Income"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(face = "bold")
)
#Boxplot of income by segment
ggplot(segment_data, aes(x = Segment, y = income)) +
geom_boxplot(
color = "black",
fill = "steelblue"
) +
labs(
title = "Income Distribution by Segment",
x = "Segment",
y = "Income"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
axis.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
Write-up (A.1.2):
Comment on outliers and differences in centrality (mean vs. median)
based on the histograms and boxplots.
Age Histogram: Mean = 40.56, Median = 39, so the center of the distribution is right around those two values. There are no visible outliers on the histogram. However, if we use the standard Q3+1.5xIQR Rule to find the outliers on the left hand side of the distribution, we find out that any value beyond 48+1.5x15=70.5 is considered an outlier.
Income Histogram: Median (USD 53,186) > Mean (USD 52,071), so the center of the distribution is at the median. There are also visible outliers above USD 90,000 in the right-tail of the distribution, and the highest is USD 114,615.
Kids (“Number of Children”) Histogram: A heavily right-skewed disitribution. A few families with 5, 6, or 7 children appear as small bars on the far right of the histogram. The mean (1.28) is pulled higher than the median (1) because of these large values. The true center of the distribution is best described by the median: 1 child.
Income Distribution (All Segments) Box Plot: There are several outliers, mostly on the high end. IQR is within [USD 41,275, USD 64,236], so outliers then will lie when Income is < USD 18,807.5 or Income > USD 86,703.5.The aggregate boxplot shows dots above ~ USD 90,000, confirming high-income skew.
Income Distribution By Segments Box Plot:
Moving up: The box is reasonably symmetric in age, but for income the median is pulled toward the lower half of the box and a long upper whisker plus a few high outlier dots reveal right-skew: median is preferred for this segment.
Suburb mix: The box is almost perfectly symmetric, the median sits dead centre, and both whiskers are of similar length with virtually no outliers: distribution is close to symmetric, so mean and median are essentially interchangeable.
Travelers: The box is short and positioned high on the plot; the median is near the top of the box while a much longer lower whisker extends downward and a handful of points sit well below the main group: clear left-skew; the median (~ USD 68,000) is therefore considerably more representative than the lower mean (~ USD 66,000).
Urban hip: The median is in the middle but an extremely long upper whisker and a couple of outlier points stretching upward: strongly right-skewed; the median (~ USD 21,000) far better captures the typical income of this young, low-earning group.
Write-up (A.1.3):
Do Income and Age appear approximately normally distributed? What
distribution might be appropriate for the variable kids
(number of children)?
Age: Mildly right-skewed (upper whisker longer, mean 40.9 > median 39) but fairly close to normal – so can be considered approximately normally distributed for most practical purposes.
Income: It doesn’t appear to be approximately normally distributed. Income distribution is mostly right-skewed overall, and especially skewness in Moving up, Travelers (left-skewed) and Urban hip (very right-skewed). Normality assumption does not hold.
‘Kids’: Poisson distribution.
#90% CI for Income (overall)
t.test(segment_data$income, conf.level = 0.90)
##
## One Sample t-test
##
## data: segment_data$income
## t = 45.724, df = 299, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 50192.24 53950.21
## sample estimates:
## mean of x
## 52071.23
#90% CI for Income – by segment
income_segment <- segment_data %>%
group_by(Segment) %>%
summarise(
n = sum(!is.na(income)),
mean = mean(income, na.rm = TRUE),
lower = t.test(income, conf.level = 0.90)$conf.int[1],
upper = t.test(income, conf.level = 0.90)$conf.int[2]
)
income_segment
## # A tibble: 4 × 5
## Segment n mean lower upper
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Moving up 70 53004. 51219. 54790.
## 2 Suburb mix 100 55677. 53969. 57385.
## 3 Travelers 80 65953. 62030. 69877.
## 4 Urban hip 50 21343. 20170. 22515.
#90% CI for Income – by gender
income_gender <- segment_data %>%
group_by(gender) %>%
summarise(
n = sum(!is.na(income)),
mean = mean(income, na.rm = TRUE),
lower = t.test(income, conf.level = 0.90)$conf.int[1],
upper = t.test(income, conf.level = 0.90)$conf.int[2]
)
income_gender
## # A tibble: 2 × 5
## gender n mean lower upper
## <fct> <int> <dbl> <dbl> <dbl>
## 1 Female 149 54046. 51689. 56404.
## 2 Male 151 50122. 47204. 53041.
Write-up (A.2.1):
Interpret the confidence intervals and comment on differences between
overall, segments, and genders.
We are 90% confident that: The overall population mean income of all customers lies between USD 50,192 and USD 53,950 (point estimate is USD 52,071).
By segment, we are 90% confident that the true average incomes are:
Moving up: [USD 51,219; USD 54,790]
Suburb mix: [USD 53,969; USD 57,385]
Travelers: [USD 62,030; USD 69,877] (clearly the wealthiest segment)
Urban hip: [USD 20,170; USD 22,515] (dramatically lower than all others)
By gender, there is almost no difference:
Females: [USD 51,689; USD 56,404]
Males: [USD 47,204; USD 53,041]
#Overall 90% CI for proportion of subscribers
prop_overall <- segment_data %>%
summarise(
n = n(),
subscribers = sum(subscribe == "subYes"),
p_hat = subscribers / n,
lower = prop.test(subscribers, n, conf.level = 0.90)$conf.int[1],
upper = prop.test(subscribers, n, conf.level = 0.90)$conf.int[2]
)
prop_overall
## # A tibble: 1 × 5
## n subscribers p_hat lower upper
## <int> <int> <dbl> <dbl> <dbl>
## 1 300 35 0.117 0.0881 0.152
#Proportion of subscribers by segment (with 90% CIs)
prop_segment <- segment_data %>%
group_by(Segment) %>%
summarise(
n = n(),
subscribers = sum(subscribe == "subYes"),
p_hat = subscribers / n,
lower = prop.test(subscribers, n, conf.level = 0.90)$conf.int[1],
upper = prop.test(subscribers, n, conf.level = 0.90)$conf.int[2],
.groups = "drop"
)
prop_segment
## # A tibble: 4 × 6
## Segment n subscribers p_hat lower upper
## <fct> <int> <int> <dbl> <dbl> <dbl>
## 1 Moving up 70 15 0.214 0.139 0.313
## 2 Suburb mix 100 10 0.1 0.0569 0.166
## 3 Travelers 80 2 0.025 0.00534 0.0816
## 4 Urban hip 50 8 0.16 0.0851 0.274
#Segment with highest subscription proportion
best_segment <- prop_segment %>%
slice_max(p_hat, n = 1)
best_segment
## # A tibble: 1 × 6
## Segment n subscribers p_hat lower upper
## <fct> <int> <int> <dbl> <dbl> <dbl>
## 1 Moving up 70 15 0.214 0.139 0.313
Write-up (A.2.2):
Interpret the overall CI for the subscription proportion and the CI for
the segment with the highest proportion.
We are 90% confident that the true population proportion of subscribers lies between 8.8% and 15.2% (point estimate 11.7%).
The segment with the highest subscription rate is Moving up (observed 21.4%).
We are 90% confident that the true subscription rate in the Moving up segment is between 13.9% and 31.3%.
age <- segment_data$age
#Sample size and sample SD
n <- sum(!is.na(age))
s <- sd(age, na.rm = TRUE)
#Confidence level
alpha <- 0.10 # for 90% CI
#Chi-square critical values
chi_low <- qchisq(1 - alpha/2, df = n - 1) # upper quantile
chi_high <- qchisq(alpha/2, df = n - 1) # lower quantile
#CI for variance
lower_var <- (n - 1) * s^2 / chi_low
upper_var <- (n - 1) * s^2 / chi_high
#CI for standard deviation
lower_sd <- sqrt(lower_var)
upper_sd <- sqrt(upper_var)
c(n = n,
sample_sd = s,
lower_sd_90 = lower_sd,
upper_sd_90 = upper_sd)
## n sample_sd lower_sd_90 upper_sd_90
## 300.00000 12.68318 11.88817 13.60262
Write-up (A.2.3):
Report and interpret the CI for the population standard deviation of
Age.
Discuss whether the assumptions (normality, independence) are
reasonable.
We are 90% confident that the true population standard deviation of age lies between 11.9 and 13.6 years (sample SD = 12.68 years).
Are the assumptions met? No, because the confidence interval for σ assumes that age data is normally distributed. Since age is only mildly right-skewed, the assumption is not fully satisfied. The interval, however, is still acceptable for practical purposes but not perfectly valid.
z <- qnorm(1 - 0.10/2) #for 90% CI
p <- 0.5 #conservative choice
ME1 <- 0.02
ME2 <- 0.01
n_ME1 <- ceiling(z^2 * p * (1 - p) / ME1^2)
n_ME2 <- ceiling(z^2 * p * (1 - p) / ME2^2)
c(ME_2pct = n_ME1,
ME_1pct = n_ME2)
## ME_2pct ME_1pct
## 1691 6764
Write-up (A.3.1):
Compare the required sample sizes for ME = 2% and ME = 1%. Is the cost
of improving the ME worth it?
To estimate any population proportion with a ±2% margin of error: need n = 1,691 respondents. To improve precision to ±1% margin of error: need n = 6,764 respondents.
Comparison: Going from ±2% to ±1% precision requires exactly four times as many respondents (6,764 ÷ 1,691 ≈ 4).
Whether this is worth it depends on budget: quadrupling the sample size is usually very expensive, so ±2% is often considered sufficient for marketing decisions, whereas ±1% is only justified for high-stakes campaigns.
traveler_income <- segment_data %>%
filter(Segment == "Travelers") %>%
pull(income)
urban_income <- segment_data %>%
filter(Segment == "Urban hip") %>%
pull(income)
#Two-sample 90% CI for difference in means
tt_income <- t.test(traveler_income,
urban_income,
conf.level = 0.90)
tt_income
##
## Welch Two Sample t-test
##
## data: traveler_income and urban_income
## t = 18.141, df = 92.356, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
## 40524.9 48696.6
## sample estimates:
## mean of x mean of y
## 65953.49 21342.74
Write-up (A.4.1):
Interpret the CI for Mean_Travelers − Mean_UrbanHip. Does it suggest a
meaningful difference?
We are 90% confident that the true difference in average income (Travelers − Urban hip) lies between USD 40,524 and USD 48,697 per year.
Interpretation: The interval is entirely positive and far from zero, so there is extremely strong evidence of a large and meaningful income difference. Travelers earn, on average, around USD 44,000 more per year than Urban hip customers. —
led_data <- read_excel("LED.xlsx", sheet = "Values")
summary(led_data)
## Country Developed HIV/AIDS Adult Mortality
## Length:156 Min. :0.0000 Min. : 0.100 Min. : 19.93
## Class :character 1st Qu.:0.0000 1st Qu.: 0.100 1st Qu.: 76.43
## Mode :character Median :0.0000 Median : 0.100 Median :144.27
## Mean :0.1806 Mean : 1.756 Mean :162.61
## 3rd Qu.:0.0000 3rd Qu.: 1.097 3rd Qu.:218.03
## Max. :1.0000 Max. :24.554 Max. :554.47
## NA's :1 NA's :1
## Schooling Income composition of resources Life Expectancy
## Min. : 3.992 Min. :0.2946 Min. :53.30
## 1st Qu.: 9.883 1st Qu.:0.4934 1st Qu.:68.95
## Median :12.433 Median :0.6931 Median :74.50
## Mean :12.066 Mean :0.6538 Mean :73.59
## 3rd Qu.:14.159 3rd Qu.:0.7842 3rd Qu.:78.95
## Max. :20.086 Max. :0.9303 Max. :85.30
## NA's :1 NA's :1 NA's :1
## Income Alcohol Consumption Infants death Smooking
## Min. : 628 Min. : 0.003 Min. : 1.60 Length:156
## 1st Qu.: 4160 1st Qu.: 2.195 1st Qu.: 5.70 Class :character
## Median : 13200 Median : 5.540 Median :13.40 Mode :character
## Mean : 19077 Mean : 5.790 Mean :20.95
## 3rd Qu.: 28200 3rd Qu.: 8.935 3rd Qu.:32.50
## Max. :116000 Max. :20.500 Max. :81.00
## NA's :1 NA's :1 NA's :1
## Measles BMI Men BMI Women Population
## Min. : 0.00 Min. :20.30 Min. :20.60 Min. :9.790e+04
## 1st Qu.: 0.01 1st Qu.:22.60 1st Qu.:23.70 1st Qu.:2.800e+06
## Median : 0.44 Median :25.40 Median :25.90 Median :9.540e+06
## Mean : 345.18 Mean :24.87 Mean :25.63 Mean :4.286e+07
## 3rd Qu.: 43.70 3rd Qu.:26.70 3rd Qu.:27.10 3rd Qu.:3.045e+07
## Max. :17100.00 Max. :31.00 Max. :34.30 Max. :1.440e+09
## NA's :1 NA's :1 NA's :1 NA's :1
## Polio Vaccine Maternal Death CO2 Emission Blood Preasure Men
## Min. :0.420 Min. : 2.0 Min. : 0.0467 Min. :123.0
## 1st Qu.:0.850 1st Qu.: 12.0 1st Qu.: 0.6315 1st Qu.:129.0
## Median :0.930 Median : 52.0 Median : 2.5300 Median :131.0
## Mean :0.885 Mean : 154.3 Mean : 4.4837 Mean :131.2
## 3rd Qu.:0.970 3rd Qu.: 190.5 3rd Qu.: 5.9700 3rd Qu.:134.0
## Max. :0.990 Max. :1140.0 Max. :38.0000 Max. :139.0
## NA's :1 NA's :1 NA's :1 NA's :1
## Blood presure Women
## Min. :117.0
## 1st Qu.:123.5
## Median :126.0
## Mean :126.8
## 3rd Qu.:130.0
## Max. :136.0
## NA's :1
#Outlier table for numeric variables
num_data <- led_data[ , sapply(led_data, is.numeric)]
outlier_fun <- function(x) {
x <- x[!is.na(x)]
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
n_out <- sum(x < lower | x > upper)
c(Q1 = Q1, Q3 = Q3, IQR = IQR,
lower_bound = lower, upper_bound = upper,
n_outliers = n_out)
}
outlier_table <- t(sapply(num_data, outlier_fun))
outlier_table
## Q1.25% Q3.75% IQR.75%
## Developed 0.000000e+00 0.000000e+00 0.000000e+00
## HIV/AIDS 1.000000e-01 1.096667e+00 9.966667e-01
## Adult Mortality 7.643333e+01 2.180333e+02 1.416000e+02
## Schooling 9.883333e+00 1.415889e+01 4.275556e+00
## Income composition of resources 4.934333e-01 7.841778e-01 2.907444e-01
## Life Expectancy 6.895000e+01 7.895000e+01 1.000000e+01
## Income 4.160000e+03 2.820000e+04 2.404000e+04
## Alcohol Consumption 2.195000e+00 8.935000e+00 6.740000e+00
## Infants death 5.700000e+00 3.250000e+01 2.680000e+01
## Measles 1.000000e-02 4.370000e+01 4.369000e+01
## BMI Men 2.260000e+01 2.670000e+01 4.100000e+00
## BMI Women 2.370000e+01 2.710000e+01 3.400000e+00
## Population 2.800000e+06 3.045000e+07 2.765000e+07
## Polio Vaccine 8.500000e-01 9.700000e-01 1.200000e-01
## Maternal Death 1.200000e+01 1.905000e+02 1.785000e+02
## CO2 Emission 6.315000e-01 5.970000e+00 5.338500e+00
## Blood Preasure Men 1.290000e+02 1.340000e+02 5.000000e+00
## Blood presure Women 1.235000e+02 1.300000e+02 6.500000e+00
## lower_bound.25% upper_bound.75% n_outliers
## Developed 0.000000e+00 0.000000e+00 29
## HIV/AIDS -1.395000e+00 2.591667e+00 23
## Adult Mortality -1.359667e+02 4.304333e+02 3
## Schooling 3.470000e+00 2.057222e+01 0
## Income composition of resources 5.731667e-02 1.220294e+00 0
## Life Expectancy 5.395000e+01 9.395000e+01 1
## Income -3.190000e+04 6.426000e+04 7
## Alcohol Consumption -7.915000e+00 1.904500e+01 1
## Infants death -3.450000e+01 7.270000e+01 3
## Measles -6.552500e+01 1.092350e+02 29
## BMI Men 1.645000e+01 3.285000e+01 0
## BMI Women 1.860000e+01 3.220000e+01 2
## Population -3.867500e+07 7.192500e+07 14
## Polio Vaccine 6.700000e-01 1.150000e+00 9
## Maternal Death -2.557500e+02 4.582500e+02 17
## CO2 Emission -7.376250e+00 1.397775e+01 11
## Blood Preasure Men 1.215000e+02 1.415000e+02 0
## Blood presure Women 1.137500e+02 1.397500e+02 0
#Correlation matrix for numeric variables
num_only <- led_data[ , sapply(led_data, is.numeric)]
cor_matrix <- cor(num_only, use = "complete.obs")
round(cor_matrix, 2)
## Developed HIV/AIDS Adult Mortality Schooling
## Developed 1.00 -0.18 -0.40 0.56
## HIV/AIDS -0.18 1.00 0.77 -0.26
## Adult Mortality -0.40 0.77 1.00 -0.61
## Schooling 0.56 -0.26 -0.61 1.00
## Income composition of resources 0.57 -0.40 -0.73 0.93
## Life Expectancy 0.52 -0.51 -0.80 0.80
## Income 0.60 -0.26 -0.55 0.65
## Alcohol Consumption 0.55 -0.03 -0.19 0.56
## Infants death -0.43 0.45 0.72 -0.82
## Measles -0.10 0.07 0.13 -0.20
## BMI Men 0.35 -0.38 -0.63 0.72
## BMI Women -0.01 -0.16 -0.37 0.48
## Population -0.07 -0.05 -0.05 -0.08
## Polio Vaccine 0.24 -0.20 -0.38 0.45
## Maternal Death -0.31 0.40 0.62 -0.71
## CO2 Emission 0.23 -0.20 -0.37 0.47
## Blood Preasure Men 0.14 0.24 0.25 0.04
## Blood presure Women -0.25 0.42 0.54 -0.41
## Income composition of resources Life Expectancy
## Developed 0.57 0.52
## HIV/AIDS -0.40 -0.51
## Adult Mortality -0.73 -0.80
## Schooling 0.93 0.80
## Income composition of resources 1.00 0.89
## Life Expectancy 0.89 1.00
## Income 0.76 0.70
## Alcohol Consumption 0.49 0.37
## Infants death -0.88 -0.87
## Measles -0.21 -0.19
## BMI Men 0.78 0.64
## BMI Women 0.51 0.36
## Population -0.06 0.00
## Polio Vaccine 0.47 0.54
## Maternal Death -0.79 -0.74
## CO2 Emission 0.58 0.47
## Blood Preasure Men -0.04 -0.17
## Blood presure Women -0.50 -0.56
## Income Alcohol Consumption Infants death
## Developed 0.60 0.55 -0.43
## HIV/AIDS -0.26 -0.03 0.45
## Adult Mortality -0.55 -0.19 0.72
## Schooling 0.65 0.56 -0.82
## Income composition of resources 0.76 0.49 -0.88
## Life Expectancy 0.70 0.37 -0.87
## Income 1.00 0.36 -0.60
## Alcohol Consumption 0.36 1.00 -0.39
## Infants death -0.60 -0.39 1.00
## Measles -0.14 -0.03 0.24
## BMI Men 0.55 0.29 -0.70
## BMI Women 0.27 0.01 -0.47
## Population -0.06 -0.01 0.01
## Polio Vaccine 0.37 0.21 -0.59
## Maternal Death -0.49 -0.28 0.87
## CO2 Emission 0.76 0.16 -0.46
## Blood Preasure Men -0.01 0.36 0.08
## Blood presure Women -0.39 -0.06 0.48
## Measles BMI Men BMI Women Population
## Developed -0.10 0.35 -0.01 -0.07
## HIV/AIDS 0.07 -0.38 -0.16 -0.05
## Adult Mortality 0.13 -0.63 -0.37 -0.05
## Schooling -0.20 0.72 0.48 -0.08
## Income composition of resources -0.21 0.78 0.51 -0.06
## Life Expectancy -0.19 0.64 0.36 0.00
## Income -0.14 0.55 0.27 -0.06
## Alcohol Consumption -0.03 0.29 0.01 -0.01
## Infants death 0.24 -0.70 -0.47 0.01
## Measles 1.00 -0.28 -0.27 0.60
## BMI Men -0.28 1.00 0.85 -0.19
## BMI Women -0.27 0.85 1.00 -0.21
## Population 0.60 -0.19 -0.21 1.00
## Polio Vaccine -0.21 0.29 0.19 0.02
## Maternal Death 0.26 -0.67 -0.50 -0.01
## CO2 Emission -0.12 0.51 0.37 -0.03
## Blood Preasure Men -0.11 0.00 -0.06 -0.22
## Blood presure Women 0.08 -0.37 -0.24 -0.10
## Polio Vaccine Maternal Death CO2 Emission
## Developed 0.24 -0.31 0.23
## HIV/AIDS -0.20 0.40 -0.20
## Adult Mortality -0.38 0.62 -0.37
## Schooling 0.45 -0.71 0.47
## Income composition of resources 0.47 -0.79 0.58
## Life Expectancy 0.54 -0.74 0.47
## Income 0.37 -0.49 0.76
## Alcohol Consumption 0.21 -0.28 0.16
## Infants death -0.59 0.87 -0.46
## Measles -0.21 0.26 -0.12
## BMI Men 0.29 -0.67 0.51
## BMI Women 0.19 -0.50 0.37
## Population 0.02 -0.01 -0.03
## Polio Vaccine 1.00 -0.54 0.30
## Maternal Death -0.54 1.00 -0.41
## CO2 Emission 0.30 -0.41 1.00
## Blood Preasure Men -0.01 0.13 -0.01
## Blood presure Women -0.19 0.44 -0.22
## Blood Preasure Men Blood presure Women
## Developed 0.14 -0.25
## HIV/AIDS 0.24 0.42
## Adult Mortality 0.25 0.54
## Schooling 0.04 -0.41
## Income composition of resources -0.04 -0.50
## Life Expectancy -0.17 -0.56
## Income -0.01 -0.39
## Alcohol Consumption 0.36 -0.06
## Infants death 0.08 0.48
## Measles -0.11 0.08
## BMI Men 0.00 -0.37
## BMI Women -0.06 -0.24
## Population -0.22 -0.10
## Polio Vaccine -0.01 -0.19
## Maternal Death 0.13 0.44
## CO2 Emission -0.01 -0.22
## Blood Preasure Men 1.00 0.75
## Blood presure Women 0.75 1.00
Write-up (B.1.1):
Comment on outliers and on key patterns in the correlation matrix
(especially correlations with Life Expectancy, and
multicollinearity).
DATA SUMMARY The data reveal global disparities in health and economy, with Life Expectancy averaging 70 years (range 53–83). Developing countries predominate (mean Developed=0.18). Right-skew is evident in risks like HIV/AIDS (mean 7.64 > median 0.4), Adult Mortality (218 > 163), and Population (28.6M > 4.3M), indicating tails toward worse outcomes; Income also skews right (mean USD 19,000 > median USD 13,000), signaling inequality.
OUTLIERS Outliers were identified using the IQR method (values < Q1 - 1.5IQR or > Q3 + 1.5IQR) and are provided in the table. Note: developed has 29 “upper” outliers, but as a binary variable, this is misleading—treat it as categorical, no true outliers.
CORRELATIONS: There were noticeable patterns. Strong positive correlations among positive factors: e.g., Life Expectancy with Income Composition of Resources (the breakdown of a nation’s total income into different resources and how efficiently government allocate and use resources) (0.89), Schooling (0.80), Income (0.70), etc. These variables tend to increase together with Life Expectancy. A high positive correlation (e.g., 0.89) means that once you know a country’s Income Composition, you can predict LE reasonably well.
Strong negative correlations with risks: Life Expectancy with Adult Mortality (-0.80), HIV/AIDS (-0.51), Infants Death (-0.87), Maternal Death (-0.74). As mortality, infant deaths, maternal deaths, and HIV prevalence increase, life expectancy falls.Strong negative values mean the variables move in opposite directions in a predictable pattern.
Moderate: Alcohol Consumption (0.37), Polio Vaccine (0.47), CO2 Emission (0.41). There is some association, but it’s weaker. These often act as development proxies: richer countries drink more alcohol, vaccinate more, and produce more emissions — and also have higher LE.
Weak or near-zero correlations: LE & Smoking: 0.12 LE & Population: –0.06 There is almost no linear relationship. Population size does not predict life expectancy. Smoking varies across countries in non-linear ways.
The matrix reveals clear patterns: Life Expectancy has a strong positive relationship with economic/social development (e.g., Schooling, Income) and negatively with mortality/HIV risks.
Potential multicollinearity issues: high correlations between predictors like Schooling and Income Composition (0.93), Adult Mortality and HIV/AIDS (0.77), BMI Men/Women (0.99—near perfect), Infants Death and Maternal Death (0.87). High correlations among predictors matter because they make it statistically difficult for the model to separate each variable’s independent effect on life expectancy.
m1 <- lm(`Life Expectancy` ~ Developed +
`HIV/AIDS` +
`Income composition of resources` +
`Alcohol Consumption` +
`Polio Vaccine`,
data = led_data)
summary(m1)
##
## Call:
## lm(formula = `Life Expectancy` ~ Developed + `HIV/AIDS` + `Income composition of resources` +
## `Alcohol Consumption` + `Polio Vaccine`, data = led_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5478 -1.5350 0.0748 1.7350 7.8782
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.76097 1.82686 25.596 < 2e-16 ***
## Developed 1.16910 0.77613 1.506 0.134
## `HIV/AIDS` -0.27841 0.05995 -4.644 7.45e-06 ***
## `Income composition of resources` 30.63928 2.06136 14.864 < 2e-16 ***
## `Alcohol Consumption` -0.10250 0.07047 -1.454 0.148
## `Polio Vaccine` 8.66321 2.13606 4.056 8.03e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.844 on 149 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8349, Adjusted R-squared: 0.8293
## F-statistic: 150.7 on 5 and 149 DF, p-value: < 2.2e-16
Write-up (B.1.2 – first model):
For each variable, comment on (i) significance (one-sided alternatives),
(ii) sign (is it as expected?), and (iii) magnitude.
The first MLR model explains 83.5% of variation in Life Expectancy (adj R²=83.2%), with overall high significance (F=150.7, p<2e-16).
Developed: (i) Not significant (one-sided p=0.134 > 0.05). (ii) Positive sign as expected (developed countries have higher LE). (iii) Magnitude small: +1.17 years for developed status.
HIV/AIDS: (i) Highly significant (one-sided p=7.45e-06 <0.001). (ii) Negative sign as expected (higher HIV prevalence reduces LE). (iii) Moderate magnitude: each unit increase reduces LE by 0.28 years.
Income composition of resources: (i) Highly significant (one-sided p<2e-16 <0.001). (ii) Positive as expected (better resource use boosts LE). (iii) Large magnitude: each 0.1 unit increase adds ~3.06 years.
Alcohol Consumption: (i) Not significant (one-sided p=0.148 >0.05, since unexpected positive t). (ii) Negative but close to zero, as expected for health risk. (iii) Small magnitude: negligible -0.10 years per unit.
Polio Vaccine: (i) Highly significant (one-sided p=4.02e-05 <0.001). (ii) Positive as expected (better immunization increases LE). (iii) Large magnitude: each 0.1 coverage increase adds ~0.87 years.
vars <- c("Developed", "HIV/AIDS", "Adult Mortality", "Schooling",
"Income composition of resources", "Income", "Alcohol Consumption",
"Infants death", "Smooking", "Measles", "BMI Men", "BMI Women",
"Population", "Polio Vaccine", "Maternal Death", "CO2 Emission",
"Blood Preasure Men", "Blood presure Women")
r2_list <- sapply(vars, function(v) {
summary(lm(led_data$`Life Expectancy` ~ led_data[[v]]))$r.squared
})
best_idx <- which.max(r2_list)
best_var <- vars[best_idx]
best_r2 <- r2_list[best_idx]
best_var
## [1] "Income composition of resources"
best_r2
## Income composition of resources
## 0.7848303
Write-up (B.1.2 – WHO questions):
Which variable is most effective in increasing LE (based on the
5-variable MLR)?
Which variable has the best predictive power in terms of R² (from the simple regressions)?
Should WHO be worried about LE in non-developed vs. developed countries?
#Ensure Smooking is numeric
led_data$Smooking <- as.numeric(led_data$Smooking)
m_full <- lm(`Life Expectancy` ~ . - Country, data = led_data)
summary(m_full)
##
## Call:
## lm(formula = `Life Expectancy` ~ . - Country, data = led_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2954 -1.0584 0.0381 1.0803 4.7854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.950e+01 1.026e+01 9.699 1.67e-15 ***
## Developed 3.643e-03 8.175e-01 0.004 0.996454
## `HIV/AIDS` 1.568e-01 9.528e-02 1.645 0.103533
## `Adult Mortality` -2.732e-02 5.669e-03 -4.820 6.04e-06 ***
## Schooling 2.960e-02 2.367e-01 0.125 0.900784
## `Income composition of resources` 2.042e+01 7.140e+00 2.859 0.005315 **
## Income 5.722e-06 2.963e-05 0.193 0.847314
## `Alcohol Consumption` -2.102e-02 7.501e-02 -0.280 0.779950
## `Infants death` -1.451e-01 3.901e-02 -3.719 0.000354 ***
## Smooking 7.151e-01 2.943e+00 0.243 0.808563
## Measles 3.000e-05 1.479e-04 0.203 0.839757
## `BMI Men` -2.679e-01 3.266e-01 -0.820 0.414310
## `BMI Women` -1.783e-01 2.339e-01 -0.762 0.447857
## Population -1.529e-09 1.519e-09 -1.007 0.316798
## `Polio Vaccine` -1.469e-01 3.156e+00 -0.047 0.962977
## `Maternal Death` 4.278e-03 2.415e-03 1.771 0.080025 .
## `CO2 Emission` -5.178e-02 6.163e-02 -0.840 0.403082
## `Blood Preasure Men` -1.045e-01 1.305e-01 -0.801 0.425550
## `Blood presure Women` -6.358e-02 1.136e-01 -0.560 0.577172
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.112 on 87 degrees of freedom
## (50 observations deleted due to missingness)
## Multiple R-squared: 0.9057, Adjusted R-squared: 0.8862
## F-statistic: 46.42 on 18 and 87 DF, p-value: < 2.2e-16
coefs_full <- summary(m_full)$coefficients
pvals <- coefs_full[-1, 4] # exclude intercept
#Number of significant variables at 5% level
sum(pvals < 0.05)
## [1] 3
#Names of significant variables
names(pvals)[pvals < 0.05]
## [1] "`Adult Mortality`" "`Income composition of resources`"
## [3] "`Infants death`"
#R-squared comparison
summary(m1)$r.squared
## [1] 0.8348584
summary(m1)$adj.r.squared
## [1] 0.8293167
summary(m_full)$r.squared
## [1] 0.9056998
summary(m_full)$adj.r.squared
## [1] 0.8861895
Write-up (B.1.2 – full model):
Test the overall significance (F-test) of the full model.
How many variables are significant at 5%? Did any variables from the first model drop out?
Compare R² and adjusted R² between the first and full models. What do you conclude?
#Predicted LE for an "average" country
avg_vals <- led_data %>%
summarise(
Developed = mean(Developed, na.rm = TRUE),
`HIV/AIDS` = mean(`HIV/AIDS`, na.rm = TRUE),
`Income composition of resources` = mean(`Income composition of resources`, na.rm = TRUE),
`Alcohol Consumption` = mean(`Alcohol Consumption`, na.rm = TRUE),
`Polio Vaccine` = mean(`Polio Vaccine`, na.rm = TRUE)
)
avg_vals
## # A tibble: 1 × 5
## Developed `HIV/AIDS` `Income composition of resources` `Alcohol Consumption`
## <dbl> <dbl> <dbl> <dbl>
## 1 0.181 1.76 0.654 5.79
## # ℹ 1 more variable: `Polio Vaccine` <dbl>
pred_LE_avg <- predict(m1, newdata = avg_vals)
pred_LE_avg
## 1
## 73.58903
#Country with highest observed LE
idx_max_LE <- which.max(led_data$`Life Expectancy`)
led_data[idx_max_LE, c("Country", "Life Expectancy")]
## # A tibble: 1 × 2
## Country `Life Expectancy`
## <chr> <dbl>
## 1 Singapore 85.3
#Predicted LE for that country
pred_LE_max_country <- predict(m1, newdata = led_data[idx_max_LE, ])
pred_LE_max_country
## 1
## 82.45459
Write-up (B.1.3):
- Report the predicted LE for an “average” country.
- Identify the country with the highest observed LE and its observed
value. - Report the predicted LE for that country and comment on the
difference vs observed.
The predicted life expectancy for an “average” country (based on mean values of the predictors: Developed=0.181, HIV/AIDS=1.76, Income composition=0.654, Alcohol Consumption=5.79, Polio Vaccine=0.884) is 73.6 years.
The country with the highest observed life expectancy is Singapore, with an observed value of 85.3 years. The predicted life expectancy for Singapore is 82.5 years.
Comment on difference: - The model under-predicts by about 2.8 years (observed 85.3 vs predicted 82.5), a reasonable residual given the model’s R²~83%—it captures broad trends but misses country-specific factors.
credit_data <- read_excel("CreditData.xlsx")
#Rename dependent variable
str(credit_data)
## tibble [30,000 × 18] (S3: tbl_df/tbl/data.frame)
## $ LIMIT_BAL : num [1:30000] -1.137 -0.366 -0.597 -0.905 -0.905 ...
## $ SEX : num [1:30000] 0.81 0.81 0.81 0.81 -1.23 ...
## $ EDUCATION : num [1:30000] 0.212 0.212 0.212 0.212 0.212 ...
## $ MARRIAGE : num [1:30000] -1.069 0.849 0.849 -1.069 -1.069 ...
## $ PAY_0 : num [1:30000] 1.7945 -0.875 0.0149 0.0149 -0.875 ...
## $ PAY_2 : num [1:30000] 1.782 1.782 0.112 0.112 0.112 ...
## $ PAY_3 : num [1:30000] -0.697 0.139 0.139 0.139 -0.697 ...
## $ PAY_4 : num [1:30000] -0.667 0.189 0.189 0.189 0.189 ...
## $ PAY_5 : num [1:30000] -1.53 0.235 0.235 0.235 0.235 ...
## $ PAY_6 : num [1:30000] -1.486 1.992 0.253 0.253 0.253 ...
## $ BILL_AMT1 : num [1:30000] -0.6425 -0.6592 -0.2986 -0.0575 -0.5786 ...
## $ PAY_AMT1 : num [1:30000] -0.342 -0.342 -0.25 -0.221 -0.221 ...
## $ PAY_AMT2 : num [1:30000] -0.227 -0.214 -0.192 -0.169 1.335 ...
## $ PAY_AMT3 : num [1:30000] -0.297 -0.24 -0.24 -0.229 0.271 ...
## $ PAY_AMT4 : num [1:30000] -0.308 -0.244 -0.244 -0.238 0.266 ...
## $ PAY_AMT5 : num [1:30000] -0.314 -0.314 -0.249 -0.244 -0.269 ...
## $ PAY_AMT6 : num [1:30000] -0.2934 -0.1809 -0.0121 -0.2371 -0.2552 ...
## $ default_payment: num [1:30000] 1 1 0 0 0 0 0 0 0 0 ...
names(credit_data)
## [1] "LIMIT_BAL" "SEX" "EDUCATION" "MARRIAGE"
## [5] "PAY_0" "PAY_2" "PAY_3" "PAY_4"
## [9] "PAY_5" "PAY_6" "BILL_AMT1" "PAY_AMT1"
## [13] "PAY_AMT2" "PAY_AMT3" "PAY_AMT4" "PAY_AMT5"
## [17] "PAY_AMT6" "default_payment"
colnames(credit_data)[ncol(credit_data)] <- "DefaultPayment"
table(credit_data$DefaultPayment)
##
## 0 1
## 23364 6636
set.seed(123)
n <- nrow(credit_data)
train_idx <- sample(1:n, size = 0.7 * n)
train_data <- credit_data[train_idx, ]
test_data <- credit_data[-train_idx, ]
#Counts of defaults/non-defaults
cat("Whole sample:\n")
## Whole sample:
table(credit_data$DefaultPayment)
##
## 0 1
## 23364 6636
prop.table(table(credit_data$DefaultPayment))
##
## 0 1
## 0.7788 0.2212
cat("\nTraining data:\n")
##
## Training data:
table(train_data$DefaultPayment)
##
## 0 1
## 16321 4679
prop.table(table(train_data$DefaultPayment))
##
## 0 1
## 0.7771905 0.2228095
cat("\nTest data:\n")
##
## Test data:
table(test_data$DefaultPayment)
##
## 0 1
## 7043 1957
prop.table(table(test_data$DefaultPayment))
##
## 0 1
## 0.7825556 0.2174444
Write-up (C.1):
Report the counts and proportions of defaults and non-defaults in the
whole sample, training data, and test data.
Non Default: 23,364 Default: 6,636 Training Data (70%): 16,321 Test Data (30%): 4,679
#Logistic regression with all 17 predictors
lr_model <- glm(
DefaultPayment ~ .,
data = train_data,
family = binomial(link = "logit"),
control = list(maxit = 160) # K = 160 iterations
)
summary(lr_model)
##
## Call:
## glm(formula = DefaultPayment ~ ., family = binomial(link = "logit"),
## data = train_data, control = list(maxit = 160))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.46881 0.01999 -73.490 < 2e-16 ***
## LIMIT_BAL -0.10240 0.02429 -4.216 2.49e-05 ***
## SEX -0.06912 0.01776 -3.892 9.94e-05 ***
## EDUCATION -0.07958 0.01929 -4.125 3.71e-05 ***
## MARRIAGE -0.11230 0.01840 -6.104 1.03e-09 ***
## PAY_0 0.66689 0.02382 27.991 < 2e-16 ***
## PAY_2 0.09895 0.02873 3.443 0.000574 ***
## PAY_3 0.07695 0.03233 2.380 0.017293 *
## PAY_4 0.06111 0.03502 1.745 0.080970 .
## PAY_5 0.01067 0.03653 0.292 0.770275
## PAY_6 0.02822 0.03024 0.933 0.350719
## BILL_AMT1 -0.10170 0.02376 -4.280 1.87e-05 ***
## PAY_AMT1 -0.18523 0.04272 -4.336 1.45e-05 ***
## PAY_AMT2 -0.21420 0.05354 -4.001 6.32e-05 ***
## PAY_AMT3 -0.02501 0.02935 -0.852 0.394167
## PAY_AMT4 -0.08643 0.03200 -2.700 0.006924 **
## PAY_AMT5 -0.07022 0.02871 -2.446 0.014430 *
## PAY_AMT6 -0.07338 0.02890 -2.540 0.011100 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 22279 on 20999 degrees of freedom
## Residual deviance: 19497 on 20982 degrees of freedom
## AIC: 19533
##
## Number of Fisher Scoring iterations: 5
#Predicted probabilities and classes on test data
test_prob <- predict(lr_model, newdata = test_data, type = "response")
test_pred_class <- ifelse(test_prob >= 0.5, 1, 0)
#Confusion matrix and metrics in base R
true <- factor(test_data$DefaultPayment, levels = c(0, 1))
pred <- factor(test_pred_class, levels = c(0, 1))
cm <- table(Predicted = pred, Actual = true)
cm
## Actual
## Predicted 0 1
## 0 6826 1479
## 1 217 478
TN <- cm["0", "0"]
FP <- cm["1", "0"]
FN <- cm["0", "1"]
TP <- cm["1", "1"]
total <- TN + FP + FN + TP
accuracy <- (TP + TN) / total
sensitivity <- TP / (TP + FN) # true positive rate (default correctly flagged)
specificity <- TN / (TN + FP) # true negative rate (non-default correctly flagged)
prevalence <- (TP + FN) / total # proportion of defaults in test set
precision <- TP / (TP + FP) # positive predictive value
F1 <- 2 * precision * sensitivity / (precision + sensitivity)
metrics <- c(
Accuracy = accuracy,
Sensitivity = sensitivity,
Specificity = specificity,
Prevalence = prevalence,
Precision = precision,
F1_score = F1
)
round(metrics, 3)
## Accuracy Sensitivity Specificity Prevalence Precision F1_score
## 0.812 0.244 0.969 0.217 0.688 0.360
Write-up (C.2):
Overall accuracy of the model:
- The model achieves an accuracy of 81.2%, correctly classifying 81.2% of the test cases. This is reasonably good.
Sensitivity and specificity (how well it detects defaulters vs
non-defaulters)
- Sensitivity is low at 24.4%, meaning the model only flags 24.4% of
actual defaulters correctly, thus missing misses most risky customers
(false negatives), which is problematic for risk management.
Prevalence of default in the test data
- The prevalence of default is 21.7%.
F1-score (balance of precision and recall for defaults)
- The F1-score is 0.360, indicating a poor balance between precision
(68.8%, meaning 68.8% of flagged defaults are true) and recall
(sensitivity 24.4%). The low F1 highlights the model’s weakness in
reliably identifying defaults without too many errors, due to low recall
despite decent precision
From a bank’s perspective, what would be an acceptable general prediction accuracy? What is the minimum accuracy you would require before using the LR model in practice?