Nathania Karina Anggraini - 5003251181
A researcher collected a monthly household expenditures from 12 respondents. They wanted to find the ordinary mean and the 10% trimmed mean of the collected data.
trimmed_mean <- function(x, alpha = 0.1) {
# validation
if (alpha < 0 || alpha > 0.5 ) {
stop("alpha must be between 0 and 0.5")
}
# 1. compute n, k, and sort the data
n <- length(x)
k <- floor(n*alpha)
sorted_data <- sort(x)
# 2. find the sum of the data manually from (k+1) until (n-k)
sum_x <- 0
for (i in (k+1):(n-k)) {
sum_x <- sum_x + x[i]
}
# 3. return the value of the trimmed_mean
return(sum_x/(n-2*k))
}
# Collected Data
x <- c(850, 920, 1100, 1250, 1300, 1400, 1500, 1550, 1600, 1750, 2100, 8500)
mean_alpha_0 <- trimmed_mean(x, 0)
cat("The ordinary mean of data (", x, ") is ", mean_alpha_0, "\n")
The ordinary mean of data ( 850 920 1100 1250 1300 1400 1500 1550 1600 1750 2100 8500 ) is 1985
mean_alpha_0.1 <- trimmed_mean(x, 0.1)
cat("The 10% trimmed mean of data (", x, ") is ", mean_alpha_0.1, "\n")
The 10% trimmed mean of data ( 850 920 1100 1250 1300 1400 1500 1550 1600 1750 2100 8500 ) is 1447
The Surabaya City Government conducts a household welfare survey across all 31 kecamatan (sub-districts). Three variables are recorded as district-level averages: monthly income, monthly expenditure, and household size. Each kecamatan is weighted by its population, reflecting how many households that kecamatan represents.
With the weighted multivariate descriptive statistics function, find the diagonal weight matrix, weighted mean vector, weighted covariance, and the weighted standard deviation vector.
weighted_multi_desc <- function(X, w) {
# 1. define n, p, and n by 1 vector
n <- nrow(X)
p <- ncol(X)
vector1 <- matrix(1, nrow = n, ncol = 1)
# 2. Diagonal weight matrix (W)
W <- diag(w)
# 3. Total weight (n_w)
n_w <- 0
for (i in 1:n) {
n_w <- n_w + w[i]
}
# 4. Weighted mean vector (X_bar_w)
x_bar_w <- (1/n_w)* t(X) %*% W %*% vector1
# 5. Deviation matrix (D)
D <- X - vector1 %*% t(x_bar_w)
# 6. Weighted covariance matrix (S_w)
S_w <- (1 / (n_w - 1)) * t(D) %*% W %*% D
# 7. Weighted standard deviation vector (Std_w)
Std_w <- sqrt(diag(S_w))
#8. return W, x_bar_w, S_w, Std_w
return(list(
W = W,
x_bar_w = x_bar_w,
S_w = S_w,
Std_w = Std_w
))
}
# Collected data from 31 kecamatan in Surabaya
data <- read.csv("https://raw.githubusercontent.com/nia1408/data_quiz/main/data_quiz.csv", header = TRUE, sep = ",")
X <- as.matrix(data[, c("Income", "Expenditure", "HH_Size")])
w <- data$Weight
print(data)
Kecamatan Income Expenditure HH_Size Weight
1 Asemrowo 3.1 2.5 4.2 5
2 Benowo 3.8 3.0 4.0 7
3 Bubutan 4.2 3.4 3.5 10
4 Bulak 3.5 2.8 4.1 5
5 Dukuh Pakis 6.5 5.2 3.0 6
6 Gayungan 5.8 4.6 3.1 4
7 Genteng 5.5 4.4 2.8 6
8 Gubeng 5.9 4.7 2.9 13
9 Gunung Anyar 4.8 3.8 3.6 6
10 Jambangan 4.3 3.4 3.8 5
11 Karangpilang 4.5 3.6 3.7 7
12 Kenjeran 3.0 2.4 4.5 18
13 Krembangan 3.4 2.7 4.3 11
14 Lakarsantri 4.0 3.2 3.9 6
15 Mulyorejo 5.2 4.1 3.2 9
16 Pabean Cantikan 3.3 2.6 4.4 7
17 Pakal 3.6 2.9 4.1 6
18 Rungkut 5.0 4.0 3.3 12
19 Sambikerep 4.1 3.3 3.8 7
20 Sawahan 3.7 3.0 4.2 20
21 Semampir 2.8 2.3 4.7 18
22 Simokerto 3.2 2.6 4.3 9
23 Sukolilo 5.6 4.5 3.0 11
24 Sukomanunggal 4.6 3.7 3.5 10
25 Tambaksari 3.5 2.8 4.4 23
26 Tandes 4.0 3.2 3.9 9
27 Tegalsari 4.1 3.3 3.7 10
28 Tenggilis Mejoyo 5.3 4.2 3.1 6
29 Wiyung 4.7 3.7 3.6 7
30 Wonocolo 4.9 3.9 3.4 8
31 Wonokromo 4.4 3.5 3.8 16
Apply the data to the multivariate descriptive function
answer <- weighted_multi_desc(X,w)
print(answer$x_bar_w)
[,1]
Income 4.186869
Expenditure 3.350505
HH_Size 3.832997
print(answer$S_w)
Income Expenditure HH_Size
Income 0.8703337 0.6840776 -0.4882477
Expenditure 0.6840776 0.5384541 -0.3843073
HH_Size -0.4882477 -0.3843073 0.2926238
print(answer$Std_w)
Income Expenditure HH_Size
0.9329168 0.7337943 0.5409471
By visualizing using histogram we could see that most expenditure values are concentrated between 850 and 2100, while the value 8500 is far from the rest, makes it an outlier
hist(x,
breaks = 15,
col = "steelblue",
border = "white",
main = "Histogram of Expenditure",
xlab = "Expenditure",
freq = TRUE
)
In this case, boxplot is important because it clearly identifies 8500 as an outlier.
boxplot(x,
main = "Boxplot of Expenditure",
ylab = "Expenditure",
col = "#2196F3")
The density plot shows the distribution of the expenditures by shape smoothly. It does not only show the outliers (8500), but also the mean and the trimmed mean.
library(ggplot2)
data_frame_x <- data.frame(x = x)
ggplot(data_frame_x, aes(x = x)) +
geom_density(fill = "hotpink3", alpha = 0.5) +
geom_vline(aes(xintercept = mean_alpha_0),
color = "blue", linetype = "dashed", size = 1) +
geom_vline(aes(xintercept = mean_alpha_0.1,),
color = "red", linetype = "dashed", size = 1 ) +
labs(title = "Density Plot of Expenditure",
x = "Expenditure", y = "Density")
Visualizing with scatter plot shows a positive relationship between income and expenditure. It indicates that districts with higher income tend to have higher expenditure.
ggplot(data, aes(x = Income, y = Expenditure)) +
geom_point(color = "darkgreen", size = 2) +
labs(title = "Income vs Expenditure")
This type of plot show’s the regression line of Income and Expenditure. The regression line confirms the upward trend where the higher income, the higher the expenditure and this supports the covariance results too.
ggplot(data, aes(x = Income, y = Expenditure)) +
geom_point(color = "blue", size = 2) +
geom_smooth(method = "lm", se = FALSE, color = "red", size = 0.5) +
labs(title = "Income vs Expenditure with Regression line") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"))
The bubble plot adds population weights, this shows that some districts contribute more significantly to the overall statistics. Larger bubble represents districts with larger population, tells the use of weighted statistics rather than basic statistics.
ggplot(data, aes(x = Income, y = Expenditure, size = Weight)) +
geom_point(alpha = 0.5, color = "steelblue") +
labs(title="Income vs Expenditure (Weighted by Population)")
The correlation heatmap summarize the relationships numerically. This makes it easier to identify strong and weak correlations between data.
library(corrplot)
cor_matrix <- cor(data[, c("Income", "Expenditure", "HH_Size")])
corrplot(cor_matrix,
method = "color",
addCoef.col = "white")
Using pair plot provides a detailed view of relationships among all variables. This help suggest potential associations between household size and both income and the expenditure.
library(GGally)
ggpairs(data[, c("Income","Expenditure","HH_Size")],
title = "Pair Plot Household Welfare survey",
)
QQ plot is a supporting plot in this case. This plot helps to indicate whether the income variable follows a normal distribution.
ggplot(data, aes(sample = Income)) +
stat_qq() +
stat_qq_line(color = "blue") +
labs(title = "Q-Q Plot Income",
x = "Theoretical Quantiles",
y = "Observed/Sample Quantiles")