Name: Violetta Ammara Irianti Agung | Student Number: 5003251126 | Class: Statistics K

Question 1

a) Write an R function called trimmed_mean(x, alpha) that computes the trimmed mean, can sort the data, remove k observations from each end, and return the mean of the remaining values.

Trimmed Mean Function:

trimmed_mean <- function(x, alpha) {
  
  n <- length(x)
  k <- floor(n*alpha)
  
  x_sorted <- sort(x)
  x_trimmed <- x_sorted[(k+1):(n-k)]
  
  total <- 0
  for(i in 1:length(x_trimmed)){
    total <- total+x_trimmed[i]
  }
  
  result <- total/(n-2*k)
  return(result)
  
}

b) Compute the ordinary mean (alpha = 0) and the 10% trimmed mean (alpha = 0.10) of the data using the trimmed_mean function that has been created.

Set Up the Data:

x <- c(850, 920, 1100, 1250, 1300, 1400, 1500, 1550, 1600, 1750, 2100, 8500)

Ordinary Mean (\(\alpha = 0\)):

ordinary_mean <- trimmed_mean(x, alpha = 0)
cat("Ordinary Mean:", ordinary_mean, "\n")
## Ordinary Mean: 1985

10% Trimmed Mean (\(\alpha =0.10\)):

mean_trimmed <- trimmed_mean(x, alpha = 0.10)
cat("10% Trimmed Mean:", mean_trimmed, "\n")
## 10% Trimmed Mean: 1447

Boxplot Visualization to Detect Outlier:

boxplot(x,
        main = "Boxplot of Monthly Expenditure",
        ylab = "Expenditure (thousand IDR)",
        col = "lightblue")

Barplot Visualization to Compare Means:

means <- c(ordinary_mean, mean_trimmed)
names(means) <- c("Ordinary Mean", "10% Trimmed Mean")

barplot(means,
        main = "Comparison of Means",
        ylab = "Value",
        col = "lightgreen")

Interpretation of Question 1:

The ordinary mean is significantly affected by the extreme value 8500, which acts as an outlier. The 10% trimmed mean removes extreme values from both ends, making it more robust. The boxplot clearly shows 8500 as an outlier, far from the rest of the data. Therefore, the trimmed mean provides a more reliable measure of central tendency for this dataset.

Conclusion of Question 1:

The trimmed mean is more robust than the ordinary mean when outliers are present. The dataset clearly contains an outlier (8500), making trimmed mean more appropriate.

Question 2

a) Write an R function weighted_multi_desc(X,w) that uses matrix operations to return the diagonal weight matrix W, the weighted mean vector xbar_w, the weighted covariance S_w, and the weighted standard deviation vector s_w.

Weighted Multivariate Descriptive Function:

weighted_multi_desc <- function(X, w) {
  
  n <- nrow(X)
  p <- ncol(X)
  one <- matrix(1, nrow = n, ncol = 1)
  
  W <- diag(w)
  n_w <- as.numeric(t(w) %*% one)
  
  xbar_w <- (1/n_w)*t(X) %*% W %*% one
  D <- X - one %*% t(xbar_w)
  
  S_w <- (1/(n_w - 1))* t(D) %*% W %*% D
  s_w <- sqrt(diag(S_w))
  
  return(list(W = W,
              xbar_w = as.numeric(xbar_w),
              S_w = S_w,
              s_w = s_w))
  
}

b) Apply the function to the data and show the weighted mean vector, weighted covariance matrix, weighted standard deviation vector.

Set Up the Data:

data <- read.csv("data_quiz.csv")

X <- as.matrix(data[, c("Income", "Expenditure", "HH_Size")])
w <- data$Weight

result <- weighted_multi_desc(X, w)

Weighted Mean Vector:

names(result$xbar_w) <- colnames(X)
print(result$xbar_w)
##      Income Expenditure     HH_Size 
##    4.186869    3.350505    3.832997

Weighted Covariance Matrix:

print(result$S_w)
##                 Income Expenditure    HH_Size
## Income       0.8703337   0.6840776 -0.4882477
## Expenditure  0.6840776   0.5384541 -0.3843073
## HH_Size     -0.4882477  -0.3843073  0.2926238

Weighted Standard Deviation Vector:

names(result$s_w) <- colnames(X)
print(result$s_w)
##      Income Expenditure     HH_Size 
##   0.9329168   0.7337943   0.5409471

Pairwise Scatter Plot Matrix Visualization:

pairs(X,
      main = "Scatter Plot Matrix",
      pch = 19,
      col = "darkblue")

Correlation Matrix:

cor_matrix <- cor(X)
print(cor_matrix)
##                 Income Expenditure    HH_Size
## Income       1.0000000   0.9992739 -0.9594547
## Expenditure  0.9992739   1.0000000 -0.9606524
## HH_Size     -0.9594547  -0.9606524  1.0000000

Heatmap of Correlation Matrix Visualization:

heatmap(cor_matrix,
        main = "Correlation Heatmap")

Interpretation of Question 2:

The weighted mean vector reflects the average values of income, expenditure, and household size, adjusted by population weights, making it more representative. The weighted covariance matrix shows relationships between variables where positive covariance indicates variables increase together. Based on the scatter plot, income and expenditure show a positive relationship (higher income → higher spending) and household size may show weaker relationships with the other variables. The correlation matrix and heatmap confirms that there is strong positive correlation between income and expenditure. While the weighted standard deviation indicates variability such as higher values mean more variation across districts.

Conclusion of Question 2:

Weighted multivariate statistics are essential when observations have different importance. Visualizations helps to understand relationships between variables.