load ggplot2

library(ggplot2)

1. Data 1

A researcher collected the following monthly household expenditures (in thousands of IDR) from 12 respondents:

x <- c(850, 920, 1100, 1250, 1300, 1400, 1500, 1550, 1600, 1750, 2100, 8500)
data1 <- data.frame(Value = x)

Plot using built-in function:

boxplot(x,
  main = "Boxplot of Monthly Household Expenditures",
  ylab = "Monthly expenditures (thousand IDR)",
  col  = "skyblue")
grid()

ggplot2 function:

ggplot(data1, aes(x = "", y = Value)) +
  geom_boxplot(fill = "skyblue",
               outlier.color = "tomato",
               outlier.stroke = 2) +
  labs(title = "Boxplot of Monthly Household Expenditures",
       y = "Monthly expenditures (thousand IDR)") +
  theme_light()

Outlier detection & summary:

iqr_val<- IQR(x)
upper_fence <- quantile(x,0.75) + (1.5 * iqr_val)
lower_fence <- quantile(x,0.25) - (1.5 * iqr_val)
get_outliers <- function(data, low, high) {
  outs <- data[data < low | data > high]
  
  if(length(outs) > 0) {
    return(paste(outs, collapse = ", "))
  } else {
    return("none.")
  }
}

outs_data1 <- get_outliers(x, lower_fence, upper_fence)

cat(
  " ==============================\n",
  "         DATA SUMMARY\n",
  "==============================\n",
  "Minimum value       :", min(x), "\n",
  "Maximum value       :", max(x), "\n",
  "1st quantile        :", quantile(x, 0.25), "\n",
  "Median              :", quantile(x, 0.5), "\n",
  "3rd quantile        :", quantile(x, 0.75), "\n",
  "Mean                :", mean(x), "\n",
  "Lower fence         :", lower_fence, "\n",
  "Upper fence         :", upper_fence, "\n",
  "Outlier             :", outs_data1, "\n",
  "=============================="
)
##  ==============================
##           DATA SUMMARY
##  ==============================
##  Minimum value       : 850 
##  Maximum value       : 8500 
##  1st quantile        : 1212.5 
##  Median              : 1450 
##  3rd quantile        : 1637.5 
##  Mean                : 1985 
##  Lower fence         : 575 
##  Upper fence         : 2275 
##  Outlier             : 8500 
##  ==============================

Interpretation

The presence of an upper outlier in dataset x causes the mean to be an inaccurate representation of the data. This is reflected in the fact that the mean is higher than the median, indicating that the distribution is positively skewed.

2. Data 2

data2 <- read.csv("data.csv", header = TRUE, sep = ',')

A. Scatter Plot

Income vs Expenditure

plot(data2$Income,data2$Expenditure,
     col = "darkblue",
     pch = 19,
     main = "Income vs Expenditure\n(million IDR)",
     xlab = "Income",
     ylab = "Expenditure")
abline(lm(Expenditure ~ Income, data = data2), col = "magenta", lwd = 2)

ggplot(data2, aes(x = Income, y = Expenditure)) +
  geom_point(color = "darkblue", size = 2) +
  geom_smooth(method = "lm", se = FALSE, col = "magenta")+
  theme_light() +
  labs(title = "Income vs Expenditure (million IDR)",
       x = "Income",
       y = "Expenditure")
## `geom_smooth()` using formula = 'y ~ x'

Interpretation: The scatter plot shows a positive linear correlation between monthly income and expenditure. Essentially, as household’s income increases, their expenditure tends to increase as well.

B. Histogram

Histogram of Monthly Income

hist(x = data2$Income,
     main = "Histogram of Monthly Income",
     xlab = "Income (million IDR)",
     ylab = "Count",
     breaks = 14,
     col = "coral"
     )

ggplot(data2, aes(x = Income)) +
  geom_histogram(bins = 14, fill = "coral") +
  theme_light()+
  labs(title = "Histogram of Monthly Income",
       x = "Income (million IDR)",
       y = "Count")

Interpretation: The data indicates a right-skewed distribution, where the majority of income observations are concentrated in the lower to middle range (3–5 millions IDR), with a small number of higher-income outliers pulling the distribution toward the right.

Histogram of Expenditure

hist(x = data2$Expenditure,
     main = "Histogram of Expenditure",
     xlab = "Expenditure (million IDR)",
     ylab = "Count",
     breaks = 15,
     col = "seagreen"
     )

ggplot(data2, aes(x = Expenditure)) +
  geom_histogram(bins = 15, fill = "seagreen") +
  theme_light() +
  labs(title = "Histogram of Monthly Expenditure",
       x = "Expenditure (million IDR)",
       y = "Count")

Interpretation: The data indicates a slight right-skewed distribution, where the majority of income observations are concentrated in the lower to middle range (2.5–4 millions IDR), with the highest frequency occuring at approximately 3.5 million IDR.

Histogram of HH_Size

hist(x = data2$HH_Size,
     main = "Histogram of Household Size",
     xlab = "HH Size (persons)",
     ylab = "Count",
     breaks = 10,
     ylim = c(0,5),
     col = "lightblue"
     )

ggplot(data2, aes(x = HH_Size)) +
  geom_histogram(bins = 15, fill = "lightblue") +
  theme_light() +
  labs(title = "Histogram of Household Size",
       x = "HH size (persons)",
       y = "Count")

Interpretation: The household size in this district relatively represents a normal distribution pattern. The histogram illustrates that the majority of households consist of 3 to 4 members, representing a very consistent family structure on that district.

C. Boxplot

Comparison of Income and Expenditure

boxplot(data2$Income,
        data2$Expenditure,
        names = c("Income", "Expenditure"),
        main = "Income and Expenditure Comparison",
        ylab = "Value",
        col = c("purple", "violet"))
grid()

ggplot(data2) +
  geom_boxplot(aes(x = "Income", y = Income), fill = "purple") +
  geom_boxplot(aes(x = "Expenditure", y = Expenditure), fill = "violet") +
  labs(title = "Boxplot of Data 2", 
       y = "Value (million IDR)", 
       x = "") +
  theme_minimal()

Interpretation: The median income exceeds the median expenditure, indicating that people on this district earn more than they spend. Besides, the income boxplot displays the greater diversity, representing a population with varied earning capacities but relatively uniform spending patterns.

Boxplot of HH_Size

boxplot(data2$HH_Size,
        main = "Boxplot of Household Size",
        ylab = "Household size (persons)",
        col = "pink2")
grid()

ggplot(data2) +
  geom_boxplot(aes(x = "", y = HH_Size), fill = "pink2") +
  labs(title = "Boxplot of Household size", 
       y = "HH Size (persons)", 
       x = "") +
  theme_minimal()

Interpretation: The boxplot shows a symmetrical distribution of household size in this district. It means there isn’t a bias toward very small or very large size of households.