library(ggplot2)
A researcher collected the following monthly household expenditures (in thousands of IDR) from 12 respondents:
x <- c(850, 920, 1100, 1250, 1300, 1400, 1500, 1550, 1600, 1750, 2100, 8500)
data1 <- data.frame(Value = x)
Plot using built-in function:
boxplot(x,
main = "Boxplot of Monthly Household Expenditures",
ylab = "Monthly expenditures (thousand IDR)",
col = "skyblue")
grid()
ggplot2 function:
ggplot(data1, aes(x = "", y = Value)) +
geom_boxplot(fill = "skyblue",
outlier.color = "tomato",
outlier.stroke = 2) +
labs(title = "Boxplot of Monthly Household Expenditures",
y = "Monthly expenditures (thousand IDR)") +
theme_light()
Outlier detection & summary:
iqr_val<- IQR(x)
upper_fence <- quantile(x,0.75) + (1.5 * iqr_val)
lower_fence <- quantile(x,0.25) - (1.5 * iqr_val)
get_outliers <- function(data, low, high) {
outs <- data[data < low | data > high]
if(length(outs) > 0) {
return(paste(outs, collapse = ", "))
} else {
return("none.")
}
}
outs_data1 <- get_outliers(x, lower_fence, upper_fence)
cat(
" ==============================\n",
" DATA SUMMARY\n",
"==============================\n",
"Minimum value :", min(x), "\n",
"Maximum value :", max(x), "\n",
"1st quantile :", quantile(x, 0.25), "\n",
"Median :", quantile(x, 0.5), "\n",
"3rd quantile :", quantile(x, 0.75), "\n",
"Mean :", mean(x), "\n",
"Lower fence :", lower_fence, "\n",
"Upper fence :", upper_fence, "\n",
"Outlier :", outs_data1, "\n",
"=============================="
)
## ==============================
## DATA SUMMARY
## ==============================
## Minimum value : 850
## Maximum value : 8500
## 1st quantile : 1212.5
## Median : 1450
## 3rd quantile : 1637.5
## Mean : 1985
## Lower fence : 575
## Upper fence : 2275
## Outlier : 8500
## ==============================
The presence of an upper outlier in dataset x causes the
mean to be an inaccurate representation of the data. This is reflected
in the fact that the mean is higher than the median, indicating that the
distribution is positively skewed.
data2 <- read.csv("data.csv", header = TRUE, sep = ',')
Income vs Expenditure
plot(data2$Income,data2$Expenditure,
col = "darkblue",
pch = 19,
main = "Income vs Expenditure\n(million IDR)",
xlab = "Income",
ylab = "Expenditure")
abline(lm(Expenditure ~ Income, data = data2), col = "magenta", lwd = 2)
ggplot(data2, aes(x = Income, y = Expenditure)) +
geom_point(color = "darkblue", size = 2) +
geom_smooth(method = "lm", se = FALSE, col = "magenta")+
theme_light() +
labs(title = "Income vs Expenditure (million IDR)",
x = "Income",
y = "Expenditure")
## `geom_smooth()` using formula = 'y ~ x'
Interpretation: The scatter plot shows a positive linear correlation between monthly income and expenditure. Essentially, as household’s income increases, their expenditure tends to increase as well.
hist(x = data2$Income,
main = "Histogram of Monthly Income",
xlab = "Income (million IDR)",
ylab = "Count",
breaks = 14,
col = "coral"
)
ggplot(data2, aes(x = Income)) +
geom_histogram(bins = 14, fill = "coral") +
theme_light()+
labs(title = "Histogram of Monthly Income",
x = "Income (million IDR)",
y = "Count")
Interpretation: The data indicates a right-skewed distribution, where the majority of income observations are concentrated in the lower to middle range (3–5 millions IDR), with a small number of higher-income outliers pulling the distribution toward the right.
hist(x = data2$Expenditure,
main = "Histogram of Expenditure",
xlab = "Expenditure (million IDR)",
ylab = "Count",
breaks = 15,
col = "seagreen"
)
ggplot(data2, aes(x = Expenditure)) +
geom_histogram(bins = 15, fill = "seagreen") +
theme_light() +
labs(title = "Histogram of Monthly Expenditure",
x = "Expenditure (million IDR)",
y = "Count")
Interpretation: The data indicates a slight right-skewed distribution, where the majority of income observations are concentrated in the lower to middle range (2.5–4 millions IDR), with the highest frequency occuring at approximately 3.5 million IDR.
hist(x = data2$HH_Size,
main = "Histogram of Household Size",
xlab = "HH Size (persons)",
ylab = "Count",
breaks = 10,
ylim = c(0,5),
col = "lightblue"
)
ggplot(data2, aes(x = HH_Size)) +
geom_histogram(bins = 15, fill = "lightblue") +
theme_light() +
labs(title = "Histogram of Household Size",
x = "HH size (persons)",
y = "Count")
Interpretation: The household size in this district relatively represents a normal distribution pattern. The histogram illustrates that the majority of households consist of 3 to 4 members, representing a very consistent family structure on that district.
boxplot(data2$Income,
data2$Expenditure,
names = c("Income", "Expenditure"),
main = "Income and Expenditure Comparison",
ylab = "Value",
col = c("purple", "violet"))
grid()
ggplot(data2) +
geom_boxplot(aes(x = "Income", y = Income), fill = "purple") +
geom_boxplot(aes(x = "Expenditure", y = Expenditure), fill = "violet") +
labs(title = "Boxplot of Data 2",
y = "Value (million IDR)",
x = "") +
theme_minimal()
Interpretation: The median income exceeds the median expenditure, indicating that people on this district earn more than they spend. Besides, the income boxplot displays the greater diversity, representing a population with varied earning capacities but relatively uniform spending patterns.
boxplot(data2$HH_Size,
main = "Boxplot of Household Size",
ylab = "Household size (persons)",
col = "pink2")
grid()
ggplot(data2) +
geom_boxplot(aes(x = "", y = HH_Size), fill = "pink2") +
labs(title = "Boxplot of Household size",
y = "HH Size (persons)",
x = "") +
theme_minimal()
Interpretation: The boxplot shows a symmetrical distribution of household size in this district. It means there isn’t a bias toward very small or very large size of households.