library(tidyverse) # for data wrangling, importing, visiualizations and data cleaning
library(gridExtra) # scatter plot arrangement
library(lmtest) # testing equality of variance
library(car) # for durbin watson test
library(nortest) # for normality
library(pander) # for clean tablesIncome and Spending Dynamics: A Case Study Using FIES 2023 Data
It is important to understand household financial behavior in order to formulate policies about poverty, inequality, and sustainable development. In this case study, we investigate the nexus between total income, regular salary, and total expenditure of Filipino households based on data from the 2023 Family Income and Expenditure Survey (FIES) of the Philippine Statistics Authority (PSA). By seeing the relationship between income and salary levels and household consumption, this research gives an understanding of what drives consumption behavior among various income segments. You can download the file by clicking this link.
Data importing and sampling
We examine four primary economic indicators from region 3 where, total income (TOINC), regular salary (REG_SAL), total expenditure (TOTEXP), and urbanity (URB) urbanity is set to have the values of 0 and 1, where 0 represent urban cities and 1 representsrural places. In using scatter plot visualizations and statistical modeling, we examine to what extent household spending is linked to income, and whether earnings from salaries alone are adequate to cover a family’s expenses. Such relationships are crucial for determining income adequacy, consumption patterns, and the financial burden faced by urban and rural families.
Furhermore the following libraries was utilized to conduct the analysis
The dataset is then imported, and a seed is initialized to maintain reproducibility of results.
set.seed(600) # for reproduceability
data <- read.csv("E:\\PHL-PSA-FIES-2023-V2-PUF\\FIES PUF 2023 Volume2 Household Summary.CSV")
dataset <- data %>% filter(W_REGN == 3) %>% filter(REG_SAL != 0) %>%
select(REG_SAL, TOINC, TOTEX, URB) %>% drop_na()Once the relevant variables have been filtered and selected, a sample of approximately 200 entries is drawn. The column names are then renamed to enhance clarity and interpretability.
# 0 = urban, 1 = rural
dataset <- dataset %>% mutate(
URB = factor(URB, labels = c(0,1), levels = c(1,2))
) %>% sample_n(200)
colnames(dataset) <- c("Regular_salary",
"Total_income",
"Total_expenditure",
"Urbanity")
head(dataset, n = 10) # showing the first 10 rows Regular_salary Total_income Total_expenditure Urbanity
1 84000 219553.0 189823.0 1
2 54000 173288.0 173471.0 1
3 168200 515977.5 381693.5 1
4 107800 182520.0 189936.0 1
5 85800 342683.0 329026.0 1
6 131000 184200.0 186667.0 0
7 208152 251352.0 147526.0 0
8 70200 315510.0 255098.0 1
9 409000 428840.0 355670.0 0
10 357190 430040.0 322179.0 0
From the result, we see that the Regular_salary variable is an integer, and Total_income and Total_expenditure are float types, which implies that these variables are continuous. For instance, the Urbanity variable is binary with values 0 and 1, where 0 denotes urban places and 1 represents rural places in Region 3, implying that Urbanity is a categorical or factor variable.
Exploratory Data Analysis (EDA)
As part of our exploratory data analysis, we start by checking if each continuous variable is normally distributed. This is crucial since most statistical procedures like parametric tests are based on the assumption of normality. We also investigate relationships between variables in order to determine possible patterns, trends, or associations that might be present in the data. We also calculate correlation coefficients to express the strength and direction of linear associations between continuous variables. Overall, these steps give us a better insight into the structure of the dataset and inform future modeling and analysis choices.
summary(dataset) %>% pander()| Regular_salary | Total_income | Total_expenditure | Urbanity |
|---|---|---|---|
| Min. : 9900 | Min. : 96642 | Min. : 84499 | 0:136 |
| 1st Qu.: 107190 | 1st Qu.: 211820 | 1st Qu.: 189034 | 1: 64 |
| Median : 170930 | Median : 322373 | Median : 273972 | NA |
| Mean : 253498 | Mean : 414754 | Mean : 318083 | NA |
| 3rd Qu.: 310804 | 3rd Qu.: 510789 | 3rd Qu.: 368542 | NA |
| Max. :2157600 | Max. :2446870 | Max. :1393510 | NA |
From the findings, the Urbanity variable indicates that 68% (136 out of 200) of the population sample are from urban locations, while the other 32% (64 out of 200) are from rural places. This suggests a higher concentration of urban families in the data, which could affect earnings and consumption patterns given variations in job access, commodities, and services.
In terms of overall spending, the amounts vary from a low of PHP 84,499 to a high of PHP 1,393,510, averaging PHP 318,083. The large range is evidence of great fluctuation in household spending, most probably attributable to differences in lifestyle, household size, and economic means.
As for overall income, the sample has a minimum of PHP 96,642 and a maximum of PHP 2,446,870, with a mean income of PHP 414,754. The large difference between highest and lowest income indicates economic inequality among the population, which may be of interest in association with urbanity or other demographic variables.
For the usual salary variable, the values vary from a low of PHP 9,900 to a high of PHP 2,157,600, with a mean of PHP 253,498. The comparatively lower mean in relation to total earnings means that most households could depend on other income sources aside from regular salaries, such as informal labor, remittances, or earnings from businesses.
Next up is the test of normality using KS or Kolmogorov-Smirnov Test of normality. In this section we wish to determine if the variables Total_expenditure, Total_income, and Regular_salary is normally distributed in the sample.
spend <- lillie.test(dataset$Total_expenditure)
income <- lillie.test(dataset$Total_income)
salary <- lillie.test(dataset$Regular_salary)
res1 <- data.frame(
Variable = c("Total Expenditure", "Total Income", "Regular Salary"),
Statistic = c(spend$statistic, income$statistic, salary$statistic),
P_Value = c(spend$p.value, income$p.value, salary$p.value)
)
pander(res1)| Variable | Statistic | P_Value |
|---|---|---|
| Total Expenditure | 0.1521 | 3.363e-12 |
| Total Income | 0.1773 | 7.368e-17 |
| Regular Salary | 0.1926 | 4.51e-20 |
According to the findings shown, none of the continuous variables are normally distributed, as evidenced by their low p-values for the normality test. Though the low Kolmogorov-Smirnov (KS) statistics, the significant findings (i.e., p-values less than the typical alpha level of 0.05) indicate that the data observed considerably deviate from a normal distribution.
Up next, we wish to determine the relationship of the variables with each other. We can show this by correlation and with the use of scatter plots.
plot1 <- ggplot(dataset, aes(x = Total_income, y = Total_expenditure)) +
geom_point(color = "steelblue") +
labs(
title = "Total income VS Total expenditure",
x = "Total Income",
y = "Total expenditure"
) +
theme_minimal()
plot2 <- ggplot(dataset, aes(x = Total_expenditure, y = Regular_salary)) +
geom_point(color = "darkgreen") +
labs(
title = "Total expenditure VS Regular Salary",
x = "Total Expenditure",
y = "Regular Salary"
) +
theme_minimal()
plot3 <- ggplot(dataset, aes(x = Regular_salary, y = Total_income)) +
geom_point(color = "gold") +
labs(
title = "Regular salary VS Total income",
x = "Regular Salary",
y = "Total Income"
) +
theme_minimal()
plot4 <- ggplot(dataset, aes(x = Urbanity)) +
geom_bar(fill = "maroon", color = "black") +
geom_text(stat = "count", aes(label = after_stat(count)),
color = "black",
hjust = 0.5,
size = 5) +
labs(
title = "Frequency of Urbanized (0) VS Rural (1)",
x = "Urbanity",
y = "Frequency"
) +
theme_minimal()
grid.arrange(plot1, plot2, plot3, plot4, ncol = 2)Looking at the graphs provided, we notice that a number of variables display a linear trend, most notably Total Income vs. Total Expenditures and Regular Salary vs. Total Income. These trends indicate positive correlation where movement in one variable tends to be followed by movement in the other. Though a couple of outliers can be seen on the plots, further analysis will be done to see if these values should be included or removed based on how much they affect the model.
Conversely, the linearity between Total Expenditure and Regular Salary is not apparent, showing that salary itself might not be able to directly account for differences in household expenditure, perhaps owing to the impact of other sources of income or different habits of consumption.
Finally, because categorical variables such as Urbanity cannot be represented meaningfully in scatter plots, they are best represented using bar charts to determine their distribution. In this instance, the bar graph findings are in line with the previous summary, substantiating the previous conclusions regarding the sample composition.
Lastly for our exploratory data analysis we will conduct a correlation test to determine the association of the continuous variables.
p1 <- cor.test(dataset$Regular_salary, dataset$Total_income)$p.value
p2 <- cor.test(dataset$Regular_salary, dataset$Total_expenditure)$p.value
p3 <- cor.test(dataset$Total_income, dataset$Total_expenditure)$p.value
stat1 <- cor.test(dataset$Regular_salary, dataset$Total_income)$estimate
stat2 <- cor.test(dataset$Regular_salary, dataset$Total_expenditure)$estimate
stat3 <- cor.test(dataset$Total_income, dataset$Total_expenditure)$estimate
res2 <- data.frame(
Variable = c("Regular_salary | Total_income",
"Regular_salary | Total Expenditure",
"Total_income | Total_expenditure"),
Pearson.R = c(stat1, stat2, stat3),
p.value = c(p1, p2, p3)
)
pander(res2)| Variable | Pearson.R | p.value |
|---|---|---|
| Regular_salary | Total_income | 0.7865 | 2.646e-43 |
| Regular_salary | Total Expenditure | 0.7551 | 3.707e-38 |
| Total_income | Total_expenditure | 0.9304 | 2.964e-88 |
A strong and statistically significant association was found between regular salary and total income, r(198) = .79, p < .001. Similarly, regular salary and total expenditure were also strongly correlated, r(198) = .76, p < .001. Lastly, total income and total expenditure demonstrated a very strong correlation, r(198) = .93, p < .001, indicating a close linear relationship between these two financial indicators.
Modelling
Here, we discuss multiple linear regression, a statistical procedure for determining the relationship between two or more independent variables and a continuous dependent variable. This is a commonly used technique in healthcare, marketing, and the social sciences for predicting or explaining outcomes on the basis of multiple predictors. We will explore its key ideas, assumptions, model result interpretation, and typical uses of multiple linear regression in applied research and analysis.
model <- lm(Total_income ~ Urbanity + Regular_salary + Total_expenditure,
data = dataset)
summary(model) %>% pander()| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -106697 | 17693 | -6.03 | 8.003e-09 |
| Urbanity1 | 64434 | 17259 | 3.733 | 0.0002476 |
| Regular_salary | 0.2472 | 0.04582 | 5.394 | 1.974e-07 |
| Total_expenditure | 1.378 | 0.06269 | 21.98 | 8.937e-55 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 200 | 110703 | 0.8899 | 0.8882 |
| F.Statistic | df1 | df2 | p.value | |
|---|---|---|---|---|
| value | 528 | 3 | 196 | < 1e-05 |
A multiple linear regression was conducted to examine the effect of urbanity, regular salary, and total expenditure on total income. The overall model was statistically significant, F(3, 196) = 528, p < .001, indicating that the predictors collectively explain a significant portion of the variance in total income.
Urbanity was a significant predictor, b = 64,434, p < .001, suggesting that, holding other variables constant, individuals in urban areas have an estimated PHP 64,434 higher total income compared to those in rural areas.
Regular salary also significantly predicted total income, b = 0.2472, p < .001, indicating that for every PHP 1 increase in regular salary, total income increases by approximately PHP 0.25.
Similarly, total expenditure was a significant predictor, b = 1.378, p < .001, meaning that each additional PHP 1 in expenditure is associated with a PHP 1.38 increase in total income.
The model explained a substantial proportion of variance in total income, R² = .89, indicating that approximately 88.99% of the variability in total income was accounted for by the predictors.
Assumptions
In this section, we examine the model assumptions which are Linearity of residual and fitted values, Equality of variance (homoscedasticity), Normality of residuals, Independence of residuals, and finally is no multicollinearity.
resid1 <- model$residuals
pred1 <- model$fitted.values
assumptions1 <- data.frame(resid1, pred1)
# linearity
ggplot(assumptions1, aes(x = pred1, y = resid1)) +
geom_point(color = "blue") +
labs(
title = "Fitted value VS Residuals",
x = "Fitted values",
y = "Residuals"
) +
theme_minimal()lillie.test(resid1)
Lilliefors (Kolmogorov-Smirnov) normality test
data: resid1
D = 0.14021, p-value = 2.864e-10
The plot breaks three of the most important regression assumptions: linearity, homoscedasticity, and normality. A funnel-shaped pattern of residuals indicates unequal variance, and the KS test result was a p-value of less than 0.05, meaning non-normal residuals. These will impact the model’s validity and interpretation. To correct them, the dependent and independent variable transformation might be required to make improvements on assumption compliance.
Transformation of the variables
As the linearity, homoscedasticity, and normality assumptions for the initial model were not fulfilled, variable transformation is required to enhance model validity. One of the widely used methods to resolve these violations is the use of a logarithmic transformation, which can linearize relationships among variables and stabilize variance. In this instance, the transformation is specifically applicable because of the funnel-shaped form of the residual plot and the significant p-value of the KS test. By log transforming the variables, we seek to minimize skewness, deal with outliers, and satisfy the assumptions necessary for sound regression analysis. Thus, the log transformation will be used for the respective variables in the model.
model2 <- lm(log(Total_income) ~ Urbanity + log(Regular_salary) +
log(Total_expenditure),
data = dataset)
summary(model2) %>% pander()| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -1.004 | 0.3291 | -3.049 | 0.002611 |
| Urbanity1 | 0.09635 | 0.02908 | 3.313 | 0.001099 |
| log(Regular_salary) | 0.1386 | 0.02175 | 6.371 | 1.308e-09 |
| log(Total_expenditure) | 0.9602 | 0.03468 | 27.68 | 1.186e-69 |
| Observations | Residual Std. Error | \(R^2\) | Adjusted \(R^2\) |
|---|---|---|---|
| 200 | 0.1826 | 0.9093 | 0.9079 |
| F.Statistic | df1 | df2 | p.value | |
|---|---|---|---|---|
| value | 655.1 | 3 | 196 | < 1e-05 |
According to the model that has been transformed, the findings are that all independent variables are still statistically significant predictors of total income. Even if the parameter estimates are somewhat different from the original model, the transformed model is still strong. Interestingly, the model attained a stronger coefficient of determination (R² = 0.9093), implying that about 90.93% of total income variation is accounted for by the predictors—an improvement in the fit of the model over the original.
Moreover, the re-modeled model had a larger F-statistic (F = 655.1, p < .00001), further indicating its overall statistical significance and better predictive ability.
As for individual predictors, the regression results indicate that all variables significantly contribute to the prediction of total income. When all independent variables are assumed to be zero, the model estimates a drop in total income by around 1.004 PHP for residents living in non-urban areas. On the other hand, residence in Region 3’s urban area is related to an increase in total income of approximately 0.09635 PHP, with other variables held constant. Additionally, a 1 PHP rise in regular wage is related to a rise in total income of 0.1386 PHP, while a 1 PHP rise in total expenditure translates to a rise in total income of approximately 0.9602 PHP, with all other variables held constant.
Assumptions
In order to further test the validity and reliability of our new transformed model, we will determine if it meets the assumptions required, as was the case with the original untransformed model.
pred2 <- model2$fitted.values
resid2 <- model2$residuals
assumptions2 <- data.frame(pred2, resid2)
# assumptions
# linearity
ggplot(assumptions2, aes(x = pred2, y = resid2)) +
geom_point(color = "darkgreen") +
labs(
title = "Predicted value VS Residuals",
x = "Predicted values",
y = "Residuals"
) +
theme_minimal()As can be seen from the above plot, the residuals do not have any funneling or obvious trend. This indicates that the linearity assumption is fairly satisfied in the transformed model. It also gives an early insight into homoscedasticity. Nevertheless, for this to be confirmed, a standard statistical test will be used afterwards.
# independence of errors
durbinWatsonTest(resid2) %>% pander()2.058
Second, we examine the independence of error terms, an important assumption in regression analysis. To check for it, the Durbin-Watson test was used. The Durbin-Watson statistic lies between 0 and 4, and its value near 2 means residuals are uncorrelated (i.e., independent). A value much less than 2 implies positive autocorrelation, and a value greater than 2 implies negative autocorrelation. From this analysis, it was found that the test gave a reading of 2.058, very close to 2. This implies that the residuals have no serious autocorrelation and hence confirm the assumption that the error terms are relatively independent. This enhances the strength of the model’s reliability since non-independence of residuals would lead to distortion of inference and model validity.
# normality of errors
plot(model2 , which = 2)A Q–Q plot was created in order to determine normality of residuals in the model transformed by regression. Although most of the points lie on the 45-degree reference line, there are deviations evident at the lower and upper tails of the distribution. Observations 181 and 188, in particular, are evident as outliers, lying above the line at the upper tail, indicating positive skewness. These discrepancies suggest that the residuals are perhaps not normally distributed. While minor departures from normality are seen, the distribution is fairly close to being normal and can even be acceptable for regression analysis, particularly provided a sample size is large enough.
# Equal variance
bptest(model2)Breusch-Pagan Test:
BP statistic = 7.9276
p-value = 0.05
A Breusch–Pagan test was also run to assess the homoscedasticity assumption in the model of regression. The outcome was BP(3) = 7.93, p = .05, and thus the assumption of residual variance constancy is not violated at the standard significance level. Still, because the p-value is on the borderline, careful interpretation is advisable and cautious modeling to take care of the robustness of the findings.
# multicollinearity
GVIF <- function(model) {
vifs <- vif(model)
return(vifs)
}
GVIF(model2) %>% pander()| Urbanity | log(Regular_salary) | log(Total_expenditure) |
|---|---|---|
| 1.104 | 1.921 | 1.957 |
To check multicollinearity between the independent variables, Variance Inflation Factor (VIF) values were checked. The outcomes showed that Urbanity recorded a VIF of 1.104, Regular Salary recorded a VIF of 1.921, and Total Expenditure recorded a VIF of 1.957. These findings are significantly lower than the widely accepted cut-off of 10 since they are far from exceeding it, which confirms that multicollinearity is not an issue in this model. The low VIF values suggest the independent variables do not have high correlation with each other, enabling easier interpretation of their separate effects on the dependent variable.
Conclusion
In summary, the Urbanity, Regular Salary, and Total Expenditures variables all proved to be statistically significant predictors of an employee’s overall income in Region 3. That implies that when an individual’s regular salary goes up, his or her overall income follows suit—a logical but noteworthy validation. Furthermore, residing in a Region 3 urbanized area seems to provide economic benefits, maybe enhancing the lifestyle of a worker through increased earning potential opportunities in urban areas. Interestingly, as the total expense grows, so does their total income, which might be an indication of the higher earning potential that enables increased spending—and maybe vice-versa. Overall, these results imply that Region 3 would be a strong viable option to Metro Manila with the prospect for improved financial condition and generally enhanced quality of living.