This study aims to uncover combinations of socio-economic characteristics associated with high income in New York. The data are drawn from the American Community Survey (ACS) Public Use Microdata Sample and cover a five-year period from 2019 to 2023. The variables included in the dataset allow for the analysis of relationships among individuals’ socio-economic characteristics. The analysis will be conducted using the Apriori algorithm, a method of association rule mining. Applying the Apriori algorithm will enable the identification of combinations of characteristics that are commonly associated with high income in New York.
The data come from the ACS Public Use Microdata Sample (PUMS):
https://www.census.gov/programs-surveys/acs/microdata/access.2023.html#list-tab-735824205.
The dataset covers a five-year period from 2019 to 2023. The dataset can be accessed here:
https://www2.census.gov/programs-surveys/acs/data/pums/2023/5-Year/
The csv_pny.zip file consists of the data about New York. The dataset includes various socio-economic characteristics of the respondents. After preliminary preprocessing of the raw data, the following variables were selected for analysis:
Income (adjusted to inflation)
Age
Sex
Level of education
Hours worked
Marital status
In this study, the method applied is the Apriori algorithm, which is a classic algorithm for discovering associations and frequent patterns in data. The most common use of this algorithm is market basket analysis based on transactional data. However, it can also be applied to the analysis of patterns in many fields, including economics.
It is necessary to introduce three important statistics before diving into how the Apriori algorithm works. The first statistic is support. Support measures how frequently an itemset or a rule \(X\) occurs in the data:
\[ Support(X) = \frac{\text{Count}(X)}{\text{Number of observations}} \]
The second one is confidence. Confidence measures the likelihood of the occurrence of \(Y\) given \(X\):
\[ \text{Confidence}(X \implies Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)} \]
The last statistic is lift. Lift measures the strength of association between \(X\) and \(Y\). For instance, if lift = 3, then \(Y\) is 3 times more likely to occur given \(X\) than it is overall:
\[ \text{Lift}(A \implies B) = \frac{\text{Confidence}(A \implies B)}{\text{Support}(B)} = \frac{\text{Support}(A \cup B)}{\text{Support}(A) \cdot \text{Support}(B)} \]
The apriori algorithm works as follows.
By doing this, the Apriori algorithm allows us to identify which itemsets are frequently associated with the outcome of interest. For this reason, the analysis of characteristics associated with income will be conducted using this procedure.
library(arules)
library(arulesViz)
library(ggplot2)
library(knitr)
#Loading the data
data <- read.csv("psam_p36.csv")
#Let's see how many rows and column this dataset has
dim(data)
## [1] 973112 286
The dataset is very large, consisting of nearly 1 000 000 observations and 286 columns. We will extract the columns that are crucial for the analysis and remove rows containing NA values.
data <- data[, c("PINCP", "ADJINC", "AGEP", "SCHL", "WKHP", "SEX", "MAR", "ESR")]
#Removing rows with NA values
data <- na.omit(data)
dim(data)
## [1] 503753 8
Removing rows with NA values reduced the number of observations to 503 753.
Next, let’s keep only individuals with an income greater than 0 and those who are actively employed.
data <- subset(data, PINCP>0)
data <- data[data$ESR %in% c(1, 2, 4, 5), ]
As mentioned before, the dataset consists of observations from the years 2019–2023. This means that income from different years cannot be fully compared due to factors such as inflation.
For this reason, the variable AJDINC allows to scale income across different years and express it in 2023 dollars. According to the US Census Bureau instructions, all that is needed is to multiply income by ADJINC/1000000.
data$INC <- as.numeric(data$PINCP) * as.numeric(data$ADJINC)/1000000
kable(head(data))
| PINCP | ADJINC | AGEP | SCHL | WKHP | SEX | MAR | ESR | INC | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 190 | 1207712 | 18 | 19 | 12 | 1 | 5 | 1 | 229.4653 |
| 9 | 5010 | 1207712 | 19 | 19 | 5 | 2 | 5 | 1 | 6050.6371 |
| 20 | 2400 | 1207712 | 19 | 19 | 40 | 2 | 5 | 1 | 2898.5088 |
| 26 | 36500 | 1207712 | 66 | 23 | 60 | 1 | 5 | 1 | 44081.4880 |
| 35 | 5000 | 1207712 | 19 | 16 | 40 | 1 | 5 | 1 | 6038.5600 |
| 44 | 24400 | 1207712 | 25 | 16 | 46 | 2 | 5 | 1 | 29468.1728 |
Association mining requires the data to be in a tabular format. Therefore, the original variables described above need to be transformed.
The primary goal of this study is to identify combinations of socio-economic characteristics associated with high income. This raises the question: how should high income be defined? Top 10%? Top 20%? Top 30%? The choice is subjective. In this study, income is classified as high if it is equal to or greater than the third quartile (0.75 quantile). This means that an individual’s income is considered high if they earn more than 75% of the population.
inc <- data$INC
quantile_inc <- quantile(inc, 0.75)
data$INCOME <- ifelse(data$INC>=quantile_inc, "High_Income", "No_High_Income")
data$INCOME <- as.factor(data$INCOME)
Time for the transformation of other variables.
data$AGE <- ifelse(data$AGEP >= 15 & data$AGEP <= 20, "Age_15_20",
ifelse(data$AGEP >= 21 & data$AGEP <= 30, "Age_21_30",
ifelse(data$AGEP >= 31 & data$AGEP <= 40, "Age_31_40",
ifelse(data$AGEP >= 41 & data$AGEP <= 50, "Age_41_50",
ifelse(data$AGEP >= 51 & data$AGEP <= 60, "Age_51_60",
ifelse(data$AGEP > 60, "Age_60_plus", NA))))))
data$AGE <- as.factor(data$AGE)
data$SEX <- ifelse(data$SEX == 1, "Male", "Female")
data$SEX <- as.factor(data$SEX)
data$EDUC <- ifelse(data$SCHL >= 20, "High_Educ", "No_High_Educ")
data$EDUC <- as.factor(data$EDUC)
data$HOURS <- ifelse(data$WKHP < 35, "Part_Time",
ifelse(data$WKHP >= 35 & data$WKHP <= 40, "Full_Time",
ifelse(data$WKHP > 40, "Overtime", NA)))
data$HOURS <- as.factor(data$HOURS)
data$MARRIED <- ifelse(data$MAR == 1, "Married", "Not_Married")
data$MARRIED <- as.factor(data$MARRIED)
data <- data[, c("AGE", "SEX", "EDUC", "HOURS", "MARRIED", "INCOME")]
kable(head(data), caption = "First 6 rows of the dataset")
| AGE | SEX | EDUC | HOURS | MARRIED | INCOME | |
|---|---|---|---|---|---|---|
| 1 | Age_15_20 | Male | No_High_Educ | Part_Time | Not_Married | No_High_Income |
| 9 | Age_15_20 | Female | No_High_Educ | Part_Time | Not_Married | No_High_Income |
| 20 | Age_15_20 | Female | No_High_Educ | Full_Time | Not_Married | No_High_Income |
| 26 | Age_60_plus | Male | High_Educ | Overtime | Not_Married | No_High_Income |
| 35 | Age_15_20 | Male | No_High_Educ | Full_Time | Not_Married | No_High_Income |
| 44 | Age_21_30 | Female | No_High_Educ | Overtime | Not_Married | No_High_Income |
Since the dataset described above has already been processed, the next step is to transform it into a transaction-type dataset.
data <- as(data, "transactions")
summary(data)
## transactions as itemMatrix in sparse format with
## 453789 rows (elements/itemsets/transactions) and
## 17 columns (items) and a density of 0.3529412
##
## most frequent items:
## INCOME=No_High_Income EDUC=High_Educ HOURS=Full_Time
## 340179 252028 246514
## MARRIED=Married SEX=Male (Other)
## 238395 228650 1416968
##
## element (itemset/transaction) length distribution:
## sizes
## 6
## 453789
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6 6 6 6 6 6
##
## includes extended item information - examples:
## labels variables levels
## 1 AGE=Age_15_20 AGE Age_15_20
## 2 AGE=Age_21_30 AGE Age_21_30
## 3 AGE=Age_31_40 AGE Age_31_40
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 9
## 3 20
inspect(data[1:5])
## items transactionID
## [1] {AGE=Age_15_20,
## SEX=Male,
## EDUC=No_High_Educ,
## HOURS=Part_Time,
## MARRIED=Not_Married,
## INCOME=No_High_Income} 1
## [2] {AGE=Age_15_20,
## SEX=Female,
## EDUC=No_High_Educ,
## HOURS=Part_Time,
## MARRIED=Not_Married,
## INCOME=No_High_Income} 9
## [3] {AGE=Age_15_20,
## SEX=Female,
## EDUC=No_High_Educ,
## HOURS=Full_Time,
## MARRIED=Not_Married,
## INCOME=No_High_Income} 20
## [4] {AGE=Age_60_plus,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Not_Married,
## INCOME=No_High_Income} 26
## [5] {AGE=Age_15_20,
## SEX=Male,
## EDUC=No_High_Educ,
## HOURS=Full_Time,
## MARRIED=Not_Married,
## INCOME=No_High_Income} 35
As the final step before applying the Apriori algorithm, let’s create an item frequency plot to see which characteristics occur most frequently.
itemFrequencyPlot(data, topN = 10, col = "skyblue", xlab = "Characteristics", ylab = "Frequency", main = "Characteristics Frequency")
The characteristics that occur most frequently are, unsurprisingly, not having high income, as well as having higher education and working full-time hours.
Now, since everything is clear and prepared, let’s apply the Apriori algorithm. The minimal support, minimal confidence and minimum length parameters are set to 0.005, 0.6 and 2, respectively.
set.seed(123)
high_income_rules <- apriori(data, parameter = list(support = 0.005, confidence = 0.6, minlen = 2), appearance = list(rhs = "INCOME=High_Income", default = "lhs"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.005 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 2268
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[17 item(s), 453789 transaction(s)] done [0.22s].
## sorting and recoding items ... [17 item(s)] done [0.02s].
## creating transaction tree ... done [0.26s].
## checking subsets of size 1 2 3 4 5 6 done [0.04s].
## writing ... [20 rule(s)] done [0.00s].
## creating S4 object ... done [0.05s].
high_income_rules
## set of 20 rules
The algorithm identified exactly 20 rules that meet the specified requirements for support, confidence, and itemset length.
A summary of these rules is presented below:
summary(high_income_rules)
## set of 20 rules
##
## rule length distribution (lhs + rhs):sizes
## 4 5 6
## 5 10 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 4.75 5.00 5.00 5.25 6.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.005549 Min. :0.6005 Min. :0.00909 Min. :2.399
## 1st Qu.:0.011281 1st Qu.:0.6413 1st Qu.:0.01583 1st Qu.:2.562
## Median :0.014442 Median :0.6715 Median :0.02245 Median :2.682
## Mean :0.019683 Mean :0.6711 Mean :0.02985 Mean :2.680
## 3rd Qu.:0.021792 3rd Qu.:0.7025 3rd Qu.:0.03359 3rd Qu.:2.806
## Max. :0.060912 Max. :0.7462 Max. :0.09336 Max. :2.981
## count
## Min. : 2518
## 1st Qu.: 5119
## Median : 6554
## Mean : 8932
## 3rd Qu.: 9889
## Max. :27641
##
## mining info:
## data ntransactions support confidence
## data 453789 0.005 0.6
## call
## apriori(data = data, parameter = list(support = 0.005, confidence = 0.6, minlen = 2), appearance = list(rhs = "INCOME=High_Income", default = "lhs"))
It is now time to inspect the rules that were discovered.
inspect(high_income_rules)
## lhs rhs support confidence coverage lift count
## [1] {AGE=Age_60_plus,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.011730121 0.6613244 0.017737318 2.641508 5323
## [2] {AGE=Age_41_50,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.022378242 0.6509198 0.034379414 2.599949 10155
## [3] {AGE=Age_51_60,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.021595940 0.6479339 0.033330469 2.588023 9800
## [4] {SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.055677859 0.6215039 0.089585689 2.482454 25266
## [5] {EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.060911569 0.6524182 0.093362774 2.605934 27641
## [6] {AGE=Age_60_plus,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.008217476 0.7008081 0.011725714 2.799217 3729
## [7] {AGE=Age_60_plus,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.009054869 0.7091819 0.012768049 2.832664 4109
## [8] {AGE=Age_41_50,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.014310616 0.6938034 0.020626326 2.771238 6494
## [9] {AGE=Age_41_50,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.017488304 0.6908078 0.025315730 2.759273 7936
## [10] {AGE=Age_31_40,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.014572852 0.6005267 0.024266785 2.398666 6613
## [11] {AGE=Age_31_40,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.015626205 0.6191391 0.025238602 2.473009 7091
## [12] {AGE=Age_51_60,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.013883104 0.7076266 0.019619250 2.826452 6300
## [13] {AGE=Age_51_60,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.016734650 0.6816264 0.024551058 2.722600 7594
## [14] {AGE=Age_51_60,
## SEX=Male,
## EDUC=High_Educ,
## MARRIED=Married} => {INCOME=High_Income} 0.023145118 0.6052556 0.038240239 2.417554 10503
## [15] {SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.041971048 0.6985000 0.060087397 2.789998 19046
## [16] {AGE=Age_60_plus,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.006952570 0.7462157 0.009317106 2.980587 3155
## [17] {AGE=Age_41_50,
## SEX=Female,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.005548834 0.6104242 0.009090128 2.438199 2518
## [18] {AGE=Age_41_50,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.011939470 0.7358414 0.016225603 2.939149 5418
## [19] {AGE=Age_31_40,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.010315367 0.6505907 0.015855387 2.598635 4681
## [20] {AGE=Age_51_60,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.011602309 0.7369821 0.015742999 2.943705 5265
In general, it can be observed that characteristics such as higher education, working overtime, or being married are strongly associated with high income, as they appear in the majority of the rules discovered.
It will be useful to sort the rules by confidence and lift to identify the strongest associations.
inspect(sort(high_income_rules, by = "confidence", decreasing = TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {AGE=Age_60_plus,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.006952570 0.7462157 0.009317106 2.980587 3155
## [2] {AGE=Age_51_60,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.011602309 0.7369821 0.015742999 2.943705 5265
## [3] {AGE=Age_41_50,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.011939470 0.7358414 0.016225603 2.939149 5418
## [4] {AGE=Age_60_plus,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.009054869 0.7091819 0.012768049 2.832664 4109
## [5] {AGE=Age_51_60,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.013883104 0.7076266 0.019619250 2.826452 6300
inspect(sort(high_income_rules, by = "lift", decreasing = TRUE)[1:5])
## lhs rhs support confidence coverage lift count
## [1] {AGE=Age_60_plus,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.006952570 0.7462157 0.009317106 2.980587 3155
## [2] {AGE=Age_51_60,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.011602309 0.7369821 0.015742999 2.943705 5265
## [3] {AGE=Age_41_50,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.011939470 0.7358414 0.016225603 2.939149 5418
## [4] {AGE=Age_60_plus,
## EDUC=High_Educ,
## HOURS=Overtime,
## MARRIED=Married} => {INCOME=High_Income} 0.009054869 0.7091819 0.012768049 2.832664 4109
## [5] {AGE=Age_51_60,
## SEX=Male,
## EDUC=High_Educ,
## HOURS=Overtime} => {INCOME=High_Income} 0.013883104 0.7076266 0.019619250 2.826452 6300
We obtained the same lists. Overall, the strongest rule in terms of confidence and lift is:
{Age_60_plus, Male, High_Educ, Overtime, Married} -> {High_Income}
The confidence of this rule is approximately 0.75, meaning that 75% of individuals with this combination of characteristics are high-income earners. Meanwhile, the lift is equal to 3, indicating that people with these characteristics are three times more likely to have high income compared to the general population.
Other rules with high confidence and lift are also shown above.
It is now time to create some plots to visualize the rules uncovered by the Apriori algorithm.
Let’s start with a scatter plot, where support is on the x-axis and confidence is on the y-axis. The lift of each rule is represented using a color gradient.
plot(high_income_rules, engine = "ggplot2", main = "Scatter plot for discovered rules")+
theme_minimal()+
geom_point(size=4)+
scale_color_gradient(low = "blue", high = "red")
The only minor drawback of the strongest rules is that their support is relatively low. This means that these combinations of characteristics do not occur very frequently in the data. However, when they do occur, the rules are strong and visible.
The second plot in this section is a parallel coordinates plot, which represents the rules as lines connecting the characteristics involved.
plot(high_income_rules, method="paracoord", control=list(reorder=TRUE),
main = "Parallel coordinates for discovered rules")
The primary goal of this study was to identify combinations of socio-economic characteristics associated with high income. To achieve this, the Apriori algorithm, a method of association rule mining, was applied. The data were obtained from the ACS Public Use Microdata Sample and cover observations from 2019 to 2023.
The analysis identified 20 combinations of characteristics associated with high income. The strongest rules highlighted characteristics such as advanced age, being male, having higher education, working overtime, and being married. Given the large size of the sample, these findings are likely to be robust and meaningful.
These results can serve as a valuable starting point for further analysis, helping to identify key characteristics that should be considered when studying the determinants of individual income.