Introduction

This study aims to uncover combinations of socio-economic characteristics associated with high income in New York. The data are drawn from the American Community Survey (ACS) Public Use Microdata Sample and cover a five-year period from 2019 to 2023. The variables included in the dataset allow for the analysis of relationships among individuals’ socio-economic characteristics. The analysis will be conducted using the Apriori algorithm, a method of association rule mining. Applying the Apriori algorithm will enable the identification of combinations of characteristics that are commonly associated with high income in New York.

Chapter 1. Source and Description of the Dataset

The data come from the ACS Public Use Microdata Sample (PUMS):

https://www.census.gov/programs-surveys/acs/microdata/access.2023.html#list-tab-735824205.

The dataset covers a five-year period from 2019 to 2023. The dataset can be accessed here:

https://www2.census.gov/programs-surveys/acs/data/pums/2023/5-Year/

The csv_pny.zip file consists of the data about New York. The dataset includes various socio-economic characteristics of the respondents. After preliminary preprocessing of the raw data, the following variables were selected for analysis:

Income (adjusted to inflation)

Age

Sex

Level of education

Hours worked

Marital status

Chapter 2. Methodology - Apriori Algorithm

In this study, the method applied is the Apriori algorithm, which is a classic algorithm for discovering associations and frequent patterns in data. The most common use of this algorithm is market basket analysis based on transactional data. However, it can also be applied to the analysis of patterns in many fields, including economics.

It is necessary to introduce three important statistics before diving into how the Apriori algorithm works. The first statistic is support. Support measures how frequently an itemset or a rule \(X\) occurs in the data:

\[ Support(X) = \frac{\text{Count}(X)}{\text{Number of observations}} \]

The second one is confidence. Confidence measures the likelihood of the occurrence of \(Y\) given \(X\):

\[ \text{Confidence}(X \implies Y) = \frac{\text{Support}(X \cup Y)}{\text{Support}(X)} \]

The last statistic is lift. Lift measures the strength of association between \(X\) and \(Y\). For instance, if lift = 3, then \(Y\) is 3 times more likely to occur given \(X\) than it is overall:

\[ \text{Lift}(A \implies B) = \frac{\text{Confidence}(A \implies B)}{\text{Support}(B)} = \frac{\text{Support}(A \cup B)}{\text{Support}(A) \cdot \text{Support}(B)} \]

The apriori algorithm works as follows.

  1. Minimum thresholds for support and confidence (and optionally the minimum length of the left-hand side) are specified.
  2. The algorithm iteratively generates candidate itemsets and keeps only those whose support exceeds the minimum threshold, using the Apriori principle that all subsets of a frequent itemset must also be frequent.
  3. From the resulting frequent itemsets, all possible association rules are generated.
  4. The rules are filtered to keep only those that meet the minimum confidence thresold.

By doing this, the Apriori algorithm allows us to identify which itemsets are frequently associated with the outcome of interest. For this reason, the analysis of characteristics associated with income will be conducted using this procedure.

Chapter 3. Data Preprocessing. Transforming Numeric Data into Tabular Format

3.1. Loading the data. Preliminary Data Preprocessing

library(arules)
library(arulesViz)
library(ggplot2)
library(knitr)

#Loading the data
data <- read.csv("psam_p36.csv")

#Let's see how many rows and column this dataset has
dim(data)
## [1] 973112    286

The dataset is very large, consisting of nearly 1 000 000 observations and 286 columns. We will extract the columns that are crucial for the analysis and remove rows containing NA values.

data <- data[, c("PINCP", "ADJINC", "AGEP", "SCHL", "WKHP", "SEX", "MAR", "ESR")]

#Removing rows with NA values
data <- na.omit(data)
dim(data)
## [1] 503753      8

Removing rows with NA values reduced the number of observations to 503 753.

Next, let’s keep only individuals with an income greater than 0 and those who are actively employed.

data <- subset(data, PINCP>0)
data <- data[data$ESR %in% c(1, 2, 4, 5), ]

As mentioned before, the dataset consists of observations from the years 2019–2023. This means that income from different years cannot be fully compared due to factors such as inflation.

For this reason, the variable AJDINC allows to scale income across different years and express it in 2023 dollars. According to the US Census Bureau instructions, all that is needed is to multiply income by ADJINC/1000000.

data$INC <- as.numeric(data$PINCP) * as.numeric(data$ADJINC)/1000000
kable(head(data))
PINCP ADJINC AGEP SCHL WKHP SEX MAR ESR INC
1 190 1207712 18 19 12 1 5 1 229.4653
9 5010 1207712 19 19 5 2 5 1 6050.6371
20 2400 1207712 19 19 40 2 5 1 2898.5088
26 36500 1207712 66 23 60 1 5 1 44081.4880
35 5000 1207712 19 16 40 1 5 1 6038.5600
44 24400 1207712 25 16 46 2 5 1 29468.1728

3.2. Transformation of Numeric Data Into Tabular Format

Association mining requires the data to be in a tabular format. Therefore, the original variables described above need to be transformed.

The primary goal of this study is to identify combinations of socio-economic characteristics associated with high income. This raises the question: how should high income be defined? Top 10%? Top 20%? Top 30%? The choice is subjective. In this study, income is classified as high if it is equal to or greater than the third quartile (0.75 quantile). This means that an individual’s income is considered high if they earn more than 75% of the population.

inc <- data$INC
quantile_inc <- quantile(inc, 0.75) 
data$INCOME <- ifelse(data$INC>=quantile_inc, "High_Income", "No_High_Income")
data$INCOME <- as.factor(data$INCOME)

Time for the transformation of other variables.

data$AGE <- ifelse(data$AGEP >= 15 & data$AGEP <= 20, "Age_15_20",
            ifelse(data$AGEP >= 21 & data$AGEP <= 30, "Age_21_30",
            ifelse(data$AGEP >= 31 & data$AGEP <= 40, "Age_31_40",
            ifelse(data$AGEP >= 41 & data$AGEP <= 50, "Age_41_50",
            ifelse(data$AGEP >= 51 & data$AGEP <= 60, "Age_51_60",
            ifelse(data$AGEP > 60, "Age_60_plus", NA))))))
data$AGE <- as.factor(data$AGE)

data$SEX <- ifelse(data$SEX == 1, "Male", "Female")
data$SEX <- as.factor(data$SEX)

data$EDUC <- ifelse(data$SCHL >= 20, "High_Educ", "No_High_Educ")
data$EDUC <- as.factor(data$EDUC)

data$HOURS <- ifelse(data$WKHP < 35, "Part_Time",
              ifelse(data$WKHP >= 35 & data$WKHP <= 40, "Full_Time",
              ifelse(data$WKHP > 40, "Overtime", NA)))
data$HOURS <- as.factor(data$HOURS)

data$MARRIED <- ifelse(data$MAR == 1, "Married", "Not_Married")
data$MARRIED <- as.factor(data$MARRIED)

data <- data[, c("AGE", "SEX", "EDUC", "HOURS", "MARRIED", "INCOME")]
kable(head(data), caption = "First 6 rows of the dataset")
First 6 rows of the dataset
AGE SEX EDUC HOURS MARRIED INCOME
1 Age_15_20 Male No_High_Educ Part_Time Not_Married No_High_Income
9 Age_15_20 Female No_High_Educ Part_Time Not_Married No_High_Income
20 Age_15_20 Female No_High_Educ Full_Time Not_Married No_High_Income
26 Age_60_plus Male High_Educ Overtime Not_Married No_High_Income
35 Age_15_20 Male No_High_Educ Full_Time Not_Married No_High_Income
44 Age_21_30 Female No_High_Educ Overtime Not_Married No_High_Income

Since the dataset described above has already been processed, the next step is to transform it into a transaction-type dataset.

data <- as(data, "transactions")
summary(data)
## transactions as itemMatrix in sparse format with
##  453789 rows (elements/itemsets/transactions) and
##  17 columns (items) and a density of 0.3529412 
## 
## most frequent items:
## INCOME=No_High_Income        EDUC=High_Educ       HOURS=Full_Time 
##                340179                252028                246514 
##       MARRIED=Married              SEX=Male               (Other) 
##                238395                228650               1416968 
## 
## element (itemset/transaction) length distribution:
## sizes
##      6 
## 453789 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6       6       6       6       6       6 
## 
## includes extended item information - examples:
##          labels variables    levels
## 1 AGE=Age_15_20       AGE Age_15_20
## 2 AGE=Age_21_30       AGE Age_21_30
## 3 AGE=Age_31_40       AGE Age_31_40
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             9
## 3            20
inspect(data[1:5])
##     items                   transactionID
## [1] {AGE=Age_15_20,                      
##      SEX=Male,                           
##      EDUC=No_High_Educ,                  
##      HOURS=Part_Time,                    
##      MARRIED=Not_Married,                
##      INCOME=No_High_Income}            1 
## [2] {AGE=Age_15_20,                      
##      SEX=Female,                         
##      EDUC=No_High_Educ,                  
##      HOURS=Part_Time,                    
##      MARRIED=Not_Married,                
##      INCOME=No_High_Income}            9 
## [3] {AGE=Age_15_20,                      
##      SEX=Female,                         
##      EDUC=No_High_Educ,                  
##      HOURS=Full_Time,                    
##      MARRIED=Not_Married,                
##      INCOME=No_High_Income}            20
## [4] {AGE=Age_60_plus,                    
##      SEX=Male,                           
##      EDUC=High_Educ,                     
##      HOURS=Overtime,                     
##      MARRIED=Not_Married,                
##      INCOME=No_High_Income}            26
## [5] {AGE=Age_15_20,                      
##      SEX=Male,                           
##      EDUC=No_High_Educ,                  
##      HOURS=Full_Time,                    
##      MARRIED=Not_Married,                
##      INCOME=No_High_Income}            35

As the final step before applying the Apriori algorithm, let’s create an item frequency plot to see which characteristics occur most frequently.

itemFrequencyPlot(data, topN = 10, col = "skyblue", xlab = "Characteristics", ylab = "Frequency", main = "Characteristics Frequency")

The characteristics that occur most frequently are, unsurprisingly, not having high income, as well as having higher education and working full-time hours.

Chapter 4. Apriori Algorithm

4.1. Applying, Summarizing and Ispecting the Algorithm

Now, since everything is clear and prepared, let’s apply the Apriori algorithm. The minimal support, minimal confidence and minimum length parameters are set to 0.005, 0.6 and 2, respectively.

set.seed(123)
high_income_rules <- apriori(data, parameter = list(support = 0.005, confidence = 0.6, minlen = 2), appearance = list(rhs = "INCOME=High_Income", default = "lhs"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5   0.005      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 2268 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[17 item(s), 453789 transaction(s)] done [0.22s].
## sorting and recoding items ... [17 item(s)] done [0.02s].
## creating transaction tree ... done [0.26s].
## checking subsets of size 1 2 3 4 5 6 done [0.04s].
## writing ... [20 rule(s)] done [0.00s].
## creating S4 object  ... done [0.05s].
high_income_rules
## set of 20 rules

The algorithm identified exactly 20 rules that meet the specified requirements for support, confidence, and itemset length.

A summary of these rules is presented below:

summary(high_income_rules)
## set of 20 rules
## 
## rule length distribution (lhs + rhs):sizes
##  4  5  6 
##  5 10  5 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    4.75    5.00    5.00    5.25    6.00 
## 
## summary of quality measures:
##     support           confidence        coverage            lift      
##  Min.   :0.005549   Min.   :0.6005   Min.   :0.00909   Min.   :2.399  
##  1st Qu.:0.011281   1st Qu.:0.6413   1st Qu.:0.01583   1st Qu.:2.562  
##  Median :0.014442   Median :0.6715   Median :0.02245   Median :2.682  
##  Mean   :0.019683   Mean   :0.6711   Mean   :0.02985   Mean   :2.680  
##  3rd Qu.:0.021792   3rd Qu.:0.7025   3rd Qu.:0.03359   3rd Qu.:2.806  
##  Max.   :0.060912   Max.   :0.7462   Max.   :0.09336   Max.   :2.981  
##      count      
##  Min.   : 2518  
##  1st Qu.: 5119  
##  Median : 6554  
##  Mean   : 8932  
##  3rd Qu.: 9889  
##  Max.   :27641  
## 
## mining info:
##  data ntransactions support confidence
##  data        453789   0.005        0.6
##                                                                                                                                                   call
##  apriori(data = data, parameter = list(support = 0.005, confidence = 0.6, minlen = 2), appearance = list(rhs = "INCOME=High_Income", default = "lhs"))

It is now time to inspect the rules that were discovered.

inspect(high_income_rules)
##      lhs                   rhs                      support confidence    coverage     lift count
## [1]  {AGE=Age_60_plus,                                                                           
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime}   => {INCOME=High_Income} 0.011730121  0.6613244 0.017737318 2.641508  5323
## [2]  {AGE=Age_41_50,                                                                             
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime}   => {INCOME=High_Income} 0.022378242  0.6509198 0.034379414 2.599949 10155
## [3]  {AGE=Age_51_60,                                                                             
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime}   => {INCOME=High_Income} 0.021595940  0.6479339 0.033330469 2.588023  9800
## [4]  {SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime}   => {INCOME=High_Income} 0.055677859  0.6215039 0.089585689 2.482454 25266
## [5]  {EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.060911569  0.6524182 0.093362774 2.605934 27641
## [6]  {AGE=Age_60_plus,                                                                           
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime}   => {INCOME=High_Income} 0.008217476  0.7008081 0.011725714 2.799217  3729
## [7]  {AGE=Age_60_plus,                                                                           
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.009054869  0.7091819 0.012768049 2.832664  4109
## [8]  {AGE=Age_41_50,                                                                             
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime}   => {INCOME=High_Income} 0.014310616  0.6938034 0.020626326 2.771238  6494
## [9]  {AGE=Age_41_50,                                                                             
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.017488304  0.6908078 0.025315730 2.759273  7936
## [10] {AGE=Age_31_40,                                                                             
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime}   => {INCOME=High_Income} 0.014572852  0.6005267 0.024266785 2.398666  6613
## [11] {AGE=Age_31_40,                                                                             
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.015626205  0.6191391 0.025238602 2.473009  7091
## [12] {AGE=Age_51_60,                                                                             
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime}   => {INCOME=High_Income} 0.013883104  0.7076266 0.019619250 2.826452  6300
## [13] {AGE=Age_51_60,                                                                             
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.016734650  0.6816264 0.024551058 2.722600  7594
## [14] {AGE=Age_51_60,                                                                             
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.023145118  0.6052556 0.038240239 2.417554 10503
## [15] {SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.041971048  0.6985000 0.060087397 2.789998 19046
## [16] {AGE=Age_60_plus,                                                                           
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.006952570  0.7462157 0.009317106 2.980587  3155
## [17] {AGE=Age_41_50,                                                                             
##       SEX=Female,                                                                                
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.005548834  0.6104242 0.009090128 2.438199  2518
## [18] {AGE=Age_41_50,                                                                             
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.011939470  0.7358414 0.016225603 2.939149  5418
## [19] {AGE=Age_31_40,                                                                             
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.010315367  0.6505907 0.015855387 2.598635  4681
## [20] {AGE=Age_51_60,                                                                             
##       SEX=Male,                                                                                  
##       EDUC=High_Educ,                                                                            
##       HOURS=Overtime,                                                                            
##       MARRIED=Married}  => {INCOME=High_Income} 0.011602309  0.7369821 0.015742999 2.943705  5265

In general, it can be observed that characteristics such as higher education, working overtime, or being married are strongly associated with high income, as they appear in the majority of the rules discovered.

It will be useful to sort the rules by confidence and lift to identify the strongest associations.

inspect(sort(high_income_rules, by = "confidence", decreasing = TRUE)[1:5])
##     lhs                   rhs                      support confidence    coverage     lift count
## [1] {AGE=Age_60_plus,                                                                           
##      SEX=Male,                                                                                  
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime,                                                                            
##      MARRIED=Married}  => {INCOME=High_Income} 0.006952570  0.7462157 0.009317106 2.980587  3155
## [2] {AGE=Age_51_60,                                                                             
##      SEX=Male,                                                                                  
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime,                                                                            
##      MARRIED=Married}  => {INCOME=High_Income} 0.011602309  0.7369821 0.015742999 2.943705  5265
## [3] {AGE=Age_41_50,                                                                             
##      SEX=Male,                                                                                  
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime,                                                                            
##      MARRIED=Married}  => {INCOME=High_Income} 0.011939470  0.7358414 0.016225603 2.939149  5418
## [4] {AGE=Age_60_plus,                                                                           
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime,                                                                            
##      MARRIED=Married}  => {INCOME=High_Income} 0.009054869  0.7091819 0.012768049 2.832664  4109
## [5] {AGE=Age_51_60,                                                                             
##      SEX=Male,                                                                                  
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime}   => {INCOME=High_Income} 0.013883104  0.7076266 0.019619250 2.826452  6300
inspect(sort(high_income_rules, by = "lift", decreasing = TRUE)[1:5])
##     lhs                   rhs                      support confidence    coverage     lift count
## [1] {AGE=Age_60_plus,                                                                           
##      SEX=Male,                                                                                  
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime,                                                                            
##      MARRIED=Married}  => {INCOME=High_Income} 0.006952570  0.7462157 0.009317106 2.980587  3155
## [2] {AGE=Age_51_60,                                                                             
##      SEX=Male,                                                                                  
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime,                                                                            
##      MARRIED=Married}  => {INCOME=High_Income} 0.011602309  0.7369821 0.015742999 2.943705  5265
## [3] {AGE=Age_41_50,                                                                             
##      SEX=Male,                                                                                  
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime,                                                                            
##      MARRIED=Married}  => {INCOME=High_Income} 0.011939470  0.7358414 0.016225603 2.939149  5418
## [4] {AGE=Age_60_plus,                                                                           
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime,                                                                            
##      MARRIED=Married}  => {INCOME=High_Income} 0.009054869  0.7091819 0.012768049 2.832664  4109
## [5] {AGE=Age_51_60,                                                                             
##      SEX=Male,                                                                                  
##      EDUC=High_Educ,                                                                            
##      HOURS=Overtime}   => {INCOME=High_Income} 0.013883104  0.7076266 0.019619250 2.826452  6300

We obtained the same lists. Overall, the strongest rule in terms of confidence and lift is:

{Age_60_plus, Male, High_Educ, Overtime, Married} -> {High_Income}

The confidence of this rule is approximately 0.75, meaning that 75% of individuals with this combination of characteristics are high-income earners. Meanwhile, the lift is equal to 3, indicating that people with these characteristics are three times more likely to have high income compared to the general population.

Other rules with high confidence and lift are also shown above.

4.2. Visualizations of the Rules

It is now time to create some plots to visualize the rules uncovered by the Apriori algorithm.

Let’s start with a scatter plot, where support is on the x-axis and confidence is on the y-axis. The lift of each rule is represented using a color gradient.

plot(high_income_rules, engine = "ggplot2", main = "Scatter plot for discovered rules")+
  theme_minimal()+
  geom_point(size=4)+
  scale_color_gradient(low = "blue", high = "red")

The only minor drawback of the strongest rules is that their support is relatively low. This means that these combinations of characteristics do not occur very frequently in the data. However, when they do occur, the rules are strong and visible.

The second plot in this section is a parallel coordinates plot, which represents the rules as lines connecting the characteristics involved.

plot(high_income_rules, method="paracoord", control=list(reorder=TRUE),
     main = "Parallel coordinates for discovered rules")

Summary and Conclusions

The primary goal of this study was to identify combinations of socio-economic characteristics associated with high income. To achieve this, the Apriori algorithm, a method of association rule mining, was applied. The data were obtained from the ACS Public Use Microdata Sample and cover observations from 2019 to 2023.

The analysis identified 20 combinations of characteristics associated with high income. The strongest rules highlighted characteristics such as advanced age, being male, having higher education, working overtime, and being married. Given the large size of the sample, these findings are likely to be robust and meaningful.

These results can serve as a valuable starting point for further analysis, helping to identify key characteristics that should be considered when studying the determinants of individual income.