library(arules)
library(arulesViz)
library(arulesCBA)
library(plot.matrix)

Association Rule - Introduction / Explanation of method

This paper is the analysis of a survey by using one of the algorithms of unsupervised learning - Association Rule. This method is used to explore and interpret transactional datasets - sets of information about items include in each transaction. Association Rule methods search most frequently sets of items and create general rules that might be interpreted and used in future transactions to predict used items. This method usually is used in market basket analysis to find items that are bought together. In this paper, I will use this method to find some relations between answers in the county’s survey.
In the first part, preparing of the dataset was presented. The next chapter analysed selected items from the survey using Association Rule.

Dataset / Preparing to use an Association Rule

Dataset used in this paper was downloaded from an American administration website containing many useful datasets for Data Scientists: Data.gov (https://catalog.data.gov/dataset/2018-constituent-satisfaction-survey-results) The website contains the answer of citizens of Arlington County. As we can read on data.gov: “The County uses the survey results as an additional tool to measure its performance and enable more effective management of community services”. The first time the survey was conducted was in 2004. The survey analysed is the newest one from 2018.

To use an Association Rule I downloaded data from the website, delete useless columns, change the name of useful columns (shortened the names), cleared values from number parts and add the column name to each value of the column. The last step is necessary to properly interpret the results of the Association Rule (Item when reading as transaction have to contain question and answer to properly understand the result of Association Rule). These steps was presented below.

## [1] 1610  158
Colname <- c("Police_Services","Fire_Emergency","Service_Emergency_Preparedness","Parks_Recreation",
             "Street_Maitenance","Art_Culture","Water_Service","Library_Service","Customer_Services","Communication","Preserve_Service","Human_Services",
             "Quality_of_Service","Overage_Image","Growth_Management","Quality_of_Life","Public_School","Tax_Value_Received","Public_Engagement_Opportunities",
             "Perception_of_Transparency_Decision","Physical_Accessibility","Overall_Inclusiveness","Budget_for_Changes","Public_School_Budget","Culture_Budget",
             "Economic_Development_Budget","Health_Services_Budget","Housing_Budget","Library_Budget","Parks_Recreation_Budget","Public_Safety_Budget",
             "Public_Safety_Budget","Public_Works_Budget","Metro_Budget","Transportation_Budget","Safe_in_Daytime","Safe_in_Night","Overall_Feeling_in_County",
             "Overall_Street_Maintenance","Overall_Quality_Police","Traffic_Safety","Park","Ease_Traveling_within_Arlington","Ease_Traveling_outside_Arlington",
             "Transportation.Accommodation_for_Disabilities","Public_Transit","Metro_Bus_Service","Metro_Rail_Service","Restaurant_Services","Home_Ownership",
             "Home_Description","English_Primary_Language_in_Home","Household_Income","Gender")
colnames(Survey2018_CSV) <- Colname

#Clear The Data
for (i in 1:dim(Survey2018_CSV)[1]){
  for (j in 1:dim(Survey2018_CSV)[2]){
    if(nchar(Survey2018_CSV[i,j]) > 1){
      Survey2018_CSV[i,j] <- paste(colnames(Survey2018_CSV)[j],":", substr(Survey2018_CSV[i,j], 5, nchar(Survey2018_CSV[i,j])))
    }
  }
}
write.csv(Survey2018_CSV, file = 'Survey2018.csv', row.names = FALSE)

Association Rules - Analysis of selected item from survey

After preparation, we can read transactions. The number of transactions and items are shown below. Moreover, 30 of the most frequent items in the dataset was presented using a bar plot. The most frequent answer was English for questions about the first language in the home, very safe in question about safe in the night in the county and that people don’t know what to think about assurance of transportation for disabilities. The plot shows that all 30 items have relatively high frequency and that value does not drop rapidly.

## Warning in asMethod(object): removing duplicated items in transactions
## [1] "Number of transactions: 1611"
## [1] "Number of items: 355"

To improve the results of the Association Rule algorithm I decided to drop the items with a small frequency. The plot of the 200 most frequent items approves that item frequency does not drop rapidly so do not exist a clear cut-off point. Based on plots I decided to cut off the items at 0.3 frequency level

## [1] "Number of items after delete items with low frequency: 62"
head(sort(itemFrequency(Survey2018)),20)
##                                        Library_Budget : No Change                       Household_Income : Between 80000 and 149000                                       Safe_in_Daytime : No Change 
##                                                         0.3010552                                                         0.3016760                                                         0.3047796 
##                      Public_Engagement_Opportunities : Don't Know                    Public_Safety_Budget : Slightly Reduce Service                                         Communication : Satisfied 
##                                                         0.3072626                                                         0.3072626                                                         0.3109870 
##                                        Housing_Budget : No Change                                     Metro_Bus_Service : Satisfied         Public_Works_Budget : Increase Taxes to Maintain Services 
##                                                         0.3122284                                                         0.3140906                                                         0.3153321 
##                                    Metro_Bus_Service : Don't Know                                Growth_Management : Very Satisfied                                Physical_Accessibility : Satisfied 
##                                                         0.3159528                                                         0.3165736                                                         0.3165736 
##                                         Public_School : Satisfied                                    Quality_of_Service : Satisfied                                          Metro_Budget : No Change 
##                                                         0.3240223                                                         0.3351955                                                         0.3358163 
##                                     Street_Maitenance : Satisfied                                 Transportation_Budget : No Change                                       Police_Services : Satisfied 
##                                                         0.3414029                                                         0.3414029                                                         0.3476102 
##                                         Water_Service : Satisfied Economic_Development_Budget : Increase Taxes to Maintain Services 
##                                                         0.3476102                                                         0.3482309

Dissimilarity Index - Jaccard Index

On the matrix plot, we can easily see that most items are highly dissimilar.

Some of the similar items: Ease Traveling outsing Arlington: Satisfied-Ease Traveling within Arlington: Satisfied Police Services: Very Satisfied - Traffic Safety: Satisfied Police Services: Very Satisfied - Fire Emergency: Very Satisfied

Mainly dissimilar items suggest that support value for created rules in the further analysis will be relatively small. Support is the number of transactions where items from LHS and RHS exist together divided by the total number of transactions. If items are dissimilar only a small number of transactions have both of these items. The bigger number of items on the left side of the rule (“if”) will, even more, reduce this statistic.

Below I present four short analyses of potential dependents between items. For each checking item, I use apriori algorithm to create rules and test if rules are not redundant [more general rules with the same or higher confidence do not exist’*’], significant [using Fisher test of statistical significance that the LHS and the RHS of the rule are significant’*’] and maximal[selecting only maximal itemsets’*’]. Apriori algorithm creates a rules-based on the frequency of items in itemsets.

‘*’ - Notes from R documentation Below are presented the table with cleared rules, the plot showing connections of items in rules and the dendrogram of rules.

What think about county people who are satisfied of quality of Service in Arlington? (Frequency Item - 43%)

People who are satisfied with the quality of service usually are generally satisfied with each particular service but the most frequent of rules is if satisfaction Human_Services, Street_Maitenance Growth_Management and Restaurant_Services. Satisfaction of quality of service is mostly connected with services provided by an administration that engage community like for Human_Services. The dendrogram shows that rules with Growth_Management on the satisfying level are grouped together, Preserve_Service are similar to this feature and rules with features: Water Services, Physical_Accessibility and Public School are most different from other rules.

## [1] "Number of rules for rhs=Quality_of_Service : Satisfied: 33"
## [1] "Number after clearing rules for rhs=Quality_of_Service : Satisfied: 18"

How answer people who think that in Arlington is very safe on night (Frequency Item - 77%)

Unlike the previous method Association Rules for the limitation RHS: Safe_in_Night: Very_Safe create much more rules. Each rule is presented on a scatter plot of confidence and support value.

This amount of rules is too large to interpret using the dendrogram. However, a clear split of the rules to two groups are clearly shown at the dendrogram. Based on it was preparing two tables and two plots for these two separate clusters of rules. In the first cluster rules mainly suggest that people feel very safe at night in the county because they overall feel very safe. Other features: they speak English at the home as their primary language, and earn more than 150.000$ per year. The second cluster has more divided rules suggesting that people feel very safe because they are very satisfied with the communication, traffic safety, police services. They also do not want to change the public budget and budget for the safety of the county.

## [1] "Number of rules for rhs=Safe_in_Night : Very Safe: 7601"
## [1] "Number after clearing rules for rhs=Safe_in_Night : Very Safe: 412"
## [1] "Percantage of rules after clearing: 5.42%"

Characteristics of people who earn more than 150.000$ (Item Frequency - 46%)

People who earn more than 150.000$ use at home English as their first language, usually create a childless family that is the owner of the house where life and answer that they are very satisfied with the quality of life. What is interesting in many rules is the item: Tax_Value_Received: Very Satisfied. It suggests that Arlington have a good tax system for wealthy people helping them to earn more money.

## [1] "Number of rules for rhs=Household_Income : 150000 and Above: 206"
## [1] "Number after clearing rules for rhs=Household_Income : 150000 and Above: 62"

What about county think people who want to increase taxes to maintain services (Item Frequency - %)

People who are want to increase taxes to maintain services of economic development budget also want to increase taxes of public works budget. One rule suggests that for gender females want to increase the taxes. Other rules suggest that increasing taxes want people who feel very safe at night and who are satisfied with the quality of the Police. The statistic of how many more times the RHS exist with LHS than without (Lift) is for each rule close to 2.

## [1] "Number of rules for rhs=Economic_Development_Budget : Increase Taxes to Maintain Services: 20"
## [1] "Number after clearing rules for rhs=Economic_Development_Budget : Increase Taxes to Maintain Services: 8"

Conclusion

This paper has presented an analysis of survey results using the Association Rule. This tool might be useful for Arlington county administration with searching aspects of the policy to improve in their county Also to find what has the biggest influence on interesting them aspects of the life of citizens. Based on the above analysis administration might decide to keep the budget to Police, provide traffic safety and public transportation to keep the thinking about the county as very safe for citizens at night. Also, they might think about family policy because wealthy people are very satisfied with receiving paid value taxes simultaneously living life without children. A small value of support in each rule was explained by the plot of the dissimilarity matrix (Jaccard Index).