NY State Total Income And Tax Liability
This project explores “Total Income And Tax Liability By Place Of Residence” dataset from NY State. Dataset was obtained from US Government open data website (data.gov): https://catalog.data.gov/dataset/total-income-and-tax-liability-by-place-of-residence-beginning-tax-year-1999. The data includes tax liability information for full-year resident, full-year nonresident, and part-year resident returns for 1999 through 2014 tax years. The data includes income, deductions, dependent exemptions, and tax liability and information is provided by income and county. The project will attempt to determine whether there is any impactful difference between resident and non-resident tax lability. This data is relevant to current tax laws and policies and it might expose a pattern of tax loopholes or indications for a need of a policy change.
Reading in the data:
df <- read.csv("https://raw.githubusercontent.com/che10vek/Data608/master/Total_Income_And_Tax_Liability_By_Place_Of_Residence__Beginning_Tax_Year_1999.csv", header= TRUE)
And lets preview this data:
knitr::kable(head(df),format = "markdown")
Tax.Year | Resident.Type | Place.of.Residence | Country | State | County | Disclosure | Number.of.All.Returns | NY.AGI.of.All.Returns..in.thousands… | Tax.Liability.of.All.Returns..in.thousands…. | Number.of.Taxable.Returns | NY.AGI.of.Taxable.Returns..in.thousands… | Tax.Liability.of.Taxable.Returns..in.thousands…. | Number.of.Nontaxable.Returns | NY.AGI.of.Nontaxable.Returns..in.thousands… | Tax.Liability.of.Nontaxable.Returns..in.thousands…. | Average.NY.AGI.of.All.Returns | Average.Tax.of.All.Returns | Average.NY.AGI.of.Taxable.Returns | Average.Tax.of.Taxable.Returns | Average.NY.AGI.of.Nontaxable.Returns | County.Sort.Order |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2011 | Full-Year Resident | New York City - Bronx | United States | New York | Bronx | NA | 588674 | 17626369 | 433593 | 299820 | 14719525 | 598802 | 288854 | 2906845 | -165209 | 29942 | 737 | 49095 | 1997 | 10063 | 1 |
2011 | Full-Year Resident | New York City - Kings | United States | New York | Kings | NA | 1078520 | 48145626 | 1989456 | 636200 | 44304145 | 2239988 | 442320 | 3841481 | -250532 | 44640 | 1845 | 69639 | 3521 | 8685 | 2 |
2011 | Full-Year Resident | New York City - Manhattan | United States | New York | Manhattan | NA | 808917 | 129181676 | 9465286 | 575612 | 132401862 | 9552543 | 233305 | -3220186 | -87257 | 159697 | 11701 | 230019 | 16595 | -13802 | 3 |
2011 | Full-Year Resident | New York City - Queens | United States | New York | Queens | NA | 1028443 | 41248512 | 1512781 | 627276 | 37521694 | 1694491 | 401167 | 3726818 | -181709 | 40108 | 1471 | 59817 | 2701 | 9290 | 4 |
2011 | Full-Year Resident | New York City - Richmond | United States | New York | Richmond | NA | 204587 | 11442144 | 494959 | 140940 | 11001652 | 518404 | 63647 | 440492 | -23445 | 55928 | 2419 | 78059 | 3678 | 6921 | 5 |
2011 | Full-Year Resident | Total, New York City | United States | New York | Total, New York City | NA | 3709141 | 247644328 | 13896076 | 2279848 | 239948878 | 14604228 | 1429293 | 7695450 | -708153 | 66766 | 3746 | 105248 | 6406 | 5384 | 6 |
summary(df)
## Tax.Year Resident.Type Place.of.Residence
## Min. :1999 Full-Year Nonresident: 760 Washington : 32
## 1st Qu.:2003 Full-Year Resident :1072 Delaware : 31
## Median :2007 Grand Total, Total : 16 Wyoming : 20
## Mean :2007 Part-Year Resident : 16 Albany : 16
## 3rd Qu.:2011 All Other Countries +++: 16
## Max. :2014 Allegany : 16
## (Other) :1733
## Country State
## United States :1668 New York :1072
## All : 48 Not Applicable : 144
## All Other Countries +++: 16 All - Excluding New York: 32
## Canada : 16 All : 20
## Residence Unknown ++++ : 16 California : 16
## Germany : 15 Colorado : 16
## (Other) : 85 (Other) : 564
## County Disclosure Number.of.All.Returns
## All : 600 Mode:logical Min. : 52
## Not Applicable: 144 NA's:1864 1st Qu.: 3807
## Albany : 16 Median : 22008
## Allegany : 16 Mean : 311807
## Bronx : 16 3rd Qu.: 67982
## Broome : 16 Max. :10273274
## (Other) :1056
## NY.AGI.of.All.Returns..in.thousands...
## Min. :3.740e+03
## 1st Qu.:7.028e+05
## Median :1.536e+06
## Mean :2.472e+07
## 3rd Qu.:6.244e+06
## Max. :1.109e+09
##
## Tax.Liability.of.All.Returns..in.thousands.... Number.of.Taxable.Returns
## Min. : 147 Min. : 31
## 1st Qu.: 10494 1st Qu.: 2585
## Median : 35656 Median : 14784
## Mean : 986693 Mean : 214842
## 3rd Qu.: 140765 3rd Qu.: 49587
## Max. :41911953 Max. :7027102
##
## NY.AGI.of.Taxable.Returns..in.thousands...
## Min. :3.591e+03
## 1st Qu.:6.265e+05
## Median :1.355e+06
## Mean :2.300e+07
## 3rd Qu.:5.451e+06
## Max. :1.022e+09
##
## Tax.Liability.of.Taxable.Returns..in.thousands....
## Min. : 147
## 1st Qu.: 11244
## Median : 38915
## Mean : 1040505
## 3rd Qu.: 153005
## Max. :43342022
## NA's :34
## Number.of.Nontaxable.Returns NY.AGI.of.Nontaxable.Returns..in.thousands...
## Min. : 21 Min. :-3220186
## 1st Qu.: 1062 1st Qu.: 60758
## Median : 7309 Median : 139968
## Mean : 97017 Mean : 1718891
## 3rd Qu.: 19581 3rd Qu.: 575504
## Max. :3258100 Max. :94923870
##
## Tax.Liability.of.Nontaxable.Returns..in.thousands....
## Min. :-1458748.0
## 1st Qu.: -7536.8
## Median : -2637.5
## Mean : -75963.5
## 3rd Qu.: -11.8
## Max. : 0.0
## NA's :1272
## Average.NY.AGI.of.All.Returns Average.Tax.of.All.Returns
## Min. : 6627 Min. : 122
## 1st Qu.: 38492 1st Qu.: 1325
## Median : 57676 Median : 2024
## Mean : 267333 Mean : 2800
## 3rd Qu.: 335807 3rd Qu.: 3230
## Max. :32714136 Max. :32720
##
## Average.NY.AGI.of.Taxable.Returns Average.Tax.of.Taxable.Returns
## Min. : 32554 Min. : 695
## 1st Qu.: 53920 1st Qu.: 2166
## Median : 78996 Median : 2938
## Mean : 303776 Mean : 4096
## 3rd Qu.: 384735 3rd Qu.: 4623
## Max. :47070012 Max. :45523
## NA's :34
## Average.NY.AGI.of.Nontaxable.Returns County.Sort.Order
## Min. :-188756 Min. : 1.00
## 1st Qu.: 8083 1st Qu.: 30.00
## Median : 9976 Median : 59.00
## Mean : 198480 Mean : 82.33
## 3rd Qu.: 163643 3rd Qu.:128.00
## Max. :9855074 Max. :299.00
##
Certain rows include “Grand Total” information, I will remove those to avoid double counting, I will accomplish that by subseting the data and filtering out those rows.
df1 <- df %>% filter((Resident.Type != "Grand Total, Total"))
df1 <- df1 %>% filter(Place.of.Residence != "Total, New York City")
df1 <- df1 %>% filter(Place.of.Residence != "Total, Counties Outside of New York City")
df1 <- df1 %>% filter(Place.of.Residence != "Grand Total, Full-Year Resident")
df1 <- df1 %>% filter(Place.of.Residence != "Grand Total, Full-Year Nonresident")
summary(df1)
## Tax.Year Resident.Type Place.of.Residence
## Min. :1999 Full-Year Nonresident: 744 Washington : 32
## 1st Qu.:2003 Full-Year Resident :1024 Delaware : 31
## Median :2007 Grand Total, Total : 0 Wyoming : 20
## Mean :2007 Part-Year Resident : 16 Albany : 16
## 3rd Qu.:2011 All Other Countries +++: 16
## Max. :2014 Allegany : 16
## (Other) :1653
## Country State
## United States :1620 New York :1024
## All : 16 Not Applicable : 144
## All Other Countries +++: 16 All - Excluding New York: 16
## Canada : 16 California : 16
## Residence Unknown ++++ : 16 Colorado : 16
## Germany : 15 Connecticut : 16
## (Other) : 85 (Other) : 552
## County Disclosure Number.of.All.Returns
## All :600 Mode:logical Min. : 52
## Not Applicable:144 NA's:1784 1st Qu.: 3381
## Albany : 16 Median : 20633
## Allegany : 16 Mean : 84070
## Bronx : 16 3rd Qu.: 48991
## Broome : 16 Max. :1138249
## (Other) :976
## NY.AGI.of.All.Returns..in.thousands...
## Min. : 3740
## 1st Qu.: 686220
## Median : 1440429
## Mean : 7092545
## 3rd Qu.: 4532468
## Max. :177432814
##
## Tax.Liability.of.All.Returns..in.thousands.... Number.of.Taxable.Returns
## Min. : 147 Min. : 31
## 1st Qu.: 9857 1st Qu.: 2378
## Median : 32686 Median : 13865
## Mean : 270107 Mean : 58264
## 3rd Qu.: 117261 3rd Qu.: 34264
## Max. :12317382 Max. :690968
##
## NY.AGI.of.Taxable.Returns..in.thousands...
## Min. : 3591
## 1st Qu.: 601187
## Median : 1266724
## Mean : 6556208
## 3rd Qu.: 3858014
## Max. :179562053
##
## Tax.Liability.of.Taxable.Returns..in.thousands....
## Min. : 147
## 1st Qu.: 10517
## Median : 35410
## Mean : 283542
## 3rd Qu.: 124265
## Max. :12415536
## NA's :34
## Number.of.Nontaxable.Returns NY.AGI.of.Nontaxable.Returns..in.thousands...
## Min. : 21.0 Min. :-3220186
## 1st Qu.: 965.8 1st Qu.: 58820
## Median : 6925.5 Median : 130467
## Mean : 25859.4 Mean : 537065
## 3rd Qu.: 14625.2 3rd Qu.: 451021
## Max. :449793.0 Max. :13304043
##
## Tax.Liability.of.Nontaxable.Returns..in.thousands....
## Min. :-269143
## 1st Qu.: -5922
## Median : -1926
## Mean : -10474
## 3rd Qu.: -7
## Max. : 0
## NA's :1248
## Average.NY.AGI.of.All.Returns Average.Tax.of.All.Returns
## Min. : 6627 Min. : 122
## 1st Qu.: 38158 1st Qu.: 1308
## Median : 55342 Median : 1958
## Mean : 273961 Mean : 2765
## 3rd Qu.: 353794 3rd Qu.: 3182
## Max. :32714136 Max. :32720
##
## Average.NY.AGI.of.Taxable.Returns Average.Tax.of.Taxable.Returns
## Min. : 32554 Min. : 695
## 1st Qu.: 53464 1st Qu.: 2137
## Median : 76072 Median : 2866
## Mean : 311003 Mean : 4048
## 3rd Qu.: 398284 3rd Qu.: 4438
## Max. :47070012 Max. :45523
## NA's :34
## Average.NY.AGI.of.Nontaxable.Returns County.Sort.Order
## Min. :-188756 Min. : 1.00
## 1st Qu.: 8094 1st Qu.: 29.00
## Median : 9976 Median : 57.00
## Mean : 204307 Mean : 80.19
## 3rd Qu.: 167031 3rd Qu.:125.25
## Max. :9855074 Max. :250.00
##
Based on the summry table above - the data looks very clean - there is only 1 significant column with NAs. Tax.Liability.of.Nontaxable.Returns column includes NAs due to the fact that detailed breakdown information is not avaiilable by county for prior years, that information is only available for several of the more recent years. I will use Mice to impute using random forest imputation since I believe the data will be a good candidate for this method.
df1_imputed <- mice(df1, m = 1, method = "rf", print = F) %>% mice::complete()
Let’s begin by creating a visualizaiton for Average NY AGI for all returns and seeing what we can learn from that data.
inc1 <- df1_imputed %>%
group_by(Tax.Year) %>%
summarize(AGI = mean(Average.NY.AGI.of.All.Returns))
inc1<-arrange(inc1, desc(Tax.Year))
ggplot(inc1, aes(x = Tax.Year, y = AGI)) +
geom_bar(stat = "identity", fill = 'light blue')+
labs(title="Average AGI of NY Returns by Year (Thousands)", x = "", y = "") +
geom_text(aes(label=round(AGI,0)), size = 2, color = "black") +
theme_bw()
From the above visualizaiton we can observe the Adjusted Gross Income trends over the years. The effect of the recent economic crisis is very apparent from this plot.
Next, we will explore the Average Tax of all returns.
inc2 <- df1_imputed %>%
group_by(Tax.Year) %>%
summarize(Tax_Liability = mean(Average.Tax.of.All.Returns))
inc2<-arrange(inc2, desc(Tax.Year))
ggplot(inc2, aes(x = Tax.Year, y = Tax_Liability)) +
geom_bar(stat = "identity", fill = 'light blue')+
labs(title="Average Tax Liability for All NY Returns by Year (Thousands)", x = "", y = "") +
geom_text(aes(label=round(Tax_Liability,0)), size = 2, color = "black") +
theme_bw()
It is intersting to observe that the average tax doesn’t really follow the same trends and patterns we are observing in the AGI data.
Next, I will create an interactive visualization app which will allow the user to interact with the data and view the details by place of residence.
Link to the App: