West 3rd and MacDougal St

Abstract:

a short (100 to 150 words) synopsis of what the research explored, its findings, and their significance. Note: this is a last section you write, after you’ve completed writing the rest of the paper.

Introduction:

We want to build a model predicting assessed tax values and compare them to actual tax values. If the richer co-ops have larger R values than the mid and poor co-ops then we’ve shown that they are under assessed

It could also be there is systemic bias(?), for example NYC uses statisical methods to produce assessed property values for tax purposes. That may tend to lump poor, mid and rich co-ops together so that poor buildings are paying a disproportionately larger value of taxes and rich buildings are paying a disproportionately lower value of taxes. Basically we could ask if the statistical methods used by NYC DOF lump all buildings into the mid rich building category

Some apartments have a low property value because the maintenance is so high, maybe there is a large underlying mortgage on the building itself and the interest on that makes up a big proportion of the maintenance, or it’s a building with a large staff. If sale price alone of individual units is used to produce assessed tax values of whole buildings then buildings with heavily leveraged underlying mortgages or with large staff would be underpaying taxes

Why is this important. Tax inequities drive people to game the system to benefit themselves…

Goal: Identify a problem or phenomenon and frame it as a function of a number of factors

The introduction should familiarize your reader with what you are trying to show, as well as the reasons for your research and what value you believe that it has.

What problem/issue/phenomenon have I chosen to explore - my research quesiton, theory and hypothesis.

Why is it meaningful?

What prior research has been done on the topic: explain originality of my research (note just a few paragraphs on relevant areas of research, not a detailed literature review)

What research methods do I intend to use, what models, variables and data sets will be used to test hypos and theory?

Have I made sure there are data sets available (proposal stage)

Note: I need to be clear and specific, have sufficient scope and complexity of the issue and proposed analysis availabilitya nd statistcial validity of the data sets and models..

PHASE2 -> flesh out introduction, literature review, hypothesis, data nd variables and statistical methods so everything except discussion/conclusion and abstract. Should be a 10-12 page paper at this point.

Final project Presentation should be 10-15 minutes assuming an audience of peers and non-technical management. send recorded video to professor

Good: The research question is clearly stated, can be answered by the data, and the context of the problem clearly explained.

OK: The research question is unclear and/or not supported by the data.

Bad: Research question is ambiguous, unclear, or not stated.

Intro: tell the reader why this matters Hook the rader, explain the background, give them a clear sense of what I’m doing and why The problem Why do property taxes matter in NYC What’s at stake when they’re unfair Who is impacted and who benefits The suspicion: Is there reason to believe some buildings are getting lower asssssments unfairly Can this be detected with data The research angle Why is this question novel or worth exploring Is there precedent in other cities or fields? Inequality in taxation The stakes Could this help bring transparency to property taxation? Could this lead to better policy, tools for residents, or media coverage? The hook - a mini-anecdote or striking statistic. In 2022, a luxury building on Central Park west was assessed at half the rate of nearby co-ops”

Literature Review:

In a literature review, you examine prior studies into the causes and factors associated with the phenomenon you want to explore or effect that you want to measure in the study. By reviewing these studies, you see what data sets and models researchers used, and compare and contrast their findings with what phenomena that you’re exploring. A good lit review answers the question: so what makes your study different or more interesting than the current body of knowledge?

use Google Scholar

use Elicit

Literature Review (Domain Dive) Are there any learnings from cities with similar demographics? Google Scholar -> search List of 10 articles generated by chatgpt getliner.com elicit.com connectedpapers.com Take the best articles I find and look at the papers they referenced and then look through the papers that reference the one I like. (Is that a feature of google scholar?) (Yes. I can generate my own citation tree? What’s the purpose of a citation tree?)

Domain Dive NYC Dept of Finance What are class to properties.

“we use statistical modeling to calculate the typical income and expenses for properties similar to yours in size, location, age, and number of units. The process varies depending upon whether your property has more or less than 10 units.“

Research Question:

Richer co-op buildings may pay lower property taxes than expected because they tend to have sophisticated owners who can appeal and win property tax assessed values.

Goal: Develop a hypothesis or research question that proposes a relationship to variables inherent to that problem or phenomenon

This describes what problem you’re seeking to examine, phenomenon you want to explore or effect that you want to measure in the study.

Hypothesis Statement
What we’re trying to do is model property tax assessments for Co-ops in NYC. By comparing the output of the model to actual assessments we’re able to look for properties with a high difference between expected and actual taxes. Our theory is that some particularly rich buildings will be paying lower taxes than we’d expect due to the sophistication of the owners who were able to appeal and reduce their assessed property taxes.

Null Hypothesis
Assessed property taxes are fair, there is no advantage afforded a few socioeconomically sophisticated buildings, and we would confirm that by showing that residuals are random, normally distributed, and not moderated by a building being identified as high-end. Another way to look at is is we would expect the rich buildings to have a higher variability in assessed taxes than the not-rich buildings.

Data and Variables:

This section describes what data you intend to use, how they were acquired, and how they represent the variables you have chosen.

Goal: Explore and identify data sets to create variables that represent these factors, as well as statistical and or machine learning methods to model measure and/or predict this phenomenon

We’ve found NYC property tax data from Open Data NYC. discuss what I’ve found.

I’m a little concerned I need to do individual building valuations and that there’s multiple buildings on any block/lot combination with no way to primary key individual buildings across the publically available NYC data

I also need to do a better domain dive of what NYC discloses of their methodology

What we don’t have is a good sophistication flag, I tried pulling educational attainment from the US Bureau of Census at the zip code level, but my zipcode in NYC has tens thousand buildings (citation needed). GH suggested I look at how granular that data is, maybe by voting precinct. I think the sophistication flag is price per square foot for the average apartment in that building - less than $1200 is low, between $1200-$1300 is mid and above $1300/sqft is high. I could use a decision tree to come up with better cutoffs

could possibly do some data visualization with an NYC map

There are 71k 4+ residence units (over how many years?) in the manhattan zipcode of 10019 alone. Zipcode doesn’t give us the granularity or resolution we need to view wealth differences in individual buildings

We’re also having an issue with block and lot number not being granular enough. On any given lot in a block there are 1-10 buildings so we need a building ID code that’s consistent across datasets

DATA EXPLORATION Three values Model’s estimated value The actual value as provided in our data The formulaic model We went to data to look at this. Not sure if the TotVal column is the assessed tax. I can narrow it down to look at my building and compare

Describe each dataset and where it comes from with proper citation. Find a list from NYCDOF with the variable name descriptions. Narrow the number of records to Taxtype=2 Can narrow the database to manhattan only and that could be enough for the project

But could even initially narrow the database to a particular zipcode in the beginning so that I can build an end to end model with the smallest amount of data and then build up from there.

I also want to use Census Data to get average income per zip code to predict value based on zip code… except NYC is so dense. You have a $1 pizza shop next to a $20 a person Italian restaurant next to a $200 a person italian restaurant. If you charged all three an average tax based on earnings potential knowing the average income of the street, the $1 pizza shop would disadvantaged and the $200/person italian restaurant would be advantaged

Also, check the warning and issues from the read_csv() to see if there are anything that would give us pause and make us want to find a different route of processing. For example, if there is a parsing issue with a specific deliminarotr…..

Which columns are of importance to us. The basic values we’ll use in our regression Also-> run missing statistics on the columns.

Steps for next time I look this up. Basic regression I want to run Which columns do i want to focus on. Pick one of the literature review links.

Data Display

Good:Includes appropriate, well-labeled, accurate displays (graphs and tables) of the data.

OK:Includes appropriate, accurate displays of the data.

Bad:Includes appropriate but no accurate displays of the data.

BLDGCL what do these values mean R0, D8, C6, R4. I need to pull out a data dictionary

I should be putting my summary of work here and this more detailed work in another document

LTFront and LTDEPTH to get lot footprint, do the same for building footprint and stories to get square footage of building.

FULLVALUE, AVLAND AVTOT -> go through and do individual calculations to see how these are related

issue with one data set, up to the 2018-2019 plan year, need to grab it from another dataset but similar

can I calculate an assessed value per square foot-> compare that to buildings in the same block-lot and then what about adjacent lots?

List of data I’m looking at NYC open data NYC DOF I have a database on Appeals information. can I compare that to the price per squarefoot and see if wealthy buildings are more likely to appeal? Also it’s only successful appeals in the database, if I could find a database with unsuccessful appeals that could be interesting. Basically there are three mechanism we are looking at for inequity: 1) rich people are better able to appeal 2) part of the statistical calculation for tax assessments used by NYC DOF penalizes poorer buildings and subsidizes richer buildings but tending to assume an average value, or 3) the price per square foot is distorted by underlying building mortgage, which shouldn’t affect the taxable value but would affect the sale value, or high maintenance due to luxury amenitites and staff, two equal units, if one building has higher maintenance the price is lower. But is that true, like does the price accurately capture the value of having a doorman? Sigh, this also means price per square foot isn’t a great indicator of richness…

I need to start over, grab more datasets and do basic data exploration and evaluating missingness and imputation strategies.

No matter what we’re only looking at tax class 2 buildings. Maybe that can be part of the domain dive section: where we discuss how the DOF assesses tax values.

There’s also the Pluto database and that has a building ID but is that consistent across other databases?

BBLE “1000163859”
BORO “1”
BLOCK “16”
LOT “3859”
EASEMENT NA
OWNER “CHEN, QI TOM”
BLDGCL “R4”
TAXCLASS “2”
LTFRONT “0”
LTDEPTH “0”
EXT NA
STORIES “31”
FULLVAL “354180”
AVLAND “3310”
AVTOT “159381”
EXLAND “3310”
EXTOT “159381”
EXCD1 “6800”
STADDR “1 RIVER TERRACE” POSTCODE NA
EXMPTCL NA
BLDFRONT “0”
BLDDEPTH “0”
AVLAND2 “3310”
AVTOT2 “148953”
EXLAND2 “3310”
EXTOT2 “148953”
EXCD2 NA
PERIOD “FINAL”
YEAR “2018/19”
VALTYPE “AC-TR”
Borough NA
Latitude NA
Longitude NA
Community Board NA
Council District NA
Census Tract NA
BIN NA
NTA NA
New Georeferenced Column NA

Where can I find the NYC DOF glossary? AVTOT is Average Total value or Assessed Value Total? FULLVAL is market value? EXTOT is Exempt value? AVTOT2 is assessed value total after exemptions? Need to use summary(), look for missing values, use skimr::skim(), look for outliers, and ranges

Use Census data to come up with average salary per zipcode to help determine Richness by zipcode…

Statistical Methods:

This section describes the methods you used to analyze the data.

Goal: design and implement a model or set of models to measure or explore these relationships

Ridge Regression

Try linear regression -> I may know what my features are because of the domain dive but if not I could try lasso regression. For a nonlinear model maybe I could do random forest. How would the residuals from a linear vs nonlinear model be a diagnostic in and of itself?

use test/train split and x-fold cross-validation

Residual Analysis

Analyze residuals across building types.

Visualize anomalies - what we’re expecting to see is greater residuals for wealthy and poor buildings w

Statistical Validity

How sound are the research design and methods

Is the sample of observations selected fro the test reflective of the population are the values of the independnet variables not dependent on each other are there significant confounding or exogenous factors influencing the depenent variable and thus need to be controlled for in the model?

Internal Validity

Do the data sets and variables accurately represent the phenomena being explored

External validity

Can the results of the study be generalized

Limitations Potential concerns: It’s possible that if we train the model on the historically available data and there is the presence of unfairness with richer buildings not paying their fair share, then the model would predict relatively lower property tax assessments than you would expect and so the residuals would be normal for the higher end buildings. If we have a nonlinear model that may capture the unfairness anomaly as an expected part of the model so if we use a linear model, and the actual tax assessment methodology is linear then we will avoid this problem. If we can’t use differences in the residuals to identify the tax anomaly then we may have to find other aspects of the model to look for taxation anomalies. [Later, go through and standardize the nomenclature]

Model Selection

How did the type of relationships among the variables or end result influence which statistical or machine learning model was most appropriate

consider the scikit-learn algorithm cheat-sheet OR https://www.analyticssteps.com/blogs/5-statistical-data-analysis-techniques-statistical-modelling-machine-learning

main classifications are: regression / classifcation / clustering / dimensionality reduction

Model Fit

Discuss my work in relationship to overfitting and underfitting

need to discuss feature importance and partial dependence plots - what other visuals can I interpret?

Data Analysis

Good: The appropriate statistical test(s) was used for the data and interpretation was clear.

OK: The appropriate statistical test(s) was used but interpretation was not fully clear or well-articulated.

Bad: The incorrect statistical test was used and/or not justified for the data as presented.

Discussion of Results:

In this section, you describe the results of your statistical analyses, their significance, and how they compare or contrast with those from other studies.

Goal: interpret, develop and formulate/articulate conclusions from the results

One concern is that the model might learn the systemic bias and so maybe that’s another way to look at this. If the model is showing even residuals between poor, mid and rich buildings… I guess that would indicate it’s more systemic unfairness than sophisticated owners being able to appeal.

Also it’s easy to see with linear regression, but if we have some low-interpretable non-linear model then I’m not sure what conclusions I’ll be able to draw.

Conclusion:

This section summarizes your final thoughts on your findings and what they show, as well as disclose limitations to the study and suggest future avenues for research.

Good:Conclusion includes a clear answer to the statistical question that is consistent with the data analysis and the method of data collection.

OK:Conclusion includes an answer to the statistical question that is consistent with the data but not with the data collection method.

Bad:Conclusion does not include an answer to the statistical question that is consistent with the data analysis.

Final Presentation

Include a link to the youtube presentation ***

Good: Speaker articulates the nature of the research and shows the study’s findings clearly, concisely and succinctly

OK: Speaker hits most of the highlights

Bad: Speaker does not clearly and concisely explain the nature of the research project nor its findings.

write an article for The Co-operator ***

include link to data repository Github or elsewhere or where on the NYC websites I can find the data or, it should already be processed. maybe I can have links to other RPubs documents where the data scrubbing is visible ***

Simple website where you enter your buildings Block and Lot and are able to see a Tax Anomaly-ometer where: Green - fine/under Yellow - fine Orange - over paying Red - definitely appeal Maybe the website can also produce a small report that a board member can bring back to their board to discuss Share as a marketing tool for Daisy our property management company

CUNY MSDS Capstone Project

Taxation Equity for NYC Co-ops

PK O’Flaherty

Stage 2

2025-04-23