Part I (Why I chose this dataset)

The data set which interests me the most is from the Independent Budget Office (IBO) of New York City, and is available through NYC Open Data. The IBO offers a nonpartisan perspective on the city’s budget and tax revenues. This dataset, titled “Annual tax revenue from major tax sources, from FY 1980 - FY 2020,” presents four decades of information in CSV format.This dataset provides an in-depth examination of various tax revenue sources. I am particularly interested in possibly conducting regression analysis to tax revenue sources with economic events, such as the financial crisis, demographic changes and major shifts in government policies to discover relationships with sociological patterns in tax revenue among NYC residents. This analysis could provide valuable insights into the resilience and adaptability of New York City’s tax structure, as well as the complex relationships between economic conditions and policy decisions.

Introduction

The goal of this first analysis is to visualize the share of different tax sources during the past four decades.

Variable Descriptions

  • PRT: Property tax on real property, such as homes and commercial buildings
  • PIT: Personal Income Tax is the tax on the income of individuals.
  • SAT: General Sales Tax is the tax on consumer purchase of goods and services.
  • COT: General Corporation Tax is the tax on the profits of S-corporations.
  • FCT: Financial corporation tax is a tax on financial corporations, savings and loan associations, trust companies, and state banks
  • UBT: Unincorporated Business Tax a business that is not structured as a corporation. This includes sole proprietorships, partnerships, and limited liability companies
  • MRT: Mortgage Recording Tax is a fee paid when a mortgage is recorded in New York City. The tax is based on the amount of the mortgage and the property’s location.
  • CRT: Commercial Rent Tax is a tax on the annual base rent of commercial properties in Manhattan, New York City
  • RPT: The converyance of real property tax is a tax levied on the sale or transfer of real property in New York City.
  • OTT: Other taxes include business income taxes, the Metropolitan Commuter Transportation Mobility Tax (MCTMT), and the Pass-Through Entity Tax (PTET).
setwd("C:/Users/Yung Cho/Documents/DATA712/Data")
tax_data <- read.csv("NYC_Independent_Budget_Office__IBO__Tax_Revenue_FY_1980_-_FY_2020_20250228.csv")

# Remove the Total column
tax_data <- tax_data %>%
        select(-"Total.Taxes")
# Rename the columns
colnames(tax_data) <- c("Year", "PRT", "PIT", "SAT", "COT", "FCT", "UBT", "MRT", "CRT", "RPT", "OTT")

# Format the columns to show as money values with dollar sign and commas for display
tax_data <- tax_data %>%
  mutate(across(PRT:OTT, ~ as.numeric(gsub("[$,]", "", .))))

str(tax_data)
## 'data.frame':    41 obs. of  11 variables:
##  $ Year: int  2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 ...
##  $ PRT : num  2.98e+10 2.79e+10 2.64e+10 2.47e+10 2.32e+10 ...
##  $ PIT : num  1.38e+10 1.36e+10 1.34e+10 1.13e+10 1.14e+10 ...
##  $ SAT : num  7.39e+09 7.84e+09 7.46e+09 7.03e+09 7.17e+09 ...
##  $ COT : num  5.17e+09 4.73e+09 4.10e+09 4.05e+09 3.63e+09 ...
##  $ FCT : num  82902210 -1282663 394858132 435658184 689535134 ...
##  $ UBT : num  2.05e+09 2.12e+09 2.27e+09 2.08e+09 2.11e+09 ...
##  $ MRT : num  9.75e+08 1.10e+09 1.05e+09 1.12e+09 1.23e+09 ...
##  $ CRT : num  9.43e+08 9.95e+08 9.19e+08 9.21e+08 8.37e+08 ...
##  $ RPT : num  1.14e+09 1.56e+09 1.43e+09 1.42e+09 1.79e+09 ...
##  $ OTT : num  1.75e+09 1.71e+09 1.66e+09 1.67e+09 1.59e+09 ...

Part II (Pivot Long View)

In this example, the use of the pivot_longer function highlights a helpful technique. It converts a wide-format data frame, initially comprising one row and ten columns, into a long-format structure with ten rows and two columns, incorporating the total sum values. The advantage of converting to a long format is to ensure the visualization is enhanced. By reorganizing the data in this manner, it’s easier to facilitate the creation of graphs that can effectively illustrate the performance of each variable.

# Calculate the total sum for each column
total_sum <- tax_data %>%
        summarise(across(PRT:OTT, sum, na.rm = TRUE))

# Convert the total_sum to a long format
total_sum_long <- total_sum %>%
        pivot_longer(cols = everything(), names_to = "Tax_Type", values_to = "Total_Sum")

# Create the bar chart
ggplot(total_sum_long, aes(x = Tax_Type, y = Total_Sum)) +
        geom_bar(stat = "identity", fill = "blue") +
        scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "B")) +
        labs(title = "Total Sum of Tax Revenue by Type (1980-2020)",
             x = "Tax Type",
             y = "Total Sum in Billions") +
        theme_minimal()

# Pivot the data to a long format for easier plotting
tax_data_long_extended <- tax_data %>%
        pivot_longer(
                cols = PRT:OTT,  # Specify the columns to pivot
                names_to = "Tax_Type",  # Name of the new column for tax types
                values_to = "Revenue"  # Name of the new column for revenue values
        ) %>%
        filter(Year >= 1980 & Year <= 2020)

# Create a faceted plot to visualize the revenue for each tax type
ggplot(tax_data_long_extended, aes(x = Year, y = Revenue, color = Tax_Type)) +
        geom_line() +
        facet_wrap(~ Tax_Type, scales = "free_y", ncol = 2) +  # Adjust the number of columns
        scale_x_continuous(breaks = seq(1980, 2020, by = 5)) +
        scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +
        labs(title = "Tax Revenue by Type (1980-2020)",
             x = "Year",
             y = "Revenue (in Millions)") +
        theme_minimal() +
        theme(
                legend.position = "none",
                strip.text = element_text(size = 12),  # Increase facet label size
                axis.text = element_text(size = 10),   # Increase axis text size
                axis.title = element_text(size = 12),  # Increase axis title size
                plot.title = element_text(size = 14)   # Increase plot title size
        )

#Part III (Linear Regression via MLE)}

data("ChickWeight")

str(ChickWeight)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame':   578 obs. of  4 variables:
##  $ weight: num  42 51 59 64 76 93 106 125 149 171 ...
##  $ Time  : num  0 2 4 6 8 10 12 14 16 18 ...
##  $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ Diet  : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "formula")=Class 'formula'  language weight ~ Time | Chick
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "outer")=Class 'formula'  language ~Diet
##   .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv> 
##  - attr(*, "labels")=List of 2
##   ..$ x: chr "Time"
##   ..$ y: chr "Body weight"
##  - attr(*, "units")=List of 2
##   ..$ x: chr "(days)"
##   ..$ y: chr "(gm)"
# Transform the dataset to long format
ChickWeight_long <- ChickWeight %>%
  pivot_longer(cols = c(weight, Time),
               names_to = "variable",
               values_to = "value")
head(ChickWeight_long)
## # A tibble: 6 × 4
##   Chick Diet  variable value
##   <ord> <fct> <chr>    <dbl>
## 1 1     1     weight      42
## 2 1     1     Time         0
## 3 1     1     weight      51
## 4 1     1     Time         2
## 5 1     1     weight      59
## 6 1     1     Time         4
# Transform the dataset to wide format
ChickWeight_wide <- ChickWeight_long %>%
  pivot_wider(names_from = variable,
              values_from = value)
head(ChickWeight_wide)
## # A tibble: 6 × 4
##   Chick Diet  weight     Time      
##   <ord> <fct> <list>     <list>    
## 1 1     1     <dbl [12]> <dbl [12]>
## 2 2     1     <dbl [12]> <dbl [12]>
## 3 3     1     <dbl [12]> <dbl [12]>
## 4 4     1     <dbl [12]> <dbl [12]>
## 5 5     1     <dbl [12]> <dbl [12]>
## 6 6     1     <dbl [12]> <dbl [12]>
# Linear regression using MLE
mle_model <- function(param) {
  beta  <- param[-1]
  sigma <- param[1]
  y <- as.vector(ChickWeight$weight)
  x <- cbind(1, ChickWeight$diet)
  mu <- x %*% beta
  sum(dnorm(y, mu, sigma, log = TRUE))
}

# Run MLE
mle_result <- maxLik(logLik = mle_model, start = c(sigma = 1, beta1 = 1, beta2 = 1))
summary(mle_result)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 51 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -3282.421 
## 3  free parameters
## Estimates:
##       Estimate Std. error t value Pr(> t)
## sigma    70.83        Inf       0       1
## beta1   116.46        Inf       0       1
## beta2   127.17        Inf       0       1
## --------------------------------------------
# Checking the results
summary(lm(weight ~ Diet, data = ChickWeight))
## 
## Call:
## lm(formula = weight ~ Diet, data = ChickWeight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -103.95  -53.65  -13.64   40.38  230.05 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  102.645      4.674  21.961  < 2e-16 ***
## Diet2         19.971      7.867   2.538   0.0114 *  
## Diet3         40.305      7.867   5.123 4.11e-07 ***
## Diet4         32.617      7.910   4.123 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.33 on 574 degrees of freedom
## Multiple R-squared:  0.05348,    Adjusted R-squared:  0.04853 
## F-statistic: 10.81 on 3 and 574 DF,  p-value: 6.433e-07

The standard errors (Inf), t value = 0 which suggests that the optimization did not properly estimate the variance. The log-likelihood of -3282.421 represents a poor fit.