The data set which interests me the most is from the Independent Budget Office (IBO) of New York City, and is available through NYC Open Data. The IBO offers a nonpartisan perspective on the city’s budget and tax revenues. This dataset, titled “Annual tax revenue from major tax sources, from FY 1980 - FY 2020,” presents four decades of information in CSV format.This dataset provides an in-depth examination of various tax revenue sources. I am particularly interested in possibly conducting regression analysis to tax revenue sources with economic events, such as the financial crisis, demographic changes and major shifts in government policies to discover relationships with sociological patterns in tax revenue among NYC residents. This analysis could provide valuable insights into the resilience and adaptability of New York City’s tax structure, as well as the complex relationships between economic conditions and policy decisions.
The goal of this first analysis is to visualize the share of different tax sources during the past four decades.
setwd("C:/Users/Yung Cho/Documents/DATA712/Data")
tax_data <- read.csv("NYC_Independent_Budget_Office__IBO__Tax_Revenue_FY_1980_-_FY_2020_20250228.csv")
# Remove the Total column
tax_data <- tax_data %>%
select(-"Total.Taxes")
# Rename the columns
colnames(tax_data) <- c("Year", "PRT", "PIT", "SAT", "COT", "FCT", "UBT", "MRT", "CRT", "RPT", "OTT")
# Format the columns to show as money values with dollar sign and commas for display
tax_data <- tax_data %>%
mutate(across(PRT:OTT, ~ as.numeric(gsub("[$,]", "", .))))
str(tax_data)
## 'data.frame': 41 obs. of 11 variables:
## $ Year: int 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 ...
## $ PRT : num 2.98e+10 2.79e+10 2.64e+10 2.47e+10 2.32e+10 ...
## $ PIT : num 1.38e+10 1.36e+10 1.34e+10 1.13e+10 1.14e+10 ...
## $ SAT : num 7.39e+09 7.84e+09 7.46e+09 7.03e+09 7.17e+09 ...
## $ COT : num 5.17e+09 4.73e+09 4.10e+09 4.05e+09 3.63e+09 ...
## $ FCT : num 82902210 -1282663 394858132 435658184 689535134 ...
## $ UBT : num 2.05e+09 2.12e+09 2.27e+09 2.08e+09 2.11e+09 ...
## $ MRT : num 9.75e+08 1.10e+09 1.05e+09 1.12e+09 1.23e+09 ...
## $ CRT : num 9.43e+08 9.95e+08 9.19e+08 9.21e+08 8.37e+08 ...
## $ RPT : num 1.14e+09 1.56e+09 1.43e+09 1.42e+09 1.79e+09 ...
## $ OTT : num 1.75e+09 1.71e+09 1.66e+09 1.67e+09 1.59e+09 ...
In this example, the use of the pivot_longer function highlights a helpful technique. It converts a wide-format data frame, initially comprising one row and ten columns, into a long-format structure with ten rows and two columns, incorporating the total sum values. The advantage of converting to a long format is to ensure the visualization is enhanced. By reorganizing the data in this manner, it’s easier to facilitate the creation of graphs that can effectively illustrate the performance of each variable.
# Calculate the total sum for each column
total_sum <- tax_data %>%
summarise(across(PRT:OTT, sum, na.rm = TRUE))
# Convert the total_sum to a long format
total_sum_long <- total_sum %>%
pivot_longer(cols = everything(), names_to = "Tax_Type", values_to = "Total_Sum")
# Create the bar chart
ggplot(total_sum_long, aes(x = Tax_Type, y = Total_Sum)) +
geom_bar(stat = "identity", fill = "blue") +
scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "B")) +
labs(title = "Total Sum of Tax Revenue by Type (1980-2020)",
x = "Tax Type",
y = "Total Sum in Billions") +
theme_minimal()
# Pivot the data to a long format for easier plotting
tax_data_long_extended <- tax_data %>%
pivot_longer(
cols = PRT:OTT, # Specify the columns to pivot
names_to = "Tax_Type", # Name of the new column for tax types
values_to = "Revenue" # Name of the new column for revenue values
) %>%
filter(Year >= 1980 & Year <= 2020)
# Create a faceted plot to visualize the revenue for each tax type
ggplot(tax_data_long_extended, aes(x = Year, y = Revenue, color = Tax_Type)) +
geom_line() +
facet_wrap(~ Tax_Type, scales = "free_y", ncol = 2) + # Adjust the number of columns
scale_x_continuous(breaks = seq(1980, 2020, by = 5)) +
scale_y_continuous(labels = scales::dollar_format(scale = 1e-6, suffix = "M")) +
labs(title = "Tax Revenue by Type (1980-2020)",
x = "Year",
y = "Revenue (in Millions)") +
theme_minimal() +
theme(
legend.position = "none",
strip.text = element_text(size = 12), # Increase facet label size
axis.text = element_text(size = 10), # Increase axis text size
axis.title = element_text(size = 12), # Increase axis title size
plot.title = element_text(size = 14) # Increase plot title size
)
#Part III (Linear Regression via MLE)}
data("ChickWeight")
str(ChickWeight)
## Classes 'nfnGroupedData', 'nfGroupedData', 'groupedData' and 'data.frame': 578 obs. of 4 variables:
## $ weight: num 42 51 59 64 76 93 106 125 149 171 ...
## $ Time : num 0 2 4 6 8 10 12 14 16 18 ...
## $ Chick : Ord.factor w/ 50 levels "18"<"16"<"15"<..: 15 15 15 15 15 15 15 15 15 15 ...
## $ Diet : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "formula")=Class 'formula' language weight ~ Time | Chick
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "outer")=Class 'formula' language ~Diet
## .. ..- attr(*, ".Environment")=<environment: R_EmptyEnv>
## - attr(*, "labels")=List of 2
## ..$ x: chr "Time"
## ..$ y: chr "Body weight"
## - attr(*, "units")=List of 2
## ..$ x: chr "(days)"
## ..$ y: chr "(gm)"
# Transform the dataset to long format
ChickWeight_long <- ChickWeight %>%
pivot_longer(cols = c(weight, Time),
names_to = "variable",
values_to = "value")
head(ChickWeight_long)
## # A tibble: 6 × 4
## Chick Diet variable value
## <ord> <fct> <chr> <dbl>
## 1 1 1 weight 42
## 2 1 1 Time 0
## 3 1 1 weight 51
## 4 1 1 Time 2
## 5 1 1 weight 59
## 6 1 1 Time 4
# Transform the dataset to wide format
ChickWeight_wide <- ChickWeight_long %>%
pivot_wider(names_from = variable,
values_from = value)
head(ChickWeight_wide)
## # A tibble: 6 × 4
## Chick Diet weight Time
## <ord> <fct> <list> <list>
## 1 1 1 <dbl [12]> <dbl [12]>
## 2 2 1 <dbl [12]> <dbl [12]>
## 3 3 1 <dbl [12]> <dbl [12]>
## 4 4 1 <dbl [12]> <dbl [12]>
## 5 5 1 <dbl [12]> <dbl [12]>
## 6 6 1 <dbl [12]> <dbl [12]>
# Linear regression using MLE
mle_model <- function(param) {
beta <- param[-1]
sigma <- param[1]
y <- as.vector(ChickWeight$weight)
x <- cbind(1, ChickWeight$diet)
mu <- x %*% beta
sum(dnorm(y, mu, sigma, log = TRUE))
}
# Run MLE
mle_result <- maxLik(logLik = mle_model, start = c(sigma = 1, beta1 = 1, beta2 = 1))
summary(mle_result)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 51 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -3282.421
## 3 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## sigma 70.83 Inf 0 1
## beta1 116.46 Inf 0 1
## beta2 127.17 Inf 0 1
## --------------------------------------------
# Checking the results
summary(lm(weight ~ Diet, data = ChickWeight))
##
## Call:
## lm(formula = weight ~ Diet, data = ChickWeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -103.95 -53.65 -13.64 40.38 230.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.645 4.674 21.961 < 2e-16 ***
## Diet2 19.971 7.867 2.538 0.0114 *
## Diet3 40.305 7.867 5.123 4.11e-07 ***
## Diet4 32.617 7.910 4.123 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.33 on 574 degrees of freedom
## Multiple R-squared: 0.05348, Adjusted R-squared: 0.04853
## F-statistic: 10.81 on 3 and 574 DF, p-value: 6.433e-07