I am using NYC Open Data to analyze motor vehicles in New York City. This platform provides publicly accessible datasets collected by city agencies like the Department of Transportation (DOT), the NYPD, and the DMV. Relevant datasets include vehicle registrations, traffic counts, crash reports, and parking violations, offering insights into city vehicle patterns and transportation trends.
The data is collected by various city agencies such as the Department of Transportation (DOT), the NYPD, and the DMV. It is publicly accessible and can be found on the NYC Open Data platform. The datasets are typically available in formats such as CSV, JSON, and Excel, making it easy to download and analyze.
My research will focus on questions such as: 1. How do accident rates and traffic violations relate to vehicle registrations? 2. What borough experiences the most accidents?
The key variables I am interested in include: - Vehicle type - Registration numbers - Borough location - Year - Accident frequency - Violation types
The analyses I have in mind include: - Summary statistics to track trends - Time series analysis to examine changes over time - Mapping to explore borough-level differences - Statistical models to examine relationships between variables
This research aims to better understand motor vehicle usage and its broader impact on transportation in NYC.
pivot_longer()
and pivot_wider()
in
RThe pivot_longer()
and pivot_wider()
functions are part of the tidyr
package in the
tidyverse
ecosystem. The pivot_longer()
function is used to convert data from a wide format to a long format,
whereas the pivot_wider()
function is used to convert data
from a long format to a wide format. These functions are particularly
useful when you need to reshape your data for different types of
analysis or visualization.
Let’s create a simple example using simulated data to demonstrate the operation of moving from the “long” to “wide” form and vice versa.
# Load the necessary libraries
library(tidyr)
library(dplyr)
# Create a simple data frame in wide format
wide_data <- data.frame(
id = 1:3,
year_2020 = c(5, 6, 7),
year_2021 = c(8, 9, 10)
)
# Display the wide data
print("Wide Data:")
print(wide_data)
# Convert the wide data to long data using pivot_longer()
long_data <- wide_data %>%
pivot_longer(cols = starts_with("year"),
names_to = "year",
values_to = "value")
# Display the long data
print("Long Data:")
print(long_data)
# Convert the long data back to wide data using pivot_wider()
wide_data_again <- long_data %>%
pivot_wider(names_from = year,
values_from = value)
# Display the wide data again
print("Wide Data Again:")
print(wide_data_again)
Wide Data: We start with a simple data frame
wide_data
that has three columns: id
,
year_2020
, and year_2021
. Each row represents
a different id
and the values for the years 2020 and
2021.
Convert Wide to Long: We use the
pivot_longer()
function to convert the wide data to long
data. The cols
argument specifies which columns to pivot
(in this case, columns starting with “year”). The names_to
argument specifies the name of the new column that will contain the
former column names (in this case, “year”). The values_to
argument specifies the name of the new column that will contain the
values (in this case, “value”). The resulting long_data
has
three columns: id
, year
, and
value
.
Convert Long to Wide: We use the
pivot_wider()
function to convert the long data back to
wide data. The names_from
argument specifies which column’s
values will become the new column names (in this case, “year”). The
values_from
argument specifies which column’s values will
fill the new columns (in this case, “value”). The resulting
wide_data_again
has the same structure as the original
wide_data
.
By using pivot_longer()
and pivot_wider()
,
we can reshape our data for different types of analysis or
visualization. This flexibility is one of the strengths of the
tidyverse
ecosystem.
In this analysis, we use the mtcars
dataset to explore
the relationship between miles per gallon (mpg
) and
horsepower (hp
). We will use Maximum Likelihood Estimation
(MLE) to estimate the parameters of a linear regression model and
compare the results with those obtained from Ordinary Least Squares
(OLS) regression.
# Load necessary libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(maxLik)
## Warning: package 'maxLik' was built under R version 4.3.2
## Loading required package: miscTools
##
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
##
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
# Load the mtcars dataset
data(mtcars)
dplyr::glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
# Define the log-likelihood function
ols_log_likelihood <- function(param) {
sigma <- param[1]
beta <- param[-1]
y <- as.vector(mtcars$mpg)
x <- cbind(1, mtcars$hp)
mu <- x %*% beta
sum(dnorm(y, mu, sigma, log = TRUE))
}
# Perform Maximum Likelihood Estimation
mle_ols <- maxLik(logLik = ols_log_likelihood, start = c(sigma = 1, beta1 = 1, beta2 = 1))
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
summary(mle_ols)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 21 iterations
## Return code 2: successive function values within tolerance limit (tol)
## Log-Likelihood: -87.61931
## 3 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## sigma 3.740297 0.466780 8.013 1.12e-15 ***
## beta1 30.098857 1.553704 19.372 < 2e-16 ***
## beta2 -0.068228 0.009653 -7.068 1.57e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
# Perform OLS regression
ols_model <- lm(mpg ~ hp, data = mtcars)
summary(ols_model)
##
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7121 -2.1122 -0.8854 1.5819 8.2360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.09886 1.63392 18.421 < 2e-16 ***
## hp -0.06823 0.01012 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
In this analysis, we used both MLE and OLS methods to estimate the
parameters of a linear regression model relating mpg
to
hp
in the mtcars
dataset. The results from
both methods are consistent, demonstrating the effectiveness of MLE in
parameter estimation for linear regression models.