Part 1

Data Description

I am using NYC Open Data to analyze motor vehicles in New York City. This platform provides publicly accessible datasets collected by city agencies like the Department of Transportation (DOT), the NYPD, and the DMV. Relevant datasets include vehicle registrations, traffic counts, crash reports, and parking violations, offering insights into city vehicle patterns and transportation trends.

Basic Information

The data is collected by various city agencies such as the Department of Transportation (DOT), the NYPD, and the DMV. It is publicly accessible and can be found on the NYC Open Data platform. The datasets are typically available in formats such as CSV, JSON, and Excel, making it easy to download and analyze.

Research Questions

My research will focus on questions such as: 1. How do accident rates and traffic violations relate to vehicle registrations? 2. What borough experiences the most accidents?

Key Variables

The key variables I am interested in include: - Vehicle type - Registration numbers - Borough location - Year - Accident frequency - Violation types

Planned Analyses

The analyses I have in mind include: - Summary statistics to track trends - Time series analysis to examine changes over time - Mapping to explore borough-level differences - Statistical models to examine relationships between variables

This research aims to better understand motor vehicle usage and its broader impact on transportation in NYC.

Part 2

Using pivot_longer() and pivot_wider() in R

The pivot_longer() and pivot_wider() functions are part of the tidyr package in the tidyverse ecosystem. The pivot_longer() function is used to convert data from a wide format to a long format, whereas the pivot_wider() function is used to convert data from a long format to a wide format. These functions are particularly useful when you need to reshape your data for different types of analysis or visualization.

Example

Let’s create a simple example using simulated data to demonstrate the operation of moving from the “long” to “wide” form and vice versa.

Simulated Data

# Load the necessary libraries
library(tidyr)
library(dplyr)

# Create a simple data frame in wide format
wide_data <- data.frame(
  id = 1:3,
  year_2020 = c(5, 6, 7),
  year_2021 = c(8, 9, 10)
)

# Display the wide data
print("Wide Data:")
print(wide_data)

Convert Wide Data to Long Data

# Convert the wide data to long data using pivot_longer()
long_data <- wide_data %>%
  pivot_longer(cols = starts_with("year"), 
               names_to = "year", 
               values_to = "value")

# Display the long data
print("Long Data:")
print(long_data)

Convert Long Data Back to Wide Data

# Convert the long data back to wide data using pivot_wider()
wide_data_again <- long_data %>%
  pivot_wider(names_from = year, 
              values_from = value)

# Display the wide data again
print("Wide Data Again:")
print(wide_data_again)

Explanation

  1. Wide Data: We start with a simple data frame wide_data that has three columns: id, year_2020, and year_2021. Each row represents a different id and the values for the years 2020 and 2021.

  2. Convert Wide to Long: We use the pivot_longer() function to convert the wide data to long data. The cols argument specifies which columns to pivot (in this case, columns starting with “year”). The names_to argument specifies the name of the new column that will contain the former column names (in this case, “year”). The values_to argument specifies the name of the new column that will contain the values (in this case, “value”). The resulting long_data has three columns: id, year, and value.

  3. Convert Long to Wide: We use the pivot_wider() function to convert the long data back to wide data. The names_from argument specifies which column’s values will become the new column names (in this case, “year”). The values_from argument specifies which column’s values will fill the new columns (in this case, “value”). The resulting wide_data_again has the same structure as the original wide_data.

By using pivot_longer() and pivot_wider(), we can reshape our data for different types of analysis or visualization. This flexibility is one of the strengths of the tidyverse ecosystem.

Part 3

Introduction

In this analysis, we use the mtcars dataset to explore the relationship between miles per gallon (mpg) and horsepower (hp). We will use Maximum Likelihood Estimation (MLE) to estimate the parameters of a linear regression model and compare the results with those obtained from Ordinary Least Squares (OLS) regression.

Load and Inspect Data

# Load necessary libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(maxLik)
## Warning: package 'maxLik' was built under R version 4.3.2
## Loading required package: miscTools
## 
## Please cite the 'maxLik' package as:
## Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
## 
## If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
## https://r-forge.r-project.org/projects/maxlik/
# Load the mtcars dataset
data(mtcars)
dplyr::glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Define Log-Likelihood Function

# Define the log-likelihood function
ols_log_likelihood <- function(param) {
  sigma <- param[1]
  beta <- param[-1]
  y <- as.vector(mtcars$mpg)
  x <- cbind(1, mtcars$hp)
  mu <- x %*% beta
  sum(dnorm(y, mu, sigma, log = TRUE))
}

Perform Maximum Likelihood Estimation

# Perform Maximum Likelihood Estimation
mle_ols <- maxLik(logLik = ols_log_likelihood, start = c(sigma = 1, beta1 = 1, beta2 = 1))
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
## Warning in dnorm(y, mu, sigma, log = TRUE): NaNs produced
summary(mle_ols)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 21 iterations
## Return code 2: successive function values within tolerance limit (tol)
## Log-Likelihood: -87.61931 
## 3  free parameters
## Estimates:
##        Estimate Std. error t value  Pr(> t)    
## sigma  3.740297   0.466780   8.013 1.12e-15 ***
## beta1 30.098857   1.553704  19.372  < 2e-16 ***
## beta2 -0.068228   0.009653  -7.068 1.57e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------

Perform OLS Regression

# Perform OLS regression
ols_model <- lm(mpg ~ hp, data = mtcars)
summary(ols_model)
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Conclusion

In this analysis, we used both MLE and OLS methods to estimate the parameters of a linear regression model relating mpg to hp in the mtcars dataset. The results from both methods are consistent, demonstrating the effectiveness of MLE in parameter estimation for linear regression models.