Noise Complaints vs Temp

This dataset was pulled from Kaggle.com and includes the noise complaints for NYC in 2016.

The temperature data for NYC was retrieved from http://w2.weather.gov/climate/xmacis.php?wfo=okx

The question I want to investigate is, “Does temperature affect the number of noise complaints?”

I think the linear model is appropriate because as the temperature gets warmer people tend to become more active. This could lead to more social gatherings and noise complaints.

Read in data

url <- "https://raw.githubusercontent.com/smithchad17/Class605/master/party_in_nyc.csv"
nyc_noise_complaints <- read.csv(file = url, header = T, stringsAsFactors = F)

Create data frame from temperature data for NYC retrieved from http://w2.weather.gov/climate/xmacis.php?wfo=okx

t <- c(34.5,    37.7,   48.9,   53.3,   62.8,   72.3,   78.7,   79.2,   71.8,   58.8,   49.8,   38.3)
n <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
y <- c("month", "avg_temp")
nyc_temp <- data.frame(n, t, stringsAsFactors = F)
colnames(nyc_temp) <- y
                     
nyc_temp
##    month avg_temp
## 1      1     34.5
## 2      2     37.7
## 3      3     48.9
## 4      4     53.3
## 5      5     62.8
## 6      6     72.3
## 7      7     78.7
## 8      8     79.2
## 9      9     71.8
## 10    10     58.8
## 11    11     49.8
## 12    12     38.3

Tidy the data

For the NYC Noise complaint, I selected the column with the incident dates and extract the month of those dates. I then grouped them by the month and counted the number of incidents for each month.

nyc_noise <- nyc_noise_complaints %>%
  select(Created.Date) 
colnames(nyc_noise) <- c("month")

nyc_noise$month <- str_extract(nyc_noise$month, "^\\d{1,2}")

nyc_noise <- nyc_noise %>%
  group_by(month) %>%
  count(month)

nyc_noise
## # A tibble: 12 x 2
## # Groups:   month [12]
##    month     n
##    <chr> <int>
##  1     1 12171
##  2    10 19332
##  3    11 14146
##  4    12 15182
##  5     2 10977
##  6     3 13880
##  7     4 17718
##  8     5 25192
##  9     6 25933
## 10     7 24502
## 11     8 20833
## 12     9 25000

Merge data frames

nyc <- merge(nyc_noise, nyc_temp, by="month")
nyc <- nyc %>%
  select(n, avg_temp)

nyc
##        n avg_temp
## 1  12171     34.5
## 2  19332     58.8
## 3  14146     49.8
## 4  15182     38.3
## 5  10977     37.7
## 6  13880     48.9
## 7  17718     53.3
## 8  25192     62.8
## 9  25933     72.3
## 10 24502     78.7
## 11 20833     79.2
## 12 25000     71.8
print("Summary of Incidents")
## [1] "Summary of Incidents"
summary(nyc$n)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10977   14080   18525   18739   24627   25933
print("Summary of Average Temperature")
## [1] "Summary of Average Temperature"
summary(nyc$avg_temp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34.50   46.25   56.05   57.17   71.92   79.20

Plot The plot compares the number of noise complaints vs the average temp in F. The blue line represents the slope and intercept from the linear model. The shaded area is the 95% confidence interval for that blue line.

We can see there are some outliers that fall outside of the confidence interval.

ggplot(data = nyc, aes(x = avg_temp, y = n)) +
      xlab("Temp") +
      ylab("Noise Complaints") +
      geom_point() +
      geom_smooth(method = "lm")

Summary of model

nyc_lm <- lm(nyc$n ~ nyc$avg_temp)
summary(nyc_lm)
## 
## Call:
## lm(formula = nyc$n ~ nyc$avg_temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4585.3 -1979.1   127.3  1911.2  4747.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1399.44    3028.03   0.462 0.653857    
## nyc$avg_temp   303.27      51.16   5.928 0.000145 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2713 on 10 degrees of freedom
## Multiple R-squared:  0.7785, Adjusted R-squared:  0.7563 
## F-statistic: 35.14 on 1 and 10 DF,  p-value: 0.0001455

Residual Analysis

The histogram plot of the residuals looks to be a normal distribution. We could see a more clear distribution if the data had more cases to compare. Since the data only had 12 cases, outliers affect the average more. These outliers of noise complaints could be due to festivals, or events that happen that aren’t depended on the temperature.

nyc_res <- nyc_lm$residuals

hist(nyc_res, breaks = 11)