Noise Complaints vs Temp
This dataset was pulled from Kaggle.com and includes the noise complaints for NYC in 2016.
The temperature data for NYC was retrieved from http://w2.weather.gov/climate/xmacis.php?wfo=okx
The question I want to investigate is, “Does temperature affect the number of noise complaints?”
I think the linear model is appropriate because as the temperature gets warmer people tend to become more active. This could lead to more social gatherings and noise complaints.
Read in data
url <- "https://raw.githubusercontent.com/smithchad17/Class605/master/party_in_nyc.csv"
nyc_noise_complaints <- read.csv(file = url, header = T, stringsAsFactors = F)
Create data frame from temperature data for NYC retrieved from http://w2.weather.gov/climate/xmacis.php?wfo=okx
t <- c(34.5, 37.7, 48.9, 53.3, 62.8, 72.3, 78.7, 79.2, 71.8, 58.8, 49.8, 38.3)
n <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
y <- c("month", "avg_temp")
nyc_temp <- data.frame(n, t, stringsAsFactors = F)
colnames(nyc_temp) <- y
nyc_temp
## month avg_temp
## 1 1 34.5
## 2 2 37.7
## 3 3 48.9
## 4 4 53.3
## 5 5 62.8
## 6 6 72.3
## 7 7 78.7
## 8 8 79.2
## 9 9 71.8
## 10 10 58.8
## 11 11 49.8
## 12 12 38.3
Tidy the data
For the NYC Noise complaint, I selected the column with the incident dates and extract the month of those dates. I then grouped them by the month and counted the number of incidents for each month.
nyc_noise <- nyc_noise_complaints %>%
select(Created.Date)
colnames(nyc_noise) <- c("month")
nyc_noise$month <- str_extract(nyc_noise$month, "^\\d{1,2}")
nyc_noise <- nyc_noise %>%
group_by(month) %>%
count(month)
nyc_noise
## # A tibble: 12 x 2
## # Groups: month [12]
## month n
## <chr> <int>
## 1 1 12171
## 2 10 19332
## 3 11 14146
## 4 12 15182
## 5 2 10977
## 6 3 13880
## 7 4 17718
## 8 5 25192
## 9 6 25933
## 10 7 24502
## 11 8 20833
## 12 9 25000
Merge data frames
nyc <- merge(nyc_noise, nyc_temp, by="month")
nyc <- nyc %>%
select(n, avg_temp)
nyc
## n avg_temp
## 1 12171 34.5
## 2 19332 58.8
## 3 14146 49.8
## 4 15182 38.3
## 5 10977 37.7
## 6 13880 48.9
## 7 17718 53.3
## 8 25192 62.8
## 9 25933 72.3
## 10 24502 78.7
## 11 20833 79.2
## 12 25000 71.8
print("Summary of Incidents")
## [1] "Summary of Incidents"
summary(nyc$n)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10977 14080 18525 18739 24627 25933
print("Summary of Average Temperature")
## [1] "Summary of Average Temperature"
summary(nyc$avg_temp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.50 46.25 56.05 57.17 71.92 79.20
Plot The plot compares the number of noise complaints vs the average temp in F. The blue line represents the slope and intercept from the linear model. The shaded area is the 95% confidence interval for that blue line.
We can see there are some outliers that fall outside of the confidence interval.
ggplot(data = nyc, aes(x = avg_temp, y = n)) +
xlab("Temp") +
ylab("Noise Complaints") +
geom_point() +
geom_smooth(method = "lm")
Summary of model
nyc_lm <- lm(nyc$n ~ nyc$avg_temp)
summary(nyc_lm)
##
## Call:
## lm(formula = nyc$n ~ nyc$avg_temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4585.3 -1979.1 127.3 1911.2 4747.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1399.44 3028.03 0.462 0.653857
## nyc$avg_temp 303.27 51.16 5.928 0.000145 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2713 on 10 degrees of freedom
## Multiple R-squared: 0.7785, Adjusted R-squared: 0.7563
## F-statistic: 35.14 on 1 and 10 DF, p-value: 0.0001455
Residual Analysis
The histogram plot of the residuals looks to be a normal distribution. We could see a more clear distribution if the data had more cases to compare. Since the data only had 12 cases, outliers affect the average more. These outliers of noise complaints could be due to festivals, or events that happen that aren’t depended on the temperature.
nyc_res <- nyc_lm$residuals
hist(nyc_res, breaks = 11)