1. This dataset was sourced from Kaggle and focuses on pollution
levels across various locations in New York. It was collected by Sahar,
a senior data scientist based in the South, who visited different areas
of the city during multiple time periods to assess the severity of
pollution. The dataset serves as valuable evidence for marketers by
emphasizing the often-overlooked presence of pollution in cities,
especially in one of the most well-known cities in the world. Using the
99 observations, I will help to identify differences in pollution levels
across neighborhoods in New York and offer insight into what might be
considered “normal” pollution levels across the years. This analysis may
also help assess whether the city presents a strong market for
non-pollution related products.
2.
# Load necessary libraries
library(ggplot2)
# Read CSV
data <- readxl::read_excel("Air_1Quality.csv.xlsx") # Adjust if it's Excel, use readxl::read_excel()
# Convert Start_Date to Date format
data$Start_Date <- as.Date(data$Start_Date)
# Convert Date to numeric year
data$Year <- as.numeric(format(data$Start_Date, "%Y"))
# Remove missing values
clean_data <- na.omit(data[, c("Year", "Data_Value")])
# Linear regression
model <- lm(Data_Value ~ Year, data = clean_data)
# Summary of regression
summary(model)
##
## Call:
## lm(formula = Data_Value ~ Year, data = clean_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0915 -3.5471 0.1441 3.3872 11.3491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1775.3919 235.3339 7.544 2.46e-11 ***
## Year -0.8706 0.1168 -7.452 3.84e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.687 on 97 degrees of freedom
## Multiple R-squared: 0.3641, Adjusted R-squared: 0.3575
## F-statistic: 55.53 on 1 and 97 DF, p-value: 3.844e-11
# Scatter plot with regression line
ggplot(clean_data, aes(x = Year, y = Data_Value)) +
geom_point(color = "steelblue") +
geom_smooth(method = "lm", se = FALSE, color = "darkred") +
labs(title = "Regression of NO2 Levels Over Time",
x = "Year", y = "NO2 Level (ppb)")
## `geom_smooth()` using formula = 'y ~ x'

3. After reviewing the results of the linear regression, which
showed a clear linear trend, it is clear that nitrogen dioxide (NO₂)
levels across various parts of New York have been steadily decreasing
over the past decade. For example, levels dropped from approximately 28
ppb in 2010 to around 16 ppb in 2021. The p-value being below 0.05
indicates statistical significance, confirming that the decline in
pollution is intentional.
This trend presents a unique opportunity for environmentally
conscious brands. As awareness of environmental and health impacts
grows, these brands can align their messaging with public concern for
cleaner air and well-being. In relation to the improvements made in New
york, brands can play an active role in advocating for improving air
quality, improving their brand image while promoting products that
resonate with socially and environmentally aware consumers in New
York.
Note: I used ChatGPT for the reggression analysis and linear
analysis creation