Question 1
a.
The data set has 153 observations with 6 variables. The data set
contains 4 quantitative variables (Ozone, Solar.R, Wind, Temp). It also
contains 2 categorical yet numeric columns (Month, Day).
data <- read.csv("HW1data.csv")
glimpse(data)
b.
Ozone is missing 37 values. Solar.R is missing 7 values.
colSums(is.na(data))
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
c.
There are 111 complete cases. There are 35 observations missing Ozone
only. There are 5 observations missing Solar.R only. There are 2
observations missing both Ozone and Solar.R.
aggr(data, numbers = TRUE, prop = FALSE, sortVar = TRUE)

Variables sorted by number of missings:
Variable Count
Ozone 37
Solar.R 7
Wind 0
Temp 0
Month 0
Day 0
d.
The proportion of missing Ozone values is substantially higher in the
≤60 band and substantially lower in the 61–70 band, with moderate
variation across the remaining temperature ranges. This suggests that
missingness depends on temperature, indicating the missing values are
Missing At Random (MAR). In other words, the probability that an Ozone
value is missing is related to the observed temperature. Although the
largest swings in proportion occur in bands with smaller sample sizes,
meaningful differences remain in larger bands.
temp_p <- data %>%
mutate(temp_band = cut(Temp, breaks = c(0,60,70,80,90,Inf),
labels = c("<=60", "61–70", "71–80", "81-90", "90+"),
right = TRUE)) %>%
group_by(temp_band) %>%
summarise(proportion = sum(is.na(Ozone)) / n(),
band_total = n(),
na_total = sum(is.na(Ozone)))
kable(temp_p)
| <=60 |
0.5000000 |
8 |
4 |
| 61–70 |
0.0800000 |
25 |
2 |
| 71–80 |
0.3269231 |
52 |
17 |
| 81-90 |
0.1851852 |
54 |
10 |
| 90+ |
0.2857143 |
14 |
4 |
e.
new <- data %>% mutate(Ozone2 = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))
plot_df_long <- new %>%
select(Ozone, Ozone2) %>%
pivot_longer(everything(),
names_to = "Type",
values_to = "Value")
Den.plt <- ggplot(plot_df_long, aes(x = Value, color = Type)) +
stat_density(geom = "line", linewidth = 1.2, position = "identity", na.rm = TRUE) +
scale_color_manual(values = c("Ozone" = "blue",
"Ozone2" = "red")) +
labs(
title = "Density: Original vs Mean-Imputed Ozone",
subtitle = "Comparison of Distributions",
x = "Ozone",
y = "Density"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt")
)
ggplotly(Den.plt)
When comparing the density plot of Ozone vs Mean Imputed Ozone values
there is a clear spike in density in the Mean Imputed Ozone at the mean
ozone value.
f.
This visual shows the imputed values tend to follow the same
distribution as the non-missing ones.
mice_data <- mice(data, seed = 123)
mice_1 <- filter(complete(mice_data, 1), is.na(data$Ozone))
mice_2 <- filter(complete(mice_data, 2), is.na(data$Ozone))
ggplot() +
geom_point(data = data, aes(x = Temp, y = Ozone), color = "grey") +
geom_point(data = mice_1, aes(x = Temp, y = Ozone), color = "red", alpha = 0.5) +
geom_point(data = mice_2, aes(x = Temp, y = Ozone), color = "green", alpha = 0.5)

g.
The variables with the strongest positive linear relationship are
Temp and Ozone, with a correlation of 0.655. This relationship is
clearly reflected in their scatterplot, which shows a strong upward
trend with most points clustered together, indicating that higher temps
are associated with higher ozone levels. The strongest negative linear
relationship is between Wind and Ozone, with a correlation of −0.576.
Their scatterplot reflects this as well, with points forming a clear
decreasing pattern, indicating that higher wind speeds are associated
with lower ozone levels.
fmice <- complete(mice_data, 1)
ggpairs(fmice)

h.
The largest and darkest blue circle is between Temp and Ozone, and
the largest and darkest red circle is between Wind and Ozone, supporting
are above correlation findings.
cor_mat <- cor(fmice)
corrplot.mixed(cor_mat, upper = "circle", order = 'AOE')

i.
The Euclidean distance between these two observations is 1.
data_i <- fmice[1:2,c(5,6)]
ggplot() +
geom_point(data = data_i, aes(x = Month, y = Day))

as.matrix(dist(data_i))
1 2
1 0 1
2 1 0
j.
The Manhattan distance is also 1. The Euclidean distance is a
straight line measure, whereas, the Manhattan distance measures along
horizontal and vertical paths. Since the distance between the points is
one vertical path, their two distances are the same.
as.matrix(dist(data_i, method = "manhattan"))
1 2
1 0 1
2 1 0
k.
The Euclidean distance between the two observations is 1.414214,
while the Manhattan distance is 2. The Euclidean distance is shorter
because it measures the straight line distance between the two points.
In contrast, the Manhattan distance measures distance along horizontal
and vertical paths, meaning it cannot move diagonally, and in this case
the points are separated diagonally.
data_i <- fmice[c(1,33),c(5,6)]
ggplot() +
geom_point(data = data_i, aes(x = Month, y = Day))

as.matrix(dist(data_i))
1 33
1 0.000000 1.414214
33 1.414214 0.000000
as.matrix(dist(data_i, method = "manhattan"))
1 33
1 0 2
33 2 0
l.
s_data <- mutate(fmice, across(where(is.numeric), ~scale(.) %>% as.vector))
The first observation had an Ozone value of 41 standardized to
0.0106743. This standardized value means the original value was
0.0106743 standard deviations from the mean ozone value. We can check
this by getting the mean of Ozone, 40.6601307, and standard deviation of
ozone, 31.839888, and seeing how many standard deviations were from it.
Our point 41 - the mean 40.6601307 equals 0.3398693. This is the
distance from the mean, now divide by the standard deviation and we get,
0.0106743. Which is the same as the standardized score.
m.
The nearest neighbors to observation 1 are observations 2, 10, and 7,
with observation 2 being the closest. The nearest neighbors to
observation 2 are observations 1, 10, and 38, with observation 1 being
the closest.
d_z <- dist(s_data)
result <- kNN(x = d_z, k = 3, search = "dist")
result$id[1:4, ]
1 2 3
[1,] 2 10 7
[2,] 1 10 38
[3,] 38 10 2
[4,] 7 13 14
result$dist[1:4, ]
1 2 3
[1,] 0.9882166 1.422317 1.544303
[2,] 0.9882166 1.488389 1.491222
[3,] 1.5613885 1.561498 1.564420
[4,] 0.9701069 1.322249 1.379994
Question 2
My top preference would be analyzing economic/financial data,
especially looking at the relationship between money supply, inflation,
consumer prices, wages, etc.
My second preference would be working with mental health or survey
data, for example examining relationships between medication use,
habits, and well being outcomes.
My third preference would be sports analytics, would be fine with
anything.
