Question 1

a.

The data set has 153 observations with 6 variables. The data set contains 4 quantitative variables (Ozone, Solar.R, Wind, Temp). It also contains 2 categorical yet numeric columns (Month, Day).

data <- read.csv("HW1data.csv")
glimpse(data)

b.

Ozone is missing 37 values. Solar.R is missing 7 values.

colSums(is.na(data))
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 

c.

There are 111 complete cases. There are 35 observations missing Ozone only. There are 5 observations missing Solar.R only. There are 2 observations missing both Ozone and Solar.R.

aggr(data, numbers = TRUE, prop = FALSE, sortVar = TRUE)


 Variables sorted by number of missings: 
 Variable Count
    Ozone    37
  Solar.R     7
     Wind     0
     Temp     0
    Month     0
      Day     0

d.

The proportion of missing Ozone values is substantially higher in the ≤60 band and substantially lower in the 61–70 band, with moderate variation across the remaining temperature ranges. This suggests that missingness depends on temperature, indicating the missing values are Missing At Random (MAR). In other words, the probability that an Ozone value is missing is related to the observed temperature. Although the largest swings in proportion occur in bands with smaller sample sizes, meaningful differences remain in larger bands.

temp_p <- data %>% 
  mutate(temp_band = cut(Temp, breaks = c(0,60,70,80,90,Inf), 
                                       labels = c("<=60", "61–70", "71–80", "81-90", "90+"),
                                       right = TRUE)) %>% 
  group_by(temp_band) %>% 
  summarise(proportion = sum(is.na(Ozone)) / n(),
            band_total = n(),
            na_total = sum(is.na(Ozone)))

kable(temp_p)
temp_band proportion band_total na_total
<=60 0.5000000 8 4
61–70 0.0800000 25 2
71–80 0.3269231 52 17
81-90 0.1851852 54 10
90+ 0.2857143 14 4

e.

new <- data %>% mutate(Ozone2 = ifelse(is.na(Ozone), mean(Ozone, na.rm = TRUE), Ozone))

plot_df_long <- new %>%
  select(Ozone, Ozone2) %>%
  pivot_longer(everything(),
               names_to = "Type",
               values_to = "Value")

Den.plt <- ggplot(plot_df_long, aes(x = Value, color = Type)) +
  stat_density(geom = "line", linewidth = 1.2, position = "identity", na.rm = TRUE) +
  scale_color_manual(values = c("Ozone" = "blue",
                                "Ozone2" = "red")) +
  labs(
    title = "Density: Original vs Mean-Imputed Ozone",
    subtitle = "Comparison of Distributions",
    x = "Ozone",
    y = "Density"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.margin = margin(t = 35, r = 20, b = 30, l = 30, unit = "pt")
  )

ggplotly(Den.plt)

When comparing the density plot of Ozone vs Mean Imputed Ozone values there is a clear spike in density in the Mean Imputed Ozone at the mean ozone value.

f.

This visual shows the imputed values tend to follow the same distribution as the non-missing ones.

mice_data <- mice(data, seed = 123)

mice_1 <- filter(complete(mice_data, 1), is.na(data$Ozone))
mice_2 <- filter(complete(mice_data, 2), is.na(data$Ozone))

ggplot() +
geom_point(data = data, aes(x = Temp, y = Ozone), color = "grey") +
geom_point(data = mice_1, aes(x = Temp, y = Ozone), color = "red", alpha = 0.5) +
geom_point(data = mice_2, aes(x = Temp, y = Ozone), color = "green", alpha = 0.5)

g.

The variables with the strongest positive linear relationship are Temp and Ozone, with a correlation of 0.655. This relationship is clearly reflected in their scatterplot, which shows a strong upward trend with most points clustered together, indicating that higher temps are associated with higher ozone levels. The strongest negative linear relationship is between Wind and Ozone, with a correlation of −0.576. Their scatterplot reflects this as well, with points forming a clear decreasing pattern, indicating that higher wind speeds are associated with lower ozone levels.

fmice <- complete(mice_data, 1)
ggpairs(fmice)

h.

The largest and darkest blue circle is between Temp and Ozone, and the largest and darkest red circle is between Wind and Ozone, supporting are above correlation findings.

cor_mat <- cor(fmice)
corrplot.mixed(cor_mat, upper = "circle", order = 'AOE')

i.

The Euclidean distance between these two observations is 1.

data_i <- fmice[1:2,c(5,6)]

ggplot() +
  geom_point(data = data_i, aes(x = Month, y = Day))

as.matrix(dist(data_i))
  1 2
1 0 1
2 1 0

j.

The Manhattan distance is also 1. The Euclidean distance is a straight line measure, whereas, the Manhattan distance measures along horizontal and vertical paths. Since the distance between the points is one vertical path, their two distances are the same.

as.matrix(dist(data_i, method = "manhattan"))
  1 2
1 0 1
2 1 0

k.

The Euclidean distance between the two observations is 1.414214, while the Manhattan distance is 2. The Euclidean distance is shorter because it measures the straight line distance between the two points. In contrast, the Manhattan distance measures distance along horizontal and vertical paths, meaning it cannot move diagonally, and in this case the points are separated diagonally.

data_i <- fmice[c(1,33),c(5,6)]

ggplot() +
  geom_point(data = data_i, aes(x = Month, y = Day))

as.matrix(dist(data_i))
          1       33
1  0.000000 1.414214
33 1.414214 0.000000
as.matrix(dist(data_i, method = "manhattan"))
   1 33
1  0  2
33 2  0

l.

s_data <- mutate(fmice, across(where(is.numeric), ~scale(.) %>% as.vector))

The first observation had an Ozone value of 41 standardized to 0.0106743. This standardized value means the original value was 0.0106743 standard deviations from the mean ozone value. We can check this by getting the mean of Ozone, 40.6601307, and standard deviation of ozone, 31.839888, and seeing how many standard deviations were from it. Our point 41 - the mean 40.6601307 equals 0.3398693. This is the distance from the mean, now divide by the standard deviation and we get, 0.0106743. Which is the same as the standardized score.

m.

The nearest neighbors to observation 1 are observations 2, 10, and 7, with observation 2 being the closest. The nearest neighbors to observation 2 are observations 1, 10, and 38, with observation 1 being the closest.

d_z <- dist(s_data)
result <- kNN(x = d_z, k = 3, search = "dist")
result$id[1:4, ]   
      1  2  3
[1,]  2 10  7
[2,]  1 10 38
[3,] 38 10  2
[4,]  7 13 14
result$dist[1:4, ] 
             1        2        3
[1,] 0.9882166 1.422317 1.544303
[2,] 0.9882166 1.488389 1.491222
[3,] 1.5613885 1.561498 1.564420
[4,] 0.9701069 1.322249 1.379994

Question 2

My top preference would be analyzing economic/financial data, especially looking at the relationship between money supply, inflation, consumer prices, wages, etc.

My second preference would be working with mental health or survey data, for example examining relationships between medication use, habits, and well being outcomes.

My third preference would be sports analytics, would be fine with anything.

