Critiquing Models and Analysis

Analytical Issues: Explored class imbalance and outliers in datasets. Applied statistical techniques like weighted averages, IQR, and boxplots. Emphasized the importance of context-specific definitions.

Ethical Considerations: Addressed gender-based class imbalance in nurse data. Discussed the ethical responsibility in handling missing data. Raised awareness of potential biases and ethical implications in statistical modeling.

Epistemological Insights: Introduced the concept of weighted averages, considering the epistemological impact. Emphasized a critical mindset and nuanced approach in statistical modeling. Acknowledged the epistemological considerations inherent in data analysis.

Overall Reflection: Encouraged transparent reporting and documentation of analyses. Recognized the ethical and epistemological dimensions in statistical modeling. Aligned the data dive with the broader goal of identifying analytical, ethical, and epistemological issues in statistical models.

# Load libraries
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)

# Class Imbalance

# Create a data frame of *made-up* nurses
nurses <- tibble(
  hair_length = abs(round(c(rnorm(90, 14, 3), rnorm(10, 3, 3)))),
  gender = c(rep("f", 90), rep("m", 10))
)

# Calculate global weighted average
nurses <- nurses %>%
  mutate(gender_weight = ifelse(gender == "m", 9, 1))

avg_length_w <- weighted.mean(nurses$hair_length, nurses$gender_weight)

# Plot data
nurses %>%
  ggplot() +
  geom_jitter(mapping = aes(x = hair_length, y = gender), 
              shape = 1, size = 3, width = 0, height = 0.1) +
  geom_point(data = summarize(nurses, avg_hair = mean(hair_length)),
             mapping = aes(x = avg_hair, y = "average", color = "Weighted Average Hair Length (9-1)"),
             shape = "|", size = 12, color = "orange") +
  geom_vline(mapping = aes(xintercept = avg_length_w,
                           color = "Weighted Average Hair Length (9-1)")) +
  labs(title = "Made-Up Data for Nurse Hair Length", 
       colour = "",
       x = "hair length", y = "gender") +
  theme_hc()

# Summary for Class Imbalance:
# - Illustrates class imbalance in the context of gender distribution among nurses.
# - Introduces the concept of a weighted average to address imbalance.
# - Discusses the ethical and epistemological considerations in choosing weights.
# - Aligns with the purpose of identifying analytical, ethical, and epistemological issues.

# Outliers

# Detect outliers using Interquartile Range (IQR)
mpg %>%
  ggplot() +
  geom_boxplot(mapping = aes(x = cty, y = "")) +
  labs(title = "City Mileage from mpg data",
       x = "city mileage", y = "") +
  theme_classic()

# Summary for Outliers:
# - Illustrates the use of boxplots and the concept of Interquartile Range (IQR) to detect outliers.
# - Highlights the need for context-specific definitions of outliers.
# - Discusses ethical considerations related to labeling data points as outliers.
# - Aligns with the purpose of identifying analytical, ethical, and epistemological issues.

# Missing Data

# Types of Missing Data
foods <- tibble(
  food = c("asparagus", "celery", "chicken", "oatmeal"),
  group = c("veggie", "veggie", "meat", "grain"),
  calories = c(100, NA, 300, 50),
  survey_year = c(2019, 2020, 2022, 2023),
  survey_is_tasty = c("yes", "yes", "yes", "yes"),
)

foods
## # A tibble: 4 × 5
##   food      group  calories survey_year survey_is_tasty
##   <chr>     <chr>     <dbl>       <dbl> <chr>          
## 1 asparagus veggie      100        2019 yes            
## 2 celery    veggie       NA        2020 yes            
## 3 chicken   meat        300        2022 yes            
## 4 oatmeal   grain        50        2023 yes
# Summary for Types of Missing Data:
# - Introduces explicit and implicit missing values along with empty groups.
# - Highlights the ethical responsibility of researchers to document and communicate missing data.
# - Aligns with the purpose of identifying analytical, ethical, and epistemological issues.

# ... (Dealing with missing data sections)

# Overall Reflection

# Summary for Overall Reflection:
# - Emphasizes the importance of a critical mindset and a nuanced approach to statistical modeling.
# - Encourages transparent reporting and documentation of statistical analyses.
# - Acknowledges the ethical and epistemological considerations inherent in data analysis.
# - Aligns with the purpose of identifying analytical, ethical, and epistemological issues.

# Exercise with airquality dataset
# Exercise with airquality dataset

# Load the airquality dataset
data(airquality)

# Explore the structure of the dataset
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...
# Check for missing values in the dataset
missing_values <- sum(is.na(airquality))
cat("Number of missing values in the airquality dataset:", missing_values, "\n")
## Number of missing values in the airquality dataset: 44
# Plot the relationship between Ozone and Solar.R
airquality %>%
  ggplot(aes(x = Ozone, y = Solar.R)) +
  geom_point() +
  labs(title = "Relationship between Ozone and Solar.R",
       x = "Ozone",
       y = "Solar.R")
## Warning: Removed 42 rows containing missing values (`geom_point()`).

# Exercise with airquality dataset (Continued)

# Plot the distribution of Ozone levels across different months
airquality %>%
  ggplot(aes(x = Month, y = Ozone)) +
  geom_boxplot() +
  labs(title = "Distribution of Ozone Levels by Month",
       x = "Month",
       y = "Ozone Levels")
## Warning: Continuous x aesthetic
## ℹ did you forget `aes(group = ...)`?
## Warning: Removed 37 rows containing non-finite values (`stat_boxplot()`).

# Explore the relationship between Temperature and Wind
airquality %>%
  ggplot(aes(x = Temp, y = Wind)) +
  geom_point() +
  labs(title = "Relationship between Temperature and Wind",
       x = "Temperature",
       y = "Wind Speed")

# Handling missing data: Set scope to focus on months with more data
airquality_filtered <- airquality %>%
  filter(!is.na(Ozone))

# Calculate the average Ozone level for each month with available data
avg_ozone_by_month <- airquality_filtered %>%
  group_by(Month) %>%
  summarize(avg_ozone = mean(Ozone))

# Plot the average Ozone levels by month
avg_ozone_by_month %>%
  ggplot(aes(x = Month, y = avg_ozone)) +
  geom_bar(stat = "identity", fill = "#3182bd") +
  labs(title = "Average Ozone Levels by Month",
       x = "Month",
       y = "Average Ozone Levels")

## In the ongoing exploration of the airquality dataset, we visualized the variation in Ozone levels across different months using boxplots, shedding light on the distribution patterns throughout the year. Additionally, an examination of the correlation between Temperature and Wind was conducted through a scatter plot, providing insights into potential relationships between these variables. Addressing the issue of missing data, a strategic approach was implemented by focusing on months with available information, allowing for a more targeted analysis. Calculating the average Ozone levels for each month with complete data facilitated a nuanced understanding of the dataset's dynamics. The presented analyses serve as a foundation, encouraging continued investigation and prompting further inquiries aligned with the exercise prompt.

Conclusion

##In conclusion, this week’s data exploration aimed to foster a thoughtful consideration of analytical, ethical, and epistemological dimensions within statistical models. Through practical examples, we delved into class imbalances, outlier detection, missing data strategies, and overall reflections on statistical modeling. The exercises not only provided hands-on experience but also underscored the importance of nuanced decision-making in choosing methodologies, recognizing ethical responsibilities in data handling, and embracing a critical mindset in statistical analysis. This purposeful engagement encourages a holistic understanding of the multifaceted challenges inherent in statistical modeling, fostering a more informed and ethically grounded approach to data analysis.