1 1. Data Cleaning and Exploration with SQL
2 2. Statistical Analysis with R
- 2.1 2.1 Descriptive Statistics and Visualizations
- 2.2 2.2 Hypothesis Testing
3 3. Predictive Modeling with Python
4 4. Conclusion and Recommendations
- 4.1 Key Recommendations

1 1. Data Cleaning and Exploration with SQL

In this section, we performed initial data exploration and cleaning using SQL through Python’s sqlite3 interface. The goals were to identify and resolve missing values, duplicates, and relational inconsistencies.

Key cleaning steps: - Identified and removed rows with missing or invalid demographic data - Checked for duplicate patient entries - Ensured all foreign key relationships were intact between visits, patients, and clinics

The cleaned dataset was exported as cleaned_healthcare_data.csv for downstream analysis.

-- Check for missing values
SELECT * FROM patients
WHERE PatientID IS NULL OR PatientName IS NULL OR Gender IS NULL OR Age IS NULL OR Location IS NULL;

-- Check for duplicates
SELECT PatientID, Gender, Age, COUNT(*) as Count
FROM patients
GROUP BY PatientID, Gender, Age
HAVING COUNT(*) > 1;

-- Join all tables into one dataset
SELECT 
    v.VisitID,
    v.VisitDate,
    v.ReasonForVisit,
    v.Duration,
    v.PatientID,
    p.PatientName,
    p.Gender,
    p.Age,
    p.Location AS PatientLocation,
    c.ClinicID,
    c.ClinicName,
    c.Location AS ClinicLocation,
    c.Specialization
FROM visits v
JOIN patients p ON v.PatientID = p.PatientID
JOIN clinics c ON v.ClinicID = c.ClinicID;

2 2. Statistical Analysis with R

After exporting the cleaned dataset, we used R to explore trends and distributions in patient demographics and clinic utilization.

2.1 2.1 Descriptive Statistics and Visualizations

2.1.1 Gender Distribution

health_data <- read_csv("C:/Users/cruzl/OneDrive/Desktop/School/Current/DATS 200/Lab #4/cleaned_healthcare_data.csv")

health_data %>%
  count(Gender) %>%
  ggplot(aes(x = Gender, y = n, fill = Gender)) +
  geom_col() +
  labs(title = "Gender Distribution of Patients", x = "Gender", y = "Count")

2.1.2 Age Distribution

ggplot(health_data, aes(x = Age)) +
  geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
  labs(title = "Age Distribution of Patients", x = "Age", y = "Number of Patients")

2.1.3 Visit Counts by Specialization

health_data %>%
  count(Specialization) %>%
  ggplot(aes(x = reorder(Specialization, n), y = n)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(title = "Visit Counts by Clinic Specialization", x = "Specialization", y = "Visit Count")

2.1.4 Monthly Visit Trends

health_data %>%
  mutate(Month = floor_date(as.Date(VisitDate), "month")) %>%
  count(Month) %>%
  ggplot(aes(x = Month, y = n)) +
  geom_line(color = "purple", linewidth = 1.2) +
  geom_point(size = 2) +
  labs(title = "Monthly Visit Trends", x = "Month", y = "Number of Visits")

2.2 2.2 Hypothesis Testing

To assess statistically significant patterns in visit behavior, we conducted hypothesis testing using t-tests and ANOVA.

2.2.1 T-Test: Visit Duration by Gender

health_data %>%
  group_by(Gender) %>%
  summarise(
    MeanDuration = mean(Duration),
    SD = sd(Duration),
    Count = n()
  )

## # A tibble: 2 × 4
##   Gender MeanDuration    SD Count
##   <chr>         <dbl> <dbl> <int>
## 1 Female         92.7  48.7   203
## 2 Male           94.2  48.4   297

t.test(Duration ~ Gender, data = health_data)

## 
##  Welch Two Sample t-test
## 
## data:  Duration by Gender
## t = -0.33848, df = 431.97, p-value = 0.7352
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
##  -10.191992   7.197316
## sample estimates:
## mean in group Female   mean in group Male 
##             92.69458             94.19192

2.2.2 ANOVA: Visit Duration by Specialization

anova_result <- aov(Duration ~ Specialization, data = health_data)
summary(anova_result)

##                 Df  Sum Sq Mean Sq F value Pr(>F)
## Specialization   4   11332    2833   1.208  0.306
## Residuals      495 1160763    2345

tukey_result <- TukeyHSD(anova_result)
tukey_result

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = Duration ~ Specialization, data = health_data)
## 
## $Specialization
##                               diff       lwr       upr     p adj
## General-Cardiology       9.0452775 -12.11734 30.207896 0.7684151
## Neurology-Cardiology     2.6180190 -15.97632 21.212356 0.9953144
## Orthopedics-Cardiology  -7.4354226 -28.46784 13.596994 0.8694891
## Pediatrics-Cardiology    0.7407162 -18.67284 20.154268 0.9999731
## Neurology-General       -6.4272585 -24.94473 12.090214 0.8769232
## Orthopedics-General    -16.4807001 -37.44519  4.483793 0.1998892
## Pediatrics-General      -8.3045613 -27.64450 11.035382 0.7653527
## Orthopedics-Neurology  -10.0534416 -28.42197  8.315089 0.5639448
## Pediatrics-Neurology    -1.8773028 -18.36749 14.612887 0.9979486
## Pediatrics-Orthopedics   8.1761388 -11.02124 27.373522 0.7707517

3 3. Predictive Modeling with Python

To explore how machine learning can support healthcare forecasting, we built predictive models using Python and scikit-learn. Below is a summary of each model’s setup and findings.

3.1 3.1 Predicting Visit Duration

Target: Duration
Features: Age, Gender, ReasonForVisit, Specialization, PatientLocation
Models: Linear Regression, Decision Tree Regressor
Performance: R² = -0.11 (Linear), R² = -1.14 (Tree)
Insight: Visit duration likely depends on clinical/operational details not captured in this dataset.

3.2 3.2 Predicting Visit Frequency per Patient

Target: VisitCount
Features: Age, Gender, PatientLocation
Model: Linear Regression
Performance: R² = -0.02
Insight: Basic demographics do not predict healthcare usage frequency well.

3.3 3.3 Classifying Reason for Visit

Target: ReasonForVisit
Model: Random Forest Classifier
Accuracy: ~30%
Insight: Classification was most accurate for common visit types (e.g. Checkups). Performance could be improved with richer patient context (e.g. medical history).

4 4. Conclusion and Recommendations

The combined use of SQL, R, and Python enabled a full-cycle analysis of healthcare usage data. While some patterns were evident (e.g., gender and age distributions), predictive modeling was limited by the dataset’s scope.

4.1 Key Recommendations

Expand Data Capture: Include medical conditions, provider notes, and outcome data to improve predictive modeling.
Monitor Clinic Trends: Use monthly and specialization-based visit patterns to guide staffing and outreach.
Target Preventive Services: Increase outreach for routine care in under-served regions.
Use Machine Learning Cautiously: Current models have limited utility without richer features, but could support early-warning or triage systems with improvement.

End of report.

Lab 4 – Holistic Healthcare Analysis

Lennie Cruz

2025-04-25