In this section, we performed initial data exploration and cleaning
using SQL through Python’s sqlite3 interface. The goals
were to identify and resolve missing values, duplicates, and relational
inconsistencies.
Key cleaning steps: - Identified and removed rows with missing or invalid demographic data - Checked for duplicate patient entries - Ensured all foreign key relationships were intact between visits, patients, and clinics
The cleaned dataset was exported as
cleaned_healthcare_data.csv for downstream analysis.
-- Check for missing values
SELECT * FROM patients
WHERE PatientID IS NULL OR PatientName IS NULL OR Gender IS NULL OR Age IS NULL OR Location IS NULL;
-- Check for duplicates
SELECT PatientID, Gender, Age, COUNT(*) as Count
FROM patients
GROUP BY PatientID, Gender, Age
HAVING COUNT(*) > 1;
-- Join all tables into one dataset
SELECT
v.VisitID,
v.VisitDate,
v.ReasonForVisit,
v.Duration,
v.PatientID,
p.PatientName,
p.Gender,
p.Age,
p.Location AS PatientLocation,
c.ClinicID,
c.ClinicName,
c.Location AS ClinicLocation,
c.Specialization
FROM visits v
JOIN patients p ON v.PatientID = p.PatientID
JOIN clinics c ON v.ClinicID = c.ClinicID;
After exporting the cleaned dataset, we used R to explore trends and distributions in patient demographics and clinic utilization.
health_data <- read_csv("C:/Users/cruzl/OneDrive/Desktop/School/Current/DATS 200/Lab #4/cleaned_healthcare_data.csv")
health_data %>%
count(Gender) %>%
ggplot(aes(x = Gender, y = n, fill = Gender)) +
geom_col() +
labs(title = "Gender Distribution of Patients", x = "Gender", y = "Count")
ggplot(health_data, aes(x = Age)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "white") +
labs(title = "Age Distribution of Patients", x = "Age", y = "Number of Patients")
health_data %>%
count(Specialization) %>%
ggplot(aes(x = reorder(Specialization, n), y = n)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Visit Counts by Clinic Specialization", x = "Specialization", y = "Visit Count")
health_data %>%
mutate(Month = floor_date(as.Date(VisitDate), "month")) %>%
count(Month) %>%
ggplot(aes(x = Month, y = n)) +
geom_line(color = "purple", linewidth = 1.2) +
geom_point(size = 2) +
labs(title = "Monthly Visit Trends", x = "Month", y = "Number of Visits")
To assess statistically significant patterns in visit behavior, we conducted hypothesis testing using t-tests and ANOVA.
health_data %>%
group_by(Gender) %>%
summarise(
MeanDuration = mean(Duration),
SD = sd(Duration),
Count = n()
)
## # A tibble: 2 × 4
## Gender MeanDuration SD Count
## <chr> <dbl> <dbl> <int>
## 1 Female 92.7 48.7 203
## 2 Male 94.2 48.4 297
t.test(Duration ~ Gender, data = health_data)
##
## Welch Two Sample t-test
##
## data: Duration by Gender
## t = -0.33848, df = 431.97, p-value = 0.7352
## alternative hypothesis: true difference in means between group Female and group Male is not equal to 0
## 95 percent confidence interval:
## -10.191992 7.197316
## sample estimates:
## mean in group Female mean in group Male
## 92.69458 94.19192
anova_result <- aov(Duration ~ Specialization, data = health_data)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Specialization 4 11332 2833 1.208 0.306
## Residuals 495 1160763 2345
tukey_result <- TukeyHSD(anova_result)
tukey_result
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Duration ~ Specialization, data = health_data)
##
## $Specialization
## diff lwr upr p adj
## General-Cardiology 9.0452775 -12.11734 30.207896 0.7684151
## Neurology-Cardiology 2.6180190 -15.97632 21.212356 0.9953144
## Orthopedics-Cardiology -7.4354226 -28.46784 13.596994 0.8694891
## Pediatrics-Cardiology 0.7407162 -18.67284 20.154268 0.9999731
## Neurology-General -6.4272585 -24.94473 12.090214 0.8769232
## Orthopedics-General -16.4807001 -37.44519 4.483793 0.1998892
## Pediatrics-General -8.3045613 -27.64450 11.035382 0.7653527
## Orthopedics-Neurology -10.0534416 -28.42197 8.315089 0.5639448
## Pediatrics-Neurology -1.8773028 -18.36749 14.612887 0.9979486
## Pediatrics-Orthopedics 8.1761388 -11.02124 27.373522 0.7707517
To explore how machine learning can support healthcare forecasting,
we built predictive models using Python and scikit-learn.
Below is a summary of each model’s setup and findings.
DurationVisitCountReasonForVisitThe combined use of SQL, R, and Python enabled a full-cycle analysis of healthcare usage data. While some patterns were evident (e.g., gender and age distributions), predictive modeling was limited by the dataset’s scope.
End of report.