Uvod Projekt je napravljen u svrhu predmeta Statistička obrada podataka. Napravio ga je tim “Boksači”.

Ucitavanje potrebnih paketa

library(dplyr)
library(readr)
library(knitr)
library(ggplot2)
library(kableExtra) 
diabetes <- read_csv("diabetes_dataset00.csv")

Prikažimo prvo osnovno o podacima.

head(diabetes)
## # A tibble: 6 × 34
##   Target                       `Genetic Markers` Autoantibodies `Family History`
##   <chr>                        <chr>             <chr>          <chr>           
## 1 Steroid-Induced Diabetes     Positive          Negative       No              
## 2 Neonatal Diabetes Mellitus … Positive          Negative       No              
## 3 Prediabetic                  Positive          Positive       Yes             
## 4 Type 1 Diabetes              Negative          Positive       No              
## 5 Wolfram Syndrome             Negative          Negative       Yes             
## 6 LADA                         Positive          Negative       Yes             
## # ℹ 30 more variables: `Environmental Factors` <chr>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <chr>, `Dietary Habits` <chr>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <chr>,
## #   `Socioeconomic Factors` <chr>, `Smoking Status` <chr>,
## #   `Alcohol Consumption` <chr>, `Glucose Tolerance Test` <chr>,
## #   `History of PCOS` <chr>, `Previous Gestational Diabetes` <chr>, …
cat("Dimenzije dataseta su:", dim(diabetes))
## Dimenzije dataseta su: 70000 34
cat("\nBroj redaka je:", nrow(diabetes))
## 
## Broj redaka je: 70000
cat("\nBroj stupaca je:" , ncol(diabetes))
## 
## Broj stupaca je: 34
diabetes
## # A tibble: 70,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <chr>                       <chr>             <chr>          <chr>           
##  1 Steroid-Induced Diabetes    Positive          Negative       No              
##  2 Neonatal Diabetes Mellitus… Positive          Negative       No              
##  3 Prediabetic                 Positive          Positive       Yes             
##  4 Type 1 Diabetes             Negative          Positive       No              
##  5 Wolfram Syndrome            Negative          Negative       Yes             
##  6 LADA                        Positive          Negative       Yes             
##  7 Type 2 Diabetes             Negative          Negative       No              
##  8 Wolcott-Rallison Syndrome   Positive          Negative       No              
##  9 Secondary Diabetes          Negative          Positive       No              
## 10 Secondary Diabetes          Positive          Negative       Yes             
## # ℹ 69,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <chr>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <chr>, `Dietary Habits` <chr>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <chr>,
## #   `Socioeconomic Factors` <chr>, `Smoking Status` <chr>,
## #   `Alcohol Consumption` <chr>, `Glucose Tolerance Test` <chr>, …
IME STUPCA Opis značenja
Target Vrsta dijabetesa u pacijenta
Genetic Marker Specifična sekvenca DNA kojom “se može” predvidjeti dijabetes
Autoantibodies Prisutnost antitijela koja tijelo proizvede protiv vlastitih stanica
Family History Je li pacijent ima povijest dijabetesa u obitelji
Environmental Factors Okolišni faktori koji mogu utjecati na razvoj dijabetesa, npr. zagađenje zraka
Insulin Levels Razina inzulina u tijelu, što je ključno za procjenu funkcije gušterače i upravljanje dijabetesom
Age Dob osobe
BMI Indeks tjelesne mase (BMI)
Physical Activity Razina tjelesne aktivnosti
Dietary Habits Prehrambene navike
Blood Pressure Krvni tlak
Cholesterol Levels Razina kolesterola u krvi
Waist Circumference Obujam struka
Blood Glucose Level Razina glukoze u krvi
Ethnicity Etnicitet osobe
Socioeconomic Factors Socioekonomski faktori
Smoking Status Status pušenja
Alcohol Consumption Konzumacija alkohola
Glucose Tolerance Test Test tolerancije na glukozu, koji se koristi za dijagnosticiranje dijabetesa ili predijabetesa
History of PCOS Je li žena imala dijagnozu ili simptome policističnog sindroma jajnika (PCOS)
Previous Gestational Diabetes Povijest gestacijskog dijabetesa
Pregnancy History Povijest trudnoće
Weight Gain During Pregnancy Povećanje tjelesne mase tijekom trudnoće
Pancreatic Health Zdravlje gušterače
Pulmonary Function Funkcija pluća
Cystic Fibrosis Diagnosis Dijagnoza cistične fibroze
Steroid Use History Povijest korištenja steroida
Genetic Testing Postoje li genetske predispozicije za dijabetes
Neurological Assessments Neurološke procjene
Liver Function Tests Test funkcije jetre
Digestive Enzyme Levels Razina probavnih enzima
Urine Test Test urina
Birth Weight Tjelesna masa pri rođenju
Early Onset Symptoms Simptomi koji se javljaju u ranom stadiju dijabetesa

Deksriptivna analiza tablice diabetes: Kratki prikaz tablice

glimpse(diabetes)
## Rows: 70,000
## Columns: 34
## $ Target                          <chr> "Steroid-Induced Diabetes", "Neonatal …
## $ `Genetic Markers`               <chr> "Positive", "Positive", "Positive", "N…
## $ Autoantibodies                  <chr> "Negative", "Negative", "Positive", "P…
## $ `Family History`                <chr> "No", "No", "Yes", "No", "Yes", "Yes",…
## $ `Environmental Factors`         <chr> "Present", "Present", "Present", "Pres…
## $ `Insulin Levels`                <dbl> 40, 13, 27, 8, 17, 17, 29, 10, 47, 21,…
## $ Age                             <dbl> 44, 1, 36, 7, 10, 41, 30, 3, 47, 72, 6…
## $ BMI                             <dbl> 38, 17, 24, 16, 17, 26, 31, 18, 25, 24…
## $ `Physical Activity`             <chr> "High", "High", "High", "Low", "High",…
## $ `Dietary Habits`                <chr> "Healthy", "Healthy", "Unhealthy", "Un…
## $ `Blood Pressure`                <dbl> 124, 73, 121, 100, 103, 127, 115, 80, …
## $ `Cholesterol Levels`            <dbl> 201, 121, 185, 151, 146, 208, 237, 157…
## $ `Waist Circumference`           <dbl> 50, 24, 36, 29, 33, 32, 43, 29, 40, 36…
## $ `Blood Glucose Levels`          <dbl> 168, 178, 105, 121, 289, 142, 186, 206…
## $ Ethnicity                       <chr> "Low Risk", "Low Risk", "Low Risk", "L…
## $ `Socioeconomic Factors`         <chr> "Medium", "High", "Medium", "High", "L…
## $ `Smoking Status`                <chr> "Smoker", "Non-Smoker", "Smoker", "Smo…
## $ `Alcohol Consumption`           <chr> "High", "Moderate", "High", "Moderate"…
## $ `Glucose Tolerance Test`        <chr> "Normal", "Normal", "Abnormal", "Abnor…
## $ `History of PCOS`               <chr> "No", "Yes", "Yes", "No", "No", "No", …
## $ `Previous Gestational Diabetes` <chr> "No", "No", "No", "Yes", "Yes", "No", …
## $ `Pregnancy History`             <chr> "Normal", "Normal", "Normal", "Normal"…
## $ `Weight Gain During Pregnancy`  <dbl> 18, 8, 15, 12, 2, 11, 15, 4, 30, 33, 3…
## $ `Pancreatic Health`             <dbl> 36, 26, 56, 49, 10, 40, 62, 13, 91, 86…
## $ `Pulmonary Function`            <dbl> 76, 60, 80, 89, 41, 85, 64, 44, 71, 69…
## $ `Cystic Fibrosis Diagnosis`     <chr> "No", "Yes", "Yes", "Yes", "No", "Yes"…
## $ `Steroid Use History`           <chr> "No", "No", "No", "No", "No", "No", "Y…
## $ `Genetic Testing`               <chr> "Positive", "Negative", "Negative", "P…
## $ `Neurological Assessments`      <dbl> 3, 1, 1, 2, 1, 2, 3, 1, 3, 2, 3, 1, 1,…
## $ `Liver Function Tests`          <chr> "Normal", "Normal", "Abnormal", "Abnor…
## $ `Digestive Enzyme Levels`       <dbl> 56, 28, 55, 60, 24, 52, 96, 29, 74, 42…
## $ `Urine Test`                    <chr> "Ketones Present", "Glucose Present", …
## $ `Birth Weight`                  <dbl> 2629, 1881, 3622, 3542, 1770, 3835, 44…
## $ `Early Onset Symptoms`          <chr> "No", "Yes", "Yes", "No", "No", "Yes",…
head(diabetes)
## # A tibble: 6 × 34
##   Target                       `Genetic Markers` Autoantibodies `Family History`
##   <chr>                        <chr>             <chr>          <chr>           
## 1 Steroid-Induced Diabetes     Positive          Negative       No              
## 2 Neonatal Diabetes Mellitus … Positive          Negative       No              
## 3 Prediabetic                  Positive          Positive       Yes             
## 4 Type 1 Diabetes              Negative          Positive       No              
## 5 Wolfram Syndrome             Negative          Negative       Yes             
## 6 LADA                         Positive          Negative       Yes             
## # ℹ 30 more variables: `Environmental Factors` <chr>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <chr>, `Dietary Habits` <chr>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <chr>,
## #   `Socioeconomic Factors` <chr>, `Smoking Status` <chr>,
## #   `Alcohol Consumption` <chr>, `Glucose Tolerance Test` <chr>,
## #   `History of PCOS` <chr>, `Previous Gestational Diabetes` <chr>, …

Pošto imamo i kategorijske podatke na koje summary() funkcija neće imati klasičan odgovor napraviti će mo faktorizaciju kategorijskih stupaca

categorical_vars <- c(
  "Target", "Genetic Markers", "Autoantibodies", 
  "Family History", "Environmental Factors", 
  "Physical Activity", "Dietary Habits", "Ethnicity", 
  "Socioeconomic Factors", "Smoking Status", 
  "Alcohol Consumption", "Glucose Tolerance Test", 
  "History of PCOS", "Previous Gestational Diabetes", 
  "Pregnancy History", "Cystic Fibrosis Diagnosis", 
  "Steroid Use History", "Genetic Testing", 
  "Liver Function Tests", "Urine Test", 
  "Early Onset Symptoms"
)

# Faktorizacija
diabetes[categorical_vars] <- lapply(diabetes[categorical_vars], as.factor)

s <- summary(diabetes)
s %>%
  kable() %>%
  kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover", "condensed"))
Target Genetic Markers Autoantibodies Family History Environmental Factors Insulin Levels Age BMI Physical Activity Dietary Habits Blood Pressure Cholesterol Levels Waist Circumference Blood Glucose Levels Ethnicity Socioeconomic Factors Smoking Status Alcohol Consumption Glucose Tolerance Test History of PCOS Previous Gestational Diabetes Pregnancy History Weight Gain During Pregnancy Pancreatic Health Pulmonary Function Cystic Fibrosis Diagnosis Steroid Use History Genetic Testing Neurological Assessments Liver Function Tests Digestive Enzyme Levels Urine Test Birth Weight Early Onset Symptoms
MODY : 5553 Negative:34899 Negative:35058 No :34832 Absent :35088 Min. : 5.00 Min. : 0.00 Min. :12.00 High :23225 Healthy :35020 Min. : 60.0 Min. :100.0 Min. :20.00 Min. : 80.0 High Risk:34982 High :23304 Non-Smoker:34955 High :23246 Abnormal:35278 No :35101 No :34965 Complications:34730 Min. : 0.0 Min. :10.00 Min. :30.00 No :35135 No :35142 Negative:34685 Min. :1.000 Abnormal:34981 Min. :10.00 Glucose Present:17422 Min. :1500 No :35059
Secondary Diabetes : 5479 Positive:35101 Positive:34942 Yes:35168 Present:34912 1st Qu.:13.00 1st Qu.:14.00 1st Qu.:20.00 Low :23348 Unhealthy:34980 1st Qu.: 99.0 1st Qu.:163.0 1st Qu.:30.00 1st Qu.:121.0 Low Risk :35018 Low :23283 Smoker :35045 Low :23411 Normal :34722 Yes:34899 Yes:35035 Normal :35270 1st Qu.: 7.0 1st Qu.:32.00 1st Qu.:63.00 Yes:34865 Yes:34858 Positive:35315 1st Qu.:1.000 Normal :35019 1st Qu.:31.00 Ketones Present:17422 1st Qu.:2629 Yes:34941
Cystic Fibrosis-Related Diabetes (CFRD): 5464 NA NA NA NA Median :19.00 Median :31.00 Median :25.00 Moderate:23427 NA Median :113.0 Median :191.0 Median :34.00 Median :152.0 NA Medium:23413 NA Moderate:23343 NA NA NA NA Median :16.0 Median :46.00 Median :72.00 NA NA NA Median :2.000 NA Median :48.00 Normal :17528 Median :3103 NA
Type 1 Diabetes : 5446 NA NA NA NA Mean :21.61 Mean :32.02 Mean :24.78 NA NA Mean :111.3 Mean :194.9 Mean :35.05 Mean :160.7 NA NA NA NA NA NA NA NA Mean :15.5 Mean :47.56 Mean :70.26 NA NA NA Mean :1.804 NA Mean :46.42 Protein Present:17628 Mean :3097 NA
Neonatal Diabetes Mellitus (NDM) : 5408 NA NA NA NA 3rd Qu.:28.00 3rd Qu.:49.00 3rd Qu.:29.00 NA NA 3rd Qu.:125.0 3rd Qu.:225.0 3rd Qu.:39.00 3rd Qu.:194.0 NA NA NA NA NA NA NA NA 3rd Qu.:22.0 3rd Qu.:64.00 3rd Qu.:79.00 NA NA NA 3rd Qu.:2.000 NA 3rd Qu.:61.00 NA 3rd Qu.:3656 NA
Wolcott-Rallison Syndrome : 5400 NA NA NA NA Max. :49.00 Max. :79.00 Max. :39.00 NA NA Max. :149.0 Max. :299.0 Max. :54.00 Max. :299.0 NA NA NA NA NA NA NA NA Max. :39.0 Max. :99.00 Max. :89.00 NA NA NA Max. :3.000 NA Max. :99.00 NA Max. :4499 NA
(Other) :37250 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA

1. Projektno pitanje

“Može li se na temelju zadanih parametara procijeniti dob pacijenta?”

Promatramo utjecaj samo jedne nezavisne varijable X na neku zavisnu varijablu Y. Graficki cemo to prikazati s scatter plotom. Za sada odabiremo par X varijabli za koje smatramo da bismo iz njih nesto mogli zakljuciti o starosti pacijenata tj. biramo za koje smatramo da imaju logicku vezu s dobi.

Neke varijable (i razlozi) koje cemo crtati scatter plotom su

  • BMI (Indeks tjelesne mase): S dobi se često mijenja tjelesna masa i sastav tijela
  • Blood Pressure (Krvni tlak): S godinama krvni tlak obično raste zbog starenja krvnih žila
  • Physical Activity: Prirodno je da stariji ljudi se manje bave fizickom aktivnosti
  • Cholesterol Levels (Razina kolesterola): Razine kolesterola mogu rasti s godinama
  • Blood Glucose Level: Kako ljudi stare, osjetljivost na inzulin često opada, pa bi razine glukoze mogle biti više kod starijih pacijenata.
  • Waist Circumference (Obujam struka):

Scatter plot: BMI vs Age

head(diabetes)
## # A tibble: 6 × 34
##   Target                       `Genetic Markers` Autoantibodies `Family History`
##   <fct>                        <fct>             <fct>          <fct>           
## 1 Steroid-Induced Diabetes     Positive          Negative       No              
## 2 Neonatal Diabetes Mellitus … Positive          Negative       No              
## 3 Prediabetic                  Positive          Positive       Yes             
## 4 Type 1 Diabetes              Negative          Positive       No              
## 5 Wolfram Syndrome             Negative          Negative       Yes             
## 6 LADA                         Positive          Negative       Yes             
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>,
## #   `History of PCOS` <fct>, `Previous Gestational Diabetes` <fct>, …
nrow(diabetes)
## [1] 70000
# BMI
diab_red <- sample_n(diabetes, 5000) # reducirani skup, jer nam je previse tocaka


ggplot(diab_red, aes(x = BMI, y = Age)) +
  geom_point(color = "blue", alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
  labs(title = "Scatter Plot: BMI vs Age",
       x = "BMI",
       y = "Age (Years)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

fit.BMI <-lm(Age ~ BMI, data = diab_red)


#provjeravamo normalnost reziduala - bitna pretpostavka
reziduali <- residuals(fit.BMI)



# uocavamo da nije previse narusena normalnost reziduala
hist(reziduali)

qqnorm(reziduali, main = "QQ Plot Reziduala")
qqline(reziduali, col = "red", lwd = 2)

Scatter plot: Blood Pressure vs Age

diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Blood Pressure`, y = Age)) +
  geom_point(color = "blue", alpha = 0.6, size = 2) +
  labs(title = "Scatter Plot: Blood Pressure vs Age",
       x = "Blood Pressure (mmHg)",
       y = "Age (Years)") +
  geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
  # lm znaci linearni model
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

# uocavamo lagani trend da se povecavanjem godina povecava i krvni tlak


fit.bp <-lm(Age ~ BMI, data = diab_red)

#provjeravamo normalnost reziduala - bitna pretpostavka
reziduali <- residuals(fit.bp)



# uocavamo da nije previse narusena normalnost reziduala
hist(reziduali)

qqnorm(reziduali, main = "QQ Plot Reziduala")
qqline(reziduali, col = "red", lwd = 2)

diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Target`, y = Age)) +
  geom_boxplot(color = "blue", alpha = 0.6) +
  labs(title = "Boxplot: Age by Type of Diabetes",
       x = "Type of Diabetes",
       y = "Age (Years)") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Box plot: Physical Activity vs Age

diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Target`, y = Age)) +
  geom_boxplot(color = "blue", alpha = 0.6) +
  labs(title = "Boxplot: Age by Type of Diabetes",
       x = "Type of Diabetes",
       y = "Age (Years)") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Box plot: Physical Activity vs Age

diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Physical Activity`, y = Age)) +
  geom_boxplot(color = "blue", alpha = 0.6) +
  labs(title = "Boxplot: Age by Physical Activity Levels",
       x = "Physical Activity (Category)",
       y = "Age (Years)") +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Ovaj box plot nam jako malo govori o povezanosti između fizičke aktivnosti i godina

diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Cholesterol Levels`, y = Age)) +
  geom_point(color = "blue", alpha = 0.6, size = 2) +
  geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
  labs(title = "Scatter Plot: Cholesterol Levels vs Age",
       x = "Cholesterol Levels (mg/dL)",
       y = "Age (Years)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

fit.ch <-lm(Age ~ `Cholesterol Levels`, data = diab_red)

#provjeravamo normalnost reziduala - bitna pretpostavka
reziduali <- residuals(fit.ch)



# uocavamo da nije previse narusena normalnost reziduala
hist(reziduali)

qqnorm(reziduali, main = "QQ Plot Reziduala")
qqline(reziduali, col = "red", lwd = 2)

diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Blood Glucose Levels`, y = Age)) +
  geom_point(color = "blue", alpha = 0.6, size = 2) +
  labs(title = "Scatter Plot: Blood Glucose Level vs Age",
       x = "Blood Glucose Level (mg/dL)",
       y = "Age (Years)") +
  geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Ne uocavamo nikakvu linearnu poveznost - jako loš regresor

Scatter plot: Waist Circumference vs Age

diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Waist Circumference`, y = Age)) +
  geom_point(color = "blue", alpha = 0.6, size = 2) +
  labs(title = "Scatter Plot: Waist Circumference vs Age",
       x = "Waist Circumference (cm)",
       y = "Age (Years)") +
  geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

fit.ws <-lm(Age ~ `Waist Circumference`, data = diab_red)

#provjeravamo normalnost reziduala - bitna pretpostavka
reziduali <- residuals(fit.ws)


hist(reziduali)

qqnorm(reziduali, main = "QQ Plot Reziduala")
qqline(reziduali, col = "red", lwd = 2)

Uočavamo da nije previše narušena normalnost reziduala.

Dalje cemo koristit Target,BMI, Blood Presure, Cholesterol i Waist kao regresore jer su njihovi grafovi u pocetnoj analizi dali najbolju linearnu vezu s varijablom dobi.


Sada cemo viditi jesu li neki regresori kolerirani (koliko je jaka veza izmedju parova regresora)

Korelacijski koeficijenti

Regresori: Target BMI Blood Pressure Cholesterol Levels Waist Circumference

korelacija <- cor(cbind(diab_red$Target,
                        diab_red$BMI,
                        diab_red$`Blood Pressure`,
                        diab_red$`Cholesterol Levels`,
                        diab_red$`Waist Circumference`)) # Samo potpuni podaci

# Dodavanje imena stupcima i redovima
colnames(korelacija) <- rownames(korelacija) <- c("Target", 
                                                  "BMI", 
                                                  "Blood Pressure", 
                                                  "Cholesterol Levels", 
                                                  "Waist Circumference")

Iz ove tablice uocavamo zakljucke

BMI i Blood Pressure: Korelacija od 0.642 ukazuje na srednje jaku pozitivnu povezanost.

BMI i Cholesterol Levels: Korelacija od 0.595 je blizu granice, takođe ukazuje na povezanost.

BMI i Waist Circumference: Korelacija od 0.619 je takodjer jaka.

Sada provodimo visestruku linearnu regresiju

Korištenje kategorijskih varijabli s više od dvije kategorije kao int vrijednosti u regresiji se ne preporuča za nominalne varijable. Dakle mi necemo koristiti Target varijablu, a njen boxplot je vec prikazan iznad i u primjeru 1.4. je razlozeno kako vrsta dijabetesa moze pokazivati jako dobro dob pacijenta.

# Višestruka regresija sa Age kao zavisnom varijablom
fit.multi <- lm(Age ~ BMI + `Blood Pressure` + `Cholesterol Levels` + `Waist Circumference`, data = diab_red)

# Prikaz rezultata regresije
summary(fit.multi)
## 
## Call:
## lm(formula = Age ~ BMI + `Blood Pressure` + `Cholesterol Levels` + 
##     `Waist Circumference`, data = diab_red)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.569  -7.657  -0.803   6.673  40.661 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -67.791263   0.958346  -70.74   <2e-16 ***
## BMI                     0.568551   0.037845   15.02   <2e-16 ***
## `Blood Pressure`        0.313808   0.013010   24.12   <2e-16 ***
## `Cholesterol Levels`    0.128689   0.005417   23.75   <2e-16 ***
## `Waist Circumference`   0.735261   0.037388   19.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.43 on 4995 degrees of freedom
## Multiple R-squared:  0.7081, Adjusted R-squared:  0.7079 
## F-statistic:  3030 on 4 and 4995 DF,  p-value: < 2.2e-16

Uočavamo jako velike reziduale (40 godina na nekim procjenama je previše za godine(dob)). No ipak medijan reziduala je blizu nule sto je pozeljno.

Svi koeficijenti su statistički značajni (p-vrijednosti < 0.001), što znači da su svi prediktori relevantni za objašnjenje starosti.

Nama je nulta hipoteza da je beta jednaka nuli, te nam ove p vrijednosti govore da tu hipotezu mozemo odbaciti!

Model objašnjava 70.35% varijance zavisne varijable Age, što ukazuje na dobru sposobnost modela da objasni starost na osnovu navedenih prediktora.

Razlika izmedju R kvadrat i R-a je vrlo mala sto znaci da ukljucene varijable zaista pridonose objašnjenju varijance (nema penala!)

F statistika nam testira jesu li svi koeficijenti u modelu jednaki nuli. F = 2963 nam govori tj. p-value: < 2.2e-16 da je ta pretpostavka skoro pa nemoguca - odbacujemo H0.

residuals <- residuals(fit.multi)

hist(residuals, breaks = 30, col = "lightblue", main = "Glavni histogram reziduala naseg modela",
     xlab = "Reziduali")

# uocavamo normalnu distribuciju reziduala naseg modela sto je jako bitno za validnost t-testova

qqnorm(residuals, main = "Q-Q Plot reziduala")
qqline(residuals, col = "red", lwd = 2)

Sada cemo na primjeru podataka jedne anonimne osobe predvidjeti njene godine. Koristimo 95% interval povjerenja

diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
names(diab_red)
##  [1] "Target"                        "Genetic Markers"              
##  [3] "Autoantibodies"                "Family History"               
##  [5] "Environmental Factors"         "Insulin Levels"               
##  [7] "Age"                           "BMI"                          
##  [9] "Physical Activity"             "Dietary Habits"               
## [11] "Blood Pressure"                "Cholesterol Levels"           
## [13] "Waist Circumference"           "Blood Glucose Levels"         
## [15] "Ethnicity"                     "Socioeconomic Factors"        
## [17] "Smoking Status"                "Alcohol Consumption"          
## [19] "Glucose Tolerance Test"        "History of PCOS"              
## [21] "Previous Gestational Diabetes" "Pregnancy History"            
## [23] "Weight Gain During Pregnancy"  "Pancreatic Health"            
## [25] "Pulmonary Function"            "Cystic Fibrosis Diagnosis"    
## [27] "Steroid Use History"           "Genetic Testing"              
## [29] "Neurological Assessments"      "Liver Function Tests"         
## [31] "Digestive Enzyme Levels"       "Urine Test"                   
## [33] "Birth Weight"                  "Early Onset Symptoms"
diab_red
## # A tibble: 5,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Type 3c Diabetes (Pancreat… Negative          Negative       Yes             
##  2 Wolfram Syndrome            Negative          Negative       No              
##  3 Gestational Diabetes        Negative          Positive       Yes             
##  4 Prediabetic                 Positive          Positive       Yes             
##  5 Wolfram Syndrome            Positive          Negative       Yes             
##  6 Gestational Diabetes        Negative          Negative       Yes             
##  7 Type 3c Diabetes (Pancreat… Positive          Positive       No              
##  8 Prediabetic                 Positive          Positive       Yes             
##  9 Gestational Diabetes        Positive          Negative       No              
## 10 Type 2 Diabetes             Positive          Negative       Yes             
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
new_data <- data.frame(BMI = 25,`Waist Circumference` = 85, `Blood Pressure` = 80, `Cholesterol Levels` = 22)



new_data
##   BMI Waist.Circumference Blood.Pressure Cholesterol.Levels
## 1  25                  85             80                 22
colnames(new_data)[colnames(new_data) == "Blood.Pressure"] <- "Blood Pressure"
colnames(new_data)[colnames(new_data) == "Cholesterol.Levels"] <- "Cholesterol Levels"
colnames(new_data)[colnames(new_data) == "Waist.Circumference"] <- "Waist Circumference"


new_data
##   BMI Waist Circumference Blood Pressure Cholesterol Levels
## 1  25                  85             80                 22
pred <- predict(fit.multi, newdata = new_data, interval = "confidence", level = 0.95)

# Prikaz rezultata predikcije
pred
##        fit      lwr      upr
## 1 36.85548 32.03873 41.67223

U ovom slučaju, predviđena starost je 36.25 godina. Donja granica 95% intervala poverenja: 31.47 godina. Gornja granica 95% intervala poverenja: 41.03 godina.

Zakljucak:

Nas 95% interval povjerenja je [31.47,41.03]

2. Projektno pitanje

“Postoje li razlike u oblicima bolesti među pušačima i nepušačima?”

Koristimo χ² test nezavisnosti kako bismo ispitali jesu li dvije kategorijske varijable međusobno povezane. U ovom slučaju želimo utvrditi postoji li statistički značajna povezanost između vrste bolesti (Target) i statusa pušenja (Smoking Status).

Osnova χ² testa je nulta hipoteza (H₀) koja pretpostavlja da su varijable nezavisne, što znači da promjene u jednoj varijabli (npr. status pušenja) ne utječu na promjene u drugoj varijabli (npr. vrsta bolesti).

Hipoteza HO: Ne postoji značajna povezanost između rezultata po oblicima bolesti (Target) i statusa pušenja (Smoking Status) - nezavisne varijable. Hipoteza H1: Postoji značajna povezanost između rezultata po oblicima bolesti (Target) i statusa pušenja (Smoking Status) - zavisne varijable.

cat("Različite vrste bolesti (Target):", paste(unique(as.character(diabetes$Target)), collapse = ", "), "\n")
## Različite vrste bolesti (Target): Steroid-Induced Diabetes, Neonatal Diabetes Mellitus (NDM), Prediabetic, Type 1 Diabetes, Wolfram Syndrome, LADA, Type 2 Diabetes, Wolcott-Rallison Syndrome, Secondary Diabetes, Type 3c Diabetes (Pancreatogenic Diabetes), Gestational Diabetes, Cystic Fibrosis-Related Diabetes (CFRD), MODY
cat("\n")
cat("Kategorije pušačkog statusa (Smoking Status):", paste(unique(as.character(diabetes$`Smoking Status`)), collapse = ", "), "\n")
## Kategorije pušačkog statusa (Smoking Status): Smoker, Non-Smoker
cat("\n")
target_smoking_table <- table(diabetes$Target, diabetes$`Smoking Status`)
kable(target_smoking_table, caption = "Dobivena frekvencija vrsta bolesti prema statusu pušenja")
Dobivena frekvencija vrsta bolesti prema statusu pušenja
Non-Smoker Smoker
Cystic Fibrosis-Related Diabetes (CFRD) 2765 2699
Gestational Diabetes 2650 2694
LADA 2635 2588
MODY 2799 2754
Neonatal Diabetes Mellitus (NDM) 2626 2782
Prediabetic 2709 2667
Secondary Diabetes 2714 2765
Steroid-Induced Diabetes 2607 2668
Type 1 Diabetes 2725 2721
Type 2 Diabetes 2691 2706
Type 3c Diabetes (Pancreatogenic Diabetes) 2616 2704
Wolcott-Rallison Syndrome 2762 2638
Wolfram Syndrome 2656 2659
#Vizualizacija bar plot
ggplot(diabetes, aes(x = `Smoking Status`, fill = Target)) +
  geom_bar(position = "dodge") +
  labs(title = "Razlike u vrstama bolesti među pušačima i nepušačima", 
       x = "Status pušenja", 
       y = "Broj pacijenata", 
       fill = "Vrsta bolesti") +
  theme_minimal()

# Vizualizacija podataka - Mosaic plot
mosaicplot(target_smoking_table, main = "Mosaic graf: Povezanost statusa pušenja i vrste bolesti", color = TRUE, las = 2)

smoking_vs_target <- chisq.test(table(diabetes$Target, diabetes$`Smoking Status`))

cat("Rezultati χ2 testa:\n")
## Rezultati χ2 testa:
print(smoking_vs_target)
## 
##  Pearson's Chi-squared test
## 
## data:  table(diabetes$Target, diabetes$`Smoking Status`)
## X-squared = 12.189, df = 12, p-value = 0.4306

Iz grafičkih prikaza ne vidimo velike razlike, tj. frekvencije su otprilike podjednake za svaku vrstu bolesti odnosno status pušenja.

contingency_table <- table(diabetes$Target, diabetes$`Smoking Status`)

# Chi-squared test
chi_squared_test <- chisq.test(contingency_table)
expected_frequencies <- chi_squared_test$expected

# Prikaz očekivanih frekvencija
#kable(expected_frequencies, caption = "Očekivane frekvencije vrsta bolesti prema statusu pušenja")

# Ispis frekvencijske tablice s očekivanim frekvencijama
smoking_levels <- levels(diabetes$`Smoking Status`)
target_levels <- levels(diabetes$Target)

# Create a matrix with the combined values
combined_table <- data.frame(
  Smoking_Status = rep(smoking_levels, each = length(target_levels)),
  Target = rep(target_levels, times = length(smoking_levels)),
  Stvarne_Frekvencije = as.vector(contingency_table),
  Očekivane_Frekvencije = as.vector(expected_frequencies)
)

# Display the table with kable
kable(combined_table, caption = "Stvarne i očekivane frekvencije za svaku kombinaciju varijabli")
Stvarne i očekivane frekvencije za svaku kombinaciju varijabli
Smoking_Status Target Stvarne_Frekvencije Očekivane_Frekvencije
Non-Smoker Cystic Fibrosis-Related Diabetes (CFRD) 2765 2728.487
Non-Smoker Gestational Diabetes 2650 2668.565
Non-Smoker LADA 2635 2608.142
Non-Smoker MODY 2799 2772.930
Non-Smoker Neonatal Diabetes Mellitus (NDM) 2626 2700.523
Non-Smoker Prediabetic 2709 2684.544
Non-Smoker Secondary Diabetes 2714 2735.978
Non-Smoker Steroid-Induced Diabetes 2607 2634.109
Non-Smoker Type 1 Diabetes 2725 2719.499
Non-Smoker Type 2 Diabetes 2691 2695.030
Non-Smoker Type 3c Diabetes (Pancreatogenic Diabetes) 2616 2656.580
Non-Smoker Wolcott-Rallison Syndrome 2762 2696.529
Non-Smoker Wolfram Syndrome 2656 2654.083
Smoker Cystic Fibrosis-Related Diabetes (CFRD) 2699 2735.513
Smoker Gestational Diabetes 2694 2675.435
Smoker LADA 2588 2614.858
Smoker MODY 2754 2780.070
Smoker Neonatal Diabetes Mellitus (NDM) 2782 2707.477
Smoker Prediabetic 2667 2691.456
Smoker Secondary Diabetes 2765 2743.022
Smoker Steroid-Induced Diabetes 2668 2640.891
Smoker Type 1 Diabetes 2721 2726.501
Smoker Type 2 Diabetes 2706 2701.970
Smoker Type 3c Diabetes (Pancreatogenic Diabetes) 2704 2663.420
Smoker Wolcott-Rallison Syndrome 2638 2703.471
Smoker Wolfram Syndrome 2659 2660.917
df <- (nrow(contingency_table) - 1) * (ncol(contingency_table) - 1)

# Za stupanj slobode df = (broj redaka - 1) * (broj stupaca - 1)
# Kritična vrijednost χ² testa
alpha <- 0.05
critical_value <- qchisq(1 - alpha, df)

cat("Stupanj slobode:" ,df, "\n")
## Stupanj slobode: 12
cat("Kritična vrijednost χ² testa:", critical_value, "\n")
## Kritična vrijednost χ² testa: 21.02607
cat("Vrijednost χ² testa:", chi_squared_test$statistic, "\n")
## Vrijednost χ² testa: 12.18902
cat("P-vrijednost (p-value):", chi_squared_test$p.value, "\n")
## P-vrijednost (p-value): 0.4306211

Zaključak:

Iz dobivenih rezultata, vidimo da je p-vrijednost veća od 0.05 te ne možemo odbaciti nultu hipotezu(H0) tj. Ne postoji značajna razlika među oblicima bolesti i statusa pušenja. Nema dovoljno dokaza da zaključimo da su Target(Vrsta bolesti) i Smoking Status(Status pušenja) značajno povezani na razini sigurnosti 95%.

3. Projektno pitanje

U ovom dijelu razmatram postoji li razlika u razini zdravlja gušteraće među oblicima dijabetesa pacijenta?

Target je kategorijski tip podataka, dok je razina zdravlja gušterače numerički tip podatka.

Koristimo ANOVA test s obzirom da uspoređujemo srednje vrijednosti više skupina jedne kategorijske varijable s numeričkom varijablom. ANOVA test omogućuje nam analizu utjecaja različitih čimbenika na jednu varijablu, što je važno za utvrđivanje postoje li statistički značajne razlike između grupa.

Hipoteza H0: Ne postoji značajna razlika u razini zdravlja gušterače po oblicima dijabetesa Hipoteza H1: Postoji značajna razlika u razini zdravlja gušterače po oblicima dijabetesa

#kopiramo podatke kako ne bi promijenili pravi dataset
diabetes_copy2 = data.frame(diabetes)

names(diabetes_copy2)
##  [1] "Target"                        "Genetic.Markers"              
##  [3] "Autoantibodies"                "Family.History"               
##  [5] "Environmental.Factors"         "Insulin.Levels"               
##  [7] "Age"                           "BMI"                          
##  [9] "Physical.Activity"             "Dietary.Habits"               
## [11] "Blood.Pressure"                "Cholesterol.Levels"           
## [13] "Waist.Circumference"           "Blood.Glucose.Levels"         
## [15] "Ethnicity"                     "Socioeconomic.Factors"        
## [17] "Smoking.Status"                "Alcohol.Consumption"          
## [19] "Glucose.Tolerance.Test"        "History.of.PCOS"              
## [21] "Previous.Gestational.Diabetes" "Pregnancy.History"            
## [23] "Weight.Gain.During.Pregnancy"  "Pancreatic.Health"            
## [25] "Pulmonary.Function"            "Cystic.Fibrosis.Diagnosis"    
## [27] "Steroid.Use.History"           "Genetic.Testing"              
## [29] "Neurological.Assessments"      "Liver.Function.Tests"         
## [31] "Digestive.Enzyme.Levels"       "Urine.Test"                   
## [33] "Birth.Weight"                  "Early.Onset.Symptoms"
levels(factor(diabetes_copy2$`Pancreatic.Health`))
##  [1] "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24"
## [16] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
## [31] "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54"
## [46] "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69"
## [61] "70" "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [76] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
levels(factor(diabetes_copy2$Target))
##  [1] "Cystic Fibrosis-Related Diabetes (CFRD)"   
##  [2] "Gestational Diabetes"                      
##  [3] "LADA"                                      
##  [4] "MODY"                                      
##  [5] "Neonatal Diabetes Mellitus (NDM)"          
##  [6] "Prediabetic"                               
##  [7] "Secondary Diabetes"                        
##  [8] "Steroid-Induced Diabetes"                  
##  [9] "Type 1 Diabetes"                           
## [10] "Type 2 Diabetes"                           
## [11] "Type 3c Diabetes (Pancreatogenic Diabetes)"
## [12] "Wolcott-Rallison Syndrome"                 
## [13] "Wolfram Syndrome"
#diabetes$`Target`
class(diabetes$`Target`)
## [1] "factor"
#diabetes$`Pancreatic Health`
class(diabetes$`Pancreatic Health`)
## [1] "numeric"

ANOVA test pretpostavlja homogenost i normalnost pa moramo prvo provjeriti naše podatke kako bi mogli krenuti sa ANOVA testom. Provjeravamo normalnost podataka za svaku pojedinu grupu dijabetsa pomoću Lillieforsovom inačicom Kolmogorov-Smirnov testa

require(nortest)
## Loading required package: nortest
lillie.test(diabetes_copy2$Pancreatic.Health)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  diabetes_copy2$Pancreatic.Health
## D = 0.068414, p-value < 2.2e-16
# Lista tipova dijabetesa
diabetes_types <- unique(diabetes_copy2$Target)

# Kreiranje histograma za svaki tip dijabetesa
for (diabetes_type in diabetes_types) {
  # Filtriranje podataka za određeni tip dijabetesa
  data_subset <- diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target == diabetes_type]
  
  #lillie-test
  lillie_result <- lillie.test(data_subset)
  print(paste("Lillie test result for", diabetes_type, ":", lillie_result$p.value))
    # Kreiranje histograma
  hist(data_subset, 
       main = paste("Pancreatic Health for", diabetes_type),  # Naslov sa tipom dijabetesa
       xlab = "Pancreatic Health",  # Oznaka x-ose
       col = "lightblue",  # Boja histograma (opcionalno)
       border = "black")  # Boja ivica (opcionalno)
}
## [1] "Lillie test result for Steroid-Induced Diabetes : 1.0461707480236e-73"

## [1] "Lillie test result for Neonatal Diabetes Mellitus (NDM) : 1.00768949161913e-86"

## [1] "Lillie test result for Prediabetic : 1.53545208640135e-73"

## [1] "Lillie test result for Type 1 Diabetes : 4.26370855086212e-72"

## [1] "Lillie test result for Wolfram Syndrome : 4.54340602830596e-81"

## [1] "Lillie test result for LADA : 9.58723988923272e-74"

## [1] "Lillie test result for Type 2 Diabetes : 7.5148662012579e-59"

## [1] "Lillie test result for Wolcott-Rallison Syndrome : 2.99834984600884e-88"

## [1] "Lillie test result for Secondary Diabetes : 1.60476108964642e-66"

## [1] "Lillie test result for Type 3c Diabetes (Pancreatogenic Diabetes) : 7.25012119118581e-73"

## [1] "Lillie test result for Gestational Diabetes : 6.22617556022835e-68"

## [1] "Lillie test result for Cystic Fibrosis-Related Diabetes (CFRD) : 1.02958198804219e-75"

## [1] "Lillie test result for MODY : 2.21551404977202e-74"

#testiranje homogenosti varijance uzorka
bartlett.test(diabetes_copy2$Pancreatic.Health ~ diabetes_copy2$Target)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  diabetes_copy2$Pancreatic.Health by diabetes_copy2$Target
## Bartlett's K-squared = 14096, df = 12, p-value < 2.2e-16
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Wolfram Syndrome'])
## [1] 75.46892
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Gestational Diabetes'])
## [1] 205.5873
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='LADA'])
## [1] 135.6495
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='MODY'])
## [1] 203.6217
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Neonatal Diabetes Mellitus (NDM)'])
## [1] 75.23981
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Prediabetic'])
## [1] 209.5374
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Secondary Diabetes'])
## [1] 529.6986
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Steroid-Induced Diabetes'])
## [1] 299.7283
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Type 1 Diabetes'])
## [1] 206.4028
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Type 2 Diabetes'])
## [1] 531.1273
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Type 3c Diabetes (Pancreatogenic Diabetes)'])
## [1] 299.9355
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Wolcott-Rallison Syndrome'])
## [1] 75.20566
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Wolfram Syndrome'])
## [1] 75.46892

Vidimo kako podatci nisu normalno distribuirano. Također testirali smo i homogenost varijance uzorka.
Podatci ne zadovoljavaju uvjet homogenosti niti su normalno distribuirani (jer su dobivene p-vrijednosti manje od 0.05). Iz toga zaključujemo kako ne možemo koristiti ANOVA test.

Zaključujemo kako trebamo koristiti neparametarski test - Kruskal - Wallis. Taj je test neparametarski ekvivalent jednosmjernog ANOVA testa.

#kopiramo podatke kako ne bi promijenili pravi dataset
diabetes_copy3 = data.frame(diabetes)

Histogram zdravlja gušterače

# 1. Aritmetička sredina
mean_pancreas <- mean(diabetes_copy3$`Pancreatic.Health`)
mean_pancreas
## [1] 47.56424
# 2. Podrezana (trimmed) sredina (npr. 20%)
mean_trimmed_pancreas <- mean(diabetes_copy3$`Pancreatic.Health`, trim = 0.2)
mean_trimmed_pancreas
## [1] 46.94283
# 3. Medijan
median_pancreas <- median(diabetes_copy3$`Pancreatic.Health`)
median_pancreas
## [1] 46
# 4. Kvartili (Q1 i Q3)
quantiles_pancreas <- quantile(diabetes_copy3$`Pancreatic.Health`, probs = c(0.25, 0.75))
quantiles_pancreas
## 25% 75% 
##  32  64
# 5. Mod (najčešća vrijednost) — paket modeest
require(modeest)
## Loading required package: modeest
## Warning: package 'modeest' was built under R version 4.4.2
mode_pancreas <- mfv(diabetes_copy3$`Pancreatic.Health`)
mode_pancreas
## [1] 37
# 6. Histogram
hist(diabetes_copy3$`Pancreatic.Health`,
     main = "Histogram: Pancreatic Health",
     xlab = "Value",
     ylab = "Frequency",
     col = "lightblue")

# Dodajemo okomitu liniju u boji narančaste, debljina 4
abline(v = mean(diabetes_copy3$`Pancreatic.Health`),
       col = "orange", lwd = 4)

# 7. Summary (min, Q1, median, mean, Q3, max)
summary(diabetes_copy3$`Pancreatic.Health`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.00   32.00   46.00   47.56   64.00   99.00

Napraviti ćemo box plot prikaz svih podataka.

# Boxplot: Pancreatic.Health ~ Target
ggplot(data = diabetes_copy3, aes(x = Target, y = Pancreatic.Health)) +
  geom_boxplot(
    fill = "lightblue",        
    outlier.colour = "red"     
  ) +
 
  labs(
    title = "Boxplot of Pancreatic.Health by Diabetes Type",
    x = "Type of Diabetes",
    y = "Pancreatic.Health"
  ) +
  coord_flip() +
  theme_minimal()

Grafički nam prikaz ukazuje na to da vrsta dijabetesa ima veze sa zdravljem gušteraće, no to ne možemo potvrditi bez da napravimo Kruskal-Wallis test

Izrada Kruskal-Wallis testa:

kruskal.test(Pancreatic.Health ~ Target, data = diabetes_copy3)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  Pancreatic.Health by Target
## Kruskal-Wallis chi-squared = 30882, df = 12, p-value < 2.2e-16

Zaključak:

*Iz rezultata Kruskal-Wallis testa možemo odbaciti nultu hipotezu(H0), jer je p-vrijednost manja od 0.05 (dobivena p-vrijednost < 2.2e-16). Odnosno odbacujemo hipotezu da nema razlike u zdravlju gušterače prema oblicima dijabetesa, te se prihvaća alternativna hipoteza (H1) da postoji značajna razlika u zdravlju gušterače po oblicima dijabetesa. Odnosno barem jedan tip dijabetesa ima znatno drugačije razlike zdravlja gušteraće od drugih.

4. Projektno pitanje

“Utječe li razina fiziče aktivnosti na razinu inzulina u krvi pacijenta?”

Hipoteza H0: Ne postoji značajna razlika razine inzulina u krvi između grupa različite fizičke aktivnosti

Hhipoteza H1: Postoji razlika razine inzulina u krvi između grupa različite fizičke aktivnosti

#diabetes$`Physical Activity`
class(diabetes$`Physical Activity`)
## [1] "factor"
#diabetes$`Insulin Levels`
class(diabetes$`Insulin Levels`)
## [1] "numeric"
diabetes[c(1,6,9)]
## # A tibble: 70,000 × 3
##    Target                           `Insulin Levels` `Physical Activity`
##    <fct>                                       <dbl> <fct>              
##  1 Steroid-Induced Diabetes                       40 High               
##  2 Neonatal Diabetes Mellitus (NDM)               13 High               
##  3 Prediabetic                                    27 High               
##  4 Type 1 Diabetes                                 8 Low                
##  5 Wolfram Syndrome                               17 High               
##  6 LADA                                           17 Moderate           
##  7 Type 2 Diabetes                                29 Moderate           
##  8 Wolcott-Rallison Syndrome                      10 Low                
##  9 Secondary Diabetes                             47 High               
## 10 Secondary Diabetes                             21 Low                
## # ℹ 69,990 more rows

Opis i pregled izdvojenih atributa

  • Fizička aktivnost i razina inzulina

Fizička aktivnost

  • kategorički je tip podatatka

  • može biti vrijednosti: “Low”, “Moderate” ili “High”

#Mod
require(modeest)
#Najzastupljenija vrijednost fizičke aktivnosti vraća funkcija ispod
mfv(diabetes$`Physical Activity`)
## [1] Moderate
## Levels: High Low Moderate
#Provjera ima li nekavih rezulatta koji odstupaju od okvira
table(diabetes$`Physical Activity`)
## 
##     High      Low Moderate 
##    23225    23348    23427
summary(diabetes$`Physical Activity`)
##     High      Low Moderate 
##    23225    23348    23427

Razina inzulina

  • numerički je tip podatka
v = mean(diabetes$`Insulin Levels`)

# Podrezana aritmeticka sredina (20%)
mean(diabetes$`Insulin Levels`, trim=0.2)
## [1] 19.99071
median(diabetes$`Insulin Levels`)
## [1] 19
quantile(diabetes$`Insulin Levels`, probs = c(0.25,0.75))
## 25% 75% 
##  13  28
#Mod
require(modeest)
mfv(diabetes$`Insulin Levels`)
## [1] 13
#Crtanje histograma koji prikazuje rapodjelu inzulina po vrijednosti i frekvenciji 
h = hist(diabetes$`Insulin Levels`,
         #prob=TRUE,
         main="Insulin Levels Table",
         xlab="Insulin Level",
         ylab='Frequency',
         col="lightblue"
         )

abline(v = mean(diabetes$`Insulin Levels`), col = "orange", lwd = 4)

summary(diabetes$`Insulin Levels`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   13.00   19.00   21.61   28.00   49.00

Na temelju nacrtanog histograma možemo uočiti da više ljudi ima niže razine inzulina (od 0 do 20) te da manji broj ljudi ima razinu inzulina veću od 35.

Usporedba grupiranih podataka:

# Ako grupiramo podatke i onda radimo histogram:

h1 = hist(diabetes[diabetes$`Physical Activity` == c("Low"),]$`Insulin Levels`,
         plot=FALSE)
h2 = hist(diabetes[diabetes$`Physical Activity` == c("Moderate"),]$`Insulin Levels`,
         plot=FALSE)
h3 = hist(diabetes[diabetes$`Physical Activity` == c("High"),]$`Insulin Levels`,
         plot=FALSE)

data <- t(cbind(h1$counts,h2$counts,h3$counts))
data
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]  785  834 1465 2029 2111 1942 2036 1662 1217  1152  1289  1210   926   700
## [2,]  893  865 1378 2029 2016 1912 1966 1578 1290  1228  1216  1262  1004   679
## [3,]  814  787 1455 1972 2021 2001 2059 1653 1222  1236  1212  1209   954   704
##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,]   647   550   499   489   437   396   380   381   211
## [2,]   697   521   533   487   413   440   409   397   214
## [3,]   702   506   505   443   397   411   406   382   174
barplot(data,beside=TRUE, col=c("orange", "green", "lightblue"), xlab="Insulin Level", ylab='Frequency',)
legend("topright",c("low","moderate","high"),fill = c("orange", "green", "lightblue"))

Kao što je vidljivo iz histograma, frekvencije razine aktivnosti ispitanika u ovisnosti o razini inzulina nemaju velikih odstupanja unutar grupa.

#boxplot(`Insulin Levels` ~ `Physical Activity`, data=diabetes)
boxplot(diabetes$`Insulin Levels` ~ diabetes$`Physical Activity`)

Boxplot koji nam na još jedan grafički način prikazuje raspodjelu ispitanika u ovisnoti o razini inzulina i fizičke aktivnosti.

Grafički prikazi do sad nam daju naslutiti da nema zavisnosti između fizičke aktivnosti i razine inzulina u krvi ispitanika.

Provođenje ANOVE

Pretpostavke ANOVE su:

  • nezavisnost pojedinih podataka u uzorcima

  • normalna razdioba podataka

  • homogenost varijanci među populacijama

require(nortest)

Lillieforsov test provodimo kako bismo proverili jesu li naši podatci koje proučavamo normalno distrbuirani

#provjera distribucije cijelog atributa `Insulin Levels`
lillie.test(diabetes$`Insulin Levels`)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  diabetes$`Insulin Levels`
## D = 0.11445, p-value < 2.2e-16
#provjera distribucije atributa `Insulin Levels` = "Low"
lillie.test(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Low"])
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Low"]
## D = 0.11711, p-value < 2.2e-16
#provjera distribucije atributa `Insulin Levels` = "Moderate"
lillie.test(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Moderate"])
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Moderate"]
## D = 0.11076, p-value < 2.2e-16
#provjera distribucije atributa `Insulin Levels` = "High"
lillie.test(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "High"])
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "High"]
## D = 0.11548, p-value < 2.2e-16
#Grafički prikaz podataka
hist(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Low"])

hist(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Moderate"])

hist(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "High"])

# Testiranje homogenosti varijance uzoraka Bartlettovim testom
bartlett.test(diabetes$`Insulin Levels` ~ diabetes$`Physical Activity`)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  diabetes$`Insulin Levels` by diabetes$`Physical Activity`
## Bartlett's K-squared = 6.4479, df = 2, p-value = 0.0398
var(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Low"])
## [1] 115.882
var(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Moderate"])
## [1] 118.4482
var(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "High"])
## [1] 114.6527
a = aov(diabetes$`Insulin Levels` ~ diabetes$`Physical Activity`)
summary(a)
##                                 Df  Sum Sq Mean Sq F value Pr(>F)
## diabetes$`Physical Activity`     2     346   173.1   1.488  0.226
## Residuals                    69997 8142960   116.3

Kako podatci ne ispunjuju zahtjeve nužne za provođenje ANOVE; ne isunjuju zahtjev normalnosti, niti homogenosti (jer su pri provođenju testova p-vrijednosti bile manje od 0.05).

Zato provodim Kruskal-Wallisov test jer on ne zahtjeva normalnu distribuciju podataka.

# Kruskal-Wallis
kruskal.test(diabetes$`Insulin Levels` ~ diabetes$`Physical Activity`)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  diabetes$`Insulin Levels` by diabetes$`Physical Activity`
## Kruskal-Wallis chi-squared = 2.0217, df = 2, p-value = 0.3639

Zaključak:

P-vrijednost dobivena Kruskal-Wallis testom iznosi 0.3639 te veća od 0.05, što znači da ne možemo odbaciti nultu hipotezu(H0): Nema statistički značajne razlike među grupama fizičke aktivnosti u razini inzulina u krvi. Što se dalo naslutiti i grafičkim prikazima, ali nam je numerički ovako potvrđeno.

Istraživačko pitanje

Postoji li razlika u godinama otkrivanja dijabetesa tipa 1 i tipa 2?**

sum(is.na(diabetes$Target)) # nema NA vrijednosti
## [1] 0
sum(is.na(diabetes$Autoantibodies)) # nema NA vrijednosti
## [1] 0
diabetes
## # A tibble: 70,000 × 34
##    Target                      `Genetic Markers` Autoantibodies `Family History`
##    <fct>                       <fct>             <fct>          <fct>           
##  1 Steroid-Induced Diabetes    Positive          Negative       No              
##  2 Neonatal Diabetes Mellitus… Positive          Negative       No              
##  3 Prediabetic                 Positive          Positive       Yes             
##  4 Type 1 Diabetes             Negative          Positive       No              
##  5 Wolfram Syndrome            Negative          Negative       Yes             
##  6 LADA                        Positive          Negative       Yes             
##  7 Type 2 Diabetes             Negative          Negative       No              
##  8 Wolcott-Rallison Syndrome   Positive          Negative       No              
##  9 Secondary Diabetes          Negative          Positive       No              
## 10 Secondary Diabetes          Positive          Negative       Yes             
## # ℹ 69,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
diabetes$Age <- as.integer(diabetes$Age) # je li treba mijenjati podatke u integer ??
table(diabetes$Target)
## 
##    Cystic Fibrosis-Related Diabetes (CFRD) 
##                                       5464 
##                       Gestational Diabetes 
##                                       5344 
##                                       LADA 
##                                       5223 
##                                       MODY 
##                                       5553 
##           Neonatal Diabetes Mellitus (NDM) 
##                                       5408 
##                                Prediabetic 
##                                       5376 
##                         Secondary Diabetes 
##                                       5479 
##                   Steroid-Induced Diabetes 
##                                       5275 
##                            Type 1 Diabetes 
##                                       5446 
##                            Type 2 Diabetes 
##                                       5397 
## Type 3c Diabetes (Pancreatogenic Diabetes) 
##                                       5320 
##                  Wolcott-Rallison Syndrome 
##                                       5400 
##                           Wolfram Syndrome 
##                                       5315
# diabetes


# filtriranje redova koji sadrze samo zapise o tipu 1 ili 2 dijabetesa
type_1_and_2 <- diabetes %>% filter(Target %in% c("Type 1 Diabetes", "Type 2 Diabetes"))

# provjera - proucavanje filtiranih redova - izvlacimo nasumicnih 10
random_rows <- type_1_and_2[sample(1:nrow(type_1_and_2), size = 10, replace = FALSE), ]
print(random_rows)
## # A tibble: 10 × 34
##    Target          `Genetic Markers` Autoantibodies `Family History`
##    <fct>           <fct>             <fct>          <fct>           
##  1 Type 1 Diabetes Positive          Positive       No              
##  2 Type 2 Diabetes Positive          Positive       Yes             
##  3 Type 1 Diabetes Positive          Negative       No              
##  4 Type 2 Diabetes Positive          Negative       Yes             
##  5 Type 1 Diabetes Negative          Negative       Yes             
##  6 Type 1 Diabetes Negative          Positive       Yes             
##  7 Type 1 Diabetes Positive          Positive       No              
##  8 Type 2 Diabetes Positive          Positive       Yes             
##  9 Type 2 Diabetes Negative          Positive       No              
## 10 Type 2 Diabetes Negative          Negative       No              
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## #   Age <int>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## #   `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## #   `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## #   `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## #   `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>,
## #   `History of PCOS` <fct>, `Previous Gestational Diabetes` <fct>, …
# histogram godina ljudi oboljeli od tip 1 dijabetesa
type_1 = type_1_and_2[type_1_and_2$Target=="Type 1 Diabetes",]
type_2 = type_1_and_2[type_1_and_2$Target=="Type 2 Diabetes",]

Uočavamo na histogramima da godine(dob) nisu normalno raspoređene.

# kako postavit granice historaga
cat('Prosječna dob otkrivanje tip 1 dijabetesa je ', mean(type_1$Age),'\n')
## Prosječna dob otkrivanje tip 1 dijabetesa je  17.065
hist(type_1$Age, 
     breaks = seq(min(type_1$Age) - 1, max(type_1$Age) + 1, 1),  
     main = 'TIP 1 DIJABETES- histogram godina',
     xlab = 'Godine',
     col = 'skyblue',
     border = 'black')

summary(type_1$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   11.00   17.00   17.07   23.00   29.00
cat('Prosječna dob otkrivanje tip 2 dijabetesa je ', mean(type_2$Age), '\n')
## Prosječna dob otkrivanje tip 2 dijabetesa je  54.60978
hist(type_2$Age, 
     breaks = seq(min(type_2$Age) - 1, max(type_2$Age) + 1, 1),  
     main = 'TIP 2 DIJABETES- histogram godina',
     xlab = 'Godine',
     col = 'darkblue',
     border = 'black')

Uočavamo da podaci godina(dob) nisu normalno distribuirani nego jednoliko.

Pomoću boxplot dijagrama ćemo nacrati da bi se razlika jasnije vidjela.

boxplot(type_1$Age, type_2$Age, 
        names = c('Tip 1 godine','Tip 2 godine'),
        main='Boxplot za godine oktrivanja dijabetesa',
        ylab="Godine"
        )

## Zakljucak:

Iz grafičkih rezultata jasno uočavamo da je otkrivanje dijabetesa tipa 1 puno ranije nego dijabetesa tipa 2