Uvod Projekt je napravljen u svrhu predmeta Statistička obrada podataka. Napravio ga je tim “Boksači”.
Ucitavanje potrebnih paketa
library(dplyr)
library(readr)
library(knitr)
library(ggplot2)
library(kableExtra)
diabetes <- read_csv("diabetes_dataset00.csv")
Prikažimo prvo osnovno o podacima.
head(diabetes)
## # A tibble: 6 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <chr> <chr> <chr> <chr>
## 1 Steroid-Induced Diabetes Positive Negative No
## 2 Neonatal Diabetes Mellitus … Positive Negative No
## 3 Prediabetic Positive Positive Yes
## 4 Type 1 Diabetes Negative Positive No
## 5 Wolfram Syndrome Negative Negative Yes
## 6 LADA Positive Negative Yes
## # ℹ 30 more variables: `Environmental Factors` <chr>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <chr>, `Dietary Habits` <chr>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <chr>,
## # `Socioeconomic Factors` <chr>, `Smoking Status` <chr>,
## # `Alcohol Consumption` <chr>, `Glucose Tolerance Test` <chr>,
## # `History of PCOS` <chr>, `Previous Gestational Diabetes` <chr>, …
cat("Dimenzije dataseta su:", dim(diabetes))
## Dimenzije dataseta su: 70000 34
cat("\nBroj redaka je:", nrow(diabetes))
##
## Broj redaka je: 70000
cat("\nBroj stupaca je:" , ncol(diabetes))
##
## Broj stupaca je: 34
diabetes
## # A tibble: 70,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <chr> <chr> <chr> <chr>
## 1 Steroid-Induced Diabetes Positive Negative No
## 2 Neonatal Diabetes Mellitus… Positive Negative No
## 3 Prediabetic Positive Positive Yes
## 4 Type 1 Diabetes Negative Positive No
## 5 Wolfram Syndrome Negative Negative Yes
## 6 LADA Positive Negative Yes
## 7 Type 2 Diabetes Negative Negative No
## 8 Wolcott-Rallison Syndrome Positive Negative No
## 9 Secondary Diabetes Negative Positive No
## 10 Secondary Diabetes Positive Negative Yes
## # ℹ 69,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <chr>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <chr>, `Dietary Habits` <chr>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <chr>,
## # `Socioeconomic Factors` <chr>, `Smoking Status` <chr>,
## # `Alcohol Consumption` <chr>, `Glucose Tolerance Test` <chr>, …
| IME STUPCA | Opis značenja |
|---|---|
| Target | Vrsta dijabetesa u pacijenta |
| Genetic Marker | Specifična sekvenca DNA kojom “se može” predvidjeti dijabetes |
| Autoantibodies | Prisutnost antitijela koja tijelo proizvede protiv vlastitih stanica |
| Family History | Je li pacijent ima povijest dijabetesa u obitelji |
| Environmental Factors | Okolišni faktori koji mogu utjecati na razvoj dijabetesa, npr. zagađenje zraka |
| Insulin Levels | Razina inzulina u tijelu, što je ključno za procjenu funkcije gušterače i upravljanje dijabetesom |
| Age | Dob osobe |
| BMI | Indeks tjelesne mase (BMI) |
| Physical Activity | Razina tjelesne aktivnosti |
| Dietary Habits | Prehrambene navike |
| Blood Pressure | Krvni tlak |
| Cholesterol Levels | Razina kolesterola u krvi |
| Waist Circumference | Obujam struka |
| Blood Glucose Level | Razina glukoze u krvi |
| Ethnicity | Etnicitet osobe |
| Socioeconomic Factors | Socioekonomski faktori |
| Smoking Status | Status pušenja |
| Alcohol Consumption | Konzumacija alkohola |
| Glucose Tolerance Test | Test tolerancije na glukozu, koji se koristi za dijagnosticiranje dijabetesa ili predijabetesa |
| History of PCOS | Je li žena imala dijagnozu ili simptome policističnog sindroma jajnika (PCOS) |
| Previous Gestational Diabetes | Povijest gestacijskog dijabetesa |
| Pregnancy History | Povijest trudnoće |
| Weight Gain During Pregnancy | Povećanje tjelesne mase tijekom trudnoće |
| Pancreatic Health | Zdravlje gušterače |
| Pulmonary Function | Funkcija pluća |
| Cystic Fibrosis Diagnosis | Dijagnoza cistične fibroze |
| Steroid Use History | Povijest korištenja steroida |
| Genetic Testing | Postoje li genetske predispozicije za dijabetes |
| Neurological Assessments | Neurološke procjene |
| Liver Function Tests | Test funkcije jetre |
| Digestive Enzyme Levels | Razina probavnih enzima |
| Urine Test | Test urina |
| Birth Weight | Tjelesna masa pri rođenju |
| Early Onset Symptoms | Simptomi koji se javljaju u ranom stadiju dijabetesa |
Deksriptivna analiza tablice diabetes: Kratki prikaz tablice
glimpse(diabetes)
## Rows: 70,000
## Columns: 34
## $ Target <chr> "Steroid-Induced Diabetes", "Neonatal …
## $ `Genetic Markers` <chr> "Positive", "Positive", "Positive", "N…
## $ Autoantibodies <chr> "Negative", "Negative", "Positive", "P…
## $ `Family History` <chr> "No", "No", "Yes", "No", "Yes", "Yes",…
## $ `Environmental Factors` <chr> "Present", "Present", "Present", "Pres…
## $ `Insulin Levels` <dbl> 40, 13, 27, 8, 17, 17, 29, 10, 47, 21,…
## $ Age <dbl> 44, 1, 36, 7, 10, 41, 30, 3, 47, 72, 6…
## $ BMI <dbl> 38, 17, 24, 16, 17, 26, 31, 18, 25, 24…
## $ `Physical Activity` <chr> "High", "High", "High", "Low", "High",…
## $ `Dietary Habits` <chr> "Healthy", "Healthy", "Unhealthy", "Un…
## $ `Blood Pressure` <dbl> 124, 73, 121, 100, 103, 127, 115, 80, …
## $ `Cholesterol Levels` <dbl> 201, 121, 185, 151, 146, 208, 237, 157…
## $ `Waist Circumference` <dbl> 50, 24, 36, 29, 33, 32, 43, 29, 40, 36…
## $ `Blood Glucose Levels` <dbl> 168, 178, 105, 121, 289, 142, 186, 206…
## $ Ethnicity <chr> "Low Risk", "Low Risk", "Low Risk", "L…
## $ `Socioeconomic Factors` <chr> "Medium", "High", "Medium", "High", "L…
## $ `Smoking Status` <chr> "Smoker", "Non-Smoker", "Smoker", "Smo…
## $ `Alcohol Consumption` <chr> "High", "Moderate", "High", "Moderate"…
## $ `Glucose Tolerance Test` <chr> "Normal", "Normal", "Abnormal", "Abnor…
## $ `History of PCOS` <chr> "No", "Yes", "Yes", "No", "No", "No", …
## $ `Previous Gestational Diabetes` <chr> "No", "No", "No", "Yes", "Yes", "No", …
## $ `Pregnancy History` <chr> "Normal", "Normal", "Normal", "Normal"…
## $ `Weight Gain During Pregnancy` <dbl> 18, 8, 15, 12, 2, 11, 15, 4, 30, 33, 3…
## $ `Pancreatic Health` <dbl> 36, 26, 56, 49, 10, 40, 62, 13, 91, 86…
## $ `Pulmonary Function` <dbl> 76, 60, 80, 89, 41, 85, 64, 44, 71, 69…
## $ `Cystic Fibrosis Diagnosis` <chr> "No", "Yes", "Yes", "Yes", "No", "Yes"…
## $ `Steroid Use History` <chr> "No", "No", "No", "No", "No", "No", "Y…
## $ `Genetic Testing` <chr> "Positive", "Negative", "Negative", "P…
## $ `Neurological Assessments` <dbl> 3, 1, 1, 2, 1, 2, 3, 1, 3, 2, 3, 1, 1,…
## $ `Liver Function Tests` <chr> "Normal", "Normal", "Abnormal", "Abnor…
## $ `Digestive Enzyme Levels` <dbl> 56, 28, 55, 60, 24, 52, 96, 29, 74, 42…
## $ `Urine Test` <chr> "Ketones Present", "Glucose Present", …
## $ `Birth Weight` <dbl> 2629, 1881, 3622, 3542, 1770, 3835, 44…
## $ `Early Onset Symptoms` <chr> "No", "Yes", "Yes", "No", "No", "Yes",…
head(diabetes)
## # A tibble: 6 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <chr> <chr> <chr> <chr>
## 1 Steroid-Induced Diabetes Positive Negative No
## 2 Neonatal Diabetes Mellitus … Positive Negative No
## 3 Prediabetic Positive Positive Yes
## 4 Type 1 Diabetes Negative Positive No
## 5 Wolfram Syndrome Negative Negative Yes
## 6 LADA Positive Negative Yes
## # ℹ 30 more variables: `Environmental Factors` <chr>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <chr>, `Dietary Habits` <chr>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <chr>,
## # `Socioeconomic Factors` <chr>, `Smoking Status` <chr>,
## # `Alcohol Consumption` <chr>, `Glucose Tolerance Test` <chr>,
## # `History of PCOS` <chr>, `Previous Gestational Diabetes` <chr>, …
Pošto imamo i kategorijske podatke na koje summary() funkcija neće imati klasičan odgovor napraviti će mo faktorizaciju kategorijskih stupaca
categorical_vars <- c(
"Target", "Genetic Markers", "Autoantibodies",
"Family History", "Environmental Factors",
"Physical Activity", "Dietary Habits", "Ethnicity",
"Socioeconomic Factors", "Smoking Status",
"Alcohol Consumption", "Glucose Tolerance Test",
"History of PCOS", "Previous Gestational Diabetes",
"Pregnancy History", "Cystic Fibrosis Diagnosis",
"Steroid Use History", "Genetic Testing",
"Liver Function Tests", "Urine Test",
"Early Onset Symptoms"
)
# Faktorizacija
diabetes[categorical_vars] <- lapply(diabetes[categorical_vars], as.factor)
s <- summary(diabetes)
s %>%
kable() %>%
kable_styling(full_width = FALSE, position = "left", bootstrap_options = c("striped", "hover", "condensed"))
| Target | Genetic Markers | Autoantibodies | Family History | Environmental Factors | Insulin Levels | Age | BMI | Physical Activity | Dietary Habits | Blood Pressure | Cholesterol Levels | Waist Circumference | Blood Glucose Levels | Ethnicity | Socioeconomic Factors | Smoking Status | Alcohol Consumption | Glucose Tolerance Test | History of PCOS | Previous Gestational Diabetes | Pregnancy History | Weight Gain During Pregnancy | Pancreatic Health | Pulmonary Function | Cystic Fibrosis Diagnosis | Steroid Use History | Genetic Testing | Neurological Assessments | Liver Function Tests | Digestive Enzyme Levels | Urine Test | Birth Weight | Early Onset Symptoms | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MODY : 5553 | Negative:34899 | Negative:35058 | No :34832 | Absent :35088 | Min. : 5.00 | Min. : 0.00 | Min. :12.00 | High :23225 | Healthy :35020 | Min. : 60.0 | Min. :100.0 | Min. :20.00 | Min. : 80.0 | High Risk:34982 | High :23304 | Non-Smoker:34955 | High :23246 | Abnormal:35278 | No :35101 | No :34965 | Complications:34730 | Min. : 0.0 | Min. :10.00 | Min. :30.00 | No :35135 | No :35142 | Negative:34685 | Min. :1.000 | Abnormal:34981 | Min. :10.00 | Glucose Present:17422 | Min. :1500 | No :35059 | |
| Secondary Diabetes : 5479 | Positive:35101 | Positive:34942 | Yes:35168 | Present:34912 | 1st Qu.:13.00 | 1st Qu.:14.00 | 1st Qu.:20.00 | Low :23348 | Unhealthy:34980 | 1st Qu.: 99.0 | 1st Qu.:163.0 | 1st Qu.:30.00 | 1st Qu.:121.0 | Low Risk :35018 | Low :23283 | Smoker :35045 | Low :23411 | Normal :34722 | Yes:34899 | Yes:35035 | Normal :35270 | 1st Qu.: 7.0 | 1st Qu.:32.00 | 1st Qu.:63.00 | Yes:34865 | Yes:34858 | Positive:35315 | 1st Qu.:1.000 | Normal :35019 | 1st Qu.:31.00 | Ketones Present:17422 | 1st Qu.:2629 | Yes:34941 | |
| Cystic Fibrosis-Related Diabetes (CFRD): 5464 | NA | NA | NA | NA | Median :19.00 | Median :31.00 | Median :25.00 | Moderate:23427 | NA | Median :113.0 | Median :191.0 | Median :34.00 | Median :152.0 | NA | Medium:23413 | NA | Moderate:23343 | NA | NA | NA | NA | Median :16.0 | Median :46.00 | Median :72.00 | NA | NA | NA | Median :2.000 | NA | Median :48.00 | Normal :17528 | Median :3103 | NA | |
| Type 1 Diabetes : 5446 | NA | NA | NA | NA | Mean :21.61 | Mean :32.02 | Mean :24.78 | NA | NA | Mean :111.3 | Mean :194.9 | Mean :35.05 | Mean :160.7 | NA | NA | NA | NA | NA | NA | NA | NA | Mean :15.5 | Mean :47.56 | Mean :70.26 | NA | NA | NA | Mean :1.804 | NA | Mean :46.42 | Protein Present:17628 | Mean :3097 | NA | |
| Neonatal Diabetes Mellitus (NDM) : 5408 | NA | NA | NA | NA | 3rd Qu.:28.00 | 3rd Qu.:49.00 | 3rd Qu.:29.00 | NA | NA | 3rd Qu.:125.0 | 3rd Qu.:225.0 | 3rd Qu.:39.00 | 3rd Qu.:194.0 | NA | NA | NA | NA | NA | NA | NA | NA | 3rd Qu.:22.0 | 3rd Qu.:64.00 | 3rd Qu.:79.00 | NA | NA | NA | 3rd Qu.:2.000 | NA | 3rd Qu.:61.00 | NA | 3rd Qu.:3656 | NA | |
| Wolcott-Rallison Syndrome : 5400 | NA | NA | NA | NA | Max. :49.00 | Max. :79.00 | Max. :39.00 | NA | NA | Max. :149.0 | Max. :299.0 | Max. :54.00 | Max. :299.0 | NA | NA | NA | NA | NA | NA | NA | NA | Max. :39.0 | Max. :99.00 | Max. :89.00 | NA | NA | NA | Max. :3.000 | NA | Max. :99.00 | NA | Max. :4499 | NA | |
| (Other) :37250 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Promatramo utjecaj samo jedne nezavisne varijable X na neku zavisnu varijablu Y. Graficki cemo to prikazati s scatter plotom. Za sada odabiremo par X varijabli za koje smatramo da bismo iz njih nesto mogli zakljuciti o starosti pacijenata tj. biramo za koje smatramo da imaju logicku vezu s dobi.
Neke varijable (i razlozi) koje cemo crtati scatter plotom su
Scatter plot: BMI vs Age
head(diabetes)
## # A tibble: 6 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Steroid-Induced Diabetes Positive Negative No
## 2 Neonatal Diabetes Mellitus … Positive Negative No
## 3 Prediabetic Positive Positive Yes
## 4 Type 1 Diabetes Negative Positive No
## 5 Wolfram Syndrome Negative Negative Yes
## 6 LADA Positive Negative Yes
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>,
## # `History of PCOS` <fct>, `Previous Gestational Diabetes` <fct>, …
nrow(diabetes)
## [1] 70000
# BMI
diab_red <- sample_n(diabetes, 5000) # reducirani skup, jer nam je previse tocaka
ggplot(diab_red, aes(x = BMI, y = Age)) +
geom_point(color = "blue", alpha = 0.6, size = 2) +
geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
labs(title = "Scatter Plot: BMI vs Age",
x = "BMI",
y = "Age (Years)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
fit.BMI <-lm(Age ~ BMI, data = diab_red)
#provjeravamo normalnost reziduala - bitna pretpostavka
reziduali <- residuals(fit.BMI)
# uocavamo da nije previse narusena normalnost reziduala
hist(reziduali)
qqnorm(reziduali, main = "QQ Plot Reziduala")
qqline(reziduali, col = "red", lwd = 2)
Scatter plot: Blood Pressure vs Age
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Blood Pressure`, y = Age)) +
geom_point(color = "blue", alpha = 0.6, size = 2) +
labs(title = "Scatter Plot: Blood Pressure vs Age",
x = "Blood Pressure (mmHg)",
y = "Age (Years)") +
geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
# lm znaci linearni model
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# uocavamo lagani trend da se povecavanjem godina povecava i krvni tlak
fit.bp <-lm(Age ~ BMI, data = diab_red)
#provjeravamo normalnost reziduala - bitna pretpostavka
reziduali <- residuals(fit.bp)
# uocavamo da nije previse narusena normalnost reziduala
hist(reziduali)
qqnorm(reziduali, main = "QQ Plot Reziduala")
qqline(reziduali, col = "red", lwd = 2)
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Target`, y = Age)) +
geom_boxplot(color = "blue", alpha = 0.6) +
labs(title = "Boxplot: Age by Type of Diabetes",
x = "Type of Diabetes",
y = "Age (Years)") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))
Box plot: Physical Activity vs Age
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Target`, y = Age)) +
geom_boxplot(color = "blue", alpha = 0.6) +
labs(title = "Boxplot: Age by Type of Diabetes",
x = "Type of Diabetes",
y = "Age (Years)") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))
Box plot: Physical Activity vs Age
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Physical Activity`, y = Age)) +
geom_boxplot(color = "blue", alpha = 0.6) +
labs(title = "Boxplot: Age by Physical Activity Levels",
x = "Physical Activity (Category)",
y = "Age (Years)") +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))
Ovaj box plot nam jako malo govori o povezanosti između fizičke
aktivnosti i godina
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Cholesterol Levels`, y = Age)) +
geom_point(color = "blue", alpha = 0.6, size = 2) +
geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
labs(title = "Scatter Plot: Cholesterol Levels vs Age",
x = "Cholesterol Levels (mg/dL)",
y = "Age (Years)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
fit.ch <-lm(Age ~ `Cholesterol Levels`, data = diab_red)
#provjeravamo normalnost reziduala - bitna pretpostavka
reziduali <- residuals(fit.ch)
# uocavamo da nije previse narusena normalnost reziduala
hist(reziduali)
qqnorm(reziduali, main = "QQ Plot Reziduala")
qqline(reziduali, col = "red", lwd = 2)
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Blood Glucose Levels`, y = Age)) +
geom_point(color = "blue", alpha = 0.6, size = 2) +
labs(title = "Scatter Plot: Blood Glucose Level vs Age",
x = "Blood Glucose Level (mg/dL)",
y = "Age (Years)") +
geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Ne uocavamo nikakvu linearnu poveznost - jako loš regresor
Scatter plot: Waist Circumference vs Age
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
ggplot(diab_red, aes(x = `Waist Circumference`, y = Age)) +
geom_point(color = "blue", alpha = 0.6, size = 2) +
labs(title = "Scatter Plot: Waist Circumference vs Age",
x = "Waist Circumference (cm)",
y = "Age (Years)") +
geom_smooth(method = "lm", color = "red", se = TRUE) + # Dodaje regresijsku crtu
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
fit.ws <-lm(Age ~ `Waist Circumference`, data = diab_red)
#provjeravamo normalnost reziduala - bitna pretpostavka
reziduali <- residuals(fit.ws)
hist(reziduali)
qqnorm(reziduali, main = "QQ Plot Reziduala")
qqline(reziduali, col = "red", lwd = 2)
Uočavamo da nije previše narušena normalnost reziduala.
Dalje cemo koristit Target,BMI, Blood Presure, Cholesterol i Waist kao regresore jer su njihovi grafovi u pocetnoj analizi dali najbolju linearnu vezu s varijablom dobi.
Sada cemo viditi jesu li neki regresori kolerirani (koliko je jaka veza izmedju parova regresora)
Korelacijski koeficijenti
Regresori: Target BMI
Blood Pressure Cholesterol Levels
Waist Circumference
korelacija <- cor(cbind(diab_red$Target,
diab_red$BMI,
diab_red$`Blood Pressure`,
diab_red$`Cholesterol Levels`,
diab_red$`Waist Circumference`)) # Samo potpuni podaci
# Dodavanje imena stupcima i redovima
colnames(korelacija) <- rownames(korelacija) <- c("Target",
"BMI",
"Blood Pressure",
"Cholesterol Levels",
"Waist Circumference")
Iz ove tablice uocavamo zakljucke
BMI i Blood Pressure: Korelacija od 0.642 ukazuje na srednje jaku pozitivnu povezanost.
BMI i Cholesterol Levels: Korelacija od 0.595 je blizu granice, takođe ukazuje na povezanost.
BMI i Waist Circumference: Korelacija od 0.619 je takodjer jaka.
Sada provodimo visestruku linearnu regresiju
Korištenje kategorijskih varijabli s više od dvije kategorije kao int vrijednosti u regresiji se ne preporuča za nominalne varijable. Dakle mi necemo koristiti Target varijablu, a njen boxplot je vec prikazan iznad i u primjeru 1.4. je razlozeno kako vrsta dijabetesa moze pokazivati jako dobro dob pacijenta.
# Višestruka regresija sa Age kao zavisnom varijablom
fit.multi <- lm(Age ~ BMI + `Blood Pressure` + `Cholesterol Levels` + `Waist Circumference`, data = diab_red)
# Prikaz rezultata regresije
summary(fit.multi)
##
## Call:
## lm(formula = Age ~ BMI + `Blood Pressure` + `Cholesterol Levels` +
## `Waist Circumference`, data = diab_red)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.569 -7.657 -0.803 6.673 40.661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.791263 0.958346 -70.74 <2e-16 ***
## BMI 0.568551 0.037845 15.02 <2e-16 ***
## `Blood Pressure` 0.313808 0.013010 24.12 <2e-16 ***
## `Cholesterol Levels` 0.128689 0.005417 23.75 <2e-16 ***
## `Waist Circumference` 0.735261 0.037388 19.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.43 on 4995 degrees of freedom
## Multiple R-squared: 0.7081, Adjusted R-squared: 0.7079
## F-statistic: 3030 on 4 and 4995 DF, p-value: < 2.2e-16
Uočavamo jako velike reziduale (40 godina na nekim procjenama je previše za godine(dob)). No ipak medijan reziduala je blizu nule sto je pozeljno.
Svi koeficijenti su statistički značajni (p-vrijednosti < 0.001), što znači da su svi prediktori relevantni za objašnjenje starosti.
Nama je nulta hipoteza da je beta jednaka nuli, te nam ove p vrijednosti govore da tu hipotezu mozemo odbaciti!
Model objašnjava 70.35% varijance zavisne varijable Age, što ukazuje na dobru sposobnost modela da objasni starost na osnovu navedenih prediktora.
Razlika izmedju R kvadrat i R-a je vrlo mala sto znaci da ukljucene varijable zaista pridonose objašnjenju varijance (nema penala!)
F statistika nam testira jesu li svi koeficijenti u modelu jednaki nuli. F = 2963 nam govori tj. p-value: < 2.2e-16 da je ta pretpostavka skoro pa nemoguca - odbacujemo H0.
residuals <- residuals(fit.multi)
hist(residuals, breaks = 30, col = "lightblue", main = "Glavni histogram reziduala naseg modela",
xlab = "Reziduali")
# uocavamo normalnu distribuciju reziduala naseg modela sto je jako bitno za validnost t-testova
qqnorm(residuals, main = "Q-Q Plot reziduala")
qqline(residuals, col = "red", lwd = 2)
Sada cemo na primjeru podataka jedne anonimne osobe predvidjeti
njene godine. Koristimo 95% interval povjerenja
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
names(diab_red)
## [1] "Target" "Genetic Markers"
## [3] "Autoantibodies" "Family History"
## [5] "Environmental Factors" "Insulin Levels"
## [7] "Age" "BMI"
## [9] "Physical Activity" "Dietary Habits"
## [11] "Blood Pressure" "Cholesterol Levels"
## [13] "Waist Circumference" "Blood Glucose Levels"
## [15] "Ethnicity" "Socioeconomic Factors"
## [17] "Smoking Status" "Alcohol Consumption"
## [19] "Glucose Tolerance Test" "History of PCOS"
## [21] "Previous Gestational Diabetes" "Pregnancy History"
## [23] "Weight Gain During Pregnancy" "Pancreatic Health"
## [25] "Pulmonary Function" "Cystic Fibrosis Diagnosis"
## [27] "Steroid Use History" "Genetic Testing"
## [29] "Neurological Assessments" "Liver Function Tests"
## [31] "Digestive Enzyme Levels" "Urine Test"
## [33] "Birth Weight" "Early Onset Symptoms"
diab_red
## # A tibble: 5,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 3c Diabetes (Pancreat… Negative Negative Yes
## 2 Wolfram Syndrome Negative Negative No
## 3 Gestational Diabetes Negative Positive Yes
## 4 Prediabetic Positive Positive Yes
## 5 Wolfram Syndrome Positive Negative Yes
## 6 Gestational Diabetes Negative Negative Yes
## 7 Type 3c Diabetes (Pancreat… Positive Positive No
## 8 Prediabetic Positive Positive Yes
## 9 Gestational Diabetes Positive Negative No
## 10 Type 2 Diabetes Positive Negative Yes
## # ℹ 4,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
new_data <- data.frame(BMI = 25,`Waist Circumference` = 85, `Blood Pressure` = 80, `Cholesterol Levels` = 22)
new_data
## BMI Waist.Circumference Blood.Pressure Cholesterol.Levels
## 1 25 85 80 22
colnames(new_data)[colnames(new_data) == "Blood.Pressure"] <- "Blood Pressure"
colnames(new_data)[colnames(new_data) == "Cholesterol.Levels"] <- "Cholesterol Levels"
colnames(new_data)[colnames(new_data) == "Waist.Circumference"] <- "Waist Circumference"
new_data
## BMI Waist Circumference Blood Pressure Cholesterol Levels
## 1 25 85 80 22
pred <- predict(fit.multi, newdata = new_data, interval = "confidence", level = 0.95)
# Prikaz rezultata predikcije
pred
## fit lwr upr
## 1 36.85548 32.03873 41.67223
U ovom slučaju, predviđena starost je 36.25 godina. Donja granica 95% intervala poverenja: 31.47 godina. Gornja granica 95% intervala poverenja: 41.03 godina.
Nas 95% interval povjerenja je [31.47,41.03]
Koristimo χ² test nezavisnosti kako bismo ispitali jesu li dvije kategorijske varijable međusobno povezane. U ovom slučaju želimo utvrditi postoji li statistički značajna povezanost između vrste bolesti (Target) i statusa pušenja (Smoking Status).
Osnova χ² testa je nulta hipoteza (H₀) koja pretpostavlja da su varijable nezavisne, što znači da promjene u jednoj varijabli (npr. status pušenja) ne utječu na promjene u drugoj varijabli (npr. vrsta bolesti).
Hipoteza HO: Ne postoji značajna povezanost između rezultata po oblicima bolesti (Target) i statusa pušenja (Smoking Status) - nezavisne varijable. Hipoteza H1: Postoji značajna povezanost između rezultata po oblicima bolesti (Target) i statusa pušenja (Smoking Status) - zavisne varijable.
cat("Različite vrste bolesti (Target):", paste(unique(as.character(diabetes$Target)), collapse = ", "), "\n")
## Različite vrste bolesti (Target): Steroid-Induced Diabetes, Neonatal Diabetes Mellitus (NDM), Prediabetic, Type 1 Diabetes, Wolfram Syndrome, LADA, Type 2 Diabetes, Wolcott-Rallison Syndrome, Secondary Diabetes, Type 3c Diabetes (Pancreatogenic Diabetes), Gestational Diabetes, Cystic Fibrosis-Related Diabetes (CFRD), MODY
cat("\n")
cat("Kategorije pušačkog statusa (Smoking Status):", paste(unique(as.character(diabetes$`Smoking Status`)), collapse = ", "), "\n")
## Kategorije pušačkog statusa (Smoking Status): Smoker, Non-Smoker
cat("\n")
target_smoking_table <- table(diabetes$Target, diabetes$`Smoking Status`)
kable(target_smoking_table, caption = "Dobivena frekvencija vrsta bolesti prema statusu pušenja")
| Non-Smoker | Smoker | |
|---|---|---|
| Cystic Fibrosis-Related Diabetes (CFRD) | 2765 | 2699 |
| Gestational Diabetes | 2650 | 2694 |
| LADA | 2635 | 2588 |
| MODY | 2799 | 2754 |
| Neonatal Diabetes Mellitus (NDM) | 2626 | 2782 |
| Prediabetic | 2709 | 2667 |
| Secondary Diabetes | 2714 | 2765 |
| Steroid-Induced Diabetes | 2607 | 2668 |
| Type 1 Diabetes | 2725 | 2721 |
| Type 2 Diabetes | 2691 | 2706 |
| Type 3c Diabetes (Pancreatogenic Diabetes) | 2616 | 2704 |
| Wolcott-Rallison Syndrome | 2762 | 2638 |
| Wolfram Syndrome | 2656 | 2659 |
#Vizualizacija bar plot
ggplot(diabetes, aes(x = `Smoking Status`, fill = Target)) +
geom_bar(position = "dodge") +
labs(title = "Razlike u vrstama bolesti među pušačima i nepušačima",
x = "Status pušenja",
y = "Broj pacijenata",
fill = "Vrsta bolesti") +
theme_minimal()
# Vizualizacija podataka - Mosaic plot
mosaicplot(target_smoking_table, main = "Mosaic graf: Povezanost statusa pušenja i vrste bolesti", color = TRUE, las = 2)
smoking_vs_target <- chisq.test(table(diabetes$Target, diabetes$`Smoking Status`))
cat("Rezultati χ2 testa:\n")
## Rezultati χ2 testa:
print(smoking_vs_target)
##
## Pearson's Chi-squared test
##
## data: table(diabetes$Target, diabetes$`Smoking Status`)
## X-squared = 12.189, df = 12, p-value = 0.4306
Iz grafičkih prikaza ne vidimo velike razlike, tj. frekvencije su otprilike podjednake za svaku vrstu bolesti odnosno status pušenja.
contingency_table <- table(diabetes$Target, diabetes$`Smoking Status`)
# Chi-squared test
chi_squared_test <- chisq.test(contingency_table)
expected_frequencies <- chi_squared_test$expected
# Prikaz očekivanih frekvencija
#kable(expected_frequencies, caption = "Očekivane frekvencije vrsta bolesti prema statusu pušenja")
# Ispis frekvencijske tablice s očekivanim frekvencijama
smoking_levels <- levels(diabetes$`Smoking Status`)
target_levels <- levels(diabetes$Target)
# Create a matrix with the combined values
combined_table <- data.frame(
Smoking_Status = rep(smoking_levels, each = length(target_levels)),
Target = rep(target_levels, times = length(smoking_levels)),
Stvarne_Frekvencije = as.vector(contingency_table),
Očekivane_Frekvencije = as.vector(expected_frequencies)
)
# Display the table with kable
kable(combined_table, caption = "Stvarne i očekivane frekvencije za svaku kombinaciju varijabli")
| Smoking_Status | Target | Stvarne_Frekvencije | Očekivane_Frekvencije |
|---|---|---|---|
| Non-Smoker | Cystic Fibrosis-Related Diabetes (CFRD) | 2765 | 2728.487 |
| Non-Smoker | Gestational Diabetes | 2650 | 2668.565 |
| Non-Smoker | LADA | 2635 | 2608.142 |
| Non-Smoker | MODY | 2799 | 2772.930 |
| Non-Smoker | Neonatal Diabetes Mellitus (NDM) | 2626 | 2700.523 |
| Non-Smoker | Prediabetic | 2709 | 2684.544 |
| Non-Smoker | Secondary Diabetes | 2714 | 2735.978 |
| Non-Smoker | Steroid-Induced Diabetes | 2607 | 2634.109 |
| Non-Smoker | Type 1 Diabetes | 2725 | 2719.499 |
| Non-Smoker | Type 2 Diabetes | 2691 | 2695.030 |
| Non-Smoker | Type 3c Diabetes (Pancreatogenic Diabetes) | 2616 | 2656.580 |
| Non-Smoker | Wolcott-Rallison Syndrome | 2762 | 2696.529 |
| Non-Smoker | Wolfram Syndrome | 2656 | 2654.083 |
| Smoker | Cystic Fibrosis-Related Diabetes (CFRD) | 2699 | 2735.513 |
| Smoker | Gestational Diabetes | 2694 | 2675.435 |
| Smoker | LADA | 2588 | 2614.858 |
| Smoker | MODY | 2754 | 2780.070 |
| Smoker | Neonatal Diabetes Mellitus (NDM) | 2782 | 2707.477 |
| Smoker | Prediabetic | 2667 | 2691.456 |
| Smoker | Secondary Diabetes | 2765 | 2743.022 |
| Smoker | Steroid-Induced Diabetes | 2668 | 2640.891 |
| Smoker | Type 1 Diabetes | 2721 | 2726.501 |
| Smoker | Type 2 Diabetes | 2706 | 2701.970 |
| Smoker | Type 3c Diabetes (Pancreatogenic Diabetes) | 2704 | 2663.420 |
| Smoker | Wolcott-Rallison Syndrome | 2638 | 2703.471 |
| Smoker | Wolfram Syndrome | 2659 | 2660.917 |
df <- (nrow(contingency_table) - 1) * (ncol(contingency_table) - 1)
# Za stupanj slobode df = (broj redaka - 1) * (broj stupaca - 1)
# Kritična vrijednost χ² testa
alpha <- 0.05
critical_value <- qchisq(1 - alpha, df)
cat("Stupanj slobode:" ,df, "\n")
## Stupanj slobode: 12
cat("Kritična vrijednost χ² testa:", critical_value, "\n")
## Kritična vrijednost χ² testa: 21.02607
cat("Vrijednost χ² testa:", chi_squared_test$statistic, "\n")
## Vrijednost χ² testa: 12.18902
cat("P-vrijednost (p-value):", chi_squared_test$p.value, "\n")
## P-vrijednost (p-value): 0.4306211
Iz dobivenih rezultata, vidimo da je p-vrijednost veća od 0.05 te ne možemo odbaciti nultu hipotezu(H0) tj. Ne postoji značajna razlika među oblicima bolesti i statusa pušenja. Nema dovoljno dokaza da zaključimo da su Target(Vrsta bolesti) i Smoking Status(Status pušenja) značajno povezani na razini sigurnosti 95%.
Target je kategorijski tip podataka, dok je razina zdravlja gušterače numerički tip podatka.
Koristimo ANOVA test s obzirom da uspoređujemo srednje vrijednosti više skupina jedne kategorijske varijable s numeričkom varijablom. ANOVA test omogućuje nam analizu utjecaja različitih čimbenika na jednu varijablu, što je važno za utvrđivanje postoje li statistički značajne razlike između grupa.
Hipoteza H0: Ne postoji značajna razlika u razini zdravlja gušterače po oblicima dijabetesa Hipoteza H1: Postoji značajna razlika u razini zdravlja gušterače po oblicima dijabetesa
#kopiramo podatke kako ne bi promijenili pravi dataset
diabetes_copy2 = data.frame(diabetes)
names(diabetes_copy2)
## [1] "Target" "Genetic.Markers"
## [3] "Autoantibodies" "Family.History"
## [5] "Environmental.Factors" "Insulin.Levels"
## [7] "Age" "BMI"
## [9] "Physical.Activity" "Dietary.Habits"
## [11] "Blood.Pressure" "Cholesterol.Levels"
## [13] "Waist.Circumference" "Blood.Glucose.Levels"
## [15] "Ethnicity" "Socioeconomic.Factors"
## [17] "Smoking.Status" "Alcohol.Consumption"
## [19] "Glucose.Tolerance.Test" "History.of.PCOS"
## [21] "Previous.Gestational.Diabetes" "Pregnancy.History"
## [23] "Weight.Gain.During.Pregnancy" "Pancreatic.Health"
## [25] "Pulmonary.Function" "Cystic.Fibrosis.Diagnosis"
## [27] "Steroid.Use.History" "Genetic.Testing"
## [29] "Neurological.Assessments" "Liver.Function.Tests"
## [31] "Digestive.Enzyme.Levels" "Urine.Test"
## [33] "Birth.Weight" "Early.Onset.Symptoms"
levels(factor(diabetes_copy2$`Pancreatic.Health`))
## [1] "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24"
## [16] "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39"
## [31] "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54"
## [46] "55" "56" "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69"
## [61] "70" "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [76] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98" "99"
levels(factor(diabetes_copy2$Target))
## [1] "Cystic Fibrosis-Related Diabetes (CFRD)"
## [2] "Gestational Diabetes"
## [3] "LADA"
## [4] "MODY"
## [5] "Neonatal Diabetes Mellitus (NDM)"
## [6] "Prediabetic"
## [7] "Secondary Diabetes"
## [8] "Steroid-Induced Diabetes"
## [9] "Type 1 Diabetes"
## [10] "Type 2 Diabetes"
## [11] "Type 3c Diabetes (Pancreatogenic Diabetes)"
## [12] "Wolcott-Rallison Syndrome"
## [13] "Wolfram Syndrome"
#diabetes$`Target`
class(diabetes$`Target`)
## [1] "factor"
#diabetes$`Pancreatic Health`
class(diabetes$`Pancreatic Health`)
## [1] "numeric"
ANOVA test pretpostavlja homogenost i normalnost pa moramo prvo provjeriti naše podatke kako bi mogli krenuti sa ANOVA testom. Provjeravamo normalnost podataka za svaku pojedinu grupu dijabetsa pomoću Lillieforsovom inačicom Kolmogorov-Smirnov testa
require(nortest)
## Loading required package: nortest
lillie.test(diabetes_copy2$Pancreatic.Health)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: diabetes_copy2$Pancreatic.Health
## D = 0.068414, p-value < 2.2e-16
# Lista tipova dijabetesa
diabetes_types <- unique(diabetes_copy2$Target)
# Kreiranje histograma za svaki tip dijabetesa
for (diabetes_type in diabetes_types) {
# Filtriranje podataka za određeni tip dijabetesa
data_subset <- diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target == diabetes_type]
#lillie-test
lillie_result <- lillie.test(data_subset)
print(paste("Lillie test result for", diabetes_type, ":", lillie_result$p.value))
# Kreiranje histograma
hist(data_subset,
main = paste("Pancreatic Health for", diabetes_type), # Naslov sa tipom dijabetesa
xlab = "Pancreatic Health", # Oznaka x-ose
col = "lightblue", # Boja histograma (opcionalno)
border = "black") # Boja ivica (opcionalno)
}
## [1] "Lillie test result for Steroid-Induced Diabetes : 1.0461707480236e-73"
## [1] "Lillie test result for Neonatal Diabetes Mellitus (NDM) : 1.00768949161913e-86"
## [1] "Lillie test result for Prediabetic : 1.53545208640135e-73"
## [1] "Lillie test result for Type 1 Diabetes : 4.26370855086212e-72"
## [1] "Lillie test result for Wolfram Syndrome : 4.54340602830596e-81"
## [1] "Lillie test result for LADA : 9.58723988923272e-74"
## [1] "Lillie test result for Type 2 Diabetes : 7.5148662012579e-59"
## [1] "Lillie test result for Wolcott-Rallison Syndrome : 2.99834984600884e-88"
## [1] "Lillie test result for Secondary Diabetes : 1.60476108964642e-66"
## [1] "Lillie test result for Type 3c Diabetes (Pancreatogenic Diabetes) : 7.25012119118581e-73"
## [1] "Lillie test result for Gestational Diabetes : 6.22617556022835e-68"
## [1] "Lillie test result for Cystic Fibrosis-Related Diabetes (CFRD) : 1.02958198804219e-75"
## [1] "Lillie test result for MODY : 2.21551404977202e-74"
#testiranje homogenosti varijance uzorka
bartlett.test(diabetes_copy2$Pancreatic.Health ~ diabetes_copy2$Target)
##
## Bartlett test of homogeneity of variances
##
## data: diabetes_copy2$Pancreatic.Health by diabetes_copy2$Target
## Bartlett's K-squared = 14096, df = 12, p-value < 2.2e-16
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Wolfram Syndrome'])
## [1] 75.46892
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Gestational Diabetes'])
## [1] 205.5873
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='LADA'])
## [1] 135.6495
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='MODY'])
## [1] 203.6217
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Neonatal Diabetes Mellitus (NDM)'])
## [1] 75.23981
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Prediabetic'])
## [1] 209.5374
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Secondary Diabetes'])
## [1] 529.6986
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Steroid-Induced Diabetes'])
## [1] 299.7283
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Type 1 Diabetes'])
## [1] 206.4028
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Type 2 Diabetes'])
## [1] 531.1273
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Type 3c Diabetes (Pancreatogenic Diabetes)'])
## [1] 299.9355
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Wolcott-Rallison Syndrome'])
## [1] 75.20566
var(diabetes_copy2$Pancreatic.Health[diabetes_copy2$Target=='Wolfram Syndrome'])
## [1] 75.46892
Vidimo kako podatci nisu normalno distribuirano. Također testirali
smo i homogenost varijance uzorka.
Podatci ne zadovoljavaju uvjet homogenosti niti su normalno
distribuirani (jer su dobivene p-vrijednosti manje od 0.05). Iz
toga zaključujemo kako ne možemo koristiti ANOVA test.
Zaključujemo kako trebamo koristiti neparametarski test - Kruskal - Wallis. Taj je test neparametarski ekvivalent jednosmjernog ANOVA testa.
#kopiramo podatke kako ne bi promijenili pravi dataset
diabetes_copy3 = data.frame(diabetes)
Histogram zdravlja gušterače
# 1. Aritmetička sredina
mean_pancreas <- mean(diabetes_copy3$`Pancreatic.Health`)
mean_pancreas
## [1] 47.56424
# 2. Podrezana (trimmed) sredina (npr. 20%)
mean_trimmed_pancreas <- mean(diabetes_copy3$`Pancreatic.Health`, trim = 0.2)
mean_trimmed_pancreas
## [1] 46.94283
# 3. Medijan
median_pancreas <- median(diabetes_copy3$`Pancreatic.Health`)
median_pancreas
## [1] 46
# 4. Kvartili (Q1 i Q3)
quantiles_pancreas <- quantile(diabetes_copy3$`Pancreatic.Health`, probs = c(0.25, 0.75))
quantiles_pancreas
## 25% 75%
## 32 64
# 5. Mod (najčešća vrijednost) — paket modeest
require(modeest)
## Loading required package: modeest
## Warning: package 'modeest' was built under R version 4.4.2
mode_pancreas <- mfv(diabetes_copy3$`Pancreatic.Health`)
mode_pancreas
## [1] 37
# 6. Histogram
hist(diabetes_copy3$`Pancreatic.Health`,
main = "Histogram: Pancreatic Health",
xlab = "Value",
ylab = "Frequency",
col = "lightblue")
# Dodajemo okomitu liniju u boji narančaste, debljina 4
abline(v = mean(diabetes_copy3$`Pancreatic.Health`),
col = "orange", lwd = 4)
# 7. Summary (min, Q1, median, mean, Q3, max)
summary(diabetes_copy3$`Pancreatic.Health`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 32.00 46.00 47.56 64.00 99.00
Napraviti ćemo box plot prikaz svih podataka.
# Boxplot: Pancreatic.Health ~ Target
ggplot(data = diabetes_copy3, aes(x = Target, y = Pancreatic.Health)) +
geom_boxplot(
fill = "lightblue",
outlier.colour = "red"
) +
labs(
title = "Boxplot of Pancreatic.Health by Diabetes Type",
x = "Type of Diabetes",
y = "Pancreatic.Health"
) +
coord_flip() +
theme_minimal()
Grafički nam prikaz ukazuje na to da vrsta dijabetesa ima veze sa
zdravljem gušteraće, no to ne možemo potvrditi bez da napravimo
Kruskal-Wallis test
Izrada Kruskal-Wallis testa:
kruskal.test(Pancreatic.Health ~ Target, data = diabetes_copy3)
##
## Kruskal-Wallis rank sum test
##
## data: Pancreatic.Health by Target
## Kruskal-Wallis chi-squared = 30882, df = 12, p-value < 2.2e-16
*Iz rezultata Kruskal-Wallis testa možemo odbaciti nultu hipotezu(H0), jer je p-vrijednost manja od 0.05 (dobivena p-vrijednost < 2.2e-16). Odnosno odbacujemo hipotezu da nema razlike u zdravlju gušterače prema oblicima dijabetesa, te se prihvaća alternativna hipoteza (H1) da postoji značajna razlika u zdravlju gušterače po oblicima dijabetesa. Odnosno barem jedan tip dijabetesa ima znatno drugačije razlike zdravlja gušteraće od drugih.
Hipoteza H0: Ne postoji značajna razlika razine inzulina u krvi između grupa različite fizičke aktivnosti
Hhipoteza H1: Postoji razlika razine inzulina u krvi između grupa različite fizičke aktivnosti
#diabetes$`Physical Activity`
class(diabetes$`Physical Activity`)
## [1] "factor"
#diabetes$`Insulin Levels`
class(diabetes$`Insulin Levels`)
## [1] "numeric"
diabetes[c(1,6,9)]
## # A tibble: 70,000 × 3
## Target `Insulin Levels` `Physical Activity`
## <fct> <dbl> <fct>
## 1 Steroid-Induced Diabetes 40 High
## 2 Neonatal Diabetes Mellitus (NDM) 13 High
## 3 Prediabetic 27 High
## 4 Type 1 Diabetes 8 Low
## 5 Wolfram Syndrome 17 High
## 6 LADA 17 Moderate
## 7 Type 2 Diabetes 29 Moderate
## 8 Wolcott-Rallison Syndrome 10 Low
## 9 Secondary Diabetes 47 High
## 10 Secondary Diabetes 21 Low
## # ℹ 69,990 more rows
Opis i pregled izdvojenih atributa
Fizička aktivnost
kategorički je tip podatatka
može biti vrijednosti: “Low”, “Moderate” ili “High”
#Mod
require(modeest)
#Najzastupljenija vrijednost fizičke aktivnosti vraća funkcija ispod
mfv(diabetes$`Physical Activity`)
## [1] Moderate
## Levels: High Low Moderate
#Provjera ima li nekavih rezulatta koji odstupaju od okvira
table(diabetes$`Physical Activity`)
##
## High Low Moderate
## 23225 23348 23427
summary(diabetes$`Physical Activity`)
## High Low Moderate
## 23225 23348 23427
Razina inzulina
v = mean(diabetes$`Insulin Levels`)
# Podrezana aritmeticka sredina (20%)
mean(diabetes$`Insulin Levels`, trim=0.2)
## [1] 19.99071
median(diabetes$`Insulin Levels`)
## [1] 19
quantile(diabetes$`Insulin Levels`, probs = c(0.25,0.75))
## 25% 75%
## 13 28
#Mod
require(modeest)
mfv(diabetes$`Insulin Levels`)
## [1] 13
#Crtanje histograma koji prikazuje rapodjelu inzulina po vrijednosti i frekvenciji
h = hist(diabetes$`Insulin Levels`,
#prob=TRUE,
main="Insulin Levels Table",
xlab="Insulin Level",
ylab='Frequency',
col="lightblue"
)
abline(v = mean(diabetes$`Insulin Levels`), col = "orange", lwd = 4)
summary(diabetes$`Insulin Levels`)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 13.00 19.00 21.61 28.00 49.00
Na temelju nacrtanog histograma možemo uočiti da više ljudi ima niže razine inzulina (od 0 do 20) te da manji broj ljudi ima razinu inzulina veću od 35.
Usporedba grupiranih podataka:
# Ako grupiramo podatke i onda radimo histogram:
h1 = hist(diabetes[diabetes$`Physical Activity` == c("Low"),]$`Insulin Levels`,
plot=FALSE)
h2 = hist(diabetes[diabetes$`Physical Activity` == c("Moderate"),]$`Insulin Levels`,
plot=FALSE)
h3 = hist(diabetes[diabetes$`Physical Activity` == c("High"),]$`Insulin Levels`,
plot=FALSE)
data <- t(cbind(h1$counts,h2$counts,h3$counts))
data
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 785 834 1465 2029 2111 1942 2036 1662 1217 1152 1289 1210 926 700
## [2,] 893 865 1378 2029 2016 1912 1966 1578 1290 1228 1216 1262 1004 679
## [3,] 814 787 1455 1972 2021 2001 2059 1653 1222 1236 1212 1209 954 704
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23]
## [1,] 647 550 499 489 437 396 380 381 211
## [2,] 697 521 533 487 413 440 409 397 214
## [3,] 702 506 505 443 397 411 406 382 174
barplot(data,beside=TRUE, col=c("orange", "green", "lightblue"), xlab="Insulin Level", ylab='Frequency',)
legend("topright",c("low","moderate","high"),fill = c("orange", "green", "lightblue"))
Kao što je vidljivo iz histograma, frekvencije razine aktivnosti
ispitanika u ovisnosti o razini inzulina nemaju velikih odstupanja
unutar grupa.
#boxplot(`Insulin Levels` ~ `Physical Activity`, data=diabetes)
boxplot(diabetes$`Insulin Levels` ~ diabetes$`Physical Activity`)
Boxplot koji nam na još jedan grafički način prikazuje raspodjelu ispitanika u ovisnoti o razini inzulina i fizičke aktivnosti.
Grafički prikazi do sad nam daju naslutiti da nema zavisnosti između fizičke aktivnosti i razine inzulina u krvi ispitanika.
Provođenje ANOVE
Pretpostavke ANOVE su:
nezavisnost pojedinih podataka u uzorcima
normalna razdioba podataka
homogenost varijanci među populacijama
require(nortest)
Lillieforsov test provodimo kako bismo proverili jesu li naši podatci koje proučavamo normalno distrbuirani
#provjera distribucije cijelog atributa `Insulin Levels`
lillie.test(diabetes$`Insulin Levels`)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: diabetes$`Insulin Levels`
## D = 0.11445, p-value < 2.2e-16
#provjera distribucije atributa `Insulin Levels` = "Low"
lillie.test(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Low"])
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Low"]
## D = 0.11711, p-value < 2.2e-16
#provjera distribucije atributa `Insulin Levels` = "Moderate"
lillie.test(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Moderate"])
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Moderate"]
## D = 0.11076, p-value < 2.2e-16
#provjera distribucije atributa `Insulin Levels` = "High"
lillie.test(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "High"])
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "High"]
## D = 0.11548, p-value < 2.2e-16
#Grafički prikaz podataka
hist(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Low"])
hist(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Moderate"])
hist(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "High"])
# Testiranje homogenosti varijance uzoraka Bartlettovim testom
bartlett.test(diabetes$`Insulin Levels` ~ diabetes$`Physical Activity`)
##
## Bartlett test of homogeneity of variances
##
## data: diabetes$`Insulin Levels` by diabetes$`Physical Activity`
## Bartlett's K-squared = 6.4479, df = 2, p-value = 0.0398
var(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Low"])
## [1] 115.882
var(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "Moderate"])
## [1] 118.4482
var(diabetes$`Insulin Levels`[diabetes$`Physical Activity` == "High"])
## [1] 114.6527
a = aov(diabetes$`Insulin Levels` ~ diabetes$`Physical Activity`)
summary(a)
## Df Sum Sq Mean Sq F value Pr(>F)
## diabetes$`Physical Activity` 2 346 173.1 1.488 0.226
## Residuals 69997 8142960 116.3
Kako podatci ne ispunjuju zahtjeve nužne za provođenje ANOVE; ne isunjuju zahtjev normalnosti, niti homogenosti (jer su pri provođenju testova p-vrijednosti bile manje od 0.05).
Zato provodim Kruskal-Wallisov test jer on ne zahtjeva normalnu distribuciju podataka.
# Kruskal-Wallis
kruskal.test(diabetes$`Insulin Levels` ~ diabetes$`Physical Activity`)
##
## Kruskal-Wallis rank sum test
##
## data: diabetes$`Insulin Levels` by diabetes$`Physical Activity`
## Kruskal-Wallis chi-squared = 2.0217, df = 2, p-value = 0.3639
P-vrijednost dobivena Kruskal-Wallis testom iznosi 0.3639 te veća od 0.05, što znači da ne možemo odbaciti nultu hipotezu(H0): Nema statistički značajne razlike među grupama fizičke aktivnosti u razini inzulina u krvi. Što se dalo naslutiti i grafičkim prikazima, ali nam je numerički ovako potvrđeno.
Postoji li razlika u godinama otkrivanja dijabetesa tipa 1 i tipa 2?**
sum(is.na(diabetes$Target)) # nema NA vrijednosti
## [1] 0
sum(is.na(diabetes$Autoantibodies)) # nema NA vrijednosti
## [1] 0
diabetes
## # A tibble: 70,000 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Steroid-Induced Diabetes Positive Negative No
## 2 Neonatal Diabetes Mellitus… Positive Negative No
## 3 Prediabetic Positive Positive Yes
## 4 Type 1 Diabetes Negative Positive No
## 5 Wolfram Syndrome Negative Negative Yes
## 6 LADA Positive Negative Yes
## 7 Type 2 Diabetes Negative Negative No
## 8 Wolcott-Rallison Syndrome Positive Negative No
## 9 Secondary Diabetes Negative Positive No
## 10 Secondary Diabetes Positive Negative Yes
## # ℹ 69,990 more rows
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <dbl>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>, …
diabetes$Age <- as.integer(diabetes$Age) # je li treba mijenjati podatke u integer ??
table(diabetes$Target)
##
## Cystic Fibrosis-Related Diabetes (CFRD)
## 5464
## Gestational Diabetes
## 5344
## LADA
## 5223
## MODY
## 5553
## Neonatal Diabetes Mellitus (NDM)
## 5408
## Prediabetic
## 5376
## Secondary Diabetes
## 5479
## Steroid-Induced Diabetes
## 5275
## Type 1 Diabetes
## 5446
## Type 2 Diabetes
## 5397
## Type 3c Diabetes (Pancreatogenic Diabetes)
## 5320
## Wolcott-Rallison Syndrome
## 5400
## Wolfram Syndrome
## 5315
# diabetes
# filtriranje redova koji sadrze samo zapise o tipu 1 ili 2 dijabetesa
type_1_and_2 <- diabetes %>% filter(Target %in% c("Type 1 Diabetes", "Type 2 Diabetes"))
# provjera - proucavanje filtiranih redova - izvlacimo nasumicnih 10
random_rows <- type_1_and_2[sample(1:nrow(type_1_and_2), size = 10, replace = FALSE), ]
print(random_rows)
## # A tibble: 10 × 34
## Target `Genetic Markers` Autoantibodies `Family History`
## <fct> <fct> <fct> <fct>
## 1 Type 1 Diabetes Positive Positive No
## 2 Type 2 Diabetes Positive Positive Yes
## 3 Type 1 Diabetes Positive Negative No
## 4 Type 2 Diabetes Positive Negative Yes
## 5 Type 1 Diabetes Negative Negative Yes
## 6 Type 1 Diabetes Negative Positive Yes
## 7 Type 1 Diabetes Positive Positive No
## 8 Type 2 Diabetes Positive Positive Yes
## 9 Type 2 Diabetes Negative Positive No
## 10 Type 2 Diabetes Negative Negative No
## # ℹ 30 more variables: `Environmental Factors` <fct>, `Insulin Levels` <dbl>,
## # Age <int>, BMI <dbl>, `Physical Activity` <fct>, `Dietary Habits` <fct>,
## # `Blood Pressure` <dbl>, `Cholesterol Levels` <dbl>,
## # `Waist Circumference` <dbl>, `Blood Glucose Levels` <dbl>, Ethnicity <fct>,
## # `Socioeconomic Factors` <fct>, `Smoking Status` <fct>,
## # `Alcohol Consumption` <fct>, `Glucose Tolerance Test` <fct>,
## # `History of PCOS` <fct>, `Previous Gestational Diabetes` <fct>, …
# histogram godina ljudi oboljeli od tip 1 dijabetesa
type_1 = type_1_and_2[type_1_and_2$Target=="Type 1 Diabetes",]
type_2 = type_1_and_2[type_1_and_2$Target=="Type 2 Diabetes",]
Uočavamo na histogramima da godine(dob) nisu normalno raspoređene.
# kako postavit granice historaga
cat('Prosječna dob otkrivanje tip 1 dijabetesa je ', mean(type_1$Age),'\n')
## Prosječna dob otkrivanje tip 1 dijabetesa je 17.065
hist(type_1$Age,
breaks = seq(min(type_1$Age) - 1, max(type_1$Age) + 1, 1),
main = 'TIP 1 DIJABETES- histogram godina',
xlab = 'Godine',
col = 'skyblue',
border = 'black')
summary(type_1$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 11.00 17.00 17.07 23.00 29.00
cat('Prosječna dob otkrivanje tip 2 dijabetesa je ', mean(type_2$Age), '\n')
## Prosječna dob otkrivanje tip 2 dijabetesa je 54.60978
hist(type_2$Age,
breaks = seq(min(type_2$Age) - 1, max(type_2$Age) + 1, 1),
main = 'TIP 2 DIJABETES- histogram godina',
xlab = 'Godine',
col = 'darkblue',
border = 'black')
Uočavamo da podaci godina(dob) nisu normalno distribuirani nego jednoliko.
Pomoću boxplot dijagrama ćemo nacrati da bi se razlika jasnije vidjela.
boxplot(type_1$Age, type_2$Age,
names = c('Tip 1 godine','Tip 2 godine'),
main='Boxplot za godine oktrivanja dijabetesa',
ylab="Godine"
)
## Zakljucak:
Iz grafičkih rezultata jasno uočavamo da je otkrivanje dijabetesa tipa 1 puno ranije nego dijabetesa tipa 2