Affiliation: University of Chittagong

Loading the assigned dataset into R notebook:

Assigned dataset = diabets.csv

This dataset originates from the National Institute of Diabetes and Digestive and Kidney Diseases. The aim is to forecast, based on diagnostic measurements, whether a patient is afflicted with diabetes. Stringent criteria were applied in selecting instances from a larger database, ensuring that all patients in this dataset are females, at least 21 years old, and of Pima Indian heritage. All columns in this dataset consist of numerical values.
Pregnancies: Number of occurrences of pregnancy
Glucose: Plasma glucose concentration two hours after an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1). This serves as the target column, where ‘0’ indicates no disease, and ‘1’ indicates the presence of disease.
1. The assigned dataset “diabetes” was downloaded and moved in the working directory of R studio.
2. Then, a notebook file is created in R studio and the dataset was loaded into a variable.
Diabetes = read.csv("diabetes.csv")
print(head(Diabetes))
NA

3. Some basic information about the “Diabates” dataset

Total number of columns and their names


column_names <- colnames(Diabetes)
num_columns <- length(column_names)

# column names and the number of columns
cat("Column names of Diabetes are", column_names, "\n","\n")
Column names of Diabetes are Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome 
 
cat("Total number of columns is", num_columns)
Total number of columns is 9

a) Finding out the numerical columns


numerical_columns <- sapply(Diabetes, is.numeric) #extraction of the numerical columns

numerical_column_names <- names(numerical_columns)[numerical_columns]

print(numerical_column_names)
[1] "Pregnancies"              "Glucose"                  "BloodPressure"            "SkinThickness"           
[5] "Insulin"                  "BMI"                      "DiabetesPedigreeFunction" "Age"                     
[9] "Outcome"                 
All 9 columns of the datasets are numerical.

b) Finding out the character columns

character_columns <- sapply(Diabetes, is.character) # Extraction of the character columns

character_column_names <- names(character_columns)[character_columns]

print(character_column_names)
character(0)

So, there is no character columns in the dataset.

c) Target variable of the dataset is “Outcome”.

4. There are no categorical columns in the dataset.

5.a) Scatter plot of Pregnancies and BloodPressure

Diabetes <- na.omit(Diabetes)   # data_cleaning
print(Diabetes)
variety_color = as.numeric(factor(Diabetes$Outcome))

plot(
  x = Diabetes$Pregnancies,
  y = Diabetes$BloodPressure,
  col = variety_color,
  pch = 19,
  cex = 1.3,
  
  xlab = "Pregnancies",
  ylab = "Blood Pressure",
  main = "Scatter Plot of Pregnancies vs Blood Pressure",
)

# A legend with more descriptive labels
legend(
  "topright",
  legend = c("No Diabetes", "Diabetes"),  # Descriptive labels
  col= 1:2,
  pch = 19,
  title = "Outcome"
)

This scatter plot visualizes the relationship between Pregnancies and Blood .
Each point is colored based on the “Outcome” (0 or 1), representing different classes. The legend indicates which color corresponds to each class.

5.b) Histogram of Insulin column

hist(
  Diabetes$Insulin,
  breaks = 20,  # Increase the number of bins for better resolution
  main = "Histogram plot of Insulin Levels",
  
  col.main = "#008080",
  cex.main = 1.5,    # Adjust the size of the title
  cex.lab = 1.3,
  col.lab = c("Maroon"),
  font.main = 1,      # Use plain (non-bold) font for the title
  xlab = "Insulin",
  col = adjustcolor("purple", alpha = 0.7),  # Adjust alpha for transparency
  border = "white",  # White borders for better separation
  xlim = c(min(Diabetes$Insulin, na.rm = TRUE), max(Diabetes$Insulin, na.rm = TRUE))
)

# Add axis labels
axis(1, at = pretty(Diabetes$Insulin, n = 10), labels = TRUE, col.axis = "darkgray")
axis(2, labels = TRUE, col.axis = "darkgray")

# Addition of a grid for better readability
grid()

NA
NA
NA
This histogram plot illustrates the distribution of the “Insulin” feature in the “Diabetes” dataset.
The x-axis represents different ranges or bins of insulin values, and the y-axis shows the frequency of observations within each bin.
The color “purple” is used to fill the bars for visual distinction.

5.c) Boxplot of SkinThickness column

boxplot(
  Diabetes$SkinThickness,
  ylab = "Skin Thickness",
  col = "#48b5c4",
  pch = 19,
  border = "black",
  main = "Boxplot of Skin Thickness"
)

NA
NA
NA
This boxplot visualizes the distribution of “SkinThickness” values.
The box represents the interquartile range (IQR), the line inside the box is the median, and the whiskers extend to the minimum and maximum values providing insights into the central tendency and spread of the data.
There’s an outlier in the boxplot close to SKinThickness value, 100.

6.Using ggplot library functions to plot

a) Scatterplot of all combinations of two features- Pregnancies and Glucose



library(ggplot2)
Warning: package ‘ggplot2’ was built under R version 4.3.2
# Scatter plot using ggplot2
ggplot(Diabetes, aes(x = Pregnancies, y = Glucose, color = factor(Outcome))) +
  geom_point(size=2.8, alpha = 1) +
  labs(
    x = "Pregnancies",
    y = "Glucose",
    title = "Scatter Plot of Pregnancies and Glucose",
    caption = "Source: Iskulghar",
    color = "Outcome"
  ) +
  theme_minimal() + 
  theme(
    text = element_text(color = "darkblue", size = 14),  # Adjust font color and size
    legend.position = "top",  # Place legends at the top
    axis.text.x = element_text(color = "#FF7F7F", size = 14),
    axis.text.y = element_text(color = "#C4A484", size = 14),
    plot.title = element_text(color= "maroon", hjust = 0.5),
  
    legend.title = element_text(color = "darkgreen")
  )

NA
This scatter plot represents all combinations of the “Glucose” and “BMI” features, with points colored based on the “Outcome” class (0 or 1).
It includes proper axis labels, title, and a caption. Font colors and sizes are adjusted for readability, and the legend is positioned at the top.

b) Boxplot of all column following all instructions

library(reshape2)
Warning: package ‘reshape2’ was built under R version 4.3.2
# Melting data into long format
melted_data <- melt(Diabetes)
No id variables; using all as measure variables
# Boxplot of all columns
ggplot(melted_data, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot() +
  labs(
    x = "Columns",
    y = "Values",
    title = "Boxplot of All Columns",
    caption = "Source: Iskulghar"
  ) +
  theme_bw() +
  theme(
    text = element_text(color = "darkblue", size = 10.5),
    legend.position = "top",
    legend.title = element_text(color = "#FF7F7F", size = 13),  # Legend title color and size
    plot.title = element_text(color= "maroon", hjust = 0.5)
  ) +
  facet_wrap(~variable, scale= "free")  # Creation of a facet grid for each column

This boxplot visualizes the distribution of values for all columns in the “Diabetes” dataset.
The data is melted into long format for easier plotting. Proper axis labels, title, and a caption are provided.
Font colors and sizes are adjusted, and the legend is positioned at the top for clarity.



---
title: "Data Science Project"
author: "Shoaib Ibne Hamid"
output: html_notebook
---
**Affiliation**: University of Chittagong
  

### Loading the assigned dataset into R notebook:
#### Assigned dataset = diabets.csv
##### This dataset originates from the National Institute of Diabetes and Digestive and Kidney Diseases. The aim is to forecast, based on diagnostic measurements, whether a patient is afflicted with diabetes. Stringent criteria were applied in selecting instances from a larger database, ensuring that all patients in this dataset are females, at least 21 years old, and of Pima Indian heritage. All columns in this dataset consist of numerical values.
##### Pregnancies: Number of occurrences of pregnancy
##### Glucose: Plasma glucose concentration two hours after an oral glucose tolerance test
##### BloodPressure:  Diastolic blood pressure (mm Hg)
##### SkinThickness: Triceps skin fold thickness (mm)
##### Insulin: 2-Hour serum insulin (mu U/ml)
##### BMI: Body mass index (weight in kg/(height in m)^2)
##### DiabetesPedigreeFunction: Diabetes pedigree function
##### Age: Age (years)
##### Outcome:  Class variable (0 or 1). This serves as the target column, where '0' indicates no disease, and '1' indicates the presence of disease.




##### 1. The assigned dataset "diabetes" was downloaded and moved in the working directory of R studio.

##### 2. Then, a notebook file is created in R studio and the dataset was loaded into a variable.


<style>
div.teal pre { background-color:#90ee90 ; }
div.teal pre.r { background-color:#90ee90; }
</style>

<div class="teal">
```{r}
Diabetes = read.csv("diabetes.csv")
print(head(Diabetes))

```

### 3. Some basic information about the "Diabates" dataset


#### Total number of columns and their names
<style>
div.offee pre { background-color: #99e7f0; }
div.offee pre.r { background-color: #99e7f0; }
</style>

<div class="offee">
```{r}

column_names <- colnames(Diabetes)
num_columns <- length(column_names)

# column names and the number of columns
cat("Column names of Diabetes are", column_names, "\n","\n")
cat("Total number of columns is", num_columns)
```

### a) Finding out the numerical columns

<style>
div.lyell pre { background-color: #fffdaf; }
div.lyell pre.r { background-color: #fffdaf; }
</style>

<div class="lyell">
```{r}

numerical_columns <- sapply(Diabetes, is.numeric) #extraction of the numerical columns

numerical_column_names <- names(numerical_columns)[numerical_columns]

print(numerical_column_names)

```
##### All 9 columns of the datasets are numerical.

### b) Finding out the character columns


```{r}
character_columns <- sapply(Diabetes, is.character) # Extraction of the character columns

character_column_names <- names(character_columns)[character_columns]

print(character_column_names)
```
#### So, there is no character columns in the dataset.



#### c) Target variable of the dataset is "Outcome".




#### 4. There are no categorical columns in the dataset.


### 5.a) Scatter plot of Pregnancies and BloodPressure

<style>
div.glue pre { background-color:  #CBC3E3; }
div.glue pre.r { background-color:  #CBC3E3; }
</style>

<div class="glue">
```{r}
Diabetes <- na.omit(Diabetes)   # data_cleaning
print(Diabetes)
variety_color = as.numeric(factor(Diabetes$Outcome))

plot(
  x = Diabetes$Pregnancies,
  y = Diabetes$BloodPressure,
  col = variety_color,
  pch = 19,
  cex = 1.3,
  
  xlab = "Pregnancies",
  ylab = "Blood Pressure",
  main = "Scatter Plot of Pregnancies vs Blood Pressure",
)

# A legend with more descriptive labels
legend(
  "topright",
  legend = c("No Diabetes", "Diabetes"),  # Descriptive labels
  col= 1:2,
  pch = 19,
  title = "Outcome"
)
```
##### This scatter plot visualizes the relationship between Pregnancies and Blood .
##### Each point is colored based on the "Outcome" (0 or 1), representing different classes. The legend indicates which color corresponds to each class.




### 5.b) Histogram of Insulin column
<style>
div.cof pre { background-color: #f9f09a; }
div.cof pre.r { background-color: #f9f09a; }
</style>

<div class="cof">
```{r}
hist(
  Diabetes$Insulin,
  breaks = 20,  # Increase the number of bins for better resolution
  main = "Histogram plot of Insulin Levels",
  
  col.main = "#008080",
  cex.main = 1.5,    # Adjust the size of the title
  cex.lab = 1.3,
  col.lab = c("Maroon"),
  font.main = 1,      # Use plain (non-bold) font for the title
  xlab = "Insulin",
  col = adjustcolor("purple", alpha = 0.7),  # Adjust alpha for transparency
  border = "white",  # White borders for better separation
  xlim = c(min(Diabetes$Insulin, na.rm = TRUE), max(Diabetes$Insulin, na.rm = TRUE))
)

# Add axis labels
axis(1, at = pretty(Diabetes$Insulin, n = 10), labels = TRUE, col.axis = "darkgray")
axis(2, labels = TRUE, col.axis = "darkgray")

# Addition of a grid for better readability
grid()



```


#####  This histogram plot illustrates the distribution of the "Insulin" feature in the "Diabetes" dataset.
##### The x-axis represents different ranges or bins of insulin values, and the y-axis shows the frequency of observations within each bin.
##### The color "purple" is used to fill the bars for visual distinction.




### 5.c) Boxplot of SkinThickness column
<style>
div.coffe pre { background-color: #B6FFBB; }
div.coffe pre.r { background-color: #B6FFBB; }
</style>

<div class="coffe">
```{r}
boxplot(
  Diabetes$SkinThickness,
  ylab = "Skin Thickness",
  col = "#48b5c4",
  pch = 19,
  border = "black",
  main = "Boxplot of Skin Thickness"
)


        
```
##### This boxplot visualizes the distribution of "SkinThickness" values.
##### The box represents the interquartile range (IQR), the line inside the box is the median, and the whiskers extend to the minimum and maximum values providing insights into the central tendency and spread of the data.
##### There's an outlier in the boxplot close to SKinThickness value, 100.




# 6.Using ggplot library functions to plot

### a) Scatterplot of all combinations of two features- Pregnancies and Glucose
<style>
div.sorange pre { background-color: #ffbe90; }
div.sorange pre.r { background-color: #ffbe90; }
</style>

<div class="sorange">
```{r}


library(ggplot2)

# Scatter plot using ggplot2
ggplot(Diabetes, aes(x = Pregnancies, y = Glucose, color = factor(Outcome))) +
  geom_point(size=2.8, alpha = 1) +
  labs(
    x = "Pregnancies",
    y = "Glucose",
    title = "Scatter Plot of Pregnancies and Glucose",
    caption = "Source: Iskulghar",
    color = "Outcome"
  ) +
  theme_minimal() + 
  theme(
    text = element_text(color = "darkblue", size = 14),  # Adjust font color and size
    legend.position = "top",  # Place legends at the top
    axis.text.x = element_text(color = "#FF7F7F", size = 14),
    axis.text.y = element_text(color = "#C4A484", size = 14),
    plot.title = element_text(color= "maroon", hjust = 0.5),
  
    legend.title = element_text(color = "darkgreen")
  )
  
```
##### This scatter plot represents all combinations of the "Glucose" and "BMI" features, with points colored based on the "Outcome" class (0 or 1).
##### It includes proper axis labels, title, and a caption. Font colors and sizes are adjusted for readability, and the legend is positioned at the top.

### b) Boxplot of all column following all instructions
<style>
div.coffee pre { background-color: #f5c2c2; }
div.coffee pre.r { background-color: #f5c2c2; }
</style>

<div class="coffee">
```{r}
library(reshape2)


# Melting data into long format
melted_data <- melt(Diabetes)
# Boxplot of all columns
ggplot(melted_data, aes(x = variable, y = value, fill = variable)) +
  geom_boxplot() +
  labs(
    x = "Columns",
    y = "Values",
    title = "Boxplot of All Columns",
    caption = "Source: Iskulghar"
  ) +
  theme_bw() +
  theme(
    text = element_text(color = "darkblue", size = 10.5),
    legend.position = "top",
    legend.title = element_text(color = "#FF7F7F", size = 13),  # Legend title color and size
    plot.title = element_text(color= "maroon", hjust = 0.5)
  ) +
  facet_wrap(~variable, scale= "free")  # Creation of a facet grid for each column

```
##### This boxplot visualizes the distribution of values for all columns in the "Diabetes" dataset.
##### The data is melted into long format for easier plotting. Proper axis labels, title, and a caption are provided.
##### Font colors and sizes are adjusted, and the legend is positioned at the top for clarity.
