Content
- Load the diabetes dataset into our workspace.
The output the dataset (data frame R
name) contain with nine columns and seven hundred sixty eight rows
768 x 9.
data <- read_csv("diabetes.csv")
We can also check the dimensions of this data frame as well as the names of the variables, type of variables and the first few observations by inserting the name of the data set into the glimpse() function, as seen below:
glimpse(data)
## Rows: 768
## Columns: 9
## $ Pregnancies <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome <dbl> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
- We see all variable types are “dbl,” which refers to the data type abbreviation for double-precision floating-point numbers. In R’s type system, “dbl” stands for numeric values that are represented in double precision.
- However, It seems that the ‘Outcome’ might be categorical.
- The dataset has one target variable or response (dependent) variable named ‘Outcome.’
We have 768 observations of 9 different variables, a mix of numerical and categorical. The meaning of each variable is as follows:
Table 1: Variables and descriptions
| Variable | Description |
|---|---|
| Outcome | Class variable (0 or 1) |
| Pregnancies | Number of times pregnant. |
| Glucose | Plasma glucose concentration a 2 hours in an oral glucose tolerance test |
| BloodPressure | Diastolic blood pressure (mm Hg) |
| SkinThickness | Triceps skin fold thickness (mm) |
| Insulin | 2-Hour serum insulin (mu U/ml) |
| BMI | Body mass index (weight in kg/(height in m)^2) |
| DiabetesPedigreeFunction | Diabetes pedigree function |
| Age | Age (years) |
Since there are no other categorical variables in the dataset apart from the target variable, I am not removing any variable from the dataset. However to remove variables we may use the following codes
data$x<-NULL # To remove a single variable. Here x is example variable
data[, c(1,3:5,9)] # filter data set with desires variables