This data set contains medical/demographic data and if the individual has diabetes or not. This data set contains information on 100,000 individuals and contains 9 variables. This data set contains 5 categorical variables. These are the patient’s gender, If the patient has hypertension, if the patient has hearth disease, the patients smoking history, and if the patient has diabetes. This data set has 4 numeric variables. These are the patient’s age, the patient’s bmi level, the patients hba1c level, and the patients blood glucose level.
This data set was intended to assist in building predictive models and machine learning algorithms to predict if a patient has diabetes. We will be building a simple logistic regression model using the variable blood glucose level to create a predictive model for the probability of a patient having diabetes. THis analysis would be practical as if a non-zero relationship between blood glucose level and diabetes probability is found and measured the resulting model would be an invaluable tool in identifying individuals at risk of developing diabetes based on their glucose levels.
Looking at a histogram for the blood glucose levels of individuals with and without diabetes there is a clear difference between the distributions of patients with diabetes and patients without. None of the patients with glucose levels below 100 have diabetes. On the other end all of the patients with glucose levels above 200 have diabetes. These results give us strong evidence that we will find meaningful parameters in our logistical regression analysis. One issue is that there is a strong right skewdness in the blood glucose levels. Because of this we will use a log transformation on the blood glucose levels to help normalize the varaible.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This following graph is the post log transformed glucose levels
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The following is our calculations for the parameters for intercept and interaction term for our simple logistic regression model predicting probability of diabetes based on the log of glucose levels. our odds ratio is 338.0158, meaning that every time the log glucose levels increases by one we can expect the probability of the patient having diabetes to increase by 338%.
## Waiting for profiling to be done...
| Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | odds.ratio | |
|---|---|---|---|---|---|---|---|
| (Intercept) | -31.752761 | 0.3078209 | -103.1534 | 0 | -32.358849 | -31.152135 | 0.0000 |
| bgltran | 5.823093 | 0.0598451 | 97.3028 | 0 | 5.706304 | 5.940908 | 338.0158 |
Some important points of note in the s curve graphs, at 5 log glucose
level the probability of the individual having diabetes begins to
increase at a dramatic fashion. The rate of change for the probability
peaks at 5.4 log glucose level, decreasing after. The verticality of the
graph can be attributed both to the nature of the relationship as well
as the nature of the log transformation of glucose levels. #Probablity
curve and rate of change of probability