02/19/2023

Background Information

Diabetes is one of the most common chronic diseases in the United States, affecting the lives of millions of people each year and putting a substantial strain on the economy. It is a chronic condition in which people lose their capacity to efficiently manage glucose levels in their blood, leading to serious health problems, decreased quality of life and life expectancy. During the process of digestion, different foods are broken down to sugars and other nutrients. Insulin - a hormone produced by the pancreas, facilitates the utilization of glucose by the cells in the body for energy. A diabetic’s body either is not producing enough insulin or is unable to utilize adequately the insulin that is produced.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are typical for people in advanced stages of diabetes. Losing weight, eating healthily, being active, and receiving medical treatments can mitigate many of the harms of this chronic disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, which makes predictive models for diabetes risk important tools in public health.

We will be using the following data set:https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset. Detailed information about it can be found on the website above. It is derived from the annual Behavioral Risk Factor Surveillance System (BRFSS) telephone survey conducted by the CDC. We will look at some of the indicators in the survey to see if we can determine a correlation between them and the instances of the disease. Armed with that information we will be able to develop indicators that would help predict individuals’ chances to be afflicted by the condition.

Setting Up Our Data

First we will “read in” the data set and see how it looks like.

##   Diabetes_012 HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack
## 1            0      1        1         1  40      1      0                    0
## 2            0      0        0         0  25      1      0                    0
## 3            0      1        1         1  28      0      0                    0
## 4            0      1        0         1  27      0      0                    0
## 5            0      1        1         1  24      0      0                    0
## 6            0      1        1         1  25      1      0                    0
##   PhysActivity Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost
## 1            0      0       1                 0             1           0
## 2            1      0       0                 0             0           1
## 3            0      1       0                 0             1           1
## 4            1      1       1                 0             1           0
## 5            1      1       1                 0             1           0
## 6            1      1       1                 0             1           0
##   GenHlth MentHlth PhysHlth DiffWalk Sex Age Education Income
## 1       5       18       15        1   0   9         4      3
## 2       3        0        0        0   0   7         6      1
## 3       5       30       30        1   0   9         4      8
## 4       2        0        0        0   0  11         3      6
## 5       2        3        0        0   0  11         5      4
## 6       2        0        2        0   1  10         6      8

Data Set Up Continued

As it is evident from our attempt to display the top rows of the data set we have too many columns. Upon further inquiry in the online documentation we can see that we have 22 columns with data points. It will be a good idea to select some of those data points to look at first. Maybe select the ones that we think are most likely to give us any insights.

##   Diabetes_012 HighBP HighChol BMI Smoker HvyAlcoholConsump Sex Age
## 1            0      1        1  40      1                 0   0   9
## 2            0      0        0  25      1                 0   0   7
## 3            0      1        1  28      0                 0   0   9
## 4            0      1        0  27      0                 0   0  11
## 5            0      1        1  24      0                 0   0  11
## 6            0      1        1  25      1                 0   1  10

Exploring BMI and Diabetes

In the following example we have chosen to explore the relationship if any between BMI and diabetes.We can clearly observe the number of instances of diabetes is highest above the trend line and more specifically between BMI 24 and 36. BMI above 40 is considered dangerously overweight. The number of instances of diabetes in people with BMI higher than 40 in this survey is lower, possibly because BMI that high is more rare.

## `geom_smooth()` using formula = 'y ~ x'

Exploring BMI and Pre Diabetes

## `geom_smooth()` using formula = 'y ~ x'

The size of the population of people with pre diabetes is significantly larger. The plot on this slide further confirms what we saw on the previous page. BMI between 24 and 38 shows high correlation with pre diabetes.

Both slides display a similar pattern - there is a band of BMI values, between 24 and 38, where we find the bulk of individuals with pre diabetes and diabetes.

Heavy Alcohol Consumption and

We do not see any indication that heavy alcohol consumption(as defined by the data set) is in correlation with a high number of instances of diabetes.

Tobacco Consumption and Diabetes

The use of tobacco also does not reflect any strong correlation with the number of diabetes instances in the population queried.

High Blood Pressure and Diabetes

Individuals with high blood pressure ha almost three times as many instances of diabetes as individuals not presenting high blood pressure. This indicator shows strong correlation. While correlation does not necessarily mean causation and further investigation should be done, it is clear that when we have high instances of diabetes, we observe high instances of blood pressure and/or vice versa.

High Cholesterol and Diabetes

In this case, the diabetes instances are almost twice as many in the population that are with higher levels of cholesterol. That is another strong relationship that needs to be further investigated.

Gender and Diabetes

The data shows us that women are about 5% more likely to present with diabetes.Further research is needed to determine the accuracy of that statement. If it holds true, I would be interested to attempt to find out the reason behind this relationship.

Age and Diabetes

We can see a good correlation here between age and diabetes instances up until the age of 70. In the last three categories the instances of the disease go down with each consecutive one. That is an interesting observation and it is worth looking into. As far as our goal to determine health indicators for diabetes, we see a dramatic(almost exponential) increase with age above the 30 year mark.

Conclusions

  • BMI between the values of 24 and 37 shows the greatest number of diabetes and pre-diabetes cases. We need to perform a more in depth analysis in order to establish clear correlations and more precise values.

  • Out of the 6 other factors we looked at 3 give us the indication of high correlation: High Cholesterol and High Blood Pressure and Age

  • In the case of gender we saw a skewed distribution as it pertains to males and females with 5% more of the cases being in the female column.

The overall result of these findings is that after a little more research we can devise a model that can determine the chances of an individual to develop diabetes.With tools of that nature healthcare professionals are better equipped to advise and help their patients.