DATA 606 Data Project Proposal

Data Preparation

# Attach library
library(psych) 
library(dplyr)

# Read the data
diabetes_data <- read.csv("https://raw.githubusercontent.com/L-Velasco/Spring17_IS606/master/Final/diabetes.csv", stringsAsFactors = FALSE)

# Consider only complete cases 
# Remove observations with missing hemoglobin, weight, height, waist and hip data
diabetes_cases <- diabetes_data[!(is.na(diabetes_data$glyhb)) & !(is.na(diabetes_data$weight)) & !(is.na(diabetes_data$height)) & !(is.na(diabetes_data$waist)) & !(is.na(diabetes_data$hip)),]

# Add another column to data frame, Waist-Hip Ratio (WHR) by calculating waist/hip
diabetes_cases$WHR <- diabetes_cases$waist / diabetes_cases$hip

# Add another column to data frame, Body Mass Index (BMI)
# Convert to metric measure then calculate weight/(height^2)
diabetes_cases$BMI <- ((diabetes_cases$weight * (0.453592)) / (((diabetes_cases$height / 12) * 0.3048) ^ 2))

# size of data
dim(diabetes_cases)

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

What relationship exists between Waist-Hip Ratio (WHR) and Body Mass Index (BMI), and which of the two measures suggests a stronger risk predictor of Type 2 Diabetes?

Cases

What are the cases, and how many are there?

Each case represents an African American living in Central Virginia who were screened for Diabetes. There are 403 observations in the given data set.

For this project, only 382 completed cases will be included.

Data collection

Describe the method of data collection.

There were 1046 subjects who were interviewed in a study to understand the prevalence of obesity and diabetes in central Virginia for African Americans. The 403 of these subjects included in this dataset were the onces that were actually screened for diabetes.

Type of study

What type of study is this (observational/experiment)?

This is an observational study.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

These data are courtesy of Dr John Schorling, Department of Medicine, University of Virginia School of Medicine and obtained from http://biostat.mc.vanderbilt.edu/DataSets

More information were cited in Vanderbilt website regarding the study: Willems JP, Saunders JT, DE Hunt, JB Schorling: Prevalence of coronary heart disease risk factors among rural blacks: A community-based study. Southern Medical Journal 90:814-820; 1997

and

Schorling JB, Roach J, Siegel M, Baturka N, Hunt DE, Guterbock TM, Stewart HL: A trial of church-based smoking cessation interventions for rural African Americans. Preventive Medicine 26:92-101; 1997.

Response

What is the response variable, and what type is it (numerical/categorical)?

The response variable is Glycosolated Hemoglobin and is numerical.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorival)?

The explanatory variables will be calculated as Waist-Hip ratio and Body Mass Index. Both are numerical.

Relevant summary statistics

Provide summary statistics relevant to your research question. For example, if you are comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

# Describe statistics for the variables of interest
describe(diabetes_cases$WHR)

##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 382 0.88 0.07   0.88    0.88 0.07 0.68 1.14  0.46 0.38     0.56  0

describe(diabetes_cases$BMI)

##    vars   n mean   sd median trimmed  mad  min   max range skew kurtosis
## X1    1 382 28.8 6.64  27.81   28.27 6.07 15.2 55.79 40.58 0.81      0.8
##      se
## X1 0.34

describe(diabetes_cases$glyhb)

##    vars   n mean   sd median trimmed  mad  min   max range skew kurtosis
## X1    1 382 5.58 2.21   4.84    5.11 0.83 2.68 16.11 13.43 2.26     5.24
##      se
## X1 0.11

# Frequency plot for the variables of interest
hist(diabetes_cases$WHR)

hist(diabetes_cases$BMI)

hist(diabetes_cases$glyhb)

# Observations by town and gender
count(diabetes_cases, location, gender)

## Source: local data frame [4 x 3]
## Groups: location [?]
## 
##     location gender     n
##        <chr>  <chr> <int>
## 1 Buckingham female   107
## 2 Buckingham   male    80
## 3     Louisa female   115
## 4     Louisa   male    80