install.packages(“ggplot2”)

Dataset and Assignment Overview

Scenario: You are a health analytics consultant that has been asked to perform an exploratory data analysis regarding diabetes. You are given a dataset containing information for 100 randomly sampled patients from all patients at a specific hospital. The dataset includes the following:

suppressWarnings(suppressMessages(library(ggplot2)))
suppressWarnings(suppressMessages(library(dplyr)))

# Correctly read the CSV file into a data frame
diabetes_data <- read.csv("C:/Users/r98cb/Downloads/diabetes_data.csv")

### Fill in the path to the dataset
attach(diabetes_data)

Assignment (31 points)

Part 1 (16 points)

Answer the following questions by filling in the blanks with the appropriate answers.

Question 1 (2 points)

The population in this scenario is exploratory data analysis regarding diabetes from all patients at a specific hospital.

The sample in this scenario is 100 randomly sampled patients.

Question 2 (4 points)

One piece of information you are interested in is the average age of patients at the hospital. You calculate the average age for the 100 patients in the sample. Identify the population parameter and sample statistic in this scenario, including the appropriate symbols for these.

Hint: Below are the different population parameters and statistics with their corresponding symbols.

  • Population Mean: \(\mu\)
  • Sample Mean: \(\bar{x}\)
  • Population Variance: \(\sigma^2\)
  • Sample Variance: \(s^2\)
  • Population Standard Deviation: \(\sigma\)
  • Sample Standard Deviation: \(s\)
  • Population Proportion: \(p\)
  • Sample Proportion: \(\hat{p}\)

The population parameter in this scenario is population mean and the corresponding symbol is \(\mu\). The sample statistic in this scenario is sample mean and the corresponding symbol is \(\bar{x}\).

Question 3 (1 point)

In which sampling method would patients have been divided into different age groups and then a proportional number of patients been selected from each age group? Proportional Stratified Sampling

Question 4 (1 point)

Suppose that this sample was collected by assigning a unique number to each patient at the hospital and then using a random number generator to select 100 of these numbers. This would be an example of simple random sampling.

Question 5 (4 points)

For the variables listed below, determine if it is a qualitative or a quantitative variable:

  • Gender: qualitative
  • Age: quantitative
  • Smoking_Status: qualitative
  • Fasting_Blood_Sugar: quantitative

Question 6 (4 points)

For the variables listed below, determine if it is a discrete or continuous variable: If it is qualitative, just put NA in the line provided.

  • Gender: NA
  • Number_of_Visits: discrete
  • Fasting_Blood_Sugar: continuous
  • HbA1c: continuous

Part 2 (15 points)

Question 1 (2 points)

For the following variables, compute and display an appropriate measure of central tendency.

Exercise_Frequency

table(Exercise_Frequency)
## Exercise_Frequency
##         None Occasionally       Rarely    Regularly 
##           28           24           28           20

HbA1c

mean_value <- mean(HbA1c)
print(paste("The mean of HbA1c is:", mean_value))
## [1] "The mean of HbA1c is: 5.65"
median_value <- median(HbA1c)
print(paste("The median of HbA1c is:", median_value))
## [1] "The median of HbA1c is: 5.7"

Question 2 (2 points)

For the following variables, compute and display an appropriate measure of variability

Diabetes_Type

table(Diabetes_Type)
## Diabetes_Type
##   None Type 1 Type 2 
##     23     49     28

Age

mean_value <- mean(Age)
print(paste("The average Age:", mean_value))
## [1] "The average Age: 46.57"
median_value <- median(Age)
print(paste("The median Age is:", median_value))
## [1] "The median Age is: 45.5"

Complete the questions below by filling in the blanks with the appropriate answers or by providing the necessary R code.

Question 3 (4 points)

Below is a histogram of Fasting_Blood_Sugar.

diabetes_data <- data.frame(Fasting_Blood_Sugar)
ggplot(diabetes_data, aes(x = Fasting_Blood_Sugar)) +
  geom_histogram(binwidth = 5, fill = "blue", color = "black") +
  labs(title = "Histogram of Fasting Blood Sugar",
       x = "Fasting Blood Sugar (mg/dL)",
       y = "Frequency")

Using this histogram, what would you say is the skewness of the distribution? left-skewed

Based on the skew, would you expect the mean to be less than, greater than, or approximately equal to the median? a little less than but can be approximately equal to the median value

Compute and display the mean and median for Fasting_Blood_Sugar below.

mean_value <- mean(Fasting_Blood_Sugar)
print(paste("The mean value for Fasting blood sugar:", mean_value))
## [1] "The mean value for Fasting blood sugar: 88.954"
median_value <- median(Fasting_Blood_Sugar)
print(paste("The median value for Fasting blood sugar is:", median_value))
## [1] "The median value for Fasting blood sugar is: 90.25"

Question 4 (1 point)

Below is a histogram of Number_of_Visits.

diabetes_data <- data.frame(Number_of_Visits)
ggplot(diabetes_data, aes(x = Number_of_Visits)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Histogram of Number of Visits",
       x = "Number of Visits",
       y = "Frequency")

Based on the histogram, would you say the distribution is unimodal, bimodal, or multimodal? unimodal

Question 5 (4 points)

Which visualizations would be appropriate for the following variables? Just provide one answer for each of the following.

  • Smoking_Status: Bar Chart
  • Fasting_Blood_Sugar: Histogram
  • Diabetes_Type: Bar Chart
  • Number_of_Visits: Histogram

Question 6 (2 points)

Produce a visualization for the following variables:

HbA1c

diabetes_data <- data.frame(HbA1c)
ggplot(diabetes_data, aes(x = HbA1c)) + 
  geom_histogram(binwidth = 0.25, fill = "blue", color = "black") + 
  labs(title = "Histogram for HbA1c", x = "HbA1c Values", y = "Frequency") +
  theme_minimal()

Exercise_Frequency

Exercise_Frequency

# The column with categorical values is named 'Exercise_Frequency'

Exercise_Frequency_Count <- table(Exercise_Frequency)
# Count occurrences of each category

Exercise_Frequency_data <- as.data.frame(Exercise_Frequency_Count)
# Convert to a data frame
names(Exercise_Frequency_data) <- c("Category", "Frequency")

# Plotting the bar chart
ggplot(Exercise_Frequency_data, aes(x = Category, y = Frequency)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(title = "Exercise Frequency in Diabetes Data", x = "Category", y = "Frequency") +
  theme_minimal()