Libraries and Data Source

data <-  source("http://www.openintro.org/stat/data/cdc.R")
data <- data$value
library('tidyverse')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('moments')
dim(data)
## [1] 20000     9
str(data)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

Descriptive Statistics Analysis

Make a scatterplot of weight versus desired weight. Describe the relationship between these two variables.

ggplot(data, aes(x = weight, y = wtdesire)) +
  geom_point(
    color = "steelblue",  
    size = 3,             
    alpha = 0.7           
  ) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +  
  theme_minimal() +  
  labs(
    title = "Scatterplot of Weight vs. Desired Weight",
    x = "Weight",
    y = "Desired Weight"
  ) +
  theme(
    plot.title = element_text(size = 16, hjust = 0.5),  
    axis.title = element_text(size = 14),               
    axis.text = element_text(size = 12),                
    panel.grid.major = element_line(color = "lightgray", linetype = "dotted")  
  )

correlation_coeff <- cor(data$weight, data$wtdesire)
cat("Correlation coefficient:", correlation_coeff, "\n")
## Correlation coefficient: 0.8000521

A correlation coefficient of approximately 0.800 suggests a strong positive linear relationship between weight and desired weight (wtdesire).

Let’s consider a new variable: the difference between desired weight (wtdesire) and current weight (weight).

#Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called 'wdiff'.
data$wdiff <- data$wtdesire - data$weight
head(data[,'wdiff'], 10)
##  [1]   0 -10   0  -8 -20   0  -9 -10 -20 -10

The variable wdiff represents the difference between desired weight (wtdesire) and current weight (weight) for each individual in your dataset. The type of data that wdiff represents is a numeric variable, as it contains numerical values that can be positive, negative, or zero.

Here’s what different values of wdiff typically mean:

  1. wdiff = 0: This means that the person’s current weight (weight) is exactly the same as their desired weight (wtdesire). In other words, there is no difference between their current weight and the weight they desire.

  2. wdiff > 0: A positive wdiff indicates that the person’s current weight is greater than their desired weight. This suggests that the person would like to lose weight to reach their desired weight.

  3. wdiff < 0: A negative wdiff indicates that the person’s current weight is less than their desired weight. This implies that the person would like to gain weight to reach their desired weight.

Describe the distribution of wdiff in terms of its center, shape, and spread. What does this tell us about how people feel about their current weight?

# Center
mean_wdiff <- mean(data$wdiff)
median_wdiff <- median(data$wdiff)

# Shape - Histogram
hist(data$wdiff, main = "Distribution of 'wdiff'", xlab = "wdiff", col = "lightblue", border = "black")
abline(v = mean_wdiff, col = "red", lwd = 2)  # Add mean as a vertical line
abline(v = median_wdiff, col = "blue", lwd = 2)  # Add median as a vertical line

# Shape - Skewness
skewness <- moments::skewness(data$wdiff)

# Shape - Q-Q Plot (assuming 'wdiff' follows a normal distribution)
qqnorm(data$wdiff)
qqline(data$wdiff)

# Spread
sd_wdiff <- sd(data$wdiff)
iqr_wdiff <- IQR(data$wdiff)

# Spread - Box Plot
boxplot(data$wdiff, main = "Box Plot of 'wdiff'", col = "lightgreen", border = "black")

# Print summary statistics
cat("Mean: ", mean_wdiff, "\n")
## Mean:  -14.5891
cat("Median: ", median_wdiff, "\n")
## Median:  -10
cat("Skewness: ", skewness, "\n")
## Skewness:  -1.447632
cat("Standard Deviation: ", sd_wdiff, "\n")
## Standard Deviation:  24.04586
cat("Interquartile Range (IQR): ", iqr_wdiff, "\n")
## Interquartile Range (IQR):  21

Based on the statistics for the variable wdiff, here’s what we can infer about how people feel about their current weight:

  1. Mean: The mean value of wdiff is approximately -14.5891. This negative mean suggests that, on average, people in the dataset desire to lose weight.

  2. Median: The median value of wdiff is -10. The median is another measure of central tendency and is less sensitive to extreme values than the mean. The fact that the median is also negative confirms that a significant portion of the individuals in the dataset want to lose weight.

  3. Skewness: The skewness value is approximately -1.448. A negative skewness indicates that the distribution is skewed to the left. In the context of weight preferences, this suggests that there is a longer tail on the left side of the distribution, indicating that a larger proportion of individuals want to lose weight rather than gain it. The skewness value being negative further supports the notion of a left-skewed distribution.

  4. Standard Deviation: The standard deviation is approximately 24.046. This measure quantifies the amount of variation or spread in the data. A higher standard deviation indicates greater variability in people’s weight preferences. In this case, the relatively high standard deviation suggests that individuals’ feelings about their current weight vary widely.

  5. Interquartile Range (IQR): The interquartile range (IQR) is 21. The IQR represents the spread of the middle 50% of the data, which is a measure of the spread around the median. In this context, the IQR of 21 suggests that the majority of individuals in the dataset have weight preferences that fall within a relatively narrow range around the median value, which is negative, indicating a preference to lose weight.

In summary, based on these statistics, it appears that a significant portion of the individuals in the dataset have a desire to lose weight. The negative mean, median, and skewness values all point to this conclusion. However, it’s also important to note that there is variability in these preferences, as indicated by the relatively high standard deviation, suggesting that not everyone feels the same way about their current weight.

Using numerical summaries and a side-by-side box plot, determine if men tend to view their weight differently than women.

summary_stats <- data %>%
  group_by(gender) %>%
  summarize(
    Mean = mean(wdiff),
    Median = median(wdiff),
    SD = sd(wdiff),
    Min = min(wdiff),
    Max = max(wdiff),
    skewness = moments::skewness(wdiff),
    iqr_wdiff = IQR(data$wdiff)
  )

print(summary_stats)
## # A tibble: 2 × 8
##   gender  Mean Median    SD   Min   Max skewness iqr_wdiff
##   <fct>  <dbl>  <int> <dbl> <int> <int>    <dbl>     <dbl>
## 1 m      -10.7     -5  23.5  -300   500   -0.597        21
## 2 f      -18.2    -10  24.0  -300    83   -2.26         21
ggplot(data, aes(x = gender, y = wdiff, fill = gender)) +
  geom_boxplot() +
  labs(
    title = "Comparison of Weight Difference by Gender",
    x = "Gender",
    y = "Weight Difference (wdiff)"
  ) +
  scale_fill_manual(values = c("m" = "lightblue", "f" = "lightpink")) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, hjust = 0.5),
    axis.title = element_text(size = 10),
    axis.text = element_text(size = 10),
    legend.position = "None"
  )

Based on these statistics:

  1. Women, on average (mean), tend to have a more negative wdiff (indicating a desire to lose more weight) compared to men.

  2. The median wdiff for women is also more negative than for men.

  3. The standard deviation of wdiff is similar between men and women, indicating similar levels of variability.

  4. The skewness values for both genders are negative, suggesting leftward skewness, meaning that more individuals in both groups desire to lose weight than gain it.

In summary, women, on average, tend to view their weight more negatively (desire to lose more weight) than men based on the wdiff variable. The skewness values also suggest that both groups have a leftward skew, indicating a stronger preference to lose weight.However, it’s important to note that these are summary statistics, and individual variations may exist within each gender group.

Find the mean and standard deviation of weight and determine what proportion of the weights are within one standard deviation of the mean.

# Calculate the mean and standard deviation of 'weight'
mean_weight <- mean(data$weight)
sd_weight <- sd(data$weight)

# Calculate the lower and upper bounds for one standard deviation
lower_bound <- mean_weight - sd_weight
upper_bound <- mean_weight + sd_weight

# Calculate the proportion of weights within one standard deviation of the mean
within_one_sd <- sum(data$weight >= lower_bound & data$weight <= upper_bound) / nrow(data)

# Print the results
cat("Mean weight:", mean_weight, "\n")
## Mean weight: 169.683
cat("Standard deviation of weight:", sd_weight, "\n")
## Standard deviation of weight: 40.08097
cat("Proportion within one standard deviation of the mean:", within_one_sd, "\n")
## Proportion within one standard deviation of the mean: 0.7076