data <- source("http://www.openintro.org/stat/data/cdc.R")
data <- data$value
library('tidyverse')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library('moments')
dim(data)
## [1] 20000 9
str(data)
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
ggplot(data, aes(x = weight, y = wtdesire)) +
geom_point(
color = "steelblue",
size = 3,
alpha = 0.7
) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
theme_minimal() +
labs(
title = "Scatterplot of Weight vs. Desired Weight",
x = "Weight",
y = "Desired Weight"
) +
theme(
plot.title = element_text(size = 16, hjust = 0.5),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.major = element_line(color = "lightgray", linetype = "dotted")
)
correlation_coeff <- cor(data$weight, data$wtdesire)
cat("Correlation coefficient:", correlation_coeff, "\n")
## Correlation coefficient: 0.8000521
A correlation coefficient of approximately 0.800 suggests a strong
positive linear relationship between weight and desired weight
(wtdesire).
wtdesire) and current weight
(weight).#Create this new variable by subtracting the two columns in the data frame and assigning them to a new object called 'wdiff'.
data$wdiff <- data$wtdesire - data$weight
head(data[,'wdiff'], 10)
## [1] 0 -10 0 -8 -20 0 -9 -10 -20 -10
The variable wdiff represents the difference between
desired weight (wtdesire) and current weight
(weight) for each individual in your dataset. The type of
data that wdiff represents is a numeric variable, as it
contains numerical values that can be positive, negative, or zero.
Here’s what different values of wdiff typically
mean:
wdiff = 0: This means that the person’s current weight
(weight) is exactly the same as their desired weight
(wtdesire). In other words, there is no difference between
their current weight and the weight they desire.
wdiff > 0: A positive wdiff indicates that the
person’s current weight is greater than their desired weight. This
suggests that the person would like to lose weight to reach their
desired weight.
wdiff < 0: A negative wdiff indicates that the
person’s current weight is less than their desired weight. This implies
that the person would like to gain weight to reach their desired
weight.
wdiff in terms of
its center, shape, and spread. What does this tell us about how people
feel about their current weight?# Center
mean_wdiff <- mean(data$wdiff)
median_wdiff <- median(data$wdiff)
# Shape - Histogram
hist(data$wdiff, main = "Distribution of 'wdiff'", xlab = "wdiff", col = "lightblue", border = "black")
abline(v = mean_wdiff, col = "red", lwd = 2) # Add mean as a vertical line
abline(v = median_wdiff, col = "blue", lwd = 2) # Add median as a vertical line
# Shape - Skewness
skewness <- moments::skewness(data$wdiff)
# Shape - Q-Q Plot (assuming 'wdiff' follows a normal distribution)
qqnorm(data$wdiff)
qqline(data$wdiff)
# Spread
sd_wdiff <- sd(data$wdiff)
iqr_wdiff <- IQR(data$wdiff)
# Spread - Box Plot
boxplot(data$wdiff, main = "Box Plot of 'wdiff'", col = "lightgreen", border = "black")
# Print summary statistics
cat("Mean: ", mean_wdiff, "\n")
## Mean: -14.5891
cat("Median: ", median_wdiff, "\n")
## Median: -10
cat("Skewness: ", skewness, "\n")
## Skewness: -1.447632
cat("Standard Deviation: ", sd_wdiff, "\n")
## Standard Deviation: 24.04586
cat("Interquartile Range (IQR): ", iqr_wdiff, "\n")
## Interquartile Range (IQR): 21
Based on the statistics for the variable wdiff, here’s
what we can infer about how people feel about their current weight:
Mean: The mean value of wdiff is approximately
-14.5891. This negative mean suggests that, on average, people in the
dataset desire to lose weight.
Median: The median value of wdiff is -10. The median
is another measure of central tendency and is less sensitive to extreme
values than the mean. The fact that the median is also negative confirms
that a significant portion of the individuals in the dataset want to
lose weight.
Skewness: The skewness value is approximately -1.448. A negative skewness indicates that the distribution is skewed to the left. In the context of weight preferences, this suggests that there is a longer tail on the left side of the distribution, indicating that a larger proportion of individuals want to lose weight rather than gain it. The skewness value being negative further supports the notion of a left-skewed distribution.
Standard Deviation: The standard deviation is approximately 24.046. This measure quantifies the amount of variation or spread in the data. A higher standard deviation indicates greater variability in people’s weight preferences. In this case, the relatively high standard deviation suggests that individuals’ feelings about their current weight vary widely.
Interquartile Range (IQR): The interquartile range (IQR) is 21. The IQR represents the spread of the middle 50% of the data, which is a measure of the spread around the median. In this context, the IQR of 21 suggests that the majority of individuals in the dataset have weight preferences that fall within a relatively narrow range around the median value, which is negative, indicating a preference to lose weight.
In summary, based on these statistics, it appears that a significant portion of the individuals in the dataset have a desire to lose weight. The negative mean, median, and skewness values all point to this conclusion. However, it’s also important to note that there is variability in these preferences, as indicated by the relatively high standard deviation, suggesting that not everyone feels the same way about their current weight.
summary_stats <- data %>%
group_by(gender) %>%
summarize(
Mean = mean(wdiff),
Median = median(wdiff),
SD = sd(wdiff),
Min = min(wdiff),
Max = max(wdiff),
skewness = moments::skewness(wdiff),
iqr_wdiff = IQR(data$wdiff)
)
print(summary_stats)
## # A tibble: 2 × 8
## gender Mean Median SD Min Max skewness iqr_wdiff
## <fct> <dbl> <int> <dbl> <int> <int> <dbl> <dbl>
## 1 m -10.7 -5 23.5 -300 500 -0.597 21
## 2 f -18.2 -10 24.0 -300 83 -2.26 21
ggplot(data, aes(x = gender, y = wdiff, fill = gender)) +
geom_boxplot() +
labs(
title = "Comparison of Weight Difference by Gender",
x = "Gender",
y = "Weight Difference (wdiff)"
) +
scale_fill_manual(values = c("m" = "lightblue", "f" = "lightpink")) +
theme_minimal() +
theme(
plot.title = element_text(size = 16, hjust = 0.5),
axis.title = element_text(size = 10),
axis.text = element_text(size = 10),
legend.position = "None"
)
Based on these statistics:
Women, on average (mean), tend to have a more
negative wdiff (indicating a desire to lose more weight)
compared to men.
The median wdiff for women is also more negative
than for men.
The standard deviation of wdiff is similar between
men and women, indicating similar levels of variability.
The skewness values for both genders are negative, suggesting leftward skewness, meaning that more individuals in both groups desire to lose weight than gain it.
In summary, women, on average, tend to view their weight more
negatively (desire to lose more weight) than men based on the
wdiff variable. The skewness values also suggest that both
groups have a leftward skew, indicating a stronger preference to lose
weight.However, it’s important to note that these are summary
statistics, and individual variations may exist within each gender
group.
weight
and determine what proportion of the weights are within one standard
deviation of the mean.# Calculate the mean and standard deviation of 'weight'
mean_weight <- mean(data$weight)
sd_weight <- sd(data$weight)
# Calculate the lower and upper bounds for one standard deviation
lower_bound <- mean_weight - sd_weight
upper_bound <- mean_weight + sd_weight
# Calculate the proportion of weights within one standard deviation of the mean
within_one_sd <- sum(data$weight >= lower_bound & data$weight <= upper_bound) / nrow(data)
# Print the results
cat("Mean weight:", mean_weight, "\n")
## Mean weight: 169.683
cat("Standard deviation of weight:", sd_weight, "\n")
## Standard deviation of weight: 40.08097
cat("Proportion within one standard deviation of the mean:", within_one_sd, "\n")
## Proportion within one standard deviation of the mean: 0.7076