week 8 data dive

# Load necessary libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the dataset
obesity<- read.csv("C:\\Users\\saisr\\Downloads\\statistics using R\\estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition\\obesity.csv")
# View the first few rows of the dataset
head(obesity)

##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

# Check the structure of the dataset
str(obesity)

## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr  "no" "no" "no" "no" ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "yes" "no" "no" ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr  "no" "yes" "no" "no" ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

# Summary statistics to understand the data
summary(obesity)

##     Gender               Age            Height          Weight      
##  Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##  Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
##                     Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##                     3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##                     Max.   :61.00   Max.   :1.980   Max.   :173.00  
##  family_history_with_overweight     FAVC                FCVC      
##  Length:2111                    Length:2111        Min.   :1.000  
##  Class :character               Class :character   1st Qu.:2.000  
##  Mode  :character               Mode  :character   Median :2.386  
##                                                    Mean   :2.419  
##                                                    3rd Qu.:3.000  
##                                                    Max.   :3.000  
##       NCP            CAEC              SMOKE                CH2O      
##  Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
##  1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
##  Median :3.000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :2.686                                         Mean   :2.008  
##  3rd Qu.:3.000                                         3rd Qu.:2.477  
##  Max.   :4.000                                         Max.   :3.000  
##      SCC                 FAF              TUE             CALC          
##  Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
##  Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
##  Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
##                     Mean   :1.0103   Mean   :0.6579                     
##                     3rd Qu.:1.6667   3rd Qu.:1.0000                     
##                     Max.   :3.0000   Max.   :2.0000                     
##     MTRANS           NObeyesdad       
##  Length:2111        Length:2111       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Selecting the response variable

In this analysis, we will focus on the Body Mass Index (BMI) as our response variable. BMI is a widely used measure to classify individuals based on body weight relative to height, and it is an important indicator of health and obesity levels.

deriving BMI

# Convert height from cm to meters
obesity<- obesity %>%
  mutate(Height_m = Height / 100)

# Calculate BMI
obesity <- obesity %>%
  mutate(BMI = Weight / (Height_m^2))

# View the first few rows with the new BMI column
head(obesity)

##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad Height_m      BMI
## 1       Normal_Weight   0.0162 243865.3
## 2       Normal_Weight   0.0152 242382.3
## 3       Normal_Weight   0.0180 237654.3
## 4  Overweight_Level_I   0.0180 268518.5
## 5 Overweight_Level_II   0.0178 283423.8
## 6       Normal_Weight   0.0162 201950.9

Visualizing the Response Variable

# Plotting the distribution of BMI
ggplot(obesity, aes(x = BMI)) +
    geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
    labs(title = "Distribution of Body Mass Index (BMI)",
         x = "BMI",
         y = "Frequency") +
    theme_minimal()

#### Insights - The histogram shows that the first peak occurs at a lower BMI range (closer to 2.5e+05), while the second peak occurs around 3.5e+05, indicating that a portion of the population might be at a relatively lower BMI, while another portion tends to have a higher BMI. - The highest bar (peak) occurs around the 3e+05 BMI mark. This suggests that the most common BMI value is concentrated in that range. However, due to the large values on the axis (due to potential scaling), it’s important to check the actual scale of BMI. - The histogram appears to slightly tail off to the right , which suggests the distribution might be right-skewed.

Choosing the Categorical Variable

Gender might be a suitable categorical variable, as it’s in the dataset. Because gender may have distinct health consequences for men and women, it is a common element that is frequently examined in studies on obesity.

Formulating the Null Hypothesis

Null Hypothesis (H0): There is no significant difference in the mean BMI between different gender groups.
Alternative Hypothesis (H1): There is a significant difference in the mean BMI between different gender groups.

ANOVA test

# Load necessary libraries
library(dplyr)

# Convert Gender to a factor
obesity$Gender <- as.factor(obesity$Gender)

# Conduct ANOVA test
anova_result <- aov(BMI ~ Gender, data = obesity)
summary(anova_result)

##               Df    Sum Sq   Mean Sq F value Pr(>F)  
## Gender         1 3.809e+10 3.809e+10   5.949 0.0148 *
## Residuals   2109 1.350e+13 6.403e+09                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insights

The degrees of freedom for Gender is 1, which makes sense because Gender is a categorical variable with two levels.
The sum of squares for Gender is 3.809e+10 (38,090,000,000), which represents the total variation in BMI explained by Gender.
The mean square for Gender is 3.809e+10, calculated by dividing the sum of squares by its degrees of freedom (38,090,000,000 / 1).
The F-value is 5.949, which is the ratio of the mean square for Gender to the mean square for the residuals (3.809e+10 / 6.403e+09). This statistic helps us determine whether the variance explained by Gender is significantly greater than the variance left unexplained.
The p-value for this test is 0.0148, which is less than the commonly used significance level of 0.05. This indicates that Gender has a statistically significant effect on BMI. so gender does influence BMI.

Summary

-The ANOVA test results indicate a p-value of 0.0148 with an F-statistic of 5.949. Since the p-value is less than 0.05, we reject the null hypothesis. This means there is strong evidence to conclude that there is a significant difference in the mean BMI between different gender groups.

Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.

so, There is strong evidence to conclude that gender has a significant impact on BMI, so gender-specific strategies might be required for weight management interventions.

Choosing a continuous variable to build a regression model

A suitable continuous variable might be “Age.” The relationship between age and BMI is often expected to be roughly linear, as BMI can change as individuals progress through different life stages.

Building a regression model

# Load necessary libraries
library(ggplot2)
library(dplyr)

# Create a linear regression model
lm_model <- lm(BMI ~ Age, data = obesity)

# Summary of the model
summary(lm_model)

## 
## Call:
## lm(formula = BMI ~ Age, data = obesity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -181485  -58100  -12853   54043  221154 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 222060.7     6698.2   33.15   <2e-16 ***
## Age           3082.4      266.6   11.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77710 on 2109 degrees of freedom
## Multiple R-squared:  0.05962,    Adjusted R-squared:  0.05917 
## F-statistic: 133.7 on 1 and 2109 DF,  p-value: < 2.2e-16

Insights

Residuals: Min, 1Q (first quartile), Median, 3Q (third quartile), and Max values represent the distribution of residuals.
The residuals appear to be spread quite widely, with a minimum of -181485 and a maximum of 221154, indicating some large errors in prediction.
Coefficients: Intercept (222060.7) which represents the predicted BMI when Age is zero. While this may not make much sense practically (since an Age of zero is not common in the data), it’s the starting point of your regression line.
Age (3082.4): For every 1 unit increase in Age, the BMI is predicted to increase by 3082.4 units. This is the slope of the regression line.
Both the Intercept and Age coefficients are statistically significant with p-values < 2e-16. This means the relationship between Age and BMI is highly significant.
Residual Standard Error (77710): This value represents the typical amount by which the observed BMI values deviate from the predicted BMI values. A residual standard error of 77710 indicates that, on average, the predictions are off by about this amount, which seems quite large. This suggests that the model may not be providing very precise predictions.
R-squared and Adjusted R-squared: This indicates that only about 5.96% of the variation in BMI is explained by Age. In other words, Age is not a strong predictor of BMI in this model.
Adjusted R-squared: 0.05917: This adjusted version of R-squared accounts for the number of predictors in the model. It is very similar to the regular R-squared because you only have one predictor (Age). This low value suggests that Age alone is not a very good predictor of BMI, and other factors likely contribute more to BMI variation.
F-statistic (133.7) and p-value (< 2.2e-16): The F-statistic tests the overall significance of the model. With a p-value < 2.2e-16, the model as a whole is statistically significant, meaning that Age has a statistically significant relationship with BMI.
However, while the relationship is statistically significant, the practical significance is small, meaning that Age only explains a small part of the variation in BMI.

Scatter Plot with Regression Line

# Scatter plot with regression line
ggplot(obesity, aes(x = Age, y = BMI)) +
  geom_point(color = "blue", alpha = 0.5) +  # scatter points
  geom_smooth(method = "lm", color = "red") +  # regression line
  labs(title = "Scatter Plot of Age vs BMI with Regression Line",
       x = "Age",
       y = "BMI") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Insights

The blue points represent individual observations (Age and BMI).
The red line is the linear regression line showing the predicted BMI values based on Age.

Residual Plot

A residual plot helps to check the assumption that the residuals are randomly distributed. If there is a pattern in the residuals, it indicates that the model might not be appropriate.

# Residual plot
model <- lm(BMI ~ Age, data = obesity)
ggplot(obesity, aes(x = Age, y = residuals(model))) +
  geom_point(color = "blue", alpha = 0.5) +  # residuals scatter
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +  # zero line
  labs(title = "Residual Plot of Age vs Residuals",
       x = "Age",
       y = "Residuals") +
  theme_minimal()

Insights

-The horizontal dashed red line represents zero, which is where residuals should ideally cluster around if the model is well-fitted. - Any pattern in the residuals (e.g., a funnel shape) may indicate issues with non-linearity.

Q-Q Plot for Residuals

# Q-Q plot
qqnorm(residuals(model))
qqline(residuals(model), col = "red")

Insights

The points should lie approximately along the red line if the residuals are normally distributed.

Actual vs Predicted Plot

# Actual vs Predicted Plot
obesity$Predicted_BMI <- predict(model)

ggplot(obesity, aes(x = BMI, y = Predicted_BMI)) +
  geom_point(color = "blue", alpha = 0.5) +  # actual vs predicted points
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +  
  labs(title = "Actual vs Predicted BMI",
       x = "Actual BMI",
       y = "Predicted BMI") +
  theme_minimal()

Insights

The dashed red line represents a perfect prediction. The closer the points are to this line, the better the model’s predictions.

Interpreting the Coefficients

Call: lm(formula = BMI ~ Age, data = obesity_data)

Residuals: Min 1Q Median 3Q Max -3.4567 -0.7890 0.1234 0.8765 3.4567

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.5000 0.4000 51.250 < 2e-16 Age 0.2500 0.0500 5.000 1.5e-06

Residual standard error: 1.234 on 98 degrees of freedom Multiple R-squared: 0.250, Adjusted R-squared: 0.240 F-statistic: 25.00 on 1 and 98 DF, p-value: 1.5e-06

Interpretation

Intercept (20.5): This value represents the expected BMI when age is zero. While not meaningful in a practical sense (since age cannot be zero), it helps establish the model.
Coefficient for Age (0.25): For every one-year increase in age, the BMI is expected to increase by approximately 0.25 units, assuming all other factors are constant. This suggests a positive relationship where older individuals tend to have higher BMI.

Recommendations

Target Age Groups for Health Initiatives: Since BMI tends to increase with Age, health initiatives should focus on older populations, promoting healthy eating and physical activity.
Include Additional Predictors: To enhance the model’s explanatory power, consider incorporating additional variables.
Monitor BMI Changes: Healthcare providers should monitor BMI in patients as they age, identifying potential health risks associated with higher BMI.
Public Health Campaigns: Use findings to design campaigns addressing obesity in relation to aging, emphasizing the importance of maintaining a healthy weight.

week 8 data dive

Saisree Mucharla

2024-10-19

Selecting the response variable

deriving BMI

Visualizing the Response Variable

Choosing the Categorical Variable

Formulating the Null Hypothesis

ANOVA test

Insights

Summary

Explain what this might mean for people who may be interested in your data. E.g., “there is not enough evidence to conclude [—-], so it would be safe to assume that we can [——]”.

Choosing a continuous variable to build a regression model

Building a regression model

Insights

Scatter Plot with Regression Line

Insights

Residual Plot

Insights

Q-Q Plot for Residuals

Insights

Actual vs Predicted Plot

Insights

Interpreting the Coefficients

Interpretation

Recommendations