data_path <- "C:/Users/shanata/Downloads/smoking_driking_dataset_Ver01.csv"
data <- read.csv(data_path)

Variable Combination sets

Case 1:

Explanatory variable: Age and sex of a person Response Variable: Weight of a person

We are interested in knowing how a person’s weight measures up to the average weight for people of the same age and sex. I have created two columns: mean weight and std deviation of weigth with respect to age and sex.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
f<- table(data$sex,data$weight,data$age,data$DRK_YN)
data <- data |> 
  group_by(sex,age) |>
  mutate(weight_dev = weight - mean(weight),  # deviation
         weight_avg = mean(weight)) |>       # group average
  arrange(sex,age)

data |>
  select(age,sex,weight,weight_dev,weight_avg)
## # A tibble: 991,346 × 5
## # Groups:   sex, age [28]
##      age sex    weight weight_dev weight_avg
##    <int> <chr>   <int>      <dbl>      <dbl>
##  1    20 Female     55      0.898       54.1
##  2    20 Female     45     -9.10        54.1
##  3    20 Female     55      0.898       54.1
##  4    20 Female     55      0.898       54.1
##  5    20 Female     50     -4.10        54.1
##  6    20 Female     50     -4.10        54.1
##  7    20 Female     50     -4.10        54.1
##  8    20 Female     60      5.90        54.1
##  9    20 Female     50     -4.10        54.1
## 10    20 Female     55      0.898       54.1
## # ℹ 991,336 more rows

Here I have found how the weight of a person deviates from the mean weight of the person with respect to sex and age.

data <- ungroup(data)
Visualizing the standard deviation
ggplot(data, aes(x = factor(age), y = weight, fill = sex)) +
  geom_boxplot() +
  labs(x = "Age", y = "Weight") +
  theme_minimal()

ggplot(data, aes(x = age, y = weight_dev)) +
  geom_point() +
  labs(x = "Age", y = "Weight Deviation") +
  theme_minimal()

Covariance calculation:

Relationship between weight and age for each Sex

plot_data <- 
  data |>
    filter(sex=="Female")

plot_data |>
  ggplot() +
  geom_point(mapping = aes(x = age, y = weight)) +
  labs(x = "Age distribution", y = "Weight",
       title = "Weight as age Increases",
       subtitle = paste("Covariance:", 
                        round(cov(plot_data$age, 
                                  plot_data$weight), 2))) +
  theme_minimal()

#### Findings for Female:

1)So, a covariance of -2.61 between weight and age for females indicates that, within the female subgroup, there is a negative linear relationship between age and weight. In other words, on average, as females in this group get older, their weight tends to decrease.

  1. The magnitude is low; so it suggest that they have a negative relationship but it might not be very strong.

Verifying by calculating Pearson correlation:

round(cor(plot_data$age, plot_data$weight), 2)
## [1] -0.02

The correlation between weight and age is -0.02 which indicates there is a weak linear relationship, it is practically insignificant, so we cannot come to any conclusion.

Plot Vs Value:

Based on the plot, I came to the conclusion that there is a negative relationship, but it did not specify the strength of it. The correlation coefficient helped me arrive to the conclusion regarding the strength of the relationship. The relationship is practically insignificant.

plot_data1 <- 
  data |>
    filter(sex=="Male")

plot_data1 |>
  ggplot() +
  geom_point(mapping = aes(x = age, y = weight)) +
  labs(x = "Age distribution", y = "Weight",
       title = "Weight as age Increases",
       subtitle = paste("Covariance:", 
                        round(cov(plot_data1$age, 
                                  plot_data1$weight), 2))) +
  theme_minimal()

Findings for Male:

1)A negative covariance value (-46.79) implies an inverse relationship between weight and age for males. In other words, as age increases, weight tends to decrease, and vice versa.

2)The magnitude of the covariance (-46.79) indicates the strength of the linear relationship. In this case, a relatively large negative value suggests a reasonably strong inverse linear relationship between weight and age among males in your dataset. This means that there is a tendency for younger males to have higher weights, and as age increases, weight tends to decrease among males.

Verifying by calculating Pearson correlation:

round(cor(plot_data1$age, plot_data1$weight), 2)
## [1] -0.3

The value of -0.3 indicates a relatively weak relationship.We can interpret this by saying that, on average, older males in your dataset tend to have slightly lower weights, but the relationship is not strong enough to make definitive predictions about weight based solely on age.

Further Question:

So does it mean that all older male adults will be less in weight than young male adults?

This means that while there may be a weak negative correlation between age and weight among males in your data, it doesn’t necessarily mean that age causes changes in weight. Other factors, such as lifestyle, genetics, and health, may also be at play.

Plot Vs Value:

Based on the plot, I came to the conclusion that there is a negative relationship, but it did not specify the strength of it. The correlation coefficient helped me arrive to the conclusion regarding the strength of the relationship.

Confidence interval:

confidence_interval1 <- t.test(data$weight)$conf.int
print(confidence_interval1)
## [1] 63.25942 63.30868
## attr(,"conf.level")
## [1] 0.95
confidence_interval2 <- t.test(plot_data$weight)$conf.int
print(confidence_interval2)
## [1] 55.51253 55.56354
## attr(,"conf.level")
## [1] 0.95
confidence_interval3 <- t.test(plot_data1$weight)$conf.int
print(confidence_interval3)
## [1] 70.09508 70.15562
## attr(,"conf.level")
## [1] 0.95

Findings:

1)General Population: The true population mean is approximately 63 kg.The confidence level of 95% indicates that if we were to take many random samples from the same population and calculate confidence intervals for each sample, approximately 95% of those intervals would contain the true population mean.

2)For Female: The true population mean is approximately 55 kg.The confidence level of 95% indicates that if we were to take many random samples from the same population and calculate confidence intervals for each sample, approximately 95% of those intervals would contain the true population mean.

3)For Male: The true population mean is approximately 70 kg.The confidence level of 95% indicates that if we were to take many random samples from the same population and calculate confidence intervals for each sample, approximately 95% of those intervals would contain the true population mean.

Case 2:

Explanatory Variable: Age and Sex Response Variable: Cholesterol

We are interested in knowing hoe many people have choleastral above and bellow the maen for a specific age group.

data <-
  data |>
    group_by(age,sex) |>
    mutate(chole_median = median(tot_chole),
            chole_half = ifelse(tot_chole >= median(tot_chole),
                                 "upper half",
                                 "lower half")) |>
    ungroup()
data %>%
  ggplot() +
  geom_point(mapping = aes(x = age, 
                           y = tot_chole,
                           colour = chole_half)) +
  scale_x_log10() +
  scale_colour_brewer(palette = "Dark2") +
  labs(title = "Age VS Cholesterol")

Findings:

  1. Here I have created a new column that divides the cholesterol into 2 halves based on the median value. The red dots denote the points in the upper half and the green points denotes the points in the lower half. We can also notice the distribution of the outliers. There are few value which significantly differ from the normal value.

  2. we can also notice there is a huge distribution of outliers around the age 50- 70. They may be error or there is also a possibility that they may be due to lifestyle changes centered around that age group

Covariance between two variables:

Relationship between cholesterol and age for each Sex

plot_data2 <- 
  data |>
    filter(sex=="Female")

plot_data2 |>
  ggplot() +
  geom_point(mapping = aes(x = age, y = tot_chole)) +
  labs(x = "Age distribution", y = "cholestral",
       title = "cholestral as age Increases for female",
       subtitle = paste("Covariance:", 
                        round(cov(plot_data2$age, 
                                  plot_data2$tot_chole), 2))) +
  theme_minimal()

plot_data3 <- 
  data |>
    filter(sex=="Male")

plot_data3 |>
  ggplot() +
  geom_point(mapping = aes(x = age, y = tot_chole)) +
  labs(x = "Age distribution", y = "cholestral",
       title = "cholestral as age Increases for Male",
       subtitle = paste("Covariance:", 
                        round(cov(plot_data3$age, 
                                  plot_data3$tot_chole), 2))) +
  theme_minimal()

Findings:

  1. For female, the covariance between age and cholesterol is 66.55, indicating a positive linear relationship

  2. For male, the covariance between age and cholesterol is -48.89, indicating a negative linear relationship.

Correlation:

round(cor(plot_data2$age, plot_data2$tot_chole), 2)
## [1] 0.12
round(cor(plot_data3$age, plot_data3$tot_chole), 2)
## [1] -0.09

The correlation values are significantly small indicating no relationship.

Confidence interval:

confidence_interval1 <- t.test(data$tot_chole)$conf.int
print(confidence_interval1)
## [1] 195.4809 195.6331
## attr(,"conf.level")
## [1] 0.95
confidence_interval2 <- t.test(plot_data2$tot_chole)$conf.int
print(confidence_interval2)
## [1] 196.3700 196.5906
## attr(,"conf.level")
## [1] 0.95
confidence_interval3 <- t.test(plot_data3$tot_chole)$conf.int
print(confidence_interval3)
## [1] 194.6365 194.8467
## attr(,"conf.level")
## [1] 0.95

The true mean of the population lies around 195 irrespective of gender

Case 3:

Explanatory Variable: Age and Sex Response Variable: systolic blood Pressure

We are interested in knowing hoe many people have cholesterol above and below the median for a specific age group.

data <-
  data |>
    group_by(age,sex) |>
    mutate(SBP_median = median(SBP),
            SBP_half = ifelse(SBP >= median(SBP),
                                 "upper half",
                                 "lower half")) |>
    ungroup()
data %>%
  ggplot() +
  geom_point(mapping = aes(x = age, 
                           y = SBP,
                           colour = SBP_half)) +
  scale_x_log10() +
  scale_colour_brewer(palette = "Dark2") +
  labs(title = "Age VS Systolic Blood Pressure")

Findings:

  1. Here I have created a new column that divides the systolic bood pressure into 2 halves based on the median value. The red dots denote the points in the upper half and the green points denotes the points in the lower half. We can also notice the distribution of the outliers. There are few value which significantly differ from the normal value.

  2. we can also notice there is a huge distribution of outliers around the age 50- 70. They may be error or there is also a possibility that they may be due to lifestyle changes centered around that age group

Covariance between two variables:

Relationship between cholesterol and age for each Sex

plot_data4 <- 
  data |>
    filter(sex=="Female")

plot_data4 |>
  ggplot() +
  geom_point(mapping = aes(x = age, y = SBP)) +
  labs(x = "Age distribution", y = "Systolic Blood Pressure",
       title = "systolic Blood Pressure as age Increases for female",
       subtitle = paste("Covariance:", 
                        round(cov(plot_data4$age, 
                                  plot_data4$SBP), 2))) +
  theme_minimal()

plot_data5 <- 
  data |>
    filter(sex=="Male")

plot_data5 |>
  ggplot() +
  geom_point(mapping = aes(x = age, y = SBP)) +
  labs(x = "Age distribution", y = "systolic blood Pressure",
       title = "systolic Blood Pressure as age Increases for Male",
       subtitle = paste("Covariance:", 
                        round(cov(plot_data5$age, 
                                  plot_data5$SBP), 2))) +
  theme_minimal()

Findings:

  1. For female, the covariance between age and Systolic Blood Pressure is 87.13, indicating a positive linear relationship

  2. For Male, the covariance between age and Systolic Blood Pressure is 31.87, indicating a positive linear relationship

Correlation:

round(cor(plot_data4$age, plot_data4$SBP), 2)
## [1] 0.4
round(cor(plot_data5$age, plot_data5$SBP), 2)
## [1] 0.17

The correlation values are significantly smaller indicating no relationship.

Confidence interval:

confidence_interval1 <- t.test(data$tot_chole)$conf.int
print(confidence_interval1)
## [1] 195.4809 195.6331
## attr(,"conf.level")
## [1] 0.95
confidence_interval2 <- t.test(plot_data2$tot_chole)$conf.int
print(confidence_interval2)
## [1] 196.3700 196.5906
## attr(,"conf.level")
## [1] 0.95
confidence_interval3 <- t.test(plot_data3$tot_chole)$conf.int
print(confidence_interval3)
## [1] 194.6365 194.8467
## attr(,"conf.level")
## [1] 0.95

The true mean of the population lies around 195 irrespective of gender.