data_path <- "C:/Users/shanata/Downloads/smoking_driking_dataset_Ver01.csv"
data <- read.csv(data_path)
Explanatory variable: Age and sex of a person Response Variable: Weight of a person
We are interested in knowing how a person’s weight measures up to the average weight for people of the same age and sex. I have created two columns: mean weight and std deviation of weigth with respect to age and sex.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
f<- table(data$sex,data$weight,data$age,data$DRK_YN)
data <- data |>
group_by(sex,age) |>
mutate(weight_dev = weight - mean(weight), # deviation
weight_avg = mean(weight)) |> # group average
arrange(sex,age)
data |>
select(age,sex,weight,weight_dev,weight_avg)
## # A tibble: 991,346 × 5
## # Groups: sex, age [28]
## age sex weight weight_dev weight_avg
## <int> <chr> <int> <dbl> <dbl>
## 1 20 Female 55 0.898 54.1
## 2 20 Female 45 -9.10 54.1
## 3 20 Female 55 0.898 54.1
## 4 20 Female 55 0.898 54.1
## 5 20 Female 50 -4.10 54.1
## 6 20 Female 50 -4.10 54.1
## 7 20 Female 50 -4.10 54.1
## 8 20 Female 60 5.90 54.1
## 9 20 Female 50 -4.10 54.1
## 10 20 Female 55 0.898 54.1
## # ℹ 991,336 more rows
Here I have found how the weight of a person deviates from the mean weight of the person with respect to sex and age.
data <- ungroup(data)
ggplot(data, aes(x = factor(age), y = weight, fill = sex)) +
geom_boxplot() +
labs(x = "Age", y = "Weight") +
theme_minimal()
ggplot(data, aes(x = age, y = weight_dev)) +
geom_point() +
labs(x = "Age", y = "Weight Deviation") +
theme_minimal()
Relationship between weight and age for each Sex
plot_data <-
data |>
filter(sex=="Female")
plot_data |>
ggplot() +
geom_point(mapping = aes(x = age, y = weight)) +
labs(x = "Age distribution", y = "Weight",
title = "Weight as age Increases",
subtitle = paste("Covariance:",
round(cov(plot_data$age,
plot_data$weight), 2))) +
theme_minimal()
#### Findings for Female:
1)So, a covariance of -2.61 between weight and age for females indicates that, within the female subgroup, there is a negative linear relationship between age and weight. In other words, on average, as females in this group get older, their weight tends to decrease.
round(cor(plot_data$age, plot_data$weight), 2)
## [1] -0.02
The correlation between weight and age is -0.02 which indicates there is a weak linear relationship, it is practically insignificant, so we cannot come to any conclusion.
Based on the plot, I came to the conclusion that there is a negative relationship, but it did not specify the strength of it. The correlation coefficient helped me arrive to the conclusion regarding the strength of the relationship. The relationship is practically insignificant.
plot_data1 <-
data |>
filter(sex=="Male")
plot_data1 |>
ggplot() +
geom_point(mapping = aes(x = age, y = weight)) +
labs(x = "Age distribution", y = "Weight",
title = "Weight as age Increases",
subtitle = paste("Covariance:",
round(cov(plot_data1$age,
plot_data1$weight), 2))) +
theme_minimal()
1)A negative covariance value (-46.79) implies an inverse relationship between weight and age for males. In other words, as age increases, weight tends to decrease, and vice versa.
2)The magnitude of the covariance (-46.79) indicates the strength of the linear relationship. In this case, a relatively large negative value suggests a reasonably strong inverse linear relationship between weight and age among males in your dataset. This means that there is a tendency for younger males to have higher weights, and as age increases, weight tends to decrease among males.
round(cor(plot_data1$age, plot_data1$weight), 2)
## [1] -0.3
The value of -0.3 indicates a relatively weak relationship.We can interpret this by saying that, on average, older males in your dataset tend to have slightly lower weights, but the relationship is not strong enough to make definitive predictions about weight based solely on age.
So does it mean that all older male adults will be less in weight than young male adults?
This means that while there may be a weak negative correlation between age and weight among males in your data, it doesn’t necessarily mean that age causes changes in weight. Other factors, such as lifestyle, genetics, and health, may also be at play.
Based on the plot, I came to the conclusion that there is a negative relationship, but it did not specify the strength of it. The correlation coefficient helped me arrive to the conclusion regarding the strength of the relationship.
confidence_interval1 <- t.test(data$weight)$conf.int
print(confidence_interval1)
## [1] 63.25942 63.30868
## attr(,"conf.level")
## [1] 0.95
confidence_interval2 <- t.test(plot_data$weight)$conf.int
print(confidence_interval2)
## [1] 55.51253 55.56354
## attr(,"conf.level")
## [1] 0.95
confidence_interval3 <- t.test(plot_data1$weight)$conf.int
print(confidence_interval3)
## [1] 70.09508 70.15562
## attr(,"conf.level")
## [1] 0.95
1)General Population: The true population mean is approximately 63 kg.The confidence level of 95% indicates that if we were to take many random samples from the same population and calculate confidence intervals for each sample, approximately 95% of those intervals would contain the true population mean.
2)For Female: The true population mean is approximately 55 kg.The confidence level of 95% indicates that if we were to take many random samples from the same population and calculate confidence intervals for each sample, approximately 95% of those intervals would contain the true population mean.
3)For Male: The true population mean is approximately 70 kg.The confidence level of 95% indicates that if we were to take many random samples from the same population and calculate confidence intervals for each sample, approximately 95% of those intervals would contain the true population mean.
Explanatory Variable: Age and Sex Response Variable: Cholesterol
We are interested in knowing hoe many people have choleastral above and bellow the maen for a specific age group.
data <-
data |>
group_by(age,sex) |>
mutate(chole_median = median(tot_chole),
chole_half = ifelse(tot_chole >= median(tot_chole),
"upper half",
"lower half")) |>
ungroup()
data %>%
ggplot() +
geom_point(mapping = aes(x = age,
y = tot_chole,
colour = chole_half)) +
scale_x_log10() +
scale_colour_brewer(palette = "Dark2") +
labs(title = "Age VS Cholesterol")
Here I have created a new column that divides the cholesterol into 2 halves based on the median value. The red dots denote the points in the upper half and the green points denotes the points in the lower half. We can also notice the distribution of the outliers. There are few value which significantly differ from the normal value.
we can also notice there is a huge distribution of outliers around the age 50- 70. They may be error or there is also a possibility that they may be due to lifestyle changes centered around that age group
Relationship between cholesterol and age for each Sex
plot_data2 <-
data |>
filter(sex=="Female")
plot_data2 |>
ggplot() +
geom_point(mapping = aes(x = age, y = tot_chole)) +
labs(x = "Age distribution", y = "cholestral",
title = "cholestral as age Increases for female",
subtitle = paste("Covariance:",
round(cov(plot_data2$age,
plot_data2$tot_chole), 2))) +
theme_minimal()
plot_data3 <-
data |>
filter(sex=="Male")
plot_data3 |>
ggplot() +
geom_point(mapping = aes(x = age, y = tot_chole)) +
labs(x = "Age distribution", y = "cholestral",
title = "cholestral as age Increases for Male",
subtitle = paste("Covariance:",
round(cov(plot_data3$age,
plot_data3$tot_chole), 2))) +
theme_minimal()
For female, the covariance between age and cholesterol is 66.55, indicating a positive linear relationship
For male, the covariance between age and cholesterol is -48.89, indicating a negative linear relationship.
round(cor(plot_data2$age, plot_data2$tot_chole), 2)
## [1] 0.12
round(cor(plot_data3$age, plot_data3$tot_chole), 2)
## [1] -0.09
The correlation values are significantly small indicating no relationship.
confidence_interval1 <- t.test(data$tot_chole)$conf.int
print(confidence_interval1)
## [1] 195.4809 195.6331
## attr(,"conf.level")
## [1] 0.95
confidence_interval2 <- t.test(plot_data2$tot_chole)$conf.int
print(confidence_interval2)
## [1] 196.3700 196.5906
## attr(,"conf.level")
## [1] 0.95
confidence_interval3 <- t.test(plot_data3$tot_chole)$conf.int
print(confidence_interval3)
## [1] 194.6365 194.8467
## attr(,"conf.level")
## [1] 0.95
The true mean of the population lies around 195 irrespective of gender
Explanatory Variable: Age and Sex Response Variable: systolic blood Pressure
We are interested in knowing hoe many people have cholesterol above and below the median for a specific age group.
data <-
data |>
group_by(age,sex) |>
mutate(SBP_median = median(SBP),
SBP_half = ifelse(SBP >= median(SBP),
"upper half",
"lower half")) |>
ungroup()
data %>%
ggplot() +
geom_point(mapping = aes(x = age,
y = SBP,
colour = SBP_half)) +
scale_x_log10() +
scale_colour_brewer(palette = "Dark2") +
labs(title = "Age VS Systolic Blood Pressure")
Here I have created a new column that divides the systolic bood pressure into 2 halves based on the median value. The red dots denote the points in the upper half and the green points denotes the points in the lower half. We can also notice the distribution of the outliers. There are few value which significantly differ from the normal value.
we can also notice there is a huge distribution of outliers around the age 50- 70. They may be error or there is also a possibility that they may be due to lifestyle changes centered around that age group
Relationship between cholesterol and age for each Sex
plot_data4 <-
data |>
filter(sex=="Female")
plot_data4 |>
ggplot() +
geom_point(mapping = aes(x = age, y = SBP)) +
labs(x = "Age distribution", y = "Systolic Blood Pressure",
title = "systolic Blood Pressure as age Increases for female",
subtitle = paste("Covariance:",
round(cov(plot_data4$age,
plot_data4$SBP), 2))) +
theme_minimal()
plot_data5 <-
data |>
filter(sex=="Male")
plot_data5 |>
ggplot() +
geom_point(mapping = aes(x = age, y = SBP)) +
labs(x = "Age distribution", y = "systolic blood Pressure",
title = "systolic Blood Pressure as age Increases for Male",
subtitle = paste("Covariance:",
round(cov(plot_data5$age,
plot_data5$SBP), 2))) +
theme_minimal()
For female, the covariance between age and Systolic Blood Pressure is 87.13, indicating a positive linear relationship
For Male, the covariance between age and Systolic Blood Pressure is 31.87, indicating a positive linear relationship
round(cor(plot_data4$age, plot_data4$SBP), 2)
## [1] 0.4
round(cor(plot_data5$age, plot_data5$SBP), 2)
## [1] 0.17
The correlation values are significantly smaller indicating no relationship.
confidence_interval1 <- t.test(data$tot_chole)$conf.int
print(confidence_interval1)
## [1] 195.4809 195.6331
## attr(,"conf.level")
## [1] 0.95
confidence_interval2 <- t.test(plot_data2$tot_chole)$conf.int
print(confidence_interval2)
## [1] 196.3700 196.5906
## attr(,"conf.level")
## [1] 0.95
confidence_interval3 <- t.test(plot_data3$tot_chole)$conf.int
print(confidence_interval3)
## [1] 194.6365 194.8467
## attr(,"conf.level")
## [1] 0.95
The true mean of the population lies around 195 irrespective of gender.