This entire sample is subject to non-response bias; all the collected data comes from people who where willing to answer the phone and finish a very long survey.
It is not a great representation of the general population, but the sample size is useful to find statistical significance to even small differences.
For computing capabilities, we will refer to the data collected in the state of California only.
df = brfss2013
rm(brfss2013) # get rid of long name
df = df[20000:40000,] # trimming data, no CA rows deleted
df = tbl_df(df)
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## [1] 11518 330
Due to the fact that this has been collected retrospectively we can only imply CORRELATIONS not CAUSALITY.
Research quesion 1: What are the statistical characteristics of the estimaged age-gender specific maximum oxygen consumption? Does it follow a normal distribution?
variable: $maxvo2_
(continuous) in ml/min/kg
Research quesion 2: Does a correlation exist between the hours slept each night and reported days with poor physical health or poor mental health in the last 30 days?
explanatory variable: sleptim1
in hours
response variables: physhlth
and menthlth
in days
all of these variables are continuous
Research quesion 3: Does a correlation exist between the body mass index (BMI) of the individual answering the survey and having a high blood cholesterol or having a heart attack?
explanatory variable: _bmi5
(continuous) computed by researchers in kg/m^2
response variables: toldhi2
and cvdinfr4
both response variables are discrete
####Research quesion 1: What are the statistical characteristics of the estimaged age-gender specific maximum oxygen consumption? Does it follow a normal distribution?
The last two digits of this variable are implied decimal places, before doing anything this corrects to the units: ml/min/kg
We will now get some descritptive statistics on the computed measurement.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5.55 23.70 29.75 30.31 36.90 50.10 10
## [1] 8.697629
Next some exploratory graphs
maxvo2 = ggplot(df, aes(x=maxvo2_)) + theme_bw() +
geom_histogram(binwidth=2, alpha=0.5, aes(y=..density..), fill="lightblue", col="grey") +
stat_function(fun=dnorm, args=list(mean=30.31, sd=8.7), col="red") +
geom_rug() +
labs(title="Distribution of VO2 max in California Population",
x="VO2 max (ml O2/min/kg)",
y="Frequency")
maxvo2
## Warning: Removed 10 rows containing non-finite values (stat_bin).
The red superimposed curve is a normal distribution with the same
mean and standard deviation that this sample has. Nevertheless, this
is not the most accurate way to determine if `maxvo2` has a normal
distribution.
A another visual method to evaluate if this sample follow a sample distribution is a normal probability plot using the qqnorm
function
The sample looks like it does not follow a normal distribution, it is seem to be bimodal as evidenced from the ‘S’ shape of the QQ plot.
Regardless, we can use the Shapiro-Wilk normality test
to assess this more thoroughly. The null hypothesis (Ho) is that the data follow a normal distribution, and we will reject the Ho if the p-value < 0.05
.
##
## Shapiro-Wilk normality test
##
## data: sample(df$maxvo2_, 5000)
## W = 0.98845, p-value < 2.2e-16
Unfortunately, the shapiro.test function only allows for samples
with sizes between 3 and 5000 observations, therefore we take a
random sample.
Conclusion p < 2.2e-16
, we can assume that this sample does not follow a normal distribution.
####Research quesion 2: Does a correlation exist between the hours slept each night and reported days with poor physical health or poor mental health in the last 30 days?
Turning all the values into numeric
df$sleptim1 = as.numeric(df$sleptim1)
df$physhlth = as.numeric(df$physhlth)
df$menthlth = as.numeric(df$menthlth)
Descriptive statistics
stats = summarize(df, "sleptim1", mean(sleptim1, na.rm=T), sd(sleptim1, na.rm=T))
colnames(stats) = c("Variable", "mean", "sd")
temp = summarize(df, "physhlth", mean(physhlth, na.rm=T), sd(physhlth, na.rm=T))
colnames(temp) = c("Variable", "mean", "sd")
stats = rbind(stats, temp)
temp = summarize(df, "menthlth", mean(physhlth, na.rm=T), sd(menthlth, na.rm=T))
colnames(temp) = c("Variable", "mean", "sd")
stats = rbind(stats, temp)
stats
## # A tibble: 3 x 3
## Variable mean sd
## <chr> <dbl> <dbl>
## 1 sleptim1 7.06 1.42
## 2 physhlth 4.23 8.68
## 3 menthlth 4.23 7.85
Lets test the correlation between hours slept each night and reported days with poor physical health in the last 30 days
ggplot(df, aes(x=sleptim1, y=physhlth)) +
theme_bw() +
geom_jitter(alpha=0.25) +
geom_smooth() +
labs(x="Hours slept / night", y="Poor physical health (days / month)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 145 rows containing non-finite values (stat_smooth).
## Warning: Removed 145 rows containing missing values (geom_point).
## [1] -0.04374059
There appears to be no clear correlation in the exploratory graph. Also the coefficient of correlation r=-0.04
is not significant.
Now, lets test the correlation between hours slept each night and reported days with poor mental health in the last 30 days
ggplot(df, aes(x=sleptim1, y=menthlth)) +
theme_bw() +
geom_jitter(alpha=0.25) +
geom_smooth() +
labs(x="Hours slept / night", y="Poor mental health (days / month)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 162 rows containing non-finite values (stat_smooth).
## Warning: Removed 162 rows containing missing values (geom_point).
## [1] -0.1152792
Same thing, no visual correlation or significant coefficient can be deducted r=-0.11
.
Conclusion: Even though there is no linear relationship between the hours slept and both dependent variables, there is a dip in the lowess regression line plotted which leads to infer that people on both extremes of the spectrum might have a higher number of days with poor physical or mental health.
####Research quesion 3: Does a correlation exist between the body mass index (BMI) of the individual answering the survey and having a high blood cholesterol or having a heart attack?
The last two digits of this variable are implied decimal places, before doing anything this corrects to the units: kg/m^2
Descriptive statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12.53 23.03 25.97 26.99 29.76 75.05 849
## [1] 5.725637
##
## Yes No
## 3782 5684
##
## Yes No
## 483 11009
Exploratory graph for high cholesterol
ggplot(df, aes(x=toldhi2, y=X_bmi5, fill=toldhi2)) +
theme_bw() +
geom_violin() +
labs(title="Violin Plot",
x="Ever told you have high cholesterol?",
y="BMI (kg/m^2)")
## Warning: Removed 849 rows containing non-finite values (stat_ydensity).
Statistical analysis for high cholesterol
temp = filter(df, toldhi2=="Yes" | toldhi2=="No") %>%
select(X_bmi5, toldhi2)
chisq.test(temp$X_bmi5, temp$toldhi2)
## Warning in chisq.test(temp$X_bmi5, temp$toldhi2): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: temp$X_bmi5 and temp$toldhi2
## X-squared = 1555, df = 1368, p-value = 0.0002935
p=2.9e-4
therefore we can reject the Ho that states that there is no correlation between having a higher BMI and suffering hypercholesterolemia.
Exploratory graph for heart attack
ggplot(df, aes(x=cvdinfr4, y=X_bmi5, fill=cvdinfr4)) +
theme_bw() +
geom_violin() +
labs(title="Violin Plot",
x="Have you ever had a heart attack?",
y="BMI (kg/m^2)")
## Warning: Removed 849 rows containing non-finite values (stat_ydensity).
Statistical analysis for heart attack
temp = filter(df, cvdinfr4=="Yes" | cvdinfr4=="No") %>%
select(X_bmi5, cvdinfr4)
chisq.test(temp$X_bmi5, temp$cvdinfr4)
## Warning in chisq.test(temp$X_bmi5, temp$cvdinfr4): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: temp$X_bmi5 and temp$cvdinfr4
## X-squared = 1864.4, df = 1449, p-value = 6.473e-13
p=6.47e-13
therefore we can reject the Ho that states that there is no correlation between having a higher BMI and having a heart attack.
Conclusion: Having a higher BMI correlates statistically with having hypercholesterolemia or having a heart attack.