The first part of this is intended to take no more than 90 minutes to complete. For each question, show the R code using this markdown file, and provide a written description of what you did and what you found. You may consult any resources (your notes, previous problem sets, books, websites, etc.), but no other people during this part of the test.
The test will use the following data:
data <- read.table("exam2.csv",header=T)
Use appropriate t-tests AND Bayes factor tests to answer the following questions: * Is the mean of x1 different from 0? * Is the mean of x2 different from 0? * Is the mean of x2 different from x1, assuming they are independently groups? * Is the mean of x2 different from x1, assuming they are two measures of the same group of people (e.g., before and after a medicine). * Do men have a different x2 score than women?
library(car)
## Warning: package 'car' was built under R version 4.0.2
## Loading required package: carData
par(mfrow = c(1,2))
qqPlot(data$x1,main = 'x1 Checking for Normality')
## [1] 39 30
qqPlot(data$x2,main = 'x2 Checking for Normality')
## [1] 39 34
library(BayesFactor)
## Warning: package 'BayesFactor' was built under R version 4.0.2
## Loading required package: coda
## Warning: package 'coda' was built under R version 4.0.2
## Loading required package: Matrix
## ************
## Welcome to BayesFactor 0.9.12-4.2. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************
#1)
t.test(data$x1, mu = 0, alternative = "two.sided")
##
## One Sample t-test
##
## data: data$x1
## t = 0.90552, df = 49, p-value = 0.3696
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.2818193 0.7441024
## sample estimates:
## mean of x
## 0.2311416
#the mean of the variable x1 is not statistically different (t = 0.90552, p = 0.3696) from 0
ttestBF(data$x1, mu = 0) # BF = 0.226 which is a substantial evidence for the null hypothesis that the mean is not different from 0
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 0.226535 ±0%
##
## Against denominator:
## Null, mu = 0
## ---
## Bayes factor type: BFoneSample, JZS
#2)
t.test(data$x2, mu = 0, alternative = "two.sided")
##
## One Sample t-test
##
## data: data$x2
## t = 2.6864, df = 49, p-value = 0.009833
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.1723599 1.1958402
## sample estimates:
## mean of x
## 0.6841
#the mean of the variable x2 is statistically different (t = 2.6864, p = 0.009833) from 0
ttestBF(data$x2, mu = 0) # BF = 3.811 which is a substantial evidence for the alternative hypothesis that the mean is different from 0
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 3.8119 ±0%
##
## Against denominator:
## Null, mu = 0
## ---
## Bayes factor type: BFoneSample, JZS
#3)
t.test(data$x1,data$x2, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: data$x1 and data$x2
## t = -1.2563, df = 97.999, p-value = 0.212
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1684792 0.2625623
## sample estimates:
## mean of x mean of y
## 0.2311416 0.6841000
#the mean of variable x1 is not statistically different (t = -1.2563,p = 0.212) from the mean of variable x2
ttestBF(data$x1,data$x2) #BF = 0.424 which is a substantial evidence for the null hypothesis that the two means are not statistically different from each other
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 0.4243157 ±0%
##
## Against denominator:
## Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS
#4)
t.test(data$x1,data$x2, alternative = "two.sided", paired = T)
##
## Paired t-test
##
## data: data$x1 and data$x2
## t = -5.4564, df = 49, p-value = 1.591e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6197826 -0.2861342
## sample estimates:
## mean of the differences
## -0.4529584
#the mean of variable x1 is statistically different (less in this case) than the mean of variable x2 (t = -5.4564, p = 1.591e-06) in a pairwise comparison
ttestBF(data$x1,data$x2,paired = T) #BF = 10680.79 which is a decisive evidence that the two means are statistically different when they are paired
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 10608.79 ±0%
##
## Against denominator:
## Null, mu = 0
## ---
## Bayes factor type: BFoneSample, JZS
#5)
# table(data$x2, data$gender)
women = data$x2[data$gender == 'F']
men = data$x2[data$gender == 'M']
t.test(men,women,alternative = "two.sided")# t = 0.675 p = 0.5, no statistical difference of the mean between the two groups
##
## Welch Two Sample t-test
##
## data: men and women
## t = 0.67523, df = 40.114, p-value = 0.5034
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.7076365 1.4177938
## sample estimates:
## mean of x mean of y
## 0.8758425 0.5207639
ttestBF(men,women,alternative = "two.sided")#BF = 0.344 which is a substantial evidence in favor of the null hypothesis that the mean of the two variable are not statistically difference, hence no difference of x2 with respect to gender
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 0.3447744 ±0.02%
##
## Against denominator:
## Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS
# length(data$x2[data$gender == 'M'])
# length(data$x2[data$gender == 'F'])
Form tables showing the relationship between (1) gender and favorite TV show, and (2) gender and favorite super hero character. Use a chi-squared test and the corresponding Bayes factor test to determine whether either TV show or Super Hero depend on gender. Briefly describe your conclusions.
library(BayesFactor)
tv_gender = table(data$tvshow,data$gender)
chisq.test(tv_gender)
## Warning in chisq.test(tv_gender): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tv_gender
## X-squared = 21.08, df = 3, p-value = 0.0001013
contingencyTableBF(tv_gender, sampleType="indepMulti",fixedMargin = 'cols')
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 2095.778 ±0%
##
## Against denominator:
## Null, independence, a = 1
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial
#both the bayesian factor = 2095.778 ±0% and the Chi-Square (X-squared = 21.08, df = 3, p-value = 0.0001013) are highly in favor of the alternative hypothesis that there are significantly different distributions in the table
hero_gender = table(data$hero,data$gender)
chisq.test(hero_gender)
## Warning in chisq.test(hero_gender): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: hero_gender
## X-squared = 2.4247, df = 3, p-value = 0.4891
contingencyTableBF(hero_gender, sampleType="indepMulti",fixedMargin = 'cols')
## Bayes factor analysis
## --------------
## [1] Non-indep. (a=1) : 0.2835535 ±0%
##
## Against denominator:
## Null, independence, a = 1
## ---
## Bayes factor type: BFcontingencyTable, independent multinomial
#on the other hand, there are no real gender-related preferences for heros (X-squared = 2.4, p = 0.48, BF = 0.28)
Is x1 significantly correlated with x2, x3, or x4 (test using pearson’s correlation). For which pair should you use a non-parametric test? Tell why, and how you know? For the pair you identify, test this correlation with both a parametric and non-parametric test.
library(car)
par(mfrow = c(2,2))
qqPlot(data$x1,main = 'x1 Checking for Normality')
## [1] 39 30
qqPlot(data$x2,main = 'x2 Checking for Normality')
## [1] 39 34
qqPlot(data$x3,main = 'x3 Checking for Normality')
## [1] 50 1
qqPlot(data$x4,main = 'x4 Checking for Normality')
## [1] 49 50
# variable x4 is not normally distributed
#1)
cor.test(data$x1,data$x2, method="pearson")
##
## Pearson's product-moment correlation
##
## data: data$x1 and data$x2
## t = 20.423, df = 48, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9079842 0.9697273
## sample estimates:
## cor
## 0.9469933
# t = 20.423 p < 2.2e-16 r = 0.946 which means that there is a decisive evidence of positive correlation between the two variables
#2)
cor.test(data$x1,data$x3, method="pearson")
##
## Pearson's product-moment correlation
##
## data: data$x1 and data$x3
## t = -0.20588, df = 48, p-value = 0.8378
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3055253 0.2507170
## sample estimates:
## cor
## -0.0297036
# t = -0.205 p = 0.837 r = -0.029 which means that there is no evidence of any correlation between the two variables
#3)
cor.test(data$x1,data$x4, method="spearman")
##
## Spearman's rank correlation rho
##
## data: data$x1 and data$x4
## S = 16028, p-value = 0.1075
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.2303481
# Use of Spearman rank order test since normality assumption was rejected.
#S = 16028 p = 0.107 rho = 0.23 it appear that there is only a small positive correlation between the two variables but this is not statistically significant
cor.test(data$x1,data$x4, method="pearson")
##
## Pearson's product-moment correlation
##
## data: data$x1 and data$x4
## t = -0.45013, df = 48, p-value = 0.6546
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3370982 0.2174378
## sample estimates:
## cor
## -0.06483387
# t = -0.45 p = 0.654 r = -0.064 similarly as between x1 and x3 there is no evidence of any correlation between the two variables
Create a linear regression model predicting x1 by x2,x3, x4, and gender. Show the fitted coefficients values
linear_model = lm(x1~x2+x3+x4+gender,data=data)
summary(linear_model)
##
## Call:
## lm(formula = x1 ~ x2 + x3 + x4 + gender, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.40989 -0.36441 0.05034 0.33158 1.51932
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.4549277 0.1288637 -3.530 0.000971 ***
## x2 0.9467475 0.0480819 19.690 < 2e-16 ***
## x3 -0.0185110 0.0582653 -0.318 0.752180
## x4 0.0005956 0.0036616 0.163 0.871516
## genderM 0.0999731 0.1772316 0.564 0.575500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6025 on 45 degrees of freedom
## Multiple R-squared: 0.8977, Adjusted R-squared: 0.8886
## F-statistic: 98.7 on 4 and 45 DF, p-value: < 2.2e-16
linear_model$fitted.values
## 1 2 3 4 5 6
## -1.434838285 -0.521969178 -0.103248514 -2.952759980 -0.685019721 -0.052594289
## 7 8 9 10 11 12
## 3.030823343 -0.557876513 0.468535961 -1.968465278 0.918351255 2.233391935
## 13 14 15 16 17 18
## -0.913285685 -0.635925111 1.694591827 0.578167295 0.121493713 -2.221428186
## 19 20 21 22 23 24
## -0.003952669 1.719513379 1.000730311 0.923908646 1.256107774 1.790989500
## 25 26 27 28 29 30
## -2.731278965 -2.368939608 -0.103093533 2.834613953 1.797034682 -2.940312941
## 31 32 33 34 35 36
## 0.532178326 0.497921027 1.954198646 3.474207520 0.657715295 1.704091361
## 37 38 39 40 41 42
## -1.948223432 0.172777124 4.639726421 -0.922112494 -1.287324768 0.944273758
## 43 44 45 46 47 48
## 1.306127312 2.401492427 -0.074808561 -0.493253478 -0.236536814 0.488051195
## 49 50
## 0.121623029 -2.548309311
For the model you created, which predictors are significantly different from 0, other than the intercept term? Write a sentence for each predictor describing whether the t-test associated with the predictor shows it is different from 0. Include the t value, degrees of freedom, and the resulting p-value.
#x2 is the only predictor whose coefficient is statistically different from 0 p < 2e-16
#predictor x2 when compared to x1 is not statistically different (t = -1.25 p = 0.212 df = 97.99)
#predictor x3 when compared to x1 is not statistically different (t = -1.49 p = 0.138 df = 97.93)
#predictor x4 when compared to x1 is decisevely statistically different (W = 475 p = 9.332e-08)
#gender predictor when compared to x1 not statistically different (W = 263 p = 0.363)
t.test(data$x1,data$x2, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: data$x1 and data$x2
## t = -1.2563, df = 97.999, p-value = 0.212
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.1684792 0.2625623
## sample estimates:
## mean of x mean of y
## 0.2311416 0.6841000
t.test(data$x1,data$x3, alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: data$x1 and data$x3
## t = -1.4947, df = 97.93, p-value = 0.1382
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.2730931 0.1792065
## sample estimates:
## mean of x mean of y
## 0.2311416 0.7780849
wilcox.test(data$x1,data$x4)
##
## Wilcoxon rank sum test with continuity correction
##
## data: data$x1 and data$x4
## W = 475, p-value = 9.332e-08
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(data$x1~data$gender)
##
## Wilcoxon rank sum exact test
##
## data: data$x1 by data$gender
## W = 263, p-value = 0.3635
## alternative hypothesis: true location shift is not equal to 0
For the significant predictors in the model, describe in words how you would interpret the coefficient. In addition, describe the coefficient for gender, and discuss specifically what that coefficient represents.
#the only significant predictor is the variable x2
# The linear model predicts x1 by the equation -->
#x1 = -0.45 + 0.946 * x2 - 0.018 * x3 + 0.0005 * x4 + 0.0999 * gender (where male = 1 and female = 0)
# let's say that x3, x4, and gender happen to be 0, thus x1 = -0.45 + 0.946 * x2. When x2 is 1 x1 = -0.45+0.946*1 = 0.496 when x1 increases is 1 unit x1 = -0.45+0.946*2 = 1.442 when x2 increases by 2 units x1 = -0.45+0.946*3 = 2.388 and so on. Hence, for any n unit increase in x2, x1 increases by 0.946 * n which is a substantial increase (being 0.946 very close to 1) compared to the other variables that can only "explain" increases of 0.0005 and similar small changes. The gender coefficient means that if the variable appear to be 1 or Male than x1 increases by 0.0999 otherwise it doesn't change 0.0999 * Female = 0.0999 * 0 = 0
Use an AIC criterion to reduce the model to the simplest, most descriptive model. You may do this with a stepwise regression function or obtain the AIC values in some other way and select the model manually. For this simpler model, examine the multiple R^2 and compare it to the largest model. Discuss how you would interpret the difference between the models in R^2.
library(MASS)
## Warning: package 'MASS' was built under R version 4.0.3
AIC_selected_model = stepAIC(linear_model,direction = 'both')
## Start: AIC=-45.94
## x1 ~ x2 + x3 + x4 + gender
##
## Df Sum of Sq RSS AIC
## - x4 1 0.010 16.343 -47.911
## - x3 1 0.037 16.370 -47.828
## - gender 1 0.115 16.449 -47.588
## <none> 16.333 -45.940
## - x2 1 140.725 157.058 65.230
##
## Step: AIC=-47.91
## x1 ~ x2 + x3 + gender
##
## Df Sum of Sq RSS AIC
## - x3 1 0.028 16.371 -49.827
## - gender 1 0.106 16.449 -49.587
## <none> 16.343 -47.911
## + x4 1 0.010 16.333 -45.940
## - x2 1 140.855 157.198 63.274
##
## Step: AIC=-49.83
## x1 ~ x2 + gender
##
## Df Sum of Sq RSS AIC
## - gender 1 0.104 16.475 -51.509
## <none> 16.371 -49.827
## + x3 1 0.028 16.343 -47.911
## + x4 1 0.000 16.370 -47.828
## - x2 1 140.986 157.356 61.325
##
## Step: AIC=-51.51
## x1 ~ x2
##
## Df Sum of Sq RSS AIC
## <none> 16.475 -51.509
## + gender 1 0.104 16.371 -49.827
## + x3 1 0.026 16.449 -49.587
## + x4 1 0.007 16.468 -49.530
## - x2 1 143.159 159.634 60.043
cat(' rsquared AIC model = ',summary(AIC_selected_model)$r.squared,'\n'
,'adjusted r squared AIC model = ',summary(AIC_selected_model)$adj.r.squared,'\n','rsquared largest model = ',summary(linear_model)$r.squared
,'\n','adjusted r squared largest model = ',summary(linear_model)$adj.r.squared)
## rsquared AIC model = 0.8967962
## adjusted r squared AIC model = 0.8946462
## rsquared largest model = 0.8976822
## adjusted r squared largest model = 0.8885873
#it is wise to include both multiple and adjusted r square as the second one penalizes the inclusion of more variables. Just by looking at the multiple r square of both models we can clearly see that for the largest model 89.76 % of variation can be successfully predicted by these variables predictors, while for the smallest one, x2 alone can successfully predict 89.67 of variation in x1. The adjust r squared also takes in consideration the number of variables included and smallest model predicts 0.6 % more variation than the largest one. In conclusion, the smallest model, with the most negative AIC score (-51.51) is also the best model of the two in predicting variation of x1 (which is not always the case because more variables tend to have a better variation estimate).
Make a plot of x4 by x3. Make a model predicting x4 with x3, and examine the residuals. Choose a transform of x4 that makes the residuals appear normally distributed, and create a new model predicting this transformed variable. Compare the R^2 values of these two models and use this to determine whether the transformation was justified.
library(rcompanion)
## Warning: package 'rcompanion' was built under R version 4.0.3
#summary(lm(x4~x3,data = data))
tranformed_x4 = transformTukey(data$x4,plotit=T)
##
## lambda W Shapiro.p.value
## 403 0.05 0.9776 0.4568
##
## if (lambda > 0){TRANS = x ^ lambda}
## if (lambda == 0){TRANS = log(x)}
## if (lambda < 0){TRANS = -1 * x ^ lambda}
summary(lm(tranformed_x4~x3,data = data))$r.squared
## [1] 0.7528011
summary(lm(x4~x3,data = data))$r.squared
## [1] 0.3478208
#the transformation was highly justified since the multiple r squared increased from 0.347 to 0.752
par(mfrow=c(1,2))
plot(data$x3,data$x4,xlab = 'x3', ylab = 'x4')
plot(data$x3,lm(tranformed_x4~x3,data = data)$fitted.values, xlab = 'x3', ylab = 'Transformed x4')
Create a polynomial regression predicting x5 based on x1. Use either an AIC, BIC, or F-test to select the order of the polynomial (number of predictors) you want to use to predict the data. Discuss which polynomial model you would select, and why.
library(car)
qqPlot(data$x5,main = 'x5 normality qqplot')
## [1] 39 4
lmodel = lm(x5~x1,data=data)
poly1 = lm(x5~x1 + I(x1^2),data=data)
poly2 = lm(x5~x1 + I(x1^2) + I(x1^3),data=data)
poly3 = lm(x5~x1 + I(x1^2) + I(x1^3) + I(x1^4),data=data)
poly4 = lm(x5~x1 + I(x1^2) + I(x1^3) + I(x1^4) + I(x1^5),data=data)
poly5 = lm(x5~x1 + I(x1^2) + I(x1^3) + I(x1^4) + I(x1^5) +I (x1^6),data=data)
extractAIC(lmodel)
## [1] 2.0000 244.6147
extractAIC(poly1)
## [1] 3.0000 240.5827
extractAIC(poly2)
## [1] 4.00000 74.97184
extractAIC(poly3)
## [1] 5.00000 76.09727
extractAIC(poly4)
## [1] 6.00000 77.97617
extractAIC(poly5)
## [1] 7.00000 78.10213
#I would select the model poly 2 because if the models with the lowest AIC value
Use the model to predict the value of x5 for x1 values = c(-5,-1,2,4)
predict(poly2,list(x1 = c(-5,-1,2,4)))
## 1 2 3 4
## -119.459714 3.898447 -8.135844 22.107415
You must work alone on this exam. Do not ask for or give assistance to other members of the class. It is due by Monday at Midnight. For this portion of the exam, create a professional-looking report that answers these questions. You can include R code as markdown, but do not include excessive output–provide your answers to a non-technical manager, and back them up with stats/code/results.
Imagine you are consulting with a medical group that is recording a number of physiological sensors on their patients using a mobile device. Your ultimate goal is to create methods for displaying and cleaning data, and representing it with a simpler model that will be uploaded to your company servers and used to track your customers over time.
Unless otherwise specified, demonstrate the following for each of the four data sets separately. Your best approach will be to create a function that works on a single patient, and then apply it to each patient data set separately.
library(readr)
## Warning: package 'readr' was built under R version 4.0.2
library(unikn)
## Warning: package 'unikn' was built under R version 4.0.3
A2_1_MOPP = read_csv("C:/Users/Andrea/Desktop/PSY5210/R files/A2_1_MOPP.csv")
B1_2_BDU = read_csv("C:/Users/Andrea/Desktop/PSY5210/R files/B1_2_BDU.csv")
C2_2_MOPP = read_csv("C:/Users/Andrea/Desktop/PSY5210/R files/C2_2_MOPP.csv")
D2_2_MOPPnew = read_csv("C:/Users/Andrea/Desktop/PSY5210/R files/D2_2_MOPPnew.csv")
#plots ECG.HR, EDR.BR, Belt.BR, CoreTemp, and Temp all on the same time axis.
#Be aware of missing data, and data that are out of a reasonable range (i.e., temp=0), and convert these to NA values.
best_function = function(data_set){
temperature = data_set[,2][data_set[,2]>20 & !is.na(data_set[,2])]
HR = data_set[,8][data_set[,8]>0 & !is.na(data_set[,8])]
BR = data_set[,9][data_set[,9]>0 & !is.na(data_set[,9])]
BELT_BR = data_set[,10][data_set[,10]>0 & !is.na(data_set[,10])]
coretemp = data_set[,15][data_set[,15]>0 & !is.na(data_set[,15])]
mean_hr = round(mean(HR),1)
max_hr = round(max(HR),1)
mean_temp = round(mean(temperature),1)
max_temp = round(max(temperature),1)
median_BR = round(median(BR),1)
# mean_temp =
plot(1226,ylim = c(0,max(HR)*1.3),xlim = c(0,length(HR)),xlab = 'Time(5s)', ylab = '',main = deparse(substitute(data_set)))
points(HR,type = 'l',col='grey')
lines(filter(HR,c(1,1,1,1)/4),col="blue",lwd=2,type = 'l')
points(BR,type = 'l',col='brown',lwd=2)
points(BELT_BR,type = 'l',col='yellow',lwd=2)
points(coretemp,type = 'l',col='green',lwd=2)
points(temperature,type = 'l',col='purple',lwd=2)
legend('top',legend=c('HR','BR','BELT BR',' Core Temp',' Temp'),col=c('blue','yellow','red','green','purple'), lty = 1, cex=0.8,horiz = T, bty = 0.8)
segments(-100,seq(0, max(HR)*1.1, by=15),1500,seq(0, max(HR)*1.1, by=15),lty = 3)
uline(labels = c("mean HR =",mean_hr, "max HR = ", max_hr,"mean Temp C° = ",mean_temp,"max Temp C° = ",max_temp,"median BR = ", median_BR) ,x = c(length(HR)/10,length(HR)/6,length(HR)/10,length(HR)/6,length(HR)/2.3,length(HR)/1.85,length(HR)/2.3,length(HR)/1.85,length(HR)/1.2,length(HR)/1.1), y = c(max(HR)*1.12,max(HR)*1.12,max(HR)*1.19,max(HR)*1.19,max(HR)*1.12,max(HR)*1.12,max(HR)*1.19,max(HR)*1.19,max(HR)*1.165,max(HR)*1.165), cex = 0.75,col_bg = c('blue','blue','blue','blue','purple','purple','purple','purple','yellow','yellow'))
}
best_function (A2_1_MOPP)
# This plot lets you visualize trends of HR,BR,BELT BR, temperaure and Core Temperature over time. You can appreciate on the top part of the graph some descriptive statistics for these variables. The gridlines helps you understanding the variable value while for the HR I am plotting in grey the raw data while in blue a moving average by taking the average of 4 points in a consecutive manner in order to smooth out the data and make it more visually appealing. It is worth mentioning that this function filters out the NA and other improbable values like zeros or very close to zeros values.
best_function (B1_2_BDU)
# This second plot works fine but the data tends to overlap. This could be due to the fact that the data collection was stopped after at half the usual time needed and then restarted again. This would also explain why we the time variables stops and then restarts. Moreover, by doing so mean heart rate looks like retrieved from a subject who was undergoing heavy aerobic training and peaked at 221 bpm (which is unlikely to happen) and had a median breath rate of 50.8 (I chose the median because it wasn't normally distributed). Lastly, we should have noticed a higher max and mean temperature if this was the case.
best_function (C2_2_MOPP)
best_function (D2_2_MOPPnew)
# This 2 last subjects are very similar even though D2_2_MOPPnew has higher overall values of HR and temperature.
#Overall, plotting all these variables in one plot might not be the wisest choice to do since it's particularly difficult to interpret temperature and core temperature change over time because they look like lines. Furthermore I would possibly suggest to filter out data and noise in the other three variables as authors suggest in literature (if it was for a research lab job or so)
A2_1_MOPP$real_time = seq(1,length(A2_1_MOPP$Time)*5,by=5)
loess_span_10 = loess(`ECG HR`~real_time, data = A2_1_MOPP, span=0.10)
loess_span_30 = loess(`ECG HR`~real_time, data = A2_1_MOPP, span=0.30)
loess_span_65 = loess(`ECG HR`~real_time, data = A2_1_MOPP, span=0.65)
loess_predict_10 = predict(loess_span_10)
loess_predict_30 = predict(loess_span_30)
loess_predict_65 = predict(loess_span_65)
# first loess graph span = 10%
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.2
theme_set(theme_classic())
ggplot(A2_1_MOPP, aes(x=real_time,y=`ECG HR`)) +
geom_line() +
labs(title="ECG HR with Loess regression line",
subtitle="10% smoothing span",
caption="Subject: A2_1_MOPP",
x='Time(s)',
y="bpm")+
geom_smooth(span = 0.1)
# 2nd loess plot with span = 30% and 65%
plot(A2_1_MOPP$real_time,A2_1_MOPP$`ECG HR`,type = 'l', main = 'ECG HR with larger Loess smoothing span Subject: A2_1_MOPP', xlab = 'Time(s)', ylab = 'ECG HR',col = 'darkgrey')
lines(A2_1_MOPP$real_time,loess_predict_30, col="blue",lwd = 3)
lines(A2_1_MOPP$real_time,loess_predict_65, col="green",lwd = 3)
legend('bottomright',legend = c('30% span','65% span'),col = c('blue','green'), lty = 1, lwd = 3)
#In the first plot we see smooth our data and include all major relevant curve and, although a lot of noise has been filtered out, the ECG peaks are not really included and we might need to report them. Thus a smoothing span of 10% seems to be good enough to fit the data but it has some minor drawbacks. On the other hand, the second plot reports a smoothing span of 30% and 65% and here we are really losing data and only accounting for rough ups and downs for when span = 0.3 and almost nothing for span = 0.65.
BR_A2 = c(A2_1_MOPP$`EDR BR`, A2_1_MOPP$`Belt BR`)
BR_B1 = c(B1_2_BDU$`EDR BR`, B1_2_BDU$`Belt BR`)
time_predictor_A2 = rep(1:length(A2_1_MOPP$`EDR BR`),2)
time_predictor_B1 = rep(1:length(B1_2_BDU$`EDR BR`),2)
type_A2 = as.factor(rep(c('ecg','belt'),each=length(A2_1_MOPP$`EDR BR`)))
type_B1 = as.factor(rep(c('ecg','belt'),each=length(B1_2_BDU$`EDR BR`)))
df_A2 = data.frame(time = time_predictor_A2,type = type_A2,BRrate = BR_A2)
df_B1 = data.frame(time = time_predictor_B1,type = type_B1,BRrate = BR_B1)
poly1_A2 = lm(BRrate~time+I(time^2)+type,data = df_A2)
poly2_A2 = lm(BRrate~time+I(time^2)+I(time^3)+type,data = df_A2)
poly3_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+type,data = df_A2)
poly4_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+type,data = df_A2)
poly5_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+type,data = df_A2)
poly6_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+type,data = df_A2)
poly7_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+type,data = df_A2)
poly8_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+type,data = df_A2)
poly9_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+type,data = df_A2)
poly10_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+type,data = df_A2)
poly11_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+type,data = df_A2)
poly12_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+type,data = df_A2)
poly13_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+I(time^14)+type,data = df_A2)
poly14_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+I(time^14)+I(time^15)+type,data = df_A2)
poly15_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+I(time^14)+I(time^15)+I(time^16)+type,data = df_A2)
poly16_A2 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+I(time^14)+I(time^15)+I(time^16)+I(time^17)+type,data = df_A2)
extractAIC(poly1_A2)
## [1] 4.000 3600.168
extractAIC(poly2_A2)
## [1] 5.000 3597.911
extractAIC(poly3_A2)
## [1] 6.000 3599.907
extractAIC(poly4_A2)
## [1] 7.000 3601.839
extractAIC(poly5_A2)
## [1] 8.000 3468.746
extractAIC(poly6_A2)
## [1] 9.000 3468.921
extractAIC(poly7_A2)
## [1] 10.000 3443.102
extractAIC(poly8_A2)
## [1] 11.000 3354.447
extractAIC(poly9_A2)
## [1] 12.000 3346.762
extractAIC(poly10_A2)
## [1] 13.00 3348.62
extractAIC(poly11_A2)
## [1] 14.000 3335.522
extractAIC(poly12_A2)
## [1] 14.000 3335.522
extractAIC(poly13_A2)
## [1] 15.000 3107.627
extractAIC(poly14_A2)
## [1] 15.000 3107.627
extractAIC(poly15_A2)
## [1] 16.000 3053.123
extractAIC(poly16_A2)
## [1] 16.000 3053.123
poly1_B1 = lm(BRrate~time+I(time^2)+type,data = df_B1)
poly2_B1 = lm(BRrate~time+I(time^2)+I(time^3)+type,data = df_B1)
poly3_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+type,data = df_B1)
poly4_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+type,data = df_B1)
poly5_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+type,data = df_B1)
poly6_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+type,data = df_B1)
poly7_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+type,data = df_B1)
poly8_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+type,data = df_B1)
poly9_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+type,data = df_B1)
poly10_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+type,data = df_B1)
poly11_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+type,data = df_B1)
poly12_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+type,data = df_B1)
poly13_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+I(time^14)+type,data = df_B1)
poly14_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+I(time^14)+I(time^15)+type,data = df_B1)
poly15_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+I(time^14)+I(time^15)+I(time^16)+type,data = df_B1)
poly16_B1 = lm(BRrate~time+I(time^2)+I(time^3)+I(time^4)+I(time^5)+I(time^6)+I(time^7)+I(time^8)+I(time^9)+I(time^10)+I(time^11)+I(time^12)+I(time^13)+I(time^14)+I(time^15)+I(time^16)+I(time^17)+type,data = df_B1)
extractAIC(poly1_B1)
## [1] 4.000 9137.523
extractAIC(poly2_B1)
## [1] 5.000 9120.276
extractAIC(poly3_B1)
## [1] 6.000 9108.001
extractAIC(poly4_B1)
## [1] 7.000 9109.952
extractAIC(poly5_B1)
## [1] 8.000 9109.804
extractAIC(poly6_B1)
## [1] 9.000 9078.308
extractAIC(poly7_B1)
## [1] 10.000 9038.351
extractAIC(poly8_B1)
## [1] 11.000 9030.223
extractAIC(poly9_B1)
## [1] 12.000 9025.488
extractAIC(poly10_B1)
## [1] 13.000 8996.828
extractAIC(poly11_B1)
## [1] 14.000 8991.383
extractAIC(poly12_B1)
## [1] 14.000 8991.383
extractAIC(poly13_B1)
## [1] 15.000 8993.207
extractAIC(poly14_B1)
## [1] 15.000 8993.207
extractAIC(poly15_B1)
## [1] 16.000 8993.629
extractAIC(poly16_B1)
## [1] 16.000 8993.629
par(mfrow=c(1,2))
plot(x = 1:(length(df_A2$time)/2), y = df_A2$BRrate[0:(length(df_A2$time)/2)], bty = 'n', pch = '', ylab = 'BR', xlab = 'Time(s)',main='BR by ECG and Belt records with 16th order poly')
points(x = 1:(length(df_A2$time)/2),df_A2$BRrate[0:(length(df_A2$time)/2)], col="#697EFF",pch = 17,)
points(x = 1:(length(df_A2$time)/2),df_A2$BRrate[-(1:(length(df_A2$time)/2))], col="#FD6161",pch = 15,)
lines(x = 1:(length(df_A2$time)/2), poly15_A2$fitted.values[0:(length(df_A2$time)/2)],col='blue', lwd=5)
lines(x = 1:(length(df_A2$time)/2), poly15_A2$fitted.values[-(1:(length(df_A2$time)/2))],col='red', lwd=5)
legend('topleft', legend = c('ecg','belt','poly-ecg','poly-belt'), pch = c(17,15,NA,NA), lty = c(NA,NA,1,1),col = c('#697EFF','#FD6161','red','blue'),bty = "n", lwd = 2, cex = 1.2,)
text(535,21,labels = 'Subject:A2')
plot(x = 1:(length(df_B1$time)/2), y = df_B1$BRrate[0:(length(df_B1$time)/2)], bty = 'n', pch = '', ylab = 'BR', xlab = 'Time(s)',main='BR by ECG and Belt records with 12th order poly')
points(x = 1:(length(df_B1$time)/2),df_B1$BRrate[0:(length(df_B1$time)/2)], col="#697EFF",pch = 17,)
points(x = 1:(length(df_B1$time)/2),df_B1$BRrate[-(1:(length(df_B1$time)/2))], col="#FD6161",pch = 15,)
lines(x = 1:(length(df_B1$time)/2), poly11_B1$fitted.values[0:(length(df_B1$time)/2)],col='blue', lwd=5)
lines(x = 1:(length(df_B1$time)/2), poly11_B1$fitted.values[-(1:(length(df_B1$time)/2))],col='red', lwd=5)
#legend('bottom', legend = c('ecg','belt','poly3ecg','poly3belt'), pch = c(17,15,NA,NA), lty = c(NA,NA,1,1),col = c('#697EFF','#FD6161','red','blue'),bty = "n", lwd = 2, cex = 1.2,)
text(1000,6,labels = 'Subject:B1')
summary(poly15_A2)
##
## Call:
## lm(formula = BRrate ~ time + I(time^2) + I(time^3) + I(time^4) +
## I(time^5) + I(time^6) + I(time^7) + I(time^8) + I(time^9) +
## I(time^10) + I(time^11) + I(time^12) + I(time^13) + I(time^14) +
## I(time^15) + I(time^16) + type, data = df_A2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0131 -2.5297 -0.5846 1.8988 12.0227
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.246e+01 1.709e+00 7.292 5.62e-13 ***
## time 3.170e+00 3.537e-01 8.962 < 2e-16 ***
## I(time^2) -2.591e-01 2.439e-02 -10.626 < 2e-16 ***
## I(time^3) 9.485e-03 8.008e-04 11.844 < 2e-16 ***
## I(time^4) -1.864e-04 1.494e-05 -12.471 < 2e-16 ***
## I(time^5) 2.210e-06 1.748e-07 12.646 < 2e-16 ***
## I(time^6) -1.698e-08 1.356e-09 -12.520 < 2e-16 ***
## I(time^7) 8.818e-11 7.224e-12 12.207 < 2e-16 ***
## I(time^8) -3.157e-13 2.679e-14 -11.781 < 2e-16 ***
## I(time^9) 7.793e-16 6.904e-17 11.288 < 2e-16 ***
## I(time^10) -1.296e-18 1.205e-19 -10.759 < 2e-16 ***
## I(time^11) 1.356e-21 1.328e-22 10.213 < 2e-16 ***
## I(time^12) -7.266e-25 7.519e-26 -9.664 < 2e-16 ***
## I(time^13) NA NA NA NA
## I(time^14) 1.658e-31 1.932e-32 8.584 < 2e-16 ***
## I(time^15) NA NA NA NA
## I(time^16) -3.747e-38 4.959e-39 -7.556 8.37e-14 ***
## typeecg 1.262e+00 2.096e-01 6.021 2.32e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.606 on 1168 degrees of freedom
## Multiple R-squared: 0.4579, Adjusted R-squared: 0.451
## F-statistic: 65.78 on 15 and 1168 DF, p-value: < 2.2e-16
summary(poly11_B1)
##
## Call:
## lm(formula = BRrate ~ time + I(time^2) + I(time^3) + I(time^4) +
## I(time^5) + I(time^6) + I(time^7) + I(time^8) + I(time^9) +
## I(time^10) + I(time^11) + I(time^12) + type, data = df_B1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.7067 -2.9864 -0.2894 3.2038 18.3369
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.819e+00 1.500e+00 1.213 0.225429
## time 1.069e+00 1.112e-01 9.614 < 2e-16 ***
## I(time^2) -2.122e-02 2.727e-03 -7.781 1.03e-14 ***
## I(time^3) 2.073e-04 3.153e-05 6.574 5.91e-11 ***
## I(time^4) -1.172e-06 2.049e-07 -5.721 1.18e-08 ***
## I(time^5) 4.185e-09 8.231e-10 5.084 3.97e-07 ***
## I(time^6) -9.883e-12 2.158e-12 -4.580 4.88e-06 ***
## I(time^7) 1.580e-14 3.797e-15 4.161 3.27e-05 ***
## I(time^8) -1.717e-17 4.516e-18 -3.802 0.000147 ***
## I(time^9) 1.249e-20 3.583e-21 3.486 0.000499 ***
## I(time^10) -5.819e-24 1.816e-24 -3.205 0.001369 **
## I(time^11) 1.569e-27 5.314e-28 2.952 0.003188 **
## I(time^12) -1.861e-31 6.834e-32 -2.723 0.006508 **
## typeecg 2.887e+01 2.227e-01 129.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.666 on 2574 degrees of freedom
## Multiple R-squared: 0.8691, Adjusted R-squared: 0.8684
## F-statistic: 1314 on 13 and 2574 DF, p-value: < 2.2e-16
# for subject A2 a 16th order polynomial regression seems to be the best guess as by evaluated by the AIC values. The type variable seem to be a strong predictor of BR p < 2.16e-06. The same is true for the second subject where I adopted a 12th order polynomial regression. However, our first model has a lower multiple r squared (R-squared = 0.1251) compared to our second model (R-squared = 0.8622). This could be due to the fact that the second dataset's BR distribution is greatly differenced between the collection of the 2 different sensors.
library(BayesFactor)
library(Bolstad)
## Warning: package 'Bolstad' was built under R version 4.0.3
linear_model_D2 = lm(BRrate~time+I(time^2)+type,data = df_A2)
SubjectD2 = D2_2_MOPPnew
SubjectD2$Time= (1:length(SubjectD2$Time))
SubjectD2$Time_squared = SubjectD2$Time * SubjectD2$Time
SubjectD2$Time_cubic = SubjectD2$Time * SubjectD2$Time * SubjectD2$Time
SubjectD2$Interaction = SubjectD2$`EDR BR`*SubjectD2$`ECG HR`
# modelBF_D2 = regressionBF(`Belt BR`~ `EDR BR`+ Temp + CoreTemp + `ECG HR` + Motion + `Body Pos` + Time + Time_squared + Time_cubic + Interaction, data = SubjectD2)
#
# plot(head(modelBF_D2))
# plot(head(modelBF_D2)/max(modelBF_D2)) # comparing it to the other best models
# The best model evaluated by my Bayesian regression model to predict Breath Rate at belt sensor is composed by these following variables `EDR BR`+ Temp + Time + Time_squared + Time_cubic + Interaction. The Interaction between EDR BR and ECG HR seems to be a good predictor along with EDR BR and Time. My model is 1e + 207 times better than the Intercept only model and I prefer using this approach over the other methods because it really takes into account all possible permutations and it's less prone to human error.
best_model = bayes.lm (`Belt BR`~`EDR BR`+ Temp + Time + Time_squared + Time_cubic + Interaction,data = SubjectD2)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
summary(best_model)
##
## Call:
## bayes.lm(formula = `Belt BR` ~ `EDR BR` + Temp + Time + Time_squared +
## Time_cubic + Interaction, data = SubjectD2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1790 -1.1535 -0.1132 0.9804 7.7195
##
## Coefficients:
## Posterior Mean Std. Error t value
## (Intercept) 2.110e+01 3.397e-03 6.213e+03
## `EDR BR` 5.884e-01 1.501e-03 3.920e+02
## Temp -1.797e-01 1.383e-01 -1.299e+00
## Time 2.453e-02 8.428e-05 2.910e+02
## Time_squared -4.806e-05 1.572e-10 -3.058e+05
## Time_cubic 2.636e-08 3.638e-17 7.245e+08
## Interaction 7.353e-04 5.690e-08 1.292e+04
## ---
##
## Note: No prior given (Using flat prior).
plot(SubjectD2$Time, SubjectD2$`Belt BR`, pch = 1, cex = 1.2, col = 'black', main = 'Belt BR and Fitted Values Subject D2', xlab = 'Time(5s)', ylab = 'Belt BR')
points(SubjectD2$Time, SubjectD2$`Belt BR`,col = 'orange',cex = 1, pch = 16)
points(SubjectD2$Time,best_model$fitted.values,col = 'red', pch = 1, cex = 1.2)
points(SubjectD2$Time,best_model$fitted.values,col = 'blue', pch = 16, cex = 1)
legend(660, 32, legend = c('Values','Fitted Values'), pch = c(16,16), col = c('orange','blue'))
lm = lm(`Belt BR`~`EDR BR`+ Temp + Time + Time_squared + Time_cubic + Interaction,data = SubjectD2)
summary(lm)$r.squared
## [1] 0.7377614
#I am confident to predict around 73.7% of the total variability for BELT BR using this model.