mydata<-read.table("./Sleep_health_and_lifestyle_dataset.csv", sep=",", header= TRUE, dec=".")
head(mydata)
## Person.ID Gender Age Occupation Sleep.Duration
## 1 1 Male 27 Software Engineer 6.1
## 2 2 Male 28 Doctor 6.2
## 3 3 Male 28 Doctor 6.2
## 4 4 Male 28 Sales Representative 5.9
## 5 5 Male 28 Sales Representative 5.9
## 6 6 Male 28 Software Engineer 5.9
## Quality.of.Sleep Physical.Activity.Level Stress.Level BMI.Category
## 1 6 42 6 Overweight
## 2 6 60 8 Normal
## 3 6 60 8 Normal
## 4 4 30 8 Obese
## 5 4 30 8 Obese
## 6 4 30 8 Obese
## Blood.Pressure Heart.Rate Daily.Steps Sleep.Disorder
## 1 126/83 77 4200 None
## 2 125/80 75 10000 None
## 3 125/80 75 10000 None
## 4 140/90 85 3000 Sleep Apnea
## 5 140/90 85 3000 Sleep Apnea
## 6 140/90 85 3000 Insomnia
Explanation of data:
Person ID: An identifier for each individual.
Gender: The gender of the person (Male/Female).
Age: The age of the person in years.
Occupation: The occupation or profession of the person.
Sleep Duration: The number of hours the person sleeps per day.
Quality of Sleep: A subjective rating of the quality of sleep, ranging from 1 to 10.
Physical Activity Level: The number of minutes the person engages in physical activity daily.
Stress Level: A subjective rating of the stress level experienced by the person, ranging from 1 to 10.
BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).
Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.
Heart Rate: The resting heart rate of the person in beats per minute.
Daily Steps: The number of steps the person takes per day.
Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).
#install.packages("tidyr") #deleting not availables
library(tidyr)
mydata2 <- drop_na(mydata)
mydata$Daily.Steps.New <- mydata$Daily.Steps + 5000 #creating new variable DailyStepsNew
head(mydata, 3)
## Person.ID Gender Age Occupation Sleep.Duration
## 1 1 Male 27 Software Engineer 6.1
## 2 2 Male 28 Doctor 6.2
## 3 3 Male 28 Doctor 6.2
## Quality.of.Sleep Physical.Activity.Level Stress.Level BMI.Category
## 1 6 42 6 Overweight
## 2 6 60 8 Normal
## 3 6 60 8 Normal
## Blood.Pressure Heart.Rate Daily.Steps Sleep.Disorder
## 1 126/83 77 4200 None
## 2 125/80 75 10000 None
## 3 125/80 75 10000 None
## Daily.Steps.New
## 1 9200
## 2 15000
## 3 15000
colnames(mydata) <- c("Person.ID", "Gender", "Age", "Occupation", "Sleep.Duration", "Quality.of.Sleep", "Physical.Activity.Level", "Stress.Level", "Body.Mass.Index", "Blood.Pressure", "Heart.Rate", "Daily.Steps", "Sleep.Disorder", "Daily.Steps.New")
#Renaming variable BMI.Category
head(mydata)
## Person.ID Gender Age Occupation Sleep.Duration
## 1 1 Male 27 Software Engineer 6.1
## 2 2 Male 28 Doctor 6.2
## 3 3 Male 28 Doctor 6.2
## 4 4 Male 28 Sales Representative 5.9
## 5 5 Male 28 Sales Representative 5.9
## 6 6 Male 28 Software Engineer 5.9
## Quality.of.Sleep Physical.Activity.Level Stress.Level
## 1 6 42 6
## 2 6 60 8
## 3 6 60 8
## 4 4 30 8
## 5 4 30 8
## 6 4 30 8
## Body.Mass.Index Blood.Pressure Heart.Rate Daily.Steps
## 1 Overweight 126/83 77 4200
## 2 Normal 125/80 75 10000
## 3 Normal 125/80 75 10000
## 4 Obese 140/90 85 3000
## 5 Obese 140/90 85 3000
## 6 Obese 140/90 85 3000
## Sleep.Disorder Daily.Steps.New
## 1 None 9200
## 2 None 15000
## 3 None 15000
## 4 Sleep Apnea 8000
## 5 Sleep Apnea 8000
## 6 Insomnia 8000
mydata3 <- mydata2[mydata2$Sleep.Duration >= 7 & mydata2$Sleep.Duration <= 8 ,]
#creating new data frame
summary(mydata2[ , c(-1, -2, -4, -9, -13)]) #Descriptive Statistics
## Age Sleep.Duration Quality.of.Sleep
## Min. :27.00 Min. :5.800 Min. :4.000
## 1st Qu.:35.25 1st Qu.:6.400 1st Qu.:6.000
## Median :43.00 Median :7.200 Median :7.000
## Mean :42.18 Mean :7.132 Mean :7.313
## 3rd Qu.:50.00 3rd Qu.:7.800 3rd Qu.:8.000
## Max. :59.00 Max. :8.500 Max. :9.000
## Physical.Activity.Level Stress.Level Blood.Pressure
## Min. :30.00 Min. :3.000 Length:374
## 1st Qu.:45.00 1st Qu.:4.000 Class :character
## Median :60.00 Median :5.000 Mode :character
## Mean :59.17 Mean :5.385
## 3rd Qu.:75.00 3rd Qu.:7.000
## Max. :90.00 Max. :8.000
## Heart.Rate Daily.Steps
## Min. :65.00 Min. : 3000
## 1st Qu.:68.00 1st Qu.: 5600
## Median :70.00 Median : 7000
## Mean :70.17 Mean : 6817
## 3rd Qu.:72.00 3rd Qu.: 8000
## Max. :86.00 Max. :10000
Explanation of results:
Age: Max:59
The oldest participant in this study was 59 years old.
Sleep duration: Mean:7.132
The average sleep time was 7.132 hours per day.
Stress level: 1st Qu.:4.000
25% of people experienced stress levels that were 4 or less and 75% of
people experienced stress levels that were more than 4.
hist(mydata2$Sleep.Duration,
main = "Distribution of Sleep duration",
xlab = "Sleep duration",
ylab = "Frequency",
breaks = seq(from = 0, to = 10, by =1))
Majority of the participants slept on average between 7 and 8 hours.
boxplot(mydata2 [ , c(-1, -2, -4, -5, -6, -8, -9, -10, -12, -13)])
Age: Min.:27 Minimum age of participants was 27 years.
Max.:59 Maximum age of participants was 59 years.
Median:43 Half of the participants were 43 years old and younger, the other half was more than 43 years old.
1st Qu.:35.25 25% of participants were 35.25 years old and less, 75% of participants were older than 35.25 years
3rd Qu.:50 75% of participants were 50 years old and less, 25% of participants were more than 50 years old.
head(mydata2[order(-mydata2$Heart.Rate), ]) #Descending order by Heart rate
## Person.ID Gender Age Occupation Sleep.Duration
## 277 277 Male 49 Doctor 8.1
## 278 278 Male 49 Doctor 8.1
## 4 4 Male 28 Sales Representative 5.9
## 5 5 Male 28 Sales Representative 5.9
## 6 6 Male 28 Software Engineer 5.9
## 94 94 Male 35 Lawyer 7.4
## Quality.of.Sleep Physical.Activity.Level Stress.Level
## 277 9 85 3
## 278 9 85 3
## 4 4 30 8
## 5 4 30 8
## 6 4 30 8
## 94 7 60 5
## BMI.Category Blood.Pressure Heart.Rate Daily.Steps Sleep.Disorder
## 277 Obese 139/91 86 3700 Sleep Apnea
## 278 Obese 139/91 86 3700 Sleep Apnea
## 4 Obese 140/90 85 3000 Sleep Apnea
## 5 Obese 140/90 85 3000 Sleep Apnea
## 6 Obese 140/90 85 3000 Insomnia
## 94 Obese 135/88 84 3300 Sleep Apnea
boxplot(mydata2[ , 12],
xlab = "Daily steps")
boxplot(mydata2 [ , c( 5, 6, 8 )])
library(car)
## Loading required package: carData
scatterplot(y= mydata2$Sleep.Duration,
x= mydata2$Stress.Level,
ylab= "Sleep duration in hours",
xlab= "Stress level",
smooth = FALSE,
boxplot= FALSE)
Participants that have longer sleep duration, experienced lower stress levels.
mydatanew<-read.table("./Body mass.csv", sep=";", header= TRUE, dec=",")
head(mydatanew)
## ID Mass
## 1 1 62.1
## 2 2 64.5
## 3 3 56.5
## 4 4 53.4
## 5 5 61.3
## 6 6 62.2
summary(mydatanew[ , c(-1)]) #Descriptive statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.70 60.23 62.80 62.88 64.50 83.20
Min.:49.70 A minimum body mass of a ninth grader was 49.70 kg.
Median:62.80 Half of the ninth graders had body mass 62.80kg and less, the other half had body mass higher than 62.80kg.
Mean: 62.88 Average body mass of the ninth graders was 62.88kg.
library(ggplot2)
ggplot(mydatanew, aes(x=Mass)) +
geom_histogram(binwidth = 5,colour="black",fill="aquamarine") +
ylab("Frequency")
Hypothesis testing:
H0: Mu = 59.5 H1: Mu =/ 59.5
t.test(mydatanew$Mass,
mu=59.5,
alternative = "two.sided")
##
## One Sample t-test
##
## data: mydatanew$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
## 61.16758 64.58442
## sample estimates:
## mean of x
## 62.876
Estimate of arithmetic mean: Y_hat = 62.88
p-value = 0.000234
We reject H0 at p < 0.001 and accept H1. Average weight in 2021/2022 school year is different than it was in 2018/2019 school year. The weight increased.
library(effectsize)
cohens_d(mydatanew$Mass, mu = 59.5) #Calculating the effect size
## Cohen's d | 95% CI
## ------------------------
## 0.56 | [0.26, 0.86]
##
## - Deviation from a difference of 59.5.
interpret_cohens_d(0.56, rules = "sawilowsky2009") #Interpretation of the effect size
## [1] "medium"
## (Rules: sawilowsky2009)
The increase in weight was medium.
library(readxl)
mydata5 <- read_xlsx("./Apartments.xlsx")
mydata5 <- as.data.frame(mydata5)
head(mydata5)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
Description:
Changing categorical variables into factors.
mydata5$ParkingFactor <- factor(mydata5$Parking,
levels= c(0,1),
labels= c("No", "Yes"))
mydata5$BalconyFactor <- factor(mydata5$Balcony,
levels= c(0,1),
labels= c("No", "Yes"))
Test the hypothesis H0: Mu_Price = 1900 eur
HO:Mu_Price = 1900 eur
H1:Mu_Price =/ 1900 eur
t.test(x = mydata5$Price, mu = 1900, alternative = "two.sided") #Hypothesis testing
##
## One Sample t-test
##
## data: mydata5$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
p-value = 0.005
We reject H0 at p = 0.005 and accept H1. Average price of the apartment is different from 1900 eur. Average price gained from our data is actually 2018.94 eur.
Regression function: Price = f(Age)
fit1 <- lm(Price ~ Age, data = mydata5)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = mydata5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <2e-16 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Price = 2185.46 - 8.98 * Age
H0: Beta1 = 0
H1: Beta1 not equal to zero.
p = 0.034
Reject H0.
b1= -8.98
If the age increases by one year, than on average the price per square meter decreases by 8.98 eur (p = 0.034), assuming everything else stays the same.
Coefficient of determination: R2 = 0.0530
5.3% of variability of price can be explained by the effect of age.
library(Hmisc)
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
rcorr(as.matrix(mydata5[ , c(1, 3)])) #Correlation matrix
## Age Price
## Age 1.00 -0.23
## Price -0.23 1.00
##
## n= 85
##
##
## P
## Age Price
## Age 0.034
## Price 0.034
There is a negative weak linear relationship between price and age.
library(car)
scatterplotMatrix(mydata5[ , c(3,1,2)],
smooth=FALSE)
There is no problem with multicolinearity, because a graph that shows correlation between age and distance looks like horizontal line.
The multiple regression function: Price = f(Age, Distance)
fit2 <- lm(Price ~ Age + Distance, data = mydata5)
vif(fit2) #Multicolinearity
## Age Distance
## 1.001845 1.001845
mean(vif(fit2))
## [1] 1.001845
VIF is more than 1, but less than 5. There is some correlation among the independent variables but not severely problematic.
mydata5$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
mydata5$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances
hist(mydata5$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of Standardized residuals")
shapiro.test(mydata5$StdResid) #Distribution of standardized residuals
##
## Shapiro-Wilk normality test
##
## data: mydata5$StdResid
## W = 0.95303, p-value = 0.003645
H0: Variable is normally distributed.
H1: Variable is not normally distributed.
We reject null hypothesis at p = 0.004. We accept H1, which is a violation. Our sample size is more than 30 so we can assume normality.
hist(mydata5$CooksD,
xlab = "Cooks distance",
ylab = "Frequency",
main = "Histogram of Cooks distances")
head(mydata5[order(-mydata5$StdResid),], 5)
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 38 5 45 2180 1 1 Yes Yes
## 33 2 11 2790 1 0 Yes No
## 2 18 1 2800 1 0 Yes No
## 61 18 1 2800 1 1 Yes Yes
## 58 8 2 2820 1 0 Yes No
## StdResid CooksD
## 38 2.577 0.320
## 33 2.051 0.069
## 2 1.783 0.030
## 61 1.783 0.030
## 58 1.655 0.037
head(mydata5[order(-mydata5$CooksD),], 5)
## Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 38 5 45 2180 1 1 Yes Yes
## 55 43 37 1740 0 0 No No
## 33 2 11 2790 1 0 Yes No
## 53 7 2 1760 0 1 No Yes
## 22 37 3 2540 1 1 Yes Yes
## StdResid CooksD
## 38 2.577 0.320
## 55 1.445 0.104
## 33 2.051 0.069
## 53 -2.152 0.066
## 22 1.576 0.061
Remove ID 38, because of Cooks distance.
mydata5 <- mydata5[-38, ] #Removing unit ID38
fit2 <- lm(Price ~ Age + Distance, data = mydata5)
mydata5$StdFitted <- scale(fit2$fitted.values)
library(car)
scatterplot(y = mydata5$StdResid, x = mydata5$StdFitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE)
#Checking for potential heteroskedasticity
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## -----------------------------
## DF = 1
## Chi2 = 2.927455
## Prob > Chi2 = 0.08708469
p-value: 8,8%
We cannot reject null hypothesis. There is no problem with heteroskedasticity, because variance is constant.
hist(mydata5$StdResid,
xlab = "Standardized residuals",
ylab = "Frequency",
main = "Histogram of Standardized residuals")
shapiro.test(mydata5$StdResid) #Distribution of standardized residuals
##
## Shapiro-Wilk normality test
##
## data: mydata5$StdResid
## W = 0.94879, p-value = 0.002187
H0: Variable is normally distributed.
H1: Variable is not normally distributed.
We reject null hypothesis at p = 0.002. We accept H1, which is a violation. Our sample size is more than 30 so we can assume normality.
library(readxl)
mydata6 <- read_xlsx("./Apartments.xlsx")
mydata6 <- as.data.frame(mydata6)
head(mydata6)
## Age Distance Price Parking Balcony
## 1 7 28 1640 0 1
## 2 18 1 2800 1 0
## 3 7 28 1660 0 0
## 4 28 29 1850 0 1
## 5 18 18 1640 1 1
## 6 28 12 1770 0 1
fit2 <- lm(Price ~ Age + Distance, data=mydata6)
summary(fit2) #Results of regression model
##
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 2e-16 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 6.18e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 4.896e-11
H0: Beta1 = 0
H1: Beta1 != 0
p = 0.016
Explanation of b1= -7.93
If average year of age is increased by 1 year, price of the apartment is decreased on average by 7.93 eur (p = 0.016), assuming that all other explanatory variables, included in the model, are constant.
H0: Beta2 = 0
H1: Beta2 != 0
p < 0.001
Explanation of b2= -20.67
If average distance from city center is increased by 1 kilometer, price of the apartment is decreased on average by 20.67 eur (p < 0.001), assuming that all other explanatory variables, included in the model, are constant.
Multiple-R²: 0.4396
43.96% of variability of price of the apartment is explained by linear effect of Age and Distance.
Test of significance of regression HO: Ro2 = 0
H1: Ro2 > 0
F=32.16, p < 0.001
We reject the null hypothesis at p < 0.001. We can say that population coefficient of determination is more than 0 meaning that we found linear relationship between price and explanatory variables, included in the model.
The linear regression function Price = f(Age, Distance, Parking and Balcony)
mydata6$ParkingF <- factor(mydata6$Parking,
levels = c(0,1),
labels = c("No parking", "Has parking"))
mydata6$BalconyF <- factor(mydata6$Balcony,
levels = c(0,1),
labels = c("No balcony", "Has balcony"))
fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = mydata6)
anova(fit2, fit3) #Comparison of two regression models that differ in the number of expl. variables
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 82 6720983
## 2 80 5991088 2 729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Model 1 is better
H1: Model 2 is better
p = 0.010
We reject the null hypothesis at p = 0.010. Model 3 is better, because it explains more variability in price.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 2e-16 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 5.28e-09 ***
## ParkingFHas parking 196.168 62.868 3.120 0.00251 **
## BalconyFHas balcony 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 1.849e-11
H0: Beta3 = 0
H1: Beta3 != 0
p = 0.003
Explaining b3 = 196.17 If the apartment has parking,the price is on average 196.17 eur higher than if the apartment does not have parking, assuming all the other explanatory variables, included in the model, are constant (p = 0.003).
H0: Beta4 = 0
H1: Beta4 != 0
p = 0.975
We cannot say that if the apartment has balcony, that has a statistical effect on price of the apartment, assuming all the other explanatory variables, included in the model, are constant (p = 0.975).
Test of significance of regression
HO: Ro3 = 0
H1: Ro3 > 0
F=20.03 p < 0.001
We reject the null hypothesis at p < 0.001. We can say that population coefficient of determination is more than 0 meaning that we found linear relationship between price and explanatory variables, included in the model.
mydata6$Fitted <- fitted.values(fit3)
mydata6$Residuals <- residuals(fit3)
head(mydata6[colnames(mydata6) %in% c("Price", "Fitted", "Residuals")])
## Price Fitted Residuals
## 1 1640 1750.741 -110.74150
## 2 2800 2357.411 442.58893
## 3 1660 1748.807 -88.80674
## 4 1850 1589.921 260.07897
## 5 1640 2052.576 -412.57579
## 6 1770 1896.691 -126.69107
Calculating the residual for apartment ID2:
e = Y - Y(fitted) e = 442.59