Homework_R

Katja Volk Štefić

TASK 1

mydata<-read.table("./Sleep_health_and_lifestyle_dataset.csv", sep=",", header= TRUE, dec=".")

head(mydata)

##   Person.ID Gender Age           Occupation Sleep.Duration
## 1         1   Male  27    Software Engineer            6.1
## 2         2   Male  28               Doctor            6.2
## 3         3   Male  28               Doctor            6.2
## 4         4   Male  28 Sales Representative            5.9
## 5         5   Male  28 Sales Representative            5.9
## 6         6   Male  28    Software Engineer            5.9
##   Quality.of.Sleep Physical.Activity.Level Stress.Level BMI.Category
## 1                6                      42            6   Overweight
## 2                6                      60            8       Normal
## 3                6                      60            8       Normal
## 4                4                      30            8        Obese
## 5                4                      30            8        Obese
## 6                4                      30            8        Obese
##   Blood.Pressure Heart.Rate Daily.Steps Sleep.Disorder
## 1         126/83         77        4200           None
## 2         125/80         75       10000           None
## 3         125/80         75       10000           None
## 4         140/90         85        3000    Sleep Apnea
## 5         140/90         85        3000    Sleep Apnea
## 6         140/90         85        3000       Insomnia

Explanation of data:

Person ID: An identifier for each individual.

Gender: The gender of the person (Male/Female).

Age: The age of the person in years.

Occupation: The occupation or profession of the person.

Sleep Duration: The number of hours the person sleeps per day.

Quality of Sleep: A subjective rating of the quality of sleep, ranging from 1 to 10.

Physical Activity Level: The number of minutes the person engages in physical activity daily.

Stress Level: A subjective rating of the stress level experienced by the person, ranging from 1 to 10.

BMI Category: The BMI category of the person (e.g., Underweight, Normal, Overweight).

Blood Pressure (systolic/diastolic): The blood pressure measurement of the person, indicated as systolic pressure over diastolic pressure.

Heart Rate: The resting heart rate of the person in beats per minute.

Daily Steps: The number of steps the person takes per day.

Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).

#install.packages("tidyr") #deleting not availables
library(tidyr)
mydata2 <- drop_na(mydata)

mydata$Daily.Steps.New <- mydata$Daily.Steps + 5000 #creating new variable DailyStepsNew
head(mydata, 3)

##   Person.ID Gender Age        Occupation Sleep.Duration
## 1         1   Male  27 Software Engineer            6.1
## 2         2   Male  28            Doctor            6.2
## 3         3   Male  28            Doctor            6.2
##   Quality.of.Sleep Physical.Activity.Level Stress.Level BMI.Category
## 1                6                      42            6   Overweight
## 2                6                      60            8       Normal
## 3                6                      60            8       Normal
##   Blood.Pressure Heart.Rate Daily.Steps Sleep.Disorder
## 1         126/83         77        4200           None
## 2         125/80         75       10000           None
## 3         125/80         75       10000           None
##   Daily.Steps.New
## 1            9200
## 2           15000
## 3           15000

colnames(mydata) <- c("Person.ID", "Gender", "Age", "Occupation", "Sleep.Duration", "Quality.of.Sleep", "Physical.Activity.Level", "Stress.Level", "Body.Mass.Index", "Blood.Pressure", "Heart.Rate", "Daily.Steps", "Sleep.Disorder", "Daily.Steps.New") 
#Renaming variable BMI.Category
head(mydata)

##   Person.ID Gender Age           Occupation Sleep.Duration
## 1         1   Male  27    Software Engineer            6.1
## 2         2   Male  28               Doctor            6.2
## 3         3   Male  28               Doctor            6.2
## 4         4   Male  28 Sales Representative            5.9
## 5         5   Male  28 Sales Representative            5.9
## 6         6   Male  28    Software Engineer            5.9
##   Quality.of.Sleep Physical.Activity.Level Stress.Level
## 1                6                      42            6
## 2                6                      60            8
## 3                6                      60            8
## 4                4                      30            8
## 5                4                      30            8
## 6                4                      30            8
##   Body.Mass.Index Blood.Pressure Heart.Rate Daily.Steps
## 1      Overweight         126/83         77        4200
## 2          Normal         125/80         75       10000
## 3          Normal         125/80         75       10000
## 4           Obese         140/90         85        3000
## 5           Obese         140/90         85        3000
## 6           Obese         140/90         85        3000
##   Sleep.Disorder Daily.Steps.New
## 1           None            9200
## 2           None           15000
## 3           None           15000
## 4    Sleep Apnea            8000
## 5    Sleep Apnea            8000
## 6       Insomnia            8000

mydata3 <- mydata2[mydata2$Sleep.Duration >= 7 & mydata2$Sleep.Duration <= 8 ,] 
#creating new data frame

summary(mydata2[ , c(-1, -2, -4, -9, -13)]) #Descriptive Statistics

##       Age        Sleep.Duration  Quality.of.Sleep
##  Min.   :27.00   Min.   :5.800   Min.   :4.000   
##  1st Qu.:35.25   1st Qu.:6.400   1st Qu.:6.000   
##  Median :43.00   Median :7.200   Median :7.000   
##  Mean   :42.18   Mean   :7.132   Mean   :7.313   
##  3rd Qu.:50.00   3rd Qu.:7.800   3rd Qu.:8.000   
##  Max.   :59.00   Max.   :8.500   Max.   :9.000   
##  Physical.Activity.Level  Stress.Level   Blood.Pressure    
##  Min.   :30.00           Min.   :3.000   Length:374        
##  1st Qu.:45.00           1st Qu.:4.000   Class :character  
##  Median :60.00           Median :5.000   Mode  :character  
##  Mean   :59.17           Mean   :5.385                     
##  3rd Qu.:75.00           3rd Qu.:7.000                     
##  Max.   :90.00           Max.   :8.000                     
##    Heart.Rate     Daily.Steps   
##  Min.   :65.00   Min.   : 3000  
##  1st Qu.:68.00   1st Qu.: 5600  
##  Median :70.00   Median : 7000  
##  Mean   :70.17   Mean   : 6817  
##  3rd Qu.:72.00   3rd Qu.: 8000  
##  Max.   :86.00   Max.   :10000

Explanation of results:

Age: Max:59
The oldest participant in this study was 59 years old.

Sleep duration: Mean:7.132
The average sleep time was 7.132 hours per day.

Stress level: 1st Qu.:4.000
25% of people experienced stress levels that were 4 or less and 75% of people experienced stress levels that were more than 4.

hist(mydata2$Sleep.Duration,
     main = "Distribution of Sleep duration",
     xlab = "Sleep duration",
     ylab = "Frequency",
     breaks = seq(from = 0, to = 10, by =1))

Majority of the participants slept on average between 7 and 8 hours.

boxplot(mydata2 [ , c(-1, -2, -4, -5, -6, -8, -9, -10, -12, -13)])

Age: Min.:27 Minimum age of participants was 27 years.

Max.:59 Maximum age of participants was 59 years.

Median:43 Half of the participants were 43 years old and younger, the other half was more than 43 years old.

1st Qu.:35.25 25% of participants were 35.25 years old and less, 75% of participants were older than 35.25 years

3rd Qu.:50 75% of participants were 50 years old and less, 25% of participants were more than 50 years old.

head(mydata2[order(-mydata2$Heart.Rate), ]) #Descending order by Heart rate

##     Person.ID Gender Age           Occupation Sleep.Duration
## 277       277   Male  49               Doctor            8.1
## 278       278   Male  49               Doctor            8.1
## 4           4   Male  28 Sales Representative            5.9
## 5           5   Male  28 Sales Representative            5.9
## 6           6   Male  28    Software Engineer            5.9
## 94         94   Male  35               Lawyer            7.4
##     Quality.of.Sleep Physical.Activity.Level Stress.Level
## 277                9                      85            3
## 278                9                      85            3
## 4                  4                      30            8
## 5                  4                      30            8
## 6                  4                      30            8
## 94                 7                      60            5
##     BMI.Category Blood.Pressure Heart.Rate Daily.Steps Sleep.Disorder
## 277        Obese         139/91         86        3700    Sleep Apnea
## 278        Obese         139/91         86        3700    Sleep Apnea
## 4          Obese         140/90         85        3000    Sleep Apnea
## 5          Obese         140/90         85        3000    Sleep Apnea
## 6          Obese         140/90         85        3000       Insomnia
## 94         Obese         135/88         84        3300    Sleep Apnea

boxplot(mydata2[ , 12],
        xlab = "Daily steps")

boxplot(mydata2 [ , c( 5, 6, 8 )])

library(car)

## Loading required package: carData

scatterplot(y= mydata2$Sleep.Duration,
            x= mydata2$Stress.Level,
            ylab= "Sleep duration in hours",
            xlab= "Stress level",
            smooth = FALSE,
            boxplot= FALSE)

Participants that have longer sleep duration, experienced lower stress levels.

TASK 2

mydatanew<-read.table("./Body mass.csv", sep=";", header= TRUE, dec=",")

head(mydatanew)

##   ID Mass
## 1  1 62.1
## 2  2 64.5
## 3  3 56.5
## 4  4 53.4
## 5  5 61.3
## 6  6 62.2

summary(mydatanew[ , c(-1)]) #Descriptive statistics

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   49.70   60.23   62.80   62.88   64.50   83.20

Min.:49.70 A minimum body mass of a ninth grader was 49.70 kg.

Median:62.80 Half of the ninth graders had body mass 62.80kg and less, the other half had body mass higher than 62.80kg.

Mean: 62.88 Average body mass of the ninth graders was 62.88kg.

library(ggplot2)
ggplot(mydatanew, aes(x=Mass)) +
  geom_histogram(binwidth = 5,colour="black",fill="aquamarine") +
  ylab("Frequency")

Hypothesis testing:

H0: Mu = 59.5 H1: Mu =/ 59.5

t.test(mydatanew$Mass,
       mu=59.5,
       alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  mydatanew$Mass
## t = 3.9711, df = 49, p-value = 0.000234
## alternative hypothesis: true mean is not equal to 59.5
## 95 percent confidence interval:
##  61.16758 64.58442
## sample estimates:
## mean of x 
##    62.876

Estimate of arithmetic mean: Y_hat = 62.88

p-value = 0.000234

We reject H0 at p < 0.001 and accept H1. Average weight in 2021/2022 school year is different than it was in 2018/2019 school year. The weight increased.

library(effectsize)
cohens_d(mydatanew$Mass, mu = 59.5) #Calculating the effect size

## Cohen's d |       95% CI
## ------------------------
## 0.56      | [0.26, 0.86]
## 
## - Deviation from a difference of 59.5.

interpret_cohens_d(0.56, rules = "sawilowsky2009") #Interpretation of the effect size

## [1] "medium"
## (Rules: sawilowsky2009)

The increase in weight was medium.

TASK 3

library(readxl)
mydata5 <- read_xlsx("./Apartments.xlsx")

mydata5 <- as.data.frame(mydata5)

head(mydata5)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

Description:

Age: Age of an apartment in years
Distance: The distance from city center in km
Price: Price per m2
Parking: 0-No, 1-Yes
Balcony: 0-No, 1-Yes

Changing categorical variables into factors.

mydata5$ParkingFactor <- factor(mydata5$Parking,
                                levels= c(0,1),
                                labels= c("No", "Yes"))

mydata5$BalconyFactor <- factor(mydata5$Balcony,
                                levels= c(0,1),
                                labels= c("No", "Yes"))

Test the hypothesis H0: Mu_Price = 1900 eur

HO:Mu_Price = 1900 eur

H1:Mu_Price =/ 1900 eur

t.test(x = mydata5$Price, mu = 1900, alternative = "two.sided") #Hypothesis testing

## 
##  One Sample t-test
## 
## data:  mydata5$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
##  1937.443 2100.440
## sample estimates:
## mean of x 
##  2018.941

p-value = 0.005

We reject H0 at p = 0.005 and accept H1. Average price of the apartment is different from 1900 eur. Average price gained from our data is actually 2018.94 eur.

Regression function: Price = f(Age)

fit1 <- lm(Price ~ Age, data = mydata5)
summary(fit1)

## 
## Call:
## lm(formula = Price ~ Age, data = mydata5)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -623.9 -278.0  -69.8  243.5  776.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2185.455     87.043  25.108   <2e-16 ***
## Age           -8.975      4.164  -2.156    0.034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared:  0.05302,    Adjusted R-squared:  0.04161 
## F-statistic: 4.647 on 1 and 83 DF,  p-value: 0.03401

Price = 2185.46 - 8.98 * Age

H0: Beta1 = 0

H1: Beta1 not equal to zero.

p = 0.034

Reject H0.

b1= -8.98

If the age increases by one year, than on average the price per square meter decreases by 8.98 eur (p = 0.034), assuming everything else stays the same.

Coefficient of determination: R2 = 0.0530

5.3% of variability of price can be explained by the effect of age.

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(mydata5[ , c(1, 3)])) #Correlation matrix

##         Age Price
## Age    1.00 -0.23
## Price -0.23  1.00
## 
## n= 85 
## 
## 
## P
##       Age   Price
## Age         0.034
## Price 0.034

There is a negative weak linear relationship between price and age.

library(car)
scatterplotMatrix(mydata5[ , c(3,1,2)],
                  smooth=FALSE)

There is no problem with multicolinearity, because a graph that shows correlation between age and distance looks like horizontal line.

The multiple regression function: Price = f(Age, Distance)

fit2 <- lm(Price ~ Age + Distance, data = mydata5)

vif(fit2) #Multicolinearity

##      Age Distance 
## 1.001845 1.001845

mean(vif(fit2))

## [1] 1.001845

VIF is more than 1, but less than 5. There is some correlation among the independent variables but not severely problematic.

mydata5$StdResid <- round(rstandard(fit2), 3) #Standardized residuals
mydata5$CooksD <- round(cooks.distance(fit2), 3) #Cooks distances

hist(mydata5$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of Standardized residuals")

shapiro.test(mydata5$StdResid) #Distribution of standardized residuals

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata5$StdResid
## W = 0.95303, p-value = 0.003645

H0: Variable is normally distributed.

H1: Variable is not normally distributed.

We reject null hypothesis at p = 0.004. We accept H1, which is a violation. Our sample size is more than 30 so we can assume normality.

hist(mydata5$CooksD,
     xlab = "Cooks distance",
     ylab = "Frequency",
     main = "Histogram of Cooks distances")

head(mydata5[order(-mydata5$StdResid),], 5)

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 38   5       45  2180       1       1           Yes           Yes
## 33   2       11  2790       1       0           Yes            No
## 2   18        1  2800       1       0           Yes            No
## 61  18        1  2800       1       1           Yes           Yes
## 58   8        2  2820       1       0           Yes            No
##    StdResid CooksD
## 38    2.577  0.320
## 33    2.051  0.069
## 2     1.783  0.030
## 61    1.783  0.030
## 58    1.655  0.037

head(mydata5[order(-mydata5$CooksD),], 5)

##    Age Distance Price Parking Balcony ParkingFactor BalconyFactor
## 38   5       45  2180       1       1           Yes           Yes
## 55  43       37  1740       0       0            No            No
## 33   2       11  2790       1       0           Yes            No
## 53   7        2  1760       0       1            No           Yes
## 22  37        3  2540       1       1           Yes           Yes
##    StdResid CooksD
## 38    2.577  0.320
## 55    1.445  0.104
## 33    2.051  0.069
## 53   -2.152  0.066
## 22    1.576  0.061

Remove ID 38, because of Cooks distance.

mydata5 <- mydata5[-38, ] #Removing unit ID38

fit2 <- lm(Price ~ Age + Distance, data = mydata5)

mydata5$StdFitted <- scale(fit2$fitted.values)

library(car)
scatterplot(y = mydata5$StdResid, x = mydata5$StdFitted,
            ylab = "Standardized residuals",
            xlab = "Standardized fitted values",
            boxplots = FALSE,
            regLine = FALSE,
            smooth = FALSE)

#Checking for potential heteroskedasticity

library(olsrr)

## 
## Attaching package: 'olsrr'

## The following object is masked from 'package:datasets':
## 
##     rivers

ols_test_breusch_pagan(fit2)

## 
##  Breusch Pagan Test for Heteroskedasticity
##  -----------------------------------------
##  Ho: the variance is constant            
##  Ha: the variance is not constant        
## 
##               Data                
##  ---------------------------------
##  Response : Price 
##  Variables: fitted values of Price 
## 
##         Test Summary          
##  -----------------------------
##  DF            =    1 
##  Chi2          =    2.927455 
##  Prob > Chi2   =    0.08708469

p-value: 8,8%

We cannot reject null hypothesis. There is no problem with heteroskedasticity, because variance is constant.

hist(mydata5$StdResid,
     xlab = "Standardized residuals",
     ylab = "Frequency",
     main = "Histogram of Standardized residuals")

shapiro.test(mydata5$StdResid) #Distribution of standardized residuals

## 
##  Shapiro-Wilk normality test
## 
## data:  mydata5$StdResid
## W = 0.94879, p-value = 0.002187

H0: Variable is normally distributed.

H1: Variable is not normally distributed.

We reject null hypothesis at p = 0.002. We accept H1, which is a violation. Our sample size is more than 30 so we can assume normality.

library(readxl)
mydata6 <- read_xlsx("./Apartments.xlsx")

mydata6 <- as.data.frame(mydata6)

head(mydata6)

##   Age Distance Price Parking Balcony
## 1   7       28  1640       0       1
## 2  18        1  2800       1       0
## 3   7       28  1660       0       0
## 4  28       29  1850       0       1
## 5  18       18  1640       1       1
## 6  28       12  1770       0       1

fit2 <- lm(Price ~ Age + Distance, data=mydata6)
summary(fit2) #Results of regression model

## 
## Call:
## lm(formula = Price ~ Age + Distance, data = mydata6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -603.23 -219.94  -85.68  211.31  689.58 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2460.101     76.632   32.10  < 2e-16 ***
## Age           -7.934      3.225   -2.46    0.016 *  
## Distance     -20.667      2.748   -7.52 6.18e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared:  0.4396, Adjusted R-squared:  0.4259 
## F-statistic: 32.16 on 2 and 82 DF,  p-value: 4.896e-11

H0: Beta1 = 0

H1: Beta1 != 0

p = 0.016

Explanation of b1= -7.93

If average year of age is increased by 1 year, price of the apartment is decreased on average by 7.93 eur (p = 0.016), assuming that all other explanatory variables, included in the model, are constant.

H0: Beta2 = 0

H1: Beta2 != 0

p < 0.001

Explanation of b2= -20.67

If average distance from city center is increased by 1 kilometer, price of the apartment is decreased on average by 20.67 eur (p < 0.001), assuming that all other explanatory variables, included in the model, are constant.

Multiple-R²: 0.4396

43.96% of variability of price of the apartment is explained by linear effect of Age and Distance.

Test of significance of regression HO: Ro2 = 0

H1: Ro2 > 0

F=32.16, p < 0.001

We reject the null hypothesis at p < 0.001. We can say that population coefficient of determination is more than 0 meaning that we found linear relationship between price and explanatory variables, included in the model.

The linear regression function Price = f(Age, Distance, Parking and Balcony)

mydata6$ParkingF <- factor(mydata6$Parking,
                           levels = c(0,1),
                           labels = c("No parking", "Has parking"))

mydata6$BalconyF <- factor(mydata6$Balcony,
                           levels = c(0,1),
                           labels = c("No balcony", "Has balcony"))

fit3 <- lm(Price ~ Age + Distance + ParkingF + BalconyF, data = mydata6)

anova(fit2, fit3) #Comparison of two regression models that differ in the number of expl. variables

## Analysis of Variance Table
## 
## Model 1: Price ~ Age + Distance
## Model 2: Price ~ Age + Distance + ParkingF + BalconyF
##   Res.Df     RSS Df Sum of Sq      F  Pr(>F)  
## 1     82 6720983                              
## 2     80 5991088  2    729894 4.8732 0.01007 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Model 1 is better

H1: Model 2 is better

p = 0.010

We reject the null hypothesis at p = 0.010. Model 3 is better, because it explains more variability in price.

summary(fit3)

## 
## Call:
## lm(formula = Price ~ Age + Distance + ParkingF + BalconyF, data = mydata6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.92 -200.66  -57.48  260.08  594.37 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2301.667     94.271  24.415  < 2e-16 ***
## Age                   -6.799      3.110  -2.186  0.03172 *  
## Distance             -18.045      2.758  -6.543 5.28e-09 ***
## ParkingFHas parking  196.168     62.868   3.120  0.00251 ** 
## BalconyFHas balcony    1.935     60.014   0.032  0.97436    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared:  0.5004, Adjusted R-squared:  0.4754 
## F-statistic: 20.03 on 4 and 80 DF,  p-value: 1.849e-11

H0: Beta3 = 0

H1: Beta3 != 0

p = 0.003

Explaining b3 = 196.17 If the apartment has parking,the price is on average 196.17 eur higher than if the apartment does not have parking, assuming all the other explanatory variables, included in the model, are constant (p = 0.003).

H0: Beta4 = 0

H1: Beta4 != 0

p = 0.975

We cannot say that if the apartment has balcony, that has a statistical effect on price of the apartment, assuming all the other explanatory variables, included in the model, are constant (p = 0.975).

Test of significance of regression

HO: Ro3 = 0

H1: Ro3 > 0

F=20.03 p < 0.001

mydata6$Fitted <- fitted.values(fit3)
mydata6$Residuals <- residuals(fit3)

head(mydata6[colnames(mydata6) %in% c("Price", "Fitted", "Residuals")])

##   Price   Fitted  Residuals
## 1  1640 1750.741 -110.74150
## 2  2800 2357.411  442.58893
## 3  1660 1748.807  -88.80674
## 4  1850 1589.921  260.07897
## 5  1640 2052.576 -412.57579
## 6  1770 1896.691 -126.69107

Calculating the residual for apartment ID2:

e = Y - Y(fitted) e = 442.59

Homework_R_KVŠ

2023-09-22

Katja Volk Štefić

TASK 1

TASK 2

TASK 3