Linear regression

In this project we were asked to create the best linear model for sleep75 data. We focused as in the task on correlation between hour average wage and minutes slept with or without naps. We also checked how the variables are depended on sex and age. First we wanted to see our data in general. We used kableExtra library to beautify our tables :)

data<-sleep75[1:6,c(16,33,21,22,4,5,7,1)]
data<-na.omit(data)

data %>%
  mutate(sleepInHours = sleep /60) %>%
  relocate(sleepInHours,.before=sleep)%>%
  relocate(sleep,.after=age)%>%
  mutate(sleepNapsInHours = slpnaps /60) %>%
  relocate(sleepNapsInHours,.after=sleepInHours)%>%
  relocate(slpnaps,.after=sleep)%>%
  relocate(age,.after=sleepNapsInHours)%>%
  kbl()%>%
  kable_classic("hover",full_width=F) %>%
  kable_material_dark()%>%
  row_spec(0,angle=-20)%>%
  add_header_above(c(" ","Data set"=9),bold = TRUE)
Data set
male hrwage sleepInHours sleepNapsInHours age clerical construc earns74 sleep slpnaps
1 7.070004 51.88333 52.71667 32 0 0 0 3113 3163
1 1.429999 48.66667 48.66667 31 0 0 9500 2920 2920
1 20.529997 44.50000 46.00000 44 0 0 42500 2670 2760
0 9.619998 51.38333 51.38333 30 0 0 42500 3083 3083
1 2.750000 57.46667 58.21667 64 0 0 2500 3448 3493
1 19.249998 67.71667 67.96667 41 0 0 0 4063 4078

We created two new columns with time in hours, because it is more natural. Now when we have a glimpse of our data, we could also use some information about general knowledge about our data like minimum and maximum value, mean, median and quartiles. For that we used summary() function.

First, we can conclude that mean value of male variable is 0.83 which means in our model we will have more man than woman. We know that people in a sample data are between 30 and 64 years old. The minimum sleep they get is 2670 minutes and maximum is 4063 hours.The hour average wage is between 1,430 and 20,530. The mean is 10,108, which means we have normal distribution here.

summary(data)
##       male            hrwage           sleep         slpnaps        clerical
##  Min.   :0.0000   Min.   : 1.430   Min.   :2670   Min.   :2760   Min.   :0  
##  1st Qu.:1.0000   1st Qu.: 3.830   1st Qu.:2961   1st Qu.:2961   1st Qu.:0  
##  Median :1.0000   Median : 8.345   Median :3098   Median :3123   Median :0  
##  Mean   :0.8333   Mean   :10.108   Mean   :3216   Mean   :3250   Mean   :0  
##  3rd Qu.:1.0000   3rd Qu.:16.842   3rd Qu.:3364   3rd Qu.:3410   3rd Qu.:0  
##  Max.   :1.0000   Max.   :20.530   Max.   :4063   Max.   :4078   Max.   :0  
##     construc    earns74           age       
##  Min.   :0   Min.   :    0   Min.   :30.00  
##  1st Qu.:0   1st Qu.:  625   1st Qu.:31.25  
##  Median :0   Median : 6000   Median :36.50  
##  Mean   :0   Mean   :16167   Mean   :40.33  
##  3rd Qu.:0   3rd Qu.:34250   3rd Qu.:43.25  
##  Max.   :0   Max.   :42500   Max.   :64.00

Let’s try to create a simple plot for our data with time for sleep with naps and the connection between hour average wage. We can see here that our plot is condensed. There is many people that sleep between 3000 and 4000 minutes and get paid very low average wage. Let’s compare it with the plot for time of sleep without naps.

ggplot(data=sleep75, aes(slpnaps,hrwage))+geom_jitter(alpha = 0.5)+geom_smooth(method="lm", se=TRUE,color="deeppink4")+xlab("Sleep with naps")+ylab("Hour wage")+labs(title="Linear model for sleep with naps and hour wage")+theme(plot.title = element_text(size = 15, color = "deeppink4"))

From this plot we can conclude that the vast majority of man get bigger payment. Woman have smaller hour avarage wage. We can see that while man and woman sleep aproximately the same, man still have bigger wage.

model_1 <- lm(slpnaps ~ hrwage + male,data = data)
library(ggiraphExtra)
sex<-factor(sleep75$male)
ggplot(data=sleep75, aes(hrwage,slpnaps,color=sex))+geom_point()+scale_shape_discrete(labels=c("male","female"))+geom_smooth(method="lm")

model_1 <- lm(slpnaps ~ hrwage + age,data = data)
library(ggiraphExtra)

ggplot(data=sleep75, aes(hrwage,slpnaps,color=age))+geom_point()+geom_smooth(method="lm")+scale_x_sqrt()

From the second plot we can say that there is no huge correlation between age and sleep with naps and hour wage. It is in average the same.