Data

data("sleep75")
sleepData <- sleep75
sleepData[is.na(sleepData)] <- 0

Basic Plots

We will create some basic plots to show the distribution and relationships between variables

For this plot we can read, that the sleep time including naps doesn’t correlate with how much that person is getting paid.

From this plot we can read, that number of hours that someone is working isn’t correlated to how much they are sleeping. Only in extreme cases we can observe, that if someone is working much more per week, then he is sleeping much less. It doesn’t work in other way, because people that aren’t working at all vary a lot in time they sleep.

For this boxplot we can observe that black people were earning less money on average than white people

From this plot we can learn the distribution of hourly wage and we can see that most of them are between 0k and 10k

From this plot we can observe how many are there construction workers and cleric workers and if they are union members or not. We can see that most of the workers are not members of the union, and there are about 4 times more construction workers than cleric workers

We can see that people who are not married have a greater spread in the hours of sleep per week, and single cases sleep even less than 20 hours. On the other hand, married people usually sleep more, and a much higher percentage of married than not married people sleep slightly less than 60 hours a week.

Correlation coefficients - Pearson’s linear correlation

The correlation coefficient between log hourly wage and sleep is -0.067. Since the correlation coefficient is close to zero, it suggests a weak linear relationship between the two variables. The negative sign indicates a negative correlation, meaning that as log hourly wage increases, sleep tends to decrease slightly. However, the correlation is weak, indicating that the relationship between the two variables is not strong.

## [1] -0.05782671

Rank correlation

Linear regression model

The intercept is -3.76664, indicating the expected average hourly wage when all other predictor variables are zero. For each additional year of education, the hourly wage is expected to increase by 0.39862 units. For each additional year of age, the hourly wage is expected to increase by 0.06090 units. Males have, on average, a higher hourly wage by 2.61621 units compared to females. All predictor variables (education, age, and gender) are highly significant (p < 0.001), indicating their impact on the hourly wage. The residual standard error is 3.278, representing the average amount of error in the predicted hourly wage. The multiple R-squared value is 0.2213, indicating that approximately 22.13% of the variance in the hourly wage can be explained by the predictors in the model. The F-statistic is 50.01 with a very low p-value, indicating that the model as a whole is highly significant.

## 
## Call:
## lm(formula = sleepData$hrwage ~ sleepData$educ + sleepData$age + 
##     sleepData$male, data = sleepData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1912 -2.2867 -0.1266  1.6524 29.7388 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.16071    0.95348  -1.217   0.2239    
## sleepData$educ  0.24322    0.05251   4.632 4.32e-06 ***
## sleepData$age   0.02333    0.01289   1.810   0.0707 .  
## sleepData$male  1.72391    0.28441   6.061 2.20e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.739 on 702 degrees of freedom
## Multiple R-squared:  0.08059,    Adjusted R-squared:  0.07666 
## F-statistic: 20.51 on 3 and 702 DF,  p-value: 9.492e-13

The intercept is 3285.21, which represents the expected average sleep duration when all other predictors are zero. For each unit increase in the smsa variable, sleep duration is expected to decrease by 51.63 units. However, this effect is not statistically significant (p = 0.1435). A one-unit increase in the inlf variable is associated with a decrease in sleep duration by 36.10 units, but this effect is also not statistically significant (p = 0.3534). A one-unit increase in construc is associated with an increase in sleep duration by 133.05 units. However, this effect is not statistically significant (p = 0.2393). Similarly, a one-unit increase in clerical is associated with an increase in sleep duration by 78.44 units, but this effect is not statistically significant (p = 0.1202). A one-unit increase in black is associated with a decrease in sleep duration by 72.55 units, although this effect is not statistically significant (p = 0.3485). A one-unit increase in south is associated with an increase in sleep duration by 77.49 units, and this effect approaches statistical significance (p = 0.0833). The overall model has limited explanatory power, as indicated by the low R-squared values. Only about 1.57% of the variability in sleep duration can be explained by the predictors included in the model. The F-statistic of 1.86 with a p-value of 0.08517 suggests that the model’s overall fit is not statistically significant.

## 
## Call:
## lm(formula = sleepData$sleep ~ sleepData$smsa + sleepData$inlf + 
##     sleepData$construc + sleepData$clerical + sleepData$black + 
##     sleepData$south, data = sleepData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2548.51  -252.62     5.52   262.26  1367.46 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3285.21      38.40  85.560   <2e-16 ***
## sleepData$smsa       -51.63      35.26  -1.464   0.1435    
## sleepData$inlf       -36.10      38.87  -0.929   0.3534    
## sleepData$construc   133.05     112.97   1.178   0.2393    
## sleepData$clerical    78.44      50.41   1.556   0.1202    
## sleepData$black      -72.55      77.33  -0.938   0.3485    
## sleepData$south       77.49      44.68   1.734   0.0833 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 442.8 on 699 degrees of freedom
## Multiple R-squared:  0.01572,    Adjusted R-squared:  0.007269 
## F-statistic:  1.86 on 6 and 699 DF,  p-value: 0.08517

Nonlinear regression model

## [1] 1402.67

Most of responses are accumulated in a quite close area to the prediction line. We can observe few outliers. To sum up prediction model is more less accurate.

## 
## Call:
## lm(formula = totwrk ~ sleep + I(sleep^2), data = sleep1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2272.1  -561.0    82.7   568.3  3181.2 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.756e+03  9.071e+02   4.141 4.03e-05 ***
## sleep       -1.809e-01  5.746e-01  -0.315    0.753    
## I(sleep^2)  -9.565e-05  9.143e-05  -1.046    0.296    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 882.4 on 526 degrees of freedom
## Multiple R-squared:  0.1361, Adjusted R-squared:  0.1328 
## F-statistic: 41.43 on 2 and 526 DF,  p-value: < 2.2e-16

In the first (left top corner) plot we can obsere accummulation of results between 1500 and 2250, and residuals hestitate from -2000 up to 1500. Some outlyers are being observed. In the second plot (right to the first) we can see that predictions are strongly related to the results. In the third plot( left-bottom) whole resultsare accumulated in one area similairly to the first plot The last plot presents data completely different than the prediction was settled, whole results are accumulated on the left side of the plot between -3.0 and 2.0 Residuals.

par(mfrow = c(2,2))
plot(model3)

Correlation plot

From this correlation plot we can see that there is a strong correlation where there are red and blue squares. leis1, leis2 and leis3 are strongly related to totwtk and worknrm, and for example totwrk to worknrm. Blue means that the two variables under consideration vary in the same direction, i.e., if a variable increases the other one increases and if one decreases the other one decreases as well. Red means that is, if a variable increases the other decreases and vice versa

## corrplot 0.92 loaded

