Data
data("sleep75")
sleepData <- sleep75
sleepData[is.na(sleepData)] <- 0
Basic Plots
We will create some basic plots to show the distribution and
relationships between variables
For this plot we can read, that the sleep time including naps doesn’t
correlate with how much that person is getting paid.

From this plot we can read, that number of hours that someone is
working isn’t correlated to how much they are sleeping. Only in extreme
cases we can observe, that if someone is working much more per week,
then he is sleeping much less. It doesn’t work in other way, because
people that aren’t working at all vary a lot in time they sleep.

For this boxplot we can observe that black people were earning less
money on average than white people

From this plot we can learn the distribution of hourly wage and we
can see that most of them are between 0k and 10k

From this plot we can observe how many are there construction workers
and cleric workers and if they are union members or not. We can see that
most of the workers are not members of the union, and there are about 4
times more construction workers than cleric workers

We can see that people who are not married have a greater spread in
the hours of sleep per week, and single cases sleep even less than 20
hours. On the other hand, married people usually sleep more, and a much
higher percentage of married than not married people sleep slightly less
than 60 hours a week.

Correlation coefficients - Pearson’s linear correlation
The correlation coefficient between log hourly wage and sleep is
-0.067. Since the correlation coefficient is close to zero, it suggests
a weak linear relationship between the two variables. The negative sign
indicates a negative correlation, meaning that as log hourly wage
increases, sleep tends to decrease slightly. However, the correlation is
weak, indicating that the relationship between the two variables is not
strong.
## [1] -0.05782671
Rank correlation
Linear regression model
The intercept is -3.76664, indicating the expected average hourly
wage when all other predictor variables are zero. For each additional
year of education, the hourly wage is expected to increase by 0.39862
units. For each additional year of age, the hourly wage is expected to
increase by 0.06090 units. Males have, on average, a higher hourly wage
by 2.61621 units compared to females. All predictor variables
(education, age, and gender) are highly significant (p < 0.001),
indicating their impact on the hourly wage. The residual standard error
is 3.278, representing the average amount of error in the predicted
hourly wage. The multiple R-squared value is 0.2213, indicating that
approximately 22.13% of the variance in the hourly wage can be explained
by the predictors in the model. The F-statistic is 50.01 with a very low
p-value, indicating that the model as a whole is highly significant.
##
## Call:
## lm(formula = sleepData$hrwage ~ sleepData$educ + sleepData$age +
## sleepData$male, data = sleepData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1912 -2.2867 -0.1266 1.6524 29.7388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.16071 0.95348 -1.217 0.2239
## sleepData$educ 0.24322 0.05251 4.632 4.32e-06 ***
## sleepData$age 0.02333 0.01289 1.810 0.0707 .
## sleepData$male 1.72391 0.28441 6.061 2.20e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.739 on 702 degrees of freedom
## Multiple R-squared: 0.08059, Adjusted R-squared: 0.07666
## F-statistic: 20.51 on 3 and 702 DF, p-value: 9.492e-13
The intercept is 3285.21, which represents the expected average sleep
duration when all other predictors are zero. For each unit increase in
the smsa variable, sleep duration is expected to decrease by 51.63
units. However, this effect is not statistically significant (p =
0.1435). A one-unit increase in the inlf variable is associated with a
decrease in sleep duration by 36.10 units, but this effect is also not
statistically significant (p = 0.3534). A one-unit increase in construc
is associated with an increase in sleep duration by 133.05 units.
However, this effect is not statistically significant (p = 0.2393).
Similarly, a one-unit increase in clerical is associated with an
increase in sleep duration by 78.44 units, but this effect is not
statistically significant (p = 0.1202). A one-unit increase in black is
associated with a decrease in sleep duration by 72.55 units, although
this effect is not statistically significant (p = 0.3485). A one-unit
increase in south is associated with an increase in sleep duration by
77.49 units, and this effect approaches statistical significance (p =
0.0833). The overall model has limited explanatory power, as indicated
by the low R-squared values. Only about 1.57% of the variability in
sleep duration can be explained by the predictors included in the model.
The F-statistic of 1.86 with a p-value of 0.08517 suggests that the
model’s overall fit is not statistically significant.
##
## Call:
## lm(formula = sleepData$sleep ~ sleepData$smsa + sleepData$inlf +
## sleepData$construc + sleepData$clerical + sleepData$black +
## sleepData$south, data = sleepData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2548.51 -252.62 5.52 262.26 1367.46
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3285.21 38.40 85.560 <2e-16 ***
## sleepData$smsa -51.63 35.26 -1.464 0.1435
## sleepData$inlf -36.10 38.87 -0.929 0.3534
## sleepData$construc 133.05 112.97 1.178 0.2393
## sleepData$clerical 78.44 50.41 1.556 0.1202
## sleepData$black -72.55 77.33 -0.938 0.3485
## sleepData$south 77.49 44.68 1.734 0.0833 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 442.8 on 699 degrees of freedom
## Multiple R-squared: 0.01572, Adjusted R-squared: 0.007269
## F-statistic: 1.86 on 6 and 699 DF, p-value: 0.08517
Nonlinear regression model
## [1] 1402.67
Most of responses are accumulated in a quite close area to the
prediction line. We can observe few outliers. To sum up prediction model
is more less accurate.
##
## Call:
## lm(formula = totwrk ~ sleep + I(sleep^2), data = sleep1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2272.1 -561.0 82.7 568.3 3181.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.756e+03 9.071e+02 4.141 4.03e-05 ***
## sleep -1.809e-01 5.746e-01 -0.315 0.753
## I(sleep^2) -9.565e-05 9.143e-05 -1.046 0.296
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 882.4 on 526 degrees of freedom
## Multiple R-squared: 0.1361, Adjusted R-squared: 0.1328
## F-statistic: 41.43 on 2 and 526 DF, p-value: < 2.2e-16
In the first (left top corner) plot we can obsere accummulation of
results between 1500 and 2250, and residuals hestitate from -2000 up to
1500. Some outlyers are being observed. In the second plot (right to the
first) we can see that predictions are strongly related to the results.
In the third plot( left-bottom) whole resultsare accumulated in one area
similairly to the first plot The last plot presents data completely
different than the prediction was settled, whole results are accumulated
on the left side of the plot between -3.0 and 2.0 Residuals.
par(mfrow = c(2,2))
plot(model3)

Correlation plot
From this correlation plot we can see that there is a strong
correlation where there are red and blue squares. leis1, leis2 and leis3
are strongly related to totwtk and worknrm, and for example totwrk to
worknrm. Blue means that the two variables under consideration vary in
the same direction, i.e., if a variable increases the other one
increases and if one decreases the other one decreases as well. Red
means that is, if a variable increases the other decreases and vice
versa
## corrplot 0.92 loaded

