Zack Seath (s3843672)
Last updated: 28 May, 2023
The data set for this investigation was sourced from Kaggle, an Open Data Repository for Machine Learning and Data Science
This data was collected in a 1994 survey of Canadian Labor and Income Dynamic conducted in Ontario.
The data set consists of 7425 observations with 5 variables.
Those variables include the following:
Wages: Composite hourly wages from all jobs in Canadian Dollars.
Education: Amount of schooling in Years.
Age: in Years.
Sex: Factor Variable with levels of Male and Female.
Language: Factor Variable with levels of English, French or Other.
# Import the data
Slid <- read.csv("SLID.csv", header = TRUE)
# Check the data
names(Slid)
# Drop the 1st column as it is just the index
Slid <- Slid[-c(1)]
# Check summary statistics
summary(Slid)
# Remove observations with wages as NA
Slid2 <- na.omit(Slid)
# Make sure the two factor variables are factors
Slid2$sex <- as.factor(Slid2$sex)
Slid2$language <- as.factor(Slid2$language)
# Final Check of Summary Statistics
summary(Slid2)| Variable | N | Mean | Std. Dev. | Min | Pctl. 25 | Pctl. 75 | Max |
|---|---|---|---|---|---|---|---|
| wages | 3987 | 16 | 7.9 | 2.3 | 9.2 | 20 | 50 |
| education | 3987 | 13 | 3 | 0 | 12 | 15 | 20 |
| age | 3987 | 37 | 12 | 16 | 28 | 46 | 69 |
| sex | 3987 | ||||||
| … Female | 2001 | 50% | |||||
| … Male | 1986 | 50% | |||||
| language | 3987 | ||||||
| … English | 3244 | 81% | |||||
| … French | 259 | 6% | |||||
| … Other | 484 | 12% |
p1 <- plot(Slid2$education, Slid2$wages, xlab = "Education", ylab = "Wages")
p1 <- abline(lm(Slid2$wages ~ Slid2$education))\[H_0: p = 0 \]
\[H_A: p \ne 0\]
##
## Call:
## lm(formula = Slid2$wages ~ Slid2$education, data = Slid2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.688 -5.822 -1.039 4.148 34.190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.97169 0.53429 9.305 <2e-16 ***
## Slid2$education 0.79231 0.03906 20.284 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.492 on 3985 degrees of freedom
## Multiple R-squared: 0.09359, Adjusted R-squared: 0.09336
## F-statistic: 411.4 on 1 and 3985 DF, p-value: < 2.2e-16
Singh, U., 2023, Survey of Labour and Income Dynamics, Kaggle, Viewed on 27 May 2023, https://www.kaggle.com/datasets/utkarshx27/survey-of-labour-and-income-dynamics?resource=download
Fox, J. (2016) Applied Regression Analysis and Generalized Linear Models, Third Edition. Sage.
Fox, J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.