The analysis begins with a comprehensive data exploration phase aimed at acquiring a foundational understanding of the dataset. This involves importing the dataset into the R environment using appropriate functions such as read.csv() or read.table(). Once the data is loaded, its structural properties are inspected using the str(data) function. This command allows the analyst to determine the types of variables (e.g., numeric, character, factor), the number of observations, and the general layout of the dataset. Understanding the data structure is a critical first step before any form of analysis can proceed.
data <- read.csv("D:\\Abir\\RMD & QMD Assignment Files\\MA334-SP-7_2412507.csv")
str(data)
## 'data.frame': 1181 obs. of 12 variables:
## $ age : int 29 45 39 30 42 47 62 57 21 69 ...
## $ educ : int 4 3 2 3 3 3 2 2 1 0 ...
## $ gender : int 1 1 1 0 0 1 1 0 0 1 ...
## $ hrswork: int 40 45 40 45 60 45 40 48 40 40 ...
## $ insure : int 1 1 1 1 1 1 1 1 1 0 ...
## $ metro : int 1 1 1 1 0 1 1 1 1 1 ...
## $ nchild : int 2 3 1 0 3 0 1 0 0 0 ...
## $ union : int 0 0 0 0 1 0 0 1 0 0 ...
## $ wage : num 25.9 14.4 17.2 17.1 18.3 ...
## $ race : chr "White" "White" "White" "White" ...
## $ marital: int 1 2 1 0 1 1 1 1 0 2 ...
## $ region : chr "south" "south" "midwest" "northeast" ...
summary(data)
## age educ gender hrswork
## Min. :17.00 Min. :0.000 Min. :0.000 Min. : 0.00
## 1st Qu.:32.00 1st Qu.:0.000 1st Qu.:0.000 1st Qu.:40.00
## Median :43.00 Median :2.000 Median :0.000 Median :40.00
## Mean :42.61 Mean :1.751 Mean :0.442 Mean :41.61
## 3rd Qu.:52.00 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:42.00
## Max. :77.00 Max. :5.000 Max. :1.000 Max. :80.00
## insure metro nchild union
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :1.0000 Median :0.0000 Median :0.0000
## Mean :0.8256 Mean :0.8239 Mean :0.8061 Mean :0.1372
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:2.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :9.0000 Max. :1.0000
## wage race marital region
## Min. : 2.50 Length:1181 Min. :0.0000 Length:1181
## 1st Qu.:13.00 Class :character 1st Qu.:0.0000 Class :character
## Median :18.75 Mode :character Median :1.0000 Mode :character
## Mean :22.77 Mean :0.8476
## 3rd Qu.:28.84 3rd Qu.:1.0000
## Max. :99.00 Max. :2.0000
To get an overview of the distribution and central tendencies of the variables, the summary(data) function is employed. This provides key statistical measures such as minimum, maximum, mean, median, and quartiles for numeric variables. These summary statistics are invaluable for identifying potential outliers or skewness in the data. # Number of observations and variables
cat("Rows:", nrow(data), "| Columns:", ncol(data), "\n")
## Rows: 1181 | Columns: 12
A vital part of exploration includes identifying the number of variables and observations, often achieved with dim(data) or ncol() and nrow() functions. Furthermore, the standard deviations for all numeric variables are calculated using sapply(data, sd) or similar functions to understand the spread of the data. Standard deviation provides insights into the variability of data points around the mean, highlighting variables that may require transformation or normalization. # Descriptive statistics
sapply(data[sapply(data, is.numeric)], sd)
## age educ gender hrswork insure metro nchild
## 12.4022932 1.4937188 0.4968348 8.9269236 0.3796383 0.3810852 1.1010766
## union wage marital
## 0.3441745 14.1602340 0.6110325
hist(data$wage, main="Histogram of Wage", col="skyblue", xlab="Wage")
hist(data$age, main="Histogram of Age", col="lightgreen", xlab="Age")
#
Boxplot by gender
boxplot(wage ~ gender, data=data, main="Wage by Gender", xlab="Gender", ylab="Wage")
To
visually assess the distribution of key variables, histograms are
generated for wage and age using functions like hist(data\(wage) and hist(data\)age). These plots help
to identify skewed distributions and other irregularities. A boxplot is
created to explore wage distribution by gender (boxplot(wage ~ gender)),
offering a straightforward visual representation of median wages,
interquartile ranges, and potential outliers for each gender. #
Correlation matrix for numeric variables
numeric_data <- data[sapply(data, is.numeric)]
cor_matrix <- cor(numeric_data)
print(round(cor_matrix, 2))
## age educ gender hrswork insure metro nchild union wage marital
## age 1.00 0.01 0.02 0.06 0.14 0.02 -0.05 0.05 0.21 0.39
## educ 0.01 1.00 0.11 0.12 0.23 0.13 -0.02 0.02 0.43 0.04
## gender 0.02 0.11 1.00 -0.18 0.01 0.05 -0.02 0.04 -0.14 0.00
## hrswork 0.06 0.12 -0.18 1.00 0.17 -0.01 0.07 -0.01 0.09 0.05
## insure 0.14 0.23 0.01 0.17 1.00 0.02 0.06 0.08 0.23 0.08
## metro 0.02 0.13 0.05 -0.01 0.02 1.00 -0.02 0.06 0.13 -0.05
## nchild -0.05 -0.02 -0.02 0.07 0.06 -0.02 1.00 0.03 0.02 0.17
## union 0.05 0.02 0.04 -0.01 0.08 0.06 0.03 1.00 0.05 0.03
## wage 0.21 0.43 -0.14 0.09 0.23 0.13 0.02 0.05 1.00 0.15
## marital 0.39 0.04 0.00 0.05 0.08 -0.05 0.17 0.03 0.15 1.00
Finally, a correlation matrix is computed using the cor() function for all numeric variables. This matrix helps to identify linear relationships between pairs of variables, which is particularly useful when considering predictors for regression models. Strong correlations between independent variables can signal multicollinearity issues in subsequent models.
This section delves into probabilistic reasoning using the dataset. The first task involves calculating the probability of at least one person being uninsured in a random sample of five individuals. Assuming independence, this is done using the binomial theorem or simulation-based approaches. The probability provides insights into public health coverage and the distribution of insurance status in the population ## 1+ not insured
not_insured <- sum(data$insure == 0)/nrow(data)
prob_1plus_not_insured <- 1 - (1 - not_insured)^5
prob_1plus_not_insured
## [1] 0.6164927
married <- data[data$marital == 1, ]
p_children_given_married <- sum(married$nchild >= 1) / nrow(married)
p_children_given_married
## [1] 0.6002805
nchild_dist <- table(data$nchild) / nrow(data)
nchild_dist
##
## 0 1 2 3 4 5
## 0.5605419136 0.1803556308 0.1837425910 0.0550381033 0.0127011008 0.0059271804
## 6 9
## 0.0008467401 0.0008467401
Conditional probability is computed to find the likelihood of an individual having one or more children given that they are married. This involves filtering the dataset to include only married individuals and calculating the proportion of those with one or more children (sum(nchild > 0 & marital == ‘married’) / sum(marital == ‘married’)). This offers a glimpse into the demographic distribution and familial patterns within the population. # Mean and variance
nchild_mean <- sum(as.numeric(names(nchild_dist)) * nchild_dist)
nchild_var <- sum((as.numeric(names(nchild_dist)) - nchild_mean)^2 * nchild_dist)
nchild_mean
## [1] 0.8060965
nchild_var
## [1] 1.211343
The distribution of the nchild variable is evaluated by calculating the proportion of observations corresponding to each unique value of children. The frequency table generated is normalized to reflect proportions. Additionally, the mean and variance of nchild are computed to assess central tendency and spread. These statistics are fundamental in understanding fertility patterns and the potential socioeconomic factors associated with them. # P(nchild >= 3)
sum(nchild_dist[as.numeric(names(nchild_dist)) >= 3])
## [1] 0.07535986
To further explore the distribution, the probability of an individual having three or more children is estimated. This is calculated as the proportion of records where nchild >= 3. Such tail probabilities are useful in understanding extreme cases and may influence policy decisions or further study into the characteristics of larger families.
Inferential statistics are used to derive population-level insights. The mean wage and its 95% confidence interval are calculated for individuals with exactly two children using the t.test() function. This provides an estimate of the expected wage within a specified level of confidence. The same approach is attempted for individuals with five or more children; however, in such cases, sample size becomes a critical consideration. If the subgroup is too small, the confidence interval may be unreliable or even inestimable. ## Subset for 2 children
two_kids <- subset(data, nchild == 2)
mean_2kids <- mean(two_kids$wage)
se_2kids <- sd(two_kids$wage)/sqrt(nrow(two_kids))
ci_low <- mean_2kids - 1.96 * se_2kids
ci_high <- mean_2kids + 1.96 * se_2kids
mean_2kids; ci_low; ci_high
## [1] 23.43355
## [1] 21.59181
## [1] 25.27529
more_than_5 <- subset(data, nchild >= 5)
if (nrow(more_than_5) >= 2) {
mean_5plus <- mean(more_than_5$wage)
se_5plus <- sd(more_than_5$wage)/sqrt(nrow(more_than_5))
ci5_low <- mean_5plus - 1.96 * se_5plus
ci5_high <- mean_5plus + 1.96 * se_5plus
mean_5plus; ci5_low; ci5_high
} else {
cat("Not enough data to calculate CI for 5+ children group.\n")
}
## [1] 17.0116
tbl <- table(data$insure, data$gender)
tbl
##
## 0 1
## 0 117 89
## 1 542 433
chisq.test(tbl)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tbl
## X-squared = 0.0574, df = 1, p-value = 0.8107
To assess relationships between categorical variables, a contingency table is constructed to explore the association between insurance status (insure) and gender. The table() function is used to summarize the counts, and a Chi-squared test (chisq.test()) is performed to test the hypothesis of independence. If the p-value is below 0.05, it suggests a statistically significant association between the two variables. This test is useful for understanding potential gender-based disparities in insurance coverage.
data$logwage <- log(data$wage)
Before modeling begins, a new variable logwage is created by applying the natural logarithm transformation to the wage variable. This transformation often stabilizes variance and brings skewed distributions closer to normality, which is desirable for linear regression assumptions. # Young <35
young <- subset(data, age < 35)
old <- subset(data, age >= 35)
The dataset is split into two subsets: “young” (age < 35) and “old” (age >= 35). This bifurcation allows for age-stratified analysis, enabling the examination of different trends across life stages. # Models
lm_young <- lm(logwage ~ age, data=young)
lm_old <- lm(logwage ~ age, data=old)
summary(lm_young)
##
## Call:
## lm(formula = logwage ~ age, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63005 -0.32110 -0.01201 0.31821 1.49042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.594555 0.173214 9.206 < 2e-16 ***
## age 0.041382 0.006074 6.813 3.85e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared: 0.1104, Adjusted R-squared: 0.108
## F-statistic: 46.41 on 1 and 374 DF, p-value: 3.846e-11
summary(lm_old)
##
## Call:
## lm(formula = logwage ~ age, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.91172 -0.39124 -0.04711 0.39679 1.54456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0795566 0.1157775 26.599 <2e-16 ***
## age -0.0005273 0.0023115 -0.228 0.82
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared: 6.479e-05, Adjusted R-squared: -0.00118
## F-statistic: 0.05203 on 1 and 803 DF, p-value: 0.8196
For both groups, simple linear regression models are built using lm(logwage ~ age). The model summaries include key statistical indicators like coefficients, standard errors, t-values, p-values, and R-squared values. These outputs are instrumental in interpreting the strength and direction of the relationship between age and log-transformed wage (Wickham and Grolemund, 2023). # Plots
plot(young$age, young$logwage, main="Young: log(wage) vs age", pch=19)
abline(lm_young, col="blue")
plot(old$age, old$logwage, main="Old: log(wage) vs age", pch=19)
abline(lm_old, col="red")
Scatter plots with regression lines are generated using plot() and
abline() or ggplot2 for enhanced visuals. These plots visually depict
how logwage varies with age in each age group, helping to identify
trends and any potential deviations from linearity (Kerns, 2021).
data$gender <- as.factor(data$gender)
data$marital <- as.factor(data$marital)
data$race <- as.factor(data$race)
data$region <- as.factor(data$region)
data$insure <- as.factor(data$insure)
data$metro <- as.factor(data$metro)
data$union <- as.factor(data$union)
Multiple linear regression is performed to account for several predictors simultaneously. Categorical variables such as gender, marital status, race, region, insurance, metro status, and union membership are converted into factors using the as.factor() function. Proper encoding of categorical variables is essential for model interpretability and accuracy (Field, Miles and Field, 2022). # Recreate young/old with updated factors
young <- subset(data, age < 35)
old <- subset(data, age >= 35)
After the transformation, the “young” and “old” datasets are updated to include these factor variables. Full multiple linear regression models are then built for each group using the formula lm(logwage ~ . - wage - logwage). This approach includes all remaining variables as potential predictors of logwage. # Full models
full_young <- lm(logwage ~ . -wage -logwage, data=young)
full_old <- lm(logwage ~ . -wage -logwage, data=old)
summary(full_young)
##
## Call:
## lm(formula = logwage ~ . - wage - logwage, data = young)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.36327 -0.26460 -0.01495 0.25087 1.30268
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.829281 0.209628 8.726 < 2e-16 ***
## age 0.028609 0.006349 4.506 8.94e-06 ***
## educ 0.120281 0.017682 6.802 4.30e-11 ***
## gender1 -0.191012 0.048228 -3.961 9.01e-05 ***
## hrswork -0.003271 0.002372 -1.379 0.1687
## insure1 0.223328 0.053246 4.194 3.45e-05 ***
## metro1 0.011471 0.058263 0.197 0.8440
## nchild -0.029353 0.027431 -1.070 0.2853
## union1 0.159148 0.073383 2.169 0.0308 *
## raceBlack -0.169375 0.119340 -1.419 0.1567
## raceWhite -0.100931 0.089301 -1.130 0.2591
## marital1 0.067049 0.056590 1.185 0.2369
## marital2 0.076138 0.109491 0.695 0.4873
## regionnortheast 0.116110 0.067130 1.730 0.0846 .
## regionsouth 0.012578 0.059081 0.213 0.8315
## regionwest 0.049319 0.065182 0.757 0.4498
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.423 on 360 degrees of freedom
## Multiple R-squared: 0.3391, Adjusted R-squared: 0.3116
## F-statistic: 12.32 on 15 and 360 DF, p-value: < 2.2e-16
summary(full_old)
##
## Call:
## lm(formula = logwage ~ . - wage - logwage, data = old)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.86706 -0.31416 0.01805 0.33028 1.30901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2617481 0.1800956 12.559 < 2e-16 ***
## age -0.0000705 0.0021630 -0.033 0.97401
## educ 0.1548712 0.0119607 12.948 < 2e-16 ***
## gender1 -0.1775684 0.0357558 -4.966 8.37e-07 ***
## hrswork 0.0015607 0.0021517 0.725 0.46846
## insure1 0.2394203 0.0534284 4.481 8.52e-06 ***
## metro1 0.1437671 0.0472333 3.044 0.00241 **
## nchild -0.0234799 0.0173146 -1.356 0.17546
## union1 0.0443485 0.0489137 0.907 0.36486
## raceBlack -0.0001487 0.1018183 -0.001 0.99883
## raceWhite 0.0882837 0.0832936 1.060 0.28951
## marital1 0.1001091 0.0539010 1.857 0.06364 .
## marital2 0.1143115 0.0643458 1.777 0.07603 .
## regionnortheast 0.0554189 0.0533591 1.039 0.29931
## regionsouth 0.0461424 0.0466314 0.990 0.32272
## regionwest 0.1329489 0.0506362 2.626 0.00882 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.49 on 789 degrees of freedom
## Multiple R-squared: 0.277, Adjusted R-squared: 0.2632
## F-statistic: 20.15 on 15 and 789 DF, p-value: < 2.2e-16
Model summaries provide an in-depth view of the contributions of each predictor. R-squared values are compared with those from the simple regression models. While multiple regression generally improves model fit by reducing residual error, it also increases the risk of overfitting, especially when the number of predictors is large relative to sample size. # Compare to simple models
cat("Young R2:", summary(lm_young)$r.squared, "\n")
## Young R2: 0.1103928
cat("Old R2:", summary(lm_old)$r.squared, "\n")
## Old R2: 6.479289e-05
Although multiple regression models may yield higher R-squared values, this can be misleading. Overfitting occurs when the model captures noise rather than the true underlying relationships (James et al., 2021). Therefore, model diagnostics such as adjusted R-squared, AIC, or cross-validation scores should be considered in further analysis to assess model generalizability. # Model refinement note
cat("Full models include all predictors, which may lead to overfitting. A reduced model is often more interpretable and generalisable.\n")
## Full models include all predictors, which may lead to overfitting. A reduced model is often more interpretable and generalisable.
This comprehensive analysis navigated through various stages of statistical investigation, from data exploration to advanced regression modeling. Summary statistics and visualizations offered foundational insights, while probability assessments provided demographic inferences(Peng, 2023). Point estimates and hypothesis testing revealed relationships among categorical and continuous variables. Finally, regression models—both simple and multiple—enabled predictive insights, albeit with a cautionary note on overfitting. This structured approach lays a robust groundwork for deeper, domain-specific interpretation or policy recommendation.
Field, A., Miles, J. and Field, Z., 2022. Discovering Statistics Using R. 2nd ed. London: SAGE Publications.
James, G., Witten, D., Hastie, T. and Tibshirani, R., 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. New York: Springer.
Kerns, G.J., 2021. Introduction to Probability and Statistics Using R. 2nd ed.
Peng, R.D., 2023. R Programming for Data Science. 2nd ed. Baltimore: Leanpub.
Wickham, H. and Grolemund, G., 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. Sebastopol: O’Reilly Media.