1. Data Exploration

Load data

The analysis begins with a comprehensive data exploration phase aimed at acquiring a foundational understanding of the dataset. This involves importing the dataset into the R environment using appropriate functions such as read.csv() or read.table(). Once the data is loaded, its structural properties are inspected using the str(data) function. This command allows the analyst to determine the types of variables (e.g., numeric, character, factor), the number of observations, and the general layout of the dataset. Understanding the data structure is a critical first step before any form of analysis can proceed.

data <- read.csv("D:\\Abir\\RMD & QMD Assignment Files\\MA334-SP-7_2412507.csv")

Basic structure

str(data)
## 'data.frame':    1181 obs. of  12 variables:
##  $ age    : int  29 45 39 30 42 47 62 57 21 69 ...
##  $ educ   : int  4 3 2 3 3 3 2 2 1 0 ...
##  $ gender : int  1 1 1 0 0 1 1 0 0 1 ...
##  $ hrswork: int  40 45 40 45 60 45 40 48 40 40 ...
##  $ insure : int  1 1 1 1 1 1 1 1 1 0 ...
##  $ metro  : int  1 1 1 1 0 1 1 1 1 1 ...
##  $ nchild : int  2 3 1 0 3 0 1 0 0 0 ...
##  $ union  : int  0 0 0 0 1 0 0 1 0 0 ...
##  $ wage   : num  25.9 14.4 17.2 17.1 18.3 ...
##  $ race   : chr  "White" "White" "White" "White" ...
##  $ marital: int  1 2 1 0 1 1 1 1 0 2 ...
##  $ region : chr  "south" "south" "midwest" "northeast" ...
summary(data)
##       age             educ           gender         hrswork     
##  Min.   :17.00   Min.   :0.000   Min.   :0.000   Min.   : 0.00  
##  1st Qu.:32.00   1st Qu.:0.000   1st Qu.:0.000   1st Qu.:40.00  
##  Median :43.00   Median :2.000   Median :0.000   Median :40.00  
##  Mean   :42.61   Mean   :1.751   Mean   :0.442   Mean   :41.61  
##  3rd Qu.:52.00   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:42.00  
##  Max.   :77.00   Max.   :5.000   Max.   :1.000   Max.   :80.00  
##      insure           metro            nchild           union       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :1.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.8256   Mean   :0.8239   Mean   :0.8061   Mean   :0.1372  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :9.0000   Max.   :1.0000  
##       wage           race              marital          region         
##  Min.   : 2.50   Length:1181        Min.   :0.0000   Length:1181       
##  1st Qu.:13.00   Class :character   1st Qu.:0.0000   Class :character  
##  Median :18.75   Mode  :character   Median :1.0000   Mode  :character  
##  Mean   :22.77                      Mean   :0.8476                     
##  3rd Qu.:28.84                      3rd Qu.:1.0000                     
##  Max.   :99.00                      Max.   :2.0000

To get an overview of the distribution and central tendencies of the variables, the summary(data) function is employed. This provides key statistical measures such as minimum, maximum, mean, median, and quartiles for numeric variables. These summary statistics are invaluable for identifying potential outliers or skewness in the data. # Number of observations and variables

cat("Rows:", nrow(data), "| Columns:", ncol(data), "\n")
## Rows: 1181 | Columns: 12

A vital part of exploration includes identifying the number of variables and observations, often achieved with dim(data) or ncol() and nrow() functions. Furthermore, the standard deviations for all numeric variables are calculated using sapply(data, sd) or similar functions to understand the spread of the data. Standard deviation provides insights into the variability of data points around the mean, highlighting variables that may require transformation or normalization. # Descriptive statistics

sapply(data[sapply(data, is.numeric)], sd)
##        age       educ     gender    hrswork     insure      metro     nchild 
## 12.4022932  1.4937188  0.4968348  8.9269236  0.3796383  0.3810852  1.1010766 
##      union       wage    marital 
##  0.3441745 14.1602340  0.6110325

Histograms for numeric variables

hist(data$wage, main="Histogram of Wage", col="skyblue", xlab="Wage")

hist(data$age, main="Histogram of Age", col="lightgreen", xlab="Age")

# Boxplot by gender

boxplot(wage ~ gender, data=data, main="Wage by Gender", xlab="Gender", ylab="Wage")

To visually assess the distribution of key variables, histograms are generated for wage and age using functions like hist(data\(wage) and hist(data\)age). These plots help to identify skewed distributions and other irregularities. A boxplot is created to explore wage distribution by gender (boxplot(wage ~ gender)), offering a straightforward visual representation of median wages, interquartile ranges, and potential outliers for each gender. # Correlation matrix for numeric variables

numeric_data <- data[sapply(data, is.numeric)]
cor_matrix <- cor(numeric_data)
print(round(cor_matrix, 2))
##           age  educ gender hrswork insure metro nchild union  wage marital
## age      1.00  0.01   0.02    0.06   0.14  0.02  -0.05  0.05  0.21    0.39
## educ     0.01  1.00   0.11    0.12   0.23  0.13  -0.02  0.02  0.43    0.04
## gender   0.02  0.11   1.00   -0.18   0.01  0.05  -0.02  0.04 -0.14    0.00
## hrswork  0.06  0.12  -0.18    1.00   0.17 -0.01   0.07 -0.01  0.09    0.05
## insure   0.14  0.23   0.01    0.17   1.00  0.02   0.06  0.08  0.23    0.08
## metro    0.02  0.13   0.05   -0.01   0.02  1.00  -0.02  0.06  0.13   -0.05
## nchild  -0.05 -0.02  -0.02    0.07   0.06 -0.02   1.00  0.03  0.02    0.17
## union    0.05  0.02   0.04   -0.01   0.08  0.06   0.03  1.00  0.05    0.03
## wage     0.21  0.43  -0.14    0.09   0.23  0.13   0.02  0.05  1.00    0.15
## marital  0.39  0.04   0.00    0.05   0.08 -0.05   0.17  0.03  0.15    1.00

Finally, a correlation matrix is computed using the cor() function for all numeric variables. This matrix helps to identify linear relationships between pairs of variables, which is particularly useful when considering predictors for regression models. Strong correlations between independent variables can signal multicollinearity issues in subsequent models.

2. Probability and Distributions

This section delves into probabilistic reasoning using the dataset. The first task involves calculating the probability of at least one person being uninsured in a random sample of five individuals. Assuming independence, this is done using the binomial theorem or simulation-based approaches. The probability provides insights into public health coverage and the distribution of insurance status in the population ## 1+ not insured

not_insured <- sum(data$insure == 0)/nrow(data)
prob_1plus_not_insured <- 1 - (1 - not_insured)^5
prob_1plus_not_insured
## [1] 0.6164927

P(nchild >=1 | married)

married <- data[data$marital == 1, ]
p_children_given_married <- sum(married$nchild >= 1) / nrow(married)
p_children_given_married
## [1] 0.6002805

Distribution of nchild

nchild_dist <- table(data$nchild) / nrow(data)
nchild_dist
## 
##            0            1            2            3            4            5 
## 0.5605419136 0.1803556308 0.1837425910 0.0550381033 0.0127011008 0.0059271804 
##            6            9 
## 0.0008467401 0.0008467401

Conditional probability is computed to find the likelihood of an individual having one or more children given that they are married. This involves filtering the dataset to include only married individuals and calculating the proportion of those with one or more children (sum(nchild > 0 & marital == ‘married’) / sum(marital == ‘married’)). This offers a glimpse into the demographic distribution and familial patterns within the population. # Mean and variance

nchild_mean <- sum(as.numeric(names(nchild_dist)) * nchild_dist)
nchild_var <- sum((as.numeric(names(nchild_dist)) - nchild_mean)^2 * nchild_dist)
nchild_mean
## [1] 0.8060965
nchild_var
## [1] 1.211343

The distribution of the nchild variable is evaluated by calculating the proportion of observations corresponding to each unique value of children. The frequency table generated is normalized to reflect proportions. Additionally, the mean and variance of nchild are computed to assess central tendency and spread. These statistics are fundamental in understanding fertility patterns and the potential socioeconomic factors associated with them. # P(nchild >= 3)

sum(nchild_dist[as.numeric(names(nchild_dist)) >= 3])
## [1] 0.07535986

To further explore the distribution, the probability of an individual having three or more children is estimated. This is calculated as the proportion of records where nchild >= 3. Such tail probabilities are useful in understanding extreme cases and may influence policy decisions or further study into the characteristics of larger families.

3. Point Estimates & Hypothesis Tests

Inferential statistics are used to derive population-level insights. The mean wage and its 95% confidence interval are calculated for individuals with exactly two children using the t.test() function. This provides an estimate of the expected wage within a specified level of confidence. The same approach is attempted for individuals with five or more children; however, in such cases, sample size becomes a critical consideration. If the subgroup is too small, the confidence interval may be unreliable or even inestimable. ## Subset for 2 children

two_kids <- subset(data, nchild == 2)
mean_2kids <- mean(two_kids$wage)
se_2kids <- sd(two_kids$wage)/sqrt(nrow(two_kids))
ci_low <- mean_2kids - 1.96 * se_2kids
ci_high <- mean_2kids + 1.96 * se_2kids
mean_2kids; ci_low; ci_high
## [1] 23.43355
## [1] 21.59181
## [1] 25.27529

For 5+ children

more_than_5 <- subset(data, nchild >= 5)
if (nrow(more_than_5) >= 2) {
  mean_5plus <- mean(more_than_5$wage)
  se_5plus <- sd(more_than_5$wage)/sqrt(nrow(more_than_5))
  ci5_low <- mean_5plus - 1.96 * se_5plus
  ci5_high <- mean_5plus + 1.96 * se_5plus
  mean_5plus; ci5_low; ci5_high
} else {
  cat("Not enough data to calculate CI for 5+ children group.\n")
}
## [1] 17.0116

Contingency table and chi-squared test

tbl <- table(data$insure, data$gender)
tbl
##    
##       0   1
##   0 117  89
##   1 542 433
chisq.test(tbl)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tbl
## X-squared = 0.0574, df = 1, p-value = 0.8107

To assess relationships between categorical variables, a contingency table is constructed to explore the association between insurance status (insure) and gender. The table() function is used to summarize the counts, and a Chi-squared test (chisq.test()) is performed to test the hypothesis of independence. If the p-value is below 0.05, it suggests a statistically significant association between the two variables. This test is useful for understanding potential gender-based disparities in insurance coverage.

4. Simple Linear Regression

Create log(wage)

data$logwage <- log(data$wage)

Before modeling begins, a new variable logwage is created by applying the natural logarithm transformation to the wage variable. This transformation often stabilizes variance and brings skewed distributions closer to normality, which is desirable for linear regression assumptions. # Young <35

young <- subset(data, age < 35)
old <- subset(data, age >= 35)

The dataset is split into two subsets: “young” (age < 35) and “old” (age >= 35). This bifurcation allows for age-stratified analysis, enabling the examination of different trends across life stages. # Models

lm_young <- lm(logwage ~ age, data=young)
lm_old <- lm(logwage ~ age, data=old)

summary(lm_young)
## 
## Call:
## lm(formula = logwage ~ age, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.63005 -0.32110 -0.01201  0.31821  1.49042 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.594555   0.173214   9.206  < 2e-16 ***
## age         0.041382   0.006074   6.813 3.85e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4816 on 374 degrees of freedom
## Multiple R-squared:  0.1104, Adjusted R-squared:  0.108 
## F-statistic: 46.41 on 1 and 374 DF,  p-value: 3.846e-11
summary(lm_old)
## 
## Call:
## lm(formula = logwage ~ age, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.91172 -0.39124 -0.04711  0.39679  1.54456 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.0795566  0.1157775  26.599   <2e-16 ***
## age         -0.0005273  0.0023115  -0.228     0.82    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5712 on 803 degrees of freedom
## Multiple R-squared:  6.479e-05,  Adjusted R-squared:  -0.00118 
## F-statistic: 0.05203 on 1 and 803 DF,  p-value: 0.8196

For both groups, simple linear regression models are built using lm(logwage ~ age). The model summaries include key statistical indicators like coefficients, standard errors, t-values, p-values, and R-squared values. These outputs are instrumental in interpreting the strength and direction of the relationship between age and log-transformed wage (Wickham and Grolemund, 2023). # Plots

plot(young$age, young$logwage, main="Young: log(wage) vs age", pch=19)
abline(lm_young, col="blue")

plot(old$age, old$logwage, main="Old: log(wage) vs age", pch=19)
abline(lm_old, col="red")

Scatter plots with regression lines are generated using plot() and abline() or ggplot2 for enhanced visuals. These plots visually depict how logwage varies with age in each age group, helping to identify trends and any potential deviations from linearity (Kerns, 2021).

5. Multiple Linear Regression

Convert to factor variables where needed

data$gender <- as.factor(data$gender)
data$marital <- as.factor(data$marital)
data$race <- as.factor(data$race)
data$region <- as.factor(data$region)
data$insure <- as.factor(data$insure)
data$metro <- as.factor(data$metro)
data$union <- as.factor(data$union)

Multiple linear regression is performed to account for several predictors simultaneously. Categorical variables such as gender, marital status, race, region, insurance, metro status, and union membership are converted into factors using the as.factor() function. Proper encoding of categorical variables is essential for model interpretability and accuracy (Field, Miles and Field, 2022). # Recreate young/old with updated factors

young <- subset(data, age < 35)
old <- subset(data, age >= 35)

After the transformation, the “young” and “old” datasets are updated to include these factor variables. Full multiple linear regression models are then built for each group using the formula lm(logwage ~ . - wage - logwage). This approach includes all remaining variables as potential predictors of logwage. # Full models

full_young <- lm(logwage ~ . -wage -logwage, data=young)
full_old <- lm(logwage ~ . -wage -logwage, data=old)

summary(full_young)
## 
## Call:
## lm(formula = logwage ~ . - wage - logwage, data = young)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.36327 -0.26460 -0.01495  0.25087  1.30268 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.829281   0.209628   8.726  < 2e-16 ***
## age              0.028609   0.006349   4.506 8.94e-06 ***
## educ             0.120281   0.017682   6.802 4.30e-11 ***
## gender1         -0.191012   0.048228  -3.961 9.01e-05 ***
## hrswork         -0.003271   0.002372  -1.379   0.1687    
## insure1          0.223328   0.053246   4.194 3.45e-05 ***
## metro1           0.011471   0.058263   0.197   0.8440    
## nchild          -0.029353   0.027431  -1.070   0.2853    
## union1           0.159148   0.073383   2.169   0.0308 *  
## raceBlack       -0.169375   0.119340  -1.419   0.1567    
## raceWhite       -0.100931   0.089301  -1.130   0.2591    
## marital1         0.067049   0.056590   1.185   0.2369    
## marital2         0.076138   0.109491   0.695   0.4873    
## regionnortheast  0.116110   0.067130   1.730   0.0846 .  
## regionsouth      0.012578   0.059081   0.213   0.8315    
## regionwest       0.049319   0.065182   0.757   0.4498    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.423 on 360 degrees of freedom
## Multiple R-squared:  0.3391, Adjusted R-squared:  0.3116 
## F-statistic: 12.32 on 15 and 360 DF,  p-value: < 2.2e-16
summary(full_old)
## 
## Call:
## lm(formula = logwage ~ . - wage - logwage, data = old)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.86706 -0.31416  0.01805  0.33028  1.30901 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.2617481  0.1800956  12.559  < 2e-16 ***
## age             -0.0000705  0.0021630  -0.033  0.97401    
## educ             0.1548712  0.0119607  12.948  < 2e-16 ***
## gender1         -0.1775684  0.0357558  -4.966 8.37e-07 ***
## hrswork          0.0015607  0.0021517   0.725  0.46846    
## insure1          0.2394203  0.0534284   4.481 8.52e-06 ***
## metro1           0.1437671  0.0472333   3.044  0.00241 ** 
## nchild          -0.0234799  0.0173146  -1.356  0.17546    
## union1           0.0443485  0.0489137   0.907  0.36486    
## raceBlack       -0.0001487  0.1018183  -0.001  0.99883    
## raceWhite        0.0882837  0.0832936   1.060  0.28951    
## marital1         0.1001091  0.0539010   1.857  0.06364 .  
## marital2         0.1143115  0.0643458   1.777  0.07603 .  
## regionnortheast  0.0554189  0.0533591   1.039  0.29931    
## regionsouth      0.0461424  0.0466314   0.990  0.32272    
## regionwest       0.1329489  0.0506362   2.626  0.00882 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.49 on 789 degrees of freedom
## Multiple R-squared:  0.277,  Adjusted R-squared:  0.2632 
## F-statistic: 20.15 on 15 and 789 DF,  p-value: < 2.2e-16

Model summaries provide an in-depth view of the contributions of each predictor. R-squared values are compared with those from the simple regression models. While multiple regression generally improves model fit by reducing residual error, it also increases the risk of overfitting, especially when the number of predictors is large relative to sample size. # Compare to simple models

cat("Young R2:", summary(lm_young)$r.squared, "\n")
## Young R2: 0.1103928
cat("Old R2:", summary(lm_old)$r.squared, "\n")
## Old R2: 6.479289e-05

Although multiple regression models may yield higher R-squared values, this can be misleading. Overfitting occurs when the model captures noise rather than the true underlying relationships (James et al., 2021). Therefore, model diagnostics such as adjusted R-squared, AIC, or cross-validation scores should be considered in further analysis to assess model generalizability. # Model refinement note

cat("Full models include all predictors, which may lead to overfitting. A reduced model is often more interpretable and generalisable.\n")
## Full models include all predictors, which may lead to overfitting. A reduced model is often more interpretable and generalisable.

Conclusion

This comprehensive analysis navigated through various stages of statistical investigation, from data exploration to advanced regression modeling. Summary statistics and visualizations offered foundational insights, while probability assessments provided demographic inferences(Peng, 2023). Point estimates and hypothesis testing revealed relationships among categorical and continuous variables. Finally, regression models—both simple and multiple—enabled predictive insights, albeit with a cautionary note on overfitting. This structured approach lays a robust groundwork for deeper, domain-specific interpretation or policy recommendation.

References

Field, A., Miles, J. and Field, Z., 2022. Discovering Statistics Using R. 2nd ed. London: SAGE Publications.

James, G., Witten, D., Hastie, T. and Tibshirani, R., 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. New York: Springer.

Kerns, G.J., 2021. Introduction to Probability and Statistics Using R. 2nd ed.

Peng, R.D., 2023. R Programming for Data Science. 2nd ed. Baltimore: Leanpub.

Wickham, H. and Grolemund, G., 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. Sebastopol: O’Reilly Media.