Week 8 Data Dive

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

Load the dataset

# Load the dataset
adult <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_income_data.csv")
summary(adult)

##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:16281       Min.   :  13492   Length:16281      
##  1st Qu.:28.00   Class :character   1st Qu.: 116736   Class :character  
##  Median :37.00   Mode  :character   Median : 177831   Mode  :character  
##  Mean   :38.77                      Mean   : 189436                     
##  3rd Qu.:48.00                      3rd Qu.: 238384                     
##  Max.   :90.00                      Max.   :1490400                     
##      edunum      maritalstatus       occupation        relationship      
##  Min.   : 1.00   Length:16281       Length:16281       Length:16281      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.07                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capitalgain     capitalloss    
##  Length:16281       Length:16281       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1082   Mean   :  87.9  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :3770.0  
##   hoursperweek   nativecountry         income         
##  Min.   : 1.00   Length:16281       Length:16281      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.39                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

Loading required libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ✔ readr     2.1.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Response Variable

Age is often an important factor in many sociodemographic analyses, and it’s a continuous variable that can provide valuable insights into the dataset. The initial choice of age as the response variable will help us understand its relationships with other factors in the dataset.

response_variable <- adult$age

Explanatory Variable

Education is a categorical variable that might influence age. We can expect that individuals with higher education levels might, on average, be older than those with lower education levels. This choice allows us to investigate if there’s a significant difference in age based on education.

explanatory_variable <- adult$education

Devise Null Hypothesis

Null Hypothesis: The means of age are equal across different education levels.

ANOVA Test

anova_result <- aov(response_variable ~ explanatory_variable, data=adult)

summary(anova_result)

##                         Df  Sum Sq Mean Sq F value Pr(>F)    
## explanatory_variable    15  196524   13102   72.83 <2e-16 ***
## Residuals            16265 2925980     180                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is <0.05, so we reject the null hypothesis.

F value: The F-statistic is calculated as the ratio of the mean squared difference explained by the “explanatory_variable” to the mean squared difference unexplained (residuals). It is a test statistic for the analysis of variance. In this case, the F-statistic is approximately 72.83.

Pr(>F): This is the p-value associated with the F-statistic. It measures the probability of obtaining an F-statistic as extreme as the one observed, assuming that the null hypothesis (no effect of the “explanatory_variable”) is true.

The F-statistic tests the hypothesis that the “explanatory_variable” significantly affects the “response_variable.” In this case, the F-statistic is very high (approximately 72.83), and the associated p-value is extremely low (< 2e-16), indicating that the “explanatory_variable” has a significant effect on the “response_variable.”

Another Continuous Variable

Looking at the data, the ‘hours per week’ column seems like a potential continuous predictor of age.

ggplot(adult, aes(x=hoursperweek, y=age)) + 
  geom_point() +
  geom_smooth(method='lm', color= "red")

## `geom_smooth()` using formula = 'y ~ x'

The plot shows a rough linear relationship between age and hoursperweek.

another_variable <- adult$hoursperweek

Linear Regression Model

Building a linear regression model using the selected continuous variable to predict the response variable (age).

lm_model <- lm(response_variable ~ another_variable, data=adult)

summary(lm_model)

## 
## Call:
## lm(formula = response_variable ~ another_variable, data = adult)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.155 -11.102  -1.734   8.838  54.088 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.313238   0.366623  96.320   <2e-16 ***
## another_variable  0.085517   0.008672   9.861   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.81 on 16279 degrees of freedom
## Multiple R-squared:  0.005938,   Adjusted R-squared:  0.005877 
## F-statistic: 97.24 on 1 and 16279 DF,  p-value: < 2.2e-16

The intercept of 35.313238 suggests that when “another_variable” is zero, the estimated age is 35.31

The F-statistic is very high, indicating that the model as a whole is significant, even though the effect size is small.

The coefficient for “hours-per-week” is positive, it means that, on average, an increase in hours worked per week is associated with an increase in age i.e., the coefficient for “another_variable” is 0.085517. This represents the change in the estimated age for a one-unit increase in “another_variable.”

In summary, the model suggests that hours per week has a statistically significant, but small, positive effect on age. However, the model does not explain much of the variability in age, and other factors not included in the model may also influence age.

par(mfrow = c(2, 2))  
plot(lm_model, which = 1)
plot(lm_model, which = 2)
plot(lm_model, which = 3)
plot(lm_model, which = 4)

The diagnostic plots do not indicate nay major issues with the model assumptions or fit. The linear model seems reasonably well-specified for the hours_per_week predictor.

f_test <- summary(lm_model)
cat("Overall model F-test:\n")

## Overall model F-test:

cat("F-statistic =", f_test$fstatistic[1], ", p-value =", f_test$fstatistic[4], "\n")

## F-statistic = 97.24177 , p-value = NA

F-Statistic: The F-statistic is a test statistic that measures the overall significance of the linear regression model. It is calculated by comparing the explained variance (variance due to the regression model) to the unexplained variance (residual variance).

- In the output, the F-statistic is approximately 97.24177.

P-Value: The p-value associated with the F-statistic tells you whether the regression model, as a whole, is statistically significant. A low p-value indicates that the model is significant, while a high p-value suggests that the model is not statistically significant.

- In the output, the p-value is "NA," which typically means it is not available. However, it's unusual to have an "NA" p-value for an F-test, and this might be a result of an issue in your specific R environment.

Since, the p-value is high or “NA,” it suggests that the model, as a whole, may not have significant predictive power. In such a case, you might reconsider the choice of predictors or explore other models.

Interaction term

# Build a linear regression model with two predictors
lm_model_2 <- lm(response_variable ~ another_variable + education, data=adult)

# Summary of the model
summary(lm_model_2)

## 
## Call:
## lm(formula = response_variable ~ another_variable + education, 
##     data = adult)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.657 -11.036  -1.946   8.683  56.111 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            36.795955   0.702175  52.403  < 2e-16 ***
## another_variable        0.056299   0.008563   6.575 5.02e-11 ***
## education 11th         -6.962949   0.822044  -8.470  < 2e-16 ***
## education 12th         -6.051325   1.093105  -5.536 3.14e-08 ***
## education 1st-4th       8.657561   1.632600   5.303 1.15e-07 ***
## education 5th-6th       5.919083   1.188820   4.979 6.46e-07 ***
## education 7th-8th      12.834199   0.987066  13.002  < 2e-16 ***
## education 9th           1.572880   1.065491   1.476    0.140    
## education Assoc-acdm   -0.560405   0.854998  -0.655    0.512    
## education Assoc-voc    -0.296960   0.812097  -0.366    0.715    
## education Bachelors    -0.293791   0.680292  -0.432    0.666    
## education Doctorate     7.769842   1.179207   6.589 4.56e-11 ***
## education HS-grad       0.179948   0.654655   0.275    0.783    
## education Masters       4.705028   0.767105   6.133 8.80e-10 ***
## education Preschool     2.465159   2.449619   1.006    0.314    
## education Prof-school   6.564208   1.047821   6.265 3.83e-10 ***
## education Some-college -3.469566   0.666191  -5.208 1.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.4 on 16264 degrees of freedom
## Multiple R-squared:  0.06542,    Adjusted R-squared:  0.0645 
## F-statistic: 71.16 on 16 and 16264 DF,  p-value: < 2.2e-16

Coefficients: This section presents the coefficients of the linear regression model, along with their estimates, standard errors, t-values, and p-values for each predictor.

Intercept: The intercept (for the reference category of “education”) is 36.795955. It represents the estimated “response_variable” when all other predictor variables are zero. another_variable: The coefficient for “another_variable” is 0.056299. This indicates that, on average, for each one-unit increase in “another_variable,” the estimated “response_variable” increases by 0.0563. Education Levels: The coefficients for various education levels represent the change in the estimated “response_variable” compared to the reference category (which is not explicitly shown). For example, “education 11th” has a coefficient of -6.962949, suggesting that individuals with an education level of “11th” have an estimated “response_variable” that is 6.962949 units lower on average than the reference category. The p-values associated with each coefficient indicate whether the coefficients are statistically significant. Residual standard error: The residual standard error (13.4) represents the standard deviation of the residuals. It measures how well the model fits the data. Smaller values indicate a better fit.

Multiple R-squared and Adjusted R-squared: These statistics measure the proportion of the variance in the “response_variable” explained by the model. In this case, the model explains approximately 6.54% of the variance, which suggests that the model has limited explanatory power.

F-statistic: The F-statistic tests whether the regression model, as a whole, is statistically significant. The high F-statistic (71.16) and extremely low p-value (“< 2.2e-16”) indicate that the model is statistically significant.

Interpretation:

The model, as a whole, is statistically significant, but it explains only a small portion of the variance in the “response_variable.”

The coefficients for “another_variable” and the various education levels provide insights into how these variables influence the “response_variable” while controlling for other factors in the model.

The p-values associated with each coefficient indicate whether they are statistically significant. Some education levels, like “education 11th” and “education Prof-school,” are significant, while others may not be.

The interpretation of the coefficients for education levels should consider the reference category, which is not explicitly shown in the output. For example, “education 11th” is compared to the reference category, and the coefficient represents the difference in the “response_variable” for individuals with “11th” education compared to the reference group.

Overall, the model explains a small portion of the variation in the “response_variable,” and additional factors not included in the model may influence the outcome.

Diagnostic plots

par(mfrow = c(2, 2))  
plot(lm_model_2, which = 1)
plot(lm_model_2, which = 2)
plot(lm_model_2, which = 3)
plot(lm_model_2, which = 4)

The interaction term is significant. The edunum coefficient is still significant and positive. But the magnitude is lower now. So edunum changes the relationship between hoursperweek and edunum. Diagnostic plots look okay. No major issues.