This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Load the dataset
adult <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_income_data.csv")
summary(adult)
## age workclass fnlwgt education
## Min. :17.00 Length:16281 Min. : 13492 Length:16281
## 1st Qu.:28.00 Class :character 1st Qu.: 116736 Class :character
## Median :37.00 Mode :character Median : 177831 Mode :character
## Mean :38.77 Mean : 189436
## 3rd Qu.:48.00 3rd Qu.: 238384
## Max. :90.00 Max. :1490400
## edunum maritalstatus occupation relationship
## Min. : 1.00 Length:16281 Length:16281 Length:16281
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.07
## 3rd Qu.:12.00
## Max. :16.00
## race sex capitalgain capitalloss
## Length:16281 Length:16281 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1082 Mean : 87.9
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :3770.0
## hoursperweek nativecountry income
## Min. : 1.00 Length:16281 Length:16281
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.39
## 3rd Qu.:45.00
## Max. :99.00
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.0
## ✔ readr 2.1.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Age is often an important factor in many sociodemographic analyses, and it’s a continuous variable that can provide valuable insights into the dataset. The initial choice of age as the response variable will help us understand its relationships with other factors in the dataset.
response_variable <- adult$age
Education is a categorical variable that might influence age. We can expect that individuals with higher education levels might, on average, be older than those with lower education levels. This choice allows us to investigate if there’s a significant difference in age based on education.
explanatory_variable <- adult$education
Null Hypothesis: The means of age are equal across different education levels.
anova_result <- aov(response_variable ~ explanatory_variable, data=adult)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## explanatory_variable 15 196524 13102 72.83 <2e-16 ***
## Residuals 16265 2925980 180
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is <0.05, so we reject the null hypothesis.
F value: The F-statistic is calculated as the ratio of the mean squared difference explained by the “explanatory_variable” to the mean squared difference unexplained (residuals). It is a test statistic for the analysis of variance. In this case, the F-statistic is approximately 72.83.
Pr(>F): This is the p-value associated with the F-statistic. It measures the probability of obtaining an F-statistic as extreme as the one observed, assuming that the null hypothesis (no effect of the “explanatory_variable”) is true.
The F-statistic tests the hypothesis that the “explanatory_variable” significantly affects the “response_variable.” In this case, the F-statistic is very high (approximately 72.83), and the associated p-value is extremely low (< 2e-16), indicating that the “explanatory_variable” has a significant effect on the “response_variable.”
Looking at the data, the ‘hours per week’ column seems like a potential continuous predictor of age.
ggplot(adult, aes(x=hoursperweek, y=age)) +
geom_point() +
geom_smooth(method='lm', color= "red")
## `geom_smooth()` using formula = 'y ~ x'
The plot shows a rough linear relationship between age and
hoursperweek.
another_variable <- adult$hoursperweek
Building a linear regression model using the selected continuous variable to predict the response variable (age).
lm_model <- lm(response_variable ~ another_variable, data=adult)
summary(lm_model)
##
## Call:
## lm(formula = response_variable ~ another_variable, data = adult)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.155 -11.102 -1.734 8.838 54.088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.313238 0.366623 96.320 <2e-16 ***
## another_variable 0.085517 0.008672 9.861 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.81 on 16279 degrees of freedom
## Multiple R-squared: 0.005938, Adjusted R-squared: 0.005877
## F-statistic: 97.24 on 1 and 16279 DF, p-value: < 2.2e-16
The intercept of 35.313238 suggests that when “another_variable” is zero, the estimated age is 35.31
The F-statistic is very high, indicating that the model as a whole is significant, even though the effect size is small.
The coefficient for “hours-per-week” is positive, it means that, on average, an increase in hours worked per week is associated with an increase in age i.e., the coefficient for “another_variable” is 0.085517. This represents the change in the estimated age for a one-unit increase in “another_variable.”
In summary, the model suggests that hours per week has a statistically significant, but small, positive effect on age. However, the model does not explain much of the variability in age, and other factors not included in the model may also influence age.
par(mfrow = c(2, 2))
plot(lm_model, which = 1)
plot(lm_model, which = 2)
plot(lm_model, which = 3)
plot(lm_model, which = 4)
The diagnostic plots do not indicate nay major issues with the model
assumptions or fit. The linear model seems reasonably well-specified for
the hours_per_week predictor.
f_test <- summary(lm_model)
cat("Overall model F-test:\n")
## Overall model F-test:
cat("F-statistic =", f_test$fstatistic[1], ", p-value =", f_test$fstatistic[4], "\n")
## F-statistic = 97.24177 , p-value = NA
F-Statistic: The F-statistic is a test statistic that measures the overall significance of the linear regression model. It is calculated by comparing the explained variance (variance due to the regression model) to the unexplained variance (residual variance).
- In the output, the F-statistic is approximately 97.24177.
P-Value: The p-value associated with the F-statistic tells you whether the regression model, as a whole, is statistically significant. A low p-value indicates that the model is significant, while a high p-value suggests that the model is not statistically significant.
- In the output, the p-value is "NA," which typically means it is not available. However, it's unusual to have an "NA" p-value for an F-test, and this might be a result of an issue in your specific R environment.
Since, the p-value is high or “NA,” it suggests that the model, as a whole, may not have significant predictive power. In such a case, you might reconsider the choice of predictors or explore other models.
# Build a linear regression model with two predictors
lm_model_2 <- lm(response_variable ~ another_variable + education, data=adult)
# Summary of the model
summary(lm_model_2)
##
## Call:
## lm(formula = response_variable ~ another_variable + education,
## data = adult)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.657 -11.036 -1.946 8.683 56.111
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.795955 0.702175 52.403 < 2e-16 ***
## another_variable 0.056299 0.008563 6.575 5.02e-11 ***
## education 11th -6.962949 0.822044 -8.470 < 2e-16 ***
## education 12th -6.051325 1.093105 -5.536 3.14e-08 ***
## education 1st-4th 8.657561 1.632600 5.303 1.15e-07 ***
## education 5th-6th 5.919083 1.188820 4.979 6.46e-07 ***
## education 7th-8th 12.834199 0.987066 13.002 < 2e-16 ***
## education 9th 1.572880 1.065491 1.476 0.140
## education Assoc-acdm -0.560405 0.854998 -0.655 0.512
## education Assoc-voc -0.296960 0.812097 -0.366 0.715
## education Bachelors -0.293791 0.680292 -0.432 0.666
## education Doctorate 7.769842 1.179207 6.589 4.56e-11 ***
## education HS-grad 0.179948 0.654655 0.275 0.783
## education Masters 4.705028 0.767105 6.133 8.80e-10 ***
## education Preschool 2.465159 2.449619 1.006 0.314
## education Prof-school 6.564208 1.047821 6.265 3.83e-10 ***
## education Some-college -3.469566 0.666191 -5.208 1.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.4 on 16264 degrees of freedom
## Multiple R-squared: 0.06542, Adjusted R-squared: 0.0645
## F-statistic: 71.16 on 16 and 16264 DF, p-value: < 2.2e-16
Coefficients: This section presents the coefficients of the linear regression model, along with their estimates, standard errors, t-values, and p-values for each predictor.
Intercept: The intercept (for the reference category of “education”) is 36.795955. It represents the estimated “response_variable” when all other predictor variables are zero. another_variable: The coefficient for “another_variable” is 0.056299. This indicates that, on average, for each one-unit increase in “another_variable,” the estimated “response_variable” increases by 0.0563. Education Levels: The coefficients for various education levels represent the change in the estimated “response_variable” compared to the reference category (which is not explicitly shown). For example, “education 11th” has a coefficient of -6.962949, suggesting that individuals with an education level of “11th” have an estimated “response_variable” that is 6.962949 units lower on average than the reference category. The p-values associated with each coefficient indicate whether the coefficients are statistically significant. Residual standard error: The residual standard error (13.4) represents the standard deviation of the residuals. It measures how well the model fits the data. Smaller values indicate a better fit.
Multiple R-squared and Adjusted R-squared: These statistics measure the proportion of the variance in the “response_variable” explained by the model. In this case, the model explains approximately 6.54% of the variance, which suggests that the model has limited explanatory power.
F-statistic: The F-statistic tests whether the regression model, as a whole, is statistically significant. The high F-statistic (71.16) and extremely low p-value (“< 2.2e-16”) indicate that the model is statistically significant.
Interpretation:
The model, as a whole, is statistically significant, but it explains only a small portion of the variance in the “response_variable.”
The coefficients for “another_variable” and the various education levels provide insights into how these variables influence the “response_variable” while controlling for other factors in the model.
The p-values associated with each coefficient indicate whether they are statistically significant. Some education levels, like “education 11th” and “education Prof-school,” are significant, while others may not be.
The interpretation of the coefficients for education levels should consider the reference category, which is not explicitly shown in the output. For example, “education 11th” is compared to the reference category, and the coefficient represents the difference in the “response_variable” for individuals with “11th” education compared to the reference group.
Overall, the model explains a small portion of the variation in the “response_variable,” and additional factors not included in the model may influence the outcome.
par(mfrow = c(2, 2))
plot(lm_model_2, which = 1)
plot(lm_model_2, which = 2)
plot(lm_model_2, which = 3)
plot(lm_model_2, which = 4)
The interaction term is significant. The edunum coefficient is still
significant and positive. But the magnitude is lower now. So edunum
changes the relationship between hoursperweek and edunum. Diagnostic
plots look okay. No major issues.