R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Loading required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(readr)

# Load the dataset
adult <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_income_data.csv")

Building Three Sets of Variable Combinations:

# set 1: Age, Education Num (continuous)
set1 <- adult[c("hoursperweek", "edunum")]

# set 2: Hours per Week, Capital Gain (continuous)
set2 <- adult[c("occupation", "education")]


# set 3: Workclass, Education (ordered categorical)
set3 <- adult[c("age", "capitalgain")]

Set 1 (Hours per week vs Education Number):

# Creating a New Column for Each Set:
set1$normalized.hoursperweek <- log(set1$hoursperweek + 1)

# Select a Response Variable and Explanatory Variables for Each Set:
response_var1 <- set1$edunum
explanatory_vars1 <- set1[c("hoursperweek", "normalized.hoursperweek")]

# Plot a Visualization for Each Response-Explanatory Relationship:
# Example for set 1: Scatterplot
plot(response_var1 ~ explanatory_vars1$normalized.hoursperweek, 
     main = "Scatterplot of Education Num vs Normalized Hours per Week", 
     xlab = "Normalized Hours per Week", 
     ylab = "Education Num")

In the above scatterplot of Education Num vs. Normalized Hours per Week, we observe a scattered pattern with no clear linear relationship. Education num values are spread across a wide range of normalized hours per week and there appear to be no clear outliers in the plot. The data points are fairly evenly distributed without any extreme values.

# Correlation coefficient for set 1
correlation_coefficient_set1 <- cor(response_var1, explanatory_vars1$normalized.hoursperweek)
print(correlation_coefficient_set1)
## [1] 0.120708

The calculated correlation coefficient (r) is close to zero, indicating a weak or negligible linear relationship between Education Num and Normalized Hours per Week. This value makes sense based on the scatterplot, which showed a scattered distribution of data points with no clear linear pattern. It aligns with the visual observation that there isn’t a strong linear correlation between these variables.

# Confidence interval for set 1
confidence_interval_set1 <- t.test(response_var1)$conf.int
print(confidence_interval_set1)
## [1] 10.03347 10.11235
## attr(,"conf.level")
## [1] 0.95

This interval suggests that we can be 95% confident that the true population mean of Education Num falls within this range of [10.03347 to 10.11235].

Insights:

The scatterplot indicates that there is no strong linear relationship between education num and normalized hours per week. Individuals with various levels of education num have various levels of normalized hours per week. The absence of outliers suggests that this relationship is relatively consistent across the dataset, with no extreme values skewing the analysis.

Set 2 (Education Category vs Occupation):

set2$education_category <- cut(adult$edunum, breaks = c(0, 8, 12, 16), labels = c("Low", "Medium", "High"))

response_var2 <- set2$education_category
explanatory_var2 <- set2$occupation

barplot(table(response_var2, explanatory_var2), 
        main = "Distribution of Education Category by Occupation",
        xlab = "Occupation",
        ylab = "Frequency",
        legend = TRUE, srt = 45, las = 2)

In the above barplot of Education Category by Occupation, we observe how different education categories are distributed among various occupations.

Correlation Coefficient

As the variables are categorical, we cannot calculate traditional correlation coefficients. We are not assessing correlation but rather the distribution of categorical variables across another categorical variable. This interpretation mainly focus on the patterns and proportions observed in the barplot, such as which occupations tend to have a higher proportion of individuals in the “High” education category and which have more in the “Low” category.

Confidence Interval:

For categorical data like Education Category, we cannot build traditional confidence intervals as we do with numerical data.

Insights:

The barplot provides a clear picture of the distribution of education categories (Low, Medium, High) across different occupations. Some occupations, such as “Exec-managerial” and “Prof-specialty,” have a higher proportion of individuals in the “High” education category, suggesting that these occupations often require higher levels of education. Other occupations, like “Handlers-cleaners” and “Handlers-cleaners” have a larger proportion of individuals in the “Low” education category, indicating that they may require lower levels of formal education. This analysis allows us to understand the educational diversity within different occupations and how certain occupations tend to attract individuals with specific education levels.

Set 3 (Age vs. Scaled Capital Gain):

set3$scaled.capitalgain <- set3$capitalgain / 1000

response_var3 <- set3$age
explanatory_vars3 <- set3[c("capitalgain", "scaled.capitalgain")]

plot(response_var3 ~ explanatory_vars3$scaled.capitalgain, 
     main = "Scatterplot of Age vs. Scaled Capital Gain", 
     xlab = "Scaled Capital Gain (in thousands)", 
     ylab = "Age")

In the above scatterplot of Age vs. Scaled Capital Gain, we observe a scattered pattern with no clear linear relationship.There appear to be no clear outliers in the plot.

# Correlation coefficient

correlation_coefficient_set3 <- cor(response_var3, explanatory_vars3$scaled.capitalgain)

print(correlation_coefficient_set3)
## [1] 0.07637734

The calculated correlation coefficient (r) is close to zero, indicating a weak or negligible linear relationship between Age and Scaled Capital Gain.This value makes sense based on the scatterplot, which showed a scattered distribution of data points with no clear linear pattern. It aligns with the visual observation that there isn’t a strong linear correlation between these variables.

# Confidence interval for age
confidence_interval_set3 <- t.test(response_var3)$conf.int
print(confidence_interval_set3)
## [1] 38.55471 38.98021
## attr(,"conf.level")
## [1] 0.95

This interval suggests that we can be 95% confident that the true population mean of Education Num falls within this range of [38.5547 to 38.98021].

Insights:

The scatterplot indicates that there is no strong linear relationship between age and scaled capital gain. Individuals of various ages have various levels of scaled capital gain and the absence of outliers suggests that this relationship is relatively consistent across the dataset

Conclusion:

The sets we created provide insights into different aspects of the dataset. Sets 1 and 3 show weak or negligible correlations between variables, while Set 2 focuses on the distribution of categorical variables within categories. These insights help us understand relationships and patterns within the data.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.