R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

# Loading required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# Load the dataset
adult <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_income_data.csv")

QUESTION 1:

THE COLUMNS WHICH ARE INTIALLY UNCLEAR UNTIL READING THE DOCUMENTATION ARE FOLLOWING:

  1. edunum: This column contains numerical values representing the level of education. Without reading the documentation, it is unclear what each numerical value corresponds to in terms of education level.
  2. capitalgain: This column represents capital gains. It is unclear what this column measures, what values are considered significant, and how it relates to other variables.
  3. nativecountry: This column contains country names. It is unclear whether the data is encoded using standard country codes or if there are any specific conventions followed.
  4. fnlwgt: I could not understand what is fnlwgt and its effectiveness on other columns.

AFTER READING THE DOCUMENTATION WE CAME TO KNOW THIS:

  1. In this dataset, edunum is encoded with numerical values to represent the level of education, with higher numbers typically indicating higher education levels. This encoding is likely chosen for simplicity and easier data analysis.

  2. The nativecountry column is encoded with the names of countries for clarity and ease of understanding.

QUESTION 2

THE COLUMNS THAT ARE STILL UNCLEAR EVEN AFTER READING THE DOCUMENTATION ARE FOLLOWING:

1.After reading the documentation, I did understood fnlwgt stands for final weight and but did not understand its importance and effectiveness on other columns.

2.Also another unclear element is the distribution of values in capital gain. The documentation explains what capital gain is, but it does not provide information about the typical range or distribution of values. This lack of context makes it challenging to interpret the significance of specific values in this column.

QUESTION 3

VISUALIZATIONS AND CONCLUSIONS:

result <- aggregate(adult$fnlwgt,by=list(adult$income), mean)
result
##   Group.1        x
## 1  <=50K. 189440.6
## 2   >50K. 189419.8
# Create the bar chart
barplot(result$x, names.arg=result$Group.1, xlab="income", ylab="average final weight", col=rainbow(4),
        main="Income vs average final weight",border="black")

From the visualization, I understood that final weight is not related to the target variable income and has insignificant effect.

ggplot(adult, aes(x = capitalgain)) +
  geom_histogram(binwidth = 1000, fill = "orange", color = "black") +
  labs(title = "Distribution of Capital Gains",
       x = "Capital Gain",
       y = "Frequency") +
  annotate("text", x = 25000, y = 1000, label = "Issue: Lack of context", color = "blue", size = 5)

From this visualization, I observed that one significant is the potential for bias or misinterpretation due to the lack of context or explanation of capital-gain column. To reduce this negative consequences, we can:

  1. Thoroughly understand the data set’s documentation to interpret the data correctly.
  2. Perform data pre processing to transform the capital-gain column into a more interpretable format, such as categorizing it into ranges or percentiles. if necessary.

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.