This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
# Loading required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Load the dataset
adult <- read.csv("C:/Users/RAKESH REDDY/OneDrive/Desktop/adult_income_data.csv")
THE COLUMNS WHICH ARE INTIALLY UNCLEAR UNTIL READING THE DOCUMENTATION ARE FOLLOWING:
AFTER READING THE DOCUMENTATION WE CAME TO KNOW THIS:
In this dataset, edunum is encoded with numerical values to represent the level of education, with higher numbers typically indicating higher education levels. This encoding is likely chosen for simplicity and easier data analysis.
The nativecountry column is encoded with the names of countries for clarity and ease of understanding.
THE COLUMNS THAT ARE STILL UNCLEAR EVEN AFTER READING THE DOCUMENTATION ARE FOLLOWING:
1.After reading the documentation, I did understood fnlwgt stands for final weight and but did not understand its importance and effectiveness on other columns.
2.Also another unclear element is the distribution of values in capital gain. The documentation explains what capital gain is, but it does not provide information about the typical range or distribution of values. This lack of context makes it challenging to interpret the significance of specific values in this column.
VISUALIZATIONS AND CONCLUSIONS:
result <- aggregate(adult$fnlwgt,by=list(adult$income), mean)
result
## Group.1 x
## 1 <=50K. 189440.6
## 2 >50K. 189419.8
# Create the bar chart
barplot(result$x, names.arg=result$Group.1, xlab="income", ylab="average final weight", col=rainbow(4),
main="Income vs average final weight",border="black")
From the visualization, I understood that final weight is not related to the target variable income and has insignificant effect.
ggplot(adult, aes(x = capitalgain)) +
geom_histogram(binwidth = 1000, fill = "orange", color = "black") +
labs(title = "Distribution of Capital Gains",
x = "Capital Gain",
y = "Frequency") +
annotate("text", x = 25000, y = 1000, label = "Issue: Lack of context", color = "blue", size = 5)
From this visualization, I observed that one significant is the
potential for bias or misinterpretation due to the lack of context or
explanation of capital-gain column. To reduce this negative
consequences, we can:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.