This data set which I am currently working on is based on the income level of the people of the United States and the factors that contribute to the variation in the respective income levels.Several other factors are present in the data set which are either dependent on the income level of the US citizen or simply provide more information about that particular citizen.The data used in this data set is imbalanced.
The aim of this project is to build a predictive model which predicts the level of the income of the US citizens based on the extent of dependency of other factors like race, education, class_of_worker, tax_filer_status,etc.The income levels are binned at below 50K and above 50K to present a binary classification problem.The goal field of this data, however, was drawn from the “total person income” field rather than the “adjusted gross income” and may, therefore, behave differently than the orginal ADULT goal field.
This data set is a multivariate data set.It contains weighted census data extracted from 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. The data contains 41 demographic and employment related variables.The instance weight indicates the number of people in the population that each record represents due to stratified sampling.More information on the dataset variables can be found on this link
multivariate
This data set contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. The data contains demographic and employment related variables.
Original Owner
United States Department of Commerce
Donor
Terran Lane and Ronny Kohavi
Data Mining and Visualization
Silicon Graphics.
terran@ecn.purdue.edu, ronnyk@sgi.com
Date Donated: March 7, 2000
This data set contains weighted census data extracted from the 1994 and 1995 Current Population Surveys conducted by the U.S. Census Bureau. The data contains 41 demographic and employment related variables.
The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should not be used in the classifiers.
More information detailing the meaning of the attributes can be found in the Census Bureau’s documentation To make use of the data descriptions at this site, the following mappings to the Census Bureau’s internal database column names will be needed:
age
class of worker
industry code
occupation code
adjusted gross income
education
wage per hour
enrolled in edu inst last wk
marital status
major industry code
major occupation code
mace
hispanic Origin
sex
member of a labor union
reason for unemployment
full or part time employment stat
capital gains
capital losses
divdends from stocks
federal income tax liability
tax filer status
region of previous residence
state of previous residence
detailed household and family stat
detailed household summary in household
instance weight
migration code-change in msa
migration code-change in reg
migration code-move within reg
live in this house 1 year ago
migration prev res in sunbelt
num persons worked for employer
family members under 18
total person earnings
country of birth father
country of birth mother
country of birth self
citizenship
total person income
own business or self employed
taxable income amount
fill inc questionnaire for veteran’s admin
veterans benefits
weeks worked in year
Following are the conclusions that I have made after studying this data set closely and observing the unique standouts from the visual forms of the data present in the data set:
1.The data set contains citizens from the age of 0 to the age of 90.
2.A greater chunk of the population lies between [24,45] age group.
3.There are citizens with a salary lesser than 50K more in number than those who have a salary more than 50K.
4.The median salary below 50K lies below the age limit of 40 years while the median salary above 50K lies above the limit of 40 years.
5.For all the ages, citizens who belong to the “white” race are maximum in number.
6.We observe that most of the citizens are married and have an alive spouse.
7.There are 147 unborn citizen records in the database.
8.There are exactly 615 such entries which have their wage per hour = 0.
9.There is a weak, positive correlation between the age of a citizen and the wage per hour that the citizen earns through his services.
10.Maximum students are children in the data set.
Hypothesis 1:
Dependent Variable: income_level
Independent Variables: Age
Major_industry_code
Sex
full_parttime_employment_stat
Country_self
Citizenship
Veterans_benefits
Weeks_worked_in_year
Year
Class_of_worker
Industry_code
Occupation_code
education
Wage_per_hour
Enrolled_in_edu_inst_lastwk
Marital_status
major_occupation_code
Race
Hispanic_origin
Member_of_labor_union
Capital_gains
capital_losses
dividend_from_Stocks
Tax_filer_status
Region_of_previous_residence
state_of_previous_residence
D_household_family_stat
D_household_summary
Migration_msa
migration_reg
Migration_sunbelt
Migration_within_reg
Live_1_year_ago
num_person_Worked_employer
Family_members_under_18
Country_father
Country_mother
business_or_self_employed
fill_questionnaire_veteran_admin
Following are the significant factors with respect to the dependent variable -income_level(Best Fit Model):
Data Extraction Systemfor the Census Bureau. The United States Census Bureau Web Site The UCI KDD Archive Information and Computer Science University of California, Irvine Irvine, CA 92697-3425