Analysis Report:

1. Introduction

This data set which I am currently working on is based on the income level of the people of the United States and the factors that contribute to the variation in the respective income levels.Several other factors are present in the data set which are either dependent on the income level of the US citizen or simply provide more information about that particular citizen.The data used in this data set is imbalanced.

The aim of this project is to build a predictive model which predicts the level of the income of the US citizens based on the extent of dependency of other factors like race, education, class_of_worker, tax_filer_status,etc.The income levels are binned at below 50K and above 50K to present a binary classification problem.The goal field of this data, however, was drawn from the “total person income” field rather than the “adjusted gross income” and may, therefore, behave differently than the orginal ADULT goal field.

2. Overview of the Study

This data set is a multivariate data set.It contains weighted census data extracted from 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. The data contains 41 demographic and employment related variables.The instance weight indicates the number of people in the population that each record represents due to stratified sampling.More information on the dataset variables can be found on this link

3. Data

Data Type

multivariate

Abstract

This data set contains weighted census data extracted from the 1994 and 1995 current population surveys conducted by the U.S. Census Bureau. The data contains demographic and employment related variables.

Sources

Original Owner

U.S. Census Bureau

United States Department of Commerce

Donor

Terran Lane and Ronny Kohavi

Data Mining and Visualization

Silicon Graphics.

terran@ecn.purdue.edu, ronnyk@sgi.com

Date Donated: March 7, 2000

Data Characteristics

This data set contains weighted census data extracted from the 1994 and 1995 Current Population Surveys conducted by the U.S. Census Bureau. The data contains 41 demographic and employment related variables.

The instance weight indicates the number of people in the population that each record represents due to stratified sampling. To do real analysis and derive conclusions, this field must be used. This attribute should not be used in the classifiers.

More information detailing the meaning of the attributes can be found in the Census Bureau’s documentation To make use of the data descriptions at this site, the following mappings to the Census Bureau’s internal database column names will be needed:

age
class of worker
industry code
occupation code
adjusted gross income
education
wage per hour
enrolled in edu inst last wk
marital status
major industry code
major occupation code
mace
hispanic Origin
sex
member of a labor union
reason for unemployment
full or part time employment stat
capital gains
capital losses
divdends from stocks
federal income tax liability
tax filer status
region of previous residence
state of previous residence
detailed household and family stat
detailed household summary in household
instance weight
migration code-change in msa
migration code-change in reg
migration code-move within reg
live in this house 1 year ago
migration prev res in sunbelt
num persons worked for employer
family members under 18
total person earnings
country of birth father
country of birth mother
country of birth self
citizenship
total person income
own business or self employed
taxable income amount
fill inc questionnaire for veteran’s admin
veterans benefits
weeks worked in year

4. Conclusion:

Following are the conclusions that I have made after studying this data set closely and observing the unique standouts from the visual forms of the data present in the data set:

1.The data set contains citizens from the age of 0 to the age of 90.

2.A greater chunk of the population lies between [24,45] age group.

3.There are citizens with a salary lesser than 50K more in number than those who have a salary more than 50K.

4.The median salary below 50K lies below the age limit of 40 years while the median salary above 50K lies above the limit of 40 years.

5.For all the ages, citizens who belong to the “white” race are maximum in number.

6.We observe that most of the citizens are married and have an alive spouse.

7.There are 147 unborn citizen records in the database.

8.There are exactly 615 such entries which have their wage per hour = 0.

9.There is a weak, positive correlation between the age of a citizen and the wage per hour that the citizen earns through his services.

10.Maximum students are children in the data set.

Hypothesis 1:

Dependent Variable: income_level

Independent Variables: Age

Major_industry_code

Sex

full_parttime_employment_stat

Country_self

Citizenship

Veterans_benefits

Weeks_worked_in_year

Year

Class_of_worker

Industry_code

Occupation_code

education

Wage_per_hour

Enrolled_in_edu_inst_lastwk

Marital_status

major_occupation_code

Race

Hispanic_origin

Member_of_labor_union

Capital_gains

capital_losses

dividend_from_Stocks

Tax_filer_status

Region_of_previous_residence

state_of_previous_residence

D_household_family_stat

D_household_summary

Migration_msa

migration_reg

Migration_sunbelt

Migration_within_reg

Live_1_year_ago

num_person_Worked_employer

Family_members_under_18

Country_father

Country_mother

business_or_self_employed

fill_questionnaire_veteran_admin

5. Result:

Following are the significant factors with respect to the dependent variable -income_level(Best Fit Model):

  1. Age
  2. major_industry_code Business and repair services
  3. major_industry_code Communications
  4. major_industry_code Education
  5. major_industry_code Finance insurance and real estate
  6. major_industry_code Hospital services
  7. major_industry_code Manufacturing-durable goods
  8. major_industry_code Manufacturing-nondurable goods
  9. major_industry_code Medical except hospital
  10. major_industry_code Other professional services
  11. major_industry_code Public administration
  12. major_industry_code Transportation
  13. major_industry_code Utilities and sanitary services
  14. sex Male
  15. full_parttime_employment_statUnemployed full-time
  16. country_self Holand-Netherlands
  17. country_self Hungary
  18. country_self Iran
  19. country_self Japan
  20. country_self Taiwan
  21. country_self Thailand
  22. citizenshipNative- Born abroad of American Parent(s)
  23. Veterans_benefits
  24. Weeks_worked_in_year
  25. class_of_worker Local government
  26. class_of_worker Never worked
  27. class_of_worker Not in universe
  28. class_of_worker Private
  29. class_of_worker Self-employed-not incorporated
  30. class_of_worker State government
  31. class_of_worker Without pay
  32. industry_code
  33. occupation_code
  34. educationBachelors degree(BA AB BS)
  35. educationChildren
  36. educationDoctorate degree(PhD EdD)
  37. educationMasters degree(MA MS MEng MEd MSW MBA)
  38. educationProf school degree (MD DDS DVM LLB JD)
  39. marital_statusNever married
  40. major_occupation_code Armed Forces
  41. major_occupation_code Executive admin and managerial
  42. major_occupation_code Farming forestry and fishing
  43. major_occupation_code Handlers equip cleaners etc
  44. major_occupation_code Machine operators assmblrs & inspctr
  45. major_occupation_code Other service
  46. major_occupation_code Precision production craft & repair
  47. major_occupation_code Professional specialty
  48. major_occupation_code Protective services
  49. major_occupation_code Sales
  50. major_occupation_code Technicians and related support
  51. major_occupation_code Transportation and material moving
  52. capital_gains
  53. capital_losses
  54. dividend_from_Stocks
  55. country_selfChina
  56. country_motherScotland
  57. num_person_Worked_employer