2024-04-23

Image Alt Text

AI Aliens Proposal

Team Members

Member.1 Member.2 Member.3 Member.4
Archit Chawla Robby Connor Vamshi Kanisetty Hemanth Kona

Project Description

ISSUE Identify a model with protected columns & compare and contrast how differnet H2o Models perform.

Protected Columns

  • age | age
  • sex | sex , gender , male, female
  • race | race, white, black, african american, african-american, asian, hispanic, latino, ethnicity
  • marital | marital, married, single, divorced, widowed
  • educational | educational, education, degree, college, university
  • income | income, earnings, salary, wage, poverty
  • employment | employment, full-time, full time, part-time, part time, job, occupation, unemployment
  • religion | religion, faith, belief, denomination
  • disability | disability, disabled, handicap, impairment
  • household | household, family, parent, child, dependent
  • language | language, linguistic, english, bilingual, multilingual
  • immigration | immigration, migrant, immigrant, citizenship, naturalization
  • sexual | sexual, orientation, gay, lesbian, bisexual, transgender, queer, lgbt
  • veteran | veteran, military, service, army, navy, air force, marines, coast guard
  • socioeconomic | socioeconomic, social, economic, poverty, wealth, class

Using Hugging Face to extract Dataframes

And Search for Protected Columns

Image Alt Text

aws logo

Data & Data Exploration

  • The dataset extracted from hugging face is blog_authorship_corpus
  • Hugging Face datasets are generally clean and have a clearly defined label column
    text date gender age horoscope job
    yeah sorry writ… 2023-02-20 female 17 Libra Student
    yeah today ok l… 2020-02-20 female 17 Libra Student
    yay tuesdayno l… 2019-02-20 female 17 Libra Student
    rar 2018-02-20 female 17 Libra Student
    thought okso im… 2018-02-20 female 17 Libra Student
    After tokenization the dataset will meet the requirment for at least 20 columns

Description of Variables

  • For blog_authorship_corpus minmal preprocessing steps were taken
    • Removed Punctuation, Removed Numbers, Remove English Stopwords, Striped whitespace
    • There were no nulls present in the dataset
    • Our Predictor Varible is job
Column_Name Data_Type Column_Type
text text Feature
date date Feature
gender text Feature
age int64 Feature
horoscope text Feature
job text Predictor