2024-04-23

Image Alt Text

AI Aliens Proposal

Team Members

Member.1 Member.2
Archit Chawla Robby Connor

Project Description

ISSUE Identify a model with protected columns & compare and contrast how differnet H2o Models perform.

Protected Columns

  • age | age
  • sex | sex , gender , male, female
  • race | race, white, black, african american, african-american, asian, hispanic, latino, ethnicity
  • marital | marital, married, single, divorced, widowed
  • educational | educational, education, degree, college, university
  • income | income, earnings, salary, wage, poverty
  • employment | employment, full-time, full time, part-time, part time, job, occupation, unemployment
  • religion | religion, faith, belief, denomination
  • disability | disability, disabled, handicap, impairment
  • household | household, family, parent, child, dependent
  • language | language, linguistic, english, bilingual, multilingual
  • immigration | immigration, migrant, immigrant, citizenship, naturalization
  • sexual | sexual, orientation, gay, lesbian, bisexual, transgender, queer, lgbt
  • veteran | veteran, military, service, army, navy, air force, marines, coast guard
  • socioeconomic | socioeconomic, social, economic, poverty, wealth, class

Peer Comments

  • Utilize a Pipeline for streamlined modeling - Team 4
  • Consider other Modeling options beside a fairness model
  • Use F1 as an Evaluation Metric
    • F1 is a combination of precision and recall to provide a more balanced model

Using Hugging Face to extract Dataframes

And Search for Protected Columns

Image Alt Text

aws logo

Data & Data Exploration

  • The dataset extracted from hugging face is blog_authorship_corpus
  • Hugging Face datasets are generally clean and have a clearly defined label column
  • Data Files
    text date gender age horoscope job text_length sentiment_score month day year day_of_week gender_female gender_male age_group age_group_0_18 age_group_19_25 age_group_26_35 age_group_36_50 age_group_50plus horoscope_factors avg_sentiment_score_all max_text_length_all sentiment_trend
    goin on n 2001-02-20 female 16 Pisces Student 1203 -1 2 20 2001 Tue 1 0 0-18 1 0 0 0 0 0 4.8504 12663 NA
    i write re 2001-02-20 female 16 Pisces Student 41 -1 2 20 2001 Tue 1 0 0-18 1 0 0 0 0 0 4.8504 12663 NA
    some idiot 2001-02-20 female 16 Pisces Student 346 2 2 20 2001 Tue 1 0 0-18 1 0 0 0 0 0 4.8504 12663 NA
    am i prett 2001-02-20 female 16 Pisces Student 32 1 2 20 2001 Tue 1 0 0-18 1 0 0 0 0 0 4.8504 12663 NA
    1 you attr 2001-02-20 female 16 Pisces Student 530 5 2 20 2001 Tue 1 0 0-18 1 0 0 0 0 0 4.8504 12663 NA

Description of Variables

  • For blog_authorship_corpus minmal preprocessing steps were taken
    • Removed Punctuation, Removed Numbers, Remove English Stopwords, Striped whitespace
    • There were no nulls present in the dataset
    • Created dummy categories for other categorical variables
    • Our Predictor Varible is job
Column_Name Data_Type Column_Type
text text Feature
date date Feature
gender text Feature
age int64 Feature
horoscope text Feature
job text Predictor

Modeling

processed_data <- read_csv('processed_data.csv')
blog_authorship_data <- as.h2o(processed_data)

Accuracy of GLM Model

hugging_face_predictions <- h2o.predict(m1, test_h2o)
perf_glm <- h2o.performance(m1, newdata = test_h2o)

Accuracy of Naivie Bayes

hugging_face_predictions <- h2o.predict(m2, test_h2o)
perf_nb <- h2o.performance(m2, newdata = test_h2o)

Accuracy of the GBM model

hugging_face_predictions <- h2o.predict(m3, test_h2o)
perf_gbm <- h2o.performance(m3, newdata = test_h2o)

Confusion Matrix

conf_matrix <- as.data.frame.matrix(h2o.confusionMatrix(perf_nb))
 
#conf_matrix %>%
  #kbl() %>%
  #kable_material_dark()

Due to the number of predictors the confusion matrix is difficult to read, so it is excluded

Key Takeaways

  • Our Accuracy was 37.36% for the GLM, 94.99% for the NB, and 99.23% for the GBM
  • Most likely the GBM over fitted so we believe out best model is the NB for predicting occupation from blog posts
  • The fairness model was not able to run as anticipated