For this assignment, we will be using the census adult dataset from UCI ML repository. The Adult dataset was extracted by Barry Becker from the 1994 US Census Database. Each row in the dataset has de-identified dempgraphic information of an individual worker and their income. The income is a categorical variable with two levels: <50K and >50K. The goal of this assignment is to create a binary classifier to predict whether a person makes more than 50K based on the other attributes in the dataset.
Please see the description of dataset and its attributes here: https://archive.ics.uci.edu/ml/datasets/Adult Then go to data folder at https://archive.ics.uci.edu/ml/machine-learning-databases/adult/ and download adult.data . This would be the dataset you will use to answer the following questions.
1. Download the dataset and store it in a dataframe in R. The dataset does not have header, you should add the headers manually to your dataframe based on the list of attributes provided in https://archive.ics.uci.edu/ml/datasets/Adult. Also please note that some entries have extra white space. So to read the data properly, use the option strip.white=TRUE in read.csv function.
# Load the file into a data frame
df <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/adult (1).data", header = F, sep = ",", na.strings = "?", strip.white=TRUE)
# Manually assign the header names
colnames(df) <- c("age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income")
head(df)## age workclass fnlwgt education education-num marital-status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital-gain capital-loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hours-per-week native-country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
## [1] 32561 15
2. Explore the overall structure of the dataset using the str() function. Get a summary statistics of each variable. How many categorical and numeric variables you have in your data? Is there any missing values?
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : chr "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : chr "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ education-num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital-status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ occupation : chr "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ relationship : chr "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ race : chr "White" "White" "White" "Black" ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ capital-gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital-loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours-per-week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native-country: chr "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
## age workclass fnlwgt education
## Min. :17.00 Length:32561 Min. : 12285 Length:32561
## 1st Qu.:28.00 Class :character 1st Qu.: 117827 Class :character
## Median :37.00 Mode :character Median : 178356 Mode :character
## Mean :38.58 Mean : 189778
## 3rd Qu.:48.00 3rd Qu.: 237051
## Max. :90.00 Max. :1484705
## education-num marital-status occupation relationship
## Min. : 1.00 Length:32561 Length:32561 Length:32561
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.08
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital-gain capital-loss
## Length:32561 Length:32561 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1078 Mean : 87.3
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :4356.0
## hours-per-week native-country income
## Min. : 1.00 Length:32561 Length:32561
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
## [1] 4262
There are 9 categorical variables in the dataset i.e workclass, education, marital-status, occupation, relationship, race, sex, native-country, and income.
There are 6 numerical variables i.e age, fnlwgt, education-num, capital-gain, capital-loss, and hours-per-week
Dataset have 4262 missing data.
3. Get the frequency table of the “income” variable to see how many observations you have in each category of the income variable. Is the data balanced? Do we have equal number of samples in each classof income?
##
## <=50K >50K
## 24720 7841
we do not have an equal number of samples in each classof income. So the data is not balanced.
4. Explore the data in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income.
• To explore the relationship between numerical features and “income” variable, you can use side by side box plot and t.test
• To explore the relationship between categorical features and “income” variable, you can use frequency table and chisquare test (note that chisquare test might throw a warning if there are cells whose expected counts in the frequency table is less 5. This warning means the p-values reported from chisquare test may be incorrect due to low counts and are not reliable. You can ignore the warning for this assignment).
library(dplyr)
library(tidyr)
library(ggplot2)
# Create the boxplot
GenderPlot_age = ggplot(df, aes(x = income, y = age)) + geom_boxplot()
GenderPlot_age##
## Welch Two Sample t-test
##
## data: age by income
## t = -50.264, df = 17411, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.757250 -7.174955
## sample estimates:
## mean in group <=50K mean in group >50K
## 36.78374 44.24984
<=50K median is lower age than >50k.
##
## Welch Two Sample t-test
##
## data: age by income
## t = -50.264, df = 17411, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.757250 -7.174955
## sample estimates:
## mean in group <=50K mean in group >50K
## 36.78374 44.24984
For this problem you are going to use corona_nlp_train.csv dataset, a collection of tweets pulled from Twitter and manually labeled as being “extremely positive”, “positive”, “neutral”, “negative”, and “extremely negative”. The dataset is from this Kaggle project (https://www.kaggle.com/kerneler/starter-covid-19-nlp-text-d3a3baa6- e/data ). I have attached the data to this assignment spec and you can directly download it from canvas.
1. (1pt) Read the data and store in in the dataframe. Take a look at the structure of data and its variables. We will be working with only two variables: OriginalTweet and Sentiment. Original tweet is a text and Sentiment is a categorical variable with five levels: “extremely positive”, “positive”, “neutral”,“negative”, and “extremely negative”.
# Read the data
df_tweets <- read.csv('/Users/subhalaxmirout/CSC 532 - ML/Corona_NLP_train.csv', header = T, encoding="utf-8", stringsAsFactors= FALSE)
# Shows first 6 rows of the tweets data
head(df_tweets)## UserName ScreenName Location TweetAt
## 1 3799 48751 London 16-03-2020
## 2 3800 48752 UK 16-03-2020
## 3 3801 48753 Vagabonds 16-03-2020
## 4 3802 48754 16-03-2020
## 5 3803 48755 16-03-2020
## 6 3804 48756 ÜT: 36.319708,-82.363649 16-03-2020
## OriginalTweet
## 1 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8
## 2 advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order
## 3 Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P
## 4 My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \n\n\nStay calm, stay safe.\n\n\n\n\n\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j
## 5 Me, ready to go at supermarket during the #COVID19 outbreak.\n\n\n\n\n\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\n\n\n\n\n\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n
## 6 As news of the region\u0092s first confirmed COVID-19 case came out of Sullivan County last week, people flocked to area stores to purchase cleaning supplies, hand sanitizer, food, toilet paper and other goods, @Tim_Dodson reports https://t.co/cfXch7a2lU
## Sentiment
## 1 Neutral
## 2 Positive
## 3 Positive
## 4 Positive
## 5 Extremely Negative
## 6 Positive
## [1] 41157 6
## 'data.frame': 41157 obs. of 6 variables:
## $ UserName : int 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 ...
## $ ScreenName : int 48751 48752 48753 48754 48755 48756 48757 48758 48759 48760 ...
## $ Location : chr "London" "UK" "Vagabonds" "" ...
## $ TweetAt : chr "16-03-2020" "16-03-2020" "16-03-2020" "16-03-2020" ...
## $ OriginalTweet: chr "@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8" "advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neigh"| __truncated__ "Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak htt"| __truncated__ "My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD F"| __truncated__ ...
## $ Sentiment : chr "Neutral" "Positive" "Positive" "Positive" ...
##
## Extremely Negative Extremely Positive Negative Neutral
## 5481 6624 9917 7713
## Positive
## 11422
## [1] "Neutral" "Positive" "Extremely Negative"
## [4] "Negative" "Extremely Positive"
2. Randomize the order of the rows.
3. (1pt) Convert sentiment into a factor variable with three levels: “positive, “neutral”, and “negative”. You can do this by labeling all “positive” and “extremely positive” tweets as “positive” and all “negative” and “extremely negative” tweets as “negative”. Now take the “summary” of sentiment to see how many observations/tweets you have for each label.
library(tidyverse)
# created age group
df_tweets <- df_tweets %>%
mutate(Sentiment = case_when(Sentiment == "Extremely Positive" | Sentiment == "Positive"~ "Positive",
Sentiment == "Extremely Negative" | Sentiment == "Negative"~ "Negative",
Sentiment == "Neutral" ~ "Neutral"
))##
## Negative Neutral Positive
## 15398 7713 18046
## Negative 15398
## Neutral 7713
## Positive 18046
Create a text corpus from OriginalTweet variable. Then clean the corpus, that is convert all tweets to lowercase, stem and remove stop words, punctuations, and additional white spaces.
library(tm)
library(SnowballC)
#corpus <- iconv(df_tweets$Sentiment, to = "utf-8")
# Create corpus variable
tweet_corpus<-VCorpus(VectorSource(df_tweets$Sentiment))
# Stem document
sms_corpus_clean<-tm_map(tweet_corpus, stemDocument)
# convert to lower case
tweet_corpus_clean<-tm_map(tweet_corpus,content_transformer(tolower))
# Remove punctuation
tweet_corpus_clean <- tm_map(tweet_corpus_clean, removePunctuation)
# Remove whitespace
tweet_corpus_clean<-tm_map(tweet_corpus_clean, stripWhitespace)5. Create separate wordclouds for “positive” and “negative” tweets (set max.words=100 to only show the 100 most frequent words) Is there any visible difference between the frequent words in “positive” vs “negative” tweets?
library(wordcloud)
library(RColorBrewer)
positive <-subset(df_tweets, Sentiment == "Positive")
negative <-subset(df_tweets, Sentiment == "Negative")
pal <- brewer.pal(8,"Dark2")
wordcloud(positive$OriginalTweet, max.words= 100, scale = c(3, 0.5), colors=pal, vfont=c("sans serif","plain"))wordcloud(negative$OriginalTweet, max.words= 100, scale = c(3, 0.5), colors=pal, vfont=c("sans serif","plain"))For postive wordcloud most frequant words are Coronavirus, covid 19, supermarket, grocery, food, help, online, socialdistancing etc.
For postive wordcloud most frequant words are covid 19, panic, lockdown, pandemic, stop, crisis, empty etc.
6. Create a document-term matrix from the cleaned corpus. Then split the data into train and test sets. Use 80% of samples (roughly 32925 rows ) for training and the rest for testing.
# create document-term matrix
tweet_dtm<-tm::DocumentTermMatrix(tweet_corpus_clean)
# Split data in to train and test
split <- round(dim(tweet_dtm)[1] * 0.8)
tweet_dtm_train<-tweet_dtm[1:split-1, ]
tweet_dtm_test<-tweet_dtm[split:dim(tweet_dtm)[1], ]
cat("Length of train data : ", dim(tweet_dtm_train)[1], "\n")## Length of train data : 32925
## Length of test data : 8232
# The next two commands extract the labels from the raw data:
tweet_train_labels <- df_tweets[1:split-1, ]$Sentiment
tweet_test_labels <- df_tweets[split:dim(tweet_dtm)[1], ]$Sentiment7. Remove the words that appear less than 100 times in the training data. Convert frequencies in the document-term matrix to binary yes/no features.
# Create variable to get words atleast 100
tweet_freq_words<-findFreqTerms(tweet_dtm_train, 100)
tweet_freq_words_test<-findFreqTerms(tweet_dtm_test, 100)
# Apply on train and test set
tweet_dtm_freq_train<-tweet_dtm_train[ , tweet_freq_words]
tweet_dtm_freq_test<-tweet_dtm_test[ , tweet_freq_words_test]
# Function to convert frequancy to Yes or no
convert_counts<-function(x) {
x <-ifelse(x > 0, "Yes", "No")
}
tweet_train<-apply(tweet_dtm_freq_train, MARGIN = 2, convert_counts)
tweet_test<-apply(tweet_dtm_freq_test, MARGIN = 2, convert_counts)
# Display first 6 rows of train and test set
head(tweet_train)## Terms
## Docs negative neutral positive
## 1 "Yes" "No" "No"
## 2 "No" "No" "Yes"
## 3 "No" "No" "Yes"
## 4 "Yes" "No" "No"
## 5 "No" "No" "Yes"
## 6 "No" "Yes" "No"
## Terms
## Docs negative neutral positive
## 32926 "No" "Yes" "No"
## 32927 "Yes" "No" "No"
## 32928 "Yes" "No" "No"
## 32929 "Yes" "No" "No"
## 32930 "No" "No" "Yes"
## 32931 "No" "No" "Yes"
Train a Naïve Bayes classifier on the training data and evaluate its performance on the test data. Be patient, training and testing will take a while to run. Answer the following questions:
library(e1071)
library(gmodels)
# Create model using train data
tweet_classifier<-naiveBayes(tweet_train, tweet_train_labels)
# Test the model using test data
tweet_test_pred<-predict(tweet_classifier, tweet_test)
# model evaluation
CrossTable(tweet_test_pred, tweet_test_labels, prop.chisq= FALSE, prop.t = FALSE, dnn= c('predicted', 'actual'))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 8232
##
##
## | actual
## predicted | Negative | Neutral | Positive | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## Negative | 3168 | 0 | 0 | 3168 |
## | 1.000 | 0.000 | 0.000 | 0.385 |
## | 1.000 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|
## Neutral | 0 | 1504 | 0 | 1504 |
## | 0.000 | 1.000 | 0.000 | 0.183 |
## | 0.000 | 1.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|
## Positive | 0 | 0 | 3560 | 3560 |
## | 0.000 | 0.000 | 1.000 | 0.432 |
## | 0.000 | 0.000 | 1.000 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 3168 | 1504 | 3560 | 8232 |
## | 0.385 | 0.183 | 0.432 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
Overall accuracy of the model is 100% which is quite unlikely. The crosstab shows that the Bayes classifier made 0 mistakes, corresponding to an error rate of 0%.
Accuracy of the model in each category is 100%.