Problem 1: Applying k-Nearest Neighbors to predict income

For this assignment, we will be using the census adult dataset from UCI ML repository. The Adult dataset was extracted by Barry Becker from the 1994 US Census Database. Each row in the dataset has de-identified dempgraphic information of an individual worker and their income. The income is a categorical variable with two levels: <50K and >50K. The goal of this assignment is to create a binary classifier to predict whether a person makes more than 50K based on the other attributes in the dataset.

Please see the description of dataset and its attributes here: https://archive.ics.uci.edu/ml/datasets/Adult Then go to data folder at https://archive.ics.uci.edu/ml/machine-learning-databases/adult/ and download adult.data . This would be the dataset you will use to answer the following questions.

1. Download the dataset and store it in a dataframe in R. The dataset does not have header, you should add the headers manually to your dataframe based on the list of attributes provided in https://archive.ics.uci.edu/ml/datasets/Adult. Also please note that some entries have extra white space. So to read the data properly, use the option strip.white=TRUE in read.csv function.

# Load the file into a data frame
df <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/adult (1).data", header = F, sep = ",", na.strings = "?", strip.white=TRUE)

# Manually assign the header names
colnames(df) <- c("age", "workclass", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex", "capital-gain", "capital-loss", "hours-per-week", "native-country", "income")

head(df)
##   age        workclass fnlwgt education education-num     marital-status
## 1  39        State-gov  77516 Bachelors            13      Never-married
## 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3  38          Private 215646   HS-grad             9           Divorced
## 4  53          Private 234721      11th             7 Married-civ-spouse
## 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital-gain capital-loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hours-per-week native-country income
## 1             40  United-States  <=50K
## 2             13  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40           Cuba  <=50K
## 6             40  United-States  <=50K
dim(df)
## [1] 32561    15

2. Explore the overall structure of the dataset using the str() function. Get a summary statistics of each variable. How many categorical and numeric variables you have in your data? Is there any missing values?

# Structure of the dataset
str(df)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : chr  "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : chr  "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ education-num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital-status: chr  "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ occupation    : chr  "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ relationship  : chr  "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ race          : chr  "White" "White" "White" "Black" ...
##  $ sex           : chr  "Male" "Male" "Male" "Male" ...
##  $ capital-gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital-loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours-per-week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native-country: chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
# Summary statistics 
summary(df)
##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education-num   marital-status      occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital-gain    capital-loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##  hours-per-week  native-country        income         
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00
# Sum of the NA if exist
sum(is.na(df))
## [1] 4262

There are 9 categorical variables in the dataset i.e workclass, education, marital-status, occupation, relationship, race, sex, native-country, and income.

There are 6 numerical variables i.e age, fnlwgt, education-num, capital-gain, capital-loss, and hours-per-week

Dataset have 4262 missing data.

3. Get the frequency table of the “income” variable to see how many observations you have in each category of the income variable. Is the data balanced? Do we have equal number of samples in each classof income?

table(df$income)
## 
## <=50K  >50K 
## 24720  7841

we do not have an equal number of samples in each classof income. So the data is not balanced.

4. Explore the data in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income.

• To explore the relationship between numerical features and “income” variable, you can use side by side box plot and t.test

• To explore the relationship between categorical features and “income” variable, you can use frequency table and chisquare test (note that chisquare test might throw a warning if there are cells whose expected counts in the frequency table is less 5. This warning means the p-values reported from chisquare test may be incorrect due to low counts and are not reliable. You can ignore the warning for this assignment).

library(dplyr)
library(tidyr)
library(ggplot2)

# Create the boxplot
GenderPlot_age = ggplot(df, aes(x = income, y = age)) + geom_boxplot() 
GenderPlot_age

# t-test between age and income
t.test(age~income,alternative="two.sided", data=df)
## 
##  Welch Two Sample t-test
## 
## data:  age by income
## t = -50.264, df = 17411, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.757250 -7.174955
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            36.78374            44.24984

<=50K median is lower age than >50k.

GenderPlot_age = ggplot(df, aes(x = income, y = age)) + geom_boxplot() 
GenderPlot_age

# t-test between age and income
t.test(age~income,alternative="two.sided", data=df)
## 
##  Welch Two Sample t-test
## 
## data:  age by income
## t = -50.264, df = 17411, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.757250 -7.174955
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            36.78374            44.24984
#str(df)

Problem 2: Applying Naïve Bayes classifier to sentiment classification of COVID tweets

For this problem you are going to use corona_nlp_train.csv dataset, a collection of tweets pulled from Twitter and manually labeled as being “extremely positive”, “positive”, “neutral”, “negative”, and “extremely negative”. The dataset is from this Kaggle project (https://www.kaggle.com/kerneler/starter-covid-19-nlp-text-d3a3baa6- e/data ). I have attached the data to this assignment spec and you can directly download it from canvas.

1. (1pt) Read the data and store in in the dataframe. Take a look at the structure of data and its variables. We will be working with only two variables: OriginalTweet and Sentiment. Original tweet is a text and Sentiment is a categorical variable with five levels: “extremely positive”, “positive”, “neutral”,“negative”, and “extremely negative”.

# Read the data
df_tweets <- read.csv('/Users/subhalaxmirout/CSC 532 - ML/Corona_NLP_train.csv', header = T, encoding="utf-8", stringsAsFactors= FALSE)

# Shows first 6 rows of the tweets data
head(df_tweets)
##   UserName ScreenName                 Location    TweetAt
## 1     3799      48751                   London 16-03-2020
## 2     3800      48752                       UK 16-03-2020
## 3     3801      48753                Vagabonds 16-03-2020
## 4     3802      48754                          16-03-2020
## 5     3803      48755                          16-03-2020
## 6     3804      48756 ÜT: 36.319708,-82.363649 16-03-2020
##                                                                                                                                                                                                                                                                                                                        OriginalTweet
## 1                                                                                                                                                                                                                    @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8
## 2                                                                                      advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order
## 3                                                                                                                                                                                                Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P
## 4  My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \n\n\nStay calm, stay safe.\n\n\n\n\n\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j
## 5 Me, ready to go at supermarket during the #COVID19 outbreak.\n\n\n\n\n\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\n\n\n\n\n\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n
## 6                                                                     As news of the region\u0092s first confirmed COVID-19 case came out of Sullivan County last week, people flocked to area stores to purchase cleaning supplies, hand sanitizer, food, toilet paper and other goods, @Tim_Dodson reports https://t.co/cfXch7a2lU
##            Sentiment
## 1            Neutral
## 2           Positive
## 3           Positive
## 4           Positive
## 5 Extremely Negative
## 6           Positive
# Get the number of rows and column
dim(df_tweets)
## [1] 41157     6
# Structure of the dataset
str(df_tweets)
## 'data.frame':    41157 obs. of  6 variables:
##  $ UserName     : int  3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 ...
##  $ ScreenName   : int  48751 48752 48753 48754 48755 48756 48757 48758 48759 48760 ...
##  $ Location     : chr  "London" "UK" "Vagabonds" "" ...
##  $ TweetAt      : chr  "16-03-2020" "16-03-2020" "16-03-2020" "16-03-2020" ...
##  $ OriginalTweet: chr  "@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8" "advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neigh"| __truncated__ "Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak htt"| __truncated__ "My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD F"| __truncated__ ...
##  $ Sentiment    : chr  "Neutral" "Positive" "Positive" "Positive" ...
# Get the sentiment type count
table(df_tweets$Sentiment)
## 
## Extremely Negative Extremely Positive           Negative            Neutral 
##               5481               6624               9917               7713 
##           Positive 
##              11422
unique(df_tweets$Sentiment)
## [1] "Neutral"            "Positive"           "Extremely Negative"
## [4] "Negative"           "Extremely Positive"

2. Randomize the order of the rows.

df_tweets = df_tweets[sample(nrow(df_tweets),replace =FALSE),]

3. (1pt) Convert sentiment into a factor variable with three levels: “positive, “neutral”, and “negative”. You can do this by labeling all “positive” and “extremely positive” tweets as “positive” and all “negative” and “extremely negative” tweets as “negative”. Now take the “summary” of sentiment to see how many observations/tweets you have for each label.

library(tidyverse)

# created age group
df_tweets <- df_tweets %>%
  mutate(Sentiment = case_when(Sentiment == "Extremely Positive" | Sentiment == "Positive"~ "Positive",
                               Sentiment == "Extremely Negative" | Sentiment == "Negative"~ "Negative",
                               Sentiment == "Neutral" ~ "Neutral"
                            
                                 ))
# Get the sentiment type count
table(df_tweets$Sentiment)
## 
## Negative  Neutral Positive 
##    15398     7713    18046
# Print the results
cat("Negative", table(df_tweets$Sentiment)[1], "\n")
## Negative 15398
cat("Neutral", table(df_tweets$Sentiment)[2], "\n")
## Neutral 7713
cat("Positive", table(df_tweets$Sentiment)[3], "\n")
## Positive 18046

Create a text corpus from OriginalTweet variable. Then clean the corpus, that is convert all tweets to lowercase, stem and remove stop words, punctuations, and additional white spaces.

library(tm)
library(SnowballC)

#corpus <- iconv(df_tweets$Sentiment, to = "utf-8")
# Create corpus variable
tweet_corpus<-VCorpus(VectorSource(df_tweets$Sentiment))
# Stem document
sms_corpus_clean<-tm_map(tweet_corpus, stemDocument)
# convert to lower case
tweet_corpus_clean<-tm_map(tweet_corpus,content_transformer(tolower))
# Remove punctuation
tweet_corpus_clean <- tm_map(tweet_corpus_clean, removePunctuation)
# Remove whitespace
tweet_corpus_clean<-tm_map(tweet_corpus_clean, stripWhitespace)

5. Create separate wordclouds for “positive” and “negative” tweets (set max.words=100 to only show the 100 most frequent words) Is there any visible difference between the frequent words in “positive” vs “negative” tweets?

library(wordcloud)
library(RColorBrewer)

positive <-subset(df_tweets, Sentiment == "Positive")
negative <-subset(df_tweets, Sentiment == "Negative")
pal <- brewer.pal(8,"Dark2")
wordcloud(positive$OriginalTweet, max.words= 100, scale = c(3, 0.5), colors=pal, vfont=c("sans serif","plain"))

wordcloud(negative$OriginalTweet, max.words= 100, scale = c(3, 0.5), colors=pal, vfont=c("sans serif","plain"))

For postive wordcloud most frequant words are Coronavirus, covid 19, supermarket, grocery, food, help, online, socialdistancing etc.

For postive wordcloud most frequant words are covid 19, panic, lockdown, pandemic, stop, crisis, empty etc.

6. Create a document-term matrix from the cleaned corpus. Then split the data into train and test sets. Use 80% of samples (roughly 32925 rows ) for training and the rest for testing.

# create document-term matrix
tweet_dtm<-tm::DocumentTermMatrix(tweet_corpus_clean)
# Split data in to train and test
split <- round(dim(tweet_dtm)[1] * 0.8)
tweet_dtm_train<-tweet_dtm[1:split-1, ]
tweet_dtm_test<-tweet_dtm[split:dim(tweet_dtm)[1], ]

cat("Length of train data : ", dim(tweet_dtm_train)[1], "\n")
## Length of train data :  32925
cat("Length of test data :", dim(tweet_dtm_test)[1], "\n")
## Length of test data : 8232
# The next two commands extract the labels from the raw data:
tweet_train_labels <- df_tweets[1:split-1, ]$Sentiment
tweet_test_labels <- df_tweets[split:dim(tweet_dtm)[1], ]$Sentiment

7. Remove the words that appear less than 100 times in the training data. Convert frequencies in the document-term matrix to binary yes/no features.

# Create variable to get words atleast 100
tweet_freq_words<-findFreqTerms(tweet_dtm_train, 100)
tweet_freq_words_test<-findFreqTerms(tweet_dtm_test, 100)

# Apply on train and test set
tweet_dtm_freq_train<-tweet_dtm_train[ , tweet_freq_words]
tweet_dtm_freq_test<-tweet_dtm_test[ , tweet_freq_words_test]

# Function to convert frequancy to Yes or no
convert_counts<-function(x) {
x <-ifelse(x > 0, "Yes", "No")
}

tweet_train<-apply(tweet_dtm_freq_train, MARGIN = 2, convert_counts)
tweet_test<-apply(tweet_dtm_freq_test, MARGIN = 2, convert_counts)

# Display first 6 rows of train and test set
head(tweet_train)
##     Terms
## Docs negative neutral positive
##    1 "Yes"    "No"    "No"    
##    2 "No"     "No"    "Yes"   
##    3 "No"     "No"    "Yes"   
##    4 "Yes"    "No"    "No"    
##    5 "No"     "No"    "Yes"   
##    6 "No"     "Yes"   "No"
head(tweet_test)
##        Terms
## Docs    negative neutral positive
##   32926 "No"     "Yes"   "No"    
##   32927 "Yes"    "No"    "No"    
##   32928 "Yes"    "No"    "No"    
##   32929 "Yes"    "No"    "No"    
##   32930 "No"     "No"    "Yes"   
##   32931 "No"     "No"    "Yes"

Train a Naïve Bayes classifier on the training data and evaluate its performance on the test data. Be patient, training and testing will take a while to run. Answer the following questions:

  • What is the overall accuracy of the model? ( the percentage of correct predictions)
  • What is the accuracy of the model in each category (negative, positive, neutral) ?
  • library(e1071)
    library(gmodels)
    # Create model using train data
    tweet_classifier<-naiveBayes(tweet_train, tweet_train_labels)
    # Test the model using test data
    tweet_test_pred<-predict(tweet_classifier, tweet_test)
    # model evaluation
    CrossTable(tweet_test_pred, tweet_test_labels, prop.chisq= FALSE, prop.t = FALSE, dnn= c('predicted', 'actual'))
    ## 
    ##  
    ##    Cell Contents
    ## |-------------------------|
    ## |                       N |
    ## |           N / Row Total |
    ## |           N / Col Total |
    ## |-------------------------|
    ## 
    ##  
    ## Total Observations in Table:  8232 
    ## 
    ##  
    ##              | actual 
    ##    predicted |  Negative |   Neutral |  Positive | Row Total | 
    ## -------------|-----------|-----------|-----------|-----------|
    ##     Negative |      3168 |         0 |         0 |      3168 | 
    ##              |     1.000 |     0.000 |     0.000 |     0.385 | 
    ##              |     1.000 |     0.000 |     0.000 |           | 
    ## -------------|-----------|-----------|-----------|-----------|
    ##      Neutral |         0 |      1504 |         0 |      1504 | 
    ##              |     0.000 |     1.000 |     0.000 |     0.183 | 
    ##              |     0.000 |     1.000 |     0.000 |           | 
    ## -------------|-----------|-----------|-----------|-----------|
    ##     Positive |         0 |         0 |      3560 |      3560 | 
    ##              |     0.000 |     0.000 |     1.000 |     0.432 | 
    ##              |     0.000 |     0.000 |     1.000 |           | 
    ## -------------|-----------|-----------|-----------|-----------|
    ## Column Total |      3168 |      1504 |      3560 |      8232 | 
    ##              |     0.385 |     0.183 |     0.432 |           | 
    ## -------------|-----------|-----------|-----------|-----------|
    ## 
    ## 
  • What is the overall accuracy of the model? ( the percentage of correct predictions)
  • Overall accuracy of the model is 100% which is quite unlikely. The crosstab shows that the Bayes classifier made 0 mistakes, corresponding to an error rate of 0%.

  • What is the accuracy of the model in each category (negative, positive, neutral) ?
  • Accuracy of the model in each category is 100%.