Problem 1: Applying k-Nearest Neighbors to predict income

For this assignment, we will be using the census adult dataset from UCI ML repository. The Adult dataset was extracted by Barry Becker from the 1994 US Census Database. Each row in the dataset has de-identified dempgraphic information of an individual worker and their income. The income is a categorical variable with two levels: <50K and >50K. The goal of this assignment is to create a binary classifier to predict whether a person makes more than 50K based on the other attributes in the dataset.

Please see the description of dataset and its attributes here: https://archive.ics.uci.edu/ml/datasets/Adult Then go to data folder at https://archive.ics.uci.edu/ml/machine-learning-databases/adult/ and download adult.data . This would be the dataset you will use to answer the following questions.

1. Download the dataset and store it in a dataframe in R. The dataset does not have header, you should add the headers manually to your dataframe based on the list of attributes provided in https://archive.ics.uci.edu/ml/datasets/Adult. Also please note that some entries have extra white space. So to read the data properly, use the option strip.white=TRUE in read.csv function.

# Load the file into a data frame
df <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/adult (1).data", header = F, sep = ",", na.strings = "?", strip.white=TRUE)

# Manually assign the header names
colnames(df) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")

head(df)
##   age        workclass fnlwgt education education_num     marital_status
## 1  39        State-gov  77516 Bachelors            13      Never-married
## 2  50 Self-emp-not-inc  83311 Bachelors            13 Married-civ-spouse
## 3  38          Private 215646   HS-grad             9           Divorced
## 4  53          Private 234721      11th             7 Married-civ-spouse
## 5  28          Private 338409 Bachelors            13 Married-civ-spouse
## 6  37          Private 284582   Masters            14 Married-civ-spouse
##          occupation  relationship  race    sex capital_gain capital_loss
## 1      Adm-clerical Not-in-family White   Male         2174            0
## 2   Exec-managerial       Husband White   Male            0            0
## 3 Handlers-cleaners Not-in-family White   Male            0            0
## 4 Handlers-cleaners       Husband Black   Male            0            0
## 5    Prof-specialty          Wife Black Female            0            0
## 6   Exec-managerial          Wife White Female            0            0
##   hours_per_week native_country income
## 1             40  United-States  <=50K
## 2             13  United-States  <=50K
## 3             40  United-States  <=50K
## 4             40  United-States  <=50K
## 5             40           Cuba  <=50K
## 6             40  United-States  <=50K
dim(df)
## [1] 32561    15

2. Explore the overall structure of the dataset using the str() function. Get a summary statistics of each variable. How many categorical and numeric variables you have in your data? Is there any missing values?

# Structure of the dataset
str(df)
## 'data.frame':    32561 obs. of  15 variables:
##  $ age           : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : chr  "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt        : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
##  $ education     : chr  "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ education_num : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status: chr  "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ occupation    : chr  "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ relationship  : chr  "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ race          : chr  "White" "White" "White" "Black" ...
##  $ sex           : chr  "Male" "Male" "Male" "Male" ...
##  $ capital_gain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capital_loss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week: int  40 13 40 40 40 40 16 45 50 40 ...
##  $ native_country: chr  "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr  "<=50K" "<=50K" "<=50K" "<=50K" ...
# Summary statistics 
summary(df)
##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education_num   marital_status      occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital_gain    capital_loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##  hours_per_week  native_country        income         
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00
# Sum of the NA if exist
sum(is.na(df))
## [1] 4262

There are 9 categorical variables in the dataset i.e workclass, education, marital-status, occupation, relationship, race, sex, native-country, and income.

There are 6 numerical variables i.e age, fnlwgt, education-num, capital-gain, capital-loss, and hours-per-week

Dataset have 4262 missing data.

3. Get the frequency table of the “income” variable to see how many observations you have in each category of the income variable. Is the data balanced? Do we have equal number of samples in each classof income?

table(df$income)
## 
## <=50K  >50K 
## 24720  7841

we do not have an equal number of samples in each classof income. So the data is not balanced.

4. Explore the data in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income.

• To explore the relationship between numerical features and “income” variable, you can use side by side box plot and t.test

• To explore the relationship between categorical features and “income” variable, you can use frequency table and chisquare test (note that chisquare test might throw a warning if there are cells whose expected counts in the frequency table is less 5. This warning means the p-values reported from chisquare test may be incorrect due to low counts and are not reliable. You can ignore the warning for this assignment).

library(dplyr)
library(tidyr)
library(ggplot2)

# Create the boxplot
GenderPlot_age = ggplot(df, aes(x = income, y = age)) + geom_boxplot() 
GenderPlot_age

# t-test between age and income
t.test(age~income,alternative="two.sided", data=df)
## 
##  Welch Two Sample t-test
## 
## data:  age by income
## t = -50.264, df = 17411, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.757250 -7.174955
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            36.78374            44.24984

<=50K median is lower age than >50k.

GenderPlot_fnlwgt = ggplot(df, aes(x = income, y = fnlwgt)) + geom_boxplot() 
GenderPlot_fnlwgt

# t-test between fnlwgt and income
t.test(fnlwgt~income,alternative="two.sided", data=df)
## 
##  Welch Two Sample t-test
## 
## data:  fnlwgt by income
## t = 1.7412, df = 13615, p-value = 0.08167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -293.7029 4965.4332
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            190340.9            188005.0
GenderPlot_educationnum = ggplot(df, aes(x = income, y = education_num)) + geom_boxplot() 
GenderPlot_educationnum

# t-test between education_num and income
t.test(education_num~income,alternative="two.sided", data=df)
## 
##  Welch Two Sample t-test
## 
## data:  education_num by income
## t = -64.896, df = 13421, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.077502 -1.955682
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            9.595065           11.611657
GenderPlot_capital_gain = ggplot(df, aes(x = income, y = capital_gain)) + geom_boxplot() 
GenderPlot_capital_gain

# t-test between capital_gain and income
t.test(capital_gain~income,alternative="two.sided", data=df)
## 
##  Welch Two Sample t-test
## 
## data:  capital_gain by income
## t = -23.427, df = 7861.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4180.166 -3534.614
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            148.7525           4006.1425
GenderPlot_capital_loss = ggplot(df, aes(x = income, y = capital_loss)) + geom_boxplot() 
GenderPlot_capital_loss

# t-test between capital_loss and income
t.test(capital_loss~income,alternative="two.sided", data=df)
## 
##  Welch Two Sample t-test
## 
## data:  capital_loss by income
## t = -20.238, df = 9231.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -155.5985 -128.1187
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            53.14292           195.00153
GenderPlot_hours_per_week = ggplot(df, aes(x = income, y = hours_per_week)) + geom_boxplot() 
GenderPlot_hours_per_week

# t-test between hours_per_week and income
t.test(hours_per_week~income,alternative="two.sided", data=df)
## 
##  Welch Two Sample t-test
## 
## data:  hours_per_week by income
## t = -45.123, df = 14570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -6.920943 -6.344690
## sample estimates:
## mean in group <=50K  mean in group >50K 
##            38.84021            45.47303

Test for categorical variables:

Workclass

table_workclass <- table(df$workclass, df$income)
table_workclass
##                   
##                    <=50K  >50K
##   Federal-gov        589   371
##   Local-gov         1476   617
##   Never-worked         7     0
##   Private          17733  4963
##   Self-emp-inc       494   622
##   Self-emp-not-inc  1817   724
##   State-gov          945   353
##   Without-pay         14     0
mosaicplot(table_workclass, ylab= "income", xlab="workclass", main = "workclass vs income", shade=TRUE)

chisq.test(table_workclass)
## 
##  Pearson's Chi-squared test
## 
## data:  table_workclass
## X-squared = 827.72, df = 7, p-value < 2.2e-16

education

table_education <- table(df$education, df$income)
table_education
##               
##                <=50K >50K
##   10th           871   62
##   11th          1115   60
##   12th           400   33
##   1st-4th        162    6
##   5th-6th        317   16
##   7th-8th        606   40
##   9th            487   27
##   Assoc-acdm     802  265
##   Assoc-voc     1021  361
##   Bachelors     3134 2221
##   Doctorate      107  306
##   HS-grad       8826 1675
##   Masters        764  959
##   Preschool       51    0
##   Prof-school    153  423
##   Some-college  5904 1387
mosaicplot(table_education, ylab= "income", xlab="education", main = "education vs income", shade=TRUE)

chisq.test(table_education)
## 
##  Pearson's Chi-squared test
## 
## data:  table_education
## X-squared = 4429.7, df = 15, p-value < 2.2e-16

marital_status

table_marital_status <- table(df$marital_status, df$income)
table_marital_status
##                        
##                         <=50K  >50K
##   Divorced               3980   463
##   Married-AF-spouse        13    10
##   Married-civ-spouse     8284  6692
##   Married-spouse-absent   384    34
##   Never-married         10192   491
##   Separated               959    66
##   Widowed                 908    85
mosaicplot(table_marital_status, ylab= "income", xlab="marital_status", main = "marital_status vs income", shade=TRUE)

chisq.test(table_marital_status)
## 
##  Pearson's Chi-squared test
## 
## data:  table_marital_status
## X-squared = 6517.7, df = 6, p-value < 2.2e-16

occupation

table_occupation <- table(df$occupation, df$income)
table_occupation
##                    
##                     <=50K >50K
##   Adm-clerical       3263  507
##   Armed-Forces          8    1
##   Craft-repair       3170  929
##   Exec-managerial    2098 1968
##   Farming-fishing     879  115
##   Handlers-cleaners  1284   86
##   Machine-op-inspct  1752  250
##   Other-service      3158  137
##   Priv-house-serv     148    1
##   Prof-specialty     2281 1859
##   Protective-serv     438  211
##   Sales              2667  983
##   Tech-support        645  283
##   Transport-moving   1277  320
mosaicplot(table_occupation, ylab= "income", xlab="occupation", main = "occupation vs income", shade=TRUE)

chisq.test(table_occupation)
## 
##  Pearson's Chi-squared test
## 
## data:  table_occupation
## X-squared = 3744.9, df = 13, p-value < 2.2e-16

relationship

table_relationship <- table(df$relationship, df$income)
table_relationship
##                 
##                  <=50K >50K
##   Husband         7275 5918
##   Not-in-family   7449  856
##   Other-relative   944   37
##   Own-child       5001   67
##   Unmarried       3228  218
##   Wife             823  745
mosaicplot(table_relationship, ylab= "income", xlab="relationship", main = "relationship vs income", shade=TRUE)

chisq.test(table_relationship)
## 
##  Pearson's Chi-squared test
## 
## data:  table_relationship
## X-squared = 6699.1, df = 5, p-value < 2.2e-16

race

table_race <- table(df$race, df$income)
table_race
##                     
##                      <=50K  >50K
##   Amer-Indian-Eskimo   275    36
##   Asian-Pac-Islander   763   276
##   Black               2737   387
##   Other                246    25
##   White              20699  7117
mosaicplot(table_race, ylab= "income", xlab="race", main = "race vs income", shade=TRUE)

chisq.test(table_race)
## 
##  Pearson's Chi-squared test
## 
## data:  table_race
## X-squared = 330.92, df = 4, p-value < 2.2e-16

Sex

table_sex <- table(df$sex, df$income)
table_sex
##         
##          <=50K  >50K
##   Female  9592  1179
##   Male   15128  6662
mosaicplot(table_sex, ylab= "income", xlab="sex", main = "Sex vs income", shade=TRUE)

chisq.test(table_sex)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_sex
## X-squared = 1517.8, df = 1, p-value < 2.2e-16

native_country

table_native_country <- table(df$native_country, df$income)
table_native_country
##                             
##                              <=50K  >50K
##   Cambodia                      12     7
##   Canada                        82    39
##   China                         55    20
##   Columbia                      57     2
##   Cuba                          70    25
##   Dominican-Republic            68     2
##   Ecuador                       24     4
##   El-Salvador                   97     9
##   England                       60    30
##   France                        17    12
##   Germany                       93    44
##   Greece                        21     8
##   Guatemala                     61     3
##   Haiti                         40     4
##   Holand-Netherlands             1     0
##   Honduras                      12     1
##   Hong                          14     6
##   Hungary                       10     3
##   India                         60    40
##   Iran                          25    18
##   Ireland                       19     5
##   Italy                         48    25
##   Jamaica                       71    10
##   Japan                         38    24
##   Laos                          16     2
##   Mexico                       610    33
##   Nicaragua                     32     2
##   Outlying-US(Guam-USVI-etc)    14     0
##   Peru                          29     2
##   Philippines                  137    61
##   Poland                        48    12
##   Portugal                      33     4
##   Puerto-Rico                  102    12
##   Scotland                       9     3
##   South                         64    16
##   Taiwan                        31    20
##   Thailand                      15     3
##   Trinadad&Tobago               17     2
##   United-States              21999  7171
##   Vietnam                       62     5
##   Yugoslavia                    10     6
mosaicplot(table_native_country, ylab= "income", xlab="native_country", main = "native_country vs income", shade=TRUE)

chisq.test(table_native_country)
## 
##  Pearson's Chi-squared test
## 
## data:  table_native_country
## X-squared = 317.09, df = 40, p-value < 2.2e-16
    1. age
    1. education
    1. marital-status
    1. occupation
    1. hours-per-week
    1. Sex
    1. workclass
    1. native country

These attributes have been selected as they are likely to have a strong correlation with the target variable (income), have low correlation with each other, are easily interpretable, and do not require complex computations. By using these attributes, we can build a model that is both accurate and efficient, while also being interpretable and understandable. Age can be a good indicator of experience and seniority, both of which are often correlated with higher income. Education is a good predictor of income as higher levels of education often lead to better job opportunities and higher salaries. Marital status and occupation can also provide insights into a person’s financial situation and their earning potential. Finally, hours-per-week can be a good indicator of full-time versus part-time employment, with full-time employees typically earning more.

Remove remaining columns from the dataset.

drop <- c("fnlwgt", "education_num", "relationship", "race","capital_gain" ,"capital_loss")
df_new = df[,!(names(df) %in% drop)]

5. An initial data exploration shows that the missing values in the dataset are denoted by “?” not NA. Change all the “?” characters in the dataframe to NA

# Replace all "?" characters with NA
df_new[df_new == "?"] <- NA

6. Use the command colSums(is.na() to get the number of missing values in each column of your dataframe. Which columns have missing values?

colSums(is.na(df_new))
##            age      workclass      education marital_status     occupation 
##              0           1836              0              0           1843 
##            sex hours_per_week native_country         income 
##              0              0            583              0

workclass, occupation and native_country has NA values.

7. There are several ways we can deal with missing values. The easiest approach is to remove all the rows with missing values. However, if a large number of rows have missing values removing them will result in loss of information and may affect the classifier performance. If a large number of rows have missing values, then it is typically better to replace missing data with some values. This is called data imputation. Several methods for missing data imputation exist. The most naïve method (which we will use here) is to replace the missing values with mean of the column (for a numerical column) or mode/majority value of the column (for a categorical column). We will use a more advanced data imputation method in a later module. For now, replace the missing values in a numerical column with the mean of the column and the missing values in a categorical column with the mode/majority of the column. After imputation, use colSums(is.na() to make sure that your dataframe no longer has missing values.

mode_workclass <- names(which.max(table(df_new$workclass)))
df_new$workclass[is.na(df_new$workclass)] <- mode_workclass

mode_occupation <- names(which.max(table(df_new$occupation)))
df_new$occupation[is.na(df_new$occupation)] <- mode_occupation

mode_native_country <- names(which.max(table(df_new$native_country)))
df_new$native_country[is.na(df_new$native_country)] <- mode_native_country

# Check for missing values
colSums(is.na(df_new))
##            age      workclass      education marital_status     occupation 
##              0              0              0              0              0 
##            sex hours_per_week native_country         income 
##              0              0              0              0

8. Set the seed of the random number generator to a fixed integer, say 1, so that I can reproduce your work:

set.seed(1)

9. Randomize the order of the rows in the dataset.

df_new = df_new[sample(nrow(df_new),replace =FALSE),]

10. This dataset has several categorical variables. With the exception of few models ( such as Naiive Bayes and tree-based models) most machine learning models require numeric features and cannot work directly with categorical data. One way to deal with categorical variables is to assign numeric indices to each level. However, this imposes an artificial ordering on an unordered categorical variable. For example, suppose that we have a categorical variable primary color with three levels: “red”,”blue”,”green”. If we convert “red” to 0 , “blue” to 1 and “green” to 2 then we are telling our model that red < blue< green which is not correct. A better way to encode an unordered categorical variable is to do one-hot-encoding. In one hot-encoding we create a dummy binary variable for each level of a categorical variable. For example we can represent the primary color variable by three binary dummy variables, one for each color (red, blue, and green) .If the color is red, then the variable red takes value 1 while blue and green both take the value zero.

library(mltools)
library(data.table)

 # Convert all columns to factor
df_copy <- as.data.frame(unclass(df_new), stringsAsFactors = TRUE)
df_final <- as.data.table(df_copy)
df_final_new <- one_hot(df_final, cols = c("workclass", "education", "marital_status", "occupation", "sex", "native_country"), dropUnusedLevels = T)
df_final_new <- as.data.frame(df_final_new)
head(df_final_new)
##   age workclass_Federal-gov workclass_Local-gov workclass_Never-worked
## 1  51                     0                   0                      0
## 2  38                     0                   0                      0
## 3  30                     0                   0                      0
## 4  38                     0                   0                      0
## 5  21                     0                   0                      0
## 6  34                     0                   0                      0
##   workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc
## 1                 1                      0                          0
## 2                 1                      0                          0
## 3                 1                      0                          0
## 4                 1                      0                          0
## 5                 1                      0                          0
## 6                 1                      0                          0
##   workclass_State-gov workclass_Without-pay education_10th education_11th
## 1                   0                     0              0              0
## 2                   0                     0              0              0
## 3                   0                     0              0              0
## 4                   0                     0              0              0
## 5                   0                     0              0              0
## 6                   0                     0              0              0
##   education_12th education_1st-4th education_5th-6th education_7th-8th
## 1              0                 0                 0                 0
## 2              0                 0                 0                 0
## 3              0                 0                 0                 0
## 4              0                 0                 0                 0
## 5              0                 0                 0                 0
## 6              0                 0                 0                 0
##   education_9th education_Assoc-acdm education_Assoc-voc education_Bachelors
## 1             0                    0                   0                   0
## 2             0                    0                   0                   0
## 3             0                    1                   0                   0
## 4             0                    0                   0                   1
## 5             0                    0                   0                   0
## 6             0                    0                   0                   0
##   education_Doctorate education_HS-grad education_Masters education_Preschool
## 1                   0                 1                 0                   0
## 2                   0                 1                 0                   0
## 3                   0                 0                 0                   0
## 4                   0                 0                 0                   0
## 5                   0                 0                 0                   0
## 6                   0                 0                 0                   0
##   education_Prof-school education_Some-college marital_status_Divorced
## 1                     0                      0                       0
## 2                     0                      0                       0
## 3                     0                      0                       0
## 4                     0                      0                       0
## 5                     0                      1                       0
## 6                     1                      0                       0
##   marital_status_Married-AF-spouse marital_status_Married-civ-spouse
## 1                                0                                 1
## 2                                0                                 1
## 3                                0                                 0
## 4                                0                                 1
## 5                                0                                 0
## 6                                0                                 0
##   marital_status_Married-spouse-absent marital_status_Never-married
## 1                                    0                            0
## 2                                    0                            0
## 3                                    0                            1
## 4                                    0                            0
## 5                                    0                            1
## 6                                    0                            1
##   marital_status_Separated marital_status_Widowed occupation_Adm-clerical
## 1                        0                      0                       0
## 2                        0                      0                       0
## 3                        0                      0                       0
## 4                        0                      0                       0
## 5                        0                      0                       0
## 6                        0                      0                       0
##   occupation_Armed-Forces occupation_Craft-repair occupation_Exec-managerial
## 1                       0                       0                          0
## 2                       0                       0                          0
## 3                       0                       0                          0
## 4                       0                       0                          0
## 5                       0                       0                          0
## 6                       0                       0                          1
##   occupation_Farming-fishing occupation_Handlers-cleaners
## 1                          0                            0
## 2                          0                            0
## 3                          0                            0
## 4                          0                            0
## 5                          0                            0
## 6                          0                            0
##   occupation_Machine-op-inspct occupation_Other-service
## 1                            0                        0
## 2                            1                        0
## 3                            0                        0
## 4                            0                        0
## 5                            0                        0
## 6                            0                        0
##   occupation_Priv-house-serv occupation_Prof-specialty
## 1                          0                         0
## 2                          0                         0
## 3                          0                         0
## 4                          0                         0
## 5                          0                         0
## 6                          0                         0
##   occupation_Protective-serv occupation_Sales occupation_Tech-support
## 1                          0                0                       1
## 2                          0                0                       0
## 3                          0                0                       1
## 4                          0                0                       1
## 5                          0                1                       0
## 6                          0                0                       0
##   occupation_Transport-moving sex_Female sex_Male hours_per_week
## 1                           0          0        1             40
## 2                           0          0        1             40
## 3                           0          1        0             40
## 4                           0          0        1             40
## 5                           0          1        0             20
## 6                           0          0        1             50
##   native_country_Cambodia native_country_Canada native_country_China
## 1                       0                     0                    0
## 2                       0                     0                    0
## 3                       0                     0                    0
## 4                       0                     0                    0
## 5                       0                     0                    0
## 6                       0                     0                    0
##   native_country_Columbia native_country_Cuba native_country_Dominican-Republic
## 1                       0                   0                                 0
## 2                       0                   0                                 0
## 3                       0                   0                                 0
## 4                       0                   0                                 0
## 5                       0                   0                                 0
## 6                       0                   0                                 0
##   native_country_Ecuador native_country_El-Salvador native_country_England
## 1                      0                          0                      0
## 2                      0                          0                      0
## 3                      0                          0                      0
## 4                      0                          0                      0
## 5                      0                          0                      0
## 6                      0                          0                      0
##   native_country_France native_country_Germany native_country_Greece
## 1                     0                      0                     0
## 2                     0                      0                     0
## 3                     0                      0                     0
## 4                     0                      0                     0
## 5                     0                      0                     0
## 6                     0                      0                     0
##   native_country_Guatemala native_country_Haiti
## 1                        0                    0
## 2                        0                    0
## 3                        0                    0
## 4                        0                    0
## 5                        0                    0
## 6                        0                    0
##   native_country_Holand-Netherlands native_country_Honduras native_country_Hong
## 1                                 0                       0                   0
## 2                                 0                       0                   0
## 3                                 0                       0                   0
## 4                                 0                       0                   0
## 5                                 0                       0                   0
## 6                                 0                       0                   0
##   native_country_Hungary native_country_India native_country_Iran
## 1                      0                    0                   0
## 2                      0                    0                   0
## 3                      0                    0                   0
## 4                      0                    0                   0
## 5                      0                    0                   0
## 6                      0                    0                   0
##   native_country_Ireland native_country_Italy native_country_Jamaica
## 1                      0                    0                      0
## 2                      0                    0                      0
## 3                      0                    0                      0
## 4                      0                    0                      0
## 5                      0                    0                      0
## 6                      0                    0                      0
##   native_country_Japan native_country_Laos native_country_Mexico
## 1                    0                   0                     0
## 2                    0                   0                     0
## 3                    0                   0                     0
## 4                    0                   0                     0
## 5                    0                   0                     0
## 6                    0                   0                     0
##   native_country_Nicaragua native_country_Outlying-US(Guam-USVI-etc)
## 1                        0                                         0
## 2                        0                                         0
## 3                        0                                         0
## 4                        0                                         0
## 5                        0                                         0
## 6                        0                                         0
##   native_country_Peru native_country_Philippines native_country_Poland
## 1                   0                          0                     0
## 2                   0                          0                     0
## 3                   0                          0                     0
## 4                   0                          0                     0
## 5                   0                          0                     0
## 6                   0                          0                     0
##   native_country_Portugal native_country_Puerto-Rico native_country_Scotland
## 1                       0                          0                       0
## 2                       0                          0                       0
## 3                       0                          0                       0
## 4                       0                          0                       0
## 5                       0                          0                       0
## 6                       0                          0                       0
##   native_country_South native_country_Taiwan native_country_Thailand
## 1                    0                     0                       0
## 2                    0                     0                       0
## 3                    0                     0                       0
## 4                    0                     0                       0
## 5                    0                     0                       0
## 6                    0                     0                       0
##   native_country_Trinadad&Tobago native_country_United-States
## 1                              0                            1
## 2                              0                            1
## 3                              0                            1
## 4                              0                            1
## 5                              0                            1
## 6                              0                            1
##   native_country_Vietnam native_country_Yugoslavia income
## 1                      0                         0  <=50K
## 2                      0                         0  <=50K
## 3                      0                         0  <=50K
## 4                      0                         0   >50K
## 5                      0                         0  <=50K
## 6                      0                         0  <=50K

11. Scale all numeric features using Min-Max scaling

library(dplyr)
library(caret)

# Select only the numeric features for scaling
numeric_atrri <- sapply(df_final_new, is.numeric)
numeric_df <- df_final_new[, numeric_atrri]

#select categorical 
cat_variable <- sapply(df_final_new, is.factor)
income<-df_final_new[, cat_variable]

# Perform Min-Max scaling
min_max_scaler <- function(x) {
  return((x - min(x)) / (max(x) - min(x)))
}

numeric_df <- as.data.frame(lapply(numeric_df, min_max_scaler))

# Combine the scaled numeric features with the categorical features
data_scaled <- cbind(numeric_df, income)

12. Use 5-fold cross validation with KNN to predict the “income” variable and report the cross- validation error. ( You can find an example in slides 51-53 of module 4 lecture notes).

library(caret)
# Split the data into a training set and a testing set
split1 <- createDataPartition(data_scaled$income, p = 0.8, list = FALSE)
training_data <- data_scaled[split1, ]
testing_data <- data_scaled[-split1, ]

dim(training_data)
## [1] 26049    91
dim(testing_data)
## [1] 6512   91

13. Tune K (the number of nearest neighbors) by trying out different values (starting from k=1 to k=sqrt(n) where n is the number of observations in the dataset (for example k=1,5,10,20 50,100, sqrt(n) ). Draw a plot of cross validation errors for different values of K. Which value of K seems to perform the best on this data set? (You can find an example in slides 54-55 of module 4 lecture notes) Note: This might a long time to run on your machine, be patient ( It took about 30 minutes on my machine to run 5-fold cross validation for 6 different K values)

# Split the dataset into a training set and a test set
set.seed(100)
split2 <- createDataPartition(data_scaled$income, p = 0.8, list = FALSE)
training_set <- data_scaled[split2, ]
test_set <- data_scaled[-split2, ]

# Tune the number of nearest neighbors (K)
set.seed(100)
control <- trainControl(method = "cv", number = 5)
grid <- expand.grid(k = c(1,5,10,20, 50,100, sqrt(nrow(training_set))))
knn_model <- train(income ~ ., data = training_set, method = "knn", tuneGrid = grid, trControl = control)

# Plot the cross-validation errors for different values of K
plot(knn_model)

14. Use 5-fold cross validation with KNN to predict the income variable and report the average false positive rate (FPR) and false negative rate (FNR) of the classifier. . FPR is the proportion of negative instances classified as positive by the classifier. Similarly, FNR is the proportion of positive instances classified as negative by the classifier.

It does not matter which class you designate as positive or negative. For instance, you can designate income>50K as positive and income<50K as negative

# Predict the outcomes on the test data
predictions <- predict(knn_model, newdata = test_set)

# Calculate the false positive rate (FPR) and false negative rate (FNR)
confusion_matrix <- confusionMatrix(predictions, test_set$income)
fpr <- confusion_matrix$table[2,1] / sum(confusion_matrix$table[2,])
fnr <- confusion_matrix$table[1,2] / sum(confusion_matrix$table[,2])

# Print the FPR and FNR
cat("False Positive Rate:", fpr, "\n")
## False Positive Rate: 0.3423358
cat("False Negative Rate:", fnr)
## False Negative Rate: 0.4253827

15. Consider a majority classifier which always predicts income <50K. Without writing any code, explain what would be the training error of this classifier? ( Note the training error of this majority classifier is simply the proportion of all examples with income>50K because they are all misclassified by this majority classifier). Compare this with the cross validation error of KNN you computed in question 13. Does KNN do better than this majority classifier?

Therefore, we cannot directly compare the training error of the majority classifier with the cross-validation error of KNN. However, in general, a classifier that performs better than the majority classifier would have a lower error rate than the proportion of the minority class in the dataset. Therefore, if the cross-validation error of KNN is lower than the proportion of the minority class in the dataset, then KNN would be performing better than the majority classifier. In general, KNN can potentially do better than the majority classifier if it is able to correctly classify a significant proportion of positive examples. However, the majority classifier will always have a training error equal to the percentage of positive examples in the training set, which may be lower than the cross-validation error of KNN. In this case, it can be concluded that the majority classifier will have a lower training error, but the cross-validation error of KNN may be lower if the KNN model is able to generalize well to unseen data.

16. Explain what is the False Positive Rate and False Negative Rate of the majority classifier and how does it compare to the average FPR and FNR of KNN classifier you computed in question 14. You don’t need to write any code to compute FPR and FNR of the majority classifier. You can just compute it based on the definition of FNR and FPR

False Positive Rate (FPR) and False Negative Rate (FNR) are two measures of a binary classifier’s performance. FPR is defined as the proportion of negative examples (income < 50K) that are misclassified as positive (income > 50K), while FNR is defined as the proportion of positive examples (income > 50K) that are misclassified as negative (income < 50K).In comparison, the average FPR and FNR of the KNN classifier from question 14 will depend on the specific choice of K and the data. However, in general, it is expected that the KNN classifier will have a lower FNR and a higher FPR compared to the majority classifier. This is because the majority classifier always predicts the same class regardless of the input, whereas the KNN classifier takes into account the neighbors of each point and may make different predictions based on the data.

Problem 2: Applying Naïve Bayes classifier to sentiment classification of COVID tweets

For this problem you are going to use corona_nlp_train.csv dataset, a collection of tweets pulled from Twitter and manually labeled as being “extremely positive”, “positive”, “neutral”, “negative”, and “extremely negative”. The dataset is from this Kaggle project (https://www.kaggle.com/kerneler/starter-covid-19-nlp-text-d3a3baa6- e/data ). I have attached the data to this assignment spec and you can directly download it from canvas.

1. (1pt) Read the data and store in in the dataframe. Take a look at the structure of data and its variables. We will be working with only two variables: OriginalTweet and Sentiment. Original tweet is a text and Sentiment is a categorical variable with five levels: “extremely positive”, “positive”, “neutral”,“negative”, and “extremely negative”.

# Read the data
df_tweets <- read.csv('/Users/subhalaxmirout/CSC 532 - ML/Corona_NLP_train.csv', header = T, encoding="utf-8")

# Shows first 6 rows of the tweets data
head(df_tweets)
##   UserName ScreenName                 Location    TweetAt
## 1     3799      48751                   London 16-03-2020
## 2     3800      48752                       UK 16-03-2020
## 3     3801      48753                Vagabonds 16-03-2020
## 4     3802      48754                          16-03-2020
## 5     3803      48755                          16-03-2020
## 6     3804      48756 ÜT: 36.319708,-82.363649 16-03-2020
##                                                                                                                                                                                                                                                                                                                        OriginalTweet
## 1                                                                                                                                                                                                                    @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8
## 2                                                                                      advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order
## 3                                                                                                                                                                                                Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P
## 4  My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \n\n\nStay calm, stay safe.\n\n\n\n\n\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j
## 5 Me, ready to go at supermarket during the #COVID19 outbreak.\n\n\n\n\n\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\n\n\n\n\n\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n
## 6                                                                     As news of the region\u0092s first confirmed COVID-19 case came out of Sullivan County last week, people flocked to area stores to purchase cleaning supplies, hand sanitizer, food, toilet paper and other goods, @Tim_Dodson reports https://t.co/cfXch7a2lU
##            Sentiment
## 1            Neutral
## 2           Positive
## 3           Positive
## 4           Positive
## 5 Extremely Negative
## 6           Positive
# Get the number of rows and column
dim(df_tweets)
## [1] 41157     6
# Structure of the dataset
str(df_tweets)
## 'data.frame':    41157 obs. of  6 variables:
##  $ UserName     : int  3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 ...
##  $ ScreenName   : int  48751 48752 48753 48754 48755 48756 48757 48758 48759 48760 ...
##  $ Location     : chr  "London" "UK" "Vagabonds" "" ...
##  $ TweetAt      : chr  "16-03-2020" "16-03-2020" "16-03-2020" "16-03-2020" ...
##  $ OriginalTweet: chr  "@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8" "advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neigh"| __truncated__ "Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak htt"| __truncated__ "My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD F"| __truncated__ ...
##  $ Sentiment    : chr  "Neutral" "Positive" "Positive" "Positive" ...
# Get the sentiment type count
table(df_tweets$Sentiment)
## 
## Extremely Negative Extremely Positive           Negative            Neutral 
##               5481               6624               9917               7713 
##           Positive 
##              11422
unique(df_tweets$Sentiment)
## [1] "Neutral"            "Positive"           "Extremely Negative"
## [4] "Negative"           "Extremely Positive"

2. Randomize the order of the rows.

# Extract only the OriginalTweet and Sentiment columns
df_tweets <- df_tweets[, c("OriginalTweet", "Sentiment")]
df_tweets <- df_tweets[sample(nrow(df_tweets)), ]

3. (1pt) Convert sentiment into a factor variable with three levels: “positive, “neutral”, and “negative”. You can do this by labeling all “positive” and “extremely positive” tweets as “positive” and all “negative” and “extremely negative” tweets as “negative”. Now take the “summary” of sentiment to see how many observations/tweets you have for each label.

library(tidyverse)

df_tweets <- df_tweets %>%
  mutate(Sentiment = case_when(Sentiment == "Extremely Positive" | Sentiment == "Positive"~ "Positive",
                               Sentiment == "Extremely Negative" | Sentiment == "Negative"~ "Negative",
                               Sentiment == "Neutral" ~ "Neutral"

                                 ))

df_tweets$Sentiment <- as.factor(df_tweets$Sentiment)
str(df_tweets)
## 'data.frame':    41157 obs. of  2 variables:
##  $ OriginalTweet: chr  "Amid the ongoing COVID 19 pandemic a number of Haliburton County residents have been laid off of work and local"| __truncated__ "It is critical we start preparing now for what an election will look like under #COVID19. We need a comprehensi"| __truncated__ "touted an alcohol free hand sanitizer spray Turns out it s NOT considered effective against CDC recommends hand"| __truncated__ "Retailers, consumer brands &amp; hospitality chains are stepping up to the plate to support those who need it m"| __truncated__ ...
##  $ Sentiment    : Factor w/ 3 levels "Negative","Neutral",..: 3 3 3 1 2 2 2 2 1 3 ...
# Get the sentiment type count
table(df_tweets$Sentiment)
## 
## Negative  Neutral Positive 
##    15398     7713    18046
# Print the results
cat("Negative", table(df_tweets$Sentiment)[1], "\n")
## Negative 15398
cat("Neutral", table(df_tweets$Sentiment)[2], "\n")
## Neutral 7713
cat("Positive", table(df_tweets$Sentiment)[3], "\n")
## Positive 18046

Create a text corpus from OriginalTweet variable. Then clean the corpus, that is convert all tweets to lowercase, stem and remove stop words, punctuations, and additional white spaces.

library(tm)
library(SnowballC)

#corpus <- iconv(df_tweets$Sentiment, to = "utf-8")
# Create corpus variable
tweet_corpus<-VCorpus(VectorSource(df_tweets$OriginalTweet))
#Remove  punctation
tweet_corpus <- tm_map(tweet_corpus, removePunctuation)
# remove numbers
tweet_corpus <- tm_map(tweet_corpus, removeNumbers)
# remove stop words
tweet_corpus <- tm_map(tweet_corpus, removeWords, stopwords("english"))
# remove white space
tweet_corpus <- tm_map(tweet_corpus, stripWhitespace)
#convert all tweets to lowercase
tweet_corpus<-tm_map(tweet_corpus,content_transformer(tolower))
# Stem document
tweet_corpus <- tm_map(tweet_corpus, stemDocument)

tweet_corpus
## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 41157

5. Create separate wordclouds for “positive” and “negative” tweets (set max.words=100 to only show the 100 most frequent words) Is there any visible difference between the frequent words in “positive” vs “negative” tweets?

library(wordcloud)
library(RColorBrewer)

positive <-subset(df_tweets, Sentiment == "Positive")
negative <-subset(df_tweets, Sentiment == "Negative")
pal <- brewer.pal(8,"Dark2")
wordcloud(positive$OriginalTweet, max.words= 100, scale = c(3, 0.5), colors=pal, vfont=c("sans serif","plain"))

wordcloud(negative$OriginalTweet, max.words= 100, scale = c(3, 0.5), colors=pal, vfont=c("sans serif","plain"))

For postive wordcloud most frequant words are Coronavirus, covid 19, supermarket, grocery, food, help, online, socialdistancing etc.

For postive wordcloud most frequant words are covid 19, panic, lockdown, pandemic, stop, crisis, empty etc.

6. Create a document-term matrix from the cleaned corpus. Then split the data into train and test sets. Use 80% of samples (roughly 32925 rows ) for training and the rest for testing.

library(tm)
library(caret)
# create document-term matrix
tweet_dtm<-tm::DocumentTermMatrix(tweet_corpus)
# Split data in to train and test
split <- round(dim(tweet_dtm)[1] * 0.8)
tweet_dtm_train<-tweet_dtm[1:split-1, ]
tweet_dtm_test<-tweet_dtm[split:dim(tweet_dtm)[1], ]

cat("Length of train data : ", dim(tweet_dtm_train)[1], "\n")
## Length of train data :  32925
cat("Length of test data :", dim(tweet_dtm_test)[1], "\n")
## Length of test data : 8232
# The next two commands extract the labels from the raw data:
tweet_train_labels <- df_tweets[1:split-1, ]$Sentiment
tweet_test_labels <- df_tweets[split:dim(tweet_dtm)[1], ]$Sentiment

7. Remove the words that appear less than 100 times in the training data. Convert frequencies in the document-term matrix to binary yes/no features.

# Create variable to get words atleast 100
tweet_freq_words_train<-findFreqTerms(tweet_dtm_train, lowfreq = 100)
tweet_freq_words_test<-findFreqTerms(tweet_dtm_test, lowfreq = 100)

# Apply on train and test set
tweet_dtm_freq_train<-tweet_dtm_train[ , tweet_freq_words_train]
tweet_dtm_freq_test<-tweet_dtm_test[ , tweet_freq_words_test]

# Function to convert frequancy to Yes or no
convert_counts<-function(x) {
x <-ifelse(x > 0, "Yes", "No")
}

tweet_train<-apply(tweet_dtm_freq_train, MARGIN = 2, convert_counts)
tweet_test<-apply(tweet_dtm_freq_test, MARGIN = 2, convert_counts)

Train a Naïve Bayes classifier on the training data and evaluate its performance on the test data. Be patient, training and testing will take a while to run. Answer the following questions:

  • What is the overall accuracy of the model? ( the percentage of correct predictions)
  • What is the accuracy of the model in each category (negative, positive, neutral) ?
  • library(e1071)
    library(gmodels)
    library(Matrix)
    # Create model using train data
    tweet_classifier<-naiveBayes(tweet_train, tweet_train_labels)
    # Test the model using test data
    tweet_test_pred<-predict(tweet_classifier, tweet_test)
    # model evaluation
    CrossTable(tweet_test_pred, tweet_test_labels, prop.chisq= FALSE, prop.t = FALSE, dnn= c('predicted', 'actual'))
    ## 
    ##  
    ##    Cell Contents
    ## |-------------------------|
    ## |                       N |
    ## |           N / Row Total |
    ## |           N / Col Total |
    ## |-------------------------|
    ## 
    ##  
    ## Total Observations in Table:  8232 
    ## 
    ##  
    ##              | actual 
    ##    predicted |  Negative |   Neutral |  Positive | Row Total | 
    ## -------------|-----------|-----------|-----------|-----------|
    ##     Negative |      1767 |       313 |       718 |      2798 | 
    ##              |     0.632 |     0.112 |     0.257 |     0.340 | 
    ##              |     0.569 |     0.201 |     0.201 |           | 
    ## -------------|-----------|-----------|-----------|-----------|
    ##      Neutral |       549 |       908 |       594 |      2051 | 
    ##              |     0.268 |     0.443 |     0.290 |     0.249 | 
    ##              |     0.177 |     0.582 |     0.167 |           | 
    ## -------------|-----------|-----------|-----------|-----------|
    ##     Positive |       790 |       340 |      2253 |      3383 | 
    ##              |     0.234 |     0.101 |     0.666 |     0.411 | 
    ##              |     0.254 |     0.218 |     0.632 |           | 
    ## -------------|-----------|-----------|-----------|-----------|
    ## Column Total |      3106 |      1561 |      3565 |      8232 | 
    ##              |     0.377 |     0.190 |     0.433 |           | 
    ## -------------|-----------|-----------|-----------|-----------|
    ## 
    ## 
  • What is the overall accuracy of the model? ( the percentage of correct predictions)
  • # Get the overall accuracy of the model
    accuracy_nb <- mean(tweet_test_pred == tweet_test_labels)
    cat("The overall accuracy of the model is:", round(accuracy_nb,3) ,"\n")
    ## The overall accuracy of the model is: 0.599
  • What is the accuracy of the model in each category (negative, positive, neutral) ?
  • # Get the accuracy of the model for each category
    sentiment_levels <- levels(tweet_test_labels)
    
    for (i in 1:length(sentiment_levels)) {
      sentiment_level <- sentiment_levels[i]
      accuracy_level <- mean(tweet_test_pred[tweet_test_labels == tweet_test_pred] == tweet_test_labels[tweet_test_labels == sentiment_level])
      cat(sentiment_level, "accuracy:", round(accuracy_level,3), "\n")
    }
    ## Negative accuracy: 0.359 
    ## Neutral accuracy: 0.184 
    ## Positive accuracy: 0.457