For this assignment, we will be using the census adult dataset from UCI ML repository. The Adult dataset was extracted by Barry Becker from the 1994 US Census Database. Each row in the dataset has de-identified dempgraphic information of an individual worker and their income. The income is a categorical variable with two levels: <50K and >50K. The goal of this assignment is to create a binary classifier to predict whether a person makes more than 50K based on the other attributes in the dataset.
Please see the description of dataset and its attributes here: https://archive.ics.uci.edu/ml/datasets/Adult Then go to data folder at https://archive.ics.uci.edu/ml/machine-learning-databases/adult/ and download adult.data . This would be the dataset you will use to answer the following questions.
1. Download the dataset and store it in a dataframe in R. The dataset does not have header, you should add the headers manually to your dataframe based on the list of attributes provided in https://archive.ics.uci.edu/ml/datasets/Adult. Also please note that some entries have extra white space. So to read the data properly, use the option strip.white=TRUE in read.csv function.
# Load the file into a data frame
df <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/adult (1).data", header = F, sep = ",", na.strings = "?", strip.white=TRUE)
# Manually assign the header names
colnames(df) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")
head(df)## age workclass fnlwgt education education_num marital_status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hours_per_week native_country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
## [1] 32561 15
2. Explore the overall structure of the dataset using the str() function. Get a summary statistics of each variable. How many categorical and numeric variables you have in your data? Is there any missing values?
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : chr "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : chr "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ education_num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital_status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ occupation : chr "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ relationship : chr "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ race : chr "White" "White" "White" "Black" ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours_per_week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native_country: chr "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
## age workclass fnlwgt education
## Min. :17.00 Length:32561 Min. : 12285 Length:32561
## 1st Qu.:28.00 Class :character 1st Qu.: 117827 Class :character
## Median :37.00 Mode :character Median : 178356 Mode :character
## Mean :38.58 Mean : 189778
## 3rd Qu.:48.00 3rd Qu.: 237051
## Max. :90.00 Max. :1484705
## education_num marital_status occupation relationship
## Min. : 1.00 Length:32561 Length:32561 Length:32561
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.08
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital_gain capital_loss
## Length:32561 Length:32561 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1078 Mean : 87.3
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :4356.0
## hours_per_week native_country income
## Min. : 1.00 Length:32561 Length:32561
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
## [1] 4262
There are 9 categorical variables in the dataset i.e workclass, education, marital-status, occupation, relationship, race, sex, native-country, and income.
There are 6 numerical variables i.e age, fnlwgt, education-num, capital-gain, capital-loss, and hours-per-week
Dataset have 4262 missing data.
3. Get the frequency table of the “income” variable to see how many observations you have in each category of the income variable. Is the data balanced? Do we have equal number of samples in each classof income?
##
## <=50K >50K
## 24720 7841
we do not have an equal number of samples in each classof income. So the data is not balanced.
4. Explore the data in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income.
• To explore the relationship between numerical features and “income” variable, you can use side by side box plot and t.test
• To explore the relationship between categorical features and “income” variable, you can use frequency table and chisquare test (note that chisquare test might throw a warning if there are cells whose expected counts in the frequency table is less 5. This warning means the p-values reported from chisquare test may be incorrect due to low counts and are not reliable. You can ignore the warning for this assignment).
library(dplyr)
library(tidyr)
library(ggplot2)
# Create the boxplot
GenderPlot_age = ggplot(df, aes(x = income, y = age)) + geom_boxplot()
GenderPlot_age##
## Welch Two Sample t-test
##
## data: age by income
## t = -50.264, df = 17411, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.757250 -7.174955
## sample estimates:
## mean in group <=50K mean in group >50K
## 36.78374 44.24984
<=50K median is lower age than >50k.
##
## Welch Two Sample t-test
##
## data: fnlwgt by income
## t = 1.7412, df = 13615, p-value = 0.08167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -293.7029 4965.4332
## sample estimates:
## mean in group <=50K mean in group >50K
## 190340.9 188005.0
GenderPlot_educationnum = ggplot(df, aes(x = income, y = education_num)) + geom_boxplot()
GenderPlot_educationnum# t-test between education_num and income
t.test(education_num~income,alternative="two.sided", data=df)##
## Welch Two Sample t-test
##
## data: education_num by income
## t = -64.896, df = 13421, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.077502 -1.955682
## sample estimates:
## mean in group <=50K mean in group >50K
## 9.595065 11.611657
GenderPlot_capital_gain = ggplot(df, aes(x = income, y = capital_gain)) + geom_boxplot()
GenderPlot_capital_gain# t-test between capital_gain and income
t.test(capital_gain~income,alternative="two.sided", data=df)##
## Welch Two Sample t-test
##
## data: capital_gain by income
## t = -23.427, df = 7861.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4180.166 -3534.614
## sample estimates:
## mean in group <=50K mean in group >50K
## 148.7525 4006.1425
GenderPlot_capital_loss = ggplot(df, aes(x = income, y = capital_loss)) + geom_boxplot()
GenderPlot_capital_loss# t-test between capital_loss and income
t.test(capital_loss~income,alternative="two.sided", data=df)##
## Welch Two Sample t-test
##
## data: capital_loss by income
## t = -20.238, df = 9231.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -155.5985 -128.1187
## sample estimates:
## mean in group <=50K mean in group >50K
## 53.14292 195.00153
GenderPlot_hours_per_week = ggplot(df, aes(x = income, y = hours_per_week)) + geom_boxplot()
GenderPlot_hours_per_week# t-test between hours_per_week and income
t.test(hours_per_week~income,alternative="two.sided", data=df)##
## Welch Two Sample t-test
##
## data: hours_per_week by income
## t = -45.123, df = 14570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.920943 -6.344690
## sample estimates:
## mean in group <=50K mean in group >50K
## 38.84021 45.47303
Test for categorical variables:
Workclass
##
## <=50K >50K
## Federal-gov 589 371
## Local-gov 1476 617
## Never-worked 7 0
## Private 17733 4963
## Self-emp-inc 494 622
## Self-emp-not-inc 1817 724
## State-gov 945 353
## Without-pay 14 0
mosaicplot(table_workclass, ylab= "income", xlab="workclass", main = "workclass vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_workclass
## X-squared = 827.72, df = 7, p-value < 2.2e-16
education
##
## <=50K >50K
## 10th 871 62
## 11th 1115 60
## 12th 400 33
## 1st-4th 162 6
## 5th-6th 317 16
## 7th-8th 606 40
## 9th 487 27
## Assoc-acdm 802 265
## Assoc-voc 1021 361
## Bachelors 3134 2221
## Doctorate 107 306
## HS-grad 8826 1675
## Masters 764 959
## Preschool 51 0
## Prof-school 153 423
## Some-college 5904 1387
mosaicplot(table_education, ylab= "income", xlab="education", main = "education vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_education
## X-squared = 4429.7, df = 15, p-value < 2.2e-16
marital_status
##
## <=50K >50K
## Divorced 3980 463
## Married-AF-spouse 13 10
## Married-civ-spouse 8284 6692
## Married-spouse-absent 384 34
## Never-married 10192 491
## Separated 959 66
## Widowed 908 85
mosaicplot(table_marital_status, ylab= "income", xlab="marital_status", main = "marital_status vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_marital_status
## X-squared = 6517.7, df = 6, p-value < 2.2e-16
occupation
##
## <=50K >50K
## Adm-clerical 3263 507
## Armed-Forces 8 1
## Craft-repair 3170 929
## Exec-managerial 2098 1968
## Farming-fishing 879 115
## Handlers-cleaners 1284 86
## Machine-op-inspct 1752 250
## Other-service 3158 137
## Priv-house-serv 148 1
## Prof-specialty 2281 1859
## Protective-serv 438 211
## Sales 2667 983
## Tech-support 645 283
## Transport-moving 1277 320
mosaicplot(table_occupation, ylab= "income", xlab="occupation", main = "occupation vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_occupation
## X-squared = 3744.9, df = 13, p-value < 2.2e-16
relationship
##
## <=50K >50K
## Husband 7275 5918
## Not-in-family 7449 856
## Other-relative 944 37
## Own-child 5001 67
## Unmarried 3228 218
## Wife 823 745
mosaicplot(table_relationship, ylab= "income", xlab="relationship", main = "relationship vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_relationship
## X-squared = 6699.1, df = 5, p-value < 2.2e-16
race
##
## <=50K >50K
## Amer-Indian-Eskimo 275 36
## Asian-Pac-Islander 763 276
## Black 2737 387
## Other 246 25
## White 20699 7117
##
## Pearson's Chi-squared test
##
## data: table_race
## X-squared = 330.92, df = 4, p-value < 2.2e-16
Sex
##
## <=50K >50K
## Female 9592 1179
## Male 15128 6662
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_sex
## X-squared = 1517.8, df = 1, p-value < 2.2e-16
native_country
##
## <=50K >50K
## Cambodia 12 7
## Canada 82 39
## China 55 20
## Columbia 57 2
## Cuba 70 25
## Dominican-Republic 68 2
## Ecuador 24 4
## El-Salvador 97 9
## England 60 30
## France 17 12
## Germany 93 44
## Greece 21 8
## Guatemala 61 3
## Haiti 40 4
## Holand-Netherlands 1 0
## Honduras 12 1
## Hong 14 6
## Hungary 10 3
## India 60 40
## Iran 25 18
## Ireland 19 5
## Italy 48 25
## Jamaica 71 10
## Japan 38 24
## Laos 16 2
## Mexico 610 33
## Nicaragua 32 2
## Outlying-US(Guam-USVI-etc) 14 0
## Peru 29 2
## Philippines 137 61
## Poland 48 12
## Portugal 33 4
## Puerto-Rico 102 12
## Scotland 9 3
## South 64 16
## Taiwan 31 20
## Thailand 15 3
## Trinadad&Tobago 17 2
## United-States 21999 7171
## Vietnam 62 5
## Yugoslavia 10 6
mosaicplot(table_native_country, ylab= "income", xlab="native_country", main = "native_country vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_native_country
## X-squared = 317.09, df = 40, p-value < 2.2e-16
These attributes have been selected as they are likely to have a strong correlation with the target variable (income), have low correlation with each other, are easily interpretable, and do not require complex computations. By using these attributes, we can build a model that is both accurate and efficient, while also being interpretable and understandable. Age can be a good indicator of experience and seniority, both of which are often correlated with higher income. Education is a good predictor of income as higher levels of education often lead to better job opportunities and higher salaries. Marital status and occupation can also provide insights into a person’s financial situation and their earning potential. Finally, hours-per-week can be a good indicator of full-time versus part-time employment, with full-time employees typically earning more.
Remove remaining columns from the dataset.
drop <- c("fnlwgt", "education_num", "relationship", "race","capital_gain" ,"capital_loss")
df_new = df[,!(names(df) %in% drop)]5. An initial data exploration shows that the missing values in the dataset are denoted by “?” not NA. Change all the “?” characters in the dataframe to NA
6. Use the command colSums(is.na(
## age workclass education marital_status occupation
## 0 1836 0 0 1843
## sex hours_per_week native_country income
## 0 0 583 0
workclass, occupation and native_country has NA values.
7. There are several ways we can deal with missing values. The easiest approach is to remove all the rows with missing values. However, if a large number of rows have missing values removing them will result in loss of information and may affect the classifier performance. If a large number of rows have missing values, then it is typically better to replace missing data with some values. This is called data imputation. Several methods for missing data imputation exist. The most naïve method (which we will use here) is to replace the missing values with mean of the column (for a numerical column) or mode/majority value of the column (for a categorical column). We will use a more advanced data imputation method in a later module. For now, replace the missing values in a numerical column with the mean of the column and the missing values in a categorical column with the mode/majority of the column. After imputation, use colSums(is.na(
mode_workclass <- names(which.max(table(df_new$workclass)))
df_new$workclass[is.na(df_new$workclass)] <- mode_workclass
mode_occupation <- names(which.max(table(df_new$occupation)))
df_new$occupation[is.na(df_new$occupation)] <- mode_occupation
mode_native_country <- names(which.max(table(df_new$native_country)))
df_new$native_country[is.na(df_new$native_country)] <- mode_native_country
# Check for missing values
colSums(is.na(df_new))## age workclass education marital_status occupation
## 0 0 0 0 0
## sex hours_per_week native_country income
## 0 0 0 0
8. Set the seed of the random number generator to a fixed integer, say 1, so that I can reproduce your work:
9. Randomize the order of the rows in the dataset.
10. This dataset has several categorical variables. With the exception of few models ( such as Naiive Bayes and tree-based models) most machine learning models require numeric features and cannot work directly with categorical data. One way to deal with categorical variables is to assign numeric indices to each level. However, this imposes an artificial ordering on an unordered categorical variable. For example, suppose that we have a categorical variable primary color with three levels: “red”,”blue”,”green”. If we convert “red” to 0 , “blue” to 1 and “green” to 2 then we are telling our model that red < blue< green which is not correct. A better way to encode an unordered categorical variable is to do one-hot-encoding. In one hot-encoding we create a dummy binary variable for each level of a categorical variable. For example we can represent the primary color variable by three binary dummy variables, one for each color (red, blue, and green) .If the color is red, then the variable red takes value 1 while blue and green both take the value zero.
library(mltools)
library(data.table)
# Convert all columns to factor
df_copy <- as.data.frame(unclass(df_new), stringsAsFactors = TRUE)
df_final <- as.data.table(df_copy)
df_final_new <- one_hot(df_final, cols = c("workclass", "education", "marital_status", "occupation", "sex", "native_country"), dropUnusedLevels = T)
df_final_new <- as.data.frame(df_final_new)
head(df_final_new)## age workclass_Federal-gov workclass_Local-gov workclass_Never-worked
## 1 51 0 0 0
## 2 38 0 0 0
## 3 30 0 0 0
## 4 38 0 0 0
## 5 21 0 0 0
## 6 34 0 0 0
## workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
## workclass_State-gov workclass_Without-pay education_10th education_11th
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## education_12th education_1st-4th education_5th-6th education_7th-8th
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## education_9th education_Assoc-acdm education_Assoc-voc education_Bachelors
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 1 0 0
## 4 0 0 0 1
## 5 0 0 0 0
## 6 0 0 0 0
## education_Doctorate education_HS-grad education_Masters education_Preschool
## 1 0 1 0 0
## 2 0 1 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## education_Prof-school education_Some-college marital_status_Divorced
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 1 0
## 6 1 0 0
## marital_status_Married-AF-spouse marital_status_Married-civ-spouse
## 1 0 1
## 2 0 1
## 3 0 0
## 4 0 1
## 5 0 0
## 6 0 0
## marital_status_Married-spouse-absent marital_status_Never-married
## 1 0 0
## 2 0 0
## 3 0 1
## 4 0 0
## 5 0 1
## 6 0 1
## marital_status_Separated marital_status_Widowed occupation_Adm-clerical
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## occupation_Armed-Forces occupation_Craft-repair occupation_Exec-managerial
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 1
## occupation_Farming-fishing occupation_Handlers-cleaners
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## occupation_Machine-op-inspct occupation_Other-service
## 1 0 0
## 2 1 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## occupation_Priv-house-serv occupation_Prof-specialty
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## occupation_Protective-serv occupation_Sales occupation_Tech-support
## 1 0 0 1
## 2 0 0 0
## 3 0 0 1
## 4 0 0 1
## 5 0 1 0
## 6 0 0 0
## occupation_Transport-moving sex_Female sex_Male hours_per_week
## 1 0 0 1 40
## 2 0 0 1 40
## 3 0 1 0 40
## 4 0 0 1 40
## 5 0 1 0 20
## 6 0 0 1 50
## native_country_Cambodia native_country_Canada native_country_China
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Columbia native_country_Cuba native_country_Dominican-Republic
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Ecuador native_country_El-Salvador native_country_England
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_France native_country_Germany native_country_Greece
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Guatemala native_country_Haiti
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## native_country_Holand-Netherlands native_country_Honduras native_country_Hong
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Hungary native_country_India native_country_Iran
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Ireland native_country_Italy native_country_Jamaica
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Japan native_country_Laos native_country_Mexico
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Nicaragua native_country_Outlying-US(Guam-USVI-etc)
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## native_country_Peru native_country_Philippines native_country_Poland
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Portugal native_country_Puerto-Rico native_country_Scotland
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_South native_country_Taiwan native_country_Thailand
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Trinadad&Tobago native_country_United-States
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 0 1
## 6 0 1
## native_country_Vietnam native_country_Yugoslavia income
## 1 0 0 <=50K
## 2 0 0 <=50K
## 3 0 0 <=50K
## 4 0 0 >50K
## 5 0 0 <=50K
## 6 0 0 <=50K
11. Scale all numeric features using Min-Max scaling
library(dplyr)
library(caret)
# Select only the numeric features for scaling
numeric_atrri <- sapply(df_final_new, is.numeric)
numeric_df <- df_final_new[, numeric_atrri]
#select categorical
cat_variable <- sapply(df_final_new, is.factor)
income<-df_final_new[, cat_variable]
# Perform Min-Max scaling
min_max_scaler <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
numeric_df <- as.data.frame(lapply(numeric_df, min_max_scaler))
# Combine the scaled numeric features with the categorical features
data_scaled <- cbind(numeric_df, income)12. Use 5-fold cross validation with KNN to predict the “income” variable and report the cross- validation error. ( You can find an example in slides 51-53 of module 4 lecture notes).
library(caret)
# Split the data into a training set and a testing set
split1 <- createDataPartition(data_scaled$income, p = 0.8, list = FALSE)
training_data <- data_scaled[split1, ]
testing_data <- data_scaled[-split1, ]
dim(training_data)## [1] 26049 91
## [1] 6512 91
13. Tune K (the number of nearest neighbors) by trying out different values (starting from k=1 to k=sqrt(n) where n is the number of observations in the dataset (for example k=1,5,10,20 50,100, sqrt(n) ). Draw a plot of cross validation errors for different values of K. Which value of K seems to perform the best on this data set? (You can find an example in slides 54-55 of module 4 lecture notes) Note: This might a long time to run on your machine, be patient ( It took about 30 minutes on my machine to run 5-fold cross validation for 6 different K values)
# Split the dataset into a training set and a test set
set.seed(100)
split2 <- createDataPartition(data_scaled$income, p = 0.8, list = FALSE)
training_set <- data_scaled[split2, ]
test_set <- data_scaled[-split2, ]
# Tune the number of nearest neighbors (K)
set.seed(100)
control <- trainControl(method = "cv", number = 5)
grid <- expand.grid(k = c(1,5,10,20, 50,100, sqrt(nrow(training_set))))
knn_model <- train(income ~ ., data = training_set, method = "knn", tuneGrid = grid, trControl = control)
# Plot the cross-validation errors for different values of K
plot(knn_model)14. Use 5-fold cross validation with KNN to predict the income variable and report the average false positive rate (FPR) and false negative rate (FNR) of the classifier. . FPR is the proportion of negative instances classified as positive by the classifier. Similarly, FNR is the proportion of positive instances classified as negative by the classifier.
It does not matter which class you designate as positive or negative. For instance, you can designate income>50K as positive and income<50K as negative
# Predict the outcomes on the test data
predictions <- predict(knn_model, newdata = test_set)
# Calculate the false positive rate (FPR) and false negative rate (FNR)
confusion_matrix <- confusionMatrix(predictions, test_set$income)
fpr <- confusion_matrix$table[2,1] / sum(confusion_matrix$table[2,])
fnr <- confusion_matrix$table[1,2] / sum(confusion_matrix$table[,2])
# Print the FPR and FNR
cat("False Positive Rate:", fpr, "\n")## False Positive Rate: 0.3423358
## False Negative Rate: 0.4253827
15. Consider a majority classifier which always predicts income <50K. Without writing any code, explain what would be the training error of this classifier? ( Note the training error of this majority classifier is simply the proportion of all examples with income>50K because they are all misclassified by this majority classifier). Compare this with the cross validation error of KNN you computed in question 13. Does KNN do better than this majority classifier?
Therefore, we cannot directly compare the training error of the majority classifier with the cross-validation error of KNN. However, in general, a classifier that performs better than the majority classifier would have a lower error rate than the proportion of the minority class in the dataset. Therefore, if the cross-validation error of KNN is lower than the proportion of the minority class in the dataset, then KNN would be performing better than the majority classifier. In general, KNN can potentially do better than the majority classifier if it is able to correctly classify a significant proportion of positive examples. However, the majority classifier will always have a training error equal to the percentage of positive examples in the training set, which may be lower than the cross-validation error of KNN. In this case, it can be concluded that the majority classifier will have a lower training error, but the cross-validation error of KNN may be lower if the KNN model is able to generalize well to unseen data.
16. Explain what is the False Positive Rate and False Negative Rate of the majority classifier and how does it compare to the average FPR and FNR of KNN classifier you computed in question 14. You don’t need to write any code to compute FPR and FNR of the majority classifier. You can just compute it based on the definition of FNR and FPR
False Positive Rate (FPR) and False Negative Rate (FNR) are two measures of a binary classifier’s performance. FPR is defined as the proportion of negative examples (income < 50K) that are misclassified as positive (income > 50K), while FNR is defined as the proportion of positive examples (income > 50K) that are misclassified as negative (income < 50K).In comparison, the average FPR and FNR of the KNN classifier from question 14 will depend on the specific choice of K and the data. However, in general, it is expected that the KNN classifier will have a lower FNR and a higher FPR compared to the majority classifier. This is because the majority classifier always predicts the same class regardless of the input, whereas the KNN classifier takes into account the neighbors of each point and may make different predictions based on the data.
For this problem you are going to use corona_nlp_train.csv dataset, a collection of tweets pulled from Twitter and manually labeled as being “extremely positive”, “positive”, “neutral”, “negative”, and “extremely negative”. The dataset is from this Kaggle project (https://www.kaggle.com/kerneler/starter-covid-19-nlp-text-d3a3baa6- e/data ). I have attached the data to this assignment spec and you can directly download it from canvas.
1. (1pt) Read the data and store in in the dataframe. Take a look at the structure of data and its variables. We will be working with only two variables: OriginalTweet and Sentiment. Original tweet is a text and Sentiment is a categorical variable with five levels: “extremely positive”, “positive”, “neutral”,“negative”, and “extremely negative”.
# Read the data
df_tweets <- read.csv('/Users/subhalaxmirout/CSC 532 - ML/Corona_NLP_train.csv', header = T, encoding="utf-8")
# Shows first 6 rows of the tweets data
head(df_tweets)## UserName ScreenName Location TweetAt
## 1 3799 48751 London 16-03-2020
## 2 3800 48752 UK 16-03-2020
## 3 3801 48753 Vagabonds 16-03-2020
## 4 3802 48754 16-03-2020
## 5 3803 48755 16-03-2020
## 6 3804 48756 ÜT: 36.319708,-82.363649 16-03-2020
## OriginalTweet
## 1 @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8
## 2 advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neighbours schools employer chemist GP set up online shopping accounts if poss adequate supplies of regular meds but not over order
## 3 Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak https://t.co/bInCA9Vp8P
## 4 My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \n\n\nStay calm, stay safe.\n\n\n\n\n\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j
## 5 Me, ready to go at supermarket during the #COVID19 outbreak.\n\n\n\n\n\nNot because I'm paranoid, but because my food stock is litteraly empty. The #coronavirus is a serious thing, but please, don't panic. It causes shortage...\n\n\n\n\n\n#CoronavirusFrance #restezchezvous #StayAtHome #confinement https://t.co/usmuaLq72n
## 6 As news of the region\u0092s first confirmed COVID-19 case came out of Sullivan County last week, people flocked to area stores to purchase cleaning supplies, hand sanitizer, food, toilet paper and other goods, @Tim_Dodson reports https://t.co/cfXch7a2lU
## Sentiment
## 1 Neutral
## 2 Positive
## 3 Positive
## 4 Positive
## 5 Extremely Negative
## 6 Positive
## [1] 41157 6
## 'data.frame': 41157 obs. of 6 variables:
## $ UserName : int 3799 3800 3801 3802 3803 3804 3805 3806 3807 3808 ...
## $ ScreenName : int 48751 48752 48753 48754 48755 48756 48757 48758 48759 48760 ...
## $ Location : chr "London" "UK" "Vagabonds" "" ...
## $ TweetAt : chr "16-03-2020" "16-03-2020" "16-03-2020" "16-03-2020" ...
## $ OriginalTweet: chr "@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/iFz9FAn2Pa and https://t.co/xX6ghGFzCC and https://t.co/I2NlzdxNo8" "advice Talk to your neighbours family to exchange phone numbers create contact list with phone numbers of neigh"| __truncated__ "Coronavirus Australia: Woolworths to give elderly, disabled dedicated shopping hours amid COVID-19 outbreak htt"| __truncated__ "My food stock is not the only one which is empty...\n\n\n\n\n\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD F"| __truncated__ ...
## $ Sentiment : chr "Neutral" "Positive" "Positive" "Positive" ...
##
## Extremely Negative Extremely Positive Negative Neutral
## 5481 6624 9917 7713
## Positive
## 11422
## [1] "Neutral" "Positive" "Extremely Negative"
## [4] "Negative" "Extremely Positive"
2. Randomize the order of the rows.
# Extract only the OriginalTweet and Sentiment columns
df_tweets <- df_tweets[, c("OriginalTweet", "Sentiment")]
df_tweets <- df_tweets[sample(nrow(df_tweets)), ]3. (1pt) Convert sentiment into a factor variable with three levels: “positive, “neutral”, and “negative”. You can do this by labeling all “positive” and “extremely positive” tweets as “positive” and all “negative” and “extremely negative” tweets as “negative”. Now take the “summary” of sentiment to see how many observations/tweets you have for each label.
library(tidyverse)
df_tweets <- df_tweets %>%
mutate(Sentiment = case_when(Sentiment == "Extremely Positive" | Sentiment == "Positive"~ "Positive",
Sentiment == "Extremely Negative" | Sentiment == "Negative"~ "Negative",
Sentiment == "Neutral" ~ "Neutral"
))
df_tweets$Sentiment <- as.factor(df_tweets$Sentiment)
str(df_tweets)## 'data.frame': 41157 obs. of 2 variables:
## $ OriginalTweet: chr "Amid the ongoing COVID 19 pandemic a number of Haliburton County residents have been laid off of work and local"| __truncated__ "It is critical we start preparing now for what an election will look like under #COVID19. We need a comprehensi"| __truncated__ "touted an alcohol free hand sanitizer spray Turns out it s NOT considered effective against CDC recommends hand"| __truncated__ "Retailers, consumer brands & hospitality chains are stepping up to the plate to support those who need it m"| __truncated__ ...
## $ Sentiment : Factor w/ 3 levels "Negative","Neutral",..: 3 3 3 1 2 2 2 2 1 3 ...
##
## Negative Neutral Positive
## 15398 7713 18046
## Negative 15398
## Neutral 7713
## Positive 18046
Create a text corpus from OriginalTweet variable. Then clean the corpus, that is convert all tweets to lowercase, stem and remove stop words, punctuations, and additional white spaces.
library(tm)
library(SnowballC)
#corpus <- iconv(df_tweets$Sentiment, to = "utf-8")
# Create corpus variable
tweet_corpus<-VCorpus(VectorSource(df_tweets$OriginalTweet))
#Remove punctation
tweet_corpus <- tm_map(tweet_corpus, removePunctuation)
# remove numbers
tweet_corpus <- tm_map(tweet_corpus, removeNumbers)
# remove stop words
tweet_corpus <- tm_map(tweet_corpus, removeWords, stopwords("english"))
# remove white space
tweet_corpus <- tm_map(tweet_corpus, stripWhitespace)
#convert all tweets to lowercase
tweet_corpus<-tm_map(tweet_corpus,content_transformer(tolower))
# Stem document
tweet_corpus <- tm_map(tweet_corpus, stemDocument)
tweet_corpus## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 41157
5. Create separate wordclouds for “positive” and “negative” tweets (set max.words=100 to only show the 100 most frequent words) Is there any visible difference between the frequent words in “positive” vs “negative” tweets?
library(wordcloud)
library(RColorBrewer)
positive <-subset(df_tweets, Sentiment == "Positive")
negative <-subset(df_tweets, Sentiment == "Negative")
pal <- brewer.pal(8,"Dark2")
wordcloud(positive$OriginalTweet, max.words= 100, scale = c(3, 0.5), colors=pal, vfont=c("sans serif","plain"))wordcloud(negative$OriginalTweet, max.words= 100, scale = c(3, 0.5), colors=pal, vfont=c("sans serif","plain"))For postive wordcloud most frequant words are Coronavirus, covid 19, supermarket, grocery, food, help, online, socialdistancing etc.
For postive wordcloud most frequant words are covid 19, panic, lockdown, pandemic, stop, crisis, empty etc.
6. Create a document-term matrix from the cleaned corpus. Then split the data into train and test sets. Use 80% of samples (roughly 32925 rows ) for training and the rest for testing.
library(tm)
library(caret)
# create document-term matrix
tweet_dtm<-tm::DocumentTermMatrix(tweet_corpus)
# Split data in to train and test
split <- round(dim(tweet_dtm)[1] * 0.8)
tweet_dtm_train<-tweet_dtm[1:split-1, ]
tweet_dtm_test<-tweet_dtm[split:dim(tweet_dtm)[1], ]
cat("Length of train data : ", dim(tweet_dtm_train)[1], "\n")## Length of train data : 32925
## Length of test data : 8232
# The next two commands extract the labels from the raw data:
tweet_train_labels <- df_tweets[1:split-1, ]$Sentiment
tweet_test_labels <- df_tweets[split:dim(tweet_dtm)[1], ]$Sentiment7. Remove the words that appear less than 100 times in the training data. Convert frequencies in the document-term matrix to binary yes/no features.
# Create variable to get words atleast 100
tweet_freq_words_train<-findFreqTerms(tweet_dtm_train, lowfreq = 100)
tweet_freq_words_test<-findFreqTerms(tweet_dtm_test, lowfreq = 100)
# Apply on train and test set
tweet_dtm_freq_train<-tweet_dtm_train[ , tweet_freq_words_train]
tweet_dtm_freq_test<-tweet_dtm_test[ , tweet_freq_words_test]
# Function to convert frequancy to Yes or no
convert_counts<-function(x) {
x <-ifelse(x > 0, "Yes", "No")
}
tweet_train<-apply(tweet_dtm_freq_train, MARGIN = 2, convert_counts)
tweet_test<-apply(tweet_dtm_freq_test, MARGIN = 2, convert_counts)Train a Naïve Bayes classifier on the training data and evaluate its performance on the test data. Be patient, training and testing will take a while to run. Answer the following questions:
library(e1071)
library(gmodels)
library(Matrix)
# Create model using train data
tweet_classifier<-naiveBayes(tweet_train, tweet_train_labels)
# Test the model using test data
tweet_test_pred<-predict(tweet_classifier, tweet_test)
# model evaluation
CrossTable(tweet_test_pred, tweet_test_labels, prop.chisq= FALSE, prop.t = FALSE, dnn= c('predicted', 'actual'))##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 8232
##
##
## | actual
## predicted | Negative | Neutral | Positive | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## Negative | 1767 | 313 | 718 | 2798 |
## | 0.632 | 0.112 | 0.257 | 0.340 |
## | 0.569 | 0.201 | 0.201 | |
## -------------|-----------|-----------|-----------|-----------|
## Neutral | 549 | 908 | 594 | 2051 |
## | 0.268 | 0.443 | 0.290 | 0.249 |
## | 0.177 | 0.582 | 0.167 | |
## -------------|-----------|-----------|-----------|-----------|
## Positive | 790 | 340 | 2253 | 3383 |
## | 0.234 | 0.101 | 0.666 | 0.411 |
## | 0.254 | 0.218 | 0.632 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 3106 | 1561 | 3565 | 8232 |
## | 0.377 | 0.190 | 0.433 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
# Get the overall accuracy of the model
accuracy_nb <- mean(tweet_test_pred == tweet_test_labels)
cat("The overall accuracy of the model is:", round(accuracy_nb,3) ,"\n")## The overall accuracy of the model is: 0.599
# Get the accuracy of the model for each category
sentiment_levels <- levels(tweet_test_labels)
for (i in 1:length(sentiment_levels)) {
sentiment_level <- sentiment_levels[i]
accuracy_level <- mean(tweet_test_pred[tweet_test_labels == tweet_test_pred] == tweet_test_labels[tweet_test_labels == sentiment_level])
cat(sentiment_level, "accuracy:", round(accuracy_level,3), "\n")
}## Negative accuracy: 0.359
## Neutral accuracy: 0.184
## Positive accuracy: 0.457