For this assignment, we will be using the census adult dataset from UCI ML repository. The Adult dataset was extracted by Barry Becker from the 1994 US Census Database. Each row in the dataset has de-identified dempgraphic information of an individual worker and their income. The income is a categorical variable with two levels: <50K and >50K. The goal of this assignment is to create a binary classifier to predict whether a person makes more than 50K based on the other attributes in the dataset.
Please see the description of dataset and its attributes here: https://archive.ics.uci.edu/ml/datasets/Adult Then go to data folder at https://archive.ics.uci.edu/ml/machine-learning-databases/adult/ and download adult.data . This would be the dataset you will use to answer the following questions.
1. Download the dataset and store it in a dataframe in R. The dataset does not have header, you should add the headers manually to your dataframe based on the list of attributes provided in https://archive.ics.uci.edu/ml/datasets/Adult. Also please note that some entries have extra white space. So to read the data properly, use the option strip.white=TRUE in read.csv function.
# Load the file into a data frame
df <- read.csv("/Users/subhalaxmirout/CSC 532 - ML/adult (1).data", header = F, sep = ",", na.strings = "?", strip.white=TRUE)
# Manually assign the header names
colnames(df) <- c("age", "workclass", "fnlwgt", "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "capital_gain", "capital_loss", "hours_per_week", "native_country", "income")
head(df)## age workclass fnlwgt education education_num marital_status
## 1 39 State-gov 77516 Bachelors 13 Never-married
## 2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse
## 3 38 Private 215646 HS-grad 9 Divorced
## 4 53 Private 234721 11th 7 Married-civ-spouse
## 5 28 Private 338409 Bachelors 13 Married-civ-spouse
## 6 37 Private 284582 Masters 14 Married-civ-spouse
## occupation relationship race sex capital_gain capital_loss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hours_per_week native_country income
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
## [1] 32561 15
2. Explore the overall structure of the dataset using the str() function. Get a summary statistics of each variable. How many categorical and numeric variables you have in your data? Is there any missing values?
## 'data.frame': 32561 obs. of 15 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : chr "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
## $ education : chr "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ education_num : int 13 13 9 7 13 14 5 9 14 13 ...
## $ marital_status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ occupation : chr "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ relationship : chr "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ race : chr "White" "White" "White" "Black" ...
## $ sex : chr "Male" "Male" "Male" "Male" ...
## $ capital_gain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capital_loss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hours_per_week: int 40 13 40 40 40 40 16 45 50 40 ...
## $ native_country: chr "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
## age workclass fnlwgt education
## Min. :17.00 Length:32561 Min. : 12285 Length:32561
## 1st Qu.:28.00 Class :character 1st Qu.: 117827 Class :character
## Median :37.00 Mode :character Median : 178356 Mode :character
## Mean :38.58 Mean : 189778
## 3rd Qu.:48.00 3rd Qu.: 237051
## Max. :90.00 Max. :1484705
## education_num marital_status occupation relationship
## Min. : 1.00 Length:32561 Length:32561 Length:32561
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.08
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital_gain capital_loss
## Length:32561 Length:32561 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1078 Mean : 87.3
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :4356.0
## hours_per_week native_country income
## Min. : 1.00 Length:32561 Length:32561
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
## [1] 4262
There are 9 categorical variables in the dataset i.e workclass, education, marital-status, occupation, relationship, race, sex, native-country, and income.
There are 6 numerical variables i.e age, fnlwgt, education-num, capital-gain, capital-loss, and hours-per-week
Dataset have 4262 missing data.
3. Get the frequency table of the “income” variable to see how many observations you have in each category of the income variable. Is the data balanced? Do we have equal number of samples in each classof income?
##
## <=50K >50K
## 24720 7841
we do not have an equal number of samples in each classof income. So the data is not balanced.
4. Explore the data in order to investigate the association between income and the other features. Which of the other features seem most likely to be useful in predicting income.
• To explore the relationship between numerical features and “income” variable, you can use side by side box plot and t.test
• To explore the relationship between categorical features and “income” variable, you can use frequency table and chisquare test (note that chisquare test might throw a warning if there are cells whose expected counts in the frequency table is less 5. This warning means the p-values reported from chisquare test may be incorrect due to low counts and are not reliable. You can ignore the warning for this assignment).
library(dplyr)
library(tidyr)
library(ggplot2)
# Create the boxplot
GenderPlot_age = ggplot(df, aes(x = income, y = age)) + geom_boxplot()
GenderPlot_age##
## Welch Two Sample t-test
##
## data: age by income
## t = -50.264, df = 17411, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.757250 -7.174955
## sample estimates:
## mean in group <=50K mean in group >50K
## 36.78374 44.24984
<=50K median is lower age than >50k.
##
## Welch Two Sample t-test
##
## data: fnlwgt by income
## t = 1.7412, df = 13615, p-value = 0.08167
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -293.7029 4965.4332
## sample estimates:
## mean in group <=50K mean in group >50K
## 190340.9 188005.0
GenderPlot_educationnum = ggplot(df, aes(x = income, y = education_num)) + geom_boxplot()
GenderPlot_educationnum# t-test between education_num and income
t.test(education_num~income,alternative="two.sided", data=df)##
## Welch Two Sample t-test
##
## data: education_num by income
## t = -64.896, df = 13421, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.077502 -1.955682
## sample estimates:
## mean in group <=50K mean in group >50K
## 9.595065 11.611657
GenderPlot_capital_gain = ggplot(df, aes(x = income, y = capital_gain)) + geom_boxplot()
GenderPlot_capital_gain# t-test between capital_gain and income
t.test(capital_gain~income,alternative="two.sided", data=df)##
## Welch Two Sample t-test
##
## data: capital_gain by income
## t = -23.427, df = 7861.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -4180.166 -3534.614
## sample estimates:
## mean in group <=50K mean in group >50K
## 148.7525 4006.1425
GenderPlot_capital_loss = ggplot(df, aes(x = income, y = capital_loss)) + geom_boxplot()
GenderPlot_capital_loss# t-test between capital_loss and income
t.test(capital_loss~income,alternative="two.sided", data=df)##
## Welch Two Sample t-test
##
## data: capital_loss by income
## t = -20.238, df = 9231.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -155.5985 -128.1187
## sample estimates:
## mean in group <=50K mean in group >50K
## 53.14292 195.00153
GenderPlot_hours_per_week = ggplot(df, aes(x = income, y = hours_per_week)) + geom_boxplot()
GenderPlot_hours_per_week# t-test between hours_per_week and income
t.test(hours_per_week~income,alternative="two.sided", data=df)##
## Welch Two Sample t-test
##
## data: hours_per_week by income
## t = -45.123, df = 14570, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -6.920943 -6.344690
## sample estimates:
## mean in group <=50K mean in group >50K
## 38.84021 45.47303
Test for categorical variables:
Workclass
##
## <=50K >50K
## Federal-gov 589 371
## Local-gov 1476 617
## Never-worked 7 0
## Private 17733 4963
## Self-emp-inc 494 622
## Self-emp-not-inc 1817 724
## State-gov 945 353
## Without-pay 14 0
mosaicplot(table_workclass, ylab= "income", xlab="workclass", main = "workclass vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_workclass
## X-squared = 827.72, df = 7, p-value < 2.2e-16
education
##
## <=50K >50K
## 10th 871 62
## 11th 1115 60
## 12th 400 33
## 1st-4th 162 6
## 5th-6th 317 16
## 7th-8th 606 40
## 9th 487 27
## Assoc-acdm 802 265
## Assoc-voc 1021 361
## Bachelors 3134 2221
## Doctorate 107 306
## HS-grad 8826 1675
## Masters 764 959
## Preschool 51 0
## Prof-school 153 423
## Some-college 5904 1387
mosaicplot(table_education, ylab= "income", xlab="education", main = "education vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_education
## X-squared = 4429.7, df = 15, p-value < 2.2e-16
marital_status
##
## <=50K >50K
## Divorced 3980 463
## Married-AF-spouse 13 10
## Married-civ-spouse 8284 6692
## Married-spouse-absent 384 34
## Never-married 10192 491
## Separated 959 66
## Widowed 908 85
mosaicplot(table_marital_status, ylab= "income", xlab="marital_status", main = "marital_status vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_marital_status
## X-squared = 6517.7, df = 6, p-value < 2.2e-16
occupation
##
## <=50K >50K
## Adm-clerical 3263 507
## Armed-Forces 8 1
## Craft-repair 3170 929
## Exec-managerial 2098 1968
## Farming-fishing 879 115
## Handlers-cleaners 1284 86
## Machine-op-inspct 1752 250
## Other-service 3158 137
## Priv-house-serv 148 1
## Prof-specialty 2281 1859
## Protective-serv 438 211
## Sales 2667 983
## Tech-support 645 283
## Transport-moving 1277 320
mosaicplot(table_occupation, ylab= "income", xlab="occupation", main = "occupation vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_occupation
## X-squared = 3744.9, df = 13, p-value < 2.2e-16
relationship
##
## <=50K >50K
## Husband 7275 5918
## Not-in-family 7449 856
## Other-relative 944 37
## Own-child 5001 67
## Unmarried 3228 218
## Wife 823 745
mosaicplot(table_relationship, ylab= "income", xlab="relationship", main = "relationship vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_relationship
## X-squared = 6699.1, df = 5, p-value < 2.2e-16
race
##
## <=50K >50K
## Amer-Indian-Eskimo 275 36
## Asian-Pac-Islander 763 276
## Black 2737 387
## Other 246 25
## White 20699 7117
##
## Pearson's Chi-squared test
##
## data: table_race
## X-squared = 330.92, df = 4, p-value < 2.2e-16
Sex
##
## <=50K >50K
## Female 9592 1179
## Male 15128 6662
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_sex
## X-squared = 1517.8, df = 1, p-value < 2.2e-16
native_country
##
## <=50K >50K
## Cambodia 12 7
## Canada 82 39
## China 55 20
## Columbia 57 2
## Cuba 70 25
## Dominican-Republic 68 2
## Ecuador 24 4
## El-Salvador 97 9
## England 60 30
## France 17 12
## Germany 93 44
## Greece 21 8
## Guatemala 61 3
## Haiti 40 4
## Holand-Netherlands 1 0
## Honduras 12 1
## Hong 14 6
## Hungary 10 3
## India 60 40
## Iran 25 18
## Ireland 19 5
## Italy 48 25
## Jamaica 71 10
## Japan 38 24
## Laos 16 2
## Mexico 610 33
## Nicaragua 32 2
## Outlying-US(Guam-USVI-etc) 14 0
## Peru 29 2
## Philippines 137 61
## Poland 48 12
## Portugal 33 4
## Puerto-Rico 102 12
## Scotland 9 3
## South 64 16
## Taiwan 31 20
## Thailand 15 3
## Trinadad&Tobago 17 2
## United-States 21999 7171
## Vietnam 62 5
## Yugoslavia 10 6
mosaicplot(table_native_country, ylab= "income", xlab="native_country", main = "native_country vs income", shade=TRUE)##
## Pearson's Chi-squared test
##
## data: table_native_country
## X-squared = 317.09, df = 40, p-value < 2.2e-16
These attributes have been selected as they are likely to have a strong correlation with the target variable (income), have low correlation with each other, are easily interpretable, and do not require complex computations. By using these attributes, we can build a model that is both accurate and efficient, while also being interpretable and understandable. Age can be a good indicator of experience and seniority, both of which are often correlated with higher income. Education is a good predictor of income as higher levels of education often lead to better job opportunities and higher salaries. Marital status and occupation can also provide insights into a person’s financial situation and their earning potential. Finally, hours-per-week can be a good indicator of full-time versus part-time employment, with full-time employees typically earning more.
Remove remaining columns from the dataset.
drop <- c("fnlwgt", "education_num", "relationship", "race","capital_gain" ,"capital_loss")
df_new = df[,!(names(df) %in% drop)]5. An initial data exploration shows that the missing values in the dataset are denoted by “?” not NA. Change all the “?” characters in the dataframe to NA
6. Use the command colSums(is.na(
## age workclass education marital_status occupation
## 0 1836 0 0 1843
## sex hours_per_week native_country income
## 0 0 583 0
workclass, occupation and native_country has NA values.
7. There are several ways we can deal with missing values. The easiest approach is to remove all the rows with missing values. However, if a large number of rows have missing values removing them will result in loss of information and may affect the classifier performance. If a large number of rows have missing values, then it is typically better to replace missing data with some values. This is called data imputation. Several methods for missing data imputation exist. The most naïve method (which we will use here) is to replace the missing values with mean of the column (for a numerical column) or mode/majority value of the column (for a categorical column). We will use a more advanced data imputation method in a later module. For now, replace the missing values in a numerical column with the mean of the column and the missing values in a categorical column with the mode/majority of the column. After imputation, use colSums(is.na(
mode_workclass <- names(which.max(table(df_new$workclass)))
df_new$workclass[is.na(df_new$workclass)] <- mode_workclass
mode_occupation <- names(which.max(table(df_new$occupation)))
df_new$occupation[is.na(df_new$occupation)] <- mode_occupation
mode_native_country <- names(which.max(table(df_new$native_country)))
df_new$native_country[is.na(df_new$native_country)] <- mode_native_country
# Check for missing values
colSums(is.na(df_new))## age workclass education marital_status occupation
## 0 0 0 0 0
## sex hours_per_week native_country income
## 0 0 0 0
8. Set the seed of the random number generator to a fixed integer, say 1, so that I can reproduce your work:
9. Randomize the order of the rows in the dataset.
10. This dataset has several categorical variables. With the exception of few models ( such as Naiive Bayes and tree-based models) most machine learning models require numeric features and cannot work directly with categorical data. One way to deal with categorical variables is to assign numeric indices to each level. However, this imposes an artificial ordering on an unordered categorical variable. For example, suppose that we have a categorical variable primary color with three levels: “red”,”blue”,”green”. If we convert “red” to 0 , “blue” to 1 and “green” to 2 then we are telling our model that red < blue< green which is not correct. A better way to encode an unordered categorical variable is to do one-hot-encoding. In one hot-encoding we create a dummy binary variable for each level of a categorical variable. For example we can represent the primary color variable by three binary dummy variables, one for each color (red, blue, and green) .If the color is red, then the variable red takes value 1 while blue and green both take the value zero.
library(mltools)
library(data.table)
# Convert all columns to factor
df_copy <- as.data.frame(unclass(df_new), stringsAsFactors = TRUE)
df_final <- as.data.table(df_copy)
df_final_new <- one_hot(df_final, cols = c("workclass", "education", "marital_status", "occupation", "sex", "native_country"), dropUnusedLevels = T)
df_final_new <- as.data.frame(df_final_new)
head(df_final_new)## age workclass_Federal-gov workclass_Local-gov workclass_Never-worked
## 1 51 0 0 0
## 2 38 0 0 0
## 3 30 0 0 0
## 4 38 0 0 0
## 5 21 0 0 0
## 6 34 0 0 0
## workclass_Private workclass_Self-emp-inc workclass_Self-emp-not-inc
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 0 0
## 6 1 0 0
## workclass_State-gov workclass_Without-pay education_10th education_11th
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## education_12th education_1st-4th education_5th-6th education_7th-8th
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## education_9th education_Assoc-acdm education_Assoc-voc education_Bachelors
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 1 0 0
## 4 0 0 0 1
## 5 0 0 0 0
## 6 0 0 0 0
## education_Doctorate education_HS-grad education_Masters education_Preschool
## 1 0 1 0 0
## 2 0 1 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## education_Prof-school education_Some-college marital_status_Divorced
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 1 0
## 6 1 0 0
## marital_status_Married-AF-spouse marital_status_Married-civ-spouse
## 1 0 1
## 2 0 1
## 3 0 0
## 4 0 1
## 5 0 0
## 6 0 0
## marital_status_Married-spouse-absent marital_status_Never-married
## 1 0 0
## 2 0 0
## 3 0 1
## 4 0 0
## 5 0 1
## 6 0 1
## marital_status_Separated marital_status_Widowed occupation_Adm-clerical
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## occupation_Armed-Forces occupation_Craft-repair occupation_Exec-managerial
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 1
## occupation_Farming-fishing occupation_Handlers-cleaners
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## occupation_Machine-op-inspct occupation_Other-service
## 1 0 0
## 2 1 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## occupation_Priv-house-serv occupation_Prof-specialty
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## occupation_Protective-serv occupation_Sales occupation_Tech-support
## 1 0 0 1
## 2 0 0 0
## 3 0 0 1
## 4 0 0 1
## 5 0 1 0
## 6 0 0 0
## occupation_Transport-moving sex_Female sex_Male hours_per_week
## 1 0 0 1 40
## 2 0 0 1 40
## 3 0 1 0 40
## 4 0 0 1 40
## 5 0 1 0 20
## 6 0 0 1 50
## native_country_Cambodia native_country_Canada native_country_China
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Columbia native_country_Cuba native_country_Dominican-Republic
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Ecuador native_country_El-Salvador native_country_England
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_France native_country_Germany native_country_Greece
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Guatemala native_country_Haiti
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## native_country_Holand-Netherlands native_country_Honduras native_country_Hong
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Hungary native_country_India native_country_Iran
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Ireland native_country_Italy native_country_Jamaica
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Japan native_country_Laos native_country_Mexico
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Nicaragua native_country_Outlying-US(Guam-USVI-etc)
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## native_country_Peru native_country_Philippines native_country_Poland
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Portugal native_country_Puerto-Rico native_country_Scotland
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_South native_country_Taiwan native_country_Thailand
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## native_country_Trinadad&Tobago native_country_United-States
## 1 0 1
## 2 0 1
## 3 0 1
## 4 0 1
## 5 0 1
## 6 0 1
## native_country_Vietnam native_country_Yugoslavia income
## 1 0 0 <=50K
## 2 0 0 <=50K
## 3 0 0 <=50K
## 4 0 0 >50K
## 5 0 0 <=50K
## 6 0 0 <=50K
11. Scale all numeric features using Min-Max scaling
library(dplyr)
library(caret)
# Select only the numeric features for scaling
numeric_atrri <- sapply(df_final_new, is.numeric)
numeric_df <- df_final_new[, numeric_atrri]
#select categorical
cat_variable <- sapply(df_final_new, is.factor)
income<-df_final_new[, cat_variable]
# Perform Min-Max scaling
min_max_scaler <- function(x) {
return((x - min(x)) / (max(x) - min(x)))
}
numeric_df <- as.data.frame(lapply(numeric_df, min_max_scaler))
# Combine the scaled numeric features with the categorical features
data_scaled <- cbind(numeric_df, income)12. Use 5-fold cross validation with KNN to predict the “income” variable and report the cross- validation error. ( You can find an example in slides 51-53 of module 4 lecture notes).
library(caret)
# Split the data into a training set and a testing set
split1 <- createDataPartition(data_scaled$income, p = 0.8, list = FALSE)
training_data <- data_scaled[split1, ]
testing_data <- data_scaled[-split1, ]
dim(training_data)## [1] 26049 91
## [1] 6512 91
13. Tune K (the number of nearest neighbors) by trying out different values (starting from k=1 to k=sqrt(n) where n is the number of observations in the dataset (for example k=1,5,10,20 50,100, sqrt(n) ). Draw a plot of cross validation errors for different values of K. Which value of K seems to perform the best on this data set? (You can find an example in slides 54-55 of module 4 lecture notes) Note: This might a long time to run on your machine, be patient ( It took about 30 minutes on my machine to run 5-fold cross validation for 6 different K values)
# Split the dataset into a training set and a test set
set.seed(100)
split2 <- createDataPartition(data_scaled$income, p = 0.8, list = FALSE)
training_set <- data_scaled[split2, ]
test_set <- data_scaled[-split2, ]
# Tune the number of nearest neighbors (K)
set.seed(100)
control <- trainControl(method = "cv", number = 5)
grid <- expand.grid(k = c(1,5,10,20, 50,100, sqrt(nrow(training_set))))
knn_model <- train(income ~ ., data = training_set, method = "knn", tuneGrid = grid, trControl = control)
# Plot the cross-validation errors for different values of K
plot(knn_model)14. Use 5-fold cross validation with KNN to predict the income variable and report the average false positive rate (FPR) and false negative rate (FNR) of the classifier. . FPR is the proportion of negative instances classified as positive by the classifier. Similarly, FNR is the proportion of positive instances classified as negative by the classifier.
It does not matter which class you designate as positive or negative. For instance, you can designate income>50K as positive and income<50K as negative
# Predict the outcomes on the test data
predictions <- predict(knn_model, newdata = test_set)
# Calculate the false positive rate (FPR) and false negative rate (FNR)
confusion_matrix <- confusionMatrix(predictions, test_set$income)
fpr <- confusion_matrix$table[2,1] / sum(confusion_matrix$table[2,])
fnr <- confusion_matrix$table[1,2] / sum(confusion_matrix$table[,2])
# Print the FPR and FNR
cat("False Positive Rate:", fpr, "\n")## False Positive Rate: 0.3423358
## False Negative Rate: 0.4253827
15. Consider a majority classifier which always predicts income <50K. Without writing any code, explain what would be the training error of this classifier? ( Note the training error of this majority classifier is simply the proportion of all examples with income>50K because they are all misclassified by this majority classifier). Compare this with the cross validation error of KNN you computed in question 13. Does KNN do better than this majority classifier?
Therefore, we cannot directly compare the training error of the majority classifier with the cross-validation error of KNN. However, in general, a classifier that performs better than the majority classifier would have a lower error rate than the proportion of the minority class in the dataset. Therefore, if the cross-validation error of KNN is lower than the proportion of the minority class in the dataset, then KNN would be performing better than the majority classifier.. In general, KNN can potentially do better than the majority classifier if it is able to correctly classify a significant proportion of positive examples. However, the majority classifier will always have a training error equal to the percentage of positive examples in the training set, which may be lower than the cross-validation error of KNN. In this case, it can be concluded that the majority classifier will have a lower training error, but the cross-validation error of KNN may be lower if the KNN model is able to generalize well to unseen data.
16. Explain what is the False Positive Rate and False Negative Rate of the majority classifier and how does it compare to the average FPR and FNR of KNN classifier you computed in question 14. You don’t need to write any code to compute FPR and FNR of the majority classifier. You can just compute it based on the definition of FNR and FPR
False Positive Rate (FPR) and False Negative Rate (FNR) are two measures of a binary classifier’s performance. FPR is defined as the proportion of negative examples (income < 50K) that are misclassified as positive (income > 50K), while FNR is defined as the proportion of positive examples (income > 50K) that are misclassified as negative (income < 50K).In comparison, the average FPR and FNR of the KNN classifier from question 14 will depend on the specific choice of K and the data. However, in general, it is expected that the KNN classifier will have a lower FNR and a higher FPR compared to the majority classifier. This is because the majority classifier always predicts the same class regardless of the input, whereas the KNN classifier takes into account the neighbors of each point and may make different predictions based on the data.