Q.1 Use the given data (“exampleData.rData”) to do the following:
load("exampleData.rData")
head(custdata)
head(hhdata)
head(medianincome)
a. Determine all the missing values in the data (find the # of missing values if any for each attribute). Show the results clearly indicating the missing values.
Finding the missing values by column for custdata
a <- data.frame(colSums(is.na(custdata)))
colnames(a)[colnames(a) == 'colSums.is.na.custdata..'] <- 'num of NAs by column'
a
Here, we can see that housing.type, recent.move and num.vehicles all have 56 missing values
Finding the missing values by column for hhdata
colSums(is.na(hhdata))
household_id cust_id income gp
0 0 0 0
Finding the missing values by column for medianincome
colSums(is.na(medianincome))
State Median.Income
0 0
b. Three variables/features are all missing exactly 56 values suggests that it’s the same 56 customers in each case. Because the missing data represents a fairly small fraction of the dataset, it’s probably safe just to drop these customers from your analysis. Drop these 56 customers from the data set. Show the results.
We know that housing.type, recent.move and num.vehicles all have 56 missing values
cleaned <- custdata[!is.na(custdata$recent.move),]
colSums(is.na(cleaned))
state.of.res custid sex is.employed income marital.stat health.ins housing.type
0 0 0 280 0 0 0 0
recent.move num.vehicles age is.employed.fix1 age.normalized Median.Income income.norm gp
0 0 0 0 0 0 0 0
income.lt.30K age.range Income
0 0 309
From above we can see that housing.type, recent.move and num.vehicles all have 0 missing values.
c.Use the original data and find out how many values are missing in the “is.employed” and replace the NAs with “missing”. This transformation will turn the variable type to string. You can change it back with the as.factor() function. (Create a new column named “is.employed.fixed”.
sum(is.na(cleaned$is.employed))
[1] 280
There are 280 missing values in the is.employed column
Creating a new column is_employed.fix
cleaned$is_employed.fix <- cleaned$is.employed
cleaned$is_employed.fix[is.na(cleaned$is.employed)]<- "missing"
cleaned[c("is.employed","is_employed.fix")]
From the above, we can see that we created a new column is_employed.fix from is.employed where we replaced all the missing values with “missing”
d. Find out the # of missing values in “income” and replace the missing values by zeroes. (create a new column called Income.fix.zeroes).
sum(is.na(cleaned$Income))
[1] 309
cleaned$Income.fix.zeroes <- cleaned$Income
cleaned$Income.fix.zeroes[is.na(cleaned$Income)]<- 0
cleaned[c("Income","Income.fix.zeroes")]
d.2 Find out the # of missing values in “income” and replace the missing values by the mean value of the income. (create a new column called Income.fix.mean).
sum(is.na(cleaned$Income))
[1] 309
cleaned$Income.fix.mean <- cleaned$Income
cleaned$Income.fix.mean[is.na(cleaned$Income)]<- mean(cleaned$income)
cleaned[c("Income","Income.fix.mean")]
a. Use the “medianincome” dataframe and merge this with the “custdata” dataframe on the “state.of.res” feature in “custdata” and “State” feature in “Median.Income”.
new_df <- merge(medianincome, custdata, by.medianincome = "State", by.custdata = "state.of.res")
new_df
b. use the summary() method to summarize 3 features from the dataframe created in a. “state.of.res”, “income” and ‘Median.Income”.
summary(new_df$state.of.res)
Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida
12 2 12 13 114 10 18 4 31
Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana
26 2 3 52 27 9 6 13 17
Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska
10 15 30 43 18 16 23 3 6
Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma
4 7 40 4 94 25 3 59 7
Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont
8 63 2 16 1 16 51 5 1
Virginia Washington West Virginia Wisconsin Wyoming
22 11 7 15 4
summary(new_df$income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8700 14600 35000 53505 67000 615000
summary(new_df$Median.Income)
Min. 1st Qu. Median Mean 3rd Qu. Max.
37427 44819 50118 50919 55534 68187
c. Normalize the “income” by “Median.Income” and create a new column “income.normed” and display it using summary() method.
Using the min max normalization function
new_df$income.normed <- (new_df$income - min(new_df$Median.Income)) / (max(new_df$Median.Income) - min(new_df$Median.Income))
new_df[c("income","income.normed")]
summary(new_df$income.normed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.4996 -0.7421 -0.0789 0.5227 0.9614 18.7768
d. Convert the income feature to categorical with anyone having <20K income as TRUE and otherwise FALSE.
new_df$Income.lessthan.20K[new_df$income < 20000]<- "TRUE"
new_df$Income.lessthan.20K[new_df$income >= 20000]<- "FALSE"
new_df[c("income","Income.lessthan.20K")]
e. Convert the “age” into range with the following breaks 0, 25, 65 and Inf.
#This is already given to us
summary(new_df$age.range)
[0,25] (25,65] (65,Inf]
56 732 212
new_df$Age_Range[new_df$age <= 25]<- "[0,25]"
new_df$Age_Range[new_df$age > 25 & new_df$age <= 65 ]<- "(25,65]"
new_df$Age_Range[new_df$age > 65]<- "(65,inf]"
new_df[c("age","Age_Range")]