## The number of columns in the data is : 84
## The number of rows in the data is : 2930
This diagram below tells us that the Pool Quality Variable has the most missingness with a case of the Pool Quality, Misc.Feature,Type of Alley and Fence Quality having missing values together giving us a total of 715 cases. I can also see another repetition of that same group with an addition of the Fireplace variable having a total of 639 missing values together.
Column names are checked and it was seen that 3 variables had an anomaly. The first was the column with “…47” as column name, Second was “…66” and the Third was “..73”. Before proceeding, the values contained in these 2 columns Ire checked and it was seen that the “…66” column and the “…47” column Ire variables filled with 0’s. The “…73” column was changed to “ThreeSsnPorch” - 3 Season Porch Area in Square Feet and was kept in the data as it contained other values and not only 0’s.
The “Order” & “P IdentificationNumber” columns Ire also taken off as they Ire irrelevant to the analysis. The “SalePrice” will also be renamed to “Sale_Price” to make room for consistency for the Sales variables.
## The number of columns with complete data is : 23
## The number of columns with missing data is : 57
I further visualised this to make it easier to see the intensity of the missing values. From the diagram below, I can see that 8.6% of the data is missing.
One of the columns that didn’t have NA was the “Sales Price” column. From the diagram below, there is a positive skewness and evident peakedness, the histogram also doesn’t follow a normal distribution. I will keep this in mind, before modeling, the sale price variable needs to be logged. The numbers are very large too. Logging will fix the distribution to a reasonable extent.
names(house_data)[names(house_data) == "SalePrice"] <- "Sale.Price"
summary(house_data$Sale.Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12400 129100 160500 180705 214275 785500
I realised that in some cases, NA’s might mean Not Available and not missing, so I applied that to the categories of data I saw were missing starting with the Pool QC from the missing diagram displayed above.
I have noticed that deductions can be made from variables that are related e.g Fireplace. I have checked if the number of houses with NA’s in the fireplace quality is equal to the number of houses with fireplaces that’s equal to zero since if there’s no presence of a fireplace in a house, then no quality can be determined.
I will replace the fireplace quality with None.
The next group that can be handled is the Lot group which has 4 variables as displayed below
head(house_data %>% select(starts_with("Lot")))
Skewed left so mode of the values in the lot frontage that aren’t NA’s will be used to replace the NA’s. I also used the Lot Shape to get the Lot Area. I observed that the smaller the value or Lot Area, the more the Lot is classified as regular. I used the modeof each category to avoid bias and replaced the NA’s.
For the Lot Shape, I checked the minimum and maximum of each category and worked based on that.I used a nested if else to create boundaries for assignment of the Lot shape, It seemed to be the fairest way to distribute the shape of the lot.
aggregate(Lot.Area ~ Lot.Shape, data = house_data, FUN=min)
aggregate(Lot.Area ~ Lot.Shape, data = house_data, FUN=max)
head(house_data %>% select(starts_with("Garage")))
| Variable Name | Number of Missings |
|---|---|
| Garage Area | 1 |
| Garage Year Built | 159 |
| Garage Type | 157 |
| Garage Finish | 159 |
| Garage Cars | 1 |
| Garage Quality | 159 |
| Garage Condition | 159 |
About 4 of the variables have the same number of missingness so they will be checked. I noticed again that just like the fireplace variable, some of the garage variables were actually related.
Because a Garage Type is NA, it also means there won’t be a finishing, quality cannot be determined and condition will not be available. I will also simply give garage year built, the same year the house was built.
While checking the observations that have NA’s corresponding with the garage type, 2 rows of data were omitted resulting in an output of only 157 because the others had the same set of “No Garage”.
length(which(is.na(house_data$Garage.Type) & is.na(house_data$Garage.Finish) &
is.na(house_data$Garage.Cond) & is.na(house_data$Garage.Qual)))
## [1] 157
(house_data[!is.na(house_data$Garage.Type) & is.na(house_data$Garage.Finish),
c('Garage.Cars', 'Garage.Area', 'Garage.Type', 'Garage.Cond',
'Garage.Qual', 'Garage.Finish')])
which(!is.na(house_data$Garage.Type) & is.na(house_data$Garage.Finish))
## [1] 1357 2237
The 2 houses that were not part of the 157 were in row 1357 and 2237. It seems that row 1357 does have a garage and row 2237 does not. As a result of this, I will begin by imputing the row that actually has a garage and replace those NA’s with the mode.
For the second row 2237, it shows that the garage type is detchd meaning detached but the other variables that are associated with garage for that same row show that there’s no garage so that means the row needs to be adjusted. I will change the garage type to NA and change the garage car & area to 0. All NA’s will also be changed to “NoGarage”
head(house_data %>% select(starts_with("Bsmt")))
| Variable Name | Number of Missings |
|---|---|
| Bsmt.Full.Bath | 2 |
| Bsmt.Half.Bath | 2 |
| Bsmt.Exposure | 212 |
| Bsmt.Qual | 200 |
| Bsmt.Cond | 214 |
| BsmtFin.Type.1 | 231 |
| BsmtFin.Type.2 | 211 |
| BsmtFin.SF.1 | 156 |
| BsmtFin.SF.2 | 133 |
| Bsmt.Unf.SF | 137 |
I have checked the number of rows/cases where every observation contains NA’s and was able to get 80 variables under the condition with 451 cases being the opposite. The same case of the garage applies in this part of handling the basements. Some NA’s actually mean “No Basement” and shouldn’t be treated like it’s a missing value.
MS.Zoning & Mason
If there’s no Area, then there’s no Type. Just as I have done with the previous variables that were related, I will check to confirm and see if the observations and number of missingness are mostly on the same row and then check the relation to decide if to put none or a category. There are 32 cases where I have same situations of NA’s in same row and Mason group.
For the rest of the variables, I used the mode imputation.
Converting to Factors & Label Encoding using “Revalue” as opposed to Nested-IfElse and CaseWhen.
I have encoded some columns and will display below, how one of the encoded variables now look like. I checked for ordinality to decide if to factor or encode.
head(house_data$Fireplace.Qu)
## [1] 4 0 0 3 3 4
Since the Sales Price is the response variable, I checked for correlations against numerical variables.
The variable that seemed to have the highest correlation with Sale Price was the Overall Quality.
After checking Sales Price, I noticed that the histogram was largely skewed. I’ll use a qqplot for further analysis and determine how to transform it so there’s less skew.
The qqplot doesn’t follow the line so it’s not a normal distribution.
I’ll transform the Sales price using log and redo the qqplot and check
for changes.
Now the skewness and the abnormality has been fixed.
cleanData <- house_data[,c("MS.SubClass","Neighborhood","Overall.Qual","Gr.Liv.Area",
"Exter.Qual","Kitchen.Qual","Garage.Cars","X1st.Flr.SF","Bsmt.Qual",
"Full.Bath","Total.Bsmt.SF","Year.Built","Fireplace.Qu","Sale.Price")]
I selected the most important variables from the correlation matrix along with 2 categorical variables I believed would be important as they have no variable related to them. They might prove good use. I have performed all necessary encoding and imputations.
set.seed(007)
split <- initial_split(cleanData, prop = 0.7,strata = "Sale.Price")
house_train <- training(split)
house_test <- testing(split)
blueprint_sale <- recipe(Sale.Price ~ ., data = house_train[,-1]) %>%
step_log(Sale.Price, base = 10) %>% step_nzv(all_nominal()) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>% step_dummy(all_nominal(),
-all_outcomes(),one_hot = TRUE)
ctcontrol<- trainControl(
method = "repeatedcv",
number = 7,
repeats = 3,
)
exp_grid_train <- expand.grid(k = seq(5, 15, by = 1))
Other Train Control Methods Boot, Boot632, Optimism_boot, Boot_all, Cv, Repeatedcv, LOOCV, LGOCV, Oob, Adaptive_cv, Adaptive_boot or Adaptive_LGOCV.
Grid Search - Default in r, it is better suited for enhancing or optimising the parameters of a model - especially in cases where there are multiple parameters and we want to choose the most effective out of them.
Random Search - This type of search is also regarded as a “hit-or-miss”. It uses multiple random combinations of the parameters of a model and then selects te best solution for it.
Metric Used - RMSE (Root Mean Squared Error)
knn_model_a1 <- train(blueprint_sale,
data = house_train, method = "knn",
trControl = ctcontrol, tuneGrid = exp_grid_train,
metric = "RMSE")
prepare <- prep(blueprint_sale, training = house_train)
baked_train <- bake(prepare, new_data = house_train)
baked_test <- bake(prepare, new_data = house_test)
head(baked_train)
head(baked_test)
I tested Rsquared as a metric and didn’t note any huge difference between it and Root Mean Squared K=7
ggplot(knn_model_a1)
pdvalues <- predict(knn_model_a1,house_test)
house_test$predicted <- pdvalues
house_test <- house_test %>% mutate(residual = predicted-Sale.Price)
house_test$predicted <- 10^house_test$predicted
summary(house_test$residual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -750994 -213595 -160395 -181534 -129095 -34695
global_data_clean <- global_economic_Data %>% mutate_all(imputeTS::na_interpolation)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
Getting The GDP (in dollars) per capita in order to control relative magnitude = Gdp/Population.
global_data_clean <- global_data_clean %>% mutate(gdpercap = gdp/population)
After analysing the data from 1960-2016 for all GDP Per Capita pertaining the continents of the world, It was seen that the two continents with the highest GDP overall was Oceania, followed by Americas and Europe being the continent with the lowest GDP Per Capita.
The GDP Per Capita started high in 1960 and plummeted around 1985, It went on being constant at the low level and rose back up in the year 2000. It became unstable after that year as I noticed there were rise and falls before another decrease that occurred around 2012. 2016 was the final year with an even higher decline at the end.
I grouped the regions with their continents - Africa, Americas, Asia, Europe and Oceania. Below, there’s a diagram that shows the infant mortality rate by regions. The Region with the highest mortality rate for children is the Western part of Africa with the Eastern part of Africa following right behind.
Overall, it seems that Africa has the highest infant mortality rate followed by Asia and then Americas. Oceania is the continent with the least infant mortality rates across the 56 years examined (1960-2016).
The relationship between the variables are seen below. I will ignore the negatives and focus on the positive numbers because they are significant to the interpretaion of the relationship between the variables and are the primary concern.
In conclusion, the variables all mentioned above show a high level of correlation/relationship. They highly depend on one another to increase or decrease.
\(2x+3y = 4 -- 1\) x-2y = 3 – 2
Step 1: make x subject from eq 2 = \(2(3+2y) +3y = 4\) Step 2: \(6 + 4y + 3y = 4\) Step 3: \(4y + 3y = 4-6\) Step 4: \(7y = -2\) Step 5: \(y\) = \(\frac{-2}{7}\) —— Solve for \(x\) \(x\) - \({2x}\) x \(\frac{-2}{7}\) = \(3\) \(x\) = \(\frac{17}{7}\)
Determinant: The interpretation is that the matrix is invertible. m1 has an inverse. Also, when the determinant of a matrix is negative (especially a 2x2), it means that the orientation of the column vectors isn’t standard - in this case, it possesses a clockwise orientation.
m1 <- rbind(c(2,3),c(1,-2))
m2 <- rbind(4,3)
det(m1)
## [1] -7
To solve, I took away the x and y so it’s a standard number and then built up the matrix using rbind.
After Solving with R:
solve(m1,m2)
## [,1]
## [1,] 2.4285714
## [2,] -0.2857143
Further Interpretation of The Determinant: m1 containing the x and y variables are inverted and multiplied by b. The answer was the same through solving it both ways.
solve(m1)%*%m2
## [,1]
## [1,] 2.4285714
## [2,] -0.2857143
Nullspace = 0 - The matrix is linearly independent.
B <- rbind(c(1,3), c(2,-6))
nullspace(B)
## NULL
A · B
r_matrix <- m1%*%B
r_matrix
## [,1] [,2]
## [1,] 8 -12
## [2,] -3 15
Rank of Matrix
cat("The rank of the multiplied matrix is: ", Rank(r_matrix))
## The rank of the multiplied matrix is: 2
Rank(r_matrix) == Rank(m1)
## [1] TRUE
Eigen Value and Eigen Vector. After testing the eigenvectors, I got negatives.
V <- eigen(m1)
class(V)
## [1] "eigen"
V$values
## [1] -2.645751 2.645751
V$vectors
## [,1] [,2]
## [1,] -0.5424768 0.9776088
## [2,] 0.8400708 0.2104307
V$vectors%*%m1
## [,1] [,2]
## [1,] -0.1073448 -3.582648
## [2,] 1.8905723 2.099351
Bulky codes were kept here.
house_data$Pool.QC[is.na(house_data$Pool.QC)] <- 'NoPool'
house_data$Misc.Feature[is.na(house_data$Misc.Feature)] <- "NoMisc"
house_data$Alley[is.na(house_data$Alley)] <- "NoAlley"
house_data$Fence[is.na(house_data$Fence)] <- "NoFence"
sum(is.na(house_data$Fireplace.Qu)) == sum(house_data$Fireplaces==0)
## [1] FALSE
house_data$Fireplace.Qu[is.na(house_data$Fireplace.Qu)] <- "NoFireplace"
sum(table(house_data$Fireplace.Qu))
## [1] 2930
sum(table(house_data$Fireplaces))
## [1] 2930
house_data$Lot.Frontage[is.na(house_data$Lot.Frontage)] <- median(house_data$Lot.Frontage[!is.na(house_data$Lot.Frontage)])
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
sum(is.na(house_data$Lot.Shape))
## [1] 0
sum(is.na(house_data$Lot.Area))
## [1] 0
sum(is.na(house_data$Lot.Config))
## [1] 0
aggregate(Lot.Area ~ Lot.Shape, data = house_data, FUN=median)
house_data %>% group_by(Lot.Shape) %>%
mutate(Lot.Area = ifelse(is.na(Lot.Area), mode(Lot.Area, na.rm = TRUE), Lot.Area))
sum(!is.na(house_data$Lot.Frontage))
## [1] 2930
length(which(is.na(house_data$Garage.Type) & is.na(house_data$Garage.Finish) & is.na(house_data$Garage.Cond) & is.na(house_data$Garage.Qual)))
## [1] 0
house_data$Garage.Type[is.na(house_data$Garage.Type)] <- "None"
house_data$Garage.Qual[is.na(house_data$Garage.Qual)] <- "None"
house_data$Garage.Cond[is.na(house_data$Garage.Cond)] <- "None"
sum(is.na(house_data$Bsmt.Full.Bath))
## [1] 0
sum(is.na(house_data$Bsmt.Half.Bath))
## [1] 0
sum(is.na(house_data$Bsmt.Exposure))
## [1] 0
sum(is.na(house_data$Bsmt.Qual))
## [1] 0
sum(is.na(house_data$Bsmt.Cond))
## [1] 0
sum(is.na(house_data$BsmtFin.Type.1))
## [1] 0
sum(is.na(house_data$BsmtFin.Type.2))
## [1] 0
sum(is.na(house_data$BsmtFin.SF.1))
## [1] 0
sum(is.na(house_data$BsmtFin.SF.2))
## [1] 0
sum(is.na(house_data$Bsmt.Unf.SF))
## [1] 0
length(which(is.na(house_data$Bsmt.Qual) & is.na(house_data$Bsmt.Cond) & is.na(house_data$Bsmt.Exposure) & is.na(house_data$BsmtFin.Type.1) & is.na(house_data$BsmtFin.Type.2)))
## [1] 0
house_data[!is.na(house_data$BsmtFin.Type.1) & (is.na(house_data$Bsmt.Cond)|is.na(house_data$Bsmt.Qual)|is.na(house_data$Bsmt.Exposure)|is.na(house_data$BsmtFin.Type.2)), c('Bsmt.Qual', 'Bsmt.Cond', 'Bsmt.Exposure', 'BsmtFin.Type.1', 'BsmtFin.Type.2')] <- house_data[!is.na(house_data$BsmtFin.Type.1) & (is.na(house_data$Bsmt.Cond)|is.na(house_data$Bsmt.Qual)|is.na(house_data$Bsmt.Exposure)|is.na(house_data$BsmtFin.Type.2)), c('Bsmt.Qual', 'Bsmt.Cond', 'Bsmt.Exposure', 'BsmtFin.Type.1', 'BsmtFin.Type.2')] %>% mutate_if(is_character, function(x) replace(x, is.na(x), mode(x, na.rm = TRUE)))
## Warning in max(xtab): no non-missing arguments to max; returning -Inf
## Warning in max(xtab): no non-missing arguments to max; returning -Inf
house_data$Bsmt.Qual[is.na(house_data$Bsmt.Qual)] <- "None"
house_data$Bsmt.Cond[is.na(house_data$Bsmt.Cond)] <- "None"
house_data$Bsmt.Exposure[is.na(house_data$Bsmt.Exposure)] <- "None"
house_data$BsmtFin.Type.1[is.na(house_data$BsmtFin.Type.1)] <- "None"
house_data$BsmtFin.Type.2[is.na(house_data$BsmtFin.Type.2)] <- "None"
house_data$Bsmt.Half.Bath[is.na(house_data$Bsmt.Half.Bath)] <- 0
house_data$Full.Bath[is.na(house_data$Full.Bath)] <- 0
house_data$BsmtFin.SF.1[is.na(house_data$BsmtFin.SF.1)] <- 0
house_data$Bsmt.Unf.SF[is.na(house_data$Bsmt.Unf.SF)] <- 0
house_data$Total.Bsmt.SF[is.na(house_data$Total.Bsmt.SF)] <- 0
house_data$BsmtFin.SF.2[is.na(house_data$BsmtFin.SF.2)] <- 0
sum(is.na(house_data$BsmtFin.SF.1))
## [1] 0
house_data$MS.Zoning[is.na(house_data$MS.Zoning)] <- mode(house_data$MS.Zoning)
length(which(is.na(house_data$Mas.Vnr.Type) & is.na(house_data$Mas.Vnr.Area)))
## [1] 0
house_data$Mas.Vnr.Type[is.na(house_data$Mas.Vnr.Type)] <- "None"
house_data$Mas.Vnr.Area[is.na(house_data$Mas.Vnr.Area)] <- 0
house_data<- house_data%>% mutate_if(is.numeric, function(x) replace(x, is.na(x), mode(x, na.rm = TRUE)))
house_data <- house_data%>% mutate_if(is_character, function(x) replace(x, is.na(x), mode(x, na.rm = TRUE)))
data.frame(colnames(house_data))
cols <- c(74,6,73,10,58,2,1,8,23,24,42,79,39,21,22,15,16,12,13,14,77,76)
house_data[,cols] <- lapply(house_data[,cols] , factor)
Rank_1 <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
Rank_2 <- c('None'=0, 'Unf'=1, 'RFn'=2, 'Fin'=3)
Rank_3 <- c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)
Rank_4 <- c('None'=0, 'No'=1, 'Mn'=2, 'Av'=3, 'Gd'=4)
Rank_5 <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)
Rank_6 <- c('None'=0, 'BrkCmn'=0, 'BrkFace'=1, 'Stone'=2)
Rank_7 <- c('Sev'=0, 'Mod'=1, 'Gtl'=2)
Rank_8 <- c('N'=0, 'P'=1, 'Y'=2)
Rank_9 <- c('Grvl'=0, 'Pave'=1)
Rank_10 <- c('N'=0, 'Y'=1)
Rank_11 <- c('Sal'=0, 'Sev'=1, 'Maj2'=2, 'Maj1'=3, 'Mod'=4, 'Min2'=5, 'Min1'=6, 'Typ'=7)
Rank_12 <- c('NoPool' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
Rank_13 <- c('NoFireplace' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
house_data$Lot.Shape<-as.integer(revalue(house_data$Lot.Shape,Rank_3))
## The following `from` values were not present in `x`: IR3, IR2, IR1, Reg
house_data$Garage.Finish<-as.integer(revalue(house_data$Garage.Finish,Rank_2))
## The following `from` values were not present in `x`: None, Unf, RFn, Fin
house_data$Bsmt.Qual<-as.integer(revalue(house_data$Bsmt.Qual,Rank_1))
## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex
house_data$Garage.Cond<-as.integer(revalue(house_data$Garage.Cond,Rank_1))
## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex
house_data$Functional<-as.integer(revalue(house_data$Functional,Rank_11))
## The following `from` values were not present in `x`: Sal, Sev, Maj2, Maj1, Mod, Min2, Min1, Typ
house_data$Street<-as.integer(revalue(house_data$Street,Rank_9))
## The following `from` values were not present in `x`: Grvl, Pave
house_data$Fireplace.Qu<-as.integer(revalue(house_data$Fireplace.Qu,Rank_13))
## The following `from` values were not present in `x`: NoFireplace, Po, Fa, TA, Gd, Ex
house_data$BsmtFin.Type.1<-as.integer(revalue(house_data$BsmtFin.Type.1,Rank_5))
## The following `from` values were not present in `x`: None, Unf, LwQ, Rec, BLQ, ALQ, GLQ
house_data$Bsmt.Cond<-as.integer(revalue(house_data$Bsmt.Cond,Rank_1))
## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex
house_data$Exter.Qual<-as.integer(revalue(house_data$Exter.Qual,Rank_1))
## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex
house_data$Exter.Cond<-as.integer(revalue(house_data$Exter.Cond,Rank_1))
## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex
house_data$Garage.Qual<-as.integer(revalue(house_data$Garage.Qual,Rank_1))
## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex
house_data$Pool.QC<-as.integer(revalue(house_data$Pool.QC,Rank_12))
## The following `from` values were not present in `x`: NoPool, Po, Fa, TA, Gd, Ex
house_data$Kitchen.Qual<-as.integer(revalue(house_data$Kitchen.Qual,Rank_1))
## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex
house_data$Heating.QC<-as.integer(revalue(house_data$Heating.QC,Rank_1))
## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex
house_data$Land.Slope<-as.integer(revalue(house_data$Land.Slope,Rank_7))
## The following `from` values were not present in `x`: Sev, Mod, Gtl
house_data$Central.Air<-as.integer(revalue(house_data$Central.Air,Rank_10))
## The following `from` values were not present in `x`: N, Y
house_data$Paved.Drive<-as.integer(revalue(house_data$Paved.Drive,Rank_8))
## The following `from` values were not present in `x`: N, P, Y
house_data$BsmtFin.SF.1<-as.numeric(house_data$BsmtFin.SF.1)
house_data$BsmtFin.SF.2<-as.numeric(house_data$BsmtFin.SF.2)
house_data$Sale.Price<-as.numeric(house_data$Sale.Price)
house_data$Misc.Val<-as.numeric(house_data$Misc.Val)
house_data$Year.Built<-as.numeric(house_data$Year.Built)
house_data$Year.Remod.Add<-as.numeric(house_data$Year.Remod.Add)
house_data$Mas.Vnr.Area<-as.numeric(house_data$Mas.Vnr.Area)
house_data$Bsmt.Unf.SF<-as.numeric(house_data$Bsmt.Unf.SF)
house_data$Total.Bsmt.SF<-as.numeric(house_data$Total.Bsmt.SF)
house_data$Pool.Area<-as.numeric(house_data$Pool.Area)
house_data$Screen.Porch<-as.numeric(house_data$Screen.Porch)
house_data$ThreeSsnPorch<-as.numeric(house_data$ThreeSsnPorch)
house_data$Enclosed.Porch<-as.numeric(house_data$Enclosed.Porch)
house_data$Open.Porch.SF<-as.numeric(house_data$Open.Porch.SF)
house_data$Wood.Deck.SF<-as.numeric(house_data$Wood.Deck.SF)
house_data$Garage.Area<-as.numeric(house_data$Garage.Area)
house_data$Garage.Cars<-as.numeric(house_data$Garage.Cars)
house_data$Garage.Yr.Blt<-as.numeric(house_data$Garage.Yr.Blt)
house_data$Fireplaces<-as.numeric(house_data$Fireplaces)
house_data$TotRms.AbvGrd<-as.numeric(house_data$TotRms.AbvGrd)
house_data$Kitchen.AbvGr<-as.numeric(house_data$Kitchen.AbvGr)
house_data$Bedroom.AbvGr<-as.numeric(house_data$Bedroom.AbvGr)
house_data$Half.Bath<-as.numeric(house_data$Half.Bath)
house_data$Full.Bath<-as.numeric(house_data$Full.Bath)
house_data$Bsmt.Full.Bath<-as.numeric(house_data$Bsmt.Full.Bath)
house_data$Bsmt.Half.Bath<-as.numeric(house_data$Bsmt.Half.Bath)
house_data$X1st.Flr.SF<-as.numeric(house_data$X1st.Flr.SF)
house_data$X2nd.Flr.SF<-as.numeric(house_data$X2nd.Flr.SF)
house_data$Low.Qual.Fin.SF<-as.numeric(house_data$Low.Qual.Fin.SF)
house_data$Gr.Liv.Area<-as.numeric(house_data$Gr.Liv.Area)
house_data %>%
mutate_at(vars(Lot.Area, Lot.Frontage, Overall.Qual,Overall.Cond,Year.Built,Year.Remod.Add,Mas.Vnr.Area,BsmtFin.SF.1,BsmtFin.SF.2,Total.Bsmt.SF,X1st.Flr.SF,X2nd.Flr.SF,Low.Qual.Fin.SF,Gr.Liv.Area,Bsmt.Full.Bath,Full.Bath,Half.Bath,Bedroom.AbvGr,Kitchen.AbvGr,TotRms.AbvGrd,Fireplaces,Garage.Yr.Blt,Garage.Cars,Garage.Area,Wood.Deck.SF,Open.Porch.SF,Enclosed.Porch,ThreeSsnPorch,Screen.Porch,Pool.Area), as.numeric)
str(house_data)
## tibble [2,930 × 80] (S3: tbl_df/tbl/data.frame)
## $ MS.SubClass : Factor w/ 15 levels "120","160","180",..: 5 5 5 5 10 10 1 1 1 10 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
## $ Lot.Frontage : chr [1:2930] "143.8" "81.6" "79.4" "91.1" ...
## $ Lot.Area : chr [1:2930] "32405.4" "11622" "13981.7" "11383.2" ...
## $ Street : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
## $ Alley : Factor w/ 3 levels "Grvl","NoAlley",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Lot.Shape : int [1:2930] 2 3 2 3 2 2 3 2 2 3 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
## $ Utilities : chr [1:2930] "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
## $ Land.Slope : int [1:2930] 2 2 2 2 2 2 2 2 2 2 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 16 25 9 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
## $ Overall.Qual : chr [1:2930] "6" "5" "5" "7" ...
## $ Overall.Cond : chr [1:2930] "5" "6" "6" "5" ...
## $ Year.Built : num [1:2930] 1960 1961 1958 1968 1997 ...
## $ Year.Remod.Add : num [1:2930] 1960 1961 1958 1968 1998 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 7 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 15 15 15 6 7 15 15 ...
## $ Mas.Vnr.Type : chr [1:2930] "Stone" "None" "BrkFace" "None" ...
## $ Mas.Vnr.Area : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
## $ Exter.Qual : int [1:2930] 3 3 3 4 3 3 4 4 4 3 ...
## $ Exter.Cond : int [1:2930] 3 3 3 3 3 3 3 3 3 3 ...
## $ Foundation : chr [1:2930] "CBlock" "CBlock" "CBlock" "CBlock" ...
## $ Bsmt.Qual : int [1:2930] 3 3 3 3 4 3 4 4 4 3 ...
## $ Bsmt.Cond : int [1:2930] 4 3 3 3 3 3 3 3 3 3 ...
## $ Bsmt.Exposure : chr [1:2930] "Gd" "No" "No" "No" ...
## $ BsmtFin.Type.1 : int [1:2930] 4 0 5 5 6 6 6 5 6 1 ...
## $ BsmtFin.SF.1 : num [1:2930] 639 467 0 1068 0 ...
## $ BsmtFin.Type.2 : chr [1:2930] "Unf" "LwQ" "Unf" "Unf" ...
## $ BsmtFin.SF.2 : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : num [1:2930] 385 0 412 1103 112 ...
## $ Total.Bsmt.SF : num [1:2930] 1080 882 1329 2110 928 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : int [1:2930] 2 3 3 5 4 5 5 5 5 4 ...
## $ Central.Air : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ X1st.Flr.SF : num [1:2930] 1602 898 1315 2125 855 ...
## $ X2nd.Flr.SF : num [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
## $ Low.Qual.Fin.SF: num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Gr.Liv.Area : num [1:2930] 1788 888 1327 2061 1611 ...
## $ Bsmt.Full.Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
## $ Bsmt.Half.Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : num [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
## $ Half.Bath : num [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
## $ Bedroom.AbvGr : num [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
## $ Kitchen.AbvGr : num [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : int [1:2930] 3 3 4 5 3 4 4 4 4 4 ...
## $ TotRms.AbvGrd : num [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
## $ Functional : int [1:2930] 7 7 7 7 7 7 7 7 7 7 ...
## $ Fireplaces : num [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
## $ Fireplace.Qu : int [1:2930] 4 0 0 3 3 4 0 0 3 3 ...
## $ Garage.Type : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Garage.Yr.Blt : num [1:2930] 1960 1961 1958 1968 1997 ...
## $ Garage.Finish : int [1:2930] 3 1 1 3 3 3 3 2 2 3 ...
## $ Garage.Cars : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
## $ Garage.Area : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
## $ Garage.Qual : int [1:2930] 3 3 3 3 3 3 3 3 3 3 ...
## $ Garage.Cond : int [1:2930] 3 3 3 3 3 3 3 3 3 3 ...
## $ Paved.Drive : int [1:2930] 1 2 2 2 2 2 2 2 2 2 ...
## $ Wood.Deck.SF : num [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
## $ Open.Porch.SF : num [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
## $ Enclosed.Porch : num [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
## $ ThreeSsnPorch : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Screen.Porch : num [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
## $ Pool.Area : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
## $ Fence : Factor w/ 5 levels "GdPrv","GdWo",..: 5 3 5 5 3 5 5 5 5 5 ...
## $ Misc.Feature : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ...
## $ Misc.Val : num [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
## $ Mo.Sold : Factor w/ 12 levels "1","10","11",..: 8 9 9 7 6 9 7 1 6 9 ...
## $ Yr.Sold : Factor w/ 5 levels "2006","2007",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Sale.Type : chr [1:2930] "WD" "WD" "WD" "WD" ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Sale.Price : num [1:2930] 209200 107700 163600 257900 184400 ...
int_num <- which(sapply(house_data, is.numeric))