1 Inital Data Exploration

## The number of columns in the data is : 84

## The number of rows in the data is : 2930

This diagram below tells us that the Pool Quality Variable has the most missingness with a case of the Pool Quality, Misc.Feature,Type of Alley and Fence Quality having missing values together giving us a total of 715 cases. I can also see another repetition of that same group with an addition of the Fireplace variable having a total of 639 missing values together.

Column names are checked and it was seen that 3 variables had an anomaly. The first was the column with “…47” as column name, Second was “…66” and the Third was “..73”. Before proceeding, the values contained in these 2 columns Ire checked and it was seen that the “…66” column and the “…47” column Ire variables filled with 0’s. The “…73” column was changed to “ThreeSsnPorch” - 3 Season Porch Area in Square Feet and was kept in the data as it contained other values and not only 0’s.

The “Order” & “P IdentificationNumber” columns Ire also taken off as they Ire irrelevant to the analysis. The “SalePrice” will also be renamed to “Sale_Price” to make room for consistency for the Sales variables.

## The number of columns with complete data is : 23

## The number of columns with missing data is : 57

I further visualised this to make it easier to see the intensity of the missing values. From the diagram below, I can see that 8.6% of the data is missing.

One of the columns that didn’t have NA was the “Sales Price” column. From the diagram below, there is a positive skewness and evident peakedness, the histogram also doesn’t follow a normal distribution. I will keep this in mind, before modeling, the sale price variable needs to be logged. The numbers are very large too. Logging will fix the distribution to a reasonable extent.

names(house_data)[names(house_data) == "SalePrice"] <- "Sale.Price"

summary(house_data$Sale.Price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12400  129100  160500  180705  214275  785500

I realised that in some cases, NA’s might mean Not Available and not missing, so I applied that to the categories of data I saw were missing starting with the Pool QC from the missing diagram displayed above.

I have noticed that deductions can be made from variables that are related e.g Fireplace. I have checked if the number of houses with NA’s in the fireplace quality is equal to the number of houses with fireplaces that’s equal to zero since if there’s no presence of a fireplace in a house, then no quality can be determined.

I will replace the fireplace quality with None.

The next group that can be handled is the Lot group which has 4 variables as displayed below

head(house_data %>% select(starts_with("Lot")))

Skewed left so mode of the values in the lot frontage that aren’t NA’s will be used to replace the NA’s. I also used the Lot Shape to get the Lot Area. I observed that the smaller the value or Lot Area, the more the Lot is classified as regular. I used the modeof each category to avoid bias and replaced the NA’s.

For the Lot Shape, I checked the minimum and maximum of each category and worked based on that.I used a nested if else to create boundaries for assignment of the Lot shape, It seemed to be the fairest way to distribute the shape of the lot.

aggregate(Lot.Area ~ Lot.Shape, data = house_data, FUN=min)

aggregate(Lot.Area ~ Lot.Shape, data = house_data, FUN=max)

head(house_data %>% select(starts_with("Garage")))

Missing Data For The Garage Group
Variable Name	Number of Missings
Garage Area	1
Garage Year Built	159
Garage Type	157
Garage Finish	159
Garage Cars	1
Garage Quality	159
Garage Condition	159

About 4 of the variables have the same number of missingness so they will be checked. I noticed again that just like the fireplace variable, some of the garage variables were actually related.

Because a Garage Type is NA, it also means there won’t be a finishing, quality cannot be determined and condition will not be available. I will also simply give garage year built, the same year the house was built.

While checking the observations that have NA’s corresponding with the garage type, 2 rows of data were omitted resulting in an output of only 157 because the others had the same set of “No Garage”.

length(which(is.na(house_data$Garage.Type) & is.na(house_data$Garage.Finish) &
               
               is.na(house_data$Garage.Cond) & is.na(house_data$Garage.Qual)))

## [1] 157

(house_data[!is.na(house_data$Garage.Type) & is.na(house_data$Garage.Finish), 
            c('Garage.Cars', 'Garage.Area', 'Garage.Type', 'Garage.Cond',
              'Garage.Qual', 'Garage.Finish')])

which(!is.na(house_data$Garage.Type) & is.na(house_data$Garage.Finish))

## [1] 1357 2237

The 2 houses that were not part of the 157 were in row 1357 and 2237. It seems that row 1357 does have a garage and row 2237 does not. As a result of this, I will begin by imputing the row that actually has a garage and replace those NA’s with the mode.

For the second row 2237, it shows that the garage type is detchd meaning detached but the other variables that are associated with garage for that same row show that there’s no garage so that means the row needs to be adjusted. I will change the garage type to NA and change the garage car & area to 0. All NA’s will also be changed to “NoGarage”

head(house_data %>% select(starts_with("Bsmt")))

Number of Missings in The Basement Group
Variable Name	Number of Missings
Bsmt.Full.Bath	2
Bsmt.Half.Bath	2
Bsmt.Exposure	212
Bsmt.Qual	200
Bsmt.Cond	214
BsmtFin.Type.1	231
BsmtFin.Type.2	211
BsmtFin.SF.1	156
BsmtFin.SF.2	133
Bsmt.Unf.SF	137

I have checked the number of rows/cases where every observation contains NA’s and was able to get 80 variables under the condition with 451 cases being the opposite. The same case of the garage applies in this part of handling the basements. Some NA’s actually mean “No Basement” and shouldn’t be treated like it’s a missing value.

MS.Zoning & Mason

If there’s no Area, then there’s no Type. Just as I have done with the previous variables that were related, I will check to confirm and see if the observations and number of missingness are mostly on the same row and then check the relation to decide if to put none or a category. There are 32 cases where I have same situations of NA’s in same row and Mason group.

For the rest of the variables, I used the mode imputation.

Converting to Factors & Label Encoding using “Revalue” as opposed to Nested-IfElse and CaseWhen.

I have encoded some columns and will display below, how one of the encoded variables now look like. I checked for ordinality to decide if to factor or encode.

head(house_data$Fireplace.Qu)

## [1] 4 0 0 3 3 4

Since the Sales Price is the response variable, I checked for correlations against numerical variables.

The variable that seemed to have the highest correlation with Sale Price was the Overall Quality.

After checking Sales Price, I noticed that the histogram was largely skewed. I’ll use a qqplot for further analysis and determine how to transform it so there’s less skew.

The qqplot doesn’t follow the line so it’s not a normal distribution. I’ll transform the Sales price using log and redo the qqplot and check for changes.

Now the skewness and the abnormality has been fixed.

cleanData <- house_data[,c("MS.SubClass","Neighborhood","Overall.Qual","Gr.Liv.Area",
                           "Exter.Qual","Kitchen.Qual","Garage.Cars","X1st.Flr.SF","Bsmt.Qual",
                           "Full.Bath","Total.Bsmt.SF","Year.Built","Fireplace.Qu","Sale.Price")]

I selected the most important variables from the correlation matrix along with 2 categorical variables I believed would be important as they have no variable related to them. They might prove good use. I have performed all necessary encoding and imputations.

set.seed(007)
split <- initial_split(cleanData, prop = 0.7,strata = "Sale.Price")
house_train <- training(split)
house_test <- testing(split)

Blueprint Object

blueprint_sale <- recipe(Sale.Price ~ ., data = house_train[,-1]) %>%
  step_log(Sale.Price, base = 10) %>% step_nzv(all_nominal()) %>%
step_center(all_numeric(), -all_outcomes()) %>%
step_scale(all_numeric(), -all_outcomes()) %>% step_dummy(all_nominal(), 
                                                          -all_outcomes(),one_hot = TRUE)



ctcontrol<- trainControl(
method = "repeatedcv",
number = 7,
repeats = 3,
)

exp_grid_train <- expand.grid(k = seq(5, 15, by = 1))

Other Train Control Methods Boot, Boot632, Optimism_boot, Boot_all, Cv, Repeatedcv, LOOCV, LGOCV, Oob, Adaptive_cv, Adaptive_boot or Adaptive_LGOCV.

Grid Search - Default in r, it is better suited for enhancing or optimising the parameters of a model - especially in cases where there are multiple parameters and we want to choose the most effective out of them.

Random Search - This type of search is also regarded as a “hit-or-miss”. It uses multiple random combinations of the parameters of a model and then selects te best solution for it.

Metric Used - RMSE (Root Mean Squared Error)

knn_model_a1 <- train(blueprint_sale,
data = house_train, method = "knn",
trControl = ctcontrol, tuneGrid = exp_grid_train,
metric = "RMSE")

prepare <- prep(blueprint_sale, training = house_train)
baked_train <- bake(prepare, new_data = house_train)
baked_test <- bake(prepare, new_data = house_test)

head(baked_train)

head(baked_test)

I tested Rsquared as a metric and didn’t note any huge difference between it and Root Mean Squared K=7

ggplot(knn_model_a1)

pdvalues <- predict(knn_model_a1,house_test)
house_test$predicted <- pdvalues
house_test <- house_test %>% mutate(residual = predicted-Sale.Price)

house_test$predicted <- 10^house_test$predicted

summary(house_test$residual)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -750994 -213595 -160395 -181534 -129095  -34695

2 Graphical Exploration of Global Economic Data

global_data_clean <- global_economic_Data %>% mutate_all(imputeTS::na_interpolation)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Getting The GDP (in dollars) per capita in order to control relative magnitude = Gdp/Population.

global_data_clean <- global_data_clean %>% mutate(gdpercap = gdp/population)

After analysing the data from 1960-2016 for all GDP Per Capita pertaining the continents of the world, It was seen that the two continents with the highest GDP overall was Oceania, followed by Americas and Europe being the continent with the lowest GDP Per Capita.

The GDP Per Capita started high in 1960 and plummeted around 1985, It went on being constant at the low level and rose back up in the year 2000. It became unstable after that year as I noticed there were rise and falls before another decrease that occurred around 2012. 2016 was the final year with an even higher decline at the end.

I grouped the regions with their continents - Africa, Americas, Asia, Europe and Oceania. Below, there’s a diagram that shows the infant mortality rate by regions. The Region with the highest mortality rate for children is the Western part of Africa with the Eastern part of Africa following right behind.

Overall, it seems that Africa has the highest infant mortality rate followed by Asia and then Americas. Oceania is the continent with the least infant mortality rates across the 56 years examined (1960-2016).

The relationship between the variables are seen below. I will ignore the negatives and focus on the positive numbers because they are significant to the interpretaion of the relationship between the variables and are the primary concern.

The 1st and strongest relationship occurs between the variables Fertility and Infant Mortality. - 0.981
The 2nd correlation following occurs between the variables Population and Life Expectancy. - 0.894
The 3rd correlation occurs between the variables GDP PerCap ($) & Infant Mortality. - 0.827.
The last correlation occurs between the variables GDP PerCap ($) and Life Expectancy. - 0.7876

In conclusion, the variables all mentioned above show a high level of correlation/relationship. They highly depend on one another to increase or decrease.

Usally, mortality has a negative effective on Gdp, As the mortality rate increases, the GDP per capita decreases and vice-versa. But if the mortality increases, fertility increases and eventually leads to an economic consequence called Reccession. The Global Body can decide to implement a one-child or two-children policy in order to stablize the economy and population. Most times, It is a significant amount of gdp that the country ends up losing and it gets worse for third world countries. At the end of the day, the variables are dependent but in a weak manner.

3 Linear Algebra

$2x+3y = 4 -- 1$ x-2y = 3 – 2

Step 1: make x subject from eq 2 = $2(3+2y) +3y = 4$ Step 2: $6 + 4y + 3y = 4$ Step 3: $4y + 3y = 4-6$ Step 4: $7y = -2$ Step 5: $y$ = $\frac{-2}{7}$ —— Solve for $x$ $x$ - ${2x}$ x $\frac{-2}{7}$ = $3$ $x$ = $\frac{17}{7}$

3.2

Determinant: The interpretation is that the matrix is invertible. m1 has an inverse. Also, when the determinant of a matrix is negative (especially a 2x2), it means that the orientation of the column vectors isn’t standard - in this case, it possesses a clockwise orientation.

m1 <-  rbind(c(2,3),c(1,-2))
m2 <- rbind(4,3)

det(m1)

## [1] -7

To solve, I took away the x and y so it’s a standard number and then built up the matrix using rbind.

After Solving with R:

solve(m1,m2)

##            [,1]
## [1,]  2.4285714
## [2,] -0.2857143

Further Interpretation of The Determinant: m1 containing the x and y variables are inverted and multiplied by b. The answer was the same through solving it both ways.

solve(m1)%*%m2

##            [,1]
## [1,]  2.4285714
## [2,] -0.2857143

3.3

Nullspace = 0 - The matrix is linearly independent.

B <- rbind(c(1,3), c(2,-6))
nullspace(B)

## NULL

3.4

A · B

r_matrix <- m1%*%B
r_matrix

##      [,1] [,2]
## [1,]    8  -12
## [2,]   -3   15

Rank of Matrix

cat("The rank of the multiplied matrix is: ", Rank(r_matrix))

## The rank of the multiplied matrix is:  2

Rank(r_matrix) == Rank(m1)

## [1] TRUE

3.5

Eigen Value and Eigen Vector. After testing the eigenvectors, I got negatives.

V <- eigen(m1)
class(V)

## [1] "eigen"

V$values

## [1] -2.645751  2.645751

V$vectors

##            [,1]      [,2]
## [1,] -0.5424768 0.9776088
## [2,]  0.8400708 0.2104307

V$vectors%*%m1

##            [,1]      [,2]
## [1,] -0.1073448 -3.582648
## [2,]  1.8905723  2.099351

4 PCA Using the Social Data

4.1

Soc_Net_Data <- read.delim("SocNetData.txt",             
           header = TRUE,
           sep = ",",
           na.strings = c("NA"))

Soc_Net_Data<-Soc_Net_Data[!(is.na(Soc_Net_Data$gender) & is.na(Soc_Net_Data$age)),]
Soc_Net_Data <- Soc_Net_Data %>% mutate_if(is_numeric,(imputeTS::na_interpolation))
Soc_Net_Data$gender[is.na(Soc_Net_Data$gender)] <- mode(Soc_Net_Data$gender)
#Soc_Net_Data$gender <- as.factor(Soc_Net_Data$gender)

4.2

4.3

head(eig$d)

## [1] 3.313157 1.718695 1.639909 1.508948 1.464843 1.245703

head(eig$u,1)

##             [,1]       [,2]        [,3]       [,4]        [,5]       [,6]
## [1,] -0.02149062 0.05619527 -0.02567651 0.03728996 -0.01720021 0.04961882
##             [,7]      [,8]       [,9]     [,10]      [,11]   [,12]      [,13]
## [1,] -0.06225935 0.1220768 -0.1262111 0.0288275 0.09038433 0.90708 -0.2603075
##           [,14]      [,15]      [,16]     [,17]      [,18]      [,19]
## [1,] 0.09720056 0.04446303 0.03075158 0.1247814 0.09133751 -0.0226174
##            [,20]       [,21]       [,22]     [,23]       [,24]      [,25]
## [1,] -0.05266903 -0.02979021 -0.03072684 0.1053861 0.006488155 0.01694516
##           [,26]       [,27]       [,28]       [,29]      [,30]       [,31]
## [1,] 0.02804925 0.003724851 -0.04174003 -0.01200603 -0.0178841 -0.01282031
##           [,32]       [,33]      [,34]       [,35]       [,36]        [,37]
## [1,] 0.02847664 0.003780187 0.01623342 -0.01126471 0.005038067 -0.007300357
##            [,38]
## [1,] 0.002854623

4.4

I would say it depends on the number of features that have high variability. If the number of features with high variability greatly exceeds the number of features with very low variance, Then the high variability will make the whole data possess high variability

But if there’s a situation where there are features with high variablity and other features with a normal level of variability

4.5

I will be selecting 2 features from PC1 which are age and friends because PC1 captures a very large percentage of the data with most of the variability coming from both age and friends.

4.6

autoplot(prcomp(scaled_numdat), data = color_gend, colour = "gender_encoded",
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3)

4.7

The variability in PC1 and PC2 is driven by the age and friends feature. (Bulk of the variability in the data were captured by the two features only)
There was a high/big variance in age as well as friends.
PC1 and PC2 captures the variability of almost the whole data.
Other columns were filled with mostly 0’s and 1’s and so there wasn’t a lot of variability in those columns.
PCA in this case, has helped us to understand where majority of the variance/variability or spread came from and helped reduce the dimensionality or cases where there’s large data with a lot of columns/wide data by simply being able to see what columns adequately represent the data.
Using the R function (Autoplot), helped to gain deeper insight by actually showing how spread out the variables are. I was able to note that it wasn’t just “age” and “friends” that had high variability.

Appendix - Source Code

Bulky codes were kept here.

house_data$Pool.QC[is.na(house_data$Pool.QC)] <- 'NoPool'
house_data$Misc.Feature[is.na(house_data$Misc.Feature)] <- "NoMisc"
house_data$Alley[is.na(house_data$Alley)] <- "NoAlley"
house_data$Fence[is.na(house_data$Fence)] <- "NoFence"

sum(is.na(house_data$Fireplace.Qu)) == sum(house_data$Fireplaces==0)

## [1] FALSE

house_data$Fireplace.Qu[is.na(house_data$Fireplace.Qu)] <- "NoFireplace"
sum(table(house_data$Fireplace.Qu))

## [1] 2930

sum(table(house_data$Fireplaces))

## [1] 2930

house_data$Lot.Frontage[is.na(house_data$Lot.Frontage)] <- median(house_data$Lot.Frontage[!is.na(house_data$Lot.Frontage)])

## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA

sum(is.na(house_data$Lot.Shape))

## [1] 0

sum(is.na(house_data$Lot.Area))

## [1] 0

sum(is.na(house_data$Lot.Config))

## [1] 0

aggregate(Lot.Area ~ Lot.Shape, data = house_data, FUN=median)

house_data %>% group_by(Lot.Shape) %>% 
mutate(Lot.Area = ifelse(is.na(Lot.Area), mode(Lot.Area, na.rm = TRUE), Lot.Area))

sum(!is.na(house_data$Lot.Frontage))

## [1] 2930

length(which(is.na(house_data$Garage.Type) & is.na(house_data$Garage.Finish) & is.na(house_data$Garage.Cond) & is.na(house_data$Garage.Qual)))

## [1] 0

house_data$Garage.Type[is.na(house_data$Garage.Type)] <- "None"
house_data$Garage.Qual[is.na(house_data$Garage.Qual)] <- "None"
house_data$Garage.Cond[is.na(house_data$Garage.Cond)] <- "None"

sum(is.na(house_data$Bsmt.Full.Bath))

## [1] 0

sum(is.na(house_data$Bsmt.Half.Bath))

## [1] 0

sum(is.na(house_data$Bsmt.Exposure))

## [1] 0

sum(is.na(house_data$Bsmt.Qual))

## [1] 0

sum(is.na(house_data$Bsmt.Cond))

## [1] 0

sum(is.na(house_data$BsmtFin.Type.1))

## [1] 0

sum(is.na(house_data$BsmtFin.Type.2))

## [1] 0

sum(is.na(house_data$BsmtFin.SF.1))

## [1] 0

sum(is.na(house_data$BsmtFin.SF.2))

## [1] 0

sum(is.na(house_data$Bsmt.Unf.SF))

## [1] 0

length(which(is.na(house_data$Bsmt.Qual) & is.na(house_data$Bsmt.Cond) & is.na(house_data$Bsmt.Exposure) & is.na(house_data$BsmtFin.Type.1) & is.na(house_data$BsmtFin.Type.2)))

## [1] 0

house_data[!is.na(house_data$BsmtFin.Type.1) & (is.na(house_data$Bsmt.Cond)|is.na(house_data$Bsmt.Qual)|is.na(house_data$Bsmt.Exposure)|is.na(house_data$BsmtFin.Type.2)), c('Bsmt.Qual', 'Bsmt.Cond', 'Bsmt.Exposure', 'BsmtFin.Type.1', 'BsmtFin.Type.2')] <- house_data[!is.na(house_data$BsmtFin.Type.1) & (is.na(house_data$Bsmt.Cond)|is.na(house_data$Bsmt.Qual)|is.na(house_data$Bsmt.Exposure)|is.na(house_data$BsmtFin.Type.2)), c('Bsmt.Qual', 'Bsmt.Cond', 'Bsmt.Exposure', 'BsmtFin.Type.1', 'BsmtFin.Type.2')] %>% mutate_if(is_character, function(x) replace(x, is.na(x), mode(x, na.rm = TRUE)))

## Warning in max(xtab): no non-missing arguments to max; returning -Inf

## Warning in max(xtab): no non-missing arguments to max; returning -Inf

house_data$Bsmt.Qual[is.na(house_data$Bsmt.Qual)] <- "None"
house_data$Bsmt.Cond[is.na(house_data$Bsmt.Cond)] <- "None"
house_data$Bsmt.Exposure[is.na(house_data$Bsmt.Exposure)] <- "None"
house_data$BsmtFin.Type.1[is.na(house_data$BsmtFin.Type.1)] <- "None"
house_data$BsmtFin.Type.2[is.na(house_data$BsmtFin.Type.2)] <- "None"
house_data$Bsmt.Half.Bath[is.na(house_data$Bsmt.Half.Bath)] <- 0
house_data$Full.Bath[is.na(house_data$Full.Bath)] <- 0
house_data$BsmtFin.SF.1[is.na(house_data$BsmtFin.SF.1)] <- 0
house_data$Bsmt.Unf.SF[is.na(house_data$Bsmt.Unf.SF)] <- 0
house_data$Total.Bsmt.SF[is.na(house_data$Total.Bsmt.SF)] <- 0
house_data$BsmtFin.SF.2[is.na(house_data$BsmtFin.SF.2)] <- 0
sum(is.na(house_data$BsmtFin.SF.1))

## [1] 0

house_data$MS.Zoning[is.na(house_data$MS.Zoning)] <- mode(house_data$MS.Zoning)
length(which(is.na(house_data$Mas.Vnr.Type) & is.na(house_data$Mas.Vnr.Area)))

## [1] 0

house_data$Mas.Vnr.Type[is.na(house_data$Mas.Vnr.Type)] <- "None"
house_data$Mas.Vnr.Area[is.na(house_data$Mas.Vnr.Area)] <- 0
house_data<- house_data%>% mutate_if(is.numeric, function(x) replace(x, is.na(x), mode(x, na.rm = TRUE)))
house_data <- house_data%>% mutate_if(is_character, function(x) replace(x, is.na(x), mode(x, na.rm = TRUE)))

data.frame(colnames(house_data))

cols <- c(74,6,73,10,58,2,1,8,23,24,42,79,39,21,22,15,16,12,13,14,77,76)
house_data[,cols] <- lapply(house_data[,cols] , factor)

Rank_1 <-  c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
Rank_2 <-  c('None'=0, 'Unf'=1, 'RFn'=2, 'Fin'=3)
Rank_3 <-  c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)
Rank_4 <- c('None'=0, 'No'=1, 'Mn'=2, 'Av'=3, 'Gd'=4)
Rank_5 <-  c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)
Rank_6 <- c('None'=0, 'BrkCmn'=0, 'BrkFace'=1, 'Stone'=2)
Rank_7 <- c('Sev'=0, 'Mod'=1, 'Gtl'=2)
Rank_8 <- c('N'=0, 'P'=1, 'Y'=2)
Rank_9 <- c('Grvl'=0, 'Pave'=1)
Rank_10 <- c('N'=0, 'Y'=1)
Rank_11 <- c('Sal'=0, 'Sev'=1, 'Maj2'=2, 'Maj1'=3, 'Mod'=4, 'Min2'=5, 'Min1'=6, 'Typ'=7)
Rank_12 <-  c('NoPool' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
Rank_13 <-  c('NoFireplace' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)
house_data$Lot.Shape<-as.integer(revalue(house_data$Lot.Shape,Rank_3))

## The following `from` values were not present in `x`: IR3, IR2, IR1, Reg

house_data$Garage.Finish<-as.integer(revalue(house_data$Garage.Finish,Rank_2))

## The following `from` values were not present in `x`: None, Unf, RFn, Fin

house_data$Bsmt.Qual<-as.integer(revalue(house_data$Bsmt.Qual,Rank_1))

## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex

house_data$Garage.Cond<-as.integer(revalue(house_data$Garage.Cond,Rank_1))

## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex

house_data$Functional<-as.integer(revalue(house_data$Functional,Rank_11))

## The following `from` values were not present in `x`: Sal, Sev, Maj2, Maj1, Mod, Min2, Min1, Typ

house_data$Street<-as.integer(revalue(house_data$Street,Rank_9))

## The following `from` values were not present in `x`: Grvl, Pave

house_data$Fireplace.Qu<-as.integer(revalue(house_data$Fireplace.Qu,Rank_13))

## The following `from` values were not present in `x`: NoFireplace, Po, Fa, TA, Gd, Ex

house_data$BsmtFin.Type.1<-as.integer(revalue(house_data$BsmtFin.Type.1,Rank_5))

## The following `from` values were not present in `x`: None, Unf, LwQ, Rec, BLQ, ALQ, GLQ

house_data$Bsmt.Cond<-as.integer(revalue(house_data$Bsmt.Cond,Rank_1))

## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex

house_data$Exter.Qual<-as.integer(revalue(house_data$Exter.Qual,Rank_1))

## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex

house_data$Exter.Cond<-as.integer(revalue(house_data$Exter.Cond,Rank_1))

## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex

house_data$Garage.Qual<-as.integer(revalue(house_data$Garage.Qual,Rank_1))

## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex

house_data$Pool.QC<-as.integer(revalue(house_data$Pool.QC,Rank_12))

## The following `from` values were not present in `x`: NoPool, Po, Fa, TA, Gd, Ex

house_data$Kitchen.Qual<-as.integer(revalue(house_data$Kitchen.Qual,Rank_1))

## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex

house_data$Heating.QC<-as.integer(revalue(house_data$Heating.QC,Rank_1))

## The following `from` values were not present in `x`: None, Po, Fa, TA, Gd, Ex

house_data$Land.Slope<-as.integer(revalue(house_data$Land.Slope,Rank_7))

## The following `from` values were not present in `x`: Sev, Mod, Gtl

house_data$Central.Air<-as.integer(revalue(house_data$Central.Air,Rank_10))

## The following `from` values were not present in `x`: N, Y

house_data$Paved.Drive<-as.integer(revalue(house_data$Paved.Drive,Rank_8))

## The following `from` values were not present in `x`: N, P, Y

house_data$BsmtFin.SF.1<-as.numeric(house_data$BsmtFin.SF.1)
house_data$BsmtFin.SF.2<-as.numeric(house_data$BsmtFin.SF.2)
house_data$Sale.Price<-as.numeric(house_data$Sale.Price)
house_data$Misc.Val<-as.numeric(house_data$Misc.Val)
house_data$Year.Built<-as.numeric(house_data$Year.Built)
house_data$Year.Remod.Add<-as.numeric(house_data$Year.Remod.Add)
house_data$Mas.Vnr.Area<-as.numeric(house_data$Mas.Vnr.Area)
house_data$Bsmt.Unf.SF<-as.numeric(house_data$Bsmt.Unf.SF)
house_data$Total.Bsmt.SF<-as.numeric(house_data$Total.Bsmt.SF)
house_data$Pool.Area<-as.numeric(house_data$Pool.Area)
house_data$Screen.Porch<-as.numeric(house_data$Screen.Porch)
house_data$ThreeSsnPorch<-as.numeric(house_data$ThreeSsnPorch)
house_data$Enclosed.Porch<-as.numeric(house_data$Enclosed.Porch)
house_data$Open.Porch.SF<-as.numeric(house_data$Open.Porch.SF)
house_data$Wood.Deck.SF<-as.numeric(house_data$Wood.Deck.SF)
house_data$Garage.Area<-as.numeric(house_data$Garage.Area)
house_data$Garage.Cars<-as.numeric(house_data$Garage.Cars)
house_data$Garage.Yr.Blt<-as.numeric(house_data$Garage.Yr.Blt)
house_data$Fireplaces<-as.numeric(house_data$Fireplaces)
house_data$TotRms.AbvGrd<-as.numeric(house_data$TotRms.AbvGrd)
house_data$Kitchen.AbvGr<-as.numeric(house_data$Kitchen.AbvGr)
house_data$Bedroom.AbvGr<-as.numeric(house_data$Bedroom.AbvGr)
house_data$Half.Bath<-as.numeric(house_data$Half.Bath)
house_data$Full.Bath<-as.numeric(house_data$Full.Bath)
house_data$Bsmt.Full.Bath<-as.numeric(house_data$Bsmt.Full.Bath)
house_data$Bsmt.Half.Bath<-as.numeric(house_data$Bsmt.Half.Bath)
house_data$X1st.Flr.SF<-as.numeric(house_data$X1st.Flr.SF)
house_data$X2nd.Flr.SF<-as.numeric(house_data$X2nd.Flr.SF)
house_data$Low.Qual.Fin.SF<-as.numeric(house_data$Low.Qual.Fin.SF)
house_data$Gr.Liv.Area<-as.numeric(house_data$Gr.Liv.Area)

house_data %>% 
  mutate_at(vars(Lot.Area, Lot.Frontage, Overall.Qual,Overall.Cond,Year.Built,Year.Remod.Add,Mas.Vnr.Area,BsmtFin.SF.1,BsmtFin.SF.2,Total.Bsmt.SF,X1st.Flr.SF,X2nd.Flr.SF,Low.Qual.Fin.SF,Gr.Liv.Area,Bsmt.Full.Bath,Full.Bath,Half.Bath,Bedroom.AbvGr,Kitchen.AbvGr,TotRms.AbvGrd,Fireplaces,Garage.Yr.Blt,Garage.Cars,Garage.Area,Wood.Deck.SF,Open.Porch.SF,Enclosed.Porch,ThreeSsnPorch,Screen.Porch,Pool.Area), as.numeric)

str(house_data)

## tibble [2,930 × 80] (S3: tbl_df/tbl/data.frame)
##  $ MS.SubClass    : Factor w/ 15 levels "120","160","180",..: 5 5 5 5 10 10 1 1 1 10 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 5 6 6 6 6 6 6 6 6 ...
##  $ Lot.Frontage   : chr [1:2930] "143.8" "81.6" "79.4" "91.1" ...
##  $ Lot.Area       : chr [1:2930] "32405.4" "11622" "13981.7" "11383.2" ...
##  $ Street         : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Alley          : Factor w/ 3 levels "Grvl","NoAlley",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Lot.Shape      : int [1:2930] 2 3 2 3 2 2 3 2 2 3 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 2 4 4 ...
##  $ Utilities      : chr [1:2930] "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 1 1 5 5 5 5 5 5 ...
##  $ Land.Slope     : int [1:2930] 2 2 2 2 2 2 2 2 2 2 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 16 16 16 16 9 9 25 16 25 9 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 5 5 5 1 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 3 6 6 3 3 3 6 ...
##  $ Overall.Qual   : chr [1:2930] "6" "5" "5" "7" ...
##  $ Overall.Cond   : chr [1:2930] "5" "6" "6" "5" ...
##  $ Year.Built     : num [1:2930] 1960 1961 1958 1968 1997 ...
##  $ Year.Remod.Add : num [1:2930] 1960 1961 1958 1968 1998 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 4 2 4 4 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 7 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 4 14 15 4 14 14 6 7 6 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 11 15 16 15 15 15 6 7 15 15 ...
##  $ Mas.Vnr.Type   : chr [1:2930] "Stone" "None" "BrkFace" "None" ...
##  $ Mas.Vnr.Area   : num [1:2930] 112 0 108 0 0 20 0 0 0 0 ...
##  $ Exter.Qual     : int [1:2930] 3 3 3 4 3 3 4 4 4 3 ...
##  $ Exter.Cond     : int [1:2930] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Foundation     : chr [1:2930] "CBlock" "CBlock" "CBlock" "CBlock" ...
##  $ Bsmt.Qual      : int [1:2930] 3 3 3 3 4 3 4 4 4 3 ...
##  $ Bsmt.Cond      : int [1:2930] 4 3 3 3 3 3 3 3 3 3 ...
##  $ Bsmt.Exposure  : chr [1:2930] "Gd" "No" "No" "No" ...
##  $ BsmtFin.Type.1 : int [1:2930] 4 0 5 5 6 6 6 5 6 1 ...
##  $ BsmtFin.SF.1   : num [1:2930] 639 467 0 1068 0 ...
##  $ BsmtFin.Type.2 : chr [1:2930] "Unf" "LwQ" "Unf" "Unf" ...
##  $ BsmtFin.SF.2   : num [1:2930] 0 144 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : num [1:2930] 385 0 412 1103 112 ...
##  $ Total.Bsmt.SF  : num [1:2930] 1080 882 1329 2110 928 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : int [1:2930] 2 3 3 5 4 5 5 5 5 4 ...
##  $ Central.Air    : int [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Electrical     : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ X1st.Flr.SF    : num [1:2930] 1602 898 1315 2125 855 ...
##  $ X2nd.Flr.SF    : num [1:2930] 0 0 0 0 701 678 0 0 0 776 ...
##  $ Low.Qual.Fin.SF: num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Gr.Liv.Area    : num [1:2930] 1788 888 1327 2061 1611 ...
##  $ Bsmt.Full.Bath : num [1:2930] 1 0 0 1 0 0 1 0 1 0 ...
##  $ Bsmt.Half.Bath : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : num [1:2930] 1 1 1 2 2 2 2 2 2 2 ...
##  $ Half.Bath      : num [1:2930] 0 0 1 1 1 1 0 0 0 1 ...
##  $ Bedroom.AbvGr  : num [1:2930] 3 2 3 3 3 3 2 2 2 3 ...
##  $ Kitchen.AbvGr  : num [1:2930] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : int [1:2930] 3 3 4 5 3 4 4 4 4 4 ...
##  $ TotRms.AbvGrd  : num [1:2930] 7 5 6 8 6 7 6 5 5 7 ...
##  $ Functional     : int [1:2930] 7 7 7 7 7 7 7 7 7 7 ...
##  $ Fireplaces     : num [1:2930] 2 0 0 2 1 1 0 0 1 1 ...
##  $ Fireplace.Qu   : int [1:2930] 4 0 0 3 3 4 0 0 3 3 ...
##  $ Garage.Type    : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Garage.Yr.Blt  : num [1:2930] 1960 1961 1958 1968 1997 ...
##  $ Garage.Finish  : int [1:2930] 3 1 1 3 3 3 3 2 2 3 ...
##  $ Garage.Cars    : num [1:2930] 2 1 1 2 2 2 2 2 2 2 ...
##  $ Garage.Area    : num [1:2930] 528 730 312 522 482 470 582 506 608 442 ...
##  $ Garage.Qual    : int [1:2930] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Garage.Cond    : int [1:2930] 3 3 3 3 3 3 3 3 3 3 ...
##  $ Paved.Drive    : int [1:2930] 1 2 2 2 2 2 2 2 2 2 ...
##  $ Wood.Deck.SF   : num [1:2930] 210 140 393 0 212 360 0 0 237 140 ...
##  $ Open.Porch.SF  : num [1:2930] 62 0 36 0 34 36 0 82 152 60 ...
##  $ Enclosed.Porch : num [1:2930] 0 0 0 0 0 0 170 0 0 0 ...
##  $ ThreeSsnPorch  : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : num [1:2930] 0 120 0 0 0 0 0 144 0 0 ...
##  $ Pool.Area      : num [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : int [1:2930] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Fence          : Factor w/ 5 levels "GdPrv","GdWo",..: 5 3 5 5 3 5 5 5 5 5 ...
##  $ Misc.Feature   : Factor w/ 6 levels "Elev","Gar2",..: 3 3 2 3 3 3 3 3 3 3 ...
##  $ Misc.Val       : num [1:2930] 0 0 12500 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : Factor w/ 12 levels "1","10","11",..: 8 9 9 7 6 9 7 1 6 9 ...
##  $ Yr.Sold        : Factor w/ 5 levels "2006","2007",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Sale.Type      : chr [1:2930] "WD" "WD" "WD" "WD" ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Sale.Price     : num [1:2930] 209200 107700 163600 257900 184400 ...

int_num <- which(sapply(house_data, is.numeric))

Sales Price and Ames Housing Data with Other Assignments

Ifeoma

1 Inital Data Exploration

2 Graphical Exploration of Global Economic Data

3 Linear Algebra

3.2

3.3

3.4

3.5

Appendix - Source Code