Daniel Farías | A01236327
Kathya Ruiz | A01571094
Naila Salinas | A00832702
Sofia Badillo | A02384253
# import dataset
data <- read.csv("C:\\Users\\danyb\\OneDrive - Instituto Tecnologico y de Estudios Superiores de Monterrey\\Docs\\Documentos\\Business Intelligence\\Quinto Semestre\\Introduction to Econometrics\\real_estate_data.csv")
# we explore de data set to see if there are missing values
# sum all the na in the hole dataset
sum(is.na(data))
## [1] 0
# shows the sum of the na per variable
colSums(is.na(data))
## medv cmedv crim zn indus chas nox rm age dis
## 0 0 0 0 0 0 0 0 0 0
## rad tax ptratio b lstat
## 0 0 0 0 0
# plot the na per variable
#gg_miss_var(data)
We can conclude that there are no NA or missing values in our dataset
str(data)
## 'data.frame': 506 obs. of 15 variables:
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ cmedv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
We have a dataset composed by 15 variables with 506 observations
Each variable stands for medv : num median_value
crim : num crime_rate
zn : num residential_land indus : num
non_retail_business chas : int river_view
nox : num nitric_oxid rm : num rooms
age : num age dis : num distance
rad : int access_to_highways tax : int
property_tax_rate ptratio : num school
lstat : num low_status_pop
As we see before in the data description of the data set, chas is river_view which is a dummy variable, and rad is access_to_highways that is a rate form 1 to 8 and 24. So we have to transform this variables. Also we are going to eliminate de variables “cmedv” and “b”
data$chas <- as.factor(data$chas)
data <- data[, !(names(data) %in% c("cmedv", "b"))]
#data$rad <- as.factor(data$rad)
str(data)
## 'data.frame': 506 obs. of 13 variables:
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
# we have a summary of the data
summary(data)
## medv crim zn indus chas
## Min. : 5.00 Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471
## 1st Qu.:17.02 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35
## Median :21.20 Median : 0.25651 Median : 0.00 Median : 9.69
## Mean :22.53 Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :50.00 Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
describe(data)
## # A tibble: 12 × 26
## described_variables n na mean sd se_mean IQR skewness
## <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 medv 506 0 22.5 9.20 0.409 7.98 1.11
## 2 crim 506 0 3.61 8.60 0.382 3.60 5.22
## 3 zn 506 0 11.4 23.3 1.04 12.5 2.23
## 4 indus 506 0 11.1 6.86 0.305 12.9 0.295
## 5 nox 506 0 0.555 0.116 0.00515 0.175 0.729
## 6 rm 506 0 6.28 0.703 0.0312 0.738 0.404
## 7 age 506 0 68.6 28.1 1.25 49.0 -0.599
## 8 dis 506 0 3.80 2.11 0.0936 3.09 1.01
## 9 rad 506 0 9.55 8.71 0.387 20 1.00
## 10 tax 506 0 408. 169. 7.49 387 0.670
## 11 ptratio 506 0 18.5 2.16 0.0962 2.8 -0.802
## 12 lstat 506 0 12.7 7.14 0.317 10.0 0.906
## # ℹ 18 more variables: kurtosis <dbl>, p00 <dbl>, p01 <dbl>, p05 <dbl>,
## # p10 <dbl>, p20 <dbl>, p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>,
## # p60 <dbl>, p70 <dbl>, p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>,
## # p99 <dbl>, p100 <dbl>
# descriptive statistics of median values
#data %>% group_by(US,Urban) %>% describe(median_values)
For the data visualization, it was decided to use the variables that we considered to have more relevance for the dependent variable based on the exploratory analysis previously carried out. Among the variables, we identified 4 that exhibited the most pronounced curvature and displayed interesting behavioral patterns.
ggplot(data, aes(x = rm, y = medv)) +
geom_point() +
labs(x = "number of rooms",
y = "Median Value of homes (USD)",
title = "Relation between value and number of rooms") +
theme_minimal()
In this first graph we can see the relationship between the independent
variable “rm” (rooms) and the dependent variable “medv” (median value).
With this graph we can identify a strong positive linear correlation
between the average number of rooms per dwelling and the median value of
owner-occupied homes.
ggplot(data, aes(x = crim, y = medv)) +
geom_point() +
labs(x = "Per Capita Crime Rate",
y = "Median value of homes (USD)",
title = "Relation between value and per capita crime rate") +
theme_minimal()
In this second graph we can see the relationship between the independent
variable “crim” and the dependent variable “medv”. It can be recognized
a weak negative linear correlation, along with an interesting and
pronounced inclination towards the value of 0 in per capita crime rate
by town and its impact on the median value of owner-occupied homes.
ggplot(data, aes(x = lstat, y = medv)) +
geom_point() +
labs(x = "percentage of lower status of the population",
y = "Median value of homes",
title = "Relation between value and percentage of lower status of the population") +
theme_minimal()
This third graph shows the relationship between the independent variable
“lstat” and the dependent variable “medv”. A strong negative linear
correlation can be observed between the percentage of lower status of
the population and the median value of owner-occupied homes.
data %>% mutate(age_intervals=cut(age,breaks=c(0,20,40,60,80,100))) %>%
ggplot(aes(x=reorder(age_intervals,medv),y=medv,fill=age_intervals)) +
geom_bar(stat="identity") + coord_flip()+
scale_fill_brewer(palette="greens")+
labs(x="Ocuppied-Homes Age", y="Median Value (USD)", color="Ocuppied-Homes Age") +
ggtitle("Median Value by Ocuppied-Homes Age")
## Warning in pal_name(palette, type): Unknown palette greens
In this fourth graph, it is displayed the relationship between the
independent variable “age” and the dependent variable “medv”. Age
intervals have been added to make the graph easier to understand. There
is a trend that the older the occupied-homes, the higher the median
value.
Despite seeing a very slight negative linear correlation in the graph of the relationship between “crim” and “medv”, we consider that it is not significant enough to be a variable to be used in our hypotheses. Moreover, the curious and pronounced tendency to value 0 in per capita crime made us discard it.
hist(data$medv,prob=TRUE,col='steelblue',main='Histogram with median value')
lines(density(data$medv),col=3,lwd=4)
plot_normality(data,medv)
We can see in these last graphs the result of having an original histogram and one with a log in the dependent variable. For the original histogram, it displays a weak left-skewed distribution. When we try to normalize it using log, it displays a weak right-skewed distribution.
data_alt1<- data %>% select(-chas,-rad) ### create sub-dataset including quantiative variables.
summary(data_alt1)
## medv crim zn indus
## Min. : 5.00 Min. : 0.00632 Min. : 0.00 Min. : 0.46
## 1st Qu.:17.02 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19
## Median :21.20 Median : 0.25651 Median : 0.00 Median : 9.69
## Mean :22.53 Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.:25.00 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :50.00 Max. :88.97620 Max. :100.00 Max. :27.74
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## tax ptratio lstat
## Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :11.36
## Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :37.97
corrplot(cor(data_alt1),type='upper',order='hclust',addCoef.col='black')
In this correlation plot, we can conclude that the variables with more relation to the dependent variable could be “rooms (rm)” and “percentage of lower status of the population (lstat)”. The first one, having a positive relation, and the second one with a negative.
#Just following example code
qqnorm(data$medv)
#qqline(data$medv)
#Just following example code
shapiro.test(data$medv)
##
## Shapiro-Wilk normality test
##
## data: data$medv
## W = 0.91718, p-value = 4.941e-16
Based on this analysis, the 3 hypotheses that will be testing are:
It might be expected a positive relationship between rooms and the dependent median value.
It might be expected a positive relationship between age and the dependent median value.
It might be expected a positive relationship between percentage of lower status of the population and the dependent median value.
Multiple Linear Regression Model
model1<-lm(medv ~ crim+chas+nox+rm+age+rad+tax+lstat,data=data)
summary(model1)
##
## Call:
## lm(formula = medv ~ crim + chas + nox + rm + age + rad + tax +
## lstat, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.6771 -3.3446 -0.9797 2.0143 29.2534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.036012 3.505576 0.581 0.561644
## crim -0.078708 0.036433 -2.160 0.031221 *
## chas1 3.703134 0.960410 3.856 0.000131 ***
## nox -0.318563 3.549298 -0.090 0.928519
## rm 4.851746 0.445229 10.897 < 2e-16 ***
## age 0.016024 0.013327 1.202 0.229772
## rad 0.159515 0.070014 2.278 0.023130 *
## tax -0.012569 0.003716 -3.382 0.000775 ***
## lstat -0.575404 0.056107 -10.256 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.347 on 497 degrees of freedom
## Multiple R-squared: 0.6673, Adjusted R-squared: 0.662
## F-statistic: 124.6 on 8 and 497 DF, p-value: < 2.2e-16
vif(model1)
## crim chas nox rm age rad tax lstat
## 1.734473 1.050971 2.987547 1.728364 2.485337 6.563916 6.927621 2.835217
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 49.467, df = 8, p-value = 5.173e-08
cat("AIC:", AIC(model1),"\n")
## AIC: 3143.586
cat("RMSE:",RMSE(model1$fitted.values,data$medv))
## RMSE: 5.299483
In this first model, none of the variables were modified in order to generate a multiple linear regression model. Giving as results an AIC of 3143.586 and a RMSE of 5.299483.
Linear Regression with log
model2<-lm(log(medv) ~ crim+chas+nox+rm+age+rad+tax+lstat,data=data)
summary(model2)
##
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + age + rad +
## tax + lstat, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71931 -0.12338 -0.02445 0.10533 0.90302
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8413906 0.1381198 20.572 < 2e-16 ***
## crim -0.0094408 0.0014355 -6.577 1.22e-10 ***
## chas1 0.1420779 0.0378402 3.755 0.000194 ***
## nox -0.1024244 0.1398424 -0.732 0.464253
## rm 0.1248146 0.0175420 7.115 3.92e-12 ***
## age 0.0008360 0.0005251 1.592 0.111985
## rad 0.0083819 0.0027586 3.039 0.002503 **
## tax -0.0006225 0.0001464 -4.252 2.53e-05 ***
## lstat -0.0310938 0.0022106 -14.066 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2107 on 497 degrees of freedom
## Multiple R-squared: 0.7385, Adjusted R-squared: 0.7343
## F-statistic: 175.5 on 8 and 497 DF, p-value: < 2.2e-16
vif(model2)
## crim chas nox rm age rad tax lstat
## 1.734473 1.050971 2.987547 1.728364 2.485337 6.563916 6.927621 2.835217
bptest(model2)
##
## studentized Breusch-Pagan test
##
## data: model2
## BP = 51.872, df = 8, p-value = 1.782e-08
cat("AIC:", AIC(model2),"\n")
## AIC: -129.2103
cat("RMSE:",RMSE(model2$fitted.values,data$medv))
## RMSE: 21.4379
In the second model, the dependent variable was transformed to log, due to the slight tendency to the right that the original “medv” histogram has, in order to focus on the relevant data only. Giving as results an AIC of -129.2103 and a RMSE of 21.4379.
Polynomial Linear Regression
model3<-lm(log(medv) ~ crim+chas+nox+rm+I(rm^2)+age+rad+tax+lstat,data=data)
summary(model3)
##
## Call:
## lm(formula = log(medv) ~ crim + chas + nox + rm + I(rm^2) + age +
## rad + tax + lstat, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.99689 -0.10900 -0.00926 0.10129 0.92524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6503136 0.4061304 16.375 < 2e-16 ***
## crim -0.0105005 0.0013181 -7.967 1.13e-14 ***
## chas1 0.1249682 0.0346740 3.604 0.000345 ***
## nox -0.1889397 0.1282811 -1.473 0.141424
## rm -1.0552257 0.1206464 -8.746 < 2e-16 ***
## I(rm^2) 0.0913242 0.0092539 9.869 < 2e-16 ***
## age 0.0006277 0.0004810 1.305 0.192461
## rad 0.0073956 0.0025266 2.927 0.003578 **
## tax -0.0005392 0.0001343 -4.016 6.84e-05 ***
## lstat -0.0312611 0.0020232 -15.451 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1928 on 496 degrees of freedom
## Multiple R-squared: 0.7815, Adjusted R-squared: 0.7775
## F-statistic: 197.1 on 9 and 496 DF, p-value: < 2.2e-16
plot(effect("I(rm^2)",model3))
cat("AIC:", AIC(model3),"\n")
## AIC: -217.9258
cat("RMSE:",RMSE(model3$fitted.values,data$medv))
## RMSE: 21.42701
In this third model, as in the second one, the dependent variable was transformed to log, but also, the independent “rm” variable was elevated to the square, in order to generate a polynomial linear regression. Giving as results an AIC of -217.9258 and a RMSE of 21.42701.
Despite performing both AIC and RMSE indicators (for even more assurance), we decided to base our decision on AIC, and the lowest value among the 3 models to generate the LASSO and RIDGE methods with the intention of ensuring even more confidence in the data.
Validate Hypotheses Based on the results of our model,we can validate the hypotheses that we proposed.
For number 1, we can see that it is a positive relation, but also it is a negative relation between 4 and 6 rooms. After 6 rooms, the median value starts increasing.
For number 2, we can see that it is a positive relation, but this variable seems not to be statistically significant for the model or the prediction.
For number 3, we can confirm that it is a negative relation, and this variable is statistically significant.
Analyzing the results of the regression model, it could be interesting to study the relation between the per capita crime rate and the median value of homes, and the difference between the median value of homes if they bound river or they doesn’t. These are variables statistically significant that could be interesting to consider and use for prediction.
set.seed(123) ### sets the random seed for reproducibility of results
training.samples<-data$medv %>%
createDataPartition(p=0.75,list=FALSE) ### Lets consider 75% of the data to build a predictive model
train.data<-data[training.samples, ] ### training data to fit the linear regression model
test.data<-data[-training.samples, ] ### testing data to test the linear regression model
# LASSO regression via glmnet package can only take numerical observations. Then, the dataset is transformed to model.matrix() format.
# Independent variables
x<-model.matrix(lm(log(medv) ~ crim+chas+nox+rm+I(rm^2)+age+rad+tax+lstat,data=train.data))[,-1] ### OLS model specification
# x<-model.matrix(Weekly_Sales~.,train.data)[,-1] ### matrix of independent variables X's
y<-train.data$medv ### dependent variable
# In estimating LASSO regression it is important to define the lambda that minimizes the prediction error rate.
# Cross-validation ensures that every data / observation from the original dataset (datains) has a chance of appearing in train and test datasets.
# Find the best lambda using cross-validation.
set.seed(123)
cv.lasso<-cv.glmnet(x,y,alpha=1) # alpha = 1 for LASSO
# Display the best lambda value
cv.lasso$lambda.min ### lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is
## [1] 0.003463957
# Fit the final model on the training data
lassomodel<-glmnet(x,y,alpha=1,lambda=cv.lasso$lambda.min)
# Display regression coefficients
coef(lassomodel)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 107.159703920
## crim -0.094372253
## chas1 1.665556325
## nox -3.791411782
## rm -27.811138545
## I(rm^2) 2.541386933
## age 0.001948774
## rad 0.080581862
## tax -0.010997326
## lstat -0.472514556
# Make predictions on the test data
x.test<-model.matrix(lm(log(medv) ~ crim+chas+nox+rm+I(rm^2)+age+rad+tax+lstat,data=test.data))[,-1] ### OLS model specification
# x.test<-model.matrix(Weekly_Sales~.,test.data)[,-1]
lassopredictions <- lassomodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
data.frame(
RMSE = RMSE(lassopredictions, test.data$medv),
Rsquare = R2(lassopredictions, test.data$medv))
## RMSE Rsquare
## 1 6.348638 0.6955881
### visualizing lasso regression results
lbs_fun <- function(fit, offset_x=1, ...) {
L <- length(fit$lambda)
x <- log(fit$lambda[L])+ offset_x
y <- fit$beta[, L]
labs <- names(y)
text(x, y, labels=labs, ...)
}
lasso<-glmnet(scale(x),y,alpha=1)
plot(lasso,xvar="lambda",label=T)
lbs_fun(lasso)
abline(v=cv.lasso$lambda.min,col="red",lty=2)
abline(v=cv.lasso$lambda.1se,col="blue",lty=2)
x<-model.matrix(lm(log(medv) ~ crim+chas+nox+rm+I(rm^2)+age+rad+tax+lstat,data=train.data))[,-1] ### OLS model specification
y<-train.data$medv ### dependent variable
# Find the best lambda using cross-validation
set.seed(123) # x: independent variables | y: dependent variable
cv.ridge <- cv.glmnet(x,y,alpha=0.1) # alpha = 0 for RIDGE
# Display the best lambda value
cv.ridge$lambda.min # lambda: a numeric value defining the amount of shrinkage. Why min? the higher the value of ?? , the more penalization there is
## [1] 0.006490823
# Fit the final model on the training data
ridgemodel<-glmnet(x,y,alpha=0,lambda=cv.ridge$lambda.min)
# Display regression coefficients
coef(ridgemodel)
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 98.326184204
## crim -0.091979663
## chas1 1.738663587
## nox -3.670704647
## rm -25.054783990
## I(rm^2) 2.327573391
## age 0.003100005
## rad 0.087788439
## tax -0.011395736
## lstat -0.472553210
# Make predictions on the test data
x.test<-model.matrix(lm(log(medv) ~ crim+chas+nox+rm+I(rm^2)+age+rad+tax+lstat,data=test.data))[,-1]
ridgepredictions<-ridgemodel %>% predict(x.test) %>% as.vector()
# Model Accuracy
data.frame(
RMSE = RMSE(ridgepredictions, test.data$medv),
Rsquare = R2(ridgepredictions, test.data$medv)
)
## RMSE Rsquare
## 1 6.391347 0.691403
### visualizing ridge regression results
ridge<-glmnet(scale(x),y,alpha=0)
plot(ridge, xvar = "lambda", label=T)
lbs_fun(ridge)
abline(v=cv.ridge$lambda.min, col = "red", lty=2)
abline(v=cv.ridge$lambda.1se, col="blue", lty=2)
tab <- matrix(c(17587,6324,6322,0.77,0.71,0.71), ncol=2, byrow=FALSE)
colnames(tab) <- c('RMSE','R2')
rownames(tab) <- c('Linear Regression','Lasso','Ridge')
tab <- as.table(tab)
We could see as a team that linear regression analysis can contribute to improve predictive analytics in different ways. It helps to understand de relationship that exists between one or more independent variables and the dependent variable. Knowing how to interpret the results of a linear regression, it is possible to evaluate these relationships, and the impact that one variable could have into the dependent variable.
It also helps to identify if there is a positive or negative relationship and which variables that have a significant impact on the dependent variable. Another contribution is the insights that can be generated from the analysis, that can be supported by the models created. It is a very useful tool in order to make informed and assertive predictions.
According to IBM, these are some key assumptions to be considered for success with linear-regression analysis:
For each variable: Consider the number of valid cases, mean and standard deviation.
For each model: Consider regression coefficients, correlation matrix, part and partial correlations, multiple R, R2, adjusted R2, change in R2, standard error of the estimate, analysis-of-variance table, predicted values and residuals. Also, consider 95-percent-confidence intervals for each regression coefficient, variance-covariance matrix, variance inflation factor, tolerance, Durbin-Watson test, distance measures (Mahalanobis, Cook and leverage values), DfBeta, DfFit, prediction intervals and case-wise diagnostic information.
Plots: Consider scatterplots, partial plots, histograms and normal probability plots.
Data: Dependent and independent variables should be quantitative. Categorical variables, such as religion, major field of study or region of residence, need to be recoded to binary (dummy) variables or other types of contrast variables.
Other assumptions: For each value of the independent variable, the distribution of the dependent variable must be normal. The variance of the distribution of the dependent variable should be constant for all values of the independent variable. The relationship between the dependent variable and each independent variable should be linear and all observations should be independent.
(About Linear Regression | IBM, 2023)
About Linear Regression | IBM. (2023). Ibm.com. https://www.ibm.com/topics/linear-regression