download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/HousePrices.csv", "HousePrices.csv",method="curl")
hp<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/HousePrices.csv")
attach(hp)
Based on these plots, it seems like the variables that will be most significant in a linear regression are square feet, offers, and brick. This is because these variables show linear correlations when plotted against the dependent variable (house price). In addition, the neighborhood variable needs to be made into dummy variables because it has three categories and it is possibly significant because the box-and-whisker plot of “West” only slightly overlaps with the other two box-and-whisker plots.
v1=rep(1,length(hp$Neighborhood))
v2=rep(0,length(hp$Neighborhood))
hp$WestNeighborhood=ifelse(hp$Neighborhood=="West",v1,v2)
hp$NorthNeighborhood=ifelse(hp$Neighborhood=="North",v1,v2)
h1=lm(hp$Price~hp$SqFt+hp$HomeID+hp$Bedrooms+hp$Bathrooms+hp$Offers+hp$Brick+hp$WestNeighborhood+hp$NorthNeighborhood,data=hp)
summary(h1)
##
## Call:
## lm(formula = hp$Price ~ hp$SqFt + hp$HomeID + hp$Bedrooms + hp$Bathrooms +
## hp$Offers + hp$Brick + hp$WestNeighborhood + hp$NorthNeighborhood,
## data = hp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27897.8 -6074.8 -48.7 5551.8 27536.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 308.114 9605.692 0.032 0.974465
## hp$SqFt 53.634 5.926 9.051 3.30e-15 ***
## hp$HomeID -11.456 25.387 -0.451 0.652616
## hp$Bedrooms 4136.461 1621.775 2.551 0.012023 *
## hp$Bathrooms 7975.157 2133.831 3.737 0.000287 ***
## hp$Offers -8350.128 1103.693 -7.566 8.96e-12 ***
## hp$BrickYes 17313.540 1988.548 8.707 2.12e-14 ***
## hp$WestNeighborhood 22264.319 2540.699 8.763 1.56e-14 ***
## hp$NorthNeighborhood 1729.613 2433.756 0.711 0.478675
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10050 on 119 degrees of freedom
## Multiple R-squared: 0.8688, Adjusted R-squared: 0.86
## F-statistic: 98.54 on 8 and 119 DF, p-value: < 2.2e-16
This is an okay model. This is evidenced by its 0.86 adjusted r-squared value. However, the F-statistic is relatively low which means that we do not know much about the data based on the model. In addition, variables HomeID and North Neighborhood are not significant and should not be in the model.
Based on the raw coefficients from the regression, you get the equation:y=308.11+(53.63SqFt)-(11.46HomeID)+(4136.46Bedrooms)+(7975.16Bathrooms)-(8350.13Offers)+(17313.54Brick)+(22264.32WestNH)+(1729.61NNH)
h2=lm(hp$Price~hp$SqFt+hp$Bedrooms+hp$Bathrooms+hp$Offers+hp$Brick+hp$WestNeighborhood,data=hp)
summary(h2)
##
## Call:
## lm(formula = hp$Price ~ hp$SqFt + hp$Bedrooms + hp$Bathrooms +
## hp$Offers + hp$Brick + hp$WestNeighborhood, data = hp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26810.5 -5953.6 -266.5 5662.9 26793.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3067.471 8746.712 0.351 0.726423
## hp$SqFt 52.149 5.572 9.359 5.44e-16 ***
## hp$Bedrooms 4070.005 1570.921 2.591 0.010751 *
## hp$Bathrooms 7810.698 2109.060 3.703 0.000322 ***
## hp$Offers -8019.003 1013.011 -7.916 1.32e-12 ***
## hp$BrickYes 17058.771 1942.805 8.780 1.28e-14 ***
## hp$WestNeighborhood 21937.572 2482.393 8.837 9.39e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9995 on 121 degrees of freedom
## Multiple R-squared: 0.8682, Adjusted R-squared: 0.8616
## F-statistic: 132.8 on 6 and 121 DF, p-value: < 2.2e-16
This model is better because the insignificant variables like Home ID are eliminated and the adjusted r squared value increased. In addition, the F statistic increased, meaning that we know more now than we did before.
Based on the raw coefficients from this new regression, this new equation is created: y=3067.47+52.15SqFt+4070Bedrooms+7810.7Bathrooms-8019Offers+17058.77BrickYes+21937.57WestNeighborhood
hp=hp[-7]
hp=hp[-7]
cor(hp)
## HomeID Price SqFt Bedrooms Bathrooms
## HomeID 1.00000000 0.1081900 0.1685533 -0.06856814 0.1276935
## Price 0.10818998 1.0000000 0.5529822 0.52592606 0.5232578
## SqFt 0.16855330 0.5529822 1.0000000 0.48380711 0.5227453
## Bedrooms -0.06856814 0.5259261 0.4838071 1.00000000 0.4145560
## Bathrooms 0.12769353 0.5232578 0.5227453 0.41455596 1.0000000
## Offers -0.05359711 -0.3136359 0.3369234 0.11427061 0.1437934
## WestNeighborhood 0.02687339 0.7140066 0.2507592 0.47147686 0.2859231
## NorthNeighborhood 0.04985927 -0.5482211 -0.2888860 -0.36466744 -0.2758297
## Offers WestNeighborhood NorthNeighborhood
## HomeID -0.05359711 0.02687339 0.04985927
## Price -0.31363588 0.71400660 -0.54822108
## SqFt 0.33692335 0.25075921 -0.28888599
## Bedrooms 0.11427061 0.47147686 -0.36466744
## Bathrooms 0.14379340 0.28592314 -0.27582970
## Offers 1.00000000 -0.32742521 0.33298661
## WestNeighborhood -0.32742521 1.00000000 -0.47909760
## NorthNeighborhood 0.33298661 -0.47909760 1.00000000
In this correlation matrix, there are a few variables that are highly correlated with other variables. First, SqFt is highly correlated with Bathrooms, Bedrooms, and offers. This is not suprising because the larger a house is the more bedrooms and bathrooms it has. SqFt, according to the regression, is more significant than all three of those variables so I would keep SqFt in the regression. WestNeighborhood is also highly correlated with other variables. These include bedrooms, offers, and NorthNeighborhood. However, because WestNeighborhood is more significant than these variables, I would keep WestNeighborhood in the regression.
n=length(hp$Price)
diff=dim(n)
percdiff=dim(n)
for (k in 1:n) {
train1=c(1:n)
train=train1[train1!=k]
hp10=lm(hp$Price~hp$SqFt,data=hp[train,])
pred=predict(hp10,newdat=hp[-train,])
obs=hp$Price[-train]
diff[k]=obs-pred
percdiff[k]=abs(diff[k])/obs
}
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me
## [1] 14813.36
rmse
## [1] 30589.65
mape
## [1] 16.57394
This model creates a training set with independent variable SqFt set against House Price. However, this model computes extremely high mean error, root mean square error, and mape. These high values mean that, when the model created from the training set is put into another set, the model does not accurately predict house prices. Therefore, the variable SqFt is not enough to reliably predict House Price.
download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/DirectMarketing.csv", "DirectMarketing.csv",method="curl")
DirectMk<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/DirectMarketing.csv")
attach(DirectMk)
plot(DirectMk$AmountSpent~DirectMk$Age,data=DirectMk)
plot(DirectMk$AmountSpent~DirectMk$Gender,data=DirectMk)
plot(DirectMk$AmountSpent~DirectMk$OwnHome,data=DirectMk)
plot(DirectMk$AmountSpent~DirectMk$Married,data=DirectMk)
plot(DirectMk$AmountSpent~DirectMk$Location,data=DirectMk)
plot(DirectMk$AmountSpent~DirectMk$Salary,data=DirectMk)
plot(DirectMk$AmountSpent~DirectMk$Children,data=DirectMk)
plot(DirectMk$AmountSpent~DirectMk$History,data=DirectMk)
plot(DirectMk$AmountSpent~DirectMk$Catalogs,data=DirectMk)
Based on these plots, a number of variables need to be made into dummy variables including History, Location, Married, Own Home, Gender, and Age. In addition, these plots show that some possibly significant variables are Children, Salary, History, and Catalogs. The numerical plots (Salary and Children) show a linear relationship between that variable and the dependent variable (Amount Spent). In the other possibly significant plots (History and Catalogs) there seem to be significant differences between the categories of the indpendent variables.
This code will make dummy variables for the following independent variables: Age, Gender, OwnHome, Married, History, and Location.
v3=rep(1,length(DirectMk$Age))
v4=rep(0,length(DirectMk$Age))
v7=rep(1,length(DirectMk$Gender))
v8=rep(0,length(DirectMk$Gender))
v9=rep(1,length(DirectMk$OwnHome))
v10=rep(0,length(DirectMk$OwnHome))
v11=rep(1,length(DirectMk$Married))
v12=rep(0,length(DirectMk$Married))
v13=rep(1,length(DirectMk$History))
v14=rep(0,length(DirectMk$History))
v15=rep(1,length(DirectMk$Location))
v16=rep(0,length(DirectMk$Location))
DirectMk$Old=ifelse(DirectMk$Age=="Old",v3,v4)
DirectMk$Middle=ifelse(DirectMk$Age=="Middle",v3,v4)
DirectMk$Male=ifelse(DirectMk$Gender=="Male",v7,v8)
DirectMk$Own=ifelse(DirectMk$OwnHome=="Own",v9,v10)
DirectMk$Single=ifelse(DirectMk$Married=="Single",v11,v12)
DirectMk$HighHistory=ifelse(DirectMk$History=="High",v13,v14)
DirectMk$LowHistory=ifelse(DirectMk$History=="Low",v13,v14)
DirectMk$Close=ifelse(DirectMk$Location=="Close",v15,v16)
market=DirectMk[-8]
market=market[,6:17]
head(market)
## Salary Children Catalogs AmountSpent Old Middle Male Own Single
## 1 47500 0 6 755 1 0 0 1 1
## 2 63600 0 6 1318 0 1 1 0 1
## 3 13500 0 18 296 0 0 0 0 1
## 4 85600 1 18 2436 0 1 1 1 0
## 5 68400 0 12 1304 0 1 0 1 1
## 6 30400 0 6 495 0 0 1 1 0
## HighHistory LowHistory Close
## 1 1 0 0
## 2 1 0 1
## 3 0 1 1
## 4 1 0 1
## 5 1 0 1
## 6 0 1 1
d1=lm(market$AmountSpent~.,data=market)
summary(d1)
##
## Call:
## lm(formula = market$AmountSpent ~ ., data = market)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1648.11 -286.72 -12.63 218.21 2771.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.623e+01 9.344e+01 0.923 0.356
## Salary 1.883e-02 1.245e-03 15.124 < 2e-16 ***
## Children -2.683e+02 2.502e+01 -10.723 < 2e-16 ***
## Catalogs 4.052e+01 2.868e+00 14.128 < 2e-16 ***
## Old -4.827e+01 6.189e+01 -0.780 0.436
## Middle -8.965e+01 5.874e+01 -1.526 0.127
## Male -5.370e+01 3.802e+01 -1.413 0.158
## Own 1.829e+01 4.151e+01 0.441 0.660
## Single 1.950e+01 4.981e+01 0.392 0.696
## HighHistory 3.446e+02 5.996e+01 5.746 1.38e-08 ***
## LowHistory 7.704e+01 5.889e+01 1.308 0.191
## Close -6.090e+02 4.399e+01 -13.845 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 463.5 on 685 degrees of freedom
## (303 observations deleted due to missingness)
## Multiple R-squared: 0.7887, Adjusted R-squared: 0.7853
## F-statistic: 232.5 on 11 and 685 DF, p-value: < 2.2e-16
This is not a good regression because many of the variables are insignificant. Some of those include Age, Gender, and Marital status. In addition, many observations were omitted because of missing data, so that could contribute to the lower F statistic.
d2=lm(market$AmountSpent~market$Salary+market$Children+market$Catalogs+market$HighHistory+market$Close,data=market)
summary(d2)
##
## Call:
## lm(formula = market$AmountSpent ~ market$Salary + market$Children +
## market$Catalogs + market$HighHistory + market$Close, data = market)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1496.89 -292.01 -20.42 207.65 2854.05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.425e+02 6.290e+01 2.266 0.0238 *
## market$Salary 1.698e-02 8.446e-04 20.101 < 2e-16 ***
## market$Children -2.529e+02 1.977e+01 -12.791 < 2e-16 ***
## market$Catalogs 3.932e+01 2.833e+00 13.879 < 2e-16 ***
## market$HighHistory 3.600e+02 5.948e+01 6.052 2.35e-09 ***
## market$Close -5.952e+02 4.254e+01 -13.993 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 464.2 on 691 degrees of freedom
## (303 observations deleted due to missingness)
## Multiple R-squared: 0.7862, Adjusted R-squared: 0.7847
## F-statistic: 508.3 on 5 and 691 DF, p-value: < 2.2e-16
This is a better model. While the adjusted r-squared value did slightly decrease, the F-Statistic increased, showing that we now know more about the data than we do not. Lastly, all of the variables have a low p-value, which means that they are all significant in predicting Amount Spent.
library(lm.beta)
lm.beta(d2)
##
## Call:
## lm(formula = market$AmountSpent ~ market$Salary + market$Children +
## market$Catalogs + market$HighHistory + market$Close, data = market)
##
## Standardized Coefficients::
## (Intercept) market$Salary market$Children
## 0.0000000 0.5210862 -0.2642102
## market$Catalogs market$HighHistory market$Close
## 0.2597363 0.1734627 -0.2717141
Because the raw coefficients are so small, we will use the standardized coefficients to make the following equation:
y=0.52Salary-0.26Children+.26Catalogs+0.17HighHistory-0.27*Close
download.file("https://www.biz.uiowa.edu/faculty/jledolter/datamining/GenderDiscrimination.csv", "GenderDiscrimination.csv",method="curl")
GenDsc<-read.csv("https://www.biz.uiowa.edu/faculty/jledolter/datamining/GenderDiscrimination.csv")
attach(GenDsc)
plot(GenDsc$Salary~GenDsc$Experience,data=GenDsc)
plot(GenDsc$Salary~GenDsc$Gender,data=GenDsc)
These plots depict Gender and Experience against salary. There may be a slight linear correlation between salary and experience. However, there seems to be a large clump of data points in one area of the graph which may interfere with its significance. In addition, Gender does not seem significant either because the male and female box plots completely overlap, meaning that there is no statistical difference between male and female salaries.
v5=rep(1,length(GenDsc$Gender))
v6=rep(0,length(GenDsc$Gender))
GenDsc$Female=ifelse(GenDsc$Gender=="Female",v5,v6)
gd=GenDsc[,2:4]
head(gd)
## Experience Salary Female
## 1 15 78200 1
## 2 12 66400 1
## 3 15 61200 1
## 4 3 61000 1
## 5 4 60000 1
## 6 4 68000 1
g1=lm(gd$Salary~.,data=gd)
summary(g1)
##
## Call:
## lm(formula = gd$Salary ~ ., data = gd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52779 -9806 -121 8347 60913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70280.6 2801.7 25.085 < 2e-16 ***
## Experience 1744.6 160.7 10.858 < 2e-16 ***
## Female -17020.6 2499.6 -6.809 1.06e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4359
## F-statistic: 80.98 on 2 and 205 DF, p-value: < 2.2e-16
This is not a good regression. Even though both of the independent variables have small p values, the adjusted r-squared value is only 0.43. This shows that the independent variables only explain 43% of the data. Therefore, gender and experience are not enough to determine salary.
Based on the raw coefficients from the regression, you get the equation: y=70280.6+(1744.6Experience)-(17020.6Female)