title: “Homework 2” author: Abbey Ober —.
Data Set: House Prices This data set contains prices and characteristics of 128 houses in a major US metropolitan area. These plots can give us a predicted correlation of what characteristics affect the price of a home. By these graphs we can tell there is a possitive correlation between price and squarefootage. The number of bedrooms and bathrooms also are included in the squarefootage of a home so you can expect the price to have a possitive correlation with these as well.
m1=lm(Price~.,data=hp)
summary(m1)
##
## Call:
## lm(formula = Price ~ ., data = hp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27337.3 -6549.5 -41.7 5803.4 27359.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22840.536 10236.302 2.231 0.02752 *
## SqFt 52.994 5.734 9.242 1.10e-15 ***
## Bedrooms 4246.794 1597.911 2.658 0.00894 **
## Bathrooms 7883.278 2117.035 3.724 0.00030 ***
## Offers -8267.488 1084.777 -7.621 6.47e-12 ***
## Brick 17297.350 1981.616 8.729 1.78e-14 ***
## NeighborhoodNorth -20681.037 3148.954 -6.568 1.38e-09 ***
## NeighborhoodEast -22241.616 2531.758 -8.785 1.32e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10020 on 120 degrees of freedom
## Multiple R-squared: 0.8686, Adjusted R-squared: 0.861
## F-statistic: 113.3 on 7 and 120 DF, p-value: < 2.2e-16
Running a regression including all the charastics of a home to find a good model. This model explains 86.1% of the variation which indicates it is a good model. This model also has p-values less than .05 which means these characteristics are significant in our model.
The equation for this model is: Price=22840.54+52.99SqFt+4246.79Bedrooms+7883.28Bathrooms-8267.49Offers+17297.35Brick+-20681.04NeighborhoodNorth-22241.62NeighborhoodEast
Next is a collineary matrix which we can see if there’s multicollinearity relating to bedrooms.
cor(hp)
## Price SqFt Bedrooms Bathrooms
## Price 1.0000000 0.55298224 0.52592606 0.523257758
## SqFt 0.5529822 1.00000000 0.48380711 0.522745301
## Bedrooms 0.5259261 0.48380711 1.00000000 0.414555956
## Bathrooms 0.5232578 0.52274530 0.41455596 1.000000000
## Offers -0.3136359 0.33692335 0.11427061 0.143793404
## Brick 0.4528168 0.07979216 0.04638008 0.171976913
## NeighborhoodNorth -0.5482211 -0.28888599 -0.36466744 -0.275829702
## NeighborhoodEast -0.1429589 0.04563915 -0.09175034 -0.001247208
## Offers Brick NeighborhoodNorth
## Price -0.31363588 0.45281679 -0.5482211
## SqFt 0.33692335 0.07979216 -0.2888860
## Bedrooms 0.11427061 0.04638008 -0.3646674
## Bathrooms 0.14379340 0.17197691 -0.2758297
## Offers 1.00000000 -0.14498606 0.3329866
## Brick -0.14498606 1.00000000 -0.2605536
## NeighborhoodNorth 0.33298661 -0.26055361 1.0000000
## NeighborhoodEast -0.01560205 0.14756390 -0.5329100
## NeighborhoodEast
## Price -0.142958878
## SqFt 0.045639151
## Bedrooms -0.091750338
## Bathrooms -0.001247208
## Offers -0.015602052
## Brick 0.147563896
## NeighborhoodNorth -0.532910044
## NeighborhoodEast 1.000000000
This shows that there is multicollinearity existant between bedrooms and bathrooms with price and squarefootage.
Next a testing sample of half the original (64) to try and predict the other half.
set.seed(1)
n=length(hp$Price)
n1=64
n2=n-n1
train=sample(1:n,n1)
m2=lm(Price~.,data=hp[train,])
summary(m2)
##
## Call:
## lm(formula = Price ~ ., data = hp[train, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -28150 -3716 -1214 5635 26188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21865.996 14357.716 1.523 0.13340
## SqFt 49.042 7.843 6.253 5.88e-08 ***
## Bedrooms 6108.791 2242.421 2.724 0.00858 **
## Bathrooms 7379.366 2937.911 2.512 0.01492 *
## Offers -7192.172 1474.338 -4.878 9.25e-06 ***
## Brick 17832.637 2931.653 6.083 1.11e-07 ***
## NeighborhoodNorth -19404.887 4339.514 -4.472 3.84e-05 ***
## NeighborhoodEast -21195.669 3596.327 -5.894 2.26e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9801 on 56 degrees of freedom
## Multiple R-squared: 0.8615, Adjusted R-squared: 0.8441
## F-statistic: 49.74 on 7 and 56 DF, p-value: < 2.2e-16
The regression output is similar to the first regression we ran with slightly different values. The bathrooms p-value increased in the second regression. The Adjusted R-squared also decreased by 1.7%.
pred=predict(m2,newdat=hp[-train, ])
obs=hp$Price[-train]
diff=obs-pred
percdiff=abs(diff)/obs
mean error:
me=mean(diff)
me
## [1] 1460.916
root mean square error:
rmse=sqrt(sum(diff**2)/n2)
rmse
## [1] 10511.67
mean absolute percent error:
mape=100*(mean(percdiff))
mape
## [1] 6.581649
The absolute percent error is 6.6% so with this being under 10% the prediction is quite accurate.
Data Set: Direct Marketing This data set includes data from a direct marketer who sells his products only through direct mail.
dm<- read.csv("c:/users/abbey/Desktop/Data Mining/DirectMarketing.csv")
xyplot(AmountSpent~Salary, data=dm)
xyplot(AmountSpent~Children, data=dm)
xyplot(AmountSpent~Married, data=dm)
xyplot(AmountSpent~Catalogs,data=dm)
xyplot(AmountSpent~Location, data=dm)
plot(AmountSpent~History,data=dm)
In these graphs it is shown there is a correlation between amount spent corresponding to history, salary, location, number of catalogs, and children.
m3=lm(AmountSpent~.,data=dm)
summary(m3)
##
## Call:
## lm(formula = AmountSpent ~ ., data = dm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1648.11 -286.72 -12.63 218.21 2771.25
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.496e+02 1.340e+02 -1.862 0.06302 .
## AgeOld 4.139e+01 5.276e+01 0.784 0.43311
## AgeYoung 8.965e+01 5.874e+01 1.526 0.12740
## GenderMale -5.370e+01 3.802e+01 -1.413 0.15823
## OwnHomeRent -1.829e+01 4.151e+01 -0.441 0.65967
## MarriedSingle 1.950e+01 4.981e+01 0.392 0.69553
## LocationFar 6.090e+02 4.399e+01 13.845 < 2e-16 ***
## Salary 1.883e-02 1.245e-03 15.124 < 2e-16 ***
## Children -2.683e+02 2.502e+01 -10.723 < 2e-16 ***
## HistoryLow -2.675e+02 8.862e+01 -3.019 0.00263 **
## HistoryMedium -3.446e+02 5.996e+01 -5.746 1.38e-08 ***
## Catalogs 4.052e+01 2.868e+00 14.128 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 463.5 on 685 degrees of freedom
## (303 observations deleted due to missingness)
## Multiple R-squared: 0.7887, Adjusted R-squared: 0.7853
## F-statistic: 232.5 on 11 and 685 DF, p-value: < 2.2e-16
The equation for this model is : AmountSpent = -249.6+41.4AgeOld+89.7AgeYoung-53.7GenderMale-18.3RentHome+19.5Single+609.0Salary-268.3Children-267.5HistoryLow+40.5Catalogs
The regression on this data case tell us this model is good explaining 78.5% of variation andt the overall p-value is low. It also tells us that Age, Gender, Homeowner, and Marriage are not statistically significant in this model.
Data Set: Gender Discrimination This data set includes 208 individuals including male and female their salary and work experience.
gd<- read.csv("c:/users/abbey/Desktop/Data Mining/GenderDiscrimination.csv")
plot(Salary~Gender,data=gd)
plot(Salary~Experience, data=gd)
From these graphs you can see a positive correlation between salary and years of experience in a predictive model.
m4=lm(Salary~.,data=gd)
summary(m4)
##
## Call:
## lm(formula = Salary ~ ., data = gd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52779 -9806 -121 8347 60913
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53260.0 2416.6 22.039 < 2e-16 ***
## GenderMale 17020.6 2499.6 6.809 1.06e-10 ***
## Experience 1744.6 160.7 10.858 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared: 0.4413, Adjusted R-squared: 0.4359
## F-statistic: 80.98 on 2 and 205 DF, p-value: < 2.2e-16
Equation for this model: salary=53260.0+17020.6GenderMale+1744.6Experience
Based on this models r-squared. It only explains 43.6% of the variation which is low making this a poor model. Although, it is showing Salary has a significant relationship with gender and experience based on their p-values.