title: “Homework 2” author: Abbey Ober —.

Data Set: House Prices This data set contains prices and characteristics of 128 houses in a major US metropolitan area. These plots can give us a predicted correlation of what characteristics affect the price of a home. By these graphs we can tell there is a possitive correlation between price and squarefootage. The number of bedrooms and bathrooms also are included in the squarefootage of a home so you can expect the price to have a possitive correlation with these as well.

m1=lm(Price~.,data=hp)
summary(m1)
## 
## Call:
## lm(formula = Price ~ ., data = hp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27337.3  -6549.5    -41.7   5803.4  27359.3 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        22840.536  10236.302   2.231  0.02752 *  
## SqFt                  52.994      5.734   9.242 1.10e-15 ***
## Bedrooms            4246.794   1597.911   2.658  0.00894 ** 
## Bathrooms           7883.278   2117.035   3.724  0.00030 ***
## Offers             -8267.488   1084.777  -7.621 6.47e-12 ***
## Brick              17297.350   1981.616   8.729 1.78e-14 ***
## NeighborhoodNorth -20681.037   3148.954  -6.568 1.38e-09 ***
## NeighborhoodEast  -22241.616   2531.758  -8.785 1.32e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10020 on 120 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.861 
## F-statistic: 113.3 on 7 and 120 DF,  p-value: < 2.2e-16

Running a regression including all the charastics of a home to find a good model. This model explains 86.1% of the variation which indicates it is a good model. This model also has p-values less than .05 which means these characteristics are significant in our model.

The equation for this model is: Price=22840.54+52.99SqFt+4246.79Bedrooms+7883.28Bathrooms-8267.49Offers+17297.35Brick+-20681.04NeighborhoodNorth-22241.62NeighborhoodEast

Next is a collineary matrix which we can see if there’s multicollinearity relating to bedrooms.

cor(hp)
##                        Price        SqFt    Bedrooms    Bathrooms
## Price              1.0000000  0.55298224  0.52592606  0.523257758
## SqFt               0.5529822  1.00000000  0.48380711  0.522745301
## Bedrooms           0.5259261  0.48380711  1.00000000  0.414555956
## Bathrooms          0.5232578  0.52274530  0.41455596  1.000000000
## Offers            -0.3136359  0.33692335  0.11427061  0.143793404
## Brick              0.4528168  0.07979216  0.04638008  0.171976913
## NeighborhoodNorth -0.5482211 -0.28888599 -0.36466744 -0.275829702
## NeighborhoodEast  -0.1429589  0.04563915 -0.09175034 -0.001247208
##                        Offers       Brick NeighborhoodNorth
## Price             -0.31363588  0.45281679        -0.5482211
## SqFt               0.33692335  0.07979216        -0.2888860
## Bedrooms           0.11427061  0.04638008        -0.3646674
## Bathrooms          0.14379340  0.17197691        -0.2758297
## Offers             1.00000000 -0.14498606         0.3329866
## Brick             -0.14498606  1.00000000        -0.2605536
## NeighborhoodNorth  0.33298661 -0.26055361         1.0000000
## NeighborhoodEast  -0.01560205  0.14756390        -0.5329100
##                   NeighborhoodEast
## Price                 -0.142958878
## SqFt                   0.045639151
## Bedrooms              -0.091750338
## Bathrooms             -0.001247208
## Offers                -0.015602052
## Brick                  0.147563896
## NeighborhoodNorth     -0.532910044
## NeighborhoodEast       1.000000000

This shows that there is multicollinearity existant between bedrooms and bathrooms with price and squarefootage.

Next a testing sample of half the original (64) to try and predict the other half.

set.seed(1)
n=length(hp$Price)
n1=64
n2=n-n1
train=sample(1:n,n1)
m2=lm(Price~.,data=hp[train,]) 
summary(m2)
## 
## Call:
## lm(formula = Price ~ ., data = hp[train, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -28150  -3716  -1214   5635  26188 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        21865.996  14357.716   1.523  0.13340    
## SqFt                  49.042      7.843   6.253 5.88e-08 ***
## Bedrooms            6108.791   2242.421   2.724  0.00858 ** 
## Bathrooms           7379.366   2937.911   2.512  0.01492 *  
## Offers             -7192.172   1474.338  -4.878 9.25e-06 ***
## Brick              17832.637   2931.653   6.083 1.11e-07 ***
## NeighborhoodNorth -19404.887   4339.514  -4.472 3.84e-05 ***
## NeighborhoodEast  -21195.669   3596.327  -5.894 2.26e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9801 on 56 degrees of freedom
## Multiple R-squared:  0.8615, Adjusted R-squared:  0.8441 
## F-statistic: 49.74 on 7 and 56 DF,  p-value: < 2.2e-16

The regression output is similar to the first regression we ran with slightly different values. The bathrooms p-value increased in the second regression. The Adjusted R-squared also decreased by 1.7%.

pred=predict(m2,newdat=hp[-train, ])
obs=hp$Price[-train]
diff=obs-pred
percdiff=abs(diff)/obs

mean error:

me=mean(diff)
me
## [1] 1460.916

root mean square error:

rmse=sqrt(sum(diff**2)/n2)
rmse
## [1] 10511.67

mean absolute percent error:

mape=100*(mean(percdiff)) 
mape 
## [1] 6.581649

The absolute percent error is 6.6% so with this being under 10% the prediction is quite accurate.

Data Set: Direct Marketing This data set includes data from a direct marketer who sells his products only through direct mail.

dm<- read.csv("c:/users/abbey/Desktop/Data Mining/DirectMarketing.csv")
xyplot(AmountSpent~Salary, data=dm)

xyplot(AmountSpent~Children, data=dm)

xyplot(AmountSpent~Married, data=dm)

xyplot(AmountSpent~Catalogs,data=dm)

xyplot(AmountSpent~Location, data=dm)

plot(AmountSpent~History,data=dm)

In these graphs it is shown there is a correlation between amount spent corresponding to history, salary, location, number of catalogs, and children.

m3=lm(AmountSpent~.,data=dm)
summary(m3)
## 
## Call:
## lm(formula = AmountSpent ~ ., data = dm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1648.11  -286.72   -12.63   218.21  2771.25 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.496e+02  1.340e+02  -1.862  0.06302 .  
## AgeOld         4.139e+01  5.276e+01   0.784  0.43311    
## AgeYoung       8.965e+01  5.874e+01   1.526  0.12740    
## GenderMale    -5.370e+01  3.802e+01  -1.413  0.15823    
## OwnHomeRent   -1.829e+01  4.151e+01  -0.441  0.65967    
## MarriedSingle  1.950e+01  4.981e+01   0.392  0.69553    
## LocationFar    6.090e+02  4.399e+01  13.845  < 2e-16 ***
## Salary         1.883e-02  1.245e-03  15.124  < 2e-16 ***
## Children      -2.683e+02  2.502e+01 -10.723  < 2e-16 ***
## HistoryLow    -2.675e+02  8.862e+01  -3.019  0.00263 ** 
## HistoryMedium -3.446e+02  5.996e+01  -5.746 1.38e-08 ***
## Catalogs       4.052e+01  2.868e+00  14.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.5 on 685 degrees of freedom
##   (303 observations deleted due to missingness)
## Multiple R-squared:  0.7887, Adjusted R-squared:  0.7853 
## F-statistic: 232.5 on 11 and 685 DF,  p-value: < 2.2e-16

The equation for this model is : AmountSpent = -249.6+41.4AgeOld+89.7AgeYoung-53.7GenderMale-18.3RentHome+19.5Single+609.0Salary-268.3Children-267.5HistoryLow+40.5Catalogs

The regression on this data case tell us this model is good explaining 78.5% of variation andt the overall p-value is low. It also tells us that Age, Gender, Homeowner, and Marriage are not statistically significant in this model.

Data Set: Gender Discrimination This data set includes 208 individuals including male and female their salary and work experience.

gd<- read.csv("c:/users/abbey/Desktop/Data Mining/GenderDiscrimination.csv")
plot(Salary~Gender,data=gd)

plot(Salary~Experience, data=gd)

From these graphs you can see a positive correlation between salary and years of experience in a predictive model.

m4=lm(Salary~.,data=gd)
summary(m4)
## 
## Call:
## lm(formula = Salary ~ ., data = gd)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52779  -9806   -121   8347  60913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53260.0     2416.6  22.039  < 2e-16 ***
## GenderMale   17020.6     2499.6   6.809 1.06e-10 ***
## Experience    1744.6      160.7  10.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared:  0.4413, Adjusted R-squared:  0.4359 
## F-statistic: 80.98 on 2 and 205 DF,  p-value: < 2.2e-16

Equation for this model: salary=53260.0+17020.6GenderMale+1744.6Experience

Based on this models r-squared. It only explains 43.6% of the variation which is low making this a poor model. Although, it is showing Salary has a significant relationship with gender and experience based on their p-values.