Homework 2

Data Set #1

HP <-read.csv("/Users/hannahpeterson/Documents/R stuff/HousePrices.csv")
plot(Price~SqFt,data=HP)

plot(Price~Bedrooms,data=HP)

plot(Price~Bathrooms,data=HP)

plot(Price~Offers,data=HP)

plot(Price~Brick,data=HP)

plot(Price~Neighborhood,data=HP)

HP=HP[-1]
HP[1:3,]

##    Price SqFt Bedrooms Bathrooms Offers Brick Neighborhood
## 1 114300 1790        2         2      2    No         East
## 2 114200 2030        4         2      3    No         East
## 3 114800 1740        3         2      1    No         East

HP2=HP[,1:5]
head(HP2)

##    Price SqFt Bedrooms Bathrooms Offers
## 1 114300 1790        2         2      2
## 2 114200 2030        4         2      3
## 3 114800 1740        3         2      1
## 4  94700 1980        3         2      3
## 5 119800 2130        3         3      3
## 6 114600 1780        3         2      2

cor(HP2)

##                Price      SqFt  Bedrooms Bathrooms     Offers
## Price      1.0000000 0.5529822 0.5259261 0.5232578 -0.3136359
## SqFt       0.5529822 1.0000000 0.4838071 0.5227453  0.3369234
## Bedrooms   0.5259261 0.4838071 1.0000000 0.4145560  0.1142706
## Bathrooms  0.5232578 0.5227453 0.4145560 1.0000000  0.1437934
## Offers    -0.3136359 0.3369234 0.1142706 0.1437934  1.0000000

m1=lm(Price~.,data=HP)
summary(m1)

## 
## Call:
## lm(formula = Price ~ ., data = HP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -27337.3  -6549.5    -41.7   5803.4  27359.3 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         598.919   9552.197   0.063  0.95011    
## SqFt                 52.994      5.734   9.242 1.10e-15 ***
## Bedrooms           4246.794   1597.911   2.658  0.00894 ** 
## Bathrooms          7883.278   2117.035   3.724  0.00030 ***
## Offers            -8267.488   1084.777  -7.621 6.47e-12 ***
## BrickYes          17297.350   1981.616   8.729 1.78e-14 ***
## NeighborhoodNorth  1560.579   2396.765   0.651  0.51621    
## NeighborhoodWest  22241.616   2531.758   8.785 1.32e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10020 on 120 degrees of freedom
## Multiple R-squared:  0.8686, Adjusted R-squared:  0.861 
## F-statistic: 113.3 on 7 and 120 DF,  p-value: < 2.2e-16

According to our adjusted R squared, the variables in this model account for about 86% of the price for homes. All variables are significant factors except for NeighborhoodNorth. This means that the Price is not affected if the Neighborhood is North. The regression equation would be Price=598.919+52.99SqFt+4246.79Bedrooms+7883.27Bathrooms-8267.48Offers+17297.35Brick+1560.57NeighborhoodNorth+22241.61NeighborhoodWest

Data Set #2

DM <- read.csv("/Users/hannahpeterson/Documents/R stuff/DirectMarketing.csv")
plot(AmountSpent~Age,data=DM)

plot(AmountSpent~Gender,data=DM)

plot(AmountSpent~OwnHome,data=DM)

plot(AmountSpent~Married,data=DM)

plot(AmountSpent~Location,data=DM)

plot(AmountSpent~Salary,data=DM)

plot(AmountSpent~Children,data=DM)

plot(AmountSpent~History,data=DM)

plot(AmountSpent~Catalogs,data=DM)

m1=lm(AmountSpent~.,data=DM)
summary(m1)

## 
## Call:
## lm(formula = AmountSpent ~ ., data = DM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1648.11  -286.72   -12.63   218.21  2771.25 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.496e+02  1.340e+02  -1.862  0.06302 .  
## AgeOld         4.139e+01  5.276e+01   0.784  0.43311    
## AgeYoung       8.965e+01  5.874e+01   1.526  0.12740    
## GenderMale    -5.370e+01  3.802e+01  -1.413  0.15823    
## OwnHomeRent   -1.829e+01  4.151e+01  -0.441  0.65967    
## MarriedSingle  1.950e+01  4.981e+01   0.392  0.69553    
## LocationFar    6.090e+02  4.399e+01  13.845  < 2e-16 ***
## Salary         1.883e-02  1.245e-03  15.124  < 2e-16 ***
## Children      -2.683e+02  2.502e+01 -10.723  < 2e-16 ***
## HistoryLow    -2.675e+02  8.862e+01  -3.019  0.00263 ** 
## HistoryMedium -3.446e+02  5.996e+01  -5.746 1.38e-08 ***
## Catalogs       4.052e+01  2.868e+00  14.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.5 on 685 degrees of freedom
##   (303 observations deleted due to missingness)
## Multiple R-squared:  0.7887, Adjusted R-squared:  0.7853 
## F-statistic: 232.5 on 11 and 685 DF,  p-value: < 2.2e-16

m2=lm(AmountSpent~Location+Salary+Children+History+Catalogs,data=DM)
summary(m2)

## 
## Call:
## lm(formula = AmountSpent ~ Location + Salary + Children + History + 
##     Catalogs, data = DM)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1582.30  -281.82   -19.75   216.38  2796.08 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.994e+02  1.075e+02  -1.854  0.06414 .  
## LocationFar    6.154e+02  4.378e+01  14.057  < 2e-16 ***
## Salary         1.809e-02  1.027e-03  17.603  < 2e-16 ***
## Children      -2.742e+02  2.274e+01 -12.059  < 2e-16 ***
## HistoryLow    -2.405e+02  8.676e+01  -2.772  0.00572 ** 
## HistoryMedium -3.468e+02  5.978e+01  -5.801    1e-08 ***
## Catalogs       4.013e+01  2.860e+00  14.031  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 463.3 on 690 degrees of freedom
##   (303 observations deleted due to missingness)
## Multiple R-squared:  0.7873, Adjusted R-squared:  0.7855 
## F-statistic: 425.8 on 6 and 690 DF,  p-value: < 2.2e-16

For this data set, I decided to focus just on the variables that were statistically significant. In the first regression model it showed that location,salary,children,history and catalogs were the significant variables. They account for about 78% of the amount spent. The regression equation is AmountSpent=-1.99e+02+6.15e+02LocationFar+1.80e-02Salary-2.74e+02Children-2.40e+02HistoryLow-3.46e+02HistoryMedium+4.01e+01Catalogs

Data Set #3

GD <- read.csv("/Users/hannahpeterson/Documents/R stuff/GenderDiscrimination.csv")
plot(Salary~Gender,data=GD)

plot(Salary~Experience,data=GD)

m1=lm(Salary~.,data=GD)
summary(m1)

## 
## Call:
## lm(formula = Salary ~ ., data = GD)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -52779  -9806   -121   8347  60913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  53260.0     2416.6  22.039  < 2e-16 ***
## GenderMale   17020.6     2499.6   6.809 1.06e-10 ***
## Experience    1744.6      160.7  10.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16910 on 205 degrees of freedom
## Multiple R-squared:  0.4413, Adjusted R-squared:  0.4359 
## F-statistic: 80.98 on 2 and 205 DF,  p-value: < 2.2e-16

In this sample, gender and experience both significantly affect a persons salary. These 2 variable alone account for 43% of someones salary. The regression equation is Salary=53260+17020.6GenderMale+1744.6Experience This equation is stating that your salary will be higher if you are male. Experience doesn’t affect salary as much as gender does.

set.seed(1)
n=length(GD$Salary) #n is the number of cases of the DV
n1=100 
n1

## [1] 100

n2=n-n1
n2

## [1] 108

train=sample(1:n,n1)
m2=lm(Salary~.,data=GD[train,])
summary(m2)

## 
## Call:
## lm(formula = Salary ~ ., data = GD[train, ])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34665 -10717     34   8730  49337 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51642.4     3285.4  15.719  < 2e-16 ***
## GenderMale   20123.7     3551.1   5.667 1.49e-07 ***
## Experience    2024.9      242.1   8.365 4.48e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15930 on 97 degrees of freedom
## Multiple R-squared:  0.5239, Adjusted R-squared:  0.5141 
## F-statistic: 53.36 on 2 and 97 DF,  p-value: 2.343e-16

  pred=predict(m2,newdat=GD[-train,])
  obs=GD$Salary[-train]
  diff=obs-pred
  percdiff=abs(diff)/obs
me=mean(diff)
rmse=sqrt(mean(diff**2))
mape=100*(mean(percdiff))
me   # mean error

## [1] -5342.269

rmse # root mean square error

## [1] 18482.52

mape # mean absolute percent error

## [1] 18.79173

In regards to the holdout sample, I used 100 variables to make the prediction. It showed that 51% of Salary was accounted for from just gender and experience. The mean absolute percent error was only 18% so we can conclude that the data can make reliable and valid predictions.

Homework 2

Hannah Peterson

October 2, 2017