1a. Your friend claims that the average house price in this area is above $150K. Do you agree? Briefly explain what the p-values in these cases mean?

Claim: Avg. house price is above $150K

Null Hypothesis: Avg. price is less than or equal to $150K(mu <= 150)

Alternate Hypothesis: Avg. price is above $150K (mu > 150)

houseprice = read.csv("C:\\Users\\Perumalsamy\\Downloads\\houseprices.csv")
xbar = (mean(houseprice$Price))/1000
xbar

## [1] 163.8621

s = (sd(houseprice$Price)/1000)
s

## [1] 67.65156

n = 1047

Here p-value is the probability of finding a sample of 163.9K or more when H0 is true(i.e. mu <= 150)

P(Xbar >= xbar ; mu <= 150)

First will find probability for P(Xbar >= 163.9 ; mu <= 150)

Z = (xbar-mu0)/(s/sqrt(n))

t = (xbar-150)/(s/sqrt(n))
t

## [1] 6.63018

t = 6.63 with df = 1046

pvalue = pt(6.63,1046, lower.tail = FALSE)
pvalue

## [1] 2.684535e-11

since the p-value is less than alpha value (0.05), we can reject the null hypothesis and accept the claim that the average house price in the area is above $150K.

1b) b) He also claims that the average living area is more than 1800 Sq. Ft. Do you agree with this? (Use a 5% significance level for both.). Briefly explain what the p-values in these cases mean?

Claim: Average Living area is more thn 1800 sq. Ft

Null Hypothesis: Average Living area is less than or equal to 1800 sq. Ft (mu <= 1800)

Alternate Hypothesis: Average Living area is greater than 1800 sq. ft (mu > 1800)

xbar = mean(houseprice$Living.Area)
s = sd(houseprice$Living.Area)
xbar

## [1] 1807.303

## [1] 641.4609

Here p-value is the probability of finding a sample of 1807.3 or more when H0 is true(i.e. mu <= 1800)

P(Xbar >= xbar ; mu <= 1800)

First will find probability for P(Xbar >= 1807.3 ; mu <= 1800)

Z = (xbar-mu0)/(s/sqrt(n))

t = (xbar-1800)/(s/sqrt(n))
t

## [1] 0.3683755

t = 0.368 with df = 1046

pvalue = pt(0.368,1046,lower.tail = FALSE)
pvalue

## [1] 0.3564738

As the p-value is greater than alpha value of 0.05, we stick with the null hypothesis and can reject the claim that the average living area in the area is greater than 1800 sq. Ft

2. Are the home prices higher for houses with fireplaces as compared to those without?

a) Create side-by-side box plots of the house prices of the two groups and comment them

houseprice_inK = houseprice
houseprice_inK$Price = houseprice_inK$Price/1000
boxplot(Price~Fireplace, data=houseprice_inK, xlab = "Fireplace Availability", ylab = "Houseprice in Thousands",names = c("Without Fireplace", "With Fireplace") ,main = "Fireplace availability vs Houseprices")

From the boxplot, it is visible that the average prices of house with Fireplace is higher than the houses that does not have fireplace. Also there are less outliers for houses with fireplaces compared with the houses that does not have fireplace.

Formulate an appropriate hypothesis and test it in order to check the above claim. Assume that the population standard deviations of house prices in the two groups are equal.

Claim: Average house price of the houses with Fireplace (WF)is higher than the average houseprice of the houses without fireplace (WOF) i.e. Average houseprice WF - Average houseprice WOF > 0

Null Hypothesis: mu(WF) - mu(WOF) <= 0 Alternate Hypothesis: mu(WF) - mu(WOF) > 0

wfdata <- data.frame(subset(houseprice,Fireplace == 1))
wofdata <- data.frame(subset(houseprice, Fireplace == 0))
xbar_wf = mean(wfdata$Price)/1000
xbar_wof= mean(wofdata$Price)/1000
s_wf = sd(wfdata$Price)/1000
s_wof = sd(wofdata$Price)/1000
n_wf = nrow(wfdata)
n_wof = nrow(wofdata)

xbar_wf

## [1] 189.6378

xbar_wof

## [1] 126.2877

s_wf

## [1] 66.29643

s_wof

## [1] 49.66239

n_wf

## [1] 621

n_wof

## [1] 426

xbar = xbar_wf - xbar_wof
xbar

## [1] 63.35019

#P(xbar >= 63.35; (mu(wf) - mu(wof)) <= 0)

sp = sqrt((((n_wf-1)*s_wf*s_wf) + ((n_wof-1)*s_wof*s_wof))/(n_wf+n_wof-2))
sp

## [1] 60.08952

df = n_wf+n_wof-2
df

## [1] 1045

t = (xbar-0)/(sp*sqrt((1/n_wf)+(1/n_wof)))
t

## [1] 16.75816

t = 16.76 with df = 1045

pval = pt(16.76, 1045, lower.tail = FALSE)
pval

## [1] 2.532178e-56

Now, we can reject the null hypothesis with almost 0% chance of Type 1 error. i.e Our Claim is correct that the average house price of houses with fireplace is higher than the average houseprice of houses without fireplace

3. Any house aged more than 30 years is considered an “old” house. Your friend claims that old houses have larger lot sizes than new houses. Do you agree? Explain. Use a significance level of 5% for your test. Historical data suggests that old houses include some very large and some very small lot sizes but new houses are more homogeneous in their lot sizes.

Claim: Average lot size of old houses are higher than the average lot size of the new houses.

Null Hypothesis: mu(oh) - mu(nh) <= 0 Alternate Hypothesis: mu(oh) - mu(nh) > 0

oldh_data = data.frame(subset(houseprice, Age>30))
newh_data = data.frame(subset(houseprice, Age<=30))

xbar_oh = mean(oldh_data$Lot.Size)
xbar_nh= mean(newh_data$Lot.Size)
s_oh = sd(oldh_data$Lot.Size)
s_nh = sd(newh_data$Lot.Size)
n_oh = nrow(oldh_data)
n_nh = nrow(newh_data)

xbar_oh

## [1] 0.5481788

xbar_nh

## [1] 0.578255

s_oh

## [1] 0.7249367

s_nh

## [1] 0.7986463

n_oh

## [1] 302

n_nh

## [1] 745

xbar = xbar_oh - xbar_nh
xbar

## [1] -0.03007623

#P(xbar >= -0.03; (mu(oh) - mu(nh)) <= 0)

f1 = (s_oh*s_oh)/n_oh
f2 =(s_nh*s_nh)/n_nh
f = f1+f2
t = (xbar-0)/(sqrt(f))
t

## [1] -0.5902598

df = (f*f)/(((f1*f1)/(n_oh-1))+((f2*f2)/(n_nh-1)))
df

## [1] 610.2757

t = -0.59 with df = 610

pval = pt(-0.59, 610, lower.tail = FALSE)
pval

## [1] 0.7222954

As the p-value is greater than alpha value of 0.05, we stick with the null hypothesis and can reject the claim that the old houses have larger lot sizes than new houses.

4. Based on the evidence available here, would you be willing to claim that fireplaces have become more fashionable? For simplicity, it is OK to compare only “new” houses and “old” houses. Use a significance level of 5% for your

Claim: New houses have more fileplaces than old houses.

Null Hypothesis: pi(nh) - pi(oh) <= 0 Alternate Hypothesis: pi(nh) - pi(oh) > 0

n_oh = nrow(oldh_data)
n_nh = nrow(newh_data)
a = nrow(subset(oldh_data, Fireplace == 1))
pi_oh = a/n_oh
b = nrow(subset(newh_data, Fireplace == 1))
pi_nh = b/n_nh

n_oh

## [1] 302

n_nh

## [1] 745

pi_oh

## [1] 0.4470199

pi_nh

## [1] 0.652349

P = pi_nh-pi_oh
P

## [1] 0.2053291

#P(P >= 0.205; (pi(nh) - pi(oh)) <= 0)

#Pooled sample proportionn
pbar = ((pi_nh*n_nh)+(pi_oh*n_oh))/(n_nh+n_oh)
z = (pi_nh-pi_oh)/(sqrt(pbar*(1-pbar)*((1/n_nh)+(1/n_oh))))
pval = pnorm(6.12, lower.tail = FALSE)
pval

## [1] 4.678768e-10

Since the pval is lesser than the alpha value (0.1), we can reject the null hypothesis and confirm that the claim New houses have more fireplaces than old houses are true.

5. Suppose that houses with 1-2 bedrooms are considered to be “Small Houses”, those with 3-4 are “Medium Houses” and 5-6 as “Big Houses”. Can we conclude that the prices of Small, Medium and Big houses are not the same, at 1% level of significance?

Claim: Prices of Small, Medium and Big houses are not same Null Hypothesis: Prices of small, medium and big houses are same Alternate Hypothesis: Prices of Small, Medium and Big houses are not same

Level of significance: 1% or 0.01

houseprice$Category[houseprice$Bedrooms==1 | houseprice$Bedrooms==2] <- 'Small'
houseprice$Category[houseprice$Bedrooms==3 | houseprice$Bedrooms==4] <- 'Medium'
houseprice$Category[houseprice$Bedrooms==5 | houseprice$Bedrooms==6] <- 'Big'

price_anova = aov(houseprice$Price~houseprice$Category)
summary(price_anova)

##                       Df    Sum Sq   Mean Sq F value Pr(>F)    
## houseprice$Category    2 4.840e+11 2.420e+11   58.71 <2e-16 ***
## Residuals           1044 4.303e+12 4.122e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is less than the alpha value, we can reject the null hypothesis and conclude that Houseprices for small, medium, big houses are different.

Statistics Assignment 2

Perumal

28 November 2016