Data_Dive7

Formulating Null Hypotheses:

1) “On average, house sales in Abilene during year 2010 is equal to its house sales in year 2009.”

- Alpha Level = 0.05 - The Alpha Level refers to the Probability of Making a Type-1 Error, i.e the probability that we propose that there was a difference in the average house sales when there is actually no such difference. A high alpha reduces the probability of making such Type - 1 errors, but this increases the probability of making Type 2 errors where we faill reject the null hypothesis, even though there is a significant differences in the alphas. In our case, since we want to balance the 2 kinds of errors, we chose a value like 0.05 which is not overly high or overly low.
- Power Level = 0.8 - The Power Level is equal to 1 - Beta or 1 minus the Probability of Making a Type-2 Error. Type-2 Error refers to situation where we fail to reject the Null Hypothesis even when it is actually false. A power level of 0.8 implies that we want a probability of correctly rejecting the null hypothesis at least 80% of the time.
- Minimum Effect Size = 1.5 - The Minimum Effect Size is a standardized measurement that compares the size of some statistical effect to a reference quantity, such as the variability of the statistic. We use a Minimum Effect Size to evaluate the smallest magnitude of the difference or relationship between variables that is considered practically significant or meaningful. In our case, I hhave used a value of 1.5 to be the minimum Effect Size as ou Cohen’s D value is also -0.1123257 which is negligibly small. Therefore I chose a value that would result in at least a medium level of Effect Size

## 
## Cohen's d
## 
## d estimate: -0.1123257 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## -0.9596482  0.7349968

## [1] 136.1667

## [1] 132.5

Sample Size Analysis:

As seen below we have only 12 data-points for both the 2010 and 2009 subsets of our data, which I believe is too low.

test <- pwrss.t.2means(mu1 = 1.5, 
                       sd1 = sd(filter(abeline_df, year == 2009) |> pluck("sales")),
                       kappa = 1,
                       power = .8, alpha = 0.05, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 6023 
##   n2 = 6023 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 12044 
##  Non-centrality parameter = 2.802 
##  Type I error rate = 0.05 
##  Type II error rate = 0.2

As seen above, it appears as though n1 and n2 = 6023 is the required sample size to get our desired Minimum Effect Size. This is likely because our Cohen’s D value is really low and we require more data points before we can reach this Effect Size.

Therefore we cannot adequately conduct the Fischer’s Sampling Test

Fischer’s Sampling Test

test <- pwrss.t.2means(mu1 = 1.5, 
                       sd1 = sd(pluck(abeline_df, "sales")),
                       kappa = 1,
                       power = .8, alpha = 0.05, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.8 
##   n1 = 11171 
##   n2 = 11171 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 22340 
##  Non-centrality parameter = 2.802 
##  Type I error rate = 0.05 
##  Type II error rate = 0.2

plot(test)

## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'

## Hypothesis Test 2:

2) “The average number of listings in December is equal to the average number of listings in July”

Performing Fischer’s Significance Testing Framework

We can redefine our null hypothesis as ” Mean Listings in December - Mean Listings in July = 0”. Given that our Null hypothesis is a measure of the difference in the means of 2 distributions. The appropriate test would be - 2-Sampled t-test as we are concerned with calculating the difference between the means.

Let us first visualise the difference in means:

df_dec_jun = txhousing[txhousing$year == 2010, ]
df_dec_jun= df_dec_jun[df_dec_jun$month == 6 | df_dec_jun$month == 12, ]
df_dec_jun

## # A tibble: 92 × 9
##    city       year month sales    volume median listings inventory  date
##    <chr>     <int> <int> <dbl>     <dbl>  <dbl>    <dbl>     <dbl> <dbl>
##  1 Abilene    2010     6   169  23216943 127900      932       6.7 2010.
##  2 Abilene    2010    12   116  15289470 118300      830       6.3 2011.
##  3 Amarillo   2010     6   272  42959136 139400     1449       6.1 2010.
##  4 Amarillo   2010    12   185  24975000 118900     1381       6.5 2011.
##  5 Arlington  2010     6   367  58389192 134600     2221       5.9 2010.
##  6 Arlington  2010    12   285  41382701 130000     1821       5.6 2011.
##  7 Austin     2010     6  2190 584250558 200500    13353       7.2 2010.
##  8 Austin     2010    12  1561 384045548 191200     9284       5.6 2011.
##  9 Bay Area   2010     6   520 101230779 162100     4627      10.2 2010.
## 10 Bay Area   2010    12   396  76002315 158200     3938       9.6 2011.
## # ℹ 82 more rows

df_dec_jun |>
  ggplot() +
  geom_boxplot(mapping = 
                 aes(x = listings, 
                     y = factor(month, levels = c(6, 12),
                                labels = c("June", "December")))) +
  labs(title = "Advertisement Effect on Revenue",
       x = "Listings (# of Properties )",
       y = "Month") +
  theme_minimal()

## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

As seen above, the means actually appear to be very similar as do the general distribution of Listings in the 2 months. However, we need to explore this further by calculating a p-value before we can make any conclusions.

Data_Dive7

Shresht Venkatraman

2024-03-01

Loading our Dataset