Sampling: Estimating the Demand for Hybrid Cars

A. Sample Design:

The Excel file sampling - hybrid cars.xlsx allows you to draw samples according to different sample designs. The stratification variable is income (five categories, each representing a stratum). Note: You can use Excel to do so or you can save each dram and create an R script to process the file.

1) Proportionate sampling:

Draw random samples of different sizes (100, 500, 1000) using a proportionate sampling design (enter sample sizes that are proportional to the population sizes and vary the total sample size). For the purchase intentions, calculate and report the overall sample mean, sample variance and variance of the mean. Can you assume a simple random sample was drawn or do you need to reweight? Interpret the estimates and explain the pattern of variation across different sample sizes (what happens to the estimates when the sample size increases?).

A proportionate sample used in conjunction with stratification would not require us to reweight the sample in order for it to be representative of the population. Proportionate samples are defined as samples that are proportionate to the stratum sizes (i.e., the sample size in the stratum is the same percentage as that in the population). The table shows estimates for the sample mean, sample variance, and variance of the mean for sample sizes totaling 100, 500, and 1000 respectively. As expected, when the sample size increases, the variance of the mean decreases, the sample variance decreases, and the sample mean becomes more accurate.

2) Constant stratified samples:

Draw a sample of size 1000 using constant sample sizes (i.e., 200 per stratum). This sample is not representative of the population. Calculate the sample mean and variance of the mean. Comment on how and why the results are different from those for proportionate sampling (n=1000).

In using a constant stratified sampling method, we are working with samples that are disproportionate (i.e., the sample sizes are not proportional to the stratum sizes). Thus, this approach requires that we reweight the sample in order to ensure that it is representative of the population. In doing so, we find that the parameter estimates (sample mean, sample variance, and variance of the mean) are slightly different. For example, the variance of the mean decreases, which indicates that our estimate of the true population mean is more reliable. This can be expected because we are computing the variance of the mean in each stratum, where each stratum exhibits more homogeneity when compared to the whole. Thus, computing variance of the mean on a per stratum basis and pooling results brings down total variability substantially. While the sample mean remains approximately the same (as expected), the sample variance increases slightly.

3) Optimal stratified samples:

In a pre-study, estimates of the stratum variance of purchase intentions for the HPV were obtained:

Compute optimal sample sizes for the five strata and draw a sample of size 1000 according to this design (see lecture notes on how to compute these values). Calculate the sample mean and variance of the mean and compare the results to those of proportionate and constant sampling. What is your conclusion?

The optimal sample sizes for each of the strata are provided above. When computing the variance of the mean using optimal stratified samples, we can expect the estimate to be lower when compared to that of proportionate sampling, but similar to the results obtained in constant stratified sampling. Again, the primary reason being that the we are reweighting the sample based on a proportion of variance from each stratum. Effectively, we are overweighing stratums that represent a smaller proportion of the overall population size and underweighting stratums that represent a larger proportion of the overall population size. As expected, the variance of the mean is very similar to the result obtained using constant stratified sampling, although it is slightly larger. The sample mean is the lowest when compared to the results obtained from the two other sampling methodologies, and the sample variance increases slightly over the previous estimate (continuing its upward trend).

B. Model Based Approach to Sampling

The file hybridcars.RData contains a sample of 5000 households, drawn using a random sample design. The columns of the data object consist of id, purchase intention, income, commute distance (in miles), household size, and number of earners per household. The purchase intentions are now binary, collected using the question: “do you intend to purchase a Hybrid Prototype Vehicle in the next 12 months?

The population distribution of income levels is as follows:

Compare the income distribution in the sample to those in the population. Is the sample representative of the population? If not, you need to take this into account in your analyses.

The purchase intentions in hybridcars.RData are discrete 0/1 variables indicating whether a subject intends to purchase the HPV or not. In R, using GLM, use the appropriate family to relate the binary purchase intention variable to (a) income, (b) commute distance (in miles), (c) household size, and (d) number of earners per household. (You can transform the commute variable so its effect is nonlinear.) What is your conclusion on the relationship between the purchase intentions for Hybrid Prototype Vehicles and these demographic variables?

Make sure to include appropriate weights into your analysis so your results reflect the population income distribution. Also estimate your model assuming a representative sample. Does it make a difference?

dir()

##  [1] "HW-Sampling-Hybrid-Cars .pdf" "hybridcars.RData"            
##  [3] "image.png"                    "img2.png"                    
##  [5] "part1-3.numbers"              "sampling - hybrid cars.xlsx" 
##  [7] "sampling_hw.Rproj"            "SAMPLINGHW.html"             
##  [9] "SAMPLINGHW.Rmd"               "table.png"                   
## [11] "table1.png"                   "table2.png"

load("hybridcars.RData")
head(data)

##   id y      X1 X2 X3 X4
## 1  1 0 12367.5 31  3  2
## 2  2 0 11222.5 39  5  1
## 3  3 0 13254.5 27  2  2
## 4  4 1 11994.5 25  1  0
## 5  5 0 13222.5 20  3  2
## 6  6 0 11426.5 29  2  2

library(dplyr)

#id, purchase intention, income, commute distance, household size, number of earners per household
hybrid = rename(data, purchase_int = y, income = X1, commute_dist = X2, house_size = X3, no_earners = X4)
#head(hybrid)
attach(hybrid)

#INCOME DIST in POP:  28.22,  26.65%,  18.27%,   10.93%,   15.73%
#INCOME DIST in S: 

#difference in population proportions vs. sample, need to reweightt
n = 5000
popsize=1000000
trueIncomeDistr = c(.2822,.2665,.1827,.1093,.1573)
sampleIncomeDistr = c(.3032, .2022, .11, .14, .2446)
Nh=popsize*trueIncomeDistr
nh=n*sampleIncomeDistr
Nh

## [1] 282200 266500 182700 109300 157300

nh

## [1] 1516 1011  550  700 1223

# Using cut
hybrid$income_dist <- cut(hybrid$income, 
                       breaks = c(0, 25000, 50000, 75000, 100000, 150000), 
                       labels = c("1", "2", "3", "4", "5"), 
                       right = TRUE)

inclProb=rep(NA,n)

for (i in 1:n) {
  j=hybrid[i,"income_dist"]
  inclProb[i]= nh[j]/Nh[j]
}

#inclProb

#weights
w = n*(1/inclProb)/sum(1/inclProb)

#head(hybrid)

# reweighted (PML)
res=glm(purchase_int ~ income+log(commute_dist)+house_size+no_earners,data=hybrid,family=binomial(logit),weights=w)
summary(res)

## 
## Call:
## glm(formula = purchase_int ~ income + log(commute_dist) + house_size + 
##     no_earners, family = binomial(logit), data = hybrid, weights = w)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6724  -0.9785  -0.7041   0.9004   2.1598  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -3.885e+00  5.015e-01  -7.747 9.39e-15 ***
## income             1.531e-05  8.000e-07  19.140  < 2e-16 ***
## log(commute_dist)  6.013e-01  1.519e-01   3.958 7.56e-05 ***
## house_size        -6.997e-04  3.087e-02  -0.023    0.982    
## no_earners         4.153e-01  5.169e-02   8.034 9.42e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6660.4  on 4999  degrees of freedom
## Residual deviance: 6176.8  on 4995  degrees of freedom
## AIC: 6843.8
## 
## Number of Fisher Scoring iterations: 4

# no weighting
res_noweight=glm(purchase_int ~ income+log(commute_dist)+house_size+no_earners,data=hybrid,family=binomial(logit))
summary(res_noweight)

## 
## Call:
## glm(formula = purchase_int ~ income + log(commute_dist) + house_size + 
##     no_earners, family = binomial(logit), data = hybrid)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7542  -0.9330  -0.7177   1.0433   2.1506  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -3.646e+00  5.012e-01  -7.275 3.46e-13 ***
## income             1.530e-05  7.110e-07  21.516  < 2e-16 ***
## log(commute_dist)  4.353e-01  1.519e-01   2.865  0.00417 ** 
## house_size         1.299e-01  3.061e-02   4.245 2.18e-05 ***
## no_earners         3.767e-01  5.175e-02   7.279 3.36e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6753.7  on 4999  degrees of freedom
## Residual deviance: 6132.0  on 4995  degrees of freedom
## AIC: 6142
## 
## Number of Fisher Scoring iterations: 4

Interpretation of Results

In comparing the population income distribution to that of the sample income distribution, we can clearly see that the sample is not representative of the population. Thus, we need to reweight using the optimal stratified sampling approach.

Reweighted Interpretation

The interpretation of each exponentiated coefficient is the percent change in the probability of purchasing a hybrid vehicle for a unit increase in the corresponding predictor variable holding all other predictor variables constant. Thus, the following interpretations hold (please note that the predictor variable that corresponds to commuting distance has been log transformed).

A one unit increase in income yields a 0.001531% increase in the probability of a consumer purchasing a hybrid vehicle.
A one percent increase in the commuting distance of a consumer increases the the probability of a given consumer purchasing a hybrid by 6.013%
The coefficient estimate corresponding to the “house_size” predictor variable is statistically insignificant.
A one unit increase in the number of earners per household yields a 51.48% in the probability of a consumer purchasing a hybrid vehicle.

Assuming a Representative Sample

If we estimated our model by incorrectly assuming that the sample is representative, the results would be biased toward predictor variables, which may have been under or over sampled (i.e., sample size in stratum is not the same percentage as that in the population). If undersampled, the necessary correction would be to overweight. If oversampled, the necessary correction would be to underweight. Not taking theses corrections into account would yield the above results (see: “summary(res_noweight)”).

Conclusion

What is your conclusion on the relationship between the purchase intentions for the Hybrid Prototype Vehicles and these demographic variables?

Based on the interpretation of results included in the “Reweighted Interpretation” section, it’s clear that the number of earners per household has the most significant effect on the probability that a consumer will purchase a Hybrid Prototype Vehicle. Likewise, the distance that a potential consumer commutes to work also has a noteworthy, albeit much smaller, effect on probability that a consumer will purchase a HPV. Based on these results, auto manufacturers should target their marketing efforts toward households containing more than one earner, where earners commute a substantial distance to work each day (i.e., a commute from the suburbs to a larger metropolitan area or city). Finally, income has the smallest effect on the probability that a consumer will purchase a HPV and house size does not have a statistically significant relationship with purchase intent.