Familiarize with and implement different sampling plans and analyze categorical data to estimate the demand for hybrid cars.
You are hired as a consultant to estimate the overall demand (purchase intentions) for an enhanced hybrid car in the US, a hybrid prototype vehicle, which reaches a mileage of 55 MPG (as compared to Prius’ 46 MPG). You will design and execute sampling plans and assess the relationship between purchase intentions for the hybrid prototype vehicle and household characteristics.
The Excel file sampling - hybrid cars.xlsx allows you to draw samples according to different sample designs. The stratification variable is income (five categories, each representing a stratum). Note: You can use Excel to do so or you can save each dram and create an R script to process the file.
Draw random samples of different sizes (100, 500, 1000) using a proportionate sampling design (enter sample sizes that are proportional to the population sizes and vary the total sample size). For the purchase intentions, calculate and report the overall sample mean, sample variance and variance of the mean. Can you assume a simple random sample was drawn or do you need to reweight? Interpret the estimates and explain the pattern of variation across different sample sizes (what happens to the estimates when the sample size increases?).
A proportionate sample used in conjunction with stratification would not require us to reweight the sample in order for it to be representative of the population. Proportionate samples are defined as samples that are proportionate to the stratum sizes (i.e., the sample size in the stratum is the same percentage as that in the population). The table shows estimates for the sample mean, sample variance, and variance of the mean for sample sizes totaling 100, 500, and 1000 respectively. As expected, when the sample size increases, the variance of the mean decreases, the sample variance decreases, and the sample mean becomes more accurate.
Draw a sample of size 1000 using constant sample sizes (i.e., 200 per stratum). This sample is not representative of the population. Calculate the sample mean and variance of the mean. Comment on how and why the results are different from those for proportionate sampling (n=1000).
In using a constant stratified sampling method, we are working with samples that are disproportionate (i.e., the sample sizes are not proportional to the stratum sizes). Thus, this approach requires that we reweight the sample in order to ensure that it is representative of the population. In doing so, we find that the parameter estimates (sample mean, sample variance, and variance of the mean) are slightly different. For example, the variance of the mean decreases, which indicates that our estimate of the true population mean is more reliable. This can be expected because we are computing the variance of the mean in each stratum, where each stratum exhibits more homogeneity when compared to the whole. Thus, computing variance of the mean on a per stratum basis and pooling results brings down total variability substantially. While the sample mean remains approximately the same (as expected), the sample variance increases slightly.
In a pre-study, estimates of the stratum variance of purchase intentions for the HPV were obtained:
Compute optimal sample sizes for the five strata and draw a sample of size 1000 according to this design (see lecture notes on how to compute these values). Calculate the sample mean and variance of the mean and compare the results to those of proportionate and constant sampling. What is your conclusion?
The optimal sample sizes for each of the strata are provided above. When computing the variance of the mean using optimal stratified samples, we can expect the estimate to be lower when compared to that of proportionate sampling, but similar to the results obtained in constant stratified sampling. Again, the primary reason being that the we are reweighting the sample based on a proportion of variance from each stratum. Effectively, we are overweighing stratums that represent a smaller proportion of the overall population size and underweighting stratums that represent a larger proportion of the overall population size. As expected, the variance of the mean is very similar to the result obtained using constant stratified sampling, although it is slightly larger. The sample mean is the lowest when compared to the results obtained from the two other sampling methodologies, and the sample variance increases slightly over the previous estimate (continuing its upward trend).
The file hybridcars.RData contains a sample of 5000 households, drawn using a random sample design. The columns of the data object consist of id, purchase intention, income, commute distance (in miles), household size, and number of earners per household. The purchase intentions are now binary, collected using the question: “do you intend to purchase a Hybrid Prototype Vehicle in the next 12 months?
The population distribution of income levels is as follows:
Compare the income distribution in the sample to those in the population. Is the sample representative of the population? If not, you need to take this into account in your analyses.
The purchase intentions in hybridcars.RData are discrete 0/1 variables indicating whether a subject intends to purchase the HPV or not. In R, using GLM, use the appropriate family to relate the binary purchase intention variable to (a) income, (b) commute distance (in miles), (c) household size, and (d) number of earners per household. (You can transform the commute variable so its effect is nonlinear.) What is your conclusion on the relationship between the purchase intentions for Hybrid Prototype Vehicles and these demographic variables?
Make sure to include appropriate weights into your analysis so your results reflect the population income distribution. Also estimate your model assuming a representative sample. Does it make a difference?
dir()## [1] "HW-Sampling-Hybrid-Cars .pdf" "hybridcars.RData"
## [3] "image.png" "img2.png"
## [5] "part1-3.numbers" "sampling - hybrid cars.xlsx"
## [7] "sampling_hw.Rproj" "SAMPLINGHW.html"
## [9] "SAMPLINGHW.Rmd" "table.png"
## [11] "table1.png" "table2.png"
load("hybridcars.RData")
head(data)## id y X1 X2 X3 X4
## 1 1 0 12367.5 31 3 2
## 2 2 0 11222.5 39 5 1
## 3 3 0 13254.5 27 2 2
## 4 4 1 11994.5 25 1 0
## 5 5 0 13222.5 20 3 2
## 6 6 0 11426.5 29 2 2
library(dplyr)
#id, purchase intention, income, commute distance, household size, number of earners per household
hybrid = rename(data, purchase_int = y, income = X1, commute_dist = X2, house_size = X3, no_earners = X4)
#head(hybrid)
attach(hybrid)
#INCOME DIST in POP: 28.22, 26.65%, 18.27%, 10.93%, 15.73%
#INCOME DIST in S:
#difference in population proportions vs. sample, need to reweightt
n = 5000
popsize=1000000
trueIncomeDistr = c(.2822,.2665,.1827,.1093,.1573)
sampleIncomeDistr = c(.3032, .2022, .11, .14, .2446)
Nh=popsize*trueIncomeDistr
nh=n*sampleIncomeDistr
Nh## [1] 282200 266500 182700 109300 157300
nh## [1] 1516 1011 550 700 1223
# Using cut
hybrid$income_dist <- cut(hybrid$income,
breaks = c(0, 25000, 50000, 75000, 100000, 150000),
labels = c("1", "2", "3", "4", "5"),
right = TRUE)
inclProb=rep(NA,n)
for (i in 1:n) {
j=hybrid[i,"income_dist"]
inclProb[i]= nh[j]/Nh[j]
}
#inclProb
#weights
w = n*(1/inclProb)/sum(1/inclProb)
#head(hybrid)
# reweighted (PML)
res=glm(purchase_int ~ income+log(commute_dist)+house_size+no_earners,data=hybrid,family=binomial(logit),weights=w)
summary(res)##
## Call:
## glm(formula = purchase_int ~ income + log(commute_dist) + house_size +
## no_earners, family = binomial(logit), data = hybrid, weights = w)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6724 -0.9785 -0.7041 0.9004 2.1598
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.885e+00 5.015e-01 -7.747 9.39e-15 ***
## income 1.531e-05 8.000e-07 19.140 < 2e-16 ***
## log(commute_dist) 6.013e-01 1.519e-01 3.958 7.56e-05 ***
## house_size -6.997e-04 3.087e-02 -0.023 0.982
## no_earners 4.153e-01 5.169e-02 8.034 9.42e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6660.4 on 4999 degrees of freedom
## Residual deviance: 6176.8 on 4995 degrees of freedom
## AIC: 6843.8
##
## Number of Fisher Scoring iterations: 4
# no weighting
res_noweight=glm(purchase_int ~ income+log(commute_dist)+house_size+no_earners,data=hybrid,family=binomial(logit))
summary(res_noweight)##
## Call:
## glm(formula = purchase_int ~ income + log(commute_dist) + house_size +
## no_earners, family = binomial(logit), data = hybrid)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7542 -0.9330 -0.7177 1.0433 2.1506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.646e+00 5.012e-01 -7.275 3.46e-13 ***
## income 1.530e-05 7.110e-07 21.516 < 2e-16 ***
## log(commute_dist) 4.353e-01 1.519e-01 2.865 0.00417 **
## house_size 1.299e-01 3.061e-02 4.245 2.18e-05 ***
## no_earners 3.767e-01 5.175e-02 7.279 3.36e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6753.7 on 4999 degrees of freedom
## Residual deviance: 6132.0 on 4995 degrees of freedom
## AIC: 6142
##
## Number of Fisher Scoring iterations: 4
In comparing the population income distribution to that of the sample income distribution, we can clearly see that the sample is not representative of the population. Thus, we need to reweight using the optimal stratified sampling approach.
The interpretation of each exponentiated coefficient is the percent change in the probability of purchasing a hybrid vehicle for a unit increase in the corresponding predictor variable holding all other predictor variables constant. Thus, the following interpretations hold (please note that the predictor variable that corresponds to commuting distance has been log transformed).
If we estimated our model by incorrectly assuming that the sample is representative, the results would be biased toward predictor variables, which may have been under or over sampled (i.e., sample size in stratum is not the same percentage as that in the population). If undersampled, the necessary correction would be to overweight. If oversampled, the necessary correction would be to underweight. Not taking theses corrections into account would yield the above results (see: “summary(res_noweight)”).
What is your conclusion on the relationship between the purchase intentions for the Hybrid Prototype Vehicles and these demographic variables?
Based on the interpretation of results included in the “Reweighted Interpretation” section, it’s clear that the number of earners per household has the most significant effect on the probability that a consumer will purchase a Hybrid Prototype Vehicle. Likewise, the distance that a potential consumer commutes to work also has a noteworthy, albeit much smaller, effect on probability that a consumer will purchase a HPV. Based on these results, auto manufacturers should target their marketing efforts toward households containing more than one earner, where earners commute a substantial distance to work each day (i.e., a commute from the suburbs to a larger metropolitan area or city). Finally, income has the smallest effect on the probability that a consumer will purchase a HPV and house size does not have a statistically significant relationship with purchase intent.