Seed: 12, resulting sample size: 2522

AI usage

We used AI to teach us how R studio works and how to write the code.

The data

Consexp is cross-sectional data set where one row is one household. Sample size is 3363 households before cut down to 75%. The data is based on the consumer expenditure survey by Statistics Norway 2012. Expenditure amounts are measured in Norwegian kroner.

Setup

library(tidyverse)
library(car)
library(stargazer)
library(dplyr)
library(sandwich) 
library(lmtest)

Load data and create 75% sample

consexp <- readr::read_csv("consexp_f2025.csv")

# The seed set to our group number: 12
GROUP_SEED <- 12
set.seed(GROUP_SEED)

# Taking 75% sample of data
n_total <- nrow(consexp)
size_75 <- floor(0.75 * n_total)
S       <- sort(sample.int(n_total, size = size_75, replace = FALSE))
data    <- consexp[S, , drop = FALSE]

cat("Seed used:", GROUP_SEED,
    "\nRows total:", n_total,
    "\nRows in 75% sample:", nrow(data), "\n")

## Seed used: 12 
## Rows total: 3363 
## Rows in 75% sample: 2522

Part A: Joint distribution of income and food expenditures

Important to note that since we are using logarithms in income we cannot use values that are negative or zero. They are therefor excluded for the log plot.

# Filter out non-positives in the sample
plot_data <- data |> dplyr::filter(income > 0, foodexp > 0)

# Creating scatter and linear regression fit
ggplot(plot_data, aes(x = log(income), y = log(foodexp))) +
  geom_point(color = "slategray3", alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "steelblue4") +
  labs(
    title = "Joint Distribution of log(Income) and log(Food Expenditure)",
    x = "log(Income)",
    y = "log(Food Expenditure)"
  ) +
   theme_bw() +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::comma)

The plot shows a positive relationship between log (food expenditure) and log (income) since the regression line slopes upward. This means that when income increases, so does food expenditure.

Both variables are expressed in logs, so the slope represents elasticity. We can see that the regression line has a gentle slope, meaning the elasticity is less than 1. So if income increases by 1%, food expenditure will increase by less than 1%. According to micro economic theory, this suggests that the income elasticity of demand for food is inelastic, and that food is therefore a necessity good.

We can also see that the observations are quite spread out, meaning that there are many different spending patterns among households with similar income. This reflects some variation in food expenditure not explained by income, like household size and location.

Most of the data points are concentrated in the upper right area of the graph. This can be descriptive of the distribution of households in the sample. Our plot shows us that there are few very low income households in our sample and that most of the households have moderate to relatively high income levels.

Finally, we can see that the variation in food spending grows with income. Low-income households look fairly similar in how much they spend, while high-income households differ much more. This could suggest heteroskedasticity, meaning that the variance of the error term is not constant across income levels - so robust standard errors would be appropriate when estimating this model.

Part B: Summary statistics

# the variables that go into the statistical table
selected_vars <- c("income", "totalexp", "foodexp",
                   "energyexp", "transportexp",
                   "bmi", "nkids", "hascar")

# Labels for nicer output
var_labels <- c(
  "income"       = "Household Income",
  "totalexp"     = "Total Expenditure",
  "foodexp"      = "Food Expenditure",
  "energyexp"    = "Energy Expenditure",
  "transportexp" = "Transport Expenditure",
  "bmi"          = "Body Mass Index",
  "nkids"        = "Number of Kids",
  "hascar"       = "Car Ownership (1=Yes)"
)

# Preparing the data
tbl <- data |>
  dplyr::mutate(hascar = as.numeric(hascar)) |>
  dplyr::select(dplyr::all_of(selected_vars)) |>
  as.data.frame()

# Print table with extra spacing
cat("<div style='margin-bottom:20px;'>")  # add some space before

stargazer::stargazer(
  tbl,
  type = "html",
  title = "Summary Statistics for Key Variables",
  digits = 2,
  digits.extra = 1,
  covariate.labels = var_labels,
  table.layout = "lcccc"  # labels + 4 columns
)

**Summary Statistics for Key Variables**

Statistic	N	Mean	St. Dev.	Min	Max

Household Income	2,517	592,375.80	263,561.10	-200,000	1,430,000
Total Expenditure	2,522	490,983.00	308,981.80	-627,783	2,847,143
Food Expenditure	2,522	61,256.82	36,720.82	0.00	278,704.00
Energy Expenditure	2,522	20,399.50	11,271.21	0.00	88,064.95
Transport Expenditure	2,522	89,553.92	149,035.90	-761,321.60	1,538,910.00
Body Mass Index	2,522	25.04	3.73	13.49	38.97
Number of Kids	2,522	0.80	0.99	0	6
Car Ownership (1=Yes)	2,517	0.83	0.37	0	1

cat("</div>")

# Add CSS to widen column spacing
cat("<style>
      table { width: 100%; border-collapse: separate; border-spacing: 20px; }
    </style>")

Part C: Regressions

# Filtering out non positives so logs are valid
data_food      <- data %>% filter(income > 0, foodexp > 0)
data_energy    <- data %>% filter(income > 0, energyexp > 0)
data_transport <- data %>% filter(income > 0, transportexp > 0)

# Creating the regression models
model_food      <- lm(log(foodexp)      ~ log(income), data = data_food)
model_energy    <- lm(log(energyexp)    ~ log(income), data = data_energy)
model_transport <- lm(log(transportexp) ~ log(income), data = data_transport)

# Creating standard errors that are heteroskedasticy robust (HC1)
vc_food      <- vcovHC(model_food,      type = "HC1")
vc_energy    <- vcovHC(model_energy,    type = "HC1")
vc_transport <- vcovHC(model_transport, type = "HC1")

se_food      <- sqrt(diag(vc_food))
se_energy    <- sqrt(diag(vc_energy))
se_transport <- sqrt(diag(vc_transport))

# Creating a pretty table with the results
stargazer(
  model_food, model_energy, model_transport,
  type = "html",
  title = "Income Elasticity of Expenditures (log–log OLS with HC1-robust SEs)",
  column.labels = c("Food", "Energy", "Transport"),
  dep.var.labels.include = FALSE,
  se = list(se_food, se_energy, se_transport),
  digits = 3,
  keep.stat = c("n","rsq","adj.rsq","ser","f"),  # <- bring back Residual Std. Error & F
  notes = "Heteroskedasticity-robust (HC1) standard errors in parentheses.",
  notes.append = FALSE
)

**Income Elasticity of Expenditures (log–log OLS with HC1-robust SEs)**

	Dependent variable:

	Food	Energy	Transport
	(1)	(2)	(3)

log(income)	0.694^***	0.475^***	0.991^***
	(0.026)	(0.026)	(0.057)

Constant	1.682^***	3.502^***	-2.450^***
	(0.346)	(0.351)	(0.756)


Observations	2,511	2,499	2,328
R²	0.273	0.148	0.122
Adjusted R²	0.272	0.148	0.121
Residual Std. Error	0.591 (df = 2509)	0.590 (df = 2497)	1.357 (df = 2326)
F Statistic	939.873^*** (df = 1; 2509)	433.650^*** (df = 1; 2497)	322.558^*** (df = 1; 2326)

Note:	Heteroskedasticity-robust (HC1) standard errors in parentheses.

1. Interpreting the relationship between income and expenditure

The regression results show that all estimated coefficients on log(income) are positive and highly significant. In log–log form, these coefficients measure income elasticities of expenditure, that is, the percentage change in expenditure associated with a 1% change in income.

Category	Elasticity	Interpretation
Food	0.694	Spending on food increases by about 0.69% when income increases by 1%.
Energy	0.475	Spending on energy increases by about 0.48% when income increases by 1%.
Transport	0.991	Spending on transport increases by about 0.99% when income increases by 1%.

The results indicate that food and energy are necessity goods: expenditure on these items rises less than proportionally with income. As household income increases, the share of income spent on necessities therefore declines. This pattern is consistent with Engel’s law, which states that the proportion of income allocated to food decreases as income grows.

By contrast, transport expenditure has an elasticity close to one, suggesting that transport is a unit elastic normal good. In these cases expenditure on transport increases proportionately with income.

Explain what the results mean for the impact of increased prices (inflation) for households with different income levels

What we measured was income elasticities and they do not directly measure price responses. They measure how spending changes with income, not price. But we can use the income elasticities to reason about inflation’s impact on expenditure across different income groups since higher inflation decreases the spending power of households.

We can therefore reason that inflation would have a bigger negative impact on lower-income households, because they spend a larger share of their income on necessities such as food and energy. Since these goods are essential and difficult to reduce or substitute, rising prices force lower-income households to devote an even greater portion of their limited income toward them, reducing their ability to afford other goods.

Higher-income households, on the other hand, spend a smaller share of their income on necessities and have more flexibility in their budgets. They are therefore less affected by price increases of necessities such as energy and food.

Inflation in transport may be felt more evenly across income groups, since expenditure on transport tends to rise proportionally with income.

In short, inflation in food and energy disproportionately burdens low-income households, while transport inflation tends to affect all households more equally in proportional terms.

2. Challenges in interpreting estimates as causal effects

There are a few reasons for why the coefficients cannot be straightforwardly interpreted as the casual effect of income on expenditure:

Omitted variable bias: Firstly, many other variables can affect household expenditure that are not income like family size, age of house, location of house and many more. If we are missing important variables the OLS estimators can become biased, and the estimated elasticity would capture both the true income effect and the influence of omitted factors.
Simultaneity / reverse causality: Secondly, income may affect expenditures, but expenditures (or choices related to work, location, household composition) may also influence income. For example, households with high transport expenses might live farther from their work, influencing job choice and their income level. So the direction of causation is not guaranteed
Measurement error: Lastly, both income and expenditure may be measured with error in survey data, which can bias coefficients toward zero.

Also correlation does not mean causal relationship. In these cases using common sense and theory is important. It makes intuitively sense that income affects different expenditures. One of the core economic theories is that we can always increase utility with more consumption of normal goods.

In summary, the estimated elasticities provide descriptive evidence about how expenditures covary with income across households, but not the causal effect of income on spending.

Part D: Hypothesis test on transport elasticity

Part D1: The test

Here, β₁ is the income elasticity of transport expenditure. If β₁ is equal to 1, then the elasticity of transportation expenditure with respect to income is proportional.

To test the hypothesis, that β₁ is statistically equal to 1, we perform an F test using the linearhypothesis() function from the car package.

Null hypothesis (H₀): β₁ = 1
Alternative hypothesis (H₁): β₁ ≠ 1

# Robust Standard Errors
vcov_transport_robust <- sandwich::vcovHC(model_transport, type = "HC1")

# Formal hypothesis test: H0: elasticity = 1 (robust)
car::linearHypothesis(
  model_transport,
  "log(income) = 1",
  vcov. = vcov_transport_robust
)

## 
## Linear hypothesis test:
## log(income) = 1
## 
## Model 1: restricted model
## Model 2: log(transportexp) ~ log(income)
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F Pr(>F)
## 1   2327                 
## 2   2326  1 0.0255 0.8732

Interpreting the result

Model 1: Restricted – Means that we are forcing β₁ to be equal to 1.

Model 2: Unrestricted – Here the regression can freely estimate β₁.

The function then compares how much worse the fit is in the restricted model.

The robust F-test yields a p-value of 0.87, which is well above the conventional 0.05 significance level.

Therefore, we fail to reject the null hypothesis that the income elasticity of transportation expenditure equals 1.

This indicates that, based on our sample, transportation expenditure behaves proportionally with income: a 1% increase in household income is associated with approximately a 1% increase in transport spending.

Part D2: 95% Confidence Interval

We now calculate a 95% confidence interval for the income elasticity of transport expenditure.

# 95% robust confidence interval for elasticity
lmtest::coefci(
  model_transport,
  vcov. = vcov_transport_robust,
  level = 0.95
)["log(income)", ]

##     2.5 %    97.5 % 
## 0.8787998 1.1029555

The confidence interval

The regression in part C of log(transport) on log(income) gave us en estimate of β₁ (0.991), using heteroskedasticity robust standard errors, but that estimate is subject to a some degree of sampling uncertainty.

The estimated 95% confidence interval for the income elasticity of transportation expenditure is approximately [0.88, 1.10].

Interpretation of CI: The 95% confidence interval of 0.88 to 1.10 means that, under repeated sampling, 95% of such intervals would contain the true elasticity

How it relates to the test

The 95% confidence interval for the elasticity is [0.88, 1.10]. Since the null hypothesis value of 1 lies inside this interval, it represents a parameter value that is consistent with the data at the 5% significance level. This aligns with the hypothesis test in part (d1), where we failed to reject H₀: β₁ = 1

The connection arises because a 95% confidence interval contains exactly those parameter values that would not be rejected in a two-sided hypothesis test at the 5% significance level.

In other words, the confidence interval reflects the full set of values for which the test statistic is not large enough to indicate statistical significance.

If the hypothesized value lies inside the interval, the data does not provide sufficient evidence against it, so the test does not reject the null.
If it lies outside, then the test statistic would be large enough to reject the null.

Therefore, checking whether 1 is inside or outside the confidence interval gives the same conclusion as performing the formal hypothesis test.

Part E: Adding number of kids to the food expenditure model

We now extend the food expenditure regression by including the number of kids in the household as an additional explanatory variable:

\[ \log(\text{foodexp}_i) = \beta_0 + \beta_1 \log(\text{income}_i) + \beta_2 \, nkids_i + \varepsilon_i \]

# Re-estimate the food expenditure model including nkids
model_food_kids <- lm(
  log(foodexp) ~ log(income) + nkids,
  data = data %>% dplyr::filter(income > 0, foodexp > 0)
)

# Report results in a table
stargazer::stargazer(
  model_food_kids,
  type = "html",
  title = "Food Expenditure Model with Number of Kids",
  covariate.labels = c("Log(Income)", "Number of Kids"),
  dep.var.labels = "Log(Food Expenditure)",
  digits = 3
)

**Food Expenditure Model with Number of Kids**

	Dependent variable:

	Log(Food Expenditure)

Log(Income)	0.606^***
	(0.023)

Number of Kids	0.145^***
	(0.012)

Constant	2.723^***
	(0.303)


Observations	2,511
R²	0.312
Adjusted R²	0.311
Residual Std. Error	0.575 (df = 2508)
F Statistic	567.403^*** (df = 2; 2508)

Note:	p<0.1; p<0.05; p<0.01

Estimated equation:

\[ \widehat{\log(\text{foodexp}_i)} = 2.723 + 0.606 \, \log(\text{income}_i) + 0.145 \, nkids_i \]

\[ \quad (0.303) \quad (0.023) \quad (0.012) \]

Part E1: The change in the coefficient

Families with more kids probably spend more on food, even if the income is the same. If we do not control for the number of kids then some of the variation in food spending might incorrectly get attributed to income. Adding it makes the regression more realistic and reduces omitted variable bias.

After adding the number of kids the coefficient on income went from 0,694 to 0,606. So, once we controlled for kids the effect of income on food spending alone is smaller suggesting that the simple model overstated the elasticity of food spending with regards to income because it was omitting kids.

β₁ = 0,606 now means: a 1% increase in income is associated with about 0,61% increase in food spending, holding all other variables fixed.

Part E2: Joint test

We test the joint hypothesis that both coefficients on log income and number of kids are both different from zero.

This is achieved with an F test.

Our null hypothesis (H₀): \(\beta_1 = 0\) and \(\beta_2 = 0\)
The alternative hypothesis (H₁): at least one of \(\beta_1, \beta_2\) is \(\ne 0\)

# Joint F-test: H0: beta_log(income) = 0 AND beta_nkids = 0
joint_test <- car::linearHypothesis(
  model_food_kids,
  c("log(income) = 0", "nkids = 0")
)

# Print the result
joint_test

## 
## Linear hypothesis test:
## log(income) = 0
## nkids = 0
## 
## Model 1: restricted model
## Model 2: log(foodexp) ~ log(income) + nkids
## 
##   Res.Df     RSS Df Sum of Sq     F    Pr(>F)    
## 1   2510 1205.15                                 
## 2   2508  829.73  2    375.43 567.4 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Manual critical F value for comparison (5% level, df1=2 restrictions, df2=residual df)
crit_val <- qf(0.95, df1 = 2, df2 = df.residual(model_food_kids))
cat("\nCritical F value at 5%:", round(crit_val, 3), "\n")

## 
## Critical F value at 5%: 2.999

Interpreting the results

Model 1 (Restricted): Forces both β₁ and β₂ to be equal to zero. Which means that food spending is only predicted by the constant.

Model 2 (Unrestricted): β₁ and β₂ can be estimated normally

If adding income and kids significantly reduces the RSS (badness-of-fit), then they matter. We can see that the RSS goes form 1205,15 to 829,73 indicating that these variables affect the model greatly.

The F test yields an F-statistic of 567,4 which is far above the critical value of 2.999 at the 5% significance level. When the F statistic is higher than the critical value, then we reject the null hypothesis that both coefficients are equal to zero.

This conclusion can also be reached by looking at the p value. At the \(\alpha = 0.05\) level, we have \(p < 2.2 \times 10^{-16}\), which is way below 0,05 so we reject \(H_0\).

Final result: Income and the number of kids jointly have a statistically significant impact on household food expenditure.

Part E3: The p-value

We test the hypothesis \(H_0: \beta_1 = \hat\beta_1\) in the same sample. The linearHypothesis conducts and F test, and the manual test is a t-test. Both resulting in the same answer.

# Grab the estimated slope on log(income)
b1_hat <- coef(model_food_kids)["log(income)"]

# Formal test H0: beta1 = b1_hat (same data)
res_equal_hat <- car::linearHypothesis(
  model_food_kids,
  hypothesis = paste0("log(income) = ", b1_hat)
)
res_equal_hat

## 
## Linear hypothesis test:
## log(income) = 0.605889940264803
## 
## Model 1: restricted model
## Model 2: log(foodexp) ~ log(income) + nkids
## 
##   Res.Df    RSS Df Sum of Sq  F Pr(>F)
## 1   2509 829.73                       
## 2   2508 829.73  1         0  0      1

# Manual t-test check: (b1_hat - b1_hat)/SE = 0  => p = 1
se_b1 <- coef(summary(model_food_kids))["log(income)", "Std. Error"]
t_stat <- (b1_hat - b1_hat) / se_b1
p_val  <- 2 * (1 - pt(abs(t_stat), df = df.residual(model_food_kids)))
cat("\nManual t-stat:", t_stat, "  p-value:", p_val, "\n")

## 
## Manual t-stat: 0   p-value: 1

Interpreting the results

The test of the null hypothesis \(H_0: \beta_1 = \hat{\beta}_1\) naturally produces a p-value of 1.

This happens because we are testing whether the coefficient is equal to the exact value that the model already estimated from the data.

Model 1 (Restricted): forces \(\beta_1\) to equal \(\hat{\beta}_1\).
Model 2 (Unrestricted): allows the data to freely estimate \(\beta_1\).

Since both models are identical, the residual variation is the same, and the t-test and f-test statistic become zero.

A p-value of 1 means there is no evidence at all against the null — but that’s simply because of how the test is set up. In fact, this kind of test is basically meaningless: of course the estimate equals itself.

So the takeaway is that this result doesn’t tell us anything new about income and food spending. It’s just a reminder that hypothesis testing is only useful when we compare \(\beta_1\) to some outside value (like \(\beta_1 = 1\) in Part D), not when we compare it to its own estimate.

Part F: Controlling for BMI in the food expenditure model

2 models, one with BMI and one without.

# Model without BMI (already known, but re-run for clarity)
model_food_no_bmi <- lm(
  log(foodexp) ~ log(income),
  data = data %>% dplyr::filter(income > 0, foodexp > 0)
)

# Model with BMI
model_food_bmi <- lm(
  log(foodexp) ~ log(income) + bmi,
  data = data %>% dplyr::filter(income > 0, foodexp > 0, !is.na(bmi))
)


# Compare in one table
stargazer(model_food_no_bmi, model_food_bmi,
          type = "html",
          title = "Effect of Including BMI in the Food Expenditure Model",
          covariate.labels = c("log(Income)", "BMI", "Constant"),
          dep.var.labels = "log(Food Expenditure)",
          digits = 3)

**Effect of Including BMI in the Food Expenditure Model**

	Dependent variable:

	log(Food Expenditure)
	(1)	(2)

log(Income)	0.694^***	0.749^***
	(0.023)	(0.021)

BMI		0.067^***
		(0.003)

Constant	1.682^***	-0.715^**
	(0.298)	(0.291)


Observations	2,511	2,511
R²	0.273	0.398
Adjusted R²	0.272	0.398
Residual Std. Error	0.591 (df = 2509)	0.538 (df = 2508)
F Statistic	939.873^*** (df = 1; 2509)	830.524^*** (df = 2; 2508)

Note:	p<0.1; p<0.05; p<0.01

# Filter: only keep rows where income > 0 and bmi is not missing
corr_data <- data %>% dplyr::filter(income > 0, !is.na(bmi))

# Correlation between BMI and log income
cor_logincome_bmi <- cor(log(corr_data$income), corr_data$bmi)

cat("Correlation between BMI and log(income):", round(cor_logincome_bmi, 3), "\n")

Correlation between BMI and log(income): -0.119

Interpreting the results

The most noticable feature when comparing the two models, we see that the estimated elasticity of food expenditure with respect to income increases from 0.694 in the simple model to 0.749 when BMI is included. At the same time, BMI itself enters with a positive and highly significant coefficient (0.067), suggesting that households with higher BMI tend to spend more on food, holding income constant.

The change in the income coefficient could indicate the presence of omitted variable bias in the simple model since also the standard deviations of coefficients decreases. To investigate the direction of this bias, we calculated the correlation between BMI and income, which turned out to be negative (≈ –0.12) (meaning households with higher income, usually have lower BMI). This means that BMI is positively related to food expenditure but negatively related to income. According to the omitted variable bias formula, this combination implies a negative bias in the income coefficient when BMI is omitted.

Therefore, the model without BMI underestimated the true effect of income on food spending. Once BMI is included, the estimated elasticity of food expenditure with respect to income increases, correcting for this negative bias.

Additional comparison of the models

Including BMI not only changes the coefficient on income, but also improves the overall fit of the model.
The intercept falls from 1.682 to –0.715, showing that BMI explains some of the baseline food spending.
The \(R^2\) rises from 0.273 to 0.398, and the residual standard error decreases from 0.591 to 0.538, meaning the model with BMI explains more variation and predicts more precisely.
Both models are highly significant, but the stronger \(F\)-statistic in the richer model confirms the additional explanatory power from including BMI.
But it is important to note that just because R squared goes up, it doesn´t mean that the model is better since all added variables increase r squared even if they are doing more harm than good for the model.

Could BMI be a bad control?

One could argue that higher BMI is a result of higher food expenditure or an outcome of food expenditure. In the chapter 6 in Mastering Metrics it is clearly stated that the moral of the bad control story is that timing matters. When the variables measure before the treatment variable they are usually good controls, but the ones that are measured later may have been determined in part by the treatment, which makes them outcomes not good controls.

But we can also argue that the BMI variable is a “confounding factor”. If BMI reflects lifestyle and health, which could theoritically easily affect income. Then BMI affects both y and x_1. If that is the case it would be considered important to control for that variable.

Given this, the concern is that the drawbacks of including BMI outweigh the potential benefits. Since BMI is likely shaped by food consumption itself, conditioning on it risks introducing reverse causality and blocking part of the very relationship we aim to estimate. Even though BMI may proxy for lifestyle factors that correlate with income, its status as a potential outcome of food expenditure makes it unsuitable as a control in this setting.

Therefore, BMI should be treated as a bad control, and the simpler model without BMI provides a more credible estimate of the income elasticity of food expenditure.

Part G: Trying to add log(income²) alongside log(income)

Including both log(income) and log(income²) in the regression leads to perfect multicollinearity because log(income²) = 2 * log(income).

Since the second regressor is an exact linear transformation of the first, it does not provide any additional independent variation for estimating the model. As a result, R correctly identifies the perfect collinearity and drops log(income²) from the regression.

This means that adding log(income²) tells us nothing new about the relationship between income and food expenditure — the model cannot separately identify the effects of two variables that convey identical information

library(tidyverse)
library(stargazer)

# Clean data FIRST, then create transformed vars
df <- data %>%
  filter(income > 0, foodexp > 0) %>%
  mutate(
    income2    = income^2,
    logincome2 = log(income2)
  )

# Regress with collinear RHS
model_food_income2 <- lm(
  log(foodexp) ~ log(income) + logincome2,
  data = df
)

stargazer(
  model_food_income2,
  type = "html",
  title = "Regression with Perfect Multicollinearity",
  dep.var.labels = "log(foodexp)",
  # don't use covariate.labels here (it can break when a term is dropped)
  omit.stat = "f",
  no.space = TRUE,
  notes = "One regressor is perfectly collinear and is omitted by OLS."
)

**Regression with Perfect Multicollinearity**

	Dependent variable:

	log(foodexp)

log(income)	0.694^***
	(0.023)
logincome2

Constant	1.682^***
	(0.298)

Observations	2,511
R²	0.273
Adjusted R²	0.272
Residual Std. Error	0.591 (df = 2509)

Note:	p<0.1; p<0.05; p<0.01
	One regressor is perfectly collinear and is omitted by OLS.

Part H: Adding squared log income to the food expenditure model

Note that we no longer have perfect multicollinearity as in part G. They presumably have still correlation, but not as significantly as before.

To capture possible curvature, we now include \((\log(\text{income}))^2\) as an additional regressor:

\[ \log(\text{foodexp}_i) = \beta_0 + \beta_1 \log(\text{income}_i) + \beta_2 \big(\log(\text{income}_i)\big)^2 + \varepsilon_i \]

# 1. Filter out zero or negative income/foodexp and construct regressors
hdata <- data %>% 
  filter(income > 0, foodexp > 0) %>%
  mutate(
    logincome  = log(income),
    logincome2 = logincome^2
  )

# 2. Log–log model
model_food_loglog <- lm(
  log(foodexp) ~ logincome,
  data = hdata
)

# 3. Log–log with squared log income
model_food_quadlog <- lm(
  log(foodexp) ~ logincome + logincome2,
  data = hdata
)

# 4. Robust variance–covariance matrices (HC1)
vcov_loglog_robust   <- sandwich::vcovHC(model_food_loglog, type = "HC1")
vcov_quadlog_robust  <- sandwich::vcovHC(model_food_quadlog, type = "HC1")

# 5. Robust standard errors (square roots of diagonal)
se_loglog_robust  <- sqrt(diag(vcov_loglog_robust))
se_quadlog_robust <- sqrt(diag(vcov_quadlog_robust))

# 6. Regression table with robust SEs
stargazer::stargazer(
  model_food_loglog, model_food_quadlog,
  type = "html",
  title = "Comparing log–log vs log–log-squared models",
  covariate.labels = c("log(Income)", "(log(Income))²", "Constant"),
  dep.var.labels = "log(Food Expenditure)",
  digits = 3,
  se = list(se_loglog_robust, se_quadlog_robust),  # <-- robust SEs here
  column.labels = c("Log–log", "Log–log + (log income)^2"),
  keep.stat = c("n", "rsq", "adj.rsq")
)

**Comparing log–log vs log–log-squared models**

	Dependent variable:

	log(Food Expenditure)
	Log–log	Log–log + (log income)
	(1)	(2)

log(Income)	0.694^***	-2.870^***
	(0.026)	(0.563)

(log(Income))²		0.138^***
		(0.022)

Constant	1.682^***	24.599^***
	(0.346)	(3.638)


Observations	2,511	2,511
R²	0.273	0.283
Adjusted R²	0.272	0.282

Note:	p<0.1; p<0.05; p<0.01

Part H1. Interpretation of results

When the model only has log(income) as a dependent variable, we assume a linear relationship between food spending and log(income).

By adding \((\log(\text{income}))^2\) we are changing the functional form of the model.

The intuition behind this could be that of course food is a necessity. All humans need a certain amount of calories to survive. But as income grows and other needs have been met, better quality food can become more of luxury, so at a certain point, food spending would increase with more income (more take-aways and avacados)

Changing the functional form could therefor be more realistic.

In the simple log–log model, the estimated elasticity of food expenditure with respect to income is constant at about 0.69. This means that across the sample, a 1% increase in income is associated with a 0.69% increase in food spending, regardless of the income level.

When we extend the model by adding \((\log(\text{income}))^2\), the results change:
- The coefficient on log(income) is negative (–2.87), while the squared term is positive (0.138).
- This implies the elasticity is not constant, but varies with income. At lower incomes, the negative linear term dominates, so increases in income have a weaker effect on food spending. At higher incomes, the positive squared term grows in importance, making food expenditure more responsive to income.

In other words, the Engel curve for food is convex in logs: richer households increase food spending proportionally more strongly when income rises, while poorer households allocate relatively less of additional income to food.

Part H2. Comparing predictions

Before we compare prediction we first plotted the new regression line with the curve to see visually how the squared log income affected the regression line.

# Create a grid of log income values
income_grid <- tibble(logincome = seq(min(hdata$logincome),
                                      max(hdata$logincome),
                                      length.out = 200))

# Predictions from both models
income_grid <- income_grid %>%
  mutate(
    pred_loglog = predict(model_food_loglog,
                          newdata = data.frame(logincome = logincome)),
    pred_quadlog = predict(model_food_quadlog,
                           newdata = data.frame(logincome = logincome,
                                                logincome2 = logincome^2))
  )

# Plot: data + both fitted lines
ggplot(hdata, aes(x = logincome, y = log(foodexp))) +
  geom_point(alpha = 0.2, color = "gray") +
  geom_line(data = income_grid, aes(x = logincome, y = pred_loglog),
            color = "darkseagreen4", size = 1, linetype = "dashed") +
  geom_line(data = income_grid, aes(x = logincome, y = pred_quadlog),
            color = "darkseagreen", size = 1) +
  labs(
    title = "Fitted Relationship: log(Food Expenditure) vs log(Income)",
    subtitle = "Blue = simple log-log, Red = log-log with squared term",
    x = "log(Income)",
    y = "log(Food Expenditure)"
  ) +
  theme_minimal()

In the plot we can see that the curve is mild. At food spending decreases just a tiny bit when income increases at a very low income level. It seems than, at a certain income level we will calculate below, food spending starts to increase with income, first slowly and than more rapidly.

Simple model:
\[ \widehat{\log(\text{foodexp})} = 1.682 + 0.694 \cdot \log(\text{income}) \]
Elasticity is constant at 0.69.
Squared model:
\[ \widehat{\log(\text{foodexp})} = 24.599 - 2.870 \cdot \log(\text{income}) + 0.138 \cdot (\log(\text{income}))^2 \]

In the simple log–log model, the elasticity was constant: a 1% increase in income was associated with about a 0.69% increase in food spending, regardless of income level.

In the extended model, the negative coefficient on log(Income) together with the positive coefficient on (log(Income))² means that the effect of income on food expenditure is not constant, but depends on the level of income. At lower income levels, the growth of food expenditure with respect to income is weaker, but as income rises, the positive squared term gradually dominates and food expenditure becomes more responsive to income.

This nonlinearity suggests that poorer households allocate a smaller share of additional income to food, while richer households increase their food spending more strongly when income rises.

In terms of model fit, the \(R^2\) improves slightly from 0.273 to 0.283, and the residual standard error falls from 0.591 to 0.587. These changes are modest but show that including the squared term helps the model capture additional variation in food expenditure.

The simple model predicts a straight-line relationship between log income and log food expenditure, with the same slope everywhere.

The squared model bends this line: it predicts higher food expenditure for low income households than the simple model did. Overall, the squared model fits the data slightly better (\(R^2\) rises from 0.273 to 0.283), but the difference is modest.

Overall, the quadratic model suggests that food spending increases more slowly at very low incomes and more strongly at high incomes, while the simple model assumes the same proportional increase everywhere.

Part H3:

# Compute the turning point in log income: -b1 / (2*b2)
b1 <- coef(model_food_quadlog)["logincome"]
b2 <- coef(model_food_quadlog)["logincome2"]

turning_point_loginc <- -b1 / (2*b2)
turning_point_income <- exp(turning_point_loginc)  # back-transform to income level

cat("Turning point (log income):", round(turning_point_loginc, 3), "\n")

Turning point (log income): 10.379

cat("Turning point (income, same units as data):", round(turning_point_income, 0), "\n")

Turning point (income, same units as data): 32190 Turning point

the turning point is where the slope changes sign. Our turning point in log income is: 10.379.

Income is in natural logarithm. So we need to back-transform that to a regular unit.

# Back-transform to raw income level (same units as data)
turning_point_income <- exp(turning_point_loginc)

cat("Turning point (log income):", round(turning_point_loginc, 3), "\n")

Turning point (log income): 10.379

cat("Turning point (income, same units as original data):", round(turning_point_income, 0), "\n")

Turning point (income, same units as original data): 32190

Food expenditure starts to increase with more income when income reaches 32.190 NOK.

Part I - Half-sample exercise

# -------------------------------
# Part I — Half-sample exercise
# -------------------------------
# 1) Half-sample indices (seed already set earlier)
n_full <- nrow(data)
n_half <- max(1L, floor(n_full / 2L))
idx_half <- sort(sample.int(n_full, size = n_half, replace = FALSE))
data_half <- data[idx_half, , drop = FALSE]

# 2) Full-sample models (on your personalized `data`)
model_food_full <- lm(log(foodexp) ~ log(income),
                      data = dplyr::filter(data, foodexp > 0, income > 0))
model_energy_full <- lm(log(energyexp) ~ log(income),
                        data = dplyr::filter(data, energyexp > 0, income > 0))
model_transport_full <- lm(log(transportexp) ~ log(income),
                           data = dplyr::filter(data, transportexp > 0, income > 0))

# 3) Half-sample models
model_food_half <- lm(log(foodexp) ~ log(income),
                      data = dplyr::filter(data_half, foodexp > 0, income > 0))
model_energy_half <- lm(log(energyexp) ~ log(income),
                        data = dplyr::filter(data_half, energyexp > 0, income > 0))
model_transport_half <- lm(log(transportexp) ~ log(income),
                           data = dplyr::filter(data_half, transportexp > 0, income > 0))

# 4) Build a tidy comparison table (estimate with SE in parentheses)
squeeze <- function(m, label) {
  b <- broom::tidy(m)
  g <- broom::glance(m)
  # keep intercept and log(income)
  b2 <- b[b$term %in% c("(Intercept)", "log(income)"), c("term","estimate","std.error")]
  b2$estimate <- sprintf("%.3f", b2$estimate)
  b2$std.error <- sprintf("(%.3f)", b2$std.error)
  b2$term[b2$term == "(Intercept)"] <- "Constant"
  b2$term[b2$term == "log(income)"] <- "log(Income)"
  b2$Model <- label
  stats_row <- data.frame(
    term = c("Observations","R^2","Residual SE"),
    estimate = c(
      format(g$nobs, big.mark = ","),
      sprintf("%.3f", g$r.squared),
      sprintf("%.3f (df = %d)", g$sigma, g$df.residual)
    ),
    std.error = c("", "", ""),
    Model = label,
    stringsAsFactors = FALSE
  )
  rbind(b2, stats_row)
}

tab <- dplyr::bind_rows(
  squeeze(model_food_full,      "Food (Full)"),
  squeeze(model_energy_full,    "Energy (Full)"),
  squeeze(model_transport_full, "Transport (Full)"),
  squeeze(model_food_half,      "Food (Half)"),
  squeeze(model_energy_half,    "Energy (Half)"),
  squeeze(model_transport_half, "Transport (Half)")
)

# Combine estimate and SE in one column
tab$`Estimate ± SE` <- ifelse(tab$std.error == "",
                              tab$estimate,
                              paste(tab$estimate, tab$std.error))
tab <- tab[, c("Model","term","Estimate ± SE")]

# Pivot wide: rows = terms, columns = models
wide <- tidyr::pivot_wider(tab, names_from = Model, values_from = `Estimate ± SE`)

# Print counts to verify split
cat("Rows in personalized sample:", n_full, "\n")

## Rows in personalized sample: 2522

cat("Rows in half-sample:", nrow(data_half), "\n")

## Rows in half-sample: 1261

# 5) Display HTML table
knitr::kable(
  wide,
  format = "html",
  caption = "Part I — Full vs Half-Sample Regressions (log y on log income)"
)

Part I — Full vs Half-Sample Regressions (log y on log income)
term	Food (Full)	Energy (Full)	Transport (Full)	Food (Half)	Energy (Half)	Transport (Half)
Constant	1.682 (0.298)	3.502 (0.301)	-2.450 (0.729)	1.657 (0.420)	4.220 (0.424)	-2.566 (1.036)
log(Income)	0.694 (0.023)	0.475 (0.023)	0.991 (0.055)	0.696 (0.032)	0.421 (0.032)	1.001 (0.079)
Observations	2,511	2,499	2,328	1,252	1,248	1,167
R^2	0.273	0.148	0.122	0.277	0.121	0.122
Residual SE	0.591 (df = 2509)	0.590 (df = 2497)	1.357 (df = 2326)	0.592 (df = 1250)	0.586 (df = 1246)	1.371 (df = 1165)

Interpretation (Part I).

When comparing the full sample to the half sample, the coefficients of log(income) remain very close but still with slight differences:

Food: 0.694 (full) vs 0.696 (half)
Energy: 0.475 vs 0.421
Transport: 0.991 vs 1.001

This shows that even with half the data, the overall relationship between income and expenditure is estimated similarly.

However, the standard errors are noticeably larger in the half sample:

Food: 0,023 (full) vs 0,032 (half)
Energy: 0,023 vs 0,032
Transport: 0,055 vs 0,079

. This is expected because standard errors scale roughly with \(1 / \sqrt{n}\). With fewer observations, estimates are less precise.

The R² values are also a bit lower in the half sample for energy (0.148 → 0.121) but nearly unchanged for food and transport. This reflects natural variability in how well the model explains variation when fewer data points are used.

Overall, the exercise illustrates two key points:
1. Coefficients stay stable across samples, suggesting the estimated elasticities are not overly sensitive to sample size.
2. Precision decreases in smaller samples, which is visible in the larger standard errors and slightly less stable fit statistics.

Print session Info to ensure reproducability

To ensure reproducability and verfiy that we have run the code properly

sessionInfo()

## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=Icelandic_Iceland.utf8  LC_CTYPE=Icelandic_Iceland.utf8   
## [3] LC_MONETARY=Icelandic_Iceland.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=Icelandic_Iceland.utf8    
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lmtest_0.9-40   zoo_1.8-14      sandwich_3.1-1  stargazer_5.2.3
##  [5] car_3.1-3       carData_3.0-5   lubridate_1.9.4 forcats_1.0.0  
##  [9] stringr_1.5.1   dplyr_1.1.4     purrr_1.1.0     readr_2.1.5    
## [13] tidyr_1.3.1     tibble_3.3.0    ggplot2_3.5.2   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      lattice_0.22-7    
##  [5] hms_1.1.3          digest_0.6.37      magrittr_2.0.3     evaluate_1.0.5    
##  [9] grid_4.5.1         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] Matrix_1.7-3       jsonlite_2.0.0     backports_1.5.0    Formula_1.2-5     
## [17] mgcv_1.9-3         scales_1.4.0       jquerylib_0.1.4    abind_1.4-8       
## [21] cli_3.6.5          crayon_1.5.3       rlang_1.1.6        splines_4.5.1     
## [25] bit64_4.6.0-1      withr_3.0.2        cachem_1.1.0       yaml_2.3.10       
## [29] parallel_4.5.1     tools_4.5.1        tzdb_0.5.0         broom_1.0.9       
## [33] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.4    bit_4.6.0         
## [37] vroom_1.6.5        pkgconfig_2.0.3    pillar_1.11.0      bslib_0.9.0       
## [41] gtable_0.3.6       glue_1.8.0         xfun_0.52          tidyselect_1.2.1  
## [45] rstudioapi_0.17.1  knitr_1.50         farver_2.1.2       nlme_3.1-168      
## [49] htmltools_0.5.8.1  labeling_0.4.3     rmarkdown_2.29     compiler_4.5.1

Assignment 1, Group 12

Audur Ran Palsdottir and Hilla Elviira Savolainen

Seed: 12, resulting sample size: 2522

AI usage

The data

Setup

Load data and create 75% sample

Part A: Joint distribution of income and food expenditures

Part B: Summary statistics

Part C: Regressions

1. Interpreting the relationship between income and expenditure

2. Challenges in interpreting estimates as causal effects

Part D: Hypothesis test on transport elasticity

Part D1: The test

Interpreting the result

Part D2: 95% Confidence Interval

Part E: Adding number of kids to the food expenditure model

Part E1: The change in the coefficient

Part E2: Joint test

Part E3: The p-value

Part F: Controlling for BMI in the food expenditure model

Part G: Trying to add log(income²) alongside log(income)

Part H: Adding squared log income to the food expenditure model

Part H1. Interpretation of results

Part H2. Comparing predictions

Part H3:

Part I - Half-sample exercise

Print session Info to ensure reproducability