Load a Few Packages:

Input the Food Dataset (from reading):

Parsed with column specification:
cols(
  FOODCONSUMPTION = col_number(),
  FAMILYINCOME = col_number(),
  NUMCHULDREN = col_double(),
  HAVEGARDEN = col_character()
)
Rows: 51
Columns: 4
$ FOODCONSUMPTION <dbl> 780, 1612, 1621, 1820, 2444, 3120, 3952, 4056, 4160, 4160, 4300, 4420, 5200, …
$ FAMILYINCOME    <dbl> 24000, 20000, 37436, 36600, 10164, 2500, 29000, 40000, 30154, 34000, 46868, 1…
$ NUMCHULDREN     <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 2, 1, 2, 4, 2, 3, 4, 2, 1,…
$ HAVEGARDEN      <chr> "NO", "NO", "NO", "YES", "YES", "NO", "YES", "NO", "NO", "YES", "NO", "NO", "…

Data Cleaning and Coding:

We use three functions to help us here. The first two are found in the dplyr package and the third comes from base R.

  1. rename is a handy function that will change the name of any variables in your dataset. The formula is newname = oldname. So, in this case, we are renaming our variables to get rid of capital letters and put spaces in between the words with underscores.

  2. mutate is another SUPER handy function that I use all the time. It allows us to generate a new variable based on a transformation (or mutation, think X-Men!) of an existing one. In this case, I am taking the original (renamed) variable have_garden, and I am creating a new variable called have_garden_fac that is equal to the old variable, but converted to a factor (that is what the as_factor function does). So now, R recognizes the difference between “Yes” and “No” for have_garden_factor as actually being two different levels of a categorical variable that I can use in frequency tables, regression models, etc.

  3. na.omit is another handy function that should come with a warning label! This one should be used very carefully. What this function does is remove any observation (row) in a dataset with missing data. This is similar to listwise deletion if you have talked about that in a previous stats class. We use it here because our original food dataset had an extra row in it that was all missing. In this case, that was just an error - this wasn’t an actual response. But, if you use na.omit on a dataset where you do have missing data, then any observation with any missingness will be removed. The nice part is we can always look back at food now that we have food.clean and see the difference- how many observations were dropped, etc.

Rows: 50
Columns: 5
$ food_consumption <dbl> 780, 1612, 1621, 1820, 2444, 3120, 3952, 4056, 4160, 4160, 4300, 4420, 5200,…
$ family_income    <dbl> 24000, 20000, 37436, 36600, 10164, 2500, 29000, 40000, 30154, 34000, 46868, …
$ num_children     <dbl> 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 2, 1, 2, 4, 2, 3, 4, 2, 1…
$ have_garden      <chr> "NO", "NO", "NO", "YES", "YES", "NO", "YES", "NO", "NO", "YES", "NO", "NO", …
$ have_garden_fac  <fct> NO, NO, NO, YES, YES, NO, YES, NO, NO, YES, NO, NO, YES, YES, YES, YES, NO, …

Use skim from the skimr package to get a nice summary:

── Data Summary ────────────────────────
                           Values    
Name                       food.clean
Number of rows             50        
Number of columns          5         
_______________________              
Column type frequency:               
  character                1         
  factor                   1         
  numeric                  3         
________________________             
Group variables            None      

── Variable type: character ──────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate   min   max empty n_unique whitespace
1 have_garden           0             1     2     3     0        2          0

── Variable type: factor ─────────────────────────────────────────────────────────────────────────────
  skim_variable   n_missing complete_rate ordered n_unique top_counts     
1 have_garden_fac         0             1 FALSE          2 NO: 36, YES: 14

── Variable type: numeric ────────────────────────────────────────────────────────────────────────────
  skim_variable    n_missing complete_rate     mean       sd    p0   p25   p50     p75   p100 hist 
1 food_consumption         0             1  8795.    4599.     780  5200  8826  11895   20132 ▅▅▇▂▂
2 family_income            0             1 72322.   54170.    1200 31825 52710 105702. 192220 ▇▇▂▂▃
3 num_children             0             1     2.28     1.36     1     1     2      3       6 ▇▁▂▁▁

Start with a Simple Linear Regression: Does Income Predict Food Consumption?


Call:
lm(formula = food_consumption ~ family_income, data = food.clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-4936.0 -2282.4   -48.4  1707.7  7372.6 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   4.155e+03  7.224e+02   5.752 5.99e-07 ***
family_income 6.416e-02  8.024e-03   7.996 2.24e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3043 on 48 degrees of freedom
Multiple R-squared:  0.5712,    Adjusted R-squared:  0.5622 
F-statistic: 63.93 on 1 and 48 DF,  p-value: 2.236e-10

The short answer to our question - YES! Family income does significantly predict food expenditures. The two are positively related with one another. For each additional dollar in income a family earns, they would be expected to spend an additional .06 (or 6 cents) on food. Relationship is highly significant (p<.001). The Adjusted R Squared for this model is .56, which means that family income by itself explains about 56% of the variation in food expenditures.

What About Kids? Do Things Change When We Account For Them?


Call:
lm(formula = food_consumption ~ family_income + num_children, 
    data = food.clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-4793.2 -1678.5  -128.3  1376.2  7379.1 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.963e+03  8.503e+02   3.485  0.00108 ** 
family_income 5.461e-02  8.631e-03   6.327  8.6e-08 ***
num_children  8.258e+02  3.448e+02   2.395  0.02068 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2903 on 47 degrees of freedom
Multiple R-squared:  0.6178,    Adjusted R-squared:  0.6015 
F-statistic: 37.98 on 2 and 47 DF,  p-value: 1.528e-10

It looks like kids make a big difference, too! For each additional child (num_children), we would expect a family to spend about $826 more per year on food. Since family_income is in this model, too, we can say that this relationship holds when controlling for family income. Another way to think about this: if you take two families with the same income, the family with 1 more child would be predicted to spend about $827 more on food per year. Notice, too, that the estimate for family_income hasn’t really changed much- it is about .05, down from .06 in our first model. And our new Adjusted R Squared value is .60, up from .56. So, adding number of kids helped us explain about 4% more of the variation in food expenditures.

Now - What About Families Who Have a Garden? Does That Change Things?


Call:
lm(formula = food_consumption ~ family_income + num_children + 
    have_garden_fac, data = food.clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-4618.9 -1616.2  -256.2  1505.4  7246.8 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         3.179e+03  9.613e+02   3.307  0.00183 ** 
family_income       5.414e-02  8.753e-03   6.185 1.53e-07 ***
num_children        8.036e+02  3.505e+02   2.293  0.02648 *  
have_garden_facYES -4.703e+02  9.456e+02  -0.497  0.62135    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2926 on 46 degrees of freedom
Multiple R-squared:  0.6198,    Adjusted R-squared:  0.595 
F-statistic:    25 on 3 and 46 DF,  p-value: 9.576e-10

What about gardens? Might they change how much a family spends on food? Short answer here - no, not AFTER controlling for family income and number of kids. Sidenote: we could test if gardens are related to food expenditures by themselves just by running a simple linear regression with only have_garden_fac as the predictor. But here, it doesn’t look like owning a garden is significantly related to food expenditures.

Getting Advanced - Does the Number of Children Serve as a Moderator of the Income/Food Relationship?


Call:
lm(formula = food_consumption ~ family_income + num_children + 
    family_income:num_children, data = food.clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-4833.2 -1654.1  -216.8  1476.2  7185.1 

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 1.424e+03  1.349e+03   1.055   0.2968    
family_income               7.644e-02  1.723e-02   4.436 5.68e-05 ***
num_children                1.512e+03  5.812e+02   2.602   0.0124 *  
family_income:num_children -8.101e-03  5.556e-03  -1.458   0.1516    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2869 on 46 degrees of freedom
Multiple R-squared:  0.6347,    Adjusted R-squared:  0.6109 
F-statistic: 26.64 on 3 and 46 DF,  p-value: 3.875e-10

OK, now for the fancy stuff- buckle up! The first three models we ran assumed that our predictors were additive, and not interactive. What does this mean? The relationship between num_children and food expenditures is the same, regardless of how much income your family has. And likewise, the relationship between family_income and food is the same, regardless of how many kids you have.

This is fine for a first step, but another hypothesis might be that these two variables INTERACT (or moderate) each other. So, the slope estimate of one is dependent on the value of the other. This is a more complex relationship between predictors. We specify an interaction with the : symbol as shown above. We want to include both main effects (so both variables are still in the equation), as WELL as their interaction family_income:num_children. If the estimate for this interaction term is signficant, then some moderation or interaction is likely happening.

In this case, we sadly don’t have enough evidence to say that interaction exists (p = .15). Power may be an issue as well - we only have 50 observations in our food dataset, and interaction effects are notoriously underpowered (meaning you usually need a much larger sample to find a true significant effect with any regularity).

Using the stargazer Package to Organize Your Results:

Last thing - let’s take our results and make them pretty! The stargazer package is a super simple and cool way to synthesize the results of a series of regression analyses. We named our models along the way (foodreg1, foodreg2, etc.), so using stargazer to collect the results is super easy. One more fun trick - see below where I include the out = "test.html" option. This creates an html table called “test” wherever you specified your working directory. Why is this cool? Microsoft Word can read html tables! So you can take this fancy table, open it up right in Word, and then paste it into your paper. #Boom

length of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changednumber of rows of result is not a multiple of vector length (arg 2)number of rows of result is not a multiple of vector length (arg 2)number of rows of result is not a multiple of vector length (arg 2)

=======================================================================
                                       Dependent variable:             
                           --------------------------------------------
                                         food_consumption              
                               (1)        (2)        (3)        (4)    
-----------------------------------------------------------------------
family_income                0.06***    0.05***    0.05***    0.08***  
                             (0.01)      (0.01)     (0.01)     (0.02)  
                                                                       
num_children                            825.78*    803.61*   1,512.28* 
                                        (344.84)   (350.48)   (581.23) 
                                                                       
have_garden_facYES                                 -470.26             
                                                   (945.63)            
                                                                       
family_income:num_children                                     -0.01   
                                                               (0.01)  
                                                                       
Constant                   4,155.21*** 2,962.95** 3,179.40**  1,423.79 
                            (722.43)    (850.26)   (961.33)  (1,349.23)
                                                                       
-----------------------------------------------------------------------
Observations                   50          50         50         50    
R2                            0.57        0.62       0.62       0.63   
Adjusted R2                   0.56        0.60       0.60       0.61   
=======================================================================
Note:                                     *p<0.05; **p<0.01; ***p<0.001
---
title: "Module 2 Handout - Review of Linear Regression"
author: "Dr. B"
output: 
  html_notebook:
    highlight: tango
---

# Load a Few Packages:
```{r}
library(tidyverse)
library(ggplot2)
library(broom)
library(skimr)
```

# Input the Food Dataset (from reading):
```{r}
food <- read_csv("food.csv")

view(food)
glimpse(food)
```

# Data Cleaning and Coding:
We use three functions to help us here. The first two are found in the `dplyr` package and the third comes from `base R`.  
\newline

1. `rename` is a handy function that will change the name of any variables in your dataset. The formula is `newname = oldname`. So, in this case, we are renaming our variables to get rid of capital letters and put spaces in between the words with underscores. 
\newline

2. `mutate` is another SUPER handy function that I use all the time. It allows us to generate a new variable based on a transformation (or mutation, think X-Men!) of an existing one. In this case, I am taking the original (renamed) variable `have_garden`, and I am creating a new variable called `have_garden_fac` that is equal to the old variable, but converted to a factor (that is what the `as_factor` function does). So now, `R` recognizes the difference between "Yes" and "No" for `have_garden_factor` as actually being two different levels of a categorical variable that I can use in frequency tables, regression models, etc.  
\newline

3. `na.omit` is another handy function that should come with a warning label!  This one should be used very carefully. What this function does is remove any observation (row) in a dataset with missing data. This is similar to listwise deletion if you have talked about that in a previous stats class. We use it here because our original `food` dataset had an extra row in it that was all missing. In this case, that was just an error - this wasn't an actual response. But, if you use `na.omit` on a dataset where you do have missing data, then any observation with any missingness will be removed. The nice part is we can always look back at `food` now that we have `food.clean` and see the difference- how many observations were dropped, etc. 

```{r}

food.clean <- food %>%
  rename(.,
         food_consumption = FOODCONSUMPTION,
         family_income = FAMILYINCOME,
         num_children = NUMCHULDREN,
         have_garden= HAVEGARDEN) %>%
  mutate(.,
         have_garden_fac = as_factor(have_garden)) %>%
  na.omit()

glimpse(food.clean)
```

# Use `skim` from the `skimr` package to get a nice summary:
```{r}
skim(food.clean)
```

# Start with a Simple Linear Regression: Does Income Predict Food Consumption?
```{r}
foodreg1 <- lm(food_consumption ~ family_income, data = food.clean)
summary(foodreg1)
```

\newline
The short answer to our question - YES! Family income does significantly predict food expenditures. The two are positively related with one another. For each additional dollar in income a family earns, they would be expected to spend an additional .06 (or 6 cents) on food. Relationship is highly significant (p<.001). The Adjusted R Squared for this model is .56, which means that family income by itself explains about 56% of the variation in food expenditures.
\newline

# What About Kids? Do Things Change When We Account For Them?
```{r}
foodreg2 <- lm(food_consumption ~ family_income + num_children, data = food.clean)
summary(foodreg2)
```

\newline
It looks like kids make a big difference, too! For each additional child (`num_children`), we would expect a family to spend about \$826 more per year on food. Since `family_income` is in this model, too, we can say that this relationship holds when controlling for family income. Another way to think about this: if you take two families with the same income, the family with 1 more child would be predicted to spend about \$827 more on food per year. 
\newline
Notice, too, that the estimate for `family_income` hasn't really changed much- it is about .05, down from .06 in our first model. And our new Adjusted R Squared value is .60, up from .56. So, adding number of kids helped us explain about 4% more of the variation in food expenditures. 
\newline

# Now - What About Families Who Have a Garden? Does That Change Things?
```{r}
foodreg3 <- lm(food_consumption ~ family_income + num_children + have_garden_fac, data = food.clean)
summary(foodreg3)
```

\newline
What about gardens? Might they change how much a family spends on food? Short answer here - no, not AFTER controlling for family income and number of kids. Sidenote: we could test if gardens are related to food expenditures by themselves just by running a simple linear regression with only `have_garden_fac` as the predictor. But here, it doesn't look like owning a garden is significantly related to food expenditures. 
\newline

# Getting Advanced - Does the Number of Children Serve as a Moderator of the Income/Food Relationship?
```{r}
foodreg4 <- lm(food_consumption ~ family_income + num_children + family_income:num_children, data = food.clean)
summary(foodreg4)
```

\newline
OK, now for the fancy stuff- buckle up! The first three models we ran assumed that our predictors were additive, and not interactive. What does this mean? The relationship between `num_children` and food expenditures is the same, regardless of how much income your family has. And likewise, the relationship between `family_income` and food is the same, regardless of how many kids you have. 

\newline
This is fine for a first step, but another hypothesis might be that these two variables INTERACT (or moderate) each other. So, the slope estimate of one is dependent on the value of the other. This is a more complex relationship between predictors. We specify an interaction with the `:` symbol as shown above. We want to include both main effects (so both variables are still in the equation), as WELL as their interaction `family_income:num_children`. If the estimate for this interaction term is signficant, then some moderation or interaction is likely happening. 

\newline
In this case, we sadly don't have enough evidence to say that interaction exists (p = .15). Power may be an issue as well - we only have 50 observations in our food dataset, and interaction effects are notoriously underpowered (meaning you usually need a much larger sample to find a true significant effect with any regularity). 

\newline

# Using the `stargazer` Package to Organize Your Results:
\newline
Last thing - let's take our results and make them pretty! The `stargazer` package is a super simple and cool way to synthesize the results of a series of regression analyses. We named our models along the way (`foodreg1, foodreg2`, etc.), so using `stargazer` to collect the results is super easy. 
\newline
One more fun trick - see below where I include the `out = "test.html"` option. This creates an html table called "test" wherever you specified your working directory. Why is this cool? Microsoft Word can read html tables! So you can take this fancy table, open it up right in Word, and then paste it into your paper. #Boom
\newline

```{r}
library(stargazer)
summary.table <- stargazer(foodreg1,
                           foodreg2,
                           foodreg3,
                           foodreg4,
                           type = "text",
                           font.size = "normalsize",
                           digits = 2,
                           keep.stat = c("n", "rsq", "adj.rsq"),
                           out = "test.html",
                           star.cutoffs = c(.05, .01, .001))
```