RFinalTest

Author

Jake, Mickey, Frank

1 Introduction

In this project we will analyze real-world data on individuals monthly cafe expenditures. The demographics in the provided data set contains information (age, sex, education), location indicators (Urban vs. non-urban), as well as neighborhood median income by the ZIP code. The goal of these project will be to explore with factors are associated with higher spending while evaluating how demographic and regional income can predict expenditure spending.

2 Libraries Used

library(tidyverse)
library(rio)
library(janitor)
library(skimr)
library(kableExtra)
library(corrplot)

3 Explanation of Libraries

3.1 tidyverse

A collection of R Packages (like dplyr or ggplot) designed for cleaning/wrangling data and transforming it to be easily translatable in a visual sense.

3.2 rio

A package designed for importing or exporting different formats of data.

3.3 janitor

Tools for quickly organizing data especially for column names that are difficult to work with.

3.4 skimr

Easy and readable summaries of data showcasing distributions, missing values, and basic stats.

3.5 kableExtra

Enhances tables created with kable so that they look nicer in HTML/Word/PDF reports using borders, colors, alignments, and other features.

3.6 corrplot

Creates correlation plots to help visualize relationships between numeric variables.

4 Data

4.1 Loading Data

DataCafeOrg = import("ProjectDataCafeNew.csv") |>
   clean_names("upper_camel")
DataZIPIncomeOrg = import("DataZIPIncomeNew-1.csv") |>
  clean_names("upper_camel")|> 
    rename(IncomeZIP = Income)

There are two datasets that will be used: the first contains individual-level information on cafe ex[enditures and demographic characterstics (age, sex, education level, urban residence, ZIP code), while the second dataset contains median household income information at a ZIP-code level. The ‘import()’ function (from rio package) is used to load the CSV files into R Studio. The ‘clean_names()’ (from janitor package) standardizes column names into a readable format (not needed for this data but here for practice). Lastly, we renamed the income variable to ‘IncomeZIP’ to better indicate that it represents ZIP code level income.

4.2 Selecting Variables

DataCafe = DataCafeOrg |>
  select(CafeExp, Age, SexMale,UrbanYes, YearsAfterHighschool, Zip)

kable(head(DataCafe))
CafeExp Age SexMale UrbanYes YearsAfterHighschool Zip
75.51034 18.71554 1 1 0.7259538 19104
85.95790 49.96018 0 0 32.2933391 19105
91.17824 39.24468 0 0 21.4324698 19105
88.10348 26.32100 1 1 8.3337170 19104
128.30616 59.77155 1 1 41.9278606 19102
97.90179 28.56631 0 1 10.8204892 19104

After loading the data, a subset of variables are selected to analyze in order to simplify the dataset (by only keeping the needed variables) and make it easier to explore/model. The ‘select()’ functionisused to select the specified variables. The selected variables include: CafeExp (Cafe Expenditures), Age, SexMale (Gender), UrbanYes (Urban Residence), YearsAferHighSchool (Years of education after highschool), and Zip (ZIP Code). We then use ‘kable()’ to quickly check the resulting dataset visually.

4.3 Joining Tables

DataCafeWithInc = left_join(
  DataCafe, DataZIPIncomeOrg, 
  join_by(Zip == Zip)
)

kable(head(DataCafeWithInc))
CafeExp Age SexMale UrbanYes YearsAfterHighschool Zip IncomeZIP
75.51034 18.71554 1 1 0.7259538 19104 80
85.95790 49.96018 0 0 32.2933391 19105 70
91.17824 39.24468 0 0 21.4324698 19105 70
88.10348 26.32100 1 1 8.3337170 19104 80
128.30616 59.77155 1 1 41.9278606 19102 90
97.90179 28.56631 0 1 10.8204892 19104 80

In order for us to include the neighorhood income information with our individual-level cafe data, we need to merge the two data sets with a ‘left_join()’ function. We will combine these two datasets by matching the observations shared using the ZIP code variable. The reason we are choosing a ‘left_join()’ vs ‘inner_join()’ is because we want all of the individuals in the cafe data set retained, even in some of the ZIP codes don’t have income information connected. We use ’kable()’to verify that the combined dataset showcases both datasets combined, which can be used for analysis.

5 Descriptive Statistics

5.1 N, Min, Max, Mean, SD, etc.

DataCafeNum = DataCafeWithInc |>
  select(-Zip)

n_total = nrow(DataCafeNum)

DescrStatc = skim(DataCafeNum) |>
  mutate (
    N = n_total - n_missing
  ) |>
  select(
    Variable = skim_variable,
    Missing = n_missing,
    N,
    Mean = numeric.mean,
    SD = numeric.sd, 
    Min = numeric.p0,
    Max = numeric.p100
  )

kable(DescrStatc)
Variable Missing N Mean SD Min Max
CafeExp 0 1600 100.388262 26.5729174 1.7821390 183.74515
Age 0 1600 37.069247 10.5231428 18.0206520 59.91429
SexMale 0 1600 0.444375 0.4970516 0.0000000 1.00000
UrbanYes 0 1600 0.695625 0.4602861 0.0000000 1.00000
YearsAfterHighschool 0 1600 19.572534 10.5278207 0.0895441 42.69667
IncomeZIP 0 1600 79.943750 14.4893253 50.0000000 110.00000

This R code takes a dataset, removes the Zip column, and then calculates basic descriptive statistics for all the remaining numeric variables. First, it counts how many rows are in the data. Then it uses the skim() function to create a summary of each numeric variable, including missing values, mean, standard deviation, minimum, and maximum. The code also calculates how many applicable (not missing) values each variable has. Finally, it selects the important pieces of information and prints it as a clean table using kable().

On average, people in the sample are 37 years old and are roughly 20 years out of high school. The variables SexMale and UrbanYes are coded as 0 or 1, showing proportions—about 44% of the sample is male, and 70% live in urban areas Income by ZIP code ranges from 50,000 dollars to 110,000 dollars, with an average of about 80,000 dollars.

5.2 Group and Summarize (Pivot Table)

CafePivot = DataCafeWithInc |>
  group_by(UrbanYes, SexMale ) |>
  summarise (
    N= n(),
    MeanCafeExp = mean(CafeExp),
    SDCafeExp = sd(CafeExp)
  )

kable(CafePivot)
UrbanYes SexMale N MeanCafeExp SDCafeExp
0 0 487 81.49312 23.57402
1 0 402 103.50931 23.05510
1 1 711 111.56586 23.11128

This code groups people by whether they live in an urban area and whether they are male, then calculates how many people are in each group, their average café spending, and how much that spending varies. The table shows that non-urban females spend the least on café purchases, while urban males spend the most. It’s simply comparing café spending across these different groups.

5.2.1 Average Price by ZIP Code

TableAvgCafeZip = DataCafeWithInc |> 
  group_by(Zip) |>
  summarise(
    N = n(),
    MeanCafeExp = mean(CafeExp),
    SDCafeExp = sd(CafeExp)
  )

kable(TableAvgCafeZip)
Zip N MeanCafeExp SDCafeExp
19101 157 79.95523 24.13239
19102 230 111.69190 20.53935
19103 92 69.05188 25.33880
19104 631 103.28528 22.98335
19105 238 87.31683 20.34787
19106 167 118.35366 23.44256
19107 85 121.25712 21.52090

The code groups the dataset by ZIP code and gives three measures for cafe spending within each ZIP: number of observations, the average café expenditure, and the standard deviation fo the café expenditures. The results show how the average café spending varies between ZIP codes, indicating that location can play an important role in consumer spending patterns.

5.2.2 Average Price by ZIP Code & UrbanYes

TableAvgCafeZipUrban = DataCafeWithInc |>
  group_by(Zip, UrbanYes) |>
  summarise(
    N = n(),
    MeanCafeExp = mean(CafeExp),
    SDCafeExp = sd(CafeExp)
  )

kable(TableAvgCafeZipUrban)
Zip UrbanYes N MeanCafeExp SDCafeExp
19101 0 157 79.95523 24.13239
19102 1 230 111.69190 20.53935
19103 0 92 69.05188 25.33880
19104 1 631 103.28528 22.98335
19105 0 238 87.31683 20.34787
19106 1 167 118.35366 23.44256
19107 1 85 121.25712 21.52090

The code groups the data in two ways: first by ZIP code, and then by ZIP code combined with whether the person lives in an urban area. For each group, it calculates how many people are in that group, the average café spending, and how much that spending varies. The results show that café spending differs noticeably across ZIP codes, with some areas spending much more on average than others. When the data is split further by urban vs. non-urban, the dat shows differences that urban groups tend to spend more on café purchases compared to non-urban groups in the same ZIP code.

6 Correlated Variables

DataCafeNum = DataCafeWithInc |>
  select(-Zip)

MatrixCor = cor(DataCafeNum)

corrplot(MatrixCor)

This correlation plot shows that Age and YearsAfterHighschool are almost perfectly correlated, which is expected. CafeExp is positively associated with both Age and IncomeZIP, indicating that older café owners and those operating in higher income ZIP codes tend to have more experience. UrbanYes is strongly correlated with IncomeZIP, meaning urban ZIP codes generally have higher average income levels. SexMale shows only weak correlations and does not appear to strongly influence the other variables. Overall, demographic and location-related factors help explain variation in CafeExp among café owners.

7 Linear Regression

7.1 All Predictors

7.1.1 Unfitted Model

CafeExp = β₀ + β₁(Age) + β₂(SexMale) + β₃(UrbanYes) + β₄(YearsAfterHighschool) + β₅(IncomeZIP)

ModelLMAll = lm(CafeExp ~ ., data = DataCafeNum)
summary(ModelLMAll)

Call:
lm(formula = CafeExp ~ ., data = DataCafeNum)

Residuals:
    Min      1Q  Median      3Q     Max 
-89.530 -14.665   0.587  15.154  66.153 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -22.38384   34.91855  -0.641    0.522    
Age                    3.05243    1.96592   1.553    0.121    
SexMale               -0.40002    1.61633  -0.247    0.805    
UrbanYes              10.00672    1.93032   5.184 2.45e-07 ***
YearsAfterHighschool  -2.81462    1.96510  -1.432    0.152    
IncomeZIP              0.72460    0.07064  10.257  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.38 on 1594 degrees of freedom
Multiple R-squared:  0.2928,    Adjusted R-squared:  0.2906 
F-statistic:   132 on 5 and 1594 DF,  p-value: < 2.2e-16

7.1.2 Fitted Model

CafeExp = −22.38 + 3.05(Age) − 0.40(SexMale) + 10.01(UrbanYes) − 2.81(YearsAfterHighschool) + 0.72(IncomeZIP)

7.2 Selected Predictors

ModelLMSelect = lm(CafeExp ~ UrbanYes + IncomeZIP, data = DataCafeNum)
summary(ModelLMSelect)

Call:
lm(formula = CafeExp ~ UrbanYes + IncomeZIP, data = DataCafeNum)

Residuals:
    Min      1Q  Median      3Q     Max 
-87.396 -15.034   0.825  14.715  66.001 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.20036    3.99762   9.055  < 2e-16 ***
UrbanYes     9.64866    1.93139   4.996 6.51e-07 ***
IncomeZIP    0.71896    0.06136  11.718  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.51 on 1597 degrees of freedom
Multiple R-squared:  0.283, Adjusted R-squared:  0.2821 
F-statistic: 315.2 on 2 and 1597 DF,  p-value: < 2.2e-16

7.2.1 Unfitted Model

Y = β₀ + β₁(UrbanYes) + β₂(IncomeZIP)

7.2.2 Fitted Model

CafeExp = 36.2 + 9.65(UrbanYes) + 0.72(IncomeZIP)

8 Summary

The average Cafe expenature is estimated to be approximately 36.2 dollars, while urban residents are more likely to spend 9.65 dollars more on average than those in non urban areas. Furthermore, for every additional 1000 dollars a median ZIP code income, café spending increases on an average of 72 cents.