RFinalTest

Author

Jake, Mickey, Frank

1 Introduction

In this project we will analyze real-world data on individuals monthly cafe expenditures. The demographics in the provided data set contains information (age, sex, education), location indicators (Urban vs. non-urban), as well as neighborhood median income by the ZIP code. The goal of these project will be to explore with factors are associated with higher spending while evaluating how demographic and regional income can predict expenditure spending.

2 Libraries Used

library(tidyverse)
library(rio)
library(janitor)
library(skimr)
library(kableExtra)
library(corrplot)

3 Explanation of Libraries

3.1 tidyverse

A collection of R Packages (like dplyr or ggplot) designed for cleaning/wrangling data and transforming it to be easily translatable in a visual sense.

3.2 rio

A package designed for importing or exporting different formats of data.

3.3 janitor

Tools for quickly organizing data especially for column names that are difficult to work with.

3.4 skimr

Easy and readable summaries of data showcasing distributions, missing values, and basic stats.

3.5 kableExtra

Enhances tables created with kable so that they look nicer in HTML/Word/PDF reports using borders, colors, alignments, and other features.

3.6 corrplot

Creates correlation plots to help visualize relationships between numeric variables.

4 Data

4.1 Loading Data

DataCafeOrg = import("ProjectDataCafeNew.csv") |>
   clean_names("upper_camel")
DataZIPIncomeOrg = import("DataZIPIncomeNew-1.csv") |>
  clean_names("upper_camel")|> 
    rename(IncomeZIP = Income)

There are two datasets that will be used: the first contains individual-level information on cafe ex[enditures and demographic characterstics (age, sex, education level, urban residence, ZIP code), while the second dataset contains median household income information at a ZIP-code level. The ‘import()’ function (from rio package) is used to load the CSV files into R Studio. The ‘clean_names()’ (from janitor package) standardizes column names into a readable format (not needed for this data but here for practice). Lastly, we renamed the income variable to ‘IncomeZIP’ to better indicate that it represents ZIP code level income.

4.2 Selecting Variables

DataCafe = DataCafeOrg |>
  select(CafeExp, Age, SexMale,UrbanYes, YearsAfterHighschool, Zip)

kable(head(DataCafe))

CafeExp	Age	SexMale	UrbanYes	YearsAfterHighschool	Zip
75.51034	18.71554	1	1	0.7259538	19104
85.95790	49.96018	0	0	32.2933391	19105
91.17824	39.24468	0	0	21.4324698	19105
88.10348	26.32100	1	1	8.3337170	19104
128.30616	59.77155	1	1	41.9278606	19102
97.90179	28.56631	0	1	10.8204892	19104

After loading the data, a subset of variables are selected to analyze in order to simplify the dataset (by only keeping the needed variables) and make it easier to explore/model. The ‘select()’ functionisused to select the specified variables. The selected variables include: CafeExp (Cafe Expenditures), Age, SexMale (Gender), UrbanYes (Urban Residence), YearsAferHighSchool (Years of education after highschool), and Zip (ZIP Code). We then use ‘kable()’ to quickly check the resulting dataset visually.

4.3 Joining Tables

DataCafeWithInc = left_join(
  DataCafe, DataZIPIncomeOrg, 
  join_by(Zip == Zip)
)

kable(head(DataCafeWithInc))

CafeExp	Age	SexMale	UrbanYes	YearsAfterHighschool	Zip	IncomeZIP
75.51034	18.71554	1	1	0.7259538	19104	80
85.95790	49.96018	0	0	32.2933391	19105	70
91.17824	39.24468	0	0	21.4324698	19105	70
88.10348	26.32100	1	1	8.3337170	19104	80
128.30616	59.77155	1	1	41.9278606	19102	90
97.90179	28.56631	0	1	10.8204892	19104	80

In order for us to include the neighorhood income information with our individual-level cafe data, we need to merge the two data sets with a ‘left_join()’ function. We will combine these two datasets by matching the observations shared using the ZIP code variable. The reason we are choosing a ‘left_join()’ vs ‘inner_join()’ is because we want all of the individuals in the cafe data set retained, even in some of the ZIP codes don’t have income information connected. We use ’kable()’to verify that the combined dataset showcases both datasets combined, which can be used for analysis.

5 Descriptive Statistics

5.1 N, Min, Max, Mean, SD, etc.

DataCafeNum = DataCafeWithInc |>
  select(-Zip)

n_total = nrow(DataCafeNum)

DescrStatc = skim(DataCafeNum) |>
  mutate (
    N = n_total - n_missing
  ) |>
  select(
    Variable = skim_variable,
    Missing = n_missing,
    N,
    Mean = numeric.mean,
    SD = numeric.sd, 
    Min = numeric.p0,
    Max = numeric.p100
  )

kable(DescrStatc)

Variable	N	Mean	SD	Min	Max
CafeExp	1600	100.388262	26.5729174	1.7821390	183.74515
Age	1600	37.069247	10.5231428	18.0206520	59.91429
SexMale	1600	0.444375	0.4970516	0.0000000	1.00000
UrbanYes	1600	0.695625	0.4602861	0.0000000	1.00000
YearsAfterHighschool	1600	19.572534	10.5278207	0.0895441	42.69667
IncomeZIP	1600	79.943750	14.4893253	50.0000000	110.00000

This R code takes a dataset, removes the Zip column, and then calculates basic descriptive statistics for all the remaining numeric variables. First, it counts how many rows are in the data. Then it uses the skim() function to create a summary of each numeric variable, including missing values, mean, standard deviation, minimum, and maximum. The code also calculates how many applicable (not missing) values each variable has. Finally, it selects the important pieces of information and prints it as a clean table using kable().

On average, people in the sample are 37 years old and are roughly 20 years out of high school. The variables SexMale and UrbanYes are coded as 0 or 1, showing proportions—about 44% of the sample is male, and 70% live in urban areas Income by ZIP code ranges from 50,000 dollars to 110,000 dollars, with an average of about 80,000 dollars.

5.2 Group and Summarize (Pivot Table)

CafePivot = DataCafeWithInc |>
  group_by(UrbanYes, SexMale ) |>
  summarise (
    N= n(),
    MeanCafeExp = mean(CafeExp),
    SDCafeExp = sd(CafeExp)
  )

kable(CafePivot)

UrbanYes	SexMale	N	MeanCafeExp	SDCafeExp
0	0	487	81.49312	23.57402
1	0	402	103.50931	23.05510
1	1	711	111.56586	23.11128

This code groups people by whether they live in an urban area and whether they are male, then calculates how many people are in each group, their average café spending, and how much that spending varies. The table shows that non-urban females spend the least on café purchases, while urban males spend the most. It’s simply comparing café spending across these different groups.

5.2.1 Average Price by ZIP Code

TableAvgCafeZip = DataCafeWithInc |> 
  group_by(Zip) |>
  summarise(
    N = n(),
    MeanCafeExp = mean(CafeExp),
    SDCafeExp = sd(CafeExp)
  )

kable(TableAvgCafeZip)

Zip	N	MeanCafeExp	SDCafeExp
19101	157	79.95523	24.13239
19102	230	111.69190	20.53935
19103	92	69.05188	25.33880
19104	631	103.28528	22.98335
19105	238	87.31683	20.34787
19106	167	118.35366	23.44256
19107	85	121.25712	21.52090

The code groups the dataset by ZIP code and gives three measures for cafe spending within each ZIP: number of observations, the average café expenditure, and the standard deviation fo the café expenditures. The results show how the average café spending varies between ZIP codes, indicating that location can play an important role in consumer spending patterns.

5.2.2 Average Price by ZIP Code & UrbanYes

TableAvgCafeZipUrban = DataCafeWithInc |>
  group_by(Zip, UrbanYes) |>
  summarise(
    N = n(),
    MeanCafeExp = mean(CafeExp),
    SDCafeExp = sd(CafeExp)
  )

kable(TableAvgCafeZipUrban)

Zip	UrbanYes	N	MeanCafeExp	SDCafeExp
19101	0	157	79.95523	24.13239
19102	1	230	111.69190	20.53935
19103	0	92	69.05188	25.33880
19104	1	631	103.28528	22.98335
19105	0	238	87.31683	20.34787
19106	1	167	118.35366	23.44256
19107	1	85	121.25712	21.52090

The code groups the data in two ways: first by ZIP code, and then by ZIP code combined with whether the person lives in an urban area. For each group, it calculates how many people are in that group, the average café spending, and how much that spending varies. The results show that café spending differs noticeably across ZIP codes, with some areas spending much more on average than others. When the data is split further by urban vs. non-urban, the dat shows differences that urban groups tend to spend more on café purchases compared to non-urban groups in the same ZIP code.

6 Correlated Variables

DataCafeNum = DataCafeWithInc |>
  select(-Zip)

MatrixCor = cor(DataCafeNum)

corrplot(MatrixCor)

This correlation plot shows that Age and YearsAfterHighschool are almost perfectly correlated, which is expected. CafeExp is positively associated with both Age and IncomeZIP, indicating that older café owners and those operating in higher income ZIP codes tend to have more experience. UrbanYes is strongly correlated with IncomeZIP, meaning urban ZIP codes generally have higher average income levels. SexMale shows only weak correlations and does not appear to strongly influence the other variables. Overall, demographic and location-related factors help explain variation in CafeExp among café owners.

7 Linear Regression

7.1 All Predictors

7.1.1 Unfitted Model

CafeExp = β₀ + β₁(Age) + β₂(SexMale) + β₃(UrbanYes) + β₄(YearsAfterHighschool) + β₅(IncomeZIP)

ModelLMAll = lm(CafeExp ~ ., data = DataCafeNum)
summary(ModelLMAll)


Call:
lm(formula = CafeExp ~ ., data = DataCafeNum)

Residuals:
    Min      1Q  Median      3Q     Max 
-89.530 -14.665   0.587  15.154  66.153 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          -22.38384   34.91855  -0.641    0.522    
Age                    3.05243    1.96592   1.553    0.121    
SexMale               -0.40002    1.61633  -0.247    0.805    
UrbanYes              10.00672    1.93032   5.184 2.45e-07 ***
YearsAfterHighschool  -2.81462    1.96510  -1.432    0.152    
IncomeZIP              0.72460    0.07064  10.257  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.38 on 1594 degrees of freedom
Multiple R-squared:  0.2928,    Adjusted R-squared:  0.2906 
F-statistic:   132 on 5 and 1594 DF,  p-value: < 2.2e-16

7.1.2 Fitted Model

CafeExp = −22.38 + 3.05(Age) − 0.40(SexMale) + 10.01(UrbanYes) − 2.81(YearsAfterHighschool) + 0.72(IncomeZIP)

7.2 Selected Predictors

ModelLMSelect = lm(CafeExp ~ UrbanYes + IncomeZIP, data = DataCafeNum)
summary(ModelLMSelect)


Call:
lm(formula = CafeExp ~ UrbanYes + IncomeZIP, data = DataCafeNum)

Residuals:
    Min      1Q  Median      3Q     Max 
-87.396 -15.034   0.825  14.715  66.001 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.20036    3.99762   9.055  < 2e-16 ***
UrbanYes     9.64866    1.93139   4.996 6.51e-07 ***
IncomeZIP    0.71896    0.06136  11.718  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.51 on 1597 degrees of freedom
Multiple R-squared:  0.283, Adjusted R-squared:  0.2821 
F-statistic: 315.2 on 2 and 1597 DF,  p-value: < 2.2e-16

7.2.1 Unfitted Model

Y = β₀ + β₁(UrbanYes) + β₂(IncomeZIP)

7.2.2 Fitted Model

CafeExp = 36.2 + 9.65(UrbanYes) + 0.72(IncomeZIP)

8 Summary

The average Cafe expenature is estimated to be approximately 36.2 dollars, while urban residents are more likely to spend 9.65 dollars more on average than those in non urban areas. Furthermore, for every additional 1000 dollars a median ZIP code income, café spending increases on an average of 72 cents.