library(tidyverse)
library(rio)
library(janitor)
library(skimr)
library(kableExtra)
library(corrplot)RFinalTest
1 Introduction
In this project we will analyze real-world data on individuals monthly cafe expenditures. The demographics in the provided data set contains information (age, sex, education), location indicators (Urban vs. non-urban), as well as neighborhood median income by the ZIP code. The goal of these project will be to explore with factors are associated with higher spending while evaluating how demographic and regional income can predict expenditure spending.
2 Libraries Used
3 Explanation of Libraries
3.1 tidyverse
A collection of R Packages (like dplyr or ggplot) designed for cleaning/wrangling data and transforming it to be easily translatable in a visual sense.
3.2 rio
A package designed for importing or exporting different formats of data.
3.3 janitor
Tools for quickly organizing data especially for column names that are difficult to work with.
3.4 skimr
Easy and readable summaries of data showcasing distributions, missing values, and basic stats.
3.5 kableExtra
Enhances tables created with kable so that they look nicer in HTML/Word/PDF reports using borders, colors, alignments, and other features.
3.6 corrplot
Creates correlation plots to help visualize relationships between numeric variables.
4 Data
4.1 Loading Data
DataCafeOrg = import("ProjectDataCafeNew.csv") |>
clean_names("upper_camel")
DataZIPIncomeOrg = import("DataZIPIncomeNew-1.csv") |>
clean_names("upper_camel")|>
rename(IncomeZIP = Income)There are two datasets that will be used: the first contains individual-level information on cafe ex[enditures and demographic characterstics (age, sex, education level, urban residence, ZIP code), while the second dataset contains median household income information at a ZIP-code level. The ‘import()’ function (from rio package) is used to load the CSV files into R Studio. The ‘clean_names()’ (from janitor package) standardizes column names into a readable format (not needed for this data but here for practice). Lastly, we renamed the income variable to ‘IncomeZIP’ to better indicate that it represents ZIP code level income.
4.2 Selecting Variables
DataCafe = DataCafeOrg |>
select(CafeExp, Age, SexMale,UrbanYes, YearsAfterHighschool, Zip)
kable(head(DataCafe))| CafeExp | Age | SexMale | UrbanYes | YearsAfterHighschool | Zip |
|---|---|---|---|---|---|
| 75.51034 | 18.71554 | 1 | 1 | 0.7259538 | 19104 |
| 85.95790 | 49.96018 | 0 | 0 | 32.2933391 | 19105 |
| 91.17824 | 39.24468 | 0 | 0 | 21.4324698 | 19105 |
| 88.10348 | 26.32100 | 1 | 1 | 8.3337170 | 19104 |
| 128.30616 | 59.77155 | 1 | 1 | 41.9278606 | 19102 |
| 97.90179 | 28.56631 | 0 | 1 | 10.8204892 | 19104 |
After loading the data, a subset of variables are selected to analyze in order to simplify the dataset (by only keeping the needed variables) and make it easier to explore/model. The ‘select()’ functionisused to select the specified variables. The selected variables include: CafeExp (Cafe Expenditures), Age, SexMale (Gender), UrbanYes (Urban Residence), YearsAferHighSchool (Years of education after highschool), and Zip (ZIP Code). We then use ‘kable()’ to quickly check the resulting dataset visually.
4.3 Joining Tables
DataCafeWithInc = left_join(
DataCafe, DataZIPIncomeOrg,
join_by(Zip == Zip)
)
kable(head(DataCafeWithInc))| CafeExp | Age | SexMale | UrbanYes | YearsAfterHighschool | Zip | IncomeZIP |
|---|---|---|---|---|---|---|
| 75.51034 | 18.71554 | 1 | 1 | 0.7259538 | 19104 | 80 |
| 85.95790 | 49.96018 | 0 | 0 | 32.2933391 | 19105 | 70 |
| 91.17824 | 39.24468 | 0 | 0 | 21.4324698 | 19105 | 70 |
| 88.10348 | 26.32100 | 1 | 1 | 8.3337170 | 19104 | 80 |
| 128.30616 | 59.77155 | 1 | 1 | 41.9278606 | 19102 | 90 |
| 97.90179 | 28.56631 | 0 | 1 | 10.8204892 | 19104 | 80 |
In order for us to include the neighorhood income information with our individual-level cafe data, we need to merge the two data sets with a ‘left_join()’ function. We will combine these two datasets by matching the observations shared using the ZIP code variable. The reason we are choosing a ‘left_join()’ vs ‘inner_join()’ is because we want all of the individuals in the cafe data set retained, even in some of the ZIP codes don’t have income information connected. We use ’kable()’to verify that the combined dataset showcases both datasets combined, which can be used for analysis.
5 Descriptive Statistics
5.1 N, Min, Max, Mean, SD, etc.
DataCafeNum = DataCafeWithInc |>
select(-Zip)
n_total = nrow(DataCafeNum)
DescrStatc = skim(DataCafeNum) |>
mutate (
N = n_total - n_missing
) |>
select(
Variable = skim_variable,
Missing = n_missing,
N,
Mean = numeric.mean,
SD = numeric.sd,
Min = numeric.p0,
Max = numeric.p100
)
kable(DescrStatc)| Variable | Missing | N | Mean | SD | Min | Max |
|---|---|---|---|---|---|---|
| CafeExp | 0 | 1600 | 100.388262 | 26.5729174 | 1.7821390 | 183.74515 |
| Age | 0 | 1600 | 37.069247 | 10.5231428 | 18.0206520 | 59.91429 |
| SexMale | 0 | 1600 | 0.444375 | 0.4970516 | 0.0000000 | 1.00000 |
| UrbanYes | 0 | 1600 | 0.695625 | 0.4602861 | 0.0000000 | 1.00000 |
| YearsAfterHighschool | 0 | 1600 | 19.572534 | 10.5278207 | 0.0895441 | 42.69667 |
| IncomeZIP | 0 | 1600 | 79.943750 | 14.4893253 | 50.0000000 | 110.00000 |
This R code takes a dataset, removes the Zip column, and then calculates basic descriptive statistics for all the remaining numeric variables. First, it counts how many rows are in the data. Then it uses the skim() function to create a summary of each numeric variable, including missing values, mean, standard deviation, minimum, and maximum. The code also calculates how many applicable (not missing) values each variable has. Finally, it selects the important pieces of information and prints it as a clean table using kable().
On average, people in the sample are 37 years old and are roughly 20 years out of high school. The variables SexMale and UrbanYes are coded as 0 or 1, showing proportions—about 44% of the sample is male, and 70% live in urban areas Income by ZIP code ranges from 50,000 dollars to 110,000 dollars, with an average of about 80,000 dollars.
5.2 Group and Summarize (Pivot Table)
CafePivot = DataCafeWithInc |>
group_by(UrbanYes, SexMale ) |>
summarise (
N= n(),
MeanCafeExp = mean(CafeExp),
SDCafeExp = sd(CafeExp)
)
kable(CafePivot)| UrbanYes | SexMale | N | MeanCafeExp | SDCafeExp |
|---|---|---|---|---|
| 0 | 0 | 487 | 81.49312 | 23.57402 |
| 1 | 0 | 402 | 103.50931 | 23.05510 |
| 1 | 1 | 711 | 111.56586 | 23.11128 |
This code groups people by whether they live in an urban area and whether they are male, then calculates how many people are in each group, their average café spending, and how much that spending varies. The table shows that non-urban females spend the least on café purchases, while urban males spend the most. It’s simply comparing café spending across these different groups.
5.2.1 Average Price by ZIP Code
TableAvgCafeZip = DataCafeWithInc |>
group_by(Zip) |>
summarise(
N = n(),
MeanCafeExp = mean(CafeExp),
SDCafeExp = sd(CafeExp)
)
kable(TableAvgCafeZip)| Zip | N | MeanCafeExp | SDCafeExp |
|---|---|---|---|
| 19101 | 157 | 79.95523 | 24.13239 |
| 19102 | 230 | 111.69190 | 20.53935 |
| 19103 | 92 | 69.05188 | 25.33880 |
| 19104 | 631 | 103.28528 | 22.98335 |
| 19105 | 238 | 87.31683 | 20.34787 |
| 19106 | 167 | 118.35366 | 23.44256 |
| 19107 | 85 | 121.25712 | 21.52090 |
The code groups the dataset by ZIP code and gives three measures for cafe spending within each ZIP: number of observations, the average café expenditure, and the standard deviation fo the café expenditures. The results show how the average café spending varies between ZIP codes, indicating that location can play an important role in consumer spending patterns.
5.2.2 Average Price by ZIP Code & UrbanYes
TableAvgCafeZipUrban = DataCafeWithInc |>
group_by(Zip, UrbanYes) |>
summarise(
N = n(),
MeanCafeExp = mean(CafeExp),
SDCafeExp = sd(CafeExp)
)
kable(TableAvgCafeZipUrban)| Zip | UrbanYes | N | MeanCafeExp | SDCafeExp |
|---|---|---|---|---|
| 19101 | 0 | 157 | 79.95523 | 24.13239 |
| 19102 | 1 | 230 | 111.69190 | 20.53935 |
| 19103 | 0 | 92 | 69.05188 | 25.33880 |
| 19104 | 1 | 631 | 103.28528 | 22.98335 |
| 19105 | 0 | 238 | 87.31683 | 20.34787 |
| 19106 | 1 | 167 | 118.35366 | 23.44256 |
| 19107 | 1 | 85 | 121.25712 | 21.52090 |
The code groups the data in two ways: first by ZIP code, and then by ZIP code combined with whether the person lives in an urban area. For each group, it calculates how many people are in that group, the average café spending, and how much that spending varies. The results show that café spending differs noticeably across ZIP codes, with some areas spending much more on average than others. When the data is split further by urban vs. non-urban, the dat shows differences that urban groups tend to spend more on café purchases compared to non-urban groups in the same ZIP code.
7 Linear Regression
7.1 All Predictors
7.1.1 Unfitted Model
CafeExp = β₀ + β₁(Age) + β₂(SexMale) + β₃(UrbanYes) + β₄(YearsAfterHighschool) + β₅(IncomeZIP)
ModelLMAll = lm(CafeExp ~ ., data = DataCafeNum)
summary(ModelLMAll)
Call:
lm(formula = CafeExp ~ ., data = DataCafeNum)
Residuals:
Min 1Q Median 3Q Max
-89.530 -14.665 0.587 15.154 66.153
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -22.38384 34.91855 -0.641 0.522
Age 3.05243 1.96592 1.553 0.121
SexMale -0.40002 1.61633 -0.247 0.805
UrbanYes 10.00672 1.93032 5.184 2.45e-07 ***
YearsAfterHighschool -2.81462 1.96510 -1.432 0.152
IncomeZIP 0.72460 0.07064 10.257 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.38 on 1594 degrees of freedom
Multiple R-squared: 0.2928, Adjusted R-squared: 0.2906
F-statistic: 132 on 5 and 1594 DF, p-value: < 2.2e-16
7.1.2 Fitted Model
CafeExp = −22.38 + 3.05(Age) − 0.40(SexMale) + 10.01(UrbanYes) − 2.81(YearsAfterHighschool) + 0.72(IncomeZIP)
7.2 Selected Predictors
ModelLMSelect = lm(CafeExp ~ UrbanYes + IncomeZIP, data = DataCafeNum)
summary(ModelLMSelect)
Call:
lm(formula = CafeExp ~ UrbanYes + IncomeZIP, data = DataCafeNum)
Residuals:
Min 1Q Median 3Q Max
-87.396 -15.034 0.825 14.715 66.001
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.20036 3.99762 9.055 < 2e-16 ***
UrbanYes 9.64866 1.93139 4.996 6.51e-07 ***
IncomeZIP 0.71896 0.06136 11.718 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.51 on 1597 degrees of freedom
Multiple R-squared: 0.283, Adjusted R-squared: 0.2821
F-statistic: 315.2 on 2 and 1597 DF, p-value: < 2.2e-16
7.2.1 Unfitted Model
Y = β₀ + β₁(UrbanYes) + β₂(IncomeZIP)
7.2.2 Fitted Model
CafeExp = 36.2 + 9.65(UrbanYes) + 0.72(IncomeZIP)
8 Summary
The average Cafe expenature is estimated to be approximately 36.2 dollars, while urban residents are more likely to spend 9.65 dollars more on average than those in non urban areas. Furthermore, for every additional 1000 dollars a median ZIP code income, café spending increases on an average of 72 cents.