library(tidyverse)
library(rio)
library(janitor)
library(skimr)
library(kableExtra)
library(corrplot)IBM 5910 Project
1 Introduction
The goal of this project is to analyze monthly movie spending using 2 types of data. The study examines how variables such as age, gender, urban residency, years after high school, zip and income are associated with differences in movie expenditure.
The project begins with descriptive statistics and group summaries, followed by a correlation analysis. The final step is to estimate several linear regression models in order to identify meaningful predictors of movie spending.
2 Libraries Used
tidyverse is used for data cleaning and data manipulation.
rio is used to import datasets.
janitor is helpful for cleaning and standardizing variable names.
skimr provides descriptive statistics in a clear format.
kableExtra displays tables neatly in the HTML output.
corrplot is used to visualize correlation matrices.
3 Data
DataMovie = import("ProjectDataMovies22New.csv") |>
clean_names("upper_camel")
DataZip=import("DataZIPIncomeNew-4.csv") |>
clean_names("upper_camel") |>
select(Zip, IncomeZip=Income)The first dataset is imported from a CSV file that contains individual-level movie spending information.
It includes variables such as Age, SexMale, UrbanYes, MoviesExp, and YearsAfterHighschool. After loading the data, the variable names are cleaned and converted into UpperCamelCase format using the clean_names() function.
The second dataset contains ZIP-code-level income information, including the variables Zip and Income.
This dataset is also imported and cleaned in the same way. We select only the variables needed for the analysis, which are the ZIP code and the income for each ZIP code. The income variable is renamed to IncomeZip so that it can be merged more clearly with the movie dataset.
3.1 Selecting Variables
In this step, we select the variables that will be used in the analysis. From the original movie dataset, we keep the variables Age, SexMale, Zip, UrbanYes, MoviesExp, and YearsAfterHighschool. we also rename MoviesExp to MovieSpending to make the meaning of the variable clearer for the regression analysis.
The resulting dataset contains only the variables that are relevant for the descriptive statistics and the later regression models, which helps simplify the analysis and keep the data organized. The table below shows the first few observations after selecting and renaming the variables.
DataMovieNew = DataMovie |>
select(Age, SexMale, Zip, UrbanYes, MovieSpending=MoviesExp, YearsAfterHighschool )
kable(head(DataMovieNew))| Age | SexMale | Zip | UrbanYes | MovieSpending | YearsAfterHighschool |
|---|---|---|---|---|---|
| 43.74629 | 0 | 19103 | 0 | 55.62785 | 25.763442 |
| 26.78734 | 1 | 19106 | 1 | 110.55335 | 9.529410 |
| 19.96861 | 0 | 19104 | 1 | 130.29553 | 2.623253 |
| 33.70697 | 0 | 19104 | 1 | 134.22746 | 15.925996 |
| 49.19151 | 0 | 19105 | 0 | 90.68641 | 31.723165 |
| 50.74560 | 0 | 19104 | 1 | 91.30499 | 33.528681 |
3.2 Joining Tables (add ZIP code income to DataHousing)
In this step, we combine the movie dataset with the ZIP-code income dataset.
The two datasets share a common variable, Zip, which allows them to be merged.
We use a left_join() because we want to keep all observations from the movie dataset, even if some ZIP codes do not appear in the income dataset.
When a matching ZIP code is found, the corresponding income value (IncomeZip) is added to each row.
This merged dataset is used in the descriptive analysis and in the regression models later in the project.
The table below shows the first few rows after the join.
DataMovieWithInc = left_join(DataMovieNew, DataZip,join_by(Zip==Zip))
kable(head(DataMovieWithInc)) | Age | SexMale | Zip | UrbanYes | MovieSpending | YearsAfterHighschool | IncomeZip |
|---|---|---|---|---|---|---|
| 43.74629 | 0 | 19103 | 0 | 55.62785 | 25.763442 | 50 |
| 26.78734 | 1 | 19106 | 1 | 110.55335 | 9.529410 | 100 |
| 19.96861 | 0 | 19104 | 1 | 130.29553 | 2.623253 | 80 |
| 33.70697 | 0 | 19104 | 1 | 134.22746 | 15.925996 | 80 |
| 49.19151 | 0 | 19105 | 0 | 90.68641 | 31.723165 | 70 |
| 50.74560 | 0 | 19104 | 1 | 91.30499 | 33.528681 | 80 |
4 Descripive Statistic
4.1 N, Min, Max, Mean SD, etc.
DataMovieNum=DataMovieWithInc |>
select(-Zip)
DescrStatc=skim(DataMovieNum) |>
select(Variable = skim_variable,
Missing = n_missing,
Mean = numeric.mean,
SD = numeric.sd,
Min = numeric.p0,
Median = numeric.p50,
Max = numeric.p100)
kable(DescrStatc)| Variable | Missing | Mean | SD | Min | Median | Max |
|---|---|---|---|---|---|---|
| Age | 0 | 37.10824 | 10.5847747 | 18.0223297 | 36.83705 | 59.98208 |
| SexMale | 0 | 0.45125 | 0.4977733 | 0.0000000 | 0.00000 | 1.00000 |
| UrbanYes | 0 | 0.69625 | 0.4600198 | 0.0000000 | 1.00000 | 1.00000 |
| MovieSpending | 0 | 100.59220 | 27.4321351 | 6.9762110 | 100.39985 | 181.17121 |
| YearsAfterHighschool | 0 | 19.62224 | 10.6053937 | 0.0758948 | 19.37038 | 42.79993 |
| IncomeZip | 0 | 80.06250 | 14.4309130 | 50.0000000 | 80.00000 | 110.00000 |
The descriptive statistics provide an overview of the main variables used in the analysis.
The average age in our sample is about 37 years, with a range from approximately 18 to 60.
The variable SexMale has a mean close to 0.45, which indicates that the sample contains slightly fewer male respondents than female respondents.
About 70 percent of respondents live in urban areas, as shown by the mean of UrbanYes.
MovieSpending has an average of around 100 dollars per month, but the relatively large standard deviation shows that movie spending varies widely across individuals.
The average number of years after high school is about 19 years, which suggests that many respondents are in mid-adulthood.
IncomeZip has a mean of around 80. The relatively wide range of values also shows that the dataset includes respondents from both lower-income and higher-income neighborhoods.
None of the variables contain missing values.
4.2 Group and Summarize (Pivot Table)
4.2.1 Average Movie Spending by ZIP Code
TableAvgMovieZip = DataMovieWithInc |>
group_by(Zip) |>
summarise(N = n(), MeanMovieSpending = mean(MovieSpending),SDMovieSpending = sd(MovieSpending))
kable(TableAvgMovieZip)| Zip | N | MeanMovieSpending | SDMovieSpending |
|---|---|---|---|
| 19101 | 156 | 79.68692 | 22.91825 |
| 19102 | 242 | 111.88981 | 23.22866 |
| 19103 | 89 | 73.40020 | 21.46462 |
| 19104 | 619 | 102.25020 | 23.41043 |
| 19105 | 241 | 85.00916 | 22.27525 |
| 19106 | 171 | 119.77474 | 22.32742 |
| 19107 | 82 | 129.81512 | 25.02299 |
In this step, we group the data by ZIP code and calculate summary statistics for each ZIP.
For every ZIP code, we compute three values such as the number of respondents (N), the average monthly movie spending (MeanMovieSpending), and the standard deviation of movie spending (SDMovieSpending).
This allows us to compare spending patterns across different geographic areas and see how movie spending varies depending on where respondents live.
The results show clear differences in movie spending across ZIP codes.
For example, ZIP 19107 has the highest average spending at around 130 dollars per month, while ZIP 19103 has one of the lowest averages at about 73 dollars.
ZIP codes such as 19106 and 19107 show higher movie spending. On the other hand, ZIP codes like 19101 and 19103 have lower average spending values.
These patterns suggest that geographic location may relate to how much people spend on movies.
4.2.2 Average Movie Spending by ZIP Code & UrbanYes
TableAvgMovieZipUrban = DataMovieWithInc |>
group_by(Zip, UrbanYes) |>
summarise(N = n(), MeanMovieSpending = mean(MovieSpending), SDMovieSpending = sd(MovieSpending))
kable(TableAvgMovieZipUrban)| Zip | UrbanYes | N | MeanMovieSpending | SDMovieSpending |
|---|---|---|---|---|
| 19101 | 0 | 156 | 79.68692 | 22.91825 |
| 19102 | 1 | 242 | 111.88981 | 23.22866 |
| 19103 | 0 | 89 | 73.40020 | 21.46462 |
| 19104 | 1 | 619 | 102.25020 | 23.41043 |
| 19105 | 0 | 241 | 85.00916 | 22.27525 |
| 19106 | 1 | 171 | 119.77474 | 22.32742 |
| 19107 | 1 | 82 | 129.81512 | 25.02299 |
In this step, we group the data by both ZIP code and UrbanYes to examine how movie spending differs across geographic areas and between urban and non-urban respondents.
For each combination of ZIP and UrbanYes, we calculate the number of observations (N), the average monthly movie spending (MeanMovieSpending), and the standard deviation of spending (SDMovieSpending).
This allows us to compare spending patterns within ZIP codes while also taking into account whether respondents live in an urban area.
The results show consistent differences between urban and non-urban respondents.
In ZIP codes where UrbanYes equals 1, average movie spending is noticeably higher than in ZIP codes where UrbanYes equals 0.
For example, ZIP 19107 and 19106, which have only urban respondents, show some of the highest spending levels in the dataset.
In contrast, ZIP codes with UrbanYes equal to 0, such as 19101, 19103, and 19105, have lower average movie spending.
This pattern suggests that urban residents tend to spend more on movies, possibly due to greater access to theaters.
6 Linear Regression
6.1 All Predictors
In this section, we estimate a regression model that includes all available predictors in our dataset.
Unfitted Model
The unfitted model represents the theoretical regression equation before estimating any coefficients.
It specifies the relationship we want to test using the data.
MovieSpending = B0 + B1×Age + B2×SexMale + B3×UrbanYes + B4×YearsAfterHighschool + B5×IncomeZip
ModelLMAll1=lm(MovieSpending~., data=MovieNum)
summary(ModelLMAll1)
Call:
lm(formula = MovieSpending ~ ., data = MovieNum)
Residuals:
Min 1Q Median 3Q Max
-77.72 -14.72 0.21 15.43 69.18
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.31518 35.50439 0.093 0.925618
Age 1.31995 1.99240 0.662 0.507750
SexMale 0.05241 1.67306 0.031 0.975012
UrbanYes 7.52617 1.99273 3.777 0.000165 ***
YearsAfterHighschool -1.22356 1.98843 -0.615 0.538419
IncomeZip 0.83736 0.07326 11.430 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.04 on 1594 degrees of freedom
Multiple R-squared: 0.2969, Adjusted R-squared: 0.2947
F-statistic: 134.6 on 5 and 1594 DF, p-value: < 2.2e-16
Fitted Model
The fitted model includes the estimated coefficients from the regression output. After estimation, the model becomes:
MovieSpending = 3.315 + 1.320×Age + 0.052×SexMale + 7.526×UrbanYes - 1.224×YearsAfterHighschool + 0.837×IncomeZip
Result
The regression results show that only two predictors, UrbanYes and IncomeZip, are statistically significant at conventional levels.
UrbanYes has a positive coefficient of 7.526, indicating that urban residents spend approximately 7.5 dollars more on movies per month compared to non-urban residents, holding other variables constant.
IncomeZip also has a strong and positive effect on MovieSpending.
The coefficient of 0.837 means that for each one-unit increase in income, monthly movie spending increases by about 0.84 dollars on average.
This is consistent with the correlation results, which showed a positive association between income and spending.
In contrast, Age, SexMale, and YearsAfterHighschool are not statistically significant. Their high p-values indicate that these demographic variables do not contribute meaningfully to explaining variation in movie spending in our sample.
6.2 Selected Predictors
Based on the results of the full model, only IncomeZip and UrbanYes were statistically significant predictors of MovieSpending. In this section, we estimate a regression model that includes only the significant predictors.
Unfitted Model
The unfitted model specifies the relationship we want to estimate using only the significant predictors:
MovieSpending = B0 + B1×IncomeZip + B2×UrbanYes
ModelLMAll2=lm(MovieSpending ~ IncomeZip + UrbanYes, data=MovieNum)
summary(ModelLMAll2)
Call:
lm(formula = MovieSpending ~ IncomeZip + UrbanYes, data = MovieNum)
Residuals:
Min 1Q Median 3Q Max
-76.300 -14.584 0.073 15.626 68.647
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.56323 4.12593 6.923 6.39e-12 ***
IncomeZip 0.83342 0.06323 13.181 < 2e-16 ***
UrbanYes 7.61718 1.98344 3.840 0.000128 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.04 on 1597 degrees of freedom
Multiple R-squared: 0.2954, Adjusted R-squared: 0.2945
F-statistic: 334.7 on 2 and 1597 DF, p-value: < 2.2e-16
Fitted Model
After estimating the model, the fitted regression equation becomes:
MovieSpending = 28.563 + 0.833×IncomeZip + 7.617×UrbanYes
Result
The selected model shows that both predictors remain highly significant.
The coefficient for IncomeZip is 0.833, which means that individuals with higher income tend to spend more on movies.
Specifically, a one-unit increase in IncomeZip is associated with an average increase of about 0.83 dollars in monthly movie spending.
UrbanYes has a positive coefficient of 7.617, indicating that urban residents spend approximately 7.6 dollars more per month on movies than non-urban residents, holding income constant.
The R-squared value of 0.295 shows that the model explains about 30 percent of the variation in MovieSpending, which is nearly identical to the full model. This means that removing the insignificant predictors did not reduce the explanatory power of the model.
7 Summary
The correlation analysis shows that IncomeZip and UrbanYes have the strongest positive relationships with MovieSpending.
If income increases, then movie spending also tends to increase. Similarly, if a respondent lives in an urban area, then movie spending is typically higher than for non-urban respondents.
In contrast, Age, Sex, and YearsAfterHighschool show very weak correlations with movie spending, suggesting that these demographic variables are not major determinants of spending behavior in our dataset.
The regression results support these findings.
In using all predictors, IncomeZip and UrbanYes are the only predictors with statistically significant p-values. The null hypothesis for each coefficient states that the variable has no effect on MovieSpending. For IncomeZip and UrbanYes, we reject the null hypothesis because their p-values are below conventional significance levels. For Age, SexMale, and YearsAfterHighschool, the p-values are high, so we fail to reject the null hypothesis, meaning these variables do not significantly explain movie spending.
The selected predictor, which includes only IncomeZip and UrbanYes, performs nearly as well as the all predictor. This model shows that higher income and urban residency are the key factors associated with higher monthly movie spending.
Overall, our results suggest that economic conditions and urban living environments play important roles in shaping movie spending patterns, while individual demographic characteristics such as age and gender appear to have minimal impact.