IBM 5910 Project

Author

Sora Hosoda,

1 Introduction

The goal of this project is to analyze monthly movie spending using 2 types of data. The study examines how variables such as age, gender, urban residency, years after high school, zip and income are associated with differences in movie expenditure.

The project begins with descriptive statistics and group summaries, followed by a correlation analysis. The final step is to estimate several linear regression models in order to identify meaningful predictors of movie spending.

2 Libraries Used

library(tidyverse)
library(rio)
library(janitor)
library(skimr)
library(kableExtra)
library(corrplot)

tidyverse is used for data cleaning and data manipulation.

rio is used to import datasets.

janitor is helpful for cleaning and standardizing variable names.

skimr provides descriptive statistics in a clear format.

kableExtra displays tables neatly in the HTML output.

corrplot is used to visualize correlation matrices.

3 Data

DataMovie = import("ProjectDataMovies22New.csv") |>
  clean_names("upper_camel") 
DataZip=import("DataZIPIncomeNew-4.csv") |> 
  clean_names("upper_camel") |>
  select(Zip, IncomeZip=Income)

The first dataset is imported from a CSV file that contains individual-level movie spending information.
It includes variables such as Age, SexMale, UrbanYes, MoviesExp, and YearsAfterHighschool. After loading the data, the variable names are cleaned and converted into UpperCamelCase format using the clean_names() function.

The second dataset contains ZIP-code-level income information, including the variables Zip and Income.
This dataset is also imported and cleaned in the same way. We select only the variables needed for the analysis, which are the ZIP code and the income for each ZIP code. The income variable is renamed to IncomeZip so that it can be merged more clearly with the movie dataset.

3.1 Selecting Variables

In this step, we select the variables that will be used in the analysis. From the original movie dataset, we keep the variables Age, SexMale, Zip, UrbanYes, MoviesExp, and YearsAfterHighschool. we also rename MoviesExp to MovieSpending to make the meaning of the variable clearer for the regression analysis.

The resulting dataset contains only the variables that are relevant for the descriptive statistics and the later regression models, which helps simplify the analysis and keep the data organized. The table below shows the first few observations after selecting and renaming the variables.

DataMovieNew = DataMovie |> 
  select(Age, SexMale, Zip, UrbanYes, MovieSpending=MoviesExp, YearsAfterHighschool )
kable(head(DataMovieNew))
Age SexMale Zip UrbanYes MovieSpending YearsAfterHighschool
43.74629 0 19103 0 55.62785 25.763442
26.78734 1 19106 1 110.55335 9.529410
19.96861 0 19104 1 130.29553 2.623253
33.70697 0 19104 1 134.22746 15.925996
49.19151 0 19105 0 90.68641 31.723165
50.74560 0 19104 1 91.30499 33.528681

3.2 Joining Tables (add ZIP code income to DataHousing)

In this step, we combine the movie dataset with the ZIP-code income dataset.

The two datasets share a common variable, Zip, which allows them to be merged.

We use a left_join() because we want to keep all observations from the movie dataset, even if some ZIP codes do not appear in the income dataset.

When a matching ZIP code is found, the corresponding income value (IncomeZip) is added to each row.

This merged dataset is used in the descriptive analysis and in the regression models later in the project.

The table below shows the first few rows after the join.

DataMovieWithInc = left_join(DataMovieNew, DataZip,join_by(Zip==Zip))
kable(head(DataMovieWithInc))                 
Age SexMale Zip UrbanYes MovieSpending YearsAfterHighschool IncomeZip
43.74629 0 19103 0 55.62785 25.763442 50
26.78734 1 19106 1 110.55335 9.529410 100
19.96861 0 19104 1 130.29553 2.623253 80
33.70697 0 19104 1 134.22746 15.925996 80
49.19151 0 19105 0 90.68641 31.723165 70
50.74560 0 19104 1 91.30499 33.528681 80

4 Descripive Statistic

4.1 N, Min, Max, Mean SD, etc.

DataMovieNum=DataMovieWithInc |> 
  select(-Zip)
DescrStatc=skim(DataMovieNum) |> 
  select(Variable = skim_variable,
         Missing = n_missing,
         Mean = numeric.mean,
         SD = numeric.sd,
         Min = numeric.p0,
         Median = numeric.p50,
         Max = numeric.p100)
kable(DescrStatc)
Variable Missing Mean SD Min Median Max
Age 0 37.10824 10.5847747 18.0223297 36.83705 59.98208
SexMale 0 0.45125 0.4977733 0.0000000 0.00000 1.00000
UrbanYes 0 0.69625 0.4600198 0.0000000 1.00000 1.00000
MovieSpending 0 100.59220 27.4321351 6.9762110 100.39985 181.17121
YearsAfterHighschool 0 19.62224 10.6053937 0.0758948 19.37038 42.79993
IncomeZip 0 80.06250 14.4309130 50.0000000 80.00000 110.00000

The descriptive statistics provide an overview of the main variables used in the analysis.

The average age in our sample is about 37 years, with a range from approximately 18 to 60.

The variable SexMale has a mean close to 0.45, which indicates that the sample contains slightly fewer male respondents than female respondents.

About 70 percent of respondents live in urban areas, as shown by the mean of UrbanYes.

MovieSpending has an average of around 100 dollars per month, but the relatively large standard deviation shows that movie spending varies widely across individuals.

The average number of years after high school is about 19 years, which suggests that many respondents are in mid-adulthood.

IncomeZip has a mean of around 80. The relatively wide range of values also shows that the dataset includes respondents from both lower-income and higher-income neighborhoods.

None of the variables contain missing values.

4.2 Group and Summarize (Pivot Table)

4.2.1 Average Movie Spending by ZIP Code

TableAvgMovieZip = DataMovieWithInc |>
  group_by(Zip) |>
  summarise(N = n(), MeanMovieSpending = mean(MovieSpending),SDMovieSpending = sd(MovieSpending))

kable(TableAvgMovieZip)
Zip N MeanMovieSpending SDMovieSpending
19101 156 79.68692 22.91825
19102 242 111.88981 23.22866
19103 89 73.40020 21.46462
19104 619 102.25020 23.41043
19105 241 85.00916 22.27525
19106 171 119.77474 22.32742
19107 82 129.81512 25.02299

In this step, we group the data by ZIP code and calculate summary statistics for each ZIP.

For every ZIP code, we compute three values such as the number of respondents (N), the average monthly movie spending (MeanMovieSpending), and the standard deviation of movie spending (SDMovieSpending).

This allows us to compare spending patterns across different geographic areas and see how movie spending varies depending on where respondents live.

The results show clear differences in movie spending across ZIP codes.

For example, ZIP 19107 has the highest average spending at around 130 dollars per month, while ZIP 19103 has one of the lowest averages at about 73 dollars.

ZIP codes such as 19106 and 19107 show higher movie spending. On the other hand, ZIP codes like 19101 and 19103 have lower average spending values.

These patterns suggest that geographic location may relate to how much people spend on movies.

4.2.2 Average Movie Spending by ZIP Code & UrbanYes

TableAvgMovieZipUrban = DataMovieWithInc |>
  group_by(Zip, UrbanYes) |>
  summarise(N = n(), MeanMovieSpending = mean(MovieSpending), SDMovieSpending = sd(MovieSpending))

kable(TableAvgMovieZipUrban)
Zip UrbanYes N MeanMovieSpending SDMovieSpending
19101 0 156 79.68692 22.91825
19102 1 242 111.88981 23.22866
19103 0 89 73.40020 21.46462
19104 1 619 102.25020 23.41043
19105 0 241 85.00916 22.27525
19106 1 171 119.77474 22.32742
19107 1 82 129.81512 25.02299

In this step, we group the data by both ZIP code and UrbanYes to examine how movie spending differs across geographic areas and between urban and non-urban respondents.

For each combination of ZIP and UrbanYes, we calculate the number of observations (N), the average monthly movie spending (MeanMovieSpending), and the standard deviation of spending (SDMovieSpending).

This allows us to compare spending patterns within ZIP codes while also taking into account whether respondents live in an urban area.

The results show consistent differences between urban and non-urban respondents.

In ZIP codes where UrbanYes equals 1, average movie spending is noticeably higher than in ZIP codes where UrbanYes equals 0.

For example, ZIP 19107 and 19106, which have only urban respondents, show some of the highest spending levels in the dataset.

In contrast, ZIP codes with UrbanYes equal to 0, such as 19101, 19103, and 19105, have lower average movie spending.

This pattern suggests that urban residents tend to spend more on movies, possibly due to greater access to theaters.

5 Correlated Variables

MovieNum= DataMovieWithInc |> 
  select(-Zip)
MatrixCor=cor(MovieNum)
corrplot(MatrixCor)

The correlation matrix helps us examine which variables have meaningful relationships with MovieSpending.

IncomeZip shows a clear positive correlation with MovieSpending, indicating that individuals with higher income tend to spend more on movies.

UrbanYes also has a positive relationship with MovieSpending, suggesting that urban residents generally spend more on movies than non-urban residents.

In contrast, Age, SexMale, and YearsAfterHighschool show very weak correlations with MovieSpending.

These results imply that demographic variables such as age and gender do not play a major role in movie spending in our dataset.

The strong correlation between Age and YearsAfterHighschool is expected because these two variables measure related aspects of a person’s life stage.

However, this correlation does not relate to MovieSpending.

Overall, IncomeZip and UrbanYes appear to be the most important variables associated with MovieSpending.

6 Linear Regression

6.1 All Predictors

In this section, we estimate a regression model that includes all available predictors in our dataset.

Unfitted Model

The unfitted model represents the theoretical regression equation before estimating any coefficients.

It specifies the relationship we want to test using the data.

MovieSpending = B0 + B1×Age + B2×SexMale + B3×UrbanYes + B4×YearsAfterHighschool + B5×IncomeZip

ModelLMAll1=lm(MovieSpending~., data=MovieNum)
summary(ModelLMAll1)

Call:
lm(formula = MovieSpending ~ ., data = MovieNum)

Residuals:
   Min     1Q Median     3Q    Max 
-77.72 -14.72   0.21  15.43  69.18 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           3.31518   35.50439   0.093 0.925618    
Age                   1.31995    1.99240   0.662 0.507750    
SexMale               0.05241    1.67306   0.031 0.975012    
UrbanYes              7.52617    1.99273   3.777 0.000165 ***
YearsAfterHighschool -1.22356    1.98843  -0.615 0.538419    
IncomeZip             0.83736    0.07326  11.430  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.04 on 1594 degrees of freedom
Multiple R-squared:  0.2969,    Adjusted R-squared:  0.2947 
F-statistic: 134.6 on 5 and 1594 DF,  p-value: < 2.2e-16

Fitted Model

The fitted model includes the estimated coefficients from the regression output. After estimation, the model becomes:

MovieSpending = 3.315 + 1.320×Age + 0.052×SexMale + 7.526×UrbanYes - 1.224×YearsAfterHighschool + 0.837×IncomeZip

Result

The regression results show that only two predictors, UrbanYes and IncomeZip, are statistically significant at conventional levels.

UrbanYes has a positive coefficient of 7.526, indicating that urban residents spend approximately 7.5 dollars more on movies per month compared to non-urban residents, holding other variables constant.

IncomeZip also has a strong and positive effect on MovieSpending.

The coefficient of 0.837 means that for each one-unit increase in income, monthly movie spending increases by about 0.84 dollars on average.

This is consistent with the correlation results, which showed a positive association between income and spending.

In contrast, Age, SexMale, and YearsAfterHighschool are not statistically significant. Their high p-values indicate that these demographic variables do not contribute meaningfully to explaining variation in movie spending in our sample.

6.2 Selected Predictors

Based on the results of the full model, only IncomeZip and UrbanYes were statistically significant predictors of MovieSpending. In this section, we estimate a regression model that includes only the significant predictors.

Unfitted Model

The unfitted model specifies the relationship we want to estimate using only the significant predictors:

MovieSpending = B0 + B1×IncomeZip + B2×UrbanYes

ModelLMAll2=lm(MovieSpending ~ IncomeZip + UrbanYes, data=MovieNum) 
summary(ModelLMAll2)

Call:
lm(formula = MovieSpending ~ IncomeZip + UrbanYes, data = MovieNum)

Residuals:
    Min      1Q  Median      3Q     Max 
-76.300 -14.584   0.073  15.626  68.647 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 28.56323    4.12593   6.923 6.39e-12 ***
IncomeZip    0.83342    0.06323  13.181  < 2e-16 ***
UrbanYes     7.61718    1.98344   3.840 0.000128 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 23.04 on 1597 degrees of freedom
Multiple R-squared:  0.2954,    Adjusted R-squared:  0.2945 
F-statistic: 334.7 on 2 and 1597 DF,  p-value: < 2.2e-16

Fitted Model

After estimating the model, the fitted regression equation becomes:

MovieSpending = 28.563 + 0.833×IncomeZip + 7.617×UrbanYes

Result

The selected model shows that both predictors remain highly significant.

The coefficient for IncomeZip is 0.833, which means that individuals with higher income tend to spend more on movies.

Specifically, a one-unit increase in IncomeZip is associated with an average increase of about 0.83 dollars in monthly movie spending.

UrbanYes has a positive coefficient of 7.617, indicating that urban residents spend approximately 7.6 dollars more per month on movies than non-urban residents, holding income constant.

The R-squared value of 0.295 shows that the model explains about 30 percent of the variation in MovieSpending, which is nearly identical to the full model. This means that removing the insignificant predictors did not reduce the explanatory power of the model.

7 Summary

The correlation analysis shows that IncomeZip and UrbanYes have the strongest positive relationships with MovieSpending.

If income increases, then movie spending also tends to increase. Similarly, if a respondent lives in an urban area, then movie spending is typically higher than for non-urban respondents.

In contrast, Age, Sex, and YearsAfterHighschool show very weak correlations with movie spending, suggesting that these demographic variables are not major determinants of spending behavior in our dataset.

The regression results support these findings.

In using all predictors, IncomeZip and UrbanYes are the only predictors with statistically significant p-values. The null hypothesis for each coefficient states that the variable has no effect on MovieSpending. For IncomeZip and UrbanYes, we reject the null hypothesis because their p-values are below conventional significance levels. For Age, SexMale, and YearsAfterHighschool, the p-values are high, so we fail to reject the null hypothesis, meaning these variables do not significantly explain movie spending.

The selected predictor, which includes only IncomeZip and UrbanYes, performs nearly as well as the all predictor. This model shows that higher income and urban residency are the key factors associated with higher monthly movie spending.

Overall, our results suggest that economic conditions and urban living environments play important roles in shaping movie spending patterns, while individual demographic characteristics such as age and gender appear to have minimal impact.