Arnav Shah DATA 101 Project 2

Importing the Libraries and Dataset

# Citations/Disclaimer: This code follows what learned from course/class notes
# In this chunk I am loading the dataset so I can clean it and analyze it for my research question
# I am using read_csv to load my data set into the data frame called camp_data 
# The dataset I chose has variables such as Location, Ages, and Cost
library(readr)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.2

camp_data <- read_csv("Recreation_Summer_Camps_20260405.csv")

## Rows: 378 Columns: 20

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (15): Season, Primary Category, Secondary Category, ActivityName, Descr...
## dbl   (3): Zip, Sessions, Cost
## time  (2): Start Time, End Time
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Citations/Disclaimer: This code follows what learned from course/class notes
# I am using this variable to read the dataset to understand the variables which are included in my dataset
head(camp_data)

## # A tibble: 6 × 20
##   Season `Primary Category` `Secondary Category` ActivityName        Description
##   <chr>  <chr>              <chr>                <chr>               <chr>      
## 1 Camp   Camps              Arts and Crafts      Abrakadoodle Beach… see below …
## 2 Camp   Camps              Youth Cooking        Future Chefs Cooki… <NA>       
## 3 Camp   Camps              Arts and Crafts      Abrakadoodle Artos… see below …
## 4 Camp   Camps              Youth Cooking        Future Chefs Cooki… <NA>       
## 5 Camp   Camps              Specialty Programs   Coach Doug Summer … The camp i…
## 6 Camp   Camps              Specialty Programs   Snapology Creature… <NA>       
## # ℹ 15 more variables: ActivityNumber <chr>, Ages <chr>, Location <chr>,
## #   Address <chr>, City <chr>, State <chr>, Zip <dbl>, `Start Date` <chr>,
## #   `End Date` <chr>, `Start Time` <time>, `End Time` <time>, Sessions <dbl>,
## #   `Days of the Week` <chr>, Cost <dbl>, `Address/Location` <chr>

# Using this variable dim I am finding the number of rows and columns, this also explains to me the number of variables and observations in the dataset

dim(camp_data)

## [1] 378  20

# Using this variable colnames this is helping me list the names of the variables and it will help me figure out the variables I will use

colnames(camp_data)

##  [1] "Season"             "Primary Category"   "Secondary Category"
##  [4] "ActivityName"       "Description"        "ActivityNumber"    
##  [7] "Ages"               "Location"           "Address"           
## [10] "City"               "State"              "Zip"               
## [13] "Start Date"         "End Date"           "Start Time"        
## [16] "End Time"           "Sessions"           "Days of the Week"  
## [19] "Cost"               "Address/Location"

Research Question

Is there a statistically significant difference in mean cost between summer camps for ages 8–12 and ages 13–16 in Montgomery County recreation programs?

Introduction

The dataset I am using comes from the Montgomery County Open Data Portal(https://data.montgomerycountymd.gov/Community-Recreation/Recreation-Summer-Camps/qx87-6tqs/about_data) it includes information about summer camp programs which are offered provided by the Montgomery County Department of Recreation. The dataset contains about 378 observations and many variables such as cost, ages, activity name, and location. For my analysis the variables I am focusing on mainly are the variables cost and ages since they directly relate to my research question.

The reason why I chose this dataset is because it represents real-world pricing for summer recreation programs that are in Montgomery County, Maryland. My goal is that I want to understand if different age groups are charged different amounts for summer camps and if that difference is statistically significant. This helps provide insight into how pricing decisions are made for public programs.

Variables

In my project, I focused on a few important variables from the dataset that directly correlates to my research question of a statistical difference in the mean cost between those two age groups.

Cost: This is a Quantitative variable which represents the price of each summer camp program in dollars.I used this variable since my analysis compares the average cost between two age groups.

Ages: This is a Categorical variable that shows the age range for each camp for example (“Ages 8 to 12” or “Ages 13 to 16”). I used this variable to figure out which age group each camp belongs to.

Age_Group: This is a Categorical variable that I created from the Ages column. It groups the data into two categories: 8–12 and 13–16. I used this variable to run my statistical test and create my visualization.

Data Analysis

In this analysis, I am first cleaning the dataset by removing missing values from variables such as cost and ages. Then after I am analyzing the dataset by calculating summary statistics like the mean and maximum cost for better understanding of the distribution. Followed by that I am creating a new variable that groups camps into two categories: ages 8–12 and ages 13–16. Finally I am creating a bar chart in order to compare the average cost between the two groups.

Data Cleaning

## Citations/Disclaimer: This code follows what learned from class notes
# In this chunk, I am using colSums(is.na()) to check the amount of missing values there are in the variables
# This helps me understand what data needs to be cleaned before analysis
colSums(is.na(camp_data))

##             Season   Primary Category Secondary Category       ActivityName 
##                  0                  0                 22                  0 
##        Description     ActivityNumber               Ages           Location 
##                 14                  0                  1                  0 
##            Address               City              State                Zip 
##                 21                  0                  0                  0 
##         Start Date           End Date         Start Time           End Time 
##                  0                  0                  0                  0 
##           Sessions   Days of the Week               Cost   Address/Location 
##                  0                  0                  0                  0

# Here I am using the function filter() in order to remove tthe rows where ages and cost are missing 
# camp_clean is going to be the dataset which I will use for my analysis 
camp_clean <- camp_data %>%
  filter(!is.na(Cost) & !is.na(Ages))

Exploratory Data Analysis

## Citations/Disclaimer: This code follows what learned from class notes
# In this chunk I am showing the summary statistics for cost variable 
# I also calculated the mean and max cost across the summer programs 
#I am using mean to calculate the average for all camps 
# I am using max to calculate the highest cot in the dataset
# na.rm - TRUE basically means about how the missing values would not matter in terms of the calculations

summary(camp_clean$Cost)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    40.0   122.5   235.0   220.2   284.0  1705.0

mean(camp_clean$Cost, na.rm = TRUE)

## [1] 220.1592

max(camp_clean$Cost, na.rm = TRUE)

## [1] 1705

Creating Age Groups

# Citations/Disclaimer: This code follows what learned from course/class notes
# In this chunk, I am creating a new variable called Age_Group which is from the Ages column.
# I am using the ifelse() statement in order to try to separate my data into two categories for the ages: 8-12 and 13-16.
# Based on this if the Ages value is Ages 8-12 I would then end up assigning it to the age group of 8-12.
# If the Ages value is Ages 13-16 then I would assign it to the age group of 13-16.
# For any values which are not similar to these two categories then it would be labeled NA.

camp_groups <- camp_clean %>%
  mutate(Age_Group = ifelse(Ages == "Ages 8-12", "8-12",
                     ifelse(Ages == "Ages 13-16", "13-16", NA))) %>%
  filter(!is.na(Age_Group))

# I check that both groups exist
table(camp_groups$Age_Group)

## 
## 13-16  8-12 
##    52    26

# Citations/Disclaimer: This code follows what learned from course/class notes
# In this chunk I am using group_by(Age_Group) in order to organize my data into two different groups for ages (8-12) and (13-16)
# I am also using the function mutate to create the new variable which is called mean_cost which also calculates the average cost 
# I am using the variable count() in to summarize the number of rows which are connected to the mean cost 
# The variable camp_groups2 ends up storing the data for both groups

camp_groups2 <- camp_groups %>%
  group_by(Age_Group) %>%
  mutate(mean_cost = mean(Cost)) %>%
  count(mean_cost)
camp_groups2

## # A tibble: 2 × 3
## # Groups:   Age_Group [2]
##   Age_Group mean_cost     n
##   <chr>         <dbl> <int>
## 1 13-16          69.1    52
## 2 8-12          238.     26

Visualization

# Citations/Disclaimer: This code follows what learned from course/class notes
# In this chunk I am am creating a bar chart comparing average cost between the two age groups
# I am using ggplot() in order to create a bar chart through camp_group2 
# I also am using Age_Group as the x-axis and the mean_cost for the ages (8-12) and (13-16) as the y-axis
# I also used fill = Age_Group to create different different color for each bar
# I am using geom_col() in order to create bars for the height to be the average cost 
# I used the variable labs() in order to label the title and axis 
# I finally used theme_minimal to make the graph organized and visually appealing

ggplot(camp_groups2, aes(x = Age_Group, y = mean_cost, fill = Age_Group)) +
  geom_col() +
  labs(
    title = "Average Camp Cost by Age Group",
    x = "Age Group (8–12 vs 13–16)",
    y = "Average Cost ($)",
    caption = "Source: Montgomery County Open Data Portal"
  ) +
  theme_minimal()

Hypothesis Test (Difference in Means Test)

# Citations/Disclaimer: This code follows what learned from course/class notes
# In this chunk I am running a two-sample t-test to compare the mean cost between the two age groups (8-12) and (13-16)
# This chunk helps me figure out if the difference of both groups on average and whether it is statistically significant

t_test <- t.test(Cost ~ Age_Group, data = camp_groups)

t_test

## 
##  Welch Two Sample t-test
## 
## data:  Cost by Age_Group
## t = -13.96, df = 64.581, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group 13-16 and group 8-12 is not equal to 0
## 95 percent confidence interval:
##  -192.894 -144.606
## sample estimates:
## mean in group 13-16  mean in group 8-12 
##            69.09615           237.84615

Hypothesis Test Result

I performed a two sample t-test in order to compare the mean cost of summer camps between the age groups (8-12) and (13-16)

Hypothesis:

H₀(Null Hypothesis): The mean cost for ages 13-16 is equivalent to the mean cost for ages 8-12. μ₁ = μ₂

H₁ (Alternative Hypothesis): The mean cost are different between the age groups 13-16 and 8-12. μ₁ ≠ μ₂

The p-value which I got from my test is p-value < 2.23-16, as a result the p-value is very small.

My interpretation is:

The p-value is much smaller than the α(alpha)= 0.05. This is why I would reject the null hypothesis. This shows how there is a strong statistical significance about the mean cost of the summer camps having a difference between the age groups (8-12) and (13-16). I also noticed that the 95% confidence interval for the difference of means is -192.894 to -144.606 that does not include the number 0. This also explains what I was trying to explain about statistical signifcance. The negative interval even displays how the mean cost for the ages (13-16) is significantly lowered compared to ages (8-12).

Conclusion

Looking back at my analysis on the Recreation Summer Camps Dataset Montgomery County Maryland connecting back to my research qurstion I found a statistical significant difference between the two age groups (8-12 and 13-16). Looking at the results of my t-test it displayed a significantly small p-value which is why I rejected my null hypothesis. I also noticed based on my research question and results that the summer camps for ages 8-12 are more expensive compared to ages 13-16. This suggests how age is an important role for pricing of summer camps in Montgomery County.

The reason why I think these findings are relevant since they display how pricing is not consistent across age groups and it depends on factors such as the type of activities as well. In the future for additional research I feel like next time I could also explore other variables for better understand the reason behind these cost differing. I could also broaden my analysis by involving more groups and comparing data across many years rather than just one year. I believe that if I do this it would help my analysis and findings grow.

References

Montgomery County, Maryland. Recreation Summer Camps Dataset.

https://data.montgomerycountymd.gov/Community-Recreation/Recreation-Summer-Camps/qx87-6tqs/about_data