library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
usbirth <- read.csv("C:/Users/saisr/Downloads/US_births_2000-2014_SSA.csv")
str(usbirth)
## 'data.frame': 5479 obs. of 5 variables:
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ date_of_month: int 1 2 3 4 5 6 7 8 9 10 ...
## $ day_of_week : int 6 7 1 2 3 4 5 6 7 1 ...
## $ births : int 9083 8006 11363 13032 12558 12466 12516 8934 7949 11668 ...
head(usbirth)
## year month date_of_month day_of_week births
## 1 2000 1 1 6 9083
## 2 2000 1 2 7 8006
## 3 2000 1 3 1 11363
## 4 2000 1 4 2 13032
## 5 2000 1 5 3 12558
## 6 2000 1 6 4 12466
summary(usbirth)
## year month date_of_month day_of_week births
## Min. :2000 Min. : 1.000 Min. : 1.00 Min. :1 Min. : 5728
## 1st Qu.:2003 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2 1st Qu.: 8740
## Median :2007 Median : 7.000 Median :16.00 Median :4 Median :12343
## Mean :2007 Mean : 6.523 Mean :15.73 Mean :4 Mean :11350
## 3rd Qu.:2011 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:6 3rd Qu.:13082
## Max. :2014 Max. :12.000 Max. :31.00 Max. :7 Max. :16081
# Numeric summary
numeric_summary <- usbirth %>%
summarise(
min_births = min(births),
max_births = max(births),
mean_births = mean(births),
median_births = median(births),
sd_births = sd(births),
q25_births = quantile(births, 0.25),
q75_births = quantile(births, 0.75)
)
numeric_summary
## min_births max_births mean_births median_births sd_births q25_births
## 1 5728 16081 11350.07 12343 2325.821 8740
## q75_births
## 1 13082
1.Range of Births :
Min: The minimum number of births in a day is 7,540. Max: The maximum number of births in a day is 13,032. Insight: There is a significant range in the number of births per day, with some days having substantially more births than others.
2.Central Tendency :
Mean : The mean number of births per day is approximately 11,557. Median: The median number of births per day is 12,143. Insight: The mean is slightly lower than the median, indicating that the data might be skewed towards days with lower birth counts.
3.Spread and Variability :
Standard Deviation (SD): The standard deviation is about 1,824.5, which measures the spread of the birth counts around the mean. Insight: A relatively high standard deviation indicates that the number of births varies considerably from day to day.
4.Quantiles :
25th Percentile (Q25): The 25th percentile is 10,824. 75th Percentile (Q75): The 75th percentile is 12,466. Insight: The interquartile range (IQR) between the 25th and 75th percentiles (which is 12,466 - 10,824 = 1,642) provides a measure of the middle 50% of the data. A wider IQR indicates a larger spread in the middle range of birth counts.
# Summary for categorical columns
day_of_week_summary <- usbirth %>%
count(day_of_week) %>%
arrange(day_of_week)
month_summary <- usbirth %>%
count(month) %>%
arrange(month)
day_of_week_summary
## day_of_week n
## 1 1 783
## 2 2 783
## 3 3 783
## 4 4 782
## 5 5 782
## 6 6 783
## 7 7 783
month_summary
## month n
## 1 1 465
## 2 2 424
## 3 3 465
## 4 4 450
## 5 5 465
## 6 6 450
## 7 7 465
## 8 8 465
## 9 9 450
## 10 10 465
## 11 11 450
## 12 12 465
Day of the Week Uniform Distribution: Observation: Each day of the week has exactly 31 records. Insight: This uniform distribution across all days means that any analysis concerning day_of_week is not skewed by the number of records. This equality allows for straightforward comparisons between different days of the week.
Month Variation in Data Availability: Observation: January and March have 31 records each. February has 29 records, likely reflecting the full month. April has only 4 records, which might indicate incomplete data for this month. Insight: The data for April is sparse compared to other months. This means that any analysis based on April should be taken with caution. January, February, and March have more complete datasets.
How do birth counts vary by month?
Is there a pattern in births based on the day of the week?
Do the number of births show any significant trend over the months?
# Aggregating births by month
monthly_births <- usbirth %>%
group_by(month) %>%
summarise(total_births = sum(births))
# Visualization
ggplot(monthly_births, aes(x = factor(month), y = total_births)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Month", y = "Total Births", title = "Total Births by Month") +
theme_minimal()
# Aggregating births by day of the week
day_of_week_births <- usbirth %>%
group_by(day_of_week) %>%
summarise(total_births = sum(births))
# Visualization
ggplot(day_of_week_births, aes(x = factor(day_of_week), y = total_births)) +
geom_bar(stat = "identity", fill = "salmon") +
labs(x = "Day of the Week", y = "Total Births", title = "Total Births by Day of the Week") +
theme_minimal()
# Create a new column for combined month-year
data <- usbirth %>%
mutate(month_year = factor(paste(month, year, sep = "-"), levels = unique(paste(month, year, sep = "-"))))
# Visualization
ggplot(data, aes(x = month_year, y = births)) +
geom_line(group = 1, color = "purple") +
geom_point(color = "blue") +
labs(x = "Month-Year", y = "Number of Births", title = "Trends in Number of Births Over Time") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme_minimal()
Trend Observation: The line and points show how the number of births fluctuates each month. You can see whether the number of births increases, decreases, or remains stable over time.
Seasonal Patterns: Look for recurring patterns or trends in the plot. For example, you might notice higher births during certain months or years, indicating seasonal effects.
Long-Term Trends: The plot helps identify any long-term trends in the data, such as overall increases or decreases in the number of births over the observed period.
Monthly Variability: Variations from month to month can reveal patterns in birth rates, such as higher or lower numbers in specific months, possibly related to seasonal or cultural factors.
distribution of births
ggplot(usbirth, aes(x = births)) +
geom_histogram(bins = 20, fill = "lightblue", color = "black", alpha = 0.7) +
labs(x = "Number of Births", y = "Frequency", title = "Distribution of Births") +
theme_minimal()
Distribution of Births: Histogram showing the distribution of the number of births.
births vs day of week
ggplot(usbirth, aes(x = factor(day_of_week), y = births)) +
geom_boxplot(aes(color = factor(day_of_week)), fill = "lightgreen") +
labs(x = "Day of the Week", y = "Number of Births", title = "Number of Births by Day of the Week") +
theme_minimal()
Births vs Day of Week : Boxplot showing the spread of births across different days of the week.
Is there a relationship between the day of the week and the number of births?
Does the number of births change significantly between different years?
What is the impact of leap years on birth rates, if any?
How do holidays or specific events impact birth rates?