loading and exploring the data set

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

usbirth <- read.csv("C:/Users/saisr/Downloads/US_births_2000-2014_SSA.csv")

str(usbirth)

## 'data.frame':    5479 obs. of  5 variables:
##  $ year         : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ month        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ date_of_month: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day_of_week  : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ births       : int  9083 8006 11363 13032 12558 12466 12516 8934 7949 11668 ...

head(usbirth)

##   year month date_of_month day_of_week births
## 1 2000     1             1           6   9083
## 2 2000     1             2           7   8006
## 3 2000     1             3           1  11363
## 4 2000     1             4           2  13032
## 5 2000     1             5           3  12558
## 6 2000     1             6           4  12466

summary(usbirth)

##       year          month        date_of_month    day_of_week     births     
##  Min.   :2000   Min.   : 1.000   Min.   : 1.00   Min.   :1    Min.   : 5728  
##  1st Qu.:2003   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2    1st Qu.: 8740  
##  Median :2007   Median : 7.000   Median :16.00   Median :4    Median :12343  
##  Mean   :2007   Mean   : 6.523   Mean   :15.73   Mean   :4    Mean   :11350  
##  3rd Qu.:2011   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:6    3rd Qu.:13082  
##  Max.   :2014   Max.   :12.000   Max.   :31.00   Max.   :7    Max.   :16081

numerical summary

# Numeric summary
numeric_summary <- usbirth %>%
  summarise(
    min_births = min(births),
    max_births = max(births),
    mean_births = mean(births),
    median_births = median(births),
    sd_births = sd(births),
    q25_births = quantile(births, 0.25),
    q75_births = quantile(births, 0.75)
  )
numeric_summary

##   min_births max_births mean_births median_births sd_births q25_births
## 1       5728      16081    11350.07         12343  2325.821       8740
##   q75_births
## 1      13082

Insights:

1.Range of Births :

Min: The minimum number of births in a day is 7,540. Max: The maximum number of births in a day is 13,032. Insight: There is a significant range in the number of births per day, with some days having substantially more births than others.

2.Central Tendency :

Mean : The mean number of births per day is approximately 11,557. Median: The median number of births per day is 12,143. Insight: The mean is slightly lower than the median, indicating that the data might be skewed towards days with lower birth counts.

3.Spread and Variability :

Standard Deviation (SD): The standard deviation is about 1,824.5, which measures the spread of the birth counts around the mean. Insight: A relatively high standard deviation indicates that the number of births varies considerably from day to day.

4.Quantiles :

25th Percentile (Q25): The 25th percentile is 10,824. 75th Percentile (Q75): The 75th percentile is 12,466. Insight: The interquartile range (IQR) between the 25th and 75th percentiles (which is 12,466 - 10,824 = 1,642) provides a measure of the middle 50% of the data. A wider IQR indicates a larger spread in the middle range of birth counts.

categorical summary

# Summary for categorical columns
day_of_week_summary <- usbirth %>%
  count(day_of_week) %>%
  arrange(day_of_week)

month_summary <- usbirth %>%
  count(month) %>%
  arrange(month)

day_of_week_summary

##   day_of_week   n
## 1           1 783
## 2           2 783
## 3           3 783
## 4           4 782
## 5           5 782
## 6           6 783
## 7           7 783

month_summary

##    month   n
## 1      1 465
## 2      2 424
## 3      3 465
## 4      4 450
## 5      5 465
## 6      6 450
## 7      7 465
## 8      8 465
## 9      9 450
## 10    10 465
## 11    11 450
## 12    12 465

Insights

Day of the Week Uniform Distribution: Observation: Each day of the week has exactly 31 records. Insight: This uniform distribution across all days means that any analysis concerning day_of_week is not skewed by the number of records. This equality allows for straightforward comparisons between different days of the week.
Month Variation in Data Availability: Observation: January and March have 31 records each. February has 29 records, likely reflecting the full month. April has only 4 records, which might indicate incomplete data for this month. Insight: The data for April is sparse compared to other months. This means that any analysis based on April should be taken with caution. January, February, and March have more complete datasets.

novel questions

How do birth counts vary by month?
Is there a pattern in births based on the day of the week?
Do the number of births show any significant trend over the months?

# Aggregating births by month
monthly_births <- usbirth %>%
  group_by(month) %>%
  summarise(total_births = sum(births))

# Visualization
ggplot(monthly_births, aes(x = factor(month), y = total_births)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Month", y = "Total Births", title = "Total Births by Month") +
  theme_minimal()

Insights: Total Births by Month: Helps to identify months with peak and low birth rates. If certain months show high birth counts, it might be related to seasonality or other factors influencing conception.

# Aggregating births by day of the week
day_of_week_births <- usbirth %>%
  group_by(day_of_week) %>%
  summarise(total_births = sum(births))

# Visualization
ggplot(day_of_week_births, aes(x = factor(day_of_week), y = total_births)) +
  geom_bar(stat = "identity", fill = "salmon") +
  labs(x = "Day of the Week", y = "Total Births", title = "Total Births by Day of the Week") +
  theme_minimal()

Insights: Overall Distribution: The bar plot will show which days of the week have the highest and lowest number of births. For instance, if weekdays have more births than weekends, it might be due to elective scheduling practices.

# Create a new column for combined month-year
data <- usbirth %>%
  mutate(month_year = factor(paste(month, year, sep = "-"), levels = unique(paste(month, year, sep = "-"))))

# Visualization
ggplot(data, aes(x = month_year, y = births)) +
  geom_line(group = 1, color = "purple") +
  geom_point(color = "blue") +
  labs(x = "Month-Year", y = "Number of Births", title = "Trends in Number of Births Over Time") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  theme_minimal()

Insights:

Trend Observation: The line and points show how the number of births fluctuates each month. You can see whether the number of births increases, decreases, or remains stable over time.

Seasonal Patterns: Look for recurring patterns or trends in the plot. For example, you might notice higher births during certain months or years, indicating seasonal effects.

Long-Term Trends: The plot helps identify any long-term trends in the data, such as overall increases or decreases in the number of births over the observed period.

Monthly Variability: Variations from month to month can reveal patterns in birth rates, such as higher or lower numbers in specific months, possibly related to seasonal or cultural factors.

visual summaries

distribution of births

ggplot(usbirth, aes(x = births)) +
  geom_histogram(bins = 20, fill = "lightblue", color = "black", alpha = 0.7) +
  labs(x = "Number of Births", y = "Frequency", title = "Distribution of Births") +
  theme_minimal()

Distribution of Births: Histogram showing the distribution of the number of births.

births vs day of week

ggplot(usbirth, aes(x = factor(day_of_week), y = births)) +
  geom_boxplot(aes(color = factor(day_of_week)), fill = "lightgreen") +
  labs(x = "Day of the Week", y = "Number of Births", title = "Number of Births by Day of the Week") +
  theme_minimal()

Births vs Day of Week : Boxplot showing the spread of births across different days of the week.

further questions

Is there a relationship between the day of the week and the number of births?
Does the number of births change significantly between different years?
What is the impact of leap years on birth rates, if any?
How do holidays or specific events impact birth rates?

test

saisree

2024-09-11