Project 3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

baby_names <- read.csv("baby_names.csv")

Introduction:

Question: How does year predict the popularity rank of the top 3 names; Mary, Emma, and Elizabeth from 1880 to 2020?

This is a dataset about the top 20 baby names in males and females from the year 1880 to 2020 from the USA Social Security Administration. The variables I will be using are Rank, Female, and Year. I will not be using Males in this project. I will be using a linear regression model to find the change of the rank for female baby names throughout 1880 to 2020.There are a total of 4 variables and 2820 rows. I am going to be using all 2,820 rows, excluding the column of male. This dataset is from Openintro.org, this is the link to the site: https://www.openintro.org/data/index.php?data=baby_names

Data Analysis:

In this section, I am using “str” to check the structure of the dataset. Everything looks good and there are no unusual things with the columns or rows. Next, I used “head” to see if there were any issues or hard to understand names, or columns. We can also see that female is considered a char and rank and year are both integers, which is perfect. Next up, I used “colSums” to check for N/A’s. Thankfully, we do not see any N/A’s in the dataset.Up next, we used three dplyr functions: filter, group by, and summarize. What filter is doing is keeping the rows that meet the condition I am looking for, so in this case, Mary, Emma, and Elizabeth.Group by is grouping the data by year and name, while summarize finds the average rank and count for each group. This is what we have so far. I am going to use a multiple linear regression model using rank as the outcome variable and female names as predictor variables.

### Check structure
str(baby_names)

## 'data.frame':    2820 obs. of  4 variables:
##  $ rank  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ male  : chr  "John" "William" "James" "Charles" ...
##  $ female: chr  "Mary" "Anna" "Emma" "Elizabeth" ...
##  $ year  : int  1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...

### Check head of dataset
head(baby_names)

##   rank    male    female year
## 1    1    John      Mary 1880
## 2    2 William      Anna 1880
## 3    3   James      Emma 1880
## 4    4 Charles Elizabeth 1880
## 5    5  George    Minnie 1880
## 6    6   Frank  Margaret 1880

### Check for N/A's
colSums(is.na(baby_names))

##   rank   male female   year 
##      0      0      0      0

#### Using dplyr functions
baby_names |>
  filter (female %in% c ("Mary","Emma","Elizabeth")) |>
  group_by (year) |>
  summarize (avg_rank = mean(rank),
             count = n())

## # A tibble: 141 × 3
##     year avg_rank count
##    <int>    <dbl> <int>
##  1  1880     2.67     3
##  2  1881     2.67     3
##  3  1882     2.67     3
##  4  1883     2.67     3
##  5  1884     2.67     3
##  6  1885     2.67     3
##  7  1886     2.67     3
##  8  1887     2.67     3
##  9  1888     2.67     3
## 10  1889     3        3
## # ℹ 131 more rows

Regression Analysis:

top_names <- baby_names |> filter (female %in% c ("Mary", "Emma", "Elizabeth"))

model <- lm (rank ~ year + female, data = top_names)

summary (model)

## 
## Call:
## lm(formula = rank ~ year + female, data = top_names)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3271 -1.9923 -0.8864  0.7367 15.8678 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -35.088748  11.624617  -3.018  0.00279 ** 
## year          0.023397   0.005958   3.927  0.00011 ***
## femaleEmma   -4.799756   0.709062  -6.769 8.57e-11 ***
## femaleMary   -7.988554   0.570276 -14.008  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.07 on 260 degrees of freedom
## Multiple R-squared:  0.4943, Adjusted R-squared:  0.4885 
## F-statistic: 84.73 on 3 and 260 DF,  p-value: < 2.2e-16

For the intercept, we got -35.089. In context, this represents the predicted rank when year = 0 and name is Elizabeth. So, it’s kind of like a baseline starter point to look at. This doesn’t actually tell us much for our data since we just are looking at year 1880 to 2020. For the year, we got 0.023397, This is significant! This means over time, names are lower in rank, as in getting closer to 20. 1 means most popular, 20 means least popular. For femaleEmma, we got -4.800, which means this is significant, and also, it means Emma is around 4.8 ranks better than Elizabeth. Emma is more popular than Elizabeth. For femaleMary, we got -7.989. This is significant, and means Mary is doing 8 ranks better than Elizabeth, on average. Mary is the most popular between Emma and Elizabeth.

Model Assumptions and Diagnostics (Multiple Linear Regression)

From the component + Residual plot, we see that the linearity assumption is met. We see that the blue dashed line for year follows the pink line pretty closely, which means there is a linear relationship with year and rank. For the female names, the plots show some alignment with the names Elizabeth and Mary, which means we can confirm linearity for the categorical predictor. The plot right under this text; the Residuals vs Order plot, shows lots of residuals around the 0 line. There are some outliers and no clear, straight, trend, so we can say the independence has just been met.The plots from the core diagnostics basically confirm all the rest of the assumptions. For the residuals vs fitted plot, we see the points scattered around zero and the spread seems constant, and so, linearity and homoscedasticity are both met. The Q-Q plot shows the points following the diagonal line with a little bit of deviations at the tail, but other than that, the normality is met and satisfied. The scale location plot confirms a constant variance and spread, while the residuals vs. leverage plot shows all points in Cook’s distance. The correlation between year and rank is 0.321, which is okay. Since year is the only numeric predictor we have and female is categorical, multidisciplinary doesn’t raise any concerns here. Overall, all of these plots and assumptions are pretty much satisfied, which makes the model reliable.

####Check Independence
plot (resid(model), type = "b",
      main = "Residuals vs Order",
      ylab = "Residuals")
abline(h=0, lty = 2)

####Check Multicollinearity

cor (top_names[, c("year", "rank")], use = "complete.obs")

##          year     rank
## year 1.000000 0.321371
## rank 0.321371 1.000000

####Core diagnostics (covers: linearity, homoscedasticity, normality, influence)

par (mfrow = c(2,2))
plot (model)

par(mfrow = c(1,1))

####Linearity Check
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

crPlots(model)

Conclusion and Future Directions

For our conclusion, we can confirm that using the multiple regression model tells us that both predictors are sure to be an influence in how high a name is ranked in the dataset. The year coefficient we saw earlier was at 0.023 which meant that these names we chose have gotten higher in ranks as time passed, meaning their popularity worsened by around 0.023 positions/year. From the three names though, Mary held the most consistency in highest popularity. Mary was around 8 positions better than Elizabeth, and Emma was 5 positions. Our R squared was 0.4943, which signifies around 49.4% of variation in rank. This means this model is a strong one that tell us that throughout the years of 1880 to 2020, ranking of names have declined. Some limitations include the fact we only looked at three names, so in the future, we can go ahead and include more names to find broader patterns.