Multivariate Analysis of Salary, Titanic Survival, and Spotify Popularity

required_packages <- c("ggplot2", "gridExtra", "car", "gvlma")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(required_packages, library, character.only = TRUE)

## Cargando paquete requerido: carData

## [[1]]
## [1] "ggplot2"   "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
## [1] "gridExtra" "ggplot2"   "stats"     "graphics"  "grDevices" "utils"    
## [7] "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "car"       "carData"   "gridExtra" "ggplot2"   "stats"     "graphics" 
##  [7] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "gvlma"     "car"       "carData"   "gridExtra" "ggplot2"   "stats"    
##  [7] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"

1. Salary Data Analysis

Data Import and Preprocessing

salary_data <- read.csv("C:/Users/Manuel/Desktop/datascience_salaries.csv", sep = ",", header = TRUE)

To ensure data quality, missing values are removed:

salary_data <- na.omit(salary_data)

Factorizing key categorical variables and filtering for full-time employment:

salary_data$experience_level <- factor(salary_data$experience_level)
salary_data$job_title <- factor(salary_data$job_title)
salary_data$employment_type <- factor(salary_data$employment_type)
salary_data <- salary_data[salary_data$employment_type == "FT", ]

Exploratory Data Analysis

library(ggplot2)
library(gridExtra)

Creating visualizations to analyze salary distributions:

p1 <- ggplot(data = salary_data, aes(x = experience_level, y = salary_in_usd)) +
      geom_boxplot() +
      labs(x = "Experience Level", y = "Salary in USD") +
      theme_bw()

p2 <- ggplot(data = salary_data, aes(x = job_title, y = salary_in_usd)) +
      geom_boxplot() +
      labs(x = "Job Title", y = "Salary in USD") +
      theme_bw()

grid.arrange(p1, p2, ncol = 2)

Calculating means and standard deviations by experience level and job title:

mean_by_experience <- tapply(salary_data$salary_in_usd, salary_data$experience_level, mean)
mean_by_job <- tapply(salary_data$salary_in_usd, salary_data$job_title, mean)

ANOVA Analysis

model_anova <- aov(salary_in_usd ~ experience_level * job_title, data = salary_data)
summary(model_anova)

##                             Df    Sum Sq   Mean Sq F value   Pr(>F)    
## experience_level             3 6.499e+11 2.166e+11  59.797  < 2e-16 ***
## job_title                    3 6.585e+10 2.195e+10   6.058 0.000459 ***
## experience_level:job_title   8 4.170e+10 5.212e+09   1.439 0.177254    
## Residuals                  573 2.076e+12 3.623e+09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conclusion: This analysis highlights the influence of experience level and job title on salaries. Significant differences were found across experience levels and job roles, with higher experience leading to higher earnings. However, there is no significant interaction between experience and job title, indicating independent effects. The homogeneity and normality assumptions were tested, revealing potential deviations that must be considered in further analysis.

2. Titanic Survival Analysis

Data Cleaning and Exploration

titanic_data <- read.csv("C:/Users/Manuel/Desktop/titanic.csv", sep = ",", header = TRUE)

Removing missing values:

titanic_clean <- titanic_data[complete.cases(titanic_data), ]

Building the logistic regression model:

model_logistic <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare, data = titanic_clean, family = binomial)
summary(model_logistic)

## 
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch + 
##     Fare, family = binomial, data = titanic_clean)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.389003   0.603734   8.926  < 2e-16 ***
## Pclass      -1.242249   0.163191  -7.612 2.69e-14 ***
## Sexmale     -2.634845   0.219609 -11.998  < 2e-16 ***
## Age         -0.043953   0.008179  -5.374 7.70e-08 ***
## SibSp       -0.375755   0.127361  -2.950  0.00317 ** 
## Parch       -0.061937   0.122925  -0.504  0.61436    
## Fare         0.002160   0.002493   0.866  0.38627    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 964.52  on 713  degrees of freedom
## Residual deviance: 635.81  on 707  degrees of freedom
## AIC: 649.81
## 
## Number of Fisher Scoring iterations: 5

Conclusion: The analysis of Titanic survival rates shows that class, gender, and age significantly influence survival probabilities. Passengers from higher social classes and females had greater chances of survival, while age had a negative correlation with survival likelihood. Assumptions such as normality and homoscedasticity were tested, confirming the model’s reliability despite minor deviations.

3. Spotify Popularity Analysis

Data Processing and Regression Modeling

spotify_data <- read.csv("C:/Users/Manuel/Desktop/Popular_Spotify_Songs.csv", sep = ",")

Handling missing values:

spotify_data <- na.omit(spotify_data)

Regression analysis to predict danceability:

model_spotify <- lm(danceability_. ~ bpm + valence_., data = spotify_data)
summary(model_spotify)

## 
## Call:
## lm(formula = danceability_. ~ bpm + valence_., data = spotify_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.892  -8.825   1.166   9.335  33.153 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 64.16545    2.16859  29.589  < 2e-16 ***
## bpm         -0.08299    0.01579  -5.257 1.83e-07 ***
## valence_.    0.25647    0.01873  13.696  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.21 on 899 degrees of freedom
## Multiple R-squared:  0.1892, Adjusted R-squared:  0.1874 
## F-statistic: 104.9 on 2 and 899 DF,  p-value: < 2.2e-16

Conclusion: The analysis reveals that both tempo (bpm) and valence have a significant impact on the danceability of songs. The model explains about 18.74% of the variability, suggesting that additional variables should be considered for better predictions. Further feature engineering, such as factorizing categorical variables, was conducted to improve model performance.