required_packages <- c("ggplot2", "gridExtra", "car", "gvlma")
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
lapply(required_packages, library, character.only = TRUE)
## Cargando paquete requerido: carData
## [[1]]
## [1] "ggplot2" "stats" "graphics" "grDevices" "utils" "datasets"
## [7] "methods" "base"
##
## [[2]]
## [1] "gridExtra" "ggplot2" "stats" "graphics" "grDevices" "utils"
## [7] "datasets" "methods" "base"
##
## [[3]]
## [1] "car" "carData" "gridExtra" "ggplot2" "stats" "graphics"
## [7] "grDevices" "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "gvlma" "car" "carData" "gridExtra" "ggplot2" "stats"
## [7] "graphics" "grDevices" "utils" "datasets" "methods" "base"
salary_data <- read.csv("C:/Users/Manuel/Desktop/datascience_salaries.csv", sep = ",", header = TRUE)
To ensure data quality, missing values are removed:
salary_data <- na.omit(salary_data)
Factorizing key categorical variables and filtering for full-time employment:
salary_data$experience_level <- factor(salary_data$experience_level)
salary_data$job_title <- factor(salary_data$job_title)
salary_data$employment_type <- factor(salary_data$employment_type)
salary_data <- salary_data[salary_data$employment_type == "FT", ]
library(ggplot2)
library(gridExtra)
Creating visualizations to analyze salary distributions:
p1 <- ggplot(data = salary_data, aes(x = experience_level, y = salary_in_usd)) +
geom_boxplot() +
labs(x = "Experience Level", y = "Salary in USD") +
theme_bw()
p2 <- ggplot(data = salary_data, aes(x = job_title, y = salary_in_usd)) +
geom_boxplot() +
labs(x = "Job Title", y = "Salary in USD") +
theme_bw()
grid.arrange(p1, p2, ncol = 2)
Calculating means and standard deviations by experience level and job title:
mean_by_experience <- tapply(salary_data$salary_in_usd, salary_data$experience_level, mean)
mean_by_job <- tapply(salary_data$salary_in_usd, salary_data$job_title, mean)
model_anova <- aov(salary_in_usd ~ experience_level * job_title, data = salary_data)
summary(model_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## experience_level 3 6.499e+11 2.166e+11 59.797 < 2e-16 ***
## job_title 3 6.585e+10 2.195e+10 6.058 0.000459 ***
## experience_level:job_title 8 4.170e+10 5.212e+09 1.439 0.177254
## Residuals 573 2.076e+12 3.623e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Conclusion: This analysis highlights the influence of experience level and job title on salaries. Significant differences were found across experience levels and job roles, with higher experience leading to higher earnings. However, there is no significant interaction between experience and job title, indicating independent effects. The homogeneity and normality assumptions were tested, revealing potential deviations that must be considered in further analysis.
titanic_data <- read.csv("C:/Users/Manuel/Desktop/titanic.csv", sep = ",", header = TRUE)
Removing missing values:
titanic_clean <- titanic_data[complete.cases(titanic_data), ]
Building the logistic regression model:
model_logistic <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare, data = titanic_clean, family = binomial)
summary(model_logistic)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + SibSp + Parch +
## Fare, family = binomial, data = titanic_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.389003 0.603734 8.926 < 2e-16 ***
## Pclass -1.242249 0.163191 -7.612 2.69e-14 ***
## Sexmale -2.634845 0.219609 -11.998 < 2e-16 ***
## Age -0.043953 0.008179 -5.374 7.70e-08 ***
## SibSp -0.375755 0.127361 -2.950 0.00317 **
## Parch -0.061937 0.122925 -0.504 0.61436
## Fare 0.002160 0.002493 0.866 0.38627
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 635.81 on 707 degrees of freedom
## AIC: 649.81
##
## Number of Fisher Scoring iterations: 5
Conclusion: The analysis of Titanic survival rates shows that class, gender, and age significantly influence survival probabilities. Passengers from higher social classes and females had greater chances of survival, while age had a negative correlation with survival likelihood. Assumptions such as normality and homoscedasticity were tested, confirming the model’s reliability despite minor deviations.
spotify_data <- read.csv("C:/Users/Manuel/Desktop/Popular_Spotify_Songs.csv", sep = ",")
Handling missing values:
spotify_data <- na.omit(spotify_data)
Regression analysis to predict danceability:
model_spotify <- lm(danceability_. ~ bpm + valence_., data = spotify_data)
summary(model_spotify)
##
## Call:
## lm(formula = danceability_. ~ bpm + valence_., data = spotify_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.892 -8.825 1.166 9.335 33.153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.16545 2.16859 29.589 < 2e-16 ***
## bpm -0.08299 0.01579 -5.257 1.83e-07 ***
## valence_. 0.25647 0.01873 13.696 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.21 on 899 degrees of freedom
## Multiple R-squared: 0.1892, Adjusted R-squared: 0.1874
## F-statistic: 104.9 on 2 and 899 DF, p-value: < 2.2e-16
Conclusion: The analysis reveals that both tempo (bpm) and valence have a significant impact on the danceability of songs. The model explains about 18.74% of the variability, suggesting that additional variables should be considered for better predictions. Further feature engineering, such as factorizing categorical variables, was conducted to improve model performance.