This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
El presente trabajo tiene por objeto analizar una base de datos de Amazon, dicha base de datos contiene información diversa y amplia de 2500 usuarios de la plataforma
setwd("C:/Users/famva/Desktop/Ciencia de datos")
library(dplyr)
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(colorspace)
df <- read.csv("data_set_amazon.csv", stringsAsFactors = TRUE) #info limpia
tail(df)
## User.ID Name Email.Address Username
## 2495 2495 Jennifer Burch andersonchristina@example.net andersonchristina
## 2496 2496 Michael Lopez williamsroberto@example.org williamsroberto
## 2497 2497 Matthew Woodard lkaiser@example.com lkaiser
## 2498 2498 Morgan Barnes erikaholland@example.net erikaholland
## 2499 2499 Gina Castaneda reedcourtney@example.net reedcourtney
## 2500 2500 Mark Nicholson martinisaac@example.net martinisaac
## Date.of.Birth Gender Location Membership.Start.Date
## 2495 1950-04-18 Female New Ian 2024-03-06
## 2496 1967-08-19 Male Smithport 2024-01-25
## 2497 1980-10-23 Male Ethanport 2024-03-03
## 2498 1972-03-31 Female Alexandraborough 2024-02-09
## 2499 1965-08-02 Female Williammouth 2024-02-18
## 2500 1972-11-13 Female Estradaborough 2024-01-28
## Membership.End.Date Subscription.Plan Payment.Information Renewal.Status
## 2495 2025-03-06 Annual Mastercard Auto-renew
## 2496 2025-01-24 Annual Visa Auto-renew
## 2497 2025-03-03 Annual Amex Manual
## 2498 2025-02-08 Annual Visa Manual
## 2499 2025-02-17 Monthly Visa Manual
## 2500 2025-01-27 Annual Visa Auto-renew
## Usage.Frequency Purchase.History Favorite.Genres Devices.Used
## 2495 Occasional Clothing Drama Smartphone
## 2496 Frequent Electronics Comedy Smartphone
## 2497 Frequent Books Comedy Smart TV
## 2498 Frequent Electronics Documentary Tablet
## 2499 Regular Clothing Comedy Smartphone
## 2500 Regular Books Documentary Smart TV
## Engagement.Metrics Feedback.Ratings Customer.Support.Interactions
## 2495 Medium 4.1 8
## 2496 Medium 4.9 2
## 2497 Medium 4.0 0
## 2498 Low 4.9 8
## 2499 High 3.4 7
## 2500 High 3.3 9
A continuación identificaremos las variables con las que cuenta la base de datos.
colnames(df)
## [1] "User.ID" "Name"
## [3] "Email.Address" "Username"
## [5] "Date.of.Birth" "Gender"
## [7] "Location" "Membership.Start.Date"
## [9] "Membership.End.Date" "Subscription.Plan"
## [11] "Payment.Information" "Renewal.Status"
## [13] "Usage.Frequency" "Purchase.History"
## [15] "Favorite.Genres" "Devices.Used"
## [17] "Engagement.Metrics" "Feedback.Ratings"
## [19] "Customer.Support.Interactions"
Como se puede observar, contamos con diversas variables de las cuales buscaremos identificar nichos de mercado para buscar orientar publicidad o recomendaciones a los usuarios de la plataforma.
A continuación hacemos un resumen estadistico de la información disponible.
summary(df)
## User.ID Name Email.Address
## Min. : 1.0 Michael Smith : 3 bryandiaz@example.com : 2
## 1st Qu.: 625.8 Andrew Johnson : 2 dbailey@example.net : 2
## Median :1250.5 Brian Key : 2 djohnson@example.org : 2
## Mean :1250.5 Casey Jones : 2 ewilson@example.com : 2
## 3rd Qu.:1875.2 Christopher Foster : 2 jacqueline84@example.com: 2
## Max. :2500.0 Christopher Johnson: 2 john96@example.net : 2
## (Other) :2487 (Other) :2488
## Username Date.of.Birth Gender Location
## djohnson : 3 1967-07-16: 3 Female:1240 East Robert : 5
## ewilson : 3 1980-12-18: 3 Male :1260 New Jennifer : 5
## scott11 : 3 1933-04-26: 2 Johnsonside : 4
## anthony39: 2 1933-06-07: 2 Michaelborough: 4
## bbell : 2 1933-09-14: 2 New Robert : 4
## brian80 : 2 1933-11-24: 2 East Aaron : 3
## (Other) :2485 (Other) :2486 (Other) :2475
## Membership.Start.Date Membership.End.Date Subscription.Plan
## 2024-01-07: 40 2025-01-06: 40 Annual :1271
## 2024-01-28: 38 2025-01-27: 38 Monthly:1229
## 2024-01-03: 37 2025-01-02: 37
## 2024-01-08: 35 2025-01-07: 35
## 2024-03-26: 35 2025-03-26: 35
## 2024-03-06: 34 2025-03-06: 34
## (Other) :2281 (Other) :2281
## Payment.Information Renewal.Status Usage.Frequency Purchase.History
## Amex :806 Auto-renew:1274 Frequent :851 Books :851
## Mastercard:856 Manual :1226 Occasional:822 Clothing :802
## Visa :838 Regular :827 Electronics:847
##
##
##
##
## Favorite.Genres Devices.Used Engagement.Metrics Feedback.Ratings
## Action :380 Smart TV :780 High :845 Min. :3.000
## Comedy :349 Smartphone:867 Low :821 1st Qu.:3.500
## Documentary:340 Tablet :853 Medium:834 Median :4.000
## Drama :361 Mean :4.005
## Horror :383 3rd Qu.:4.500
## Romance :368 Max. :5.000
## Sci-Fi :319
## Customer.Support.Interactions
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 5.000
## Mean : 4.952
## 3rd Qu.: 8.000
## Max. :10.000
##
Como se puede observar, contamos con información diversa, tanto cuantitativa y cualitativa. En principio podemos segmentar la información.
glimpse(df)
## Rows: 2,500
## Columns: 19
## $ User.ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ Name <fct> Ronald Murphy, Scott Allen, Jonathan Par…
## $ Email.Address <fct> williamholland@example.com, scott22@exam…
## $ Username <fct> williamholland, scott22, brooke16, eliza…
## $ Date.of.Birth <fct> 1953-06-03, 1978-07-08, 1994-12-06, 1964…
## $ Gender <fct> Male, Male, Female, Female, Male, Female…
## $ Location <fct> Rebeccachester, Mcphersonview, Youngfort…
## $ Membership.Start.Date <fct> 2024-01-15, 2024-01-07, 2024-04-13, 2024…
## $ Membership.End.Date <fct> 2025-01-14, 2025-01-06, 2025-04-13, 2025…
## $ Subscription.Plan <fct> Annual, Monthly, Monthly, Monthly, Annua…
## $ Payment.Information <fct> Mastercard, Visa, Mastercard, Amex, Visa…
## $ Renewal.Status <fct> Manual, Manual, Manual, Auto-renew, Auto…
## $ Usage.Frequency <fct> Regular, Regular, Regular, Regular, Freq…
## $ Purchase.History <fct> Electronics, Electronics, Books, Electro…
## $ Favorite.Genres <fct> Documentary, Horror, Comedy, Documentary…
## $ Devices.Used <fct> Smart TV, Smartphone, Smart TV, Smart TV…
## $ Engagement.Metrics <fct> Medium, Medium, Low, High, Low, Low, Med…
## $ Feedback.Ratings <dbl> 3.6, 3.8, 3.3, 3.3, 4.3, 3.8, 4.4, 3.6, …
## $ Customer.Support.Interactions <int> 3, 7, 8, 7, 1, 2, 10, 6, 8, 6, 1, 6, 0, …
En este punto, ya que conocemos los datos con los que contamos, procederemos a filtrarlos por genero y en función de ello conocer las preferencias de los usuarios.
df$Engagement.Metrics <- as.numeric(factor(df$Engagement.Metrics, levels = c("Low", "Medium", "High"), labels = c(1, 2, 3)))
#df$Renewal.Status <- as.numeric(factor(df$Renewal.Status, levels = c("Low", "Medium", "High"), labels = c(1, 2, 3)))
pref_M <- df %>% select(User.ID, Gender, Renewal.Status, Payment.Information,
Usage.Frequency, Purchase.History, Favorite.Genres, Engagement.Metrics) %>%
filter(Gender == "Female")
pref_H <- df %>% select(User.ID, Gender, Renewal.Status, Payment.Information,
Usage.Frequency, Purchase.History, Favorite.Genres, Engagement.Metrics) %>%
filter(Gender == "Male")
prom_M <- df %>%
filter(Gender == "Female") %>%
summarise(mean_engagement = mean(Engagement.Metrics, na.rm = TRUE))
print(prom_M)
## mean_engagement
## 1 2.023387
prom_H <- df %>%
filter(Gender == "Male") %>%
summarise(mean_engagement = mean(Engagement.Metrics, na.rm = TRUE))
print(prom_H)
## mean_engagement
## 1 1.996032
grup <- df %>%
group_by(Renewal.Status, Gender) %>%
summarise(count = n()) %>%
arrange(Renewal.Status, Gender)
## `summarise()` has grouped output by 'Renewal.Status'. You can override using
## the `.groups` argument.
ggplot(grup, aes(x = Renewal.Status, y = count, fill = Gender)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Número de usuarios por Estado de Renovación y Género",
x = "Estado de Renovación",
y = "Número de Usuarios") +
theme_minimal() +
scale_fill_manual(values = c("Female" = "pink", "Male" = "blue"))
Haremos una conversion de Renewal a valores numericos, ademas de hacer
un filtrado por genero.
# Convertir Renewal.Status a numérico
df$Renewal.Status <- ifelse(df$Renewal.Status == "Active", 1, 0)
# Filtrar datos por género
df_female <- df %>% filter(Gender == "Female")
df_male <- df %>% filter(Gender == "Male")
# Regresión lineal para mujeres
lm_female <- lm(Renewal.Status ~ Engagement.Metrics, data = df_female)
summary(lm_female)
##
## Call:
## lm(formula = Renewal.Status ~ Engagement.Metrics, data = df_female)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0 0 NaN NaN
## Engagement.Metrics 0 0 NaN NaN
##
## Residual standard error: 0 on 1238 degrees of freedom
## Multiple R-squared: NaN, Adjusted R-squared: NaN
## F-statistic: NaN on 1 and 1238 DF, p-value: NA
# Regresión lineal para hombres
lm_male <- lm(Renewal.Status ~ Engagement.Metrics, data = df_male)
summary(lm_male)
##
## Call:
## lm(formula = Renewal.Status ~ Engagement.Metrics, data = df_male)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0 0 0 0 0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0 0 NaN NaN
## Engagement.Metrics 0 0 NaN NaN
##
## Residual standard error: 0 on 1258 degrees of freedom
## Multiple R-squared: NaN, Adjusted R-squared: NaN
## F-statistic: NaN on 1 and 1258 DF, p-value: NA
# Gráfico de regresión lineal para mujeres
ggplot(df_female, aes(x = Engagement.Metrics, y = Renewal.Status)) +
geom_point(color = "pink") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Regresión Lineal: Renewal Status vs Engagement Metrics (Mujeres)",
x = "Engagement Metrics",
y = "Renewal Status") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
# Gráfico de regresión lineal para hombres
ggplot(df_male, aes(x = Engagement.Metrics, y = Renewal.Status)) +
geom_point(color = "blue") +
geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
labs(title = "Regresión Lineal: Renewal Status vs Engagement Metrics (Hombres)",
x = "Engagement Metrics",
y = "Renewal Status") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
El objetivo de estas regresiones era el de conocer si segmentando por genero, habia una relaciòn entre la renovación y el compromiso de los usuarios con la marca. a lo que se encuentra que la relación es inexistente.