Clase MarkD

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

El presente trabajo tiene por objeto analizar una base de datos de Amazon, dicha base de datos contiene información diversa y amplia de 2500 usuarios de la plataforma

setwd("C:/Users/famva/Desktop/Ciencia de datos")

library(dplyr)

## 
## Adjuntando el paquete: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(colorspace)

df <- read.csv("data_set_amazon.csv", stringsAsFactors = TRUE) #info limpia

tail(df)

##      User.ID            Name                 Email.Address          Username
## 2495    2495  Jennifer Burch andersonchristina@example.net andersonchristina
## 2496    2496   Michael Lopez   williamsroberto@example.org   williamsroberto
## 2497    2497 Matthew Woodard           lkaiser@example.com           lkaiser
## 2498    2498   Morgan Barnes      erikaholland@example.net      erikaholland
## 2499    2499  Gina Castaneda      reedcourtney@example.net      reedcourtney
## 2500    2500  Mark Nicholson       martinisaac@example.net       martinisaac
##      Date.of.Birth Gender         Location Membership.Start.Date
## 2495    1950-04-18 Female          New Ian            2024-03-06
## 2496    1967-08-19   Male        Smithport            2024-01-25
## 2497    1980-10-23   Male        Ethanport            2024-03-03
## 2498    1972-03-31 Female Alexandraborough            2024-02-09
## 2499    1965-08-02 Female     Williammouth            2024-02-18
## 2500    1972-11-13 Female   Estradaborough            2024-01-28
##      Membership.End.Date Subscription.Plan Payment.Information Renewal.Status
## 2495          2025-03-06            Annual          Mastercard     Auto-renew
## 2496          2025-01-24            Annual                Visa     Auto-renew
## 2497          2025-03-03            Annual                Amex         Manual
## 2498          2025-02-08            Annual                Visa         Manual
## 2499          2025-02-17           Monthly                Visa         Manual
## 2500          2025-01-27            Annual                Visa     Auto-renew
##      Usage.Frequency Purchase.History Favorite.Genres Devices.Used
## 2495      Occasional         Clothing           Drama   Smartphone
## 2496        Frequent      Electronics          Comedy   Smartphone
## 2497        Frequent            Books          Comedy     Smart TV
## 2498        Frequent      Electronics     Documentary       Tablet
## 2499         Regular         Clothing          Comedy   Smartphone
## 2500         Regular            Books     Documentary     Smart TV
##      Engagement.Metrics Feedback.Ratings Customer.Support.Interactions
## 2495             Medium              4.1                             8
## 2496             Medium              4.9                             2
## 2497             Medium              4.0                             0
## 2498                Low              4.9                             8
## 2499               High              3.4                             7
## 2500               High              3.3                             9

A continuación identificaremos las variables con las que cuenta la base de datos.

colnames(df)

##  [1] "User.ID"                       "Name"                         
##  [3] "Email.Address"                 "Username"                     
##  [5] "Date.of.Birth"                 "Gender"                       
##  [7] "Location"                      "Membership.Start.Date"        
##  [9] "Membership.End.Date"           "Subscription.Plan"            
## [11] "Payment.Information"           "Renewal.Status"               
## [13] "Usage.Frequency"               "Purchase.History"             
## [15] "Favorite.Genres"               "Devices.Used"                 
## [17] "Engagement.Metrics"            "Feedback.Ratings"             
## [19] "Customer.Support.Interactions"

Como se puede observar, contamos con diversas variables de las cuales buscaremos identificar nichos de mercado para buscar orientar publicidad o recomendaciones a los usuarios de la plataforma.

A continuación hacemos un resumen estadistico de la información disponible.

summary(df)

##     User.ID                        Name                       Email.Address 
##  Min.   :   1.0   Michael Smith      :   3   bryandiaz@example.com   :   2  
##  1st Qu.: 625.8   Andrew Johnson     :   2   dbailey@example.net     :   2  
##  Median :1250.5   Brian Key          :   2   djohnson@example.org    :   2  
##  Mean   :1250.5   Casey Jones        :   2   ewilson@example.com     :   2  
##  3rd Qu.:1875.2   Christopher Foster :   2   jacqueline84@example.com:   2  
##  Max.   :2500.0   Christopher Johnson:   2   john96@example.net      :   2  
##                   (Other)            :2487   (Other)                 :2488  
##       Username       Date.of.Birth     Gender               Location   
##  djohnson :   3   1967-07-16:   3   Female:1240   East Robert   :   5  
##  ewilson  :   3   1980-12-18:   3   Male  :1260   New Jennifer  :   5  
##  scott11  :   3   1933-04-26:   2                 Johnsonside   :   4  
##  anthony39:   2   1933-06-07:   2                 Michaelborough:   4  
##  bbell    :   2   1933-09-14:   2                 New Robert    :   4  
##  brian80  :   2   1933-11-24:   2                 East Aaron    :   3  
##  (Other)  :2485   (Other)   :2486                 (Other)       :2475  
##  Membership.Start.Date Membership.End.Date Subscription.Plan
##  2024-01-07:  40       2025-01-06:  40     Annual :1271     
##  2024-01-28:  38       2025-01-27:  38     Monthly:1229     
##  2024-01-03:  37       2025-01-02:  37                      
##  2024-01-08:  35       2025-01-07:  35                      
##  2024-03-26:  35       2025-03-26:  35                      
##  2024-03-06:  34       2025-03-06:  34                      
##  (Other)   :2281       (Other)   :2281                      
##  Payment.Information    Renewal.Status   Usage.Frequency    Purchase.History
##  Amex      :806      Auto-renew:1274   Frequent  :851    Books      :851    
##  Mastercard:856      Manual    :1226   Occasional:822    Clothing   :802    
##  Visa      :838                        Regular   :827    Electronics:847    
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     Favorite.Genres     Devices.Used Engagement.Metrics Feedback.Ratings
##  Action     :380    Smart TV  :780   High  :845         Min.   :3.000   
##  Comedy     :349    Smartphone:867   Low   :821         1st Qu.:3.500   
##  Documentary:340    Tablet    :853   Medium:834         Median :4.000   
##  Drama      :361                                        Mean   :4.005   
##  Horror     :383                                        3rd Qu.:4.500   
##  Romance    :368                                        Max.   :5.000   
##  Sci-Fi     :319                                                        
##  Customer.Support.Interactions
##  Min.   : 0.000               
##  1st Qu.: 2.000               
##  Median : 5.000               
##  Mean   : 4.952               
##  3rd Qu.: 8.000               
##  Max.   :10.000               
##

Como se puede observar, contamos con información diversa, tanto cuantitativa y cualitativa. En principio podemos segmentar la información.

glimpse(df)

## Rows: 2,500
## Columns: 19
## $ User.ID                       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1…
## $ Name                          <fct> Ronald Murphy, Scott Allen, Jonathan Par…
## $ Email.Address                 <fct> williamholland@example.com, scott22@exam…
## $ Username                      <fct> williamholland, scott22, brooke16, eliza…
## $ Date.of.Birth                 <fct> 1953-06-03, 1978-07-08, 1994-12-06, 1964…
## $ Gender                        <fct> Male, Male, Female, Female, Male, Female…
## $ Location                      <fct> Rebeccachester, Mcphersonview, Youngfort…
## $ Membership.Start.Date         <fct> 2024-01-15, 2024-01-07, 2024-04-13, 2024…
## $ Membership.End.Date           <fct> 2025-01-14, 2025-01-06, 2025-04-13, 2025…
## $ Subscription.Plan             <fct> Annual, Monthly, Monthly, Monthly, Annua…
## $ Payment.Information           <fct> Mastercard, Visa, Mastercard, Amex, Visa…
## $ Renewal.Status                <fct> Manual, Manual, Manual, Auto-renew, Auto…
## $ Usage.Frequency               <fct> Regular, Regular, Regular, Regular, Freq…
## $ Purchase.History              <fct> Electronics, Electronics, Books, Electro…
## $ Favorite.Genres               <fct> Documentary, Horror, Comedy, Documentary…
## $ Devices.Used                  <fct> Smart TV, Smartphone, Smart TV, Smart TV…
## $ Engagement.Metrics            <fct> Medium, Medium, Low, High, Low, Low, Med…
## $ Feedback.Ratings              <dbl> 3.6, 3.8, 3.3, 3.3, 4.3, 3.8, 4.4, 3.6, …
## $ Customer.Support.Interactions <int> 3, 7, 8, 7, 1, 2, 10, 6, 8, 6, 1, 6, 0, …

En este punto, ya que conocemos los datos con los que contamos, procederemos a filtrarlos por genero y en función de ello conocer las preferencias de los usuarios.

df$Engagement.Metrics <- as.numeric(factor(df$Engagement.Metrics, levels = c("Low", "Medium", "High"), labels = c(1, 2, 3)))

#df$Renewal.Status <- as.numeric(factor(df$Renewal.Status, levels = c("Low", "Medium", "High"), labels = c(1, 2, 3)))

pref_M <- df %>% select(User.ID, Gender, Renewal.Status, Payment.Information, 
              Usage.Frequency, Purchase.History, Favorite.Genres, Engagement.Metrics) %>%
  filter(Gender == "Female")

pref_H <- df %>% select(User.ID, Gender, Renewal.Status, Payment.Information, 
              Usage.Frequency, Purchase.History, Favorite.Genres, Engagement.Metrics) %>%
  filter(Gender == "Male")

prom_M <- df %>% 
  filter(Gender == "Female") %>%
  summarise(mean_engagement = mean(Engagement.Metrics, na.rm = TRUE))
print(prom_M)

##   mean_engagement
## 1        2.023387

prom_H <- df %>% 
  filter(Gender == "Male") %>%
  summarise(mean_engagement = mean(Engagement.Metrics, na.rm = TRUE))
print(prom_H)

##   mean_engagement
## 1        1.996032

grup <- df %>%
  group_by(Renewal.Status, Gender) %>%
  summarise(count = n()) %>%
  arrange(Renewal.Status, Gender)

## `summarise()` has grouped output by 'Renewal.Status'. You can override using
## the `.groups` argument.

ggplot(grup, aes(x = Renewal.Status, y = count, fill = Gender)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Número de usuarios por Estado de Renovación y Género",
       x = "Estado de Renovación",
       y = "Número de Usuarios") +
  theme_minimal() +
  scale_fill_manual(values = c("Female" = "pink", "Male" = "blue"))

Haremos una conversion de Renewal a valores numericos, ademas de hacer un filtrado por genero.

# Convertir Renewal.Status a numérico
df$Renewal.Status <- ifelse(df$Renewal.Status == "Active", 1, 0)

# Filtrar datos por género
df_female <- df %>% filter(Gender == "Female")
df_male <- df %>% filter(Gender == "Male")

# Regresión lineal para mujeres
lm_female <- lm(Renewal.Status ~ Engagement.Metrics, data = df_female)
summary(lm_female)

## 
## Call:
## lm(formula = Renewal.Status ~ Engagement.Metrics, data = df_female)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##      0      0      0      0      0 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)               0          0     NaN      NaN
## Engagement.Metrics        0          0     NaN      NaN
## 
## Residual standard error: 0 on 1238 degrees of freedom
## Multiple R-squared:    NaN,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 1 and 1238 DF,  p-value: NA

# Regresión lineal para hombres
lm_male <- lm(Renewal.Status ~ Engagement.Metrics, data = df_male)
summary(lm_male)

## 
## Call:
## lm(formula = Renewal.Status ~ Engagement.Metrics, data = df_male)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##      0      0      0      0      0 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)               0          0     NaN      NaN
## Engagement.Metrics        0          0     NaN      NaN
## 
## Residual standard error: 0 on 1258 degrees of freedom
## Multiple R-squared:    NaN,  Adjusted R-squared:    NaN 
## F-statistic:   NaN on 1 and 1258 DF,  p-value: NA

# Gráfico de regresión lineal para mujeres
ggplot(df_female, aes(x = Engagement.Metrics, y = Renewal.Status)) +
  geom_point(color = "pink") +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Regresión Lineal: Renewal Status vs Engagement Metrics (Mujeres)",
       x = "Engagement Metrics",
       y = "Renewal Status") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

# Gráfico de regresión lineal para hombres
ggplot(df_male, aes(x = Engagement.Metrics, y = Renewal.Status)) +
  geom_point(color = "blue") +
  geom_smooth(method = "lm", se = FALSE, color = "darkblue") +
  labs(title = "Regresión Lineal: Renewal Status vs Engagement Metrics (Hombres)",
       x = "Engagement Metrics",
       y = "Renewal Status") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

El objetivo de estas regresiones era el de conocer si segmentando por genero, habia una relaciòn entre la renovación y el compromiso de los usuarios con la marca. a lo que se encuentra que la relación es inexistente.

Clase MarkD

Gerardo Vásquez

2024-06-20

R Markdown

Including Plots