Primero, creamos la tabla con el archivo excel
TrainExer11 <- read_excel("C:/Users/34659/Downloads/TrainExer11.xls")
(a) Make two histograms, one of expenditures and the other of age. Make also a scatter diagram with expenditures on the vertical axis versus age on the horizontal axis.
hist(TrainExer11$Expenditures)
hist(TrainExer11$Age)
(b) In what respect do the data in this scatter diagram look
different from the case of the sales and price data discussed in the
lecture?
Here the young group (<20 and <40) and the older group (>40 and
<60) fall into two groups with a gap in the expenditure.
For 20 to 40 age group the expenditure is directly proportional to age For 40 to 50 age group too the expenditure is directly proportional to age.
Overall the expenditure reduces as the age increases but with a large variability within each age group.
(c) Propose a method to analyze these data in a way that assists the travel agent in making recommendations to future clients.
Break the dataset into two different groups of ages instead of them together. That way it is easy to make clear recommendation to future clients about the expenditure increase as age increases within that range.
(d) Compute the sample mean of expenditures of all 26 clients.
mean(TrainExer11$Expenditures)
## [1] 101.1154
(e) Compute two sample means of expenditures, one for clients of age forty or more and the other for clients of age below forty.
mas40 <- TrainExer11[ which(TrainExer11$Age >= 40),]
mean(mas40$Expenditures)
## [1] 95.84615
menos40 <- TrainExer11[ which(TrainExer11$Age < 40),]
mean(menos40$Expenditures)
## [1] 106.3846
(f) What daily expenditures would you predict for a new client of fifty years old? And for someone who is twenty-five years old?
#For 25 years old the predicted expenditure is 106.4 and for 50 years old it is 95.8. Using linear models (lm) we get
lm(menos40$Expenditures ~ menos40$Age)
##
## Call:
## lm(formula = menos40$Expenditures ~ menos40$Age)
##
## Coefficients:
## (Intercept) menos40$Age
## 100.232 0.198
expenmenos40 <- 100.232 + 0.198 * 25
expenmenos40
## [1] 105.182
lm(mas40$Expenditures ~ mas40$Age)
##
## Call:
## lm(formula = mas40$Expenditures ~ mas40$Age)
##
## Coefficients:
## (Intercept) mas40$Age
## 88.8719 0.1465
expenmas40 <- 88.8719 + 0.1465 * 50
expenmas40
## [1] 96.1969
Consideremos que la tienda A tiene un nivel de ventas de 150 y la tienda B de 250. Ambos managers inician una camapaña de publicidad. A vende entonces 225 y B 400. ¿Quien ha tenido el incremento relativo mayor en ventas?
A0 <- 150
A1 <- 225
B0 <- 250
B1 <- 400
DA <- (A1 - A0) / A0
DB <- (B1 - B0) / B0
cat("DA:", DA, "\nDB:", DB, "\n")
## DA: 0.5
## DB: 0.6
B tiene un mayor incremento en ventas (0.1 ppt de diferencia)
(a) Show that in the regression model log(yi) = α + β log(xi) + εi, the elasticity of y with respect to x is equal to β (that is, does not depend on the values of xi and yi).
log(y)=α+βlog(x)+εlog(y)=α+βlog(x)+ε
Donde:
y: Variable dependiente.
x: Variable independiente.
α: Intercepto.
β: Coeficiente de elasticidad.
ε: Término de error.
2. Aplicamos la Exponencial
Para eliminar el logaritmo y expresar y en términos de x, aplicamos la función exponencial a ambos lados de la ecuación:
y=eα+βlog(x)+εy=eα+βlog(x)+ε
Usando propiedades de los exponentes, esto se puede reescribir como:
y=eα e^βlog(x) ⋅ e^ε
Sabemos que e^βlog(x) = x^β, por lo que:
y=eα⋅xβ⋅e^ε
3. Derivamos y con respecto a x
Para encontrar la elasticidad, necesitamos la derivada de y con respecto a x:
dy/dx = d/dx * (e^α ⋅ x^β ⋅ e^ε)
Como e^α y e^ε son constantes con respecto a x (ya que alfa es un término constante y el error es, por definición independiente a x), la derivada es:
dy/dx = e^α ⋅ e^ε ⋅ d/dx*(xβ)
La derivada de x^β es βx^β−1, por lo que:
dy/dx = e^α ⋅ e^ε ⋅ β^xβ−1
4. Calculamos la Elasticidad (E)
La elasticidad se define como:
E = dy/dx ⋅ x/y
Sustituimos dy/dx y y:
E = (e^α ⋅ e^ε ⋅ β^xβ−1) ⋅ xe/(α⋅xβ⋅eε)
Simplificamos:
e^α y e^ε se cancelan en el numerador y denominador.
x^β−1⋅ x = x^β , que también se cancela con x^β en el denominador.
Por lo tanto, nos queda:
E=β
model1 <- lm(log(TrainExer11$Expenditures) ~ log(TrainExer11$Age))
summary(model1)
##
## Call:
## lm(formula = log(TrainExer11$Expenditures) ~ log(TrainExer11$Age))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.090504 -0.039359 -0.005547 0.049451 0.075017
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.97595 0.12275 40.539 < 2e-16 ***
## log(TrainExer11$Age) -0.09959 0.03370 -2.956 0.00689 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05347 on 24 degrees of freedom
## Multiple R-squared: 0.2669, Adjusted R-squared: 0.2363
## F-statistic: 8.736 on 1 and 24 DF, p-value: 0.006893
print(model1)
##
## Call:
## lm(formula = log(TrainExer11$Expenditures) ~ log(TrainExer11$Age))
##
## Coefficients:
## (Intercept) log(TrainExer11$Age)
## 4.97595 -0.09959
(b) Determine the elasticity of y with respect to x in the model yi = α + β log(xi) + εi
model2 <- lm(TrainExer11$Expenditures ~ log(TrainExer11$Age))
summary(model2)
##
## Call:
## lm(formula = TrainExer11$Expenditures ~ log(TrainExer11$Age))
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8126 -4.0454 -0.6618 5.0195 7.7005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.024 12.273 11.165 5.48e-11 ***
## log(TrainExer11$Age) -9.894 3.369 -2.937 0.00721 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.347 on 24 degrees of freedom
## Multiple R-squared: 0.2643, Adjusted R-squared: 0.2337
## F-statistic: 8.624 on 1 and 24 DF, p-value: 0.007211
(c) Determine the elasticity of y with respect to x in the model log(yi) = α + βxi + εi
model3 <- lm(log(TrainExer11$Expenditures) ~ TrainExer11$Age)
summary(model3)
##
## Call:
## lm(formula = log(TrainExer11$Expenditures) ~ TrainExer11$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.091379 -0.041240 -0.007897 0.043602 0.079386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7462746 0.0388479 122.18 <2e-16 ***
## TrainExer11$Age -0.0033497 0.0009544 -3.51 0.0018 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05077 on 24 degrees of freedom
## Multiple R-squared: 0.3392, Adjusted R-squared: 0.3117
## F-statistic: 12.32 on 1 and 24 DF, p-value: 0.001798
# Extraer coeficientes, errores estándar, p-valores, etc.
coef_info <- tidy(model3)
coef_info <- coef_info %>%
mutate(Significance = case_when(
p.value < 0.001 ~ "***",
p.value < 0.01 ~ "**",
p.value < 0.05 ~ "*",
p.value < 0.1 ~ ".",
TRUE ~ ""
))
# Extraer métricas generales como R², AIC, BIC
model_stats <- glance(model3)
# Extraer información extra que no aparece en `glance()`
summary_info <- summary(model3)
extra_info <- tibble(
Statistic = c("Residual Std. Error", "F-Statistic", "Degrees of Freedom"),
Value = c(summary_info$sigma, summary_info$fstatistic[1], summary_info$df[2])
)
# Unir toda la información en una sola tabla.
# En la parte de coeficientes, se incluye la columna Significance.
full_table <- bind_rows(
coef_info %>% mutate(Statistic = term) %>%
select(Statistic, estimate, std.error, statistic, p.value, Significance),
model_stats %>% pivot_longer(cols = everything(), names_to = "Statistic", values_to = "Value"),
extra_info
)
# Formatear la tabla con gt()
full_table %>%
gt() %>%
tab_header(title = "Resumen del Modelo de Regresión") %>%
fmt_number(columns = c("estimate", "std.error", "statistic", "p.value", "Value"), decimals = 4)
| Resumen del Modelo de Regresión | ||||||
| Statistic | estimate | std.error | statistic | p.value | Significance | Value |
|---|---|---|---|---|---|---|
| (Intercept) | 4.7463 | 0.0388 | 122.1757 | 0.0000 | *** | NA |
| TrainExer11$Age | −0.0033 | 0.0010 | −3.5099 | 0.0018 | ** | NA |
| r.squared | NA | NA | NA | NA | NA | 0.3392 |
| adj.r.squared | NA | NA | NA | NA | NA | 0.3117 |
| sigma | NA | NA | NA | NA | NA | 0.0508 |
| statistic | NA | NA | NA | NA | NA | 12.3192 |
| p.value | NA | NA | NA | NA | NA | 0.0018 |
| df | NA | NA | NA | NA | NA | 1.0000 |
| logLik | NA | NA | NA | NA | NA | 41.6405 |
| AIC | NA | NA | NA | NA | NA | −77.2810 |
| BIC | NA | NA | NA | NA | NA | −73.5067 |
| deviance | NA | NA | NA | NA | NA | 0.0619 |
| df.residual | NA | NA | NA | NA | NA | 24.0000 |
| nobs | NA | NA | NA | NA | NA | 26.0000 |
| Residual Std. Error | NA | NA | NA | NA | NA | 0.0508 |
| F-Statistic | NA | NA | NA | NA | NA | 12.3192 |
| Degrees of Freedom | NA | NA | NA | NA | NA | 24.0000 |
TrainExer13 <- read_excel("C:/Users/34659/Downloads/TrainExer13.xls")
Dataset TrainExer13 contains the winning times (W) of the Olympic 100-meter finals (for men) from 1948 to 2004. The calendar years 1948-2004 are transformed to games (G) 1-15 to simplify computations. A simple regression model for the trend in winning times is Wi = α + βGi + εi.
(a) Compute a and b, and determine the values of R2 and s.
modelWinTime <- lm(TrainExer13$`Winning time men` ~ TrainExer13$Game)
# Coeficiente de determinación (R^2)
r_squared <- summary(modelWinTime)$r.squared
# Error estándar de la regresión (s)
s <- summary(modelWinTime)$sigma
cat("Coeficiente de determinación (R^2):", r_squared, "\n")
## Coeficiente de determinación (R^2): 0.6733729
cat("Error estándar de la regresión (s):", s, "\n")
## Error estándar de la regresión (s): 0.1228257
summary(modelWinTime)
##
## Call:
## lm(formula = TrainExer13$`Winning time men` ~ TrainExer13$Game)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.208 -0.048 -0.016 0.032 0.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.38600 0.06674 155.623 < 2e-16 ***
## TrainExer13$Game -0.03800 0.00734 -5.177 0.000178 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1228 on 13 degrees of freedom
## Multiple R-squared: 0.6734, Adjusted R-squared: 0.6482
## F-statistic: 26.8 on 1 and 13 DF, p-value: 0.0001781
(b) Are you confident on the predictive ability of this model?. Motivate your answer
El modelo tiene capacidad explicativa, sin embargo, un R2 de 0.67 no es suficiente como para ser considerado suficientemente fliable, más teniendo en cuenta que la desviación típica es de 0.12 segundos. Sin embargo, tanto el intercepto como la variable independiente tienen un alto nivel de significación, por lo que deberíamos encontrar otro elemento que afecte al tiempo de victoria.
(c) What prediction do you get for 2008, 2012, and 2016? Compare your predictions with the actual winning times.
WinningEst2008 = 10.38600 + -0.03800*16
WinningEst2012 = 10.38600 + -0.03800*17
WinningEst2016 = 10.38600 + -0.03800*18
cat("Tiempo Victoria 2008", WinningEst2008, "\n")
## Tiempo Victoria 2008 9.778
cat("Tiempo Victoria 2012", WinningEst2012, "\n")
## Tiempo Victoria 2012 9.74
cat("Tiempo Victoria 2016", WinningEst2016, "\n")
## Tiempo Victoria 2016 9.702
In Lecture 1.5, we applied simple regression for data on winning times on the Olympic 100 meter (athletics). We computed the regression coefficients a and b for two trend models, one with a linear trend and one with a nonlinear trend. In a test question, you created forecasts of the winning times for both men and women in 2008 and 2012. Of course, you can also forecast further ahead in the future. In fact, it is even possible to predict when men and women would run equally fast, if the current trends persist
(a) Show that the linear trend model predicts equal winning times at around 2140.
TrainExer15 <- read_excel("C:/Users/34659/Downloads/TrainExer15.xls")
modelWinTimeH <- lm(TrainExer15$`Winning time men` ~ TrainExer13$Year)
print(modelWinTimeH)
##
## Call:
## lm(formula = TrainExer15$`Winning time men` ~ TrainExer13$Year)
##
## Coefficients:
## (Intercept) TrainExer13$Year
## 28.8540 -0.0095
modelWinTimeM <- lm(TrainExer15$`Winning time women` ~ TrainExer13$Year)
print(modelWinTimeM)
##
## Call:
## lm(formula = TrainExer15$`Winning time women` ~ TrainExer13$Year)
##
## Coefficients:
## (Intercept) TrainExer13$Year
## 42.18938 -0.01573
# Ajustar los modelos lineales con los mismos datos
TrainExer15$Year <- TrainExer15$Year # Asegurar que TrainExer15 tenga la columna Year
modelWinTimeH <- lm(`Winning time men` ~ Year, data = TrainExer15)
modelWinTimeM <- lm(`Winning time women` ~ Year, data = TrainExer15)
print(modelWinTimeH)
##
## Call:
## lm(formula = `Winning time men` ~ Year, data = TrainExer15)
##
## Coefficients:
## (Intercept) Year
## 28.8540 -0.0095
print(modelWinTimeM)
##
## Call:
## lm(formula = `Winning time women` ~ Year, data = TrainExer15)
##
## Coefficients:
## (Intercept) Year
## 42.18938 -0.01573
# Crear una secuencia de años futuros para la predicción
years_future <- seq(max(TrainExer15$Year) + 1, max(TrainExer15$Year) + 200, by = 5)
# Predecir los tiempos ganadores en los años futuros
future_data <- data.frame(Year = years_future)
predictionsH <- predict(modelWinTimeH, newdata = future_data)
predictionsM <- predict(modelWinTimeM, newdata = future_data)
# Encontrar el año donde se cruzan
intersect_year <- years_future[which.min(abs(predictionsH - predictionsM))]
# Mostrar el resultado
print(cat("El punto de intersección ocurre en el año:", intersect_year, "\n"))
## El punto de intersección ocurre en el año: 2140
## NULL
#Graficar los resultados
plot(years_future, predictionsH, type = "l", col = "blue", ylim = range(c(predictionsH, predictionsM)),
xlab = "Año", ylab = "Tiempo ganador", main = "Predicción de tiempos ganadores")
lines(years_future, predictionsM, col = "red")
(b) Show that the nonlinear trend model predicts equal winning times at around 2192.
# Ajustar los modelos lineales
modelWinTimeH <- lm(`Winning time men` ~ Year, data = TrainExer15)
modelWinTimeM <- lm(`Winning time women` ~ Year, data = TrainExer15)
# Crear una secuencia de años futuros para la predicción
years_future <- seq(max(TrainExer15$Year) + 1, max(TrainExer15$Year) + 200, by = 1)
# Crear un data frame con los años futuros
future_data <- data.frame(Year = years_future)
# Predecir los tiempos ganadores en los años futuros
predictionsH <- predict(modelWinTimeH, newdata = future_data)
predictionsM <- predict(modelWinTimeM, newdata = future_data)
# Encontrar el año donde se cruzan
intersect_index <- which.min(abs(predictionsH - predictionsM))
intersect_year <- years_future[intersect_index]
# Mostrar el año de intersección
print(paste("El año donde se cruzan los tiempos ganadores es:", intersect_year))
## [1] "El año donde se cruzan los tiempos ganadores es: 2140"
# Ajustar los modelos no lineales (cuadráticos)
modelWinTimeH_nonlinear <- lm(log(`Winning time men`) ~ Year, data = TrainExer15)
modelWinTimeM_nonlinear <- lm(log(`Winning time women`) ~ Year, data = TrainExer15)
print(modelWinTimeH_nonlinear)
##
## Call:
## lm(formula = log(`Winning time men`) ~ Year, data = TrainExer15)
##
## Coefficients:
## (Intercept) Year
## 4.165995 -0.000939
print(modelWinTimeM_nonlinear)
##
## Call:
## lm(formula = log(`Winning time women`) ~ Year, data = TrainExer15)
##
## Coefficients:
## (Intercept) Year
## 5.179507 -0.001403
# Crear una secuencia de años futuros para la predicción
years_future <- seq(max(TrainExer15$Year) + 1, 2300, by = 1)
future_data <- data.frame(Year = years_future)
# Predecir los tiempos ganadores en los años futuros (modelos no lineales)
predictionsH_nonlinear <- predict(modelWinTimeH_nonlinear, newdata = future_data)
predictionsM_nonlinear <- predict(modelWinTimeM_nonlinear, newdata = future_data)
# Encontrar el año donde se cruzan (modelos no lineales)
intersect_index_nonlinear <- which.min(abs(predictionsH_nonlinear - predictionsM_nonlinear))
intersect_year_nonlinear <- years_future[intersect_index_nonlinear]
intersect_time_nonlinear <- (predictionsH_nonlinear[intersect_index_nonlinear] + predictionsM_nonlinear[intersect_index_nonlinear]) / 2
cat("(b) El modelo no lineal predice tiempos ganadores iguales alrededor del año:", intersect_year_nonlinear, "\n")
## (b) El modelo no lineal predice tiempos ganadores iguales alrededor del año: 2183
(c) Show that the linear trend model predicts equal winning times of approximately 8.53 seconds.
# Ajustar los modelos lineales
modelWinTimeH <- lm(`Winning time men` ~ Year, data = TrainExer15)
modelWinTimeM <- lm(`Winning time women` ~ Year, data = TrainExer15)
# Crear una secuencia de años futuros para la predicción
years_future <- seq(max(TrainExer15$Year) + 1, 2300, by = 1)
future_data <- data.frame(Year = years_future)
# Predecir los tiempos ganadores en los años futuros (modelos lineales)
predictionsH <- predict(modelWinTimeH, newdata = future_data)
predictionsM <- predict(modelWinTimeM, newdata = future_data)
# Encontrar el año donde se cruzan (modelos lineales)
intersect_index_linear <- which.min(abs(predictionsH - predictionsM))
intersect_year_linear <- years_future[intersect_index_linear]
# Calcular el tiempo ganador en el año de intersección
intersect_time_linear <- (predictionsH[intersect_index_linear] + predictionsM[intersect_index_linear]) / 2
# Mostrar el resultado
cat("(c) El modelo lineal predice tiempos ganadores iguales de aproximadamente:", intersect_time_linear, "segundos\n")
## (c) El modelo lineal predice tiempos ganadores iguales de aproximadamente: 8.523298 segundos
Test Exercise 1
Questions This exercise considers an example of data that do not satisfy all the standard assumptions of simple regression. In the considered case, one particular observation lies far off from the others, that is, it is an outlier. This violates assumptions A3 and A4, which state that all error terms εi are drawn from one and the same distribution with mean zero and fixed variance σ2. The dataset contains twenty weekly observations on sales and advertising of a department store. The question of interest lies in estimating the effect of advertising on sales. One of the weeks was special, as the store was also open in the evenings during this week, but this aspect will first be ignored in the analysis.
Test_exc <- read_excel("C:/Users/34659/Downloads/_689829b5c78b9b71bd384ed8fb8714c1_TestExer-1-sales-round1.xls")
(a) Make the scatter diagram with sales on the vertical axis and advertising on the horizontal axis. What do you expect to find if you would fit a regression line to these data?
ggplot(Test_exc, aes(x = Advertising, y = Sales)) +
geom_point() + # Puntos de dispersión
labs(title = "Diagrama de Dispersión de Ventas vs Publicidad",
x = "Publicidad",
y = "Ventas") +
theme_minimal()
print("Es de esperar que la recta de regresión se vea distorsionada debido al valor atípico encontrado en publicidad = 6 ~ Ventas = 50")
## [1] "Es de esperar que la recta de regresión se vea distorsionada debido al valor atípico encontrado en publicidad = 6 ~ Ventas = 50"
(b) Estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?
# Cargar las librerías necesarias
library(readxl)
# Leer los datos desde el archivo Excel
Test_exc <- read_excel("C:/Users/34659/Downloads/Test_exc.xls")
# Ajustar el modelo de regresión lineal
model <- lm(Sales ~ Advertising, data = Test_exc)
# Mostrar el resumen del modelo
summary(model)
##
## Call:
## lm(formula = Sales ~ Advertising, data = Test_exc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6794 -2.7869 -1.3811 0.6803 22.3206
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.6269 4.8815 6.069 9.78e-06 ***
## Advertising -0.3246 0.4589 -0.707 0.488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.836 on 18 degrees of freedom
## Multiple R-squared: 0.02704, Adjusted R-squared: -0.02701
## F-statistic: 0.5002 on 1 and 18 DF, p-value: 0.4885
# Extraer los coeficientes
coefficients <- coef(model)
a <- coefficients[1] # Intercepto (a)
b <- coefficients[2] # Pendiente (b)
# Extraer el error estándar y el valor t de b
summary_model <- summary(model)
std_error_b <- summary_model$coefficients[2, 2] # Error estándar de b
t_value_b <- summary_model$coefficients[2, 3] # Valor t de b
# Mostrar los resultados
cat("Intercepto (a):", a, "\n")
## Intercepto (a): 29.62689
cat("Pendiente (b):", b, "\n")
## Pendiente (b): -0.324575
cat("Error estándar de b:", std_error_b, "\n")
## Error estándar de b: 0.458911
cat("Valor t de b:", t_value_b, "\n")
## Valor t de b: -0.7072722
# Evaluar si b es significativamente diferente de 0
p_value_b <- summary_model$coefficients[2, 4] # Valor p de b
significance_level <- 0.05
if (p_value_b < significance_level) {
cat("b es significativamente diferente de 0.\n")
} else {
cat("b no es significativamente diferente de 0.\n")
}
## b no es significativamente diferente de 0.
(c) Compute the residuals and draw a histogram of these residuals. What conclusion do you draw from this histogram?
# Cargar las librerías necesarias
library(readxl)
library(ggplot2)
# Leer los datos desde el archivo Excel
Test_exc <- read_excel("C:/Users/34659/Downloads/Test_exc.xls")
# Ajustar el modelo de regresión lineal
model <- lm(Sales ~ Advertising, data = Test_exc)
# Calcular los residuos
residuals <- residuals(model)
# Dibujar un histograma de los residuos
ggplot(data.frame(Residuals = residuals), aes(x = Residuals)) +
geom_histogram(binwidth = 0.5, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Histograma de los Residuos",
x = "Residuos",
y = "Frecuencia") +
theme_minimal()
# Mostrar un resumen de los residuos
summary(residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.6794 -2.7869 -1.3811 0.0000 0.6803 22.3206
(d) Apparently, the regression result of part (b) is not satisfactory. Once you realize that the large residual corresponds to the week with opening hours during the evening, how would you proceed to get a more satisfactory regression model? -Supongamos que la tienda ha abierto el doble de horas-.
# Cargar las librerías necesarias
library(readxl)
library(ggplot2)
# Leer los datos desde el archivo Excel
Test_exc <- read_excel("C:/Users/34659/Downloads/Test_exc.xls")
# Identificar el día atípico (supongamos que ya sabes que es la observación con el residuo grande)
model <- lm(Sales ~ Advertising, data = Test_exc)
residuals <- residuals(model)
large_residual_index <- which.max(abs(residuals))
large_residual_observation <- Test_exc[large_residual_index, ]
# Verificar la observación atípica
print(large_residual_observation)
## # A tibble: 1 × 3
## Observation Advertising Sales
## <dbl> <dbl> <dbl>
## 1 12 6 50
# Ajustar los valores de ventas y horas de trabajo para normalizarlos
Test_exc$Sales[large_residual_index] <- Test_exc$Sales[large_residual_index] / 2
# Reajustar el modelo de regresión con los datos normalizados
new_model <- lm(Sales ~ Advertising, data = Test_exc)
# Mostrar el resumen del nuevo modelo
summary(new_model)
##
## Call:
## lm(formula = Sales ~ Advertising, data = Test_exc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.29923 -0.63539 0.03771 0.45386 1.70077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.64389 0.90530 23.908 4.33e-15 ***
## Advertising 0.33230 0.08511 3.905 0.00104 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.082 on 18 degrees of freedom
## Multiple R-squared: 0.4586, Adjusted R-squared: 0.4285
## F-statistic: 15.25 on 1 and 18 DF, p-value: 0.001039
# Calcular los nuevos residuos
new_residuals <- residuals(new_model)
# Dibujar un histograma de los nuevos residuos
ggplot(data.frame(Residuals = new_residuals), aes(x = Residuals)) +
geom_histogram(binwidth = 0.5, fill = "blue", color = "black", alpha = 0.7) +
labs(title = "Histograma de los Nuevos Residuos",
x = "Residuos",
y = "Frecuencia") +
theme_minimal()
# Mostrar un resumen de los nuevos residuos
summary(new_residuals)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.29923 -0.63539 0.03771 0.00000 0.45386 1.70077
ggplot(Test_exc, aes(x = Advertising, y = Sales)) +
geom_point() + # Puntos de dispersión
labs(title = "Diagrama de Dispersión de Ventas vs Publicidad",
x = "Publicidad",
y = "Ventas") +
theme_minimal()
# Extraer el error estándar y el valor t de b
summary_model <- summary(new_model)
std_error_b <- summary_model$coefficients[2, 2] # Error estándar de b
t_value_b <- summary_model$coefficients[2, 3] # Valor t de b
# Mostrar los resultados
cat("Intercepto (a):", a, "\n")
## Intercepto (a): 29.62689
cat("Pendiente (b):", b, "\n")
## Pendiente (b): -0.324575
cat("Error estándar de b:", std_error_b, "\n")
## Error estándar de b: 0.08510727
cat("Valor t de b:", t_value_b, "\n")
## Valor t de b: 3.904519
# Evaluar si b es significativamente diferente de 0
p_value_b <- summary_model$coefficients[2, 4] # Valor p de b
significance_level <- 0.05
if (p_value_b < significance_level) {
cat("b es significativamente diferente de 0.\n")
} else {
cat("b no es significativamente diferente de 0.\n")
}
## b es significativamente diferente de 0.
(e) Delete this special week from the sample and use the remaining 19 weeks to estimate the coefficients a and b in the simple regression model with sales as dependent variable and advertising as explanatory factor. Also compute the standard error and t-value of b. Is b significantly different from 0?
# Cargar las librerías necesarias
library(readxl)
# Leer los datos desde el archivo Excel
Test_exc <- read_excel("C:/Users/34659/Downloads/Test_exc.xls")
# Identificar la semana especial (supongamos que ya sabes que es la observación con el residuo grande)
model <- lm(Sales ~ Advertising, data = Test_exc)
residuals <- residuals(model)
large_residual_index <- which.max(abs(residuals))
large_residual_observation <- Test_exc[large_residual_index, ]
# Verificar la observación especial
print(large_residual_observation)
## # A tibble: 1 × 3
## Observation Advertising Sales
## <dbl> <dbl> <dbl>
## 1 12 6 50
# Eliminar la semana especial del conjunto de datos
Test_exc_clean <- Test_exc[-large_residual_index, ]
# Ajustar el modelo de regresión con los datos restantes
new_model <- lm(Sales ~ Advertising, data = Test_exc_clean)
# Mostrar el resumen del nuevo modelo
summary_new_model <- summary(new_model)
print(summary_new_model)
##
## Call:
## lm(formula = Sales ~ Advertising, data = Test_exc_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2500 -0.4375 0.0000 0.5000 1.7500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.1250 0.9548 22.124 5.72e-14 ***
## Advertising 0.3750 0.0882 4.252 0.000538 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.054 on 17 degrees of freedom
## Multiple R-squared: 0.5154, Adjusted R-squared: 0.4869
## F-statistic: 18.08 on 1 and 17 DF, p-value: 0.0005379
# Extraer los coeficientes
coefficients <- coef(new_model)
a <- coefficients[1] # Intercepto (a)
b <- coefficients[2] # Pendiente (b)
# Extraer el error estándar y el valor t de b
std_error_b <- summary_new_model$coefficients[2, 2] # Error estándar de b
t_value_b <- summary_new_model$coefficients[2, 3] # Valor t de b
# Mostrar los resultados
cat("Intercepto (a):", a, "\n")
## Intercepto (a): 21.125
cat("Pendiente (b):", b, "\n")
## Pendiente (b): 0.375
cat("Error estándar de b:", std_error_b, "\n")
## Error estándar de b: 0.08819642
cat("Valor t de b:", t_value_b, "\n")
## Valor t de b: 4.251873
# Evaluar si b es significativamente diferente de 0
p_value_b <- summary_new_model$coefficients[2, 4] # Valor p de b
significance_level <- 0.05
if (p_value_b < significance_level) {
cat("b es significativamente diferente de 0.\n")
} else {
cat("b no es significativamente diferente de 0.\n")
}
## b es significativamente diferente de 0.
(f) Discuss the differences between your findings in parts (b) and (e). Describe in words what you have learned from these results
print("Los resultados de b) no eran estadísticamente significativos para beta (p-valor > nivel de significación) y el modelo no conseguía predecir correctamente el resultado (R^2 muy baja); sin embargo, al eliminar el valor atípico en e) encontramos un nivel de significancia estadística correcto para beta y un R^2 que predice de manera aceptable el resultado de ventas, aun así, sería preferible incluir nuevas variables y mejorar el modelo")
## [1] "Los resultados de b) no eran estadísticamente significativos para beta (p-valor > nivel de significación) y el modelo no conseguía predecir correctamente el resultado (R^2 muy baja); sin embargo, al eliminar el valor atípico en e) encontramos un nivel de significancia estadística correcto para beta y un R^2 que predice de manera aceptable el resultado de ventas, aun así, sería preferible incluir nuevas variables y mejorar el modelo"