TAREA 1 DE ESTADÍSTICA 2

library(readxl)
library(markdown)
library(ggplot2)
library(qqplotr)

## 
## Adjuntando el paquete: 'qqplotr'

## The following objects are masked from 'package:ggplot2':
## 
##     stat_qq_line, StatQqLine

library(nortest)
library(lmtest)

## Cargando paquete requerido: zoo

## 
## Adjuntando el paquete: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(psych)

## 
## Adjuntando el paquete: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

PROBLEMA 1

On March 31, 2009, Ford Motor Company’s shares were trading at a 26-year low of $2.63.

Ford’s board of directors gave the CEO a grant of options and restricted shares with an es- timated value of $16 million. On April 26, 2011, the price of a share of Ford had increased

to $15.58, and the CEO’s grant was worth $202.8 million, a gain in value of $186.8 million.

The following table shows the share price in 2009 and 2011 for 10 companies, the stock- option and share grants to the CEOs in late 2008 and 2009, and the value of the options and

grants in 2011. Also shown are the percentage increases in the stock price and the percent- age gains in the options values (The Wall street Journal, April 27, 2011).

Develop a scatter diagram for these data with the percentage increase in the stock price as the independent variable.

P1 <- read_excel("C:/Users/yovanni/Downloads/P1.xlsx",sheet="P1PROB")

ggplot(P1, aes(x = x3, y = x6)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(title = "% Increase in Stock Price vs % Gain in Options Value",
       x = "% Increase in Stock Price",
       y = "% Gain in Options Value") +
  theme_minimal()

what does the scatter diagram developed in part (a) indicate about the relationship between the two variables?

Respuesta: Como se puede visualizar, existe una relación aproximadamente lineal positiva entre las variables de interés. Mediante un modelo de regresión lineal simple se podría confirmar dicha relación; fundamentalmente por el análisis de significancia global [en este caso no es necesario el análisis particular ya que la variable es solo una, por lo que la significancia global es uniparamétrica y la prueba de Fisher y Student convergen en valores determinantes].

Develop the least squares estimated regression equation.

MOD1<-lm(x6~x3, data = P1)
summary(MOD1)

## 
## Call:
## lm(formula = x6 ~ x3, data = P1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -194.55 -138.75  -14.66   51.96  441.34 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -167.8071   180.0907  -0.932  0.37871   
## x3             2.7149     0.5754   4.718  0.00151 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.8 on 8 degrees of freedom
## Multiple R-squared:  0.7356, Adjusted R-squared:  0.7026 
## F-statistic: 22.26 on 1 and 8 DF,  p-value: 0.001505

Respuesta: Como se puede visualizar en el análisis previo, la ecuación ajustada sería:

y = 2.7149x - 167.8071 donde y es % Increase in Stock Price y x es % Gain in Options Value.

Provide an interpretation for the slope of the estimated regression equation.

Respuesta: la pendiente en el presente ejemplo muestra una relación lineal positiva entre las variables; es decir, a cada aumento de % Increase in Stock Price, existe un aumento en factor de 2.7149 en % Gain in Options value.

Do the rewards for the CEO appear to be based on performance increases as measured by the stock price?

Respuesta: Sí, efectivamente se muestra una significancia en la relación lineal positiva. Asimismo, el coeficiente de determinación es aceptable y se cumple bastante bien la prueba de Fisher de significancia global bajo la significancia estándar de 0.05.

PROBLEMA 2.

To help consumers in purchasing a laptop computer, Consumer Reports calculates an overall test score for each computer tested based upon rating factors such as ergonomics, portability, performance, display, and battery life. higher overall scores indicate better test results. The following data show the average retail price and the overall score for ten 13-inch models (Consumer Reports website, October 25, 2012).

Develop a scatter diagram with price as the independent variable.

P2 <- read_excel("C:/Users/yovanni/Downloads/P2.xlsx",sheet="Worksheet")

## New names:
## • `` -> `...4`
## • `` -> `...5`

P2

## # A tibble: 10 × 5
##    x0                                   x1    x2 ...4   ...5
##    <chr>                             <dbl> <dbl> <lgl> <dbl>
##  1 Samsung Ultrabook NP900X3C-A01US   1250    83 NA     NA  
##  2 Apple MacBook Air MC965LL/A        1300    83 NA     NA  
##  3 Apple MacBook Air MD231LL/A        1200    82 NA     NA  
##  4 HP ENVY 13-2050nr Spectre XT        950    79 NA     NA  
##  5 Sony VAIO SVS13112FXB               800    77 NA     68.6
##  6 Acer Aspire S5-391-9880 Ultrabook  1200    74 NA     NA  
##  7 Apple MacBook Pro MD101LL/A        1200    74 NA     NA  
##  8 Apple MacBook Pro MD313LL/A        1000    73 NA     NA  
##  9 Dell Inspiron I13Z-6591SLV          700    67 NA     NA  
## 10 Samsung NP535U3C-A01US              600    63 NA     NA

ggplot(P2, aes(x = x1, y = x2)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Prueba general vs Precio",
       x = "Precio",
       y = "Prueba General") +
  theme_minimal()

what does the scatter diagram developed in part (a) indicate about the relationship between the two variables?

Respuesta: se puede visualizar que existe una pendiente positiva -sí hay una relación-, aunque no es tan prominente. Podría ser una pendiente que tiende a cero, aunque dudo que sea igual a cero.

Use the least squares method to develop the estimated regression equation.

MOD2<-lm(x2~x1, data = P2)
summary(MOD2)

## 
## Call:
## lm(formula = x2 ~ x1, data = P2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3182 -3.2121 -0.0758  2.6667  6.1667 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 53.863636   6.175123   8.723 2.33e-05 ***
## x1           0.021212   0.005897   3.597  0.00701 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.417 on 8 degrees of freedom
## Multiple R-squared:  0.6179, Adjusted R-squared:  0.5702 
## F-statistic: 12.94 on 1 and 8 DF,  p-value: 0.007013

Respuesta: acorde al mecanismo de mínimos cuadrados adoptado por R, visualizamos que la ecuación ajustada es:

y = 0.021x + 53.864 donde y es prueba general y x es precio.

Provide an interpretation of the slope of the estimated regression equation.

Como se puede visualizar en la figura anterior, la prueba de Fisher muestra que efectivamente la pendiente es diferente de cero, sin embargo, tal como se muestra en el modelo, aunque sabemos que efectivamente la pendiente existe, es muy pequeña, que en determinadas condiciones incluso podría tomarse como despreciable. Por lo tanto, sería ideal culminar el análisis descriptivo de la regresión -también la bondad de ajuste se puede ver algo baja- para descartar otros escenarios que sugieran no relación. Asimismo, también podríamos estar frente a un caso en el que, por cuestiones muestrales, sí hay una relación sumamente pequeña, pero que realmente a nivel poblacional no hay relación como tal y fue un hecho de mala suerte, afectando el análisis estadístico, por la insignificancia de la pendiente.

Another laptop that Consumer Reports tested is the Acer Aspire S3-951-6646 Ultra- book; the price for this laptop was $700. Predict the overall score for this laptop using

the estimated regression equation developed in part (c).

Respuesta: tomando el modelo lineal, podemos ver que al evaluarlo en 700 dólares de precio, obtenemos una calificación de aproximadamente 69 puntos.

PROBLEMA 3. 19. In exercise 7 a sales manager collected the following data on x = annual sales and y = years of experience. The estimated regression equation for these data is yˆ = 80 + 4x.

Compute SST, SSR, and SSE.

P4 <- read_excel("C:/Users/yovanni/Downloads/P4.xlsx")
P4

## # A tibble: 10 × 2
##       x1    x2
##    <dbl> <dbl>
##  1     1    80
##  2     3    97
##  3     4    92
##  4     4   102
##  5     6   103
##  6     8   111
##  7    10   119
##  8    10   123
##  9    11   117
## 10    13   136

MOD4<-lm(x2~x1, data = P4)
summary(MOD4)

## 
## Call:
## lm(formula = x2 ~ x1, data = P4)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7.00  -3.25  -1.00   3.75   6.00 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  80.0000     3.0753   26.01 5.12e-09 ***
## x1            4.0000     0.3868   10.34 6.61e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.61 on 8 degrees of freedom
## Multiple R-squared:  0.9304, Adjusted R-squared:  0.9217 
## F-statistic: 106.9 on 1 and 8 DF,  p-value: 6.609e-06

residuals <- MOD4$residuals
SSE <- sum(residuals^2)
cat("SSE (Sum of Squares due to Error):", SSE, "\n")

## SSE (Sum of Squares due to Error): 170

y <- P4$x2
y_mean <- mean(y)
SST <- sum((y - y_mean)^2)
cat("SST (Total Sum of Squares):", SST, "\n")

## SST (Total Sum of Squares): 2442

SSR <- SST - SSE
cat("SSR (Sum of Squares due to Regression):", SSR, "\n")

## SSR (Sum of Squares due to Regression): 2272

Respuesta: Como se puede apreciar en la figura, la suma total de cuadrados es 2442, la suma de cuadrados respecto al error es de 170 y la suma de cuadrados respecto a la regresión es 2272.

Compute the coefficient of determination r2. Comment on the goodness of fit. El coeficiente de determinación es de 0.9304 y el coeficiente de determinación ajustado es 0.9217. Podemos ver que existe un excelente ajuste de datos respecto al modelo de regresión. Asimismo, la evidencia sugiere una significancia global MUY pronunciada por la prueba de Fisher.
what is the value of the sample correlation coefficient?

correlation_coefficient <- cor(P4$x1, P4$x2)
cat("Coeficiente de correlación entre x1 y x2:", correlation_coefficient, "\n")

## Coeficiente de correlación entre x1 y x2: 0.9645646

Respuesta: Como se puede visualizar en la figura, el coeficiente de correlación es de 0.9646, por lo cual, existe una correlación más que significativa entre las variables.

PROBLEMA 4.

To identify high-paying jobs for people who do not like stress, the following data were collected showing the average annual salary ($1000s) and the stress tolerance for a variety of occupations (Business Insider, November 8, 2013).

The stress tolerance for each job is rated on a scale from 0 to 100, where a lower rating indicates less stress.

Develop a scatter diagram for these data with average annual salary as the independent variable. what does the scatter diagram indicate about the relationship between the two variables?

P3 <- read_excel("C:/Users/yovanni/Downloads/P3.xlsx",sheet="Worksheet")
P3

## # A tibble: 10 × 3
##    x0                             x1    x2
##    <chr>                       <dbl> <dbl>
##  1 Art directors                  81  69  
##  2 Astronomers                    96  62  
##  3 Audiologists                   70  67.5
##  4 Dental hygienists              70  71.3
##  5 Economists                     92  63.3
##  6 Engineers                      92  69.5
##  7 Law teachers                  100  62.8
##  8 Optometrists                   98  65.5
##  9 Political Scientists          102  60.1
## 10 Urban and regional planners    65  69

ggplot(P3, aes(x = x1, y = x2)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Salario vs Tolerancia al estrés",
       x = "Salario",
       y = "Tolerancia al estrés") +
  theme_minimal()

Respuesta: El diagrama de dispersión apunta a que, al igual que en el ejemplo anterior, puede existir una relación -en este caso negativa-, pero acorde a los datos muestrales, parece ser aproximadamente despreciable; por lo cual, podría tratarse de una aleatoriedad muestral más que una significancia poblacional.

Use these data to develop an estimated regression equation that can be used to predict stress tolerance given the average annual salary.

MOD3<-lm(x2~x1, data = P3)
summary(MOD3)

## 
## Call:
## lm(formula = x2 ~ x1, data = P3)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.655 -1.889 -0.964  1.815  4.638 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 84.25041    5.33986  15.778  2.6e-07 ***
## x1          -0.21074    0.06096  -3.457   0.0086 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.544 on 8 degrees of freedom
## Multiple R-squared:  0.599,  Adjusted R-squared:  0.5489 
## F-statistic: 11.95 on 1 and 8 DF,  p-value: 0.008603

Respuesta: El modelo primario de regresión sería:

y = -0.2107x + 84.2504

donde y es la tolerancia al estrés y x es el salario.

At the .05 level of significance, does there appear to be a significant statistical relation- ship between the two variables?

Respuesta: Sí, acorde a la prueba de Fisher de significancia global se muestra una relación, sin embargo, la ecuación muestra una pendiente bastante poco significante como para decir que la relación es fuertemente significativa.

would you feel comfortable in predicting the stress tolerance for a different occupa- tion given the average annual salary for the occupation? Explain.

Respuesta: Bajo este modelo de regresión no, debido a que el coeficiente de determinación ajustado no aparenta una gran bondad de ajuste, la pendiente es aproximadamente despreciable y falta estadística sobre el error para demostrar los supuestos y emplear la ecuación con plena confianza.

Does the relationship between average annual salary and stress tolerance for these data seem reasonable to you? Explain.

Respuesta:

No, complementando las respuestas anteriores, todo parece indicar que estadísticamente hablando, la relación efectivamente existe en los datos muestrales pero es realmente baja, por lo cual, podría suponerse que a falta de evidencia estadística -también tomando en cuenta la deficiente bondad de ajuste- este modelo no es apto para descripción y pronósticos. Podría realizarse más investigación para llegar a una nueva conclusión; no obstante, bajo esta muestra, ese no es el caso.

PROBLEMA 5. 44. Automobile racing, high-performance driving schools, and driver education programs run by automobile clubs continue to grow in popularity. All these activities require the

participant to wear a helmet that is certified by the Snell Memorial Foundation, a not- for-profit organization dedicated to research, education, testing, and development of

helmet safety standards. Snell “SA” (Sports Application)-rated professional helmets are designed for auto racing and provide extreme impact resistance and high fire protection. One of the key factors in selecting a helmet is weight, since lower weight helmets tend to place less stress on the neck. Consider the following data showing the weight and price for 18 SA helmets.

Develop a scatter diagram with weight as the independent variable.

P6 <- read_excel("C:/Users/yovanni/Downloads/P6.xlsx",sheet="Worksheet")
P6

## # A tibble: 18 × 3
##    x0                               x1    x2
##    <chr>                         <dbl> <dbl>
##  1 Pyrotect Pro Airflow             64   248
##  2 Pyrotect Pro Airflow Graphics    64   278
##  3 RCi Full Face                    64   200
##  4 RaceQuip RidgeLine               64   200
##  5 HJC AR-10                        58   300
##  6 HJC Si-12                        47   700
##  7 HJC HX-10                        49   900
##  8 Impact Racing Super Sport        59   340
##  9 Zamp FSA-1                       66   199
## 10 Zamp RZ-2                        58   299
## 11 Zamp RZ-2 Ferrari                58   299
## 12 Zamp RZ-3 Sport                  52   479
## 13 Zamp RZ-3 Sport Painted          52   479
## 14 Bell M2                          63   369
## 15 Bell M4                          62   369
## 16 Bell M4 Pro                      54   559
## 17 G Force Pro Force 1              63   250
## 18 G Force Pro Force 1 Grafx        63   280

ggplot(P6, aes(x = x1, y = x2)) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "black") +
  labs(title = "Weight vs Price",
       x = "Weight",
       y = "Price") +
  theme_minimal()

Does there appear to be any relationship between these two variables? Respuesta: Sí, parece que se presenta una relación lineal negativa entre el peso y el precio; en otras palabras, a mayor peso, menor precio.
Develop the estimated regression equation that could be used to predict the price given the weight.

MOD6<-lm(x2~x1, data = P6)
summary(MOD6)

## 
## Call:
## lm(formula = x2 ~ x1, data = P6)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -101.09  -76.33  -10.14   40.56  244.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2044.381    226.354   9.032 1.11e-07 ***
## x1           -28.350      3.826  -7.410 1.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 91.81 on 16 degrees of freedom
## Multiple R-squared:  0.7743, Adjusted R-squared:  0.7602 
## F-statistic:  54.9 on 1 and 16 DF,  p-value: 1.478e-06

Respuesta: El modelo de regresión lineal simple muestra la siguiente ecuación ajustada:

y = -28.350x + 2044.381

donde y es el precio y x es el peso.

Test for the significance of the relationship at the .05 level of significance. Respuesta: Como se muestra en la figura anterior, se realiza automáticamente la prueba de Fisher de significancia global, que presenta un valor p con tendencias claras y consistentes hacia el cero, por lo cual, podemos probar con buena significancia que la pendiente es diferente de cero y efectivamente existe una relación lineal negativa entre las variables.
Did the estimated regression equation provide a good fit? Explain. Respuesta: Efectivamente, vemos un coeficiente de determinación bastante bueno, de 0.7743, y que es convergente con el verdadero indicador, el coeficiente de determinación ajustado, por lo cual, podemos asumir que hay evidencia estadísticamente significativa como para asumir una buena aproximación lineal.

PROBLEMA 6. 49. In 2011 home prices and mortgage rates dropped so low that in a number of cities the monthly cost of owning a home was less expensive than renting. The following data show the average asking rent for 10 markets and the monthly mortgage on the median priced home (including taxes and insurance) for 10 cities where the average monthly mortgage payment was less than the average asking rent (The Wall street Journal, November 26–27, 2011).

Develop the estimated regression equation that can be used to predict the monthly mortgage given the average asking rent.

P5 <- read_excel("C:/Users/yovanni/Downloads/P5.xlsx",sheet="Worksheet")
P5

## # A tibble: 10 × 3
##    x0                    x1    x2
##    <chr>              <dbl> <dbl>
##  1 Atlanta              840   539
##  2 Chicago             1062  1002
##  3 Detroit              823   626
##  4 Jacksonville, Fla.   779   711
##  5 Las Vegas            796   655
##  6 Miami               1071   977
##  7 Minneapolis          953   776
##  8 Orlando, Fla.        851   695
##  9 Phoenix              762   651
## 10 St. Louis            723   654

MOD5<-lm(x2~x1, data = P5)
summary(MOD5)

## 
## Call:
## lm(formula = x2 ~ x1, data = P5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -161.78  -38.65   15.18   56.19   78.40 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -197.9583   187.6950  -1.055  0.32238   
## x1             1.0699     0.2148   4.981  0.00108 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 78.78 on 8 degrees of freedom
## Multiple R-squared:  0.7561, Adjusted R-squared:  0.7257 
## F-statistic: 24.81 on 1 and 8 DF,  p-value: 0.001079

Respuesta: La ecuación para la regresión simple ajustada es:

y = 1.0699x - 197.9583

Donde y es mortgage y x es rent.

Construct a residual plot against the independent variable.

residuals_MOD5 <- residuals(MOD5)

# Crear el diagrama de dispersión de los residuos contra la variable independiente x1
ggplot(P5, aes(x = x1, y = residuals_MOD5)) +
  geom_point(color = "blue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Diagrama de Dispersión de Residuos vs Variable Independiente x1",
       x = "Variable Independiente x1",
       y = "Residuos del Modelo") +
  theme_minimal()

Do the assumptions about the error term and model form seem reasonable in light of the residual plot? Respuesta: No, existen valores que divergen de forma no uniforme de la media cero; por lo cual, podemos suponer que aunque existe independencia -por la ausencia de patrones aparentes- no existe homocedasticidad.

PROBLEMA 7 10. Major League baseball (MLb) consists of teams that play in the American League and the National League. MLb collects a wide variety of team and player statistics. Some of the statistics often used to evaluate pitching performance are as follows: ERA: The average number of earned runs given up by the pitcher per nine innings. An earned run is any run that the opponent scores off a particular pitcher except for runs scored as a result of errors. SO/IP: The average number of strikeouts per inning pitched. hR/IP: The average number of home runs per inning pitched. R/IP: The number of runs given up per inning pitched. The following data show values for these statistics for a random sample of 20 pitchers from the American League for a full season.

Develop an estimated regression equation that can be used to predict the average number of runs given up per inning given the average number of strikeouts per inning pitched.

P8 <- read_excel("C:/Users/yovanni/Downloads/P8.xlsx")
P8

## # A tibble: 20 × 8
##    x0           x1       x2    x3    x4    x5    x6    x7
##    <chr>        <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 Verlander, J DET      24     5  2.4   1     0.1   0.29
##  2 Beckett, J   BOS      13     7  2.89  0.91  0.11  0.34
##  3 Wilson, C    TEX      16     7  2.94  0.92  0.07  0.4 
##  4 Sabathia, C  NYY      19     8  3     0.97  0.07  0.37
##  5 Haren, D     LAA      16    10  3.17  0.81  0.08  0.38
##  6 McCarthy, B  OAK       9     9  3.32  0.72  0.06  0.43
##  7 Santana, E   LAA      11    12  3.38  0.78  0.11  0.42
##  8 Lester, J    BOS      15     9  3.47  0.95  0.1   0.4 
##  9 Hernandez, F SEA      14    14  3.47  0.95  0.08  0.42
## 10 Buehrle, M   CWS      13     9  3.59  0.53  0.1   0.45
## 11 Pineda, M    SEA       9    10  3.74  1.01  0.11  0.44
## 12 Colon, B     NYY       8    10  4     0.82  0.13  0.52
## 13 Tomlin, J    CLE      12     7  4.25  0.54  0.15  0.48
## 14 Pavano, C    MIN       9    13  4.3   0.46  0.1   0.55
## 15 Danks, J     CWS       8    12  4.33  0.79  0.11  0.52
## 16 Guthrie, J   BAL       9    17  4.33  0.63  0.13  0.54
## 17 Lewis, C     TEX      14    10  4.4   0.84  0.17  0.51
## 18 Scherzer, M  DET      15     9  4.43  0.89  0.15  0.52
## 19 Davis, W     TB       11    10  4.45  0.57  0.13  0.52
## 20 Porcello, R  DET      14     9  4.75  0.57  0.1   0.57

MOD8<-lm(x7~x5, data = P8)
summary(MOD8)

## 
## Call:
## lm(formula = x7 ~ x5, data = P8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10190 -0.04165 -0.00064  0.05221  0.09687 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.67575    0.06307  10.713 3.06e-09 ***
## x5          -0.28385    0.07869  -3.607  0.00202 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06027 on 18 degrees of freedom
## Multiple R-squared:  0.4195, Adjusted R-squared:  0.3873 
## F-statistic: 13.01 on 1 and 18 DF,  p-value: 0.002016

Respuesta:

El modelo solicitado se ajusta como:

y = -0.28x + 0.68

Como podemos ver, el modelo no presenta una buena bondad de ajuste y no es recomendado su uso para descripciones y predicciones.

Develop an estimated regression equation that can be used to predict the average number of runs given up per inning given the average number of home runs per inning pitched.

MOD9<-lm(x7~x6, data = P8)
summary(MOD9)

## 
## Call:
## lm(formula = x7 ~ x6, data = P8)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.152726 -0.033191  0.000942  0.037940  0.127274 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.30805    0.06036   5.104 7.42e-05 ***
## x6           1.34673    0.54071   2.491   0.0227 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06822 on 18 degrees of freedom
## Multiple R-squared:  0.2563, Adjusted R-squared:  0.215 
## F-statistic: 6.203 on 1 and 18 DF,  p-value: 0.02274

Discusión: El modelo es y = 1.34673x+0.30805

Como podemos ver, no es un modelo que presente (de forma muy notable) una buena bondad de ajuste.

Develop an estimated regression equation that can be used to predict the average number of runs given up per inning given the average number of strikeouts per inning pitched and the average number of home runs per inning pitched.

MOD10<-lm(x7~x5+x6, data = P8)
summary(MOD10)

## 
## Call:
## lm(formula = x7 ~ x5 + x6, data = P8)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.101353 -0.036707  0.008531  0.037854  0.071857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.53651    0.08141   6.590 4.59e-06 ***
## x5          -0.24835    0.07181  -3.459    0.003 ** 
## x6           1.03193    0.43588   2.367    0.030 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05378 on 17 degrees of freedom
## Multiple R-squared:  0.5635, Adjusted R-squared:  0.5121 
## F-statistic: 10.97 on 2 and 17 DF,  p-value: 0.0008713

Discusión: La ecuación es y = -0.24835x5 + 1.03193x6 + 0.53651

A. J. burnett, a pitcher for the New York Yankees, had an average number of strike- outs per inning pitched of .91 and an average number of home runs per inning of .16. Use the estimated regression equation developed in part (c) to predict the average number of runs given up per inning for A. J. burnett. (Note: The actual value for R/IP was .6.)

Respuesta: y= -0.24835(0,91) + 1.03193(0.16) + 0.53651 y = 0.4756

Suppose a suggestion was made to also use the earned run average as another inde- pendent variable in part (c). what do you think of this suggestion?

MOD11<-lm(x7~x5+x6+x4, data = P8)
summary(MOD11)

## 
## Call:
## lm(formula = x7 ~ x5 + x6 + x4, data = P8)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.026061 -0.010482 -0.003017  0.005600  0.044248 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.07753    0.05052   1.535   0.1444    
## x5          -0.03025    0.03207  -0.943   0.3596    
## x6          -0.38748    0.20015  -1.936   0.0707 .  
## x4           0.11835    0.01074  11.025 6.96e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01891 on 16 degrees of freedom
## Multiple R-squared:  0.9492, Adjusted R-squared:  0.9397 
## F-statistic: 99.69 on 3 and 16 DF,  p-value: 1.443e-10

Respuesta: Como podemos visualizar en la figura, existe una perfecta significancia global y una muy buena bondad de ajuste. Sin embargo, laspruebas de significancia particular para x5 y para x6 resultan desfavorables bajo el estándar de 0.05, sus valores son muy grandes y no existe evidencia estadísticamente significativa para indicar una buena significancia particular (parcial).

PROBLEMA 8 18. Refer to exercise 10, where Major League baseball (MLb) pitching statistics were reported for a random sample of 20 pitchers from the American League for one full season.

In part (c) of exercise 10, an estimated regression equation was developed relating the average number of runs given up per inning pitched given the average number of strikeouts per inning pitched and the average number of home runs per inning pitched. what are the values of R2 and R2a?

Respuesta: Los valores de r2 y r2a indican el coeficiente de determinación absoluto y ajustado, que refieren a la bondad de ajuste o a la exactitud en términos estadísticos (sobre una significancia estadística sobre los errores) que presenta el modelo lineal correspondiente asociado a las variables empleadas para tal modelo.

Does the estimated regression equation provide a good fit to the data? Explain.

Respuesta: En general, no. Los datos presentan un coeficiente de determinación absoluto y ajustado bajos, y además, con valores muy divergentes entre sí, de tal manera que se supone, bajo el análisis correspondiente, que el modelo no es el ideal para la predicción y descripción del fenómeno. La divergencia entre los valores de interés incluso podría mostrar multicolinealidad, apoyado por la divergencia entre los valores de las pruebas de significancia global y las n pruebas particulares de Student.

Suppose the earned run average (ERA) is used as the dependent variable in part (c) instead of the average number of runs given up per inning pitched. Does the estimated regression equation that uses the ERA provide a good fit to the data? Explain.

MOD12<-lm(x4~x5+x6, data = P8)
summary(MOD12)

## 
## Call:
## lm(formula = x4 ~ x5 + x6, data = P8)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.83467 -0.22620  0.05842  0.20316  0.72294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.8781     0.6466   5.998 1.44e-05 ***
## x5           -1.8428     0.5703  -3.231  0.00491 ** 
## x6           11.9933     3.4621   3.464  0.00297 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4272 on 17 degrees of freedom
## Multiple R-squared:  0.6251, Adjusted R-squared:  0.581 
## F-statistic: 14.17 on 2 and 17 DF,  p-value: 0.0002387

Respuesta: No. El empleo de la variable ERA como dependiente muestra una muy buena significancia particular y global a nivel de regresores, sin embargo, aunque existe independencia lineal entre parámetros y se puede concluir evidencia estadísticamente significativa para determinar como diferentes de cero a las derivadas parciales de las independientes, el modelo no presenta una bondad de ajuste significativa por el moderado de 0.625 del coeficiente absoluto y de 0.581 en el ajustado (que además son moderadamente divergentes).

PROBLEMA 9 24. The National Football League (NFL) records a variety of performance data for indi- viduals and teams. A portion of the data showing the average number of passing yards

obtained per game on offense (OffPassYds/G), the average number of yards given up per game on defense (DefYds/G), and the precentage of games won (win%), for one full season follows.

Develop an estimated regression equation that can be used to predict the percentage of games won given the average number of passing yards obtained per game on offense and the average number of yards given up per game on defense.

P9 <- read_excel("C:/Users/yovanni/Downloads/P9.xlsx")

## New names:
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`

P9

## # A tibble: 32 × 7
##    x0            x1    x2    x3 ...5  ...6  ...7 
##    <chr>      <dbl> <dbl> <dbl> <lgl> <chr> <chr>
##  1 Arizona     223.  355.  50   NA    x1    off  
##  2 Atlanta     262   334.  62.5 NA    x2    def  
##  3 Baltimore   214.  289.  75   NA    x3    win  
##  4 Buffalo     231.  371.  37.5 NA    <NA>  <NA> 
##  5 Carolina    239.  378.  37.5 NA    <NA>  <NA> 
##  6 Chicago     188.  350.  50   NA    <NA>  <NA> 
##  7 Cincinnati  209.  316.  56.3 NA    <NA>  <NA> 
##  8 Cleveland   193.  332.  25   NA    <NA>  <NA> 
##  9 Dallas      263.  343.  50   NA    <NA>  <NA> 
## 10 Denver      152.  358.  50   NA    <NA>  <NA> 
## # ℹ 22 more rows

MOD13<-lm(x3~x2+x1, data = P9)
summary(MOD13)

## 
## Call:
## lm(formula = x3 ~ x2 + x1, data = P9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.363 -11.794  -0.783   4.120  36.802 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.54055   28.35621   2.135   0.0413 *  
## x2          -0.24134    0.08928  -2.703   0.0114 *  
## x1           0.31862    0.06256   5.093 1.96e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.31 on 29 degrees of freedom
## Multiple R-squared:  0.4762, Adjusted R-squared:  0.4401 
## F-statistic: 13.18 on 2 and 29 DF,  p-value: 8.474e-05

Respuesta: y = -0.2413x2 + 0.3186x1 es la ecuación ajustada.

Use the F test to determine the overall significance of the relationship. what is your conclusion at the .05 level of significance?

Respuesta: Como lo muestra la prueba de Fisher,la significancia global es muy notable con un valor aproximadamente despreciable que tiende a cero.

Use the t test to determine the significance of each independent variable. what is your conclusion at the .05 level of significance?

Respuesta: Resulta notable la baja bondad de ajuste, debido a que tanto las pruebas t de Student como la prueba F de Fisher resultaron muy positivas; se cumplen todas las pruebas particulares para los regresores independientes, por lo tanto, efectivamente existe evidencia estadísticamente significativa para concluir que la pendiente general (el diferencial) y las pendientes parciales (derivadas parciales) son todas diferentes de cero, aunque no sean tan pronunciadas.

PROBLEMA 10 51. A partial computer output from a regression analysis follows.

a. Compute the missing entries in this output.

Use the F test and a = .05 to see whether a significant relationship is present.

Respuesta: Como lo muestra el cuadro de Minitab, la prueba de significancia global muestra un valor p de aproximadamente cero, por lo tanto, efectivamente está presente una relación significante.

Use the t test and a = .05 to test H0: b1 = 0 and H0: b2 = 0.

Respuesta: Como lo muestra la prueba t de Student para ambos regresores, los valores p tienden a cero, por lo tanto, existe evidencia estadísticamente significativa para asumir que efectivamente existe una relación significante con pendiente parcial diferente de cero, por lo tanto se rechazan ambas hipótesis nulas con 95% de confianza y la relación lineal, por pequeña que fuere, existe.

PROBLEMA 11 52. Recall that in exercise 49, the admissions officer for Clearwater College developed the following estimated regression equation relating final college GPA to the student’s SAT mathe matics score and high-school GPA.

yˆ 5 21.41 1 .0235x1 1 .00486x2

where

x1 5 x2 5 y 5 high-school grade point average SAT mathematics score final college grade point average

a. Complete the missing entries in this output.

Use the F test and a .05 level of significance to see whether a significant relationship is present.

Respuesta: Efectivamente, la prueba de Fisher muestra un valor p de aproximadamente cero, incluso despreciable bajo los decimales designados, por lo tanto, podemos asumir con evidencia estadística significativa que efectivamente existe una relación, por muy grande o pequeña que fuere.

Use the t test and a = .05 to test H0: b1 = 0 and H0: b2 = 0. Respuesta: Al igual que la prueba de significancia global, ambas pruebas de significancia particular resultan positivas en términos de la significancia elegida. Con 95% de confianza, concluimos que existe evidencia suficiente para rechazar la hipótesis nula, y exige una relación lineal entre los parámetros y la dependiente de forma particular, por muy grande o pequeña que esta fuere.
Did the estimated regression equation provide a good fit to the data? Explain.

Respuesta: En términos generales, podemos concluir que sí existe un buen ajuste de datos respecto al modelo por el coeficiente de determinación superior a 0.85.

Resto de problemas. Nota a estimada auxiliar: disculpas extensas por no culminar, mi falta de sueño no me está permitiendo entender los problemas más, pero en la proximidad vuelvo con más ganas para no repetir este evento desafortunado. Saludos estadísticamente significativos ;).

TAREA 1 DE ESTADÍSTICA 2

José Rodrigo Türk Ikeda

2024-07-23

a. Compute the missing entries in this output.

a. Complete the missing entries in this output.