CWRU DSCI351-351m-451: Lab Exercise LE6 NAME

LE6, 10 points, 4 questions.

Q1, 1 pt.
Q2, 2 pts.
Q3, 4 pts.
Q4, 3 pts.

Lab Exercise (LE) 6

Inference Guide

There is a useful Inference Cheat Sheet in your readings folder

os2_extra_inference_guide.pdf

Q1. Fuel efficiency of Prius. (OIS 5.8)

Fueleconomy.gov,

the official US government source
- for fuel economy information,
allows users to share gas mileage information on their vehicles.

The histogram below shows

the distribution of gas mileage in miles per gallon (MPG)
- from 14 users who drive a 2012 Toyota Prius.
The sample mean is 53.3 MPG
- and the standard deviation is 5.2 MPG.

Note that these data are user estimates

and since the source data cannot be verified,
the accuracy of these estimates are not guaranteed.[@noauthor_gas_nodate]

Mileage in MPG

We would like to use these data to evaluate

the average gas mileage of all 2012 Prius drivers.

Do you think this is reasonable?

Why or why not?

Answer: No, there is uncertainty of the accuracy of the source data. The data reported online is not verifiable, and the population of 2012 Prius drivers may differ from 2012 Prius drivers who report on fueleconomy.gov.

The EPA claims that a 2012 Prius gets 50 MPG

(city and highway mileage combined).

Do these data provide strong evidence against this estimate

for drivers who participate on fueleconomy.gov?
Note any assumptions you must make as you proceed with the test.

test_stat = (53.3 - 50)/(5.2/sqrt(14))
test_stat

## [1] 2.374513

reject = qt(0.975, df = 13)
reject

## [1] 2.160369

Answer: Assuming the sample was randomly selected and with alpha = 0.05, the test statistic is higher than the upper boundary of the rejection region. Therefore, the data provides convincing evidence that the average MPG for drivers who post on fueleconomy.gov is different than the MPG for 2012 Prius drivers overall.

Calculate a 95% confidence interval

for the average gas mileage of a 2012 Prius
by drivers who participate on fueleconomy.gov.

z_value <- qnorm(.975)
stan_error <- 5.2/sqrt(14)
lower <- 53.3 - (z_value * stan_error)
upper <- 53.3 + (z_value * stan_error)
lower

## [1] 50.57612

upper

## [1] 56.02388

Answer: The 95% CI for the average gas mileage of 2012 Prius drivers who post on fueleconomy.gov is (50.57612, 56.02388).

Q2. Diamonds

Q2.1 Diamonds Part I. (OIS 5.28)

Prices of diamonds are determined by what is known as the 4 Cs:

cut,
clarity,
color,
and carat weight.

The prices of diamonds go up

as the carat weight increases,
but the increase is not smooth.

For example, the difference between the size

of a 0.99 carat diamond and
- a 1 carat diamond is undetectable to the naked human eye,
but the price of a 1 carat diamond tends to be much higher
- than the price of a 0.99 diamond.

In this question we use two random samples of diamonds,

0.99 carats and 1 carat,
each sample of size 23,

and compare the average prices of the diamonds.

In order to be able to compare equivalent units,

-we first divide the price for each diamond - by 100 times its weight in carats.

That is, for a 0.99 carat diamond, we divide the price by 99.

For a 1 carat diamond, we divide the price by 100.

The distributions and some sample statistics are shown below.[@wickham_ggplot2:_2016]

Point Price

Sample Statistics

Conduct a hypothesis test to evaluate

if there is a difference between the average standardized prices
of 0.99 and 1 carat diamonds.

Make sure to

state your hypotheses clearly,
check relevant conditions,
and interpret your results in context of the data.

Answer: Hypothesis: The average standardized prices of 0.99 and 1 carat diamonds are different. Null Hypothesis: The average standardized prices of 0.99 and 1 carat diamonds are not different. Alpha = 0.05 Standard deviation between the two groups are different, so a new degrees of freedom is calculated.

test_stat = (44.51 - 56.81)/sqrt((13.32^2)/23 + (16.13^2)/23)
degrees = (((13.32^2) / 23 + (16.13^2) / 23)^2) / ((((13.32^2 / 23)^2) / (22)) + (((16.13^2 / 23)^2) / (22)))
reject = -qt(0.975, df = degrees)
test_stat

## [1] -2.819881

reject

## [1] -2.017405

Answer: With alpha = 0.05, the hypothesis test shows the test statistic is lower than the lower boundary of the rejection region, thus rejecting the null hypothesis that the average standardized prices of 0.99 and 1 carat diamonds are not different.

Q2.2 Diamonds Part II. (OIS 5.30)

We discussed diamond prices

(standardized by weight)
for diamonds with weights 0.99 carats and 1 carat.

See the table for summary statistics,

and then construct a 95% confidence interval
for the average difference
- between the standardized prices of 0.99 and 1 carat diamonds.

You may assume the conditions for inference are met.

z_value <- qt(0.975, df = degrees)
stan_error <- sqrt((13.32^2)/23 + (16.13^2)/23)
lower <- (44.51 - 56.81) - (z_value * stan_error)
upper <- (44.51 - 56.81) + (z_value * stan_error)
lower

## [1] -21.09969

upper

## [1] -3.500307

Answer: The 95% CI for the average price of .99 carat diamonds minus the average price of 1 carat diamonds is (-21.09969, -3.500307).

Q3 OIStats: Linear regression

The movie Moneyball focuses on the “quest for the secret of success in baseball”.

It follows a low-budget team, the Oakland Athletics,
- who believed that underused statistics, such as
  - a player’s ability to get on base,
- better predict the ability to score runs than typical statistics
  - like home runs, RBIs (runs batted in), and batting average.
Obtaining players who excelled in these underused statistics
- turned out to be much more affordable for the team.

In this lab we’ll be looking at data from all 30 Major League Baseball teams

and examining the linear relationship between runs scored in a season
and a number of other player statistics.

Our aim will be to summarize these relationships

both graphically and numerically in order to find which variable, if any,
helps us best predict a team’s runs scored in a season.

Data

Let’s load up the data for the 2011 season.

Its a file mlb11.RData
Its located in the data folder of this LE6 in your class repo.

load("./data/mlb11.RData")

In addition to runs scored,

there are seven traditionally used variables in the data set:
- at-bats,
- hits,
- home runs,
- batting average,
- strikeouts,
- stolen bases,
- and wins.

There are also three newer variables:

on-base percentage,
slugging percentage,
and on-base plus slugging.

For the first portion of the analysis - we’ll consider the seven traditional variables. - At the end of the lab, you’ll work with the newer variables.

Q 3.1

What type of plot would you use to display

the relationship between runs and one of the other numerical variables?

Plot this relationship using the variable at_bats as the predictor.

Does the relationship look linear?

plot(mlb11$at_bats, mlb11$runs)

ANSWER: A scatterplot is used to display the relationship between at_runs and runs. The relation does look linear.

If you knew a team’s at_bats,

would you be comfortable using a linear model
to predict the number of runs?

Quantify the strength of the relationship with the correlation coefficient.

cor(mlb11$at_bats, mlb11$runs)

## [1] 0.610627

ANSWER: I would be comfortable using a linear model for these two variables. The correlation coefficient is 0.610627.

Sum of squared residuals

Think back to the way that we described the distribution of a single variable.

Recall that we discussed characteristics such as center, spread, and shape.
It’s also useful to be able to describe
- the relationship of two numerical variables,
- such as runs and at_bats above.

Q 3.2

Looking at your plot from the previous exercise,

describe the relationship between these two variables.

Make sure to discuss

the form, direction, and strength of the relationship
as well as any unusual observations.

ANSWER: The is a moderately strong linear positive correlation between at-bats and hits from the 2011 season.

Form
Direction
Strength

Q 3.3

Just as we used the mean and standard deviation to summarize a single variable,

we can summarize the relationship between these two variables
- by finding the line that best follows their association.

Use the following interactive function

to select the line that you think does the best job
- of going through the cloud of points.

library(statsr)

## 
## Attaching package: 'statsr'

## The following objects are masked _by_ '.GlobalEnv':
## 
##     mlb11, plot_ss

plot_ss(x = mlb11$at_bats, y = mlb11$runs)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

You’ll need to run this command in an R script (.R file)

You don’t get the interactive prompts when in a .Rmd file

After running this command in a .R script, you’ll be prompted

to click two points on the plot to define a line.

Once you’ve done that,

the line you specified will be shown in black
and the residuals in blue.

Note that there are 30 residuals, one for each of the 30 observations.

Recall that the residuals are the difference
between the observed values and the values predicted by the line:

\[e_i = y_i- \hat{y}_i\]

The most common way to do linear regression

is to select the line that minimizes the sum of squared residuals.

To visualize the squared residuals,

you can rerun the plot command
and add the argument showSquares = TRUE.

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

## Click two points to make a line.
                                
## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##  -2789.2429       0.6305  
## 
## Sum of Squares:  123721.9

Note that the output from the plot_ss function

provides you with the slope and intercept of your line 0 as well as the sum of squares.

Using plot_ss, choose a line

that does a good job of minimizing the sum of squares.

Run the function several times.

What was the smallest sum of squares that you got?
How does it compare to your neighbors?

Q 3.3a ANSWER: The smallest sum of squares I got was 145548.8 on my fifth attempt. This was slightly greater than the average fifth attempt sum of least squares compared to my classmate neighbors.

The linear model

It is rather cumbersome to try to get the correct least squares line,

i.e. the line that minimizes the sum of squared residuals,
through trial and error.

Instead we can use the lm function in R

to fit the linear model (a.k.a. regression line).

m1 <- lm(runs ~ at_bats, data = mlb11)

The first argument in the function lm is a formula that takes the form y ~ x.

Here it can be read that we want to make a linear model of runs
- as a function of at_bats.
The second argument specifies that R should look in the mlb11 dataframe
- to find the runs and at_bats variables.

The output of lm is an object that contains all of the information we need

about the linear model that was just fit.
We can access this information using the summary function.

summary(m1)

## 
## Call:
## lm(formula = runs ~ at_bats, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -125.58  -47.05  -16.59   54.40  176.87 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2789.2429   853.6957  -3.267 0.002871 ** 
## at_bats         0.6305     0.1545   4.080 0.000339 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.47 on 28 degrees of freedom
## Multiple R-squared:  0.3729, Adjusted R-squared:  0.3505 
## F-statistic: 16.65 on 1 and 28 DF,  p-value: 0.0003388

Let’s consider this output piece by piece.

First, the formula used to describe the model is shown at the top.
After the formula you find the five-number summary of the residuals.
The “Coefficients” table shown next is key;
- its first column displays the linear model’s y-intercept and the coefficient of at_bats.
With this table, we can write down
- the least squares regression line for the linear model:

$\hat{y}$ = -2789.2429 + 0.6305*$at_{bats}$

One last piece of information we will discuss from the summary output

is the Multiple R-squared, or more simply, $R^2$.
The $R^2$ value represents the proportion of variability
- in the response variable that is explained by the explanatory variable.

For this model, what percentage of the variance in runs

is explained by at-bats.

Q 3.3b ANSWER: 37.3% of the variability in runs is explained by at-bats.

Q 3.4

Fit a new model that uses homeruns to predict runs.

Using the estimates from the R output,
- write the equation of the regression line.
What does the slope tell us in the context of
- the relationship between success of a team and its home runs?

m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)

## 
## Call:
## lm(formula = runs ~ homeruns, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.615 -33.410   3.231  24.292 104.631 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 415.2389    41.6779   9.963 1.04e-10 ***
## homeruns      1.8345     0.2677   6.854 1.90e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 51.29 on 28 degrees of freedom
## Multiple R-squared:  0.6266, Adjusted R-squared:  0.6132 
## F-statistic: 46.98 on 1 and 28 DF,  p-value: 1.9e-07

Q 3.4. ANSWER: $Runs$ = 415.2389 + 1.8345*$Home Runs$ The slope shows us that for every home run, on average 1.8345 runs will occur.

Prediction and prediction errors

Create a scatterplot with the least squares line laid on top.

plot(mlb11$runs ~ mlb11$at_bats)
abline(m1)

The function abline plots a line based on its slope and intercept.

How do you do it in ggplot?

library(ggplot2)
ggplot(mlb11, aes(x=at_bats, y=runs)) + geom_point() + geom_smooth(method = lm, 
se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Here, we used a shortcut by providing the model m1,

which contains both parameter estimates.
This line can be used to predict y at any value of x.
When predictions are made for values of x
- that are beyond the range of the observed data,
- it is referred to as extrapolation and is not usually recommended.
However, predictions made within the range of the data are more reliable.
They are also used to compute the residuals.

Q 3.5

If a team manager saw the least squares regression line

and not the actual data,
how many runs would he or she predict for a team with 5,578 at-bats?

Q 3.5a ANSWER: He or she would predict about 728 runs for the team, using the regression equation above. The Phillies have a data point very close to this number of at-bats, with 5579 at-bats and 713 runs. Therefore it is an overestimate and it has a residual of -15.

-2789.2429 + 0.6305*5578

## [1] 727.6861

How many runs would he or she predict for a team with 5,578 at-bats?

Is this an overestimate or an underestimate, and by how much?
In other words, what is the residual for this prediction?

Model diagnostics

To assess whether the linear model is reliable,

we need to check for
- 1. linearity,
- 1. nearly normal residuals, and
- 1. constant variability.

Linearity: You already checked

if the relationship between runs and at-bats is linear using a scatterplot.
verify this condition with a plot of the residuals vs. at-bats.

Q 3.5b ANSWER: Plot of residuals vs. at-bats shown below.

Plot of the residuals vs. at-bats.

plot(m1$residuals ~ mlb11$at_bats)

Q 3.6

Is there any apparent pattern in the residuals plot?

What does this indicate about the linearity of the relationship
between runs and at-bats?

Q 3.6 Answer: There is no apparent pattern. Therefore, the relationship between runs and at-bats can be verified as linear.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3)

Q 3.7

Nearly normal residuals: To check this condition,

we can look at a histogram,
- hist(m1$residuals),
or a normal probability plot of the residuals.

hist(m1$residuals)

Based on the histogram and the normal probability plot,

does the nearly normal residuals condition appear to be met?

qqnorm(m1$residuals)
qqline(m1$residuals)

Q 3.7 ANSWER: Although the histogram is slightly right-skewed, the data is generally symmetrical so the normal residuals condition is indeed met.

Constant variability:

Q 3.8

Based on the plot above,

does the constant variability condition appear to be met?

Q 3.8 ANSWER: Yes. The constant variability condition appears to be met, since the variability around the line of best fit appears to be roughly constant.

Q 3.9

Choose another traditional variable from mlb11

that you think might be a good predictor of runs.

Produce a scatterplot of

the two variables
and fit a linear model.

At a glance, does there seem to be a linear relationship?

Q 3.9a ANSWER:

m3 <- lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)

summary(m3)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

How does this relationship compare to the relationship between runs and at_bats?

Use the $R^2$ values from the two model summaries to compare.
Does your variable seem to predict runs
- better than at bats? How can you tell?

Q 3.9b ANSWER: The R-squared value for this model is 65.61%, which is significantly higher than the R-squared value between runs and at_bats (37.3%). This indicates that batting average is a better variablee to preduct runs.

Q 3.10

Now that you can summarize the linear relationship between two variables,

investigate the relationships
- between runs and each of the other five traditional variables.

Which variable best predicts runs?

Support your conclusion using

the graphical
and numerical methods we’ve discussed
- (for the sake of conciseness,
- only include output for the best variable, not all five).

Q 3.10 ANSWER: Batting average (which by chance was the variable chosen in 3.9) is the best predictor for runs with a R-squared value of 65.61%. The other variables tested were hits, at_bats, homeruns, strikeouts, stolen_bases, and wins. The correlation factor between the 2 variables is 0.81.

m3 <- lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg)
abline(m3)

summary(m3)

## 
## Call:
## lm(formula = runs ~ bat_avg, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -94.676 -26.303  -5.496  28.482 131.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -642.8      183.1  -3.511  0.00153 ** 
## bat_avg       5242.2      717.3   7.308 5.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 49.23 on 28 degrees of freedom
## Multiple R-squared:  0.6561, Adjusted R-squared:  0.6438 
## F-statistic: 53.41 on 1 and 28 DF,  p-value: 5.877e-08

cor(mlb11$runs, mlb11$bat_avg)

## [1] 0.8099859

Q 3.11

Now examine the three newer variables.

These are the statistics used by the author of Moneyball
- to predict a teams success.
In general, are they more or less effective at predicting runs
- than the old variables?
Explain using appropriate graphical and numerical evidence.
Of all ten variables we’ve analyzed,
- which seems to be the best predictor of runs?
Using the limited (or not so limited) information you know
- about these baseball statistics,
- does your result make sense?
Check the model diagnostics for the regression model
- with the variable you decided was the best predictor for runs.

Q 3.11 ANSWER: All three new predictor have higher R-squared values (84.91% for onbase, 89.69% for slug, and 93.49% for obs) and are thus better predictors for runs. Obs (on-base plus slugging) seems to be the best predictor for runs. The plots between the two variables show linearity and constant variability. With no baseball knowledge, I am unable to make any personal verification on whether this makes the most sense.

onbase_model <- lm(runs ~ new_onbase, data = mlb11)
slug_model <- lm(runs ~ new_slug, data = mlb11)
obs_model <-  lm(runs ~ new_obs, data = mlb11)
summary(onbase_model)

## 
## Call:
## lm(formula = runs ~ new_onbase, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.270 -18.335   3.249  19.520  69.002 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1118.4      144.5  -7.741 1.97e-08 ***
## new_onbase    5654.3      450.5  12.552 5.12e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.61 on 28 degrees of freedom
## Multiple R-squared:  0.8491, Adjusted R-squared:  0.8437 
## F-statistic: 157.6 on 1 and 28 DF,  p-value: 5.116e-13

summary(slug_model)

## 
## Call:
## lm(formula = runs ~ new_slug, data = mlb11)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -45.41 -18.66  -0.91  16.29  52.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -375.80      68.71   -5.47 7.70e-06 ***
## new_slug     2681.33     171.83   15.61 2.42e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.96 on 28 degrees of freedom
## Multiple R-squared:  0.8969, Adjusted R-squared:  0.8932 
## F-statistic: 243.5 on 1 and 28 DF,  p-value: 2.42e-15

summary(obs_model)

## 
## Call:
## lm(formula = runs ~ new_obs, data = mlb11)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.456 -13.690   1.165  13.935  41.156 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -686.61      68.93  -9.962 1.05e-10 ***
## new_obs      1919.36      95.70  20.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.41 on 28 degrees of freedom
## Multiple R-squared:  0.9349, Adjusted R-squared:  0.9326 
## F-statistic: 402.3 on 1 and 28 DF,  p-value: < 2.2e-16

plot(mlb11$runs ~ mlb11$new_obs)
abline(obs_model)

plot(obs_model$residuals ~ mlb11$new_obs)
abline(h = 0, lty = 3)

hist(obs_model$residuals)

qqnorm(obs_model$residuals)
qqline(obs_model$residuals)

Q 3.12

What concepts from the textbook are covered in this lab?

What concepts, if any, are not covered in the textbook?
Have you seen these concepts elsewhere,
- e.g. lecture, discussion section, previous labs, or homework problems?
Be specific in your answer.

Q 3.12 ANSWER: The concepts of linear regression and how to verify it with sum of least squares and linearity are covered in the OIS textbook. The concepts of applying it with R code is seen in lectures and previous labs.

Q4 Timeseries Analysis

Time series are a common type of data,

consisting of measurements that are continuous over a time range.

In this project we will be using classical decomposition

to perform analysis on a time series.

First as an introduciton to decomposition we will have a quick example.

Q 4.1

What is the decomposition of a timeseries?

The AirPassangers data set of airline passangers every month for 12 years

library(datasets)
library(tidyverse)

## -- Attaching packages --------------------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.1     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

air <- as.data.frame(AirPassengers)

Plot the total time series of air passangers

What do you notice about the plot?

ANSWER: The time series of passengers contains 144 data points and the plot shows an increase of airline passengers overtime with some random and seasonal fluctuations.

plot.ts(AirPassengers)

Use the ts() function in base R

to define AirPassangers as a time series with a yearly trend

If the data is taken monthly,

what will the frequency (points per season) of a yearly season be?

ANSWER: In the time series plot, the number of passengers increases with each subsequent year. The months July and August tend to have the highest number of passengers within each year and November tends to have the lowest amount of passengers within each year. There are 12 points per yearly season.

ts(AirPassengers, frequency = 12)

##    Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1  112 118 132 129 121 135 148 148 136 119 104 118
## 2  115 126 141 135 125 149 170 170 158 133 114 140
## 3  145 150 178 163 172 178 199 199 184 162 146 166
## 4  171 180 193 181 183 218 230 242 209 191 172 194
## 5  196 196 236 235 229 243 264 272 237 211 180 201
## 6  204 188 235 227 234 264 302 293 259 229 203 229
## 7  242 233 267 269 270 315 364 347 312 274 237 278
## 8  284 277 317 313 318 374 413 405 355 306 271 306
## 9  315 301 356 348 355 422 465 467 404 347 305 336
## 10 340 318 362 348 363 435 491 505 404 359 310 337
## 11 360 342 406 396 420 472 548 559 463 407 362 405
## 12 417 391 419 461 472 535 622 606 508 461 390 432

Use the decompose() function

-to decompose the time series and remove the seasonality

The type for this time series is multiplicative

Plot the decomposed time series,
what do you notice about the trend?

ANSWER: From the plot of the decomposed time series, a positive trend is shown over the years. The seasonal trend shows a cyclic pattern with the same shape. Variations from randomness are also shown.

a <- ts(AirPassengers, frequency = 12)
b <- decompose(a, "multiplicative")
plot(b)

Isolate the trend and plot the trend on top of the raw data with the seasonality included

How well does the trend represent the data?

ANSWER: The trend lines represents the original data for air passengers fairly well. The variation in data would be due to seasonality.

# Extract the trend from the decomposition
trend <- b$trend

# Plot the raw data annotated with the trend
plot(a) + lines(trend) + title(main = "AirPassenengers Raw Data with Trend")

## integer(0)

Q 4.2

Now lets try this with a real world time series.

We’ll be using one month of power and weather data from a solar power plant.

The data set variables are as follows:

time: The timestamp at which the data was taken
ghir: Global Horizontal Irradiance from a sensor at the site,
- the power from the sunlight over an area normal to the surface of the earth $(Watts/m^2)$
iacp: The AC power from the power plant $(kW)$
temp: The air temperature $(Celsius)$
ghi_solargis: The Global Horizontal Irradiance, not from a sensor,
- but predicted using weather modeling $(Watts/m^2)$
clear: A logical indicating whether the sky was “clear” during measurement,
- determined by comparing the ghi and ghi_solargis data
ratio: the ratio of the Global Horizontal Irradiance
- and the Plane of Array Irradiance (the irradiance normal to the surface of a tilted module)

The power from solar panels is dependant on the irradiance hitting it,

so a power reading is often meaningless without a corresponding irradiance measurement.

It is useful to have multiple sources of irradiance measurements.

Sensors on the ground are useful because

they strongly represent the irradiance that is hitting the module;
however, sensors can begin to drift if not cleaned or calibrated properly.
An unstable sensor can render an entire data set useless.

To combat this, we also have irradiance data from SolarGIS,

a company that uses satellite images to model and predict
the irradiance at the surface of the earth.

Plot the irradiance and power for the first week of data,

how does the irradiance look compared to
- what you would expect from the trend of sunlight?
How well does the power and irradiance match up?

ANSWER: There are 7 columns (variables) and 2880 rows (objects) in the dataset. The time interval for the entire dataset goes from “2012-06-01 UTC” to “2012-06-30 23:45:00 UTC”. The irradiance peaks midday and drops off at night, which is expected. The solar power panel and both sources of irradiance line up well. The way the 3 graphs line up on the same plot show the direct relationship between irradiance incident and power produced by the solar panel.

# Read in the data
library(ggplot2)
solar_data <- readr::read_csv("data/2001-353-353m-453-le2b-tsa-data-final.csv")

## Parsed with column specification:
## cols(
##   time = col_datetime(format = ""),
##   ghir = col_double(),
##   iacp = col_double(),
##   temp = col_double(),
##   ghi_solargis = col_double(),
##   clear = col_logical(),
##   ratio = col_double()
## )

# Find total length of dataset
sprintf('The total length of the data is %0.4f days', 
        as.numeric(max(solar_data$time) - min(solar_data$time)))

## [1] "The total length of the data is 29.9896 days"

# Find the time interval
table(diff(solar_data$time))

## 
##   15 
## 2879

# Plot the irradiance and power for the first week
# Entire dataset is in 15-minute increments
first_week <- solar_data[1:672, ]

# Power plot over time
ggplot(first_week, aes(x = time, y = iacp)) + 
  geom_point(size = 2, shape = 19, alpha = 0.8) + 
  geom_line(aes(y = iacp), color = 'black', lwd = 1.2) +
  xlab('') + ylab("Power (kW)") + ggtitle("First Week Solar Panel Power")

# Both sources of irradiance
ggplot(first_week, aes(x = time)) +
  geom_line(aes(y = ghir, color = 'ghir'), lwd = 1.2) +
  geom_point(aes(y = ghir, color = 'ghir'), alpha = 0.8, 
            shape = 19, size = 2) +
  geom_line(aes(y = ghi_solargis, color = 's_ghir'), lwd = 1.2) +
  geom_point(aes(y = ghir, color = 's_ghir'), alpha = 0.8, 
            shape = 19, size = 2) +
  scale_color_manual(name = 'legend', values = c('red', 'blue')) + 
  xlab('') + ylab('Irradiance') +ggtitle('Two Sources of Irradiance')

# All 3 lines on the same plot with a secondary axis
ggplot(first_week, aes(x = time)) +
  geom_line(aes(y = iacp, color = 'power')) +
  geom_line(aes(y = 0.2 * ghir, color = 'ghir'), lwd = 1.1) +
  geom_line(aes(y = 0.2 * ghi_solargis, color = 's_ghi'), lwd = 1.1) +
  scale_y_continuous(sec.axis = sec_axis(~.*5, name = "Solar Irradiance")) +
  ylab('Power (kW)') + xlab('') + ggtitle("Power vs GHI")

Use the ts() functions and the stlplus() function from the stlplus package

to decompse the sensor and SolarGIS irradiance and the power
- for the whole month.
Plot each of the decompositions, what do you notice?

ANSWER: All of the decompositions show similar patterns. The seasonal (daily) trend matches the raw data. There is no discernable trend line as this dataset only covers one month.

# think carefully about the frequency you'll need to define for this data
# what is the seasonal component to this data and how nay data points make up a season?
# use s.window = "periodic" for the stlplus function
library(stlplus)
?stlplus()

## starting httpd help server ... done

time_series <- lapply(solar_data[c('iacp', 'ghir', 'ghi_solargis')], ts)
solar_decom <- lapply(time_series, stlplus, s.window = "periodic", n.p = 24*4)

#Plot the data with lapply as well
lapply(solar_decom, plot, main = "LOESS Decomposition")

## $iacp

## 
## $ghir

## 
## $ghi_solargis

Isolate the trends for the 3 time series you just decomposed

and build a linear model for each one.

Compare the models to each other,

how are they different?

ANSWER: All of the trend lines are horizontal lines. The y-interacts for all three graphs are close to 0. The time coefficients for all three graphs are also close to 0.

models <- lapply(solar_decom, function(x)
  {lm(x[[c('data', 'trend')]]~solar_data$time)})

lapply(models, function(x) {summary(x)$coefficients})

## $iacp
##                      Estimate   Std. Error   t value    Pr(>|t|)
## (Intercept)     -4.377731e+02 1.696082e+02 -2.581084 0.009898121
## solar_data$time  3.791630e-07 1.265918e-07  2.995163 0.002766366
## 
## $ghir
##                      Estimate   Std. Error   t value      Pr(>|t|)
## (Intercept)     -2.191447e+04 9.665667e+02 -22.67249 7.573128e-105
## solar_data$time  1.659443e-05 7.214237e-07  23.00233 1.271864e-107
## 
## $ghi_solargis
##                      Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)     -2.322766e+04 1.079400e+03 -21.51905 2.295667e-95
## solar_data$time  1.759174e-05 8.056396e-07  21.83575 6.201618e-98

Solar panel degradation leads to less power output over time

at the same irradiance conditions.

Based on the linear models you found for the trends of power and irradiance,

is this system degrading over time?

ANSWER: Based on the linear models above, the system is not degrading overtime as solar power does not significantly decrease over the course of a month.

How do the sensor GHI and the SolarGIS GHI compare to power?

ANSWER: The sensor GHI and SolarGIS GHI are also relatively constant, but there is a slight increase in solar panel power with irradiance over the course of the month.

Links

http://www.r-project.org

http://rmarkdown.rstudio.com/

https://www.openintro.org/stat/textbook.php?stat_book=os