KITADA

Lesson #16

Transforming (or “re-expressing”) data

Motivation:

In order to use the results of a simple linear regression to make accurate statements about what is happening, certain conditions must exist. One such condition is that there are no influential outliers as they could have an effect on an analysis. Outliers and a strategy for dealing with outliers were addressed in Lesson 15. Another condition is that the relationship between the response and explanatory variable must be linear. In this lesson, we will address what to do if a non-linear relationship exists between the response and explanatory variables. We’ll address some other conditions in Lesson 17.

What you need to know from this lesson:

After completing this lesson, you should be able to

To accomplish the above “What You Need to Know”, do the following:

The Lesson

The Tortilla Chip Example:

No tortilla chip lover likes soggy chips, so it is important to find characteristics of the production process that produce chips with an appealing texture. Researchers performed a study where they fried chips for a certain amount of time and recorded the average moisture content of the chips.

Below are the data:

### TORTILLA CHIP EXAMPLE ###
frying_time<-c(5, 10, 15, 20, 25, 30, 
               35, 40, 45, 50, 55, 60)
moisture<-c(16.3, 9.7, 8.1, 4.2, 3.4, 2.9, 
            2.4, 2.3, 1.9, 1.7, 1.4, 1.3)

1. Which variable is the response variable?

Response = Moisture Content

2. The scatterplot of moisture content versus frying time and the Residual Plot at the top of the next page. Describe the relationship between the two variables.

### SCATTERPLOT ###
plot(frying_time, moisture, 
     main="Scatterpot of Frying Time vs Moisture Content", 
     xlab="Frying Time (seconds)", 
     ylab="Moisture Content (%)", 
     pch=16)

mod<-lm(moisture~frying_time)

abline(coefficients(mod), 
       lwd=2, lty=2, 
       col="red")

plot of chunk unnamed-chunk-3

### RESIDUAL PLOT ###
mod<-lm(moisture~frying_time)

plot(frying_time, resid(mod), 
     main="Residual Plot", 
     xlab="Frying Time (seconds)", 
     ylab="Residual", 
     pch=16)

abline(h=0, 
       lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-3

3. One of many ways to “straighten” the data (make the relationship between the two variables linear) is to do a transformation of the data (or “re-express” the data). A transformation involves performing some operation on all of the data for one (or perhaps both) of the variables. There are many different types of transformations (square root, reciprocal, logarithmic, just to name a few), each one used in specific situations.

A common transformation is the “log” transformation.

In this example, we will “take the log” of moisture content. (Logarithmic functions are a class of functions all to themselves. They were invented by Scottish mathematician John Napier (1550-1617) for the purpose of finding the antiderivative of f(x) = 1/x. By definition, the natural logarithmic function is: \( ln(x)=\int_1^x \frac{1}{t}dt, x>0 \)

Before doing that in this example, let’s practice with a simpler example:

Simpler Example:

Definition of exponential growth: A variable grows exponentially if for every unit increase in x, y increases multiplicatively. Here’s an example:

### LOG EXAMPLE ###
x<-c(1, 2, 3, 4, 5)
y<-c(1, 10, 100, 1000, 10000)

a. As x increases by 1 unit, y increases by a factor of what?

Y increase by a factor of 10.

b. In the space below, make a rough sketch of the scatter plot:

### PLOT ###
plot(x, y, pch=16, 
     main="Original Scatterplot")

plot of chunk unnamed-chunk-5

c. A property of variables that grow exponentially: if a variable grows exponentially, its logarithm will grow linearly!

Logarithms have different bases. Two common bases are 10 and e (e is approximately 2.71 and is the natural exponential growth factor). If y increases by a factor of 10 for every unit increase in x, then the “log” of y to base 10 (written: log10(y)) will increase by 1 for every unit increase in x. Likewise, if y increases by a factor of e (2.71) for every unit increase in x, then loge(y) (or ln(y)) will increase by 1 for every unit increase in x. Let’s start with base 10.

Find log10(y) for each value of y:

### LOG BASE 10 ###
log_y<-log10(y)
log_y
## [1] 0 1 2 3 4

d. Draw the scatter plot of log10(y) versus x. Describe this relationship.

### LOG TRANSFORMED PLOT ###
plot(x, log_y, pch=16, 
     main="Log Transformed Scatterplot")

plot of chunk unnamed-chunk-7

Now it looks positive and linear.

e. Connect the points in your scatter plot above with a straight line. What is the slope of this line? What is the y-intercept? Write the equation of this line. (Remember, it’s in terms of log10(y))

### FIT A LINE ###
log_mod<-lm(log_y~x)

plot(x, log_y, pch=16, 
     main="Log Transformed Scatterplot")
abline(coefficients(log_mod), 
       lwd=2, lty=2, 
       col="red")

plot of chunk unnamed-chunk-8

summary(log_mod)
## Warning in summary.lm(log_mod): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = log_y ~ x)
## 
## Residuals:
##         1         2         3         4         5 
##  1.73e-16  1.99e-16 -5.27e-16 -2.35e-16  3.90e-16 
## 
## Coefficients:
##              Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -1.00e+00   4.51e-16 -2.217e+15   <2e-16 ***
## x            1.00e+00   1.36e-16  7.354e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.3e-16 on 3 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.409e+31 on 1 and 3 DF,  p-value: < 2.2e-16

Equation of line: \( log_{10}(y)=-1+x \)

f. A property of logarithms is that \( 10^{log_{10}(y)} = y \). This can be very useful in predicting values of y for certain values of x after a log transformation has been performed.

Using the equation found in part (e) and this property, predict y when x=2.4. (This is called a back transformation – transforming back to the original scale for easier interpretation.)

### PREDICT FOR X=2.4 ###
log_pred2.4<-as.numeric(coefficients(log_mod)[1]+
                          coefficients(log_mod)[2]*2.4)

### DONT FORGET TO BACKTRANSFORM
10^log_pred2.4
## [1] 25.11886

g. Even though it made sense to use base 10 in this example, taking the log of y to base e (\( log_e(y) \) or ln(y)) could have been done. To see this, take the ln(y) for each value of y: (Make sure you know how to use the LOG and LN keys correctly on your calculator!)

### NATURAL LOG ###
### LOG BASE 10 ###
ln_y<-log(y)
ln_y
## [1] 0.000000 2.302585 4.605170 6.907755 9.210340

h. Draw the scatter plot and connect the points with a straight line.

### NATURAL LOG ###
### LOG TRANSFORMED PLOT ###
ln_mod<-lm(ln_y~x)

plot(x, ln_y, pch=16, 
     main="Ln Transformed Scatterplot")
abline(coefficients(ln_mod), 
       lwd=2, lty=2, 
       col="red")

plot of chunk unnamed-chunk-11

i. Write the equation of the line drawn in part (h).

### MODEL SUMMARY ###
summary(ln_mod)
## Warning in summary.lm(ln_mod): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = ln_y ~ x)
## 
## Residuals:
##          1          2          3          4          5 
##  3.228e-16  2.504e-16 -5.927e-16 -8.572e-16  8.766e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -2.303e+00  8.609e-16 -2.675e+15   <2e-16 ***
## x            2.303e+00  2.596e-16  8.870e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.209e-16 on 3 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 7.868e+31 on 1 and 3 DF,  p-value: < 2.2e-16

j. Predict y when x = 2.4. Is this the same answer as in part (f)?

### PREDICT FOR X=2.4 ###
ln_pred2.4<-as.numeric(coefficients(ln_mod)[1]+
                         coefficients(ln_mod)[2]*2.4)

### DONT FORGET TO BACKTRANSFORM ###
exp(ln_pred2.4)
## [1] 25.11886

Yes, they are the same.

4. Back to the tortilla chip example. In exponential growth or exponential decay situations, the log of the response variable should be taken. In this case, the loge (which is the natural log, or ln) of moisture content was taken and is given below.

### LN TORTILLA ###
ln_moisture<-log(moisture)

cbind(frying_time, 
      moisture, 
      ln_moisture)
##       frying_time moisture ln_moisture
##  [1,]           5     16.3   2.7911651
##  [2,]          10      9.7   2.2721259
##  [3,]          15      8.1   2.0918641
##  [4,]          20      4.2   1.4350845
##  [5,]          25      3.4   1.2237754
##  [6,]          30      2.9   1.0647107
##  [7,]          35      2.4   0.8754687
##  [8,]          40      2.3   0.8329091
##  [9,]          45      1.9   0.6418539
## [10,]          50      1.7   0.5306283
## [11,]          55      1.4   0.3364722
## [12,]          60      1.3   0.2623643
### TRANSFORMED SCATTERPLOT ###
plot(frying_time, ln_moisture, 
     main="Scatterpot of Frying Time vs Ln(Moisture Content)", 
     xlab="Frying Time (seconds)", 
     ylab="Ln(Moisture Content) (%)", 
     pch=16)

lnmod<-lm(ln_moisture~frying_time)

abline(coefficients(lnmod), 
       lwd=2, lty=2, 
       col="red")

plot of chunk unnamed-chunk-14

### TRANSFORMED RESIDUAL PLOT ###
### RESIDUAL PLOT ###
plot(frying_time, resid(lnmod), 
     main="LN Residual Plot", 
     xlab="Frying Time (seconds)", 
     ylab="Residual", 
     pch=16)

abline(h=0, 
       lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-14

5. A strategy:

The above is a good example of a common situation: not all non-linear relationships are exact exponential growth or decay relationships.

Notice that there is still some curvature in the residual plot. This indicates our log transformation of moisture content didn’t work at completely straightening the relationship. Here is a strategy to use when deciding which variable to transform:

Note: if one of the above transformations doesn’t look any better than the others, you may just want to stick with analyzing the data on the original scale and comment on your analysis.

6. A transformation of frying time was attempted but did not produce a “better looking” scatterplot and residual plot. Therefore, those plots are not shown. Next, a transformation of both variables was attempted.

Below are the scatterplot and the residual plot after performing a natural log transformation on both moisture content and frying time.

Does it appear that a linear relationship exists between loge(moisture) and loge(frying time)? Explain.

### LN TRANSFORM BOTH ###
ln_frying_time<-log(frying_time)

### TRANSFORMED SCATTERPLOT ###
plot(ln_frying_time, ln_moisture, 
     main="Ln(Scatterpot of Frying Time) vs Ln(Moisture Content)", 
     xlab="Ln(Frying Time) (seconds)", 
     ylab="Ln(Moisture Content) (%)", 
     pch=16)

lnln_mod<-lm(ln_moisture~ln_frying_time)

abline(coefficients(lnln_mod), 
       lwd=2, lty=2, 
       col="red")

plot of chunk unnamed-chunk-15

### TRANSFORMED RESIDUAL PLOT ###
### RESIDUAL PLOT ###
plot(ln_frying_time, resid(lnln_mod), 
     main="Both LN Residual Plot", 
     xlab="Ln(Frying Time) (seconds)", 
     ylab="Residual", 
     pch=16)

abline(h=0, 
       lwd=2, lty=2, 
       col="blue")

plot of chunk unnamed-chunk-15

7. Below is the R output after performing a simple linear regression of ln(moisture content) versus ln(frying time). Answer the questions that follow:

### BOTH LN SUMMARY ###
summary(lnln_mod)
## 
## Call:
## lm(formula = ln_moisture ~ ln_frying_time)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16770 -0.05955 -0.01141  0.01772  0.29541 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.66177    0.15825   29.46 4.74e-11 ***
## ln_frying_time -1.05808    0.04718  -22.43 6.99e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1183 on 10 degrees of freedom
## Multiple R-squared:  0.9805, Adjusted R-squared:  0.9786 
## F-statistic: 502.9 on 1 and 10 DF,  p-value: 6.988e-10

a. Write the equation of the least-square regression line.

\( LN(Moisture) = 4.66177 -1.05808*LN(FryingTime) \)

b. What percent of the variability in ln(moisture content) is explained by the regression model (i.e. by frying time)?

### R-SQUARED VALUE ###
summary(lnln_mod)$r.squared
## [1] 0.9805042

c. Predict the moisture content for a chip fried for 18 seconds.

### PREDICT FOR 18 SECONDS ###

## FIRST: LOG THE 18
log(18)
## [1] 2.890372
## SECOND: PREDICT 
as.numeric(coefficients(lnln_mod)[1]+
  coefficients(lnln_mod)[2]*log(18))
## [1] 1.603542
## THIRD: BACKTRANSFORM

exp(as.numeric(coefficients(lnln_mod)[1]+
      coefficients(lnln_mod)[2]*log(18)))
## [1] 4.970609

d. Calculate the residual for the chip fried for 20 seconds.

### OBSERVED ### 
obs<-4.2

### EXPECTED ###
ln_exp<-as.numeric(coefficients(lnln_mod)[1]+
                     coefficients(lnln_mod)[2]*log(20))
ln_exp
## [1] 1.492063
exp<-exp(ln_exp)
exp
## [1] 4.446258
### RESIDUAL ###
obs-exp
## [1] -0.2462585

8. Note: another indication that a log transformation might be considered on a variable is if the ratio of the largest value to the smallest value is > 20 (or so).