Linear Regression

Regression is a method that allows researchers to summarize the cause and effects between dependent variables and independent variables of any kinds of natural phenomena. Moreover, it is not limited to natural phenomena but also can be utilized in any kinds of other phenomena too which are possible to define by using statistical methods.
In the field of statistics, these independent variables are sometimes called as explanatory variables and the dependent variables are called as response variables. In the case of linear regression, this relationship is established by fitting a linear equation to observed data.

 Y = a + bX

In the above equation, X is explanatory variable and Y is a response variable. The slope of the line is b, and a is the intercept (In some textbooks this equation is written as Y=mx +c too.)

Linear Regression Example Step by Step

Linear relationship
Reading data from Microsoft excel file
Linear Regression Example

Linear regression assumes that there exist a linear relationship between the dependent and independent variables. It means, you can fit a straight line to show the relationship between these variables. We will see this relationship in the coming example.

Example 1: Reading data from Microsoft excel file

Here, you will see how the data are imported in R by using excel sheet. There are the other ways to load the data. For example, you can load the data by using read.csv()function also. However, at this time I show you how to use the excel sheet to load the data in R. The function setwd() command set the working directory to the specified path. In this particular case, the path is set to E:\data\projects\Red-panda\Tutorial\Regression. You need to escape the slash if the path contains space or any special character. In the example below, it is done by using double slashes (i.e “\”)

    library(readxl)
    setwd("E:\\data\\projects\\Red-panda\\Tutorial\\Regression")
    agevsheight <- read_excel("data\\agevsheight.xls", sheet = "data") #Upload the data
    lmHeight = lm(height~age, data = agevsheight) #Create the linear regression
    summary(lmHeight) #Review the results

## 
## Call:
## lm(formula = height ~ age, data = agevsheight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.059   5.193   5.810   6.617   7.533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  64.8445    43.3815   1.495    0.166
## age           0.3835     1.8264   0.210    0.838
## 
## Residual standard error: 21.84 on 10 degrees of freedom
## Multiple R-squared:  0.00439,    Adjusted R-squared:  -0.09517 
## F-statistic: 0.0441 on 1 and 10 DF,  p-value: 0.8379

The fourth line creates the linear model in R. Here, age is an independent variable where the height is dependent variable.

Example 2: Reading data from csv file

In this example, I show you how to read the csv file. Here the data are sales and spend. First of all, you can use the setwd() function to set your working directory. In line 2, the code reads the “.csv” file and put those data into dataframe variable. While using this function, you can extract only the necessary column also. For example you, you can drop the un-necessary column by using following technique.

dataframe <- read.csv("data\\sales.csv", fileEncoding = "UTF-8")
dataframe <-dataframe[-c(3)]

The above code means that the third column of dataframe will be droped and put it into dataframe again.

It is a good practices to indicate the encoding of the file also as shown in the example. In a simple regression model, R needs a formula in a format of Y~X where ‘Y’ is a response variable and ‘X’ is an independent (Predictor) variable. The lm() function accepts a number of parameters. However, in this case, it has two parameters. The first one is a formula which describes the model. In this example, it is a linear model and the second one is a data source. Finally, you can call the variable ‘simpleRegression’ and see the output. As you can see that it calls the simple linear functions and produces a number of results. Let’s see the sumarry of the result.

    setwd("E:\\data\\projects\\Red-panda\\Tutorial\\Regression")
    dataframe <- read.csv("data\\sales.csv", fileEncoding = "UTF-8") #Upload the data
    simpleRegression = lm(Sales~Spend, data = dataframe) #Create the linear regression
    summary(simpleRegression) #Review the results

## 
## Call:
## lm(formula = Sales ~ Spend, data = dataframe)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3385  -2097    258   1726   3034 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1383.4714  1255.2404   1.102    0.296    
## Spend         10.6222     0.1625  65.378 1.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2313 on 10 degrees of freedom
## Multiple R-squared:  0.9977, Adjusted R-squared:  0.9974 
## F-statistic:  4274 on 1 and 10 DF,  p-value: 1.707e-14

The summary() function is capable of generating a number of statistical results. It gives us t-test, F-test, R-Squared, Degree of Freedom and p-value.

Analyzing Simple Regression Output

Residuals: This is the error between the prediction of the model with the actual results.
Coefficients: For each variable and intercept, a weight is produced. It has other attributes too.
Estimate: This is the weight given to the variable
- Std. Error: This value tells you about the precision of estimated value.
- t-value: t-value is the co-efficient divided by standard error. It used to decide whether the co-efficient really adding any effect to the model or not. If this does not effect anything to the model, you can also drop this from the model too.
- R-Squared: It is a statistical measure that explain how close the data are fitted to the regression line. If it is near to ‘1’ (i.e 100%), this model is explains all the variability of the response data around the mean. However, if this value is near to ‘0’ (i.e. 0%), that means the model explains none.

## 
## Call:
## lm(formula = height ~ age, data = agevsheight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.059   5.193   5.810   6.617   7.533 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  64.8445    43.3815   1.495    0.166
## age           0.3835     1.8264   0.210    0.838
## 
## Residual standard error: 21.84 on 10 degrees of freedom
## Multiple R-squared:  0.00439,    Adjusted R-squared:  -0.09517 
## F-statistic: 0.0441 on 1 and 10 DF,  p-value: 0.8379