Asssoc. Prof. Dr. Bishnu Prasad Gautam
2021/1/30
Regression is a method that allows researchers to summarize the cause and effects between dependent variables and independent variables of any kinds of natural phenomena. Moreover, it is not limited to natural phenomena but also can be utilized in any kinds of other phenomena too which are possible to define by using statistical methods.
In the field of statistics, these independent variables are sometimes called as explanatory variables and the dependent variables are called as response variables. In the case of linear regression, this relationship is established by fitting a linear equation to observed data.
Y = a + bX
In the above equation, X is explanatory variable and Y is a response variable. The slope of the line is b, and a is the intercept (In some textbooks this equation is written as Y=mx +c too.)
Linear regression assumes that there exist a linear relationship between the dependent and independent variables. It means, you can fit a straight line to show the relationship between these variables. We will see this relationship in the coming example.
Here, you will see how the data are imported in R by using excel sheet. There are the other ways to load the data. For example, you can load the data by using read.csv()function also. However, at this time I show you how to use the excel sheet to load the data in R. The function setwd() command set the working directory to the specified path. In this particular case, the path is set to E:\data\projects\Red-panda\Tutorial\Regression. You need to escape the slash if the path contains space or any special character. In the example below, it is done by using double slashes (i.e “\”)
library(readxl)
setwd("E:\\data\\projects\\Red-panda\\Tutorial\\Regression")
agevsheight <- read_excel("data\\agevsheight.xls", sheet = "data") #Upload the data
lmHeight = lm(height~age, data = agevsheight) #Create the linear regression
summary(lmHeight) #Review the results##
## Call:
## lm(formula = height ~ age, data = agevsheight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.059 5.193 5.810 6.617 7.533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.8445 43.3815 1.495 0.166
## age 0.3835 1.8264 0.210 0.838
##
## Residual standard error: 21.84 on 10 degrees of freedom
## Multiple R-squared: 0.00439, Adjusted R-squared: -0.09517
## F-statistic: 0.0441 on 1 and 10 DF, p-value: 0.8379
The fourth line creates the linear model in R. Here, age is an independent variable where the height is dependent variable.
In this example, I show you how to read the csv file. Here the data are sales and spend. First of all, you can use the setwd() function to set your working directory. In line 2, the code reads the “.csv” file and put those data into dataframe variable. While using this function, you can extract only the necessary column also. For example you, you can drop the un-necessary column by using following technique.
dataframe <- read.csv("data\\sales.csv", fileEncoding = "UTF-8")
dataframe <-dataframe[-c(3)]
The above code means that the third column of dataframe will be droped and put it into dataframe again.
It is a good practices to indicate the encoding of the file also as shown in the example. In a simple regression model, R needs a formula in a format of Y~X where ‘Y’ is a response variable and ‘X’ is an independent (Predictor) variable. The lm() function accepts a number of parameters. However, in this case, it has two parameters. The first one is a formula which describes the model. In this example, it is a linear model and the second one is a data source. Finally, you can call the variable ‘simpleRegression’ and see the output. As you can see that it calls the simple linear functions and produces a number of results. Let’s see the sumarry of the result.
setwd("E:\\data\\projects\\Red-panda\\Tutorial\\Regression")
dataframe <- read.csv("data\\sales.csv", fileEncoding = "UTF-8") #Upload the data
simpleRegression = lm(Sales~Spend, data = dataframe) #Create the linear regression
summary(simpleRegression) #Review the results##
## Call:
## lm(formula = Sales ~ Spend, data = dataframe)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3385 -2097 258 1726 3034
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1383.4714 1255.2404 1.102 0.296
## Spend 10.6222 0.1625 65.378 1.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2313 on 10 degrees of freedom
## Multiple R-squared: 0.9977, Adjusted R-squared: 0.9974
## F-statistic: 4274 on 1 and 10 DF, p-value: 1.707e-14
The summary() function is capable of generating a number of statistical results. It gives us t-test, F-test, R-Squared, Degree of Freedom and p-value.
##
## Call:
## lm(formula = height ~ age, data = agevsheight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.059 5.193 5.810 6.617 7.533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.8445 43.3815 1.495 0.166
## age 0.3835 1.8264 0.210 0.838
##
## Residual standard error: 21.84 on 10 degrees of freedom
## Multiple R-squared: 0.00439, Adjusted R-squared: -0.09517
## F-statistic: 0.0441 on 1 and 10 DF, p-value: 0.8379