P value and R squared

P Value

P values tell you whether your hypothesis test results are statistically significant. Statistics use them all over the place. P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true. The null hypothesis is usually an hypothesis of “no difference” e.g. no difference between blood pressures in group A and group B. Define a null hypothesis for each study question clearly before the start of your study.

Specifically, if the null hypothesis is correct, what is the probability of obtaining an effect at least as large as the one in your sample?

High P-values: Your sample results are consistent with a null hypothesis that is true.
Low P-values: Your sample results are not consistent with a null hypothesis that is true.

If your P value is small enough, you can conclude that your sample is so incompatible with the null hypothesis that you can reject the null for the entire population. P-values are an integral part of inferential statistics because they help you use your sample to draw conclusions about a population.

The term significance level (alpha) is used to refer to a pre-chosen probability and the term “P value” is used to indicate a probability that you calculate after a given study.

If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for you to decide when considering the real-world relevance of your result.

The choice of significance level at which you reject H0 is arbitrary. Conventionally the 5% (less than 1 in 20 chance of being wrong), 1% and 0.1% (P < 0.05, 0.01 and 0.001) levels have been used. These numbers can give a false sense of security.

In the ideal world, we would be able to define a “perfectly” random sample, the most appropriate test and one definitive conclusion. We simply cannot. What we can do is try to optimise all stages of our research to minimise sources of uncertainty. When presenting P values some groups find it helpful to use the asterisk rating system as well as quoting the P value:

P < 0.05 *
P < 0.01 **
P < 0.001 ***

Here are some important links to understand P-value better:

P Values
Blog: How to Interpret P values Correctly
Wikipedia: p-value
Youtube: Understanding the p-value - Statistics Help

R squared

R-squared is a goodness-of-fit measure for linear regression models. In statistics, the coefficient of determination, denoted R2 or r2 and pronounced “R squared”, is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.

Definition Source: Wikipedia

Caveats

R2 does not indicate whether:

the independent variables are a cause of the changes in the dependent variable;
omitted-variable bias exists;
the correct regression was used;
the most appropriate set of independent variables has been chosen;
there is collinearity present in the data on the explanatory variables;
the model might be improved by using transformed versions of the existing set of independent variables;
there are enough data points to make a solid conclusion.

Adjusted R squared

The use of an adjusted R2 is an attempt to take account of the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model.

where, p is the total number of explanatory variables in the model (not including the constant term), and n is the sample size.

Here are some important links for better understanding:

Now, let’s look on the practical application of both concepts. Here, we will fit our linear regression model for 50_Startups.csv, where R.D.Spend, Administration, Marketing.Spend, and State are independent variables and Profit is a dependent variable. Dataset can be downloaded from the following link:

Dataset: 50_Startups.csv

Importing and preprocessing the dataset

# Importing the dataset
dataset <- read.csv('50_Startups.csv')

library(hwriter)

cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

R.D.Spend	Administration	Marketing.Spend	State	Profit
165349.2	136897.80	471784.1	New York	192261.8
162597.7	151377.59	443898.5	California	191792.1
153441.5	101145.55	407934.5	Florida	191050.4
144372.4	118671.85	383199.6	New York	182902.0
142107.3	91391.77	366168.4	Florida	166187.9
131876.9	99814.71	362861.4	New York	156991.1

Since, our State variable is a categorical variable, we need to encode its string values(‘California’, ‘Florida’, ‘New York’) into factors.

# Encoding categorical data
dataset$State <- factor(dataset$State,
                       levels = c('California', 'Florida', 'New York'),
                       labels = c(1, 2, 3))

cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

R.D.Spend	Administration	Marketing.Spend	State	Profit
165349.2	136897.80	471784.1	3	192261.8
162597.7	151377.59	443898.5	1	191792.1
153441.5	101145.55	407934.5	2	191050.4
144372.4	118671.85	383199.6	3	182902.0
142107.3	91391.77	366168.4	2	166187.9
131876.9	99814.71	362861.4	3	156991.1

# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(100)
split <- sample.split(dataset$Profit, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

Fitting Multiple Linear Regression to the Training set

regressor <- lm(formula = Profit ~ .,
               data = training_set)

summary(regressor)

## 
## Call:
## lm(formula = Profit ~ ., data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32380  -4586   -190   4940  20038 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.359e+04  8.111e+03   5.374 5.61e-06 ***
## R.D.Spend        8.146e-01  5.828e-02  13.977 1.18e-15 ***
## Administration   1.858e-02  6.098e-02   0.305    0.762    
## Marketing.Spend  2.873e-02  2.083e-02   1.379    0.177    
## State2          -2.938e+02  3.871e+03  -0.076    0.940    
## State3          -1.878e+03  4.104e+03  -0.458    0.650    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9938 on 34 degrees of freedom
## Multiple R-squared:  0.9496, Adjusted R-squared:  0.9421 
## F-statistic:   128 on 5 and 34 DF,  p-value: < 2.2e-16

NOTE: We have the P values for different independent variables.If we assume our alpha to be 0.05, then, P values suggest that not all the independent variables are equally significant in our regression model. So, we can eliminate the non-significant variables using Backward Elimination model.

Removing the less significant variables

First, we remove the State variable as its P value is significantly higher than our alpha value.

regressor <- lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
               data = training_set)

summary(regressor)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, 
##     data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31950  -4596   -166   5660  18688 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.296e+04  7.785e+03   5.518 3.07e-06 ***
## R.D.Spend       8.064e-01  5.442e-02  14.820  < 2e-16 ***
## Administration  1.946e-02  5.931e-02   0.328    0.745    
## Marketing.Spend 3.096e-02  1.972e-02   1.570    0.125    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9692 on 36 degrees of freedom
## Multiple R-squared:  0.9492, Adjusted R-squared:  0.945 
## F-statistic: 224.3 on 3 and 36 DF,  p-value: < 2.2e-16

We observed that the adjusted R squared values has improved.Now, we remove Administration variable too as it has the P value of 0.745.

regressor <- lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
               data = training_set)

summary(regressor)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31922  -4693   -223   5904  18785 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.527e+04  3.243e+03  13.961 2.54e-16 ***
## R.D.Spend       8.118e-01  5.124e-02  15.844  < 2e-16 ***
## Marketing.Spend 2.946e-02  1.895e-02   1.555    0.128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9574 on 37 degrees of freedom
## Multiple R-squared:  0.9491, Adjusted R-squared:  0.9463 
## F-statistic: 344.7 on 2 and 37 DF,  p-value: < 2.2e-16

Now, we remove the Marketing.Spend variable too because of its big P value.

regressor <- lm(formula = Profit ~ R.D.Spend,
               data = training_set)

summary(regressor)

## 
## Call:
## lm(formula = Profit ~ R.D.Spend, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32393  -4874    134   5177  18628 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.707e+04  3.085e+03   15.26   <2e-16 ***
## R.D.Spend   8.724e-01  3.390e-02   25.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9751 on 38 degrees of freedom
## Multiple R-squared:  0.9457, Adjusted R-squared:  0.9443 
## F-statistic: 662.2 on 1 and 38 DF,  p-value: < 2.2e-16

NOTE: After removing the Marketing.Spend variable, we observed that the adjusted R squared value has decreased from its previous value. This suggest that even though Marketing.Spend is not a significant variable according to our alpha (significance level), we can’t remove that variable from our regression model because it affects our goodness-of-fit measure. So, our regression model is the one which has R.D.Spend and Marketing.Spend variables.