P values tell you whether your hypothesis test results are statistically significant. Statistics use them all over the place. P values are the probability of observing a sample statistic that is at least as extreme as your sample statistic when you assume that the null hypothesis is true. The null hypothesis is usually an hypothesis of “no difference” e.g. no difference between blood pressures in group A and group B. Define a null hypothesis for each study question clearly before the start of your study.
Specifically, if the null hypothesis is correct, what is the probability of obtaining an effect at least as large as the one in your sample?
If your P value is small enough, you can conclude that your sample is so incompatible with the null hypothesis that you can reject the null for the entire population. P-values are an integral part of inferential statistics because they help you use your sample to draw conclusions about a population.
The term significance level (alpha) is used to refer to a pre-chosen probability and the term “P value” is used to indicate a probability that you calculate after a given study.
If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for you to decide when considering the real-world relevance of your result.
The choice of significance level at which you reject H0 is arbitrary. Conventionally the 5% (less than 1 in 20 chance of being wrong), 1% and 0.1% (P < 0.05, 0.01 and 0.001) levels have been used. These numbers can give a false sense of security.
In the ideal world, we would be able to define a “perfectly” random sample, the most appropriate test and one definitive conclusion. We simply cannot. What we can do is try to optimise all stages of our research to minimise sources of uncertainty. When presenting P values some groups find it helpful to use the asterisk rating system as well as quoting the P value:
Here are some important links to understand P-value better:
R-squared is a goodness-of-fit measure for linear regression models. In statistics, the coefficient of determination, denoted R2 or r2 and pronounced “R squared”, is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale.
Source: Wikipedia
R2 does not indicate whether:
The use of an adjusted R2 is an attempt to take account of the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model.
where, p is the total number of explanatory variables in the model (not including the constant term), and n is the sample size.
Here are some important links for better understanding:
Now, let’s look on the practical application of both concepts. Here, we will fit our linear regression model for 50_Startups.csv, where R.D.Spend, Administration, Marketing.Spend, and State are independent variables and Profit is a dependent variable. Dataset can be downloaded from the following link:
Dataset: 50_Startups.csv
# Importing the dataset
dataset <- read.csv('50_Startups.csv')
library(hwriter)
cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
R.D.Spend | Administration | Marketing.Spend | State | Profit |
165349.2 | 136897.80 | 471784.1 | New York | 192261.8 |
162597.7 | 151377.59 | 443898.5 | California | 191792.1 |
153441.5 | 101145.55 | 407934.5 | Florida | 191050.4 |
144372.4 | 118671.85 | 383199.6 | New York | 182902.0 |
142107.3 | 91391.77 | 366168.4 | Florida | 166187.9 |
131876.9 | 99814.71 | 362861.4 | New York | 156991.1 |
Since, our State variable is a categorical variable, we need to encode its string values(‘California’, ‘Florida’, ‘New York’) into factors.
# Encoding categorical data
dataset$State <- factor(dataset$State,
levels = c('California', 'Florida', 'New York'),
labels = c(1, 2, 3))
cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
R.D.Spend | Administration | Marketing.Spend | State | Profit |
165349.2 | 136897.80 | 471784.1 | 3 | 192261.8 |
162597.7 | 151377.59 | 443898.5 | 1 | 191792.1 |
153441.5 | 101145.55 | 407934.5 | 2 | 191050.4 |
144372.4 | 118671.85 | 383199.6 | 3 | 182902.0 |
142107.3 | 91391.77 | 366168.4 | 2 | 166187.9 |
131876.9 | 99814.71 | 362861.4 | 3 | 156991.1 |
# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(100)
split <- sample.split(dataset$Profit, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
regressor <- lm(formula = Profit ~ .,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32380 -4586 -190 4940 20038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.359e+04 8.111e+03 5.374 5.61e-06 ***
## R.D.Spend 8.146e-01 5.828e-02 13.977 1.18e-15 ***
## Administration 1.858e-02 6.098e-02 0.305 0.762
## Marketing.Spend 2.873e-02 2.083e-02 1.379 0.177
## State2 -2.938e+02 3.871e+03 -0.076 0.940
## State3 -1.878e+03 4.104e+03 -0.458 0.650
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9938 on 34 degrees of freedom
## Multiple R-squared: 0.9496, Adjusted R-squared: 0.9421
## F-statistic: 128 on 5 and 34 DF, p-value: < 2.2e-16
NOTE: We have the P values for different independent variables.If we assume our alpha to be 0.05, then, P values suggest that not all the independent variables are equally significant in our regression model. So, we can eliminate the non-significant variables using Backward Elimination model.
First, we remove the State variable as its P value is significantly higher than our alpha value.
regressor <- lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
## data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31950 -4596 -166 5660 18688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.296e+04 7.785e+03 5.518 3.07e-06 ***
## R.D.Spend 8.064e-01 5.442e-02 14.820 < 2e-16 ***
## Administration 1.946e-02 5.931e-02 0.328 0.745
## Marketing.Spend 3.096e-02 1.972e-02 1.570 0.125
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9692 on 36 degrees of freedom
## Multiple R-squared: 0.9492, Adjusted R-squared: 0.945
## F-statistic: 224.3 on 3 and 36 DF, p-value: < 2.2e-16
We observed that the adjusted R squared values has improved.Now, we remove Administration variable too as it has the P value of 0.745.
regressor <- lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31922 -4693 -223 5904 18785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.527e+04 3.243e+03 13.961 2.54e-16 ***
## R.D.Spend 8.118e-01 5.124e-02 15.844 < 2e-16 ***
## Marketing.Spend 2.946e-02 1.895e-02 1.555 0.128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9574 on 37 degrees of freedom
## Multiple R-squared: 0.9491, Adjusted R-squared: 0.9463
## F-statistic: 344.7 on 2 and 37 DF, p-value: < 2.2e-16
Now, we remove the Marketing.Spend variable too because of its big P value.
regressor <- lm(formula = Profit ~ R.D.Spend,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32393 -4874 134 5177 18628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.707e+04 3.085e+03 15.26 <2e-16 ***
## R.D.Spend 8.724e-01 3.390e-02 25.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9751 on 38 degrees of freedom
## Multiple R-squared: 0.9457, Adjusted R-squared: 0.9443
## F-statistic: 662.2 on 1 and 38 DF, p-value: < 2.2e-16
NOTE: After removing the Marketing.Spend variable, we observed that the adjusted R squared value has decreased from its previous value. This suggest that even though Marketing.Spend is not a significant variable according to our alpha (significance level), we can’t remove that variable from our regression model because it affects our goodness-of-fit measure. So, our regression model is the one which has R.D.Spend and Marketing.Spend variables.