Assignment #01: Statistical Significance [Final]

Statistical Significance (Simple Linear Regression) Project - Analysis of Hubble Dataset

Brendan Howell

Renselaer Polytechnic Institute

02/26/15 - Version 1.0

1. Data

Data Selection and Description

Edwin Hubble’s Dataset of Extra-Galactic Nebulae

In 1929, Edwin Hubble investigated the relationship between the distance from Earth and the radial velocity of extra-galactic nebulae (celestial objects) to see if there was any significant relationship between them. To test this experiment, Hubble collected and used data from 24 nebulae, which is comprised of the distances from Earth (in Megaparsecs) and the recession velocities (in km/sec) for each respective nebulae.

[Reference: Hubble, E. (1929) “A Relationship Between Distance and Radial Velocity among Extra-Galactic Nebulae,” Proceedings of the National Academy of Science, 168. http://lib.stat.cmu.edu/DASL/Datafiles/Hubble.html]

Organization of Data

Below, the Hubble Dataset is loaded into R, and its summary statistics and its structure are display (along with the “head” and the “tail” of the dataset). Additionally, a scatterplot of the data is generated, which offers some visual/graphical insight into the data itself.

##Load in the Hubble Dataset
rm(list=ls())
#Get dataset from Project Documents File
hubble <- read.csv("~/Academics (RPI)/10. Spring 2015/Applied Regression Analysis/Assignments/Assignment #1/Hubble.csv", header=TRUE)
#Display the head and tail of the data
head(hubble)
##   distance recession_velocity
## 1    0.032                170
## 2    0.034                290
## 3    0.214               -130
## 4    0.263                -70
## 5    0.275               -185
## 6    0.275               -220
tail(hubble)
##    distance recession_velocity
## 19      1.4                500
## 20      1.7                960
## 21      2.0                500
## 22      2.0                850
## 23      2.0                800
## 24      2.0               1090
#Display the summary statistics and the structure of the data
summary(hubble)
##     distance      recession_velocity
##  Min.   :0.0320   Min.   :-220.0    
##  1st Qu.:0.4062   1st Qu.: 165.0    
##  Median :0.9000   Median : 295.0    
##  Mean   :0.9114   Mean   : 373.1    
##  3rd Qu.:1.1750   3rd Qu.: 537.5    
##  Max.   :2.0000   Max.   :1090.0
str(hubble)
## 'data.frame':    24 obs. of  2 variables:
##  $ distance          : num  0.032 0.034 0.214 0.263 0.275 0.275 0.45 0.5 0.5 0.63 ...
##  $ recession_velocity: int  170 290 -130 -70 -185 -220 200 290 270 200 ...
#Generate a scatterplot of the data
plot(y = hubble$recession_velocity,x = hubble$distance, pch=21, bg="darkviolet", main="Nebulae Distance from Earth vs. Nebulae Recession Velocity", ylab = "Recession Velocity (in km/s)", xlab = "Distance from Earth (in Megaparsecs)")

2. The Linear Model (Simple Linear Regression)

Description of independent variable and dependent variable

In this experiment, the independent variable is the distance of the extra-galactic nebulae from Earth (in Megaparsecs) and the dependent variable is the recesson velocity (in km/sec) of the extra-galactic nebulae.

Description of the null hypothesis (H_0)

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘recession_velocity’ in this analysis) can be explained by the variation existent in the single treatment of the experiment (which corresponds to ‘distance’). Therefore, the null hypothesis that is being tested states that the distance of the extra-galactic nebulae from Earth does not have a significant effect on the recesson velocity of the extra-galactic nebulae.

Linear Model

In order to determine whether or not the variation that is observed in the response variable (which corresponds to ‘recession_velocity’ in this analysis) can be explained by the variation existent in the single treatment of the experiment (which corresponds to ‘distance’), we can generate a linear model using the “lm()” function. With this linear model, we will be able to determine if the variation in nebulae recession velocity can be explained by the variation existent in the distance of the nebulae from Earth. Upon generating a linear model, a regression line will be drawn through a scatter plot of the data, which allows for the model’s fit (with respect to the relationship between distance and recession velocity) to be visualized graphically.

#Generate linear model
hubble_model <- lm(hubble$recession_velocity~hubble$distance)
#Display summary of linear model
summary(hubble_model)
## 
## Call:
## lm(formula = hubble$recession_velocity ~ hubble$distance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -397.96 -158.10  -13.16  148.09  506.63 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -40.78      83.44  -0.489     0.63    
## hubble$distance   454.16      75.24   6.036 4.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 232.9 on 22 degrees of freedom
## Multiple R-squared:  0.6235, Adjusted R-squared:  0.6064 
## F-statistic: 36.44 on 1 and 22 DF,  p-value: 4.477e-06
#Plot the regression line of the model
plot(y = hubble$recession_velocity,x = hubble$distance, pch=21, bg="darkviolet", main="Nebulae Distance from Earth vs. Nebulae Recession Velocity", ylab = "Recession Velocity (in km/s)", xlab = "Distance from Earth (in Megaparsecs)")
abline(hubble_model, col='black',lwd=2.5)

Plot of the 95% confidence intervals of the regression line, b_0 and b_1

The scatter plot below contains 95% confidence intervals, which are plotted around the regression line that is generated from the Hubble data and the relationship between nebulae distance from Earth and nebulae recession velocity. Along with the visual interpretation of the 95% confidence intervals, the numerical values associated with the 95% confidence intervals are also generated, which both serve to represent the fact that for 95% of the possible samples of data, the true mean value fall between those confidence interval lines (where nebulae distance from Earth ranges from 298.1262 km/s and 610.1906 km/s).

confint(hubble_model, 'hubble$distance', level=0.95)
##                    2.5 %   97.5 %
## hubble$distance 298.1262 610.1906
plot(hubble$distance,hubble$recession_velocity,pch=21, bg='darkviolet',main="Nebulae Distance from Earth vs. Nebulae Recession Velocity", ylab = "Recession Velocity (in km/s)", xlab = "Distance from Earth (in Megaparsecs)")
abline(hubble_model$coef,lwd=2.5,col="black")
abline(confint(hubble_model)[,1],col="red",lwd=2)
abline(confint(hubble_model)[,2],col="red",lwd=2)

The interpreted the results of the statistical analysis (b_0, b_1, and r)

For the simple linear regression analysis that is performed where ‘distance’ is analyzed against the response variable ‘recession_velocity’, a p-value = 4.477e-06 is returned, indicating that there is roughly a probability of 4.477e-06 that the resulting associated F-value (36.44) is the result of solely randomization. Therefore, based on this result, we would reject the null hypothesis, leading us to believe that the variation that is observed in the mean values of nebulae recession velocities can be explained by the variation in nebulae distances from Earth being considered in this analysis and, as such, is likely not caused solely by randomization. (See above results for p-value and F-value under the heading “Linear Model.”)

In further analyzing the results of the simple linear regression analysis, it’s important to note that the value of b_0 (the linear model’s y-intercept) is -40.78 and the value of b_1 (which represents both the slope of the linear model and the coefficient associated with the variable ‘distance’ in the linear model) is 454.16. These values indicate the relationship between nebulae distance from Earth and nebulae recession velocity, which corresponds to the idea that an increase in one unit of “nebulae distance from Earth” (in Megaparsecs) results in an increase in 454.16 units of “nebulae recession velocity” (in km/s). Additionally, it’s important to note the metrics that measures the correlation between nebulae distance from Earth and nebulae recession velocity, which are multiple R-squared and adjusted R-squared (since we want to take into account any bias that might be associated with the number of explanatory variables being included in the model, this analysis emphasis the value of adjusted R-squared rather than the value of multiple R-squared). Since the value of adjusted R-squared is 0.6064, it can be inferred that the variation that exists in the independent variable “nebulae distance from Earth” can explain approximately 60.64% of the variation existent in the dependent variable “nebulae recession velocity.”