Assignment #01: Statistical Significance [Outline]

Simple Linear Regression Project - Analysis of Hubble Dataset

Brendan Howell

Renselaer Polytechnic Institute

02/26/15 - Version 1.0

1. Select a dataset from “100+ Interesting Data Sets for Statistics” [http://bit.ly/1uQIVLU]

Edwin Hubble’s Dataset of Extra-Galactic Nebulae

[Reference: Hubble, E. (1929) “A Relationship Between Distance and Radial Velocity among Extra-Galactic Nebulae,” Proceedings of the National Academy of Science, 168. http://lib.stat.cmu.edu/DASL/Datafiles/Hubble.html]

2. Select one independent variable and select one dependent variable

In this experiment, the independent variable is the distance of the extra-galactic nebulae from Earth (in Megaparsecs) and the dependent variable is the recesson velocity (in km/sec) of the extra-galactic nebulae.

3. Read the dataset as a data.table or data.frame

##Load in the Hubble Dataset
#Get dataset from Project Documents File
hubble <- read.csv("~/Academics (RPI)/10. Spring 2015/Applied Regression Analysis/Assignments/Assignment #1/Hubble.csv", header=TRUE)
head(hubble)
##   distance recession_velocity
## 1    0.032                170
## 2    0.034                290
## 3    0.214               -130
## 4    0.263                -70
## 5    0.275               -185
## 6    0.275               -220
tail(hubble)
##    distance recession_velocity
## 19      1.4                500
## 20      1.7                960
## 21      2.0                500
## 22      2.0                850
## 23      2.0                800
## 24      2.0               1090

4. Describe, in detail, H_0, your null hypothesis

In this experiment, we are trying to determine whether or not the variation that is observed in the response variable (which corresponds to ‘recession_velocity’ in this analysis) can be explained by the variation existent in the single treatment of the experiment (which corresponds to ‘distance’). Therefore, the null hypothesis that is being tested states that the distance of the extra-galactic nebulae from Eartn does not have a significant effect on the recesson velocity of the extra-galactic nebulae.

5. Describe, in detail, your (linear) model

In order to determine whether or not the variation that is observed in the response variable (which corresponds to ‘recession_velocity’ in this analysis) can be explained by the variation existent in the single treatment of the experiment (which corresponds to ‘distance’), we can generate a linear model using the “lm()” function. With this linear model, we will be able to determine if the variation in nebulae recession velocity can be explained by the variation existent in the distance of the nebulae from Earth.

summary(hubble)
##     distance      recession_velocity
##  Min.   :0.0320   Min.   :-220.0    
##  1st Qu.:0.4062   1st Qu.: 165.0    
##  Median :0.9000   Median : 295.0    
##  Mean   :0.9114   Mean   : 373.1    
##  3rd Qu.:1.1750   3rd Qu.: 537.5    
##  Max.   :2.0000   Max.   :1090.0
str(hubble)
## 'data.frame':    24 obs. of  2 variables:
##  $ distance          : num  0.032 0.034 0.214 0.263 0.275 0.275 0.45 0.5 0.5 0.63 ...
##  $ recession_velocity: int  170 290 -130 -70 -185 -220 200 290 270 200 ...
hubble_model <- lm(hubble$recession_velocity~hubble$distance)
summary(hubble_model)
## 
## Call:
## lm(formula = hubble$recession_velocity ~ hubble$distance)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -397.96 -158.10  -13.16  148.09  506.63 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -40.78      83.44  -0.489     0.63    
## hubble$distance   454.16      75.24   6.036 4.48e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 232.9 on 22 degrees of freedom
## Multiple R-squared:  0.6235, Adjusted R-squared:  0.6064 
## F-statistic: 36.44 on 1 and 22 DF,  p-value: 4.477e-06

6. Describe the dataset you selected

In 1929, Edwin Hubble investigated the relationship between the distance from Earth and the radial velocity of extra-galactic nebulae (celestial objects) to see if there was any significant relationship between them. To test this experiment, Hubble collected and used data from 24 nebulae, which is comprised of the distances from Earth (in Megaparsecs) and the recession velocities (in km/sec) for each respective nebulae.

Plots

1. Plot the scattergram of your data

plot(y = hubble$recession_velocity,x = hubble$distance, col="red", main="Nebulae Distance from Earth vs. Nebulae Recession Velocity", ylab = "Recession Velocity (in km/s)", xlab = "Distance from Earth (in Megaparsecs)")

2. Plot the regression line

plot(y = hubble$recession_velocity,x = hubble$distance, col="red", main="Nebulae Distance from Earth vs. Nebulae Recession Velocity", ylab = "Recession Velocity (in km/s)", xlab = "Distance from Earth (in Megaparsecs)")
abline(hubble_model)

3. Plot the 95% confidence intervals of the regression line, b_0 and b_1

confint(hubble_model, 'hubble$distance', level=0.95)
##                    2.5 %   97.5 %
## hubble$distance 298.1262 610.1906

5. Interpret the results of the statistical analysis b_0, b_1 and r

(To be completed on final version of Assignment #1.)

Due Dates

Outline due February 19, 2015

Project due February 26, 2015