1

We are conducting research on the ways that people use data analysis and data science tools. Your participation in this non-graded and completely optional peer assessment will be part of that research. We will not collect any personally identifiable information about you for the purposes of this research and only aggregated totals of responses to questions will be reported. The potential risks to you are small. The potential benefits to the community of data scientists, developers, and professors are very high - we will be able to figure out which methods work and which methods do not. This exercise is 100% optional and will not have any influence whatsoever on your grade in the class. Thanks for considering helping us learn about data science!

https://d3c33hcgiwev3.cloudfront.net/_cf0fd3361e05f5be5304b07b771bad48_company_data.csv?Expires=1501027200&Signature=jJA3HZ5ZiPhOk224IS1ELRr7UKyeaqJiK95I7~bVg-MZ1jdeF35vStfqQ-TpqJkoNnZ3x-j1Wtxl~5tnd1rs3fwHQekxN4VO2S3qqmcd867dEJ5kMCVOz6xm-KO2X8TYd-cmRMiuJ~eGGBWXQTpxjP-KGGBssSYUS6QMg6PvI_&Key-Pair-Id=APKAJLTNE6QMUY6HBC5A

Use a linear regression model to investigate this relationship.

Would you be confident telling the company president that there is a meaningful relationship between x1 and y?

dt<-read.csv("company_data.csv")
summary(dt)
##        y                x1              x2              x3       
##  Min.   : 494.7   Min.   :109.8   Min.   :20.13   Min.   :115.7  
##  1st Qu.: 826.0   1st Qu.:181.4   1st Qu.:38.40   1st Qu.:164.8  
##  Median : 926.6   Median :198.9   Median :55.74   Median :179.7  
##  Mean   : 920.0   Mean   :200.7   Mean   :58.43   Mean   :179.5  
##  3rd Qu.:1011.5   3rd Qu.:220.4   3rd Qu.:79.24   3rd Qu.:194.5  
##  Max.   :1352.9   Max.   :314.3   Max.   :99.70   Max.   :241.1
fit<-lm(dt$y~dt$x1,data=dt)
summary(fit)
## 
## Call:
## lm(formula = dt$y ~ dt$x1, data = dt)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -164.572  -48.630   -1.861   49.181  170.498 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 90.77375   19.33019   4.696 3.44e-06 ***
## dt$x1        4.13185    0.09524  43.383  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64.59 on 498 degrees of freedom
## Multiple R-squared:  0.7908, Adjusted R-squared:  0.7903 
## F-statistic:  1882 on 1 and 498 DF,  p-value: < 2.2e-16
#yes p.value is less than 5% hence we reject the null hypothesis that there's no relationship between x1 and y

2

Report the estimated coefficient for x1 from your model to 6 significant digits.

fit$coefficients
## (Intercept)       dt$x1 
##   90.773748    4.131854
sumCoefs <- summary(fit)$coefficients
n<-nrow(dt)
sdErr<-sumCoefs[2,2]
estVal<-sumCoefs[2,1]
quantile = 1-0.05/2 # =95% , 2.5 on both tails
estVal+ c(-1,1) * qt(p=quantile,df=n-2)*sdErr
## [1] 3.944728 4.318981

4

Report the p-value associated with the coefficient for x1 from your model to 6 significant digits. Use scientific notation.

## [1] 2.781114e-171

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.