Data Scientist’s Toolbox Quiz 3

This is Quiz 3 from the Data Scientist’s toolbox course within the Data Science Specialization. This publication is intended as a learning resource, all answers are documented and explained.

Questions


1. We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or not and whether they have lung cancer. We claim that the smoking is related to lung cancer in the larger population. We explain we think that the reason for this relationship is because cigarette smoke contains known carcinogens such as arsenic and benzene, which make cells in the lungs become cancerous. directory?


  • This is an example of an inferential data analysis.


Explanation:

Descriptive analysis is for identifying characteristics and describing datasets. Inferential analysis aims to test hypothesis about general populations based on sample sets. Predictive analytics tries to guess the outcome of an event. Causal analysis generally requires careful study design and randomized studies although some techniques exist to infer causation.


2. What is the most important thing in Data Science?


  • The question you are trying to answer.


Explanation:

Garbage in, garbage out. If you don’t start in the right place you’ll end up way off course.


3. Suppose you have forked a repository called datascientist on Github but it isn’t on your local computer yet. Which of the following is the command to bring the directory to your local computer?


https://d396qusza40orc.cloudfront.net/datascitoolbox%2Fheatmap.png

  • There would be confounding because the number of people that live in an area is related to both Martha Stewart Living Subscribers and Our Site’s Users.


Explanation:

This is basically just a heatmap of population density.


4. What is an experimental design tool that can be used to address variables that may be confounders at the design phase of an experiment?


  • Stratifying variables.


Explanation:

Self-explanatory


5. What is the reason behind the explosion of interest in big data?


  • The price and difficulty of collecting and storing data has dramatically dropped.


Explanation:

Better networks, more servers, more sensors.


Website: http://www.ryantillis.com/