DSCI 100 Assignment 1

Complete the questions below. Compile your answers using Markdown and upload to your RPubs page. Submit a link to your RPubs page (through MOODLE) for grading.

Question 1

Write a few sentences about each of the 5 V’s. Explain why they are important in the context of Big Data. Split each of the V’s into their own “subsection” in Markdown. Ensure that you reference any works that you’ve used in this research.

Volume:

The majority amount of data are generated per second. According to Bernard Marr, in 2015 the amount of data we generated per minute was the same as the amount of data were generated since the beginning up to 2000. Because of the enomous amount of data, we have to the big data technology to store and analyze.

Velocity:

The speed at which data are generated and moved around. Nowsaday, that speed is really fast. The post on Facebook just take seconds to be posted. In fact, the trading system analyze social media about the trigger of buy/sell stocks in milliseconds. with such speed, we need big data to deal with the situation.

Variety:

The type of data: structured, unstructured and semi-structured. In the past most of data were structured, i.e. they were in the form of tables or regional database. However, it was not in the case anymore, most of the world data are unstructured. Big Data technology can deal with those variety type. That will change the data such as text messages, photos, videos, etc to some form more traditional.

Veracity:

The trustworthiness or messiness of the data. Data can be hard to control. For instance, on social media, people can write nearly whatever they want. The content they posted might have some grammartical error, typos and randomly hastags what they thought what their posted related.

Value:

The value that data can bring. Without value, there is no point in collecting, analyzing and storing it in the first place. Data Science try to make the data in to something useful.

The link of reference resources link link

Question 2

Which of the 5 V’s do you believe is the most important? Which one will be the one we need to worry about/ deal with most in the future?

I think the most important V is Value. If the data are not bringing benefits or insight, there are no point for us to keep and analyze those data. The fact that data are generated so fast, in the last amount and there are a variety of them which made them costly. Therefore, it might be a good idea to reserve our resources for the data that have value.

Value is also the V we will deal with most in the future. We now all have access to data, have a place to store them, etc. However, not many of us know the way to turn data into something useful.

Question 3

a. You have a file that is 4.5 petabytes. How many terabytes is that file?

\[4.5PB = 4.5PB * \frac{10^{15}B}{1PB} * \frac{1TB}{10^{12}B} = 4.5*10^3TB\]

b. You have a file that is 6.813 zettabytes. How may exabytes is that file?

\[6.813ZB = 6.813ZB * \frac{10^{21}B}{1ZB} * \frac{1EB}{10^{18}B} = 6.813 * 10^3EB\]

c. You have a file that is 50,000,000 petabytes. How many zettabytes is that file?

\[5*10^7PB = 5*10^7PB * \frac{10^{15}B}{1PB} * \frac{1ZB}{10^{21}B} = 50ZB\]

Question 4

Here are 3 data points: \((0,4), (1,8)\) and \((2,10)\). You can perfectly fit these points to the equation \[y = ax^2 + bx + c.\] If we are using this model for prediction, explain what we’ve done wrong and what a better model might be.

First, let plot the 3 data points:

In the graph, we will predict the next point x= 3 will have y value between 16 and 12.

However, we know the equation in form of \(y = ax^2 + bx + c\) will best fit with the 3 data points we have.

\[ c = 4 \] \[a + b + c = 8 \] \[4a + 2b + c = 10 \]

Solve that we will get \(y = -x^2 + 5x + 4\) put \(x = 3\) in the equation we will get \(y = 10\). If we continue with \(x = 4\) we will get \(y = 8\). We can see this is not a good model for prediction. What we have done wrong was we did not take account for error. We should leave some space for error.

The better model will be linear equation: \(y=mx+b\). We calculated and we will get the equation for that line: \(y=3x+\frac{13}{3}\)