About

Clearly show and organise your work. Each answer should appear below the question marked in red. Be careful not to modify existing text and code. For multiple choice questions you can place an (x) mark next to the text to mark your selection. Where necessary execute the already included code chunk, and add yours to perform required calculations. Best to view first in Preview mode!

Part 1: Multiple Questions (15 points)

Answer the below by typing your answer following each question. Each question is graded as marked.

1. Which one(s) of the following statements is(are) true with regards to a star schema? (1pt)

  1. Each table has to have a primary key

  2. Fact table is the biggest transactional table

  3. Dimension tables are best suited for categorical data

  4. All dimension tables are connected to fact table

  5. All above

#E

2. A set of data has a median much larger than the mean. What can we say about the data distribution? (1pt)

  1. Skew left

  2. Symmetrical

  3. Skew Right

  4. Normal

#A

3. Entering Male into a column field (attribute) labeled Marital Status is what type of business logic integrity violation? (1pt)

  1. Field integrity

  2. Relational integrity

#A

4. Which of the below is not a suited format for textual data? (1pt)

  1. JSON

  2. HTML

  3. JPEG

  4. ASCII

  5. CSV

#C

5. The term high-fidelity data was used in class to mean what exactly? (1pt)

  1. Preparing the data

  2. Streaming the data

  3. Preserving Data Integrity

  4. Garbage-in/Garbage Out

#C

6. Mark the two statements where the combination of words is inconsistent (1pt)

  1. Explanatory Quantitative

  2. Exploratory Qualitative

  3. Explanatory Qualitative

  4. Exploratory Quantitative

#C and D

7. A target function describes the relationship of independent variable as function of dependent variables (1pt)

  1. True

  2. False

#A

8. You are given a 4 x 4 correlation matrix. What is the maximum number of distinct correlation values? (1pt)

#The maximum number of distinct correlation values is 7

9. Which of the below is a qualitative description of the data? (1pt)

  1. Scatter plot

  2. Mean and Range

  3. Linear regression line

  4. Total number of samples

#B

10. You are given the following linear regression model describing the cost C of heating bill in $ versus the temperature T in Fahrenheit: C = 2T + 50. What are the units of the intercept and the slope? (2pts)

#The unit of the intercept is dollar.The unit of the slope is dollar per Fahrenheit.

11. As part of a linear optimization marketing problem, you are given a constraint in reference to two decisions variables X and Y representing the amount of dollars spent on Radio ads (X) and TV ads (Y). The amount of dollars spent on Radio ads should be at most 30% of the total amount of dollars spent on ads. Write the mathematical formulation for the constraint. (2pts)

#X-0.3Y<=0

12. The closer data values are to the mean, the more potential for noise to occur (1pt)

  1. True

  2. False

#B

13. You are given a data set with 100 entries. Twenty entries have zero values. When do we use the formula sum(data)/100 to compute the average of the data set? (1pt)

  1. When zero values mean true values

  2. When zero values represent missing values

  3. When zero values mean true as well as missing values

  4. None of the above

#A

Problem 2: Linear Regression (15 points)

Residential customers in Austin consume an average of about 1,000 kWh of electricity per month, with usage higher during hot summer months and lower in the winter. Included data show monthly average usage in kWh by month for residential customers taken over a period from 2000-2016, and include monthly electric bill amounts.

mydata = read.csv("energy.csv", header=TRUE, sep=",")
head(mydata)
##                  date average_kwh average_bill
## 1 2016-03-01T00:00:00         602        55.46
## 2 2016-02-01T00:00:00         663        61.99
## 3 2016-01-01T00:00:00         788        75.38
## 4 2015-12-01T00:00:00         657        61.34
## 5 2015-11-01T00:00:00         673        63.07
## 6 2015-10-01T00:00:00         987        99.97

Answer the below by typing your answer following each question. Each question is graded as marked.

1. Does the data on energy Kwh consumption contain any outliers?. Show your calculations, and explain your answer (2pts)

# prepare data
kwh = mydata$average_kwh
bill = mydata$average_bill
# calculate IQR
lowerq = quantile(kwh)[2]
upperq = quantile(kwh)[4]
iqr = upperq - lowerq

# compute upper threshold and lower threshold
upperthreshold = (iqr * 1.5) + upperq 
upperthreshold
##     75% 
## 1897.25
lowerthreshold = lowerq - (iqr * 1.5)
lowerthreshold
##   25% 
## 31.25
sum(kwh > upperthreshold)
## [1] 0
sum(kwh < lowerthreshold)
## [1] 0
boxplot(kwh)

threshold and lower threshold for energy consumption are 1897.25 and 31.25, respectively. Since no value is greater than 1897.25, nor less than 31.25, indicating that no outliers for the data on energy Kwh consumption. Besides, we have a boxplot to convince our conclusion.

2. Draw a scatter plot. Clearly label the x- and y- axis (2pts)

average_kwh <- mydata$average_kwh
average_bill <- mydata$average_bill
plot(average_kwh,average_bill)

3. Calculate the correlation coefficient between the energy consumption and monthly bill. What can you tell about the relationship? (2pts)

cor(average_kwh,average_bill)
## [1] 0.93997

#The correlation coefficient of these two variables is relatively large and positive, indicating that the energy consumption and monthly bill are positively correlated, and the correlation is strong.

4. Derive a linear regression model and display the summary statistics . Separately, write down the linear equation representation. (3pts)

fit <- lm(average_bill~average_kwh)
summary(fit)
## 
## Call:
## lm(formula = average_bill ~ average_kwh)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.386  -5.846   0.649   5.985  38.407 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.676177   2.982475  -7.268 8.83e-12 ***
## average_kwh   0.114984   0.003005  38.266  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.25 on 193 degrees of freedom
## Multiple R-squared:  0.8835, Adjusted R-squared:  0.8829 
## F-statistic:  1464 on 1 and 193 DF,  p-value: < 2.2e-16

#The linear equation is average_bill=-21.6761772 + 0.1149845*average_kwh or y=-21.676+0.115x,while x represents the average kwh while y represents the average bill.

5. Interpret the results for the intercept and the slope in the context of the problem. Specify the units when doing so (2pts)

#The unit of intercept is dollar and the unit of slope is dollar per kWh. The intercept is -21.676, which means that if the residential customer in Austin don’t consume electricity will receive $21.676, obviously, it is impossible. The slope is 0.115, which means that the bill will increase 0.115 on average for every kWh increase consuming of electricity.

6. Calculate the error squared in the predicted energy bill if the average consumption is 903 Kwh?(2pts)

ind <- which(average_kwh==903)
resid <- residuals(fit)[ind]
error2 <- resid^2
error2
##       39 
## 2.059822

#The error squared in the predicted energy bill is 2.059822.

7. A simple linear regression model always passes through the mean of x and y. Show calculations to prove this is true in this case (2pts)

bill_bar <- mean(average_bill)
cat("The mean of y/bill is",bill_bar)
## The mean of y/bill is 88.20651
kWh_bar <- mean(average_kwh)
intercept <- fit$coefficients[1]
slope <- fit$coefficients[2]
pred <- intercept+slope*kWh_bar
cat("\nThe prediction of y/bill when x is",kWh_bar, "(the mean of x/kWh)",pred)
## 
## The prediction of y/bill when x is 955.6308 (the mean of x/kWh) 88.20651

#Because the mean of y/bill is equal to the prediction of y/bill when x is the mean of x/kWh, so the linear regression model passes through the mean of x and y.

Data Sources