Assignment 1 (10%)

[Insert your full name]

[Insert course section & student number]

1. Read the csv files in the folder. (4 points)

micro <- read.csv(file= "USDA_Micronutrients.csv", sep= ",")
macro <- read.csv(file="USDA_Macronutrients.csv", sep =",")

2. Merge the data frames using the variable “ID”. Name the Merged Data Frame “USDA”. (4 points)

USDA = merge(macro, micro)

3. Check the datatypes of the attributes. Delete the commas in the Sodium and Potasium records. Assign Sodium and Potasium as numeric data types. (4 points)

sapply(USDA, class)

##           ID  Description     Calories      Protein     TotalFat Carbohydrate 
##    "integer"  "character"    "integer"    "numeric"    "numeric"    "numeric" 
##       Sodium  Cholesterol        Sugar      Calcium         Iron    Potassium 
##  "character"    "integer"    "numeric"    "integer"    "numeric"  "character" 
##     VitaminC     VitaminE     VitaminD 
##    "numeric"    "numeric"    "numeric"

USDA$Sodium = gsub(",", "", USDA$Sodium)
USDA$Potassium = gsub(",", "", USDA$Potassium)
USDA$Sodium = as.numeric(USDA$Sodium)
USDA$Potassium = as.numeric(USDA$Potassium)

4. Remove records (rows) with missing values in more than 6 attributes (columns). How many records remain in the data frame? (4 points)

na_count = apply(is.na(USDA), 1, sum)
USDA = USDA[na_count < 7,]
cat("Number of remaining records:", nrow(USDA))

## Number of remaining records: 6965

5. For records with missing values for Sugar, Vitamin E and Vitamin D, replace missing values with mean value for the respective variable. (4 points)

USDA$Sugar[is.na(USDA$Sugar)] = mean(USDA$Sugar[!is.na(USDA$Sugar)])
USDA$VitaminE[is.na(USDA$VitaminE)] = mean(USDA$VitaminE[!is.na(USDA$VitaminE)])
USDA$VitaminD[is.na(USDA$VitaminD)] = mean(USDA$VitaminD[!is.na(USDA$VitaminD)])

6. With a single line of code, remove all remaining records with missing values. Name the new Data Frame “USDAclean”. How many records remain in the data frame? (4 points)

USDAclean = USDA[complete.cases(USDA),]
cat("Number of remaining records:", nrow(USDAclean))

## Number of remaining records: 6310

7. Which food has the highest sodium level? (4 points)

as.character(USDAclean$Description[USDAclean$Sodium == max(USDAclean$Sodium)])

## [1] "SALT,TABLE"

8. Create a histogram of Vitamin C distribution in foods. (4 points)

hist(USDAclean$VitaminC, ylim=range(1,100),xlab = paste("Vitamin C"),main = paste("Vitamin C Distribution"))

#### 9. Create one boxplot to illustrate the distribution of values for TotalFat, Protein and Carbohydrate. (4 points)

with(USDAclean, boxplot(TotalFat, Protein, Carbohydrate))

10. Create a scatterplot to illustrate the relationship between a food’s TotalFat content and its Calorie content. (4 points)

with(USDAclean, plot(TotalFat, Calories))

11. Add a variable to the data frame that takes value 1 if the food has higher sodium than average, 0 otherwise. Call this variable HighSodium. Do the same for High Calories, High Protein, High Sugar, and High Fat. How many foods have both high sodium and high fat? (4 points)

USDAclean$HighSodium = 0
USDAclean$HighSodium[USDAclean$Sodium > mean(USDAclean$Sodium)] = 1

USDAclean$HighCalories = 0
USDAclean$HighCalories[USDAclean$Calories > mean(USDAclean$Calories)] = 1

USDAclean$HighProtein = 0
USDAclean$HighProtein[USDAclean$Protein > mean(USDAclean$Protein)] = 1

USDAclean$HighSugar = 0
USDAclean$HighSugar[USDAclean$Sugar > mean(USDAclean$Sugar)] = 1

USDAclean$HighFat = 0
USDAclean$HighFat[USDAclean$TotalFat > mean(USDAclean$TotalFat)] = 1

cat(sum(apply(USDAclean[c("HighSodium", "HighFat")], 1, function(x) sum(x) == 2)), "foods have both high sodium and high fat.")

## 644 foods have both high sodium and high fat.

12. Calculate the average amount of iron, for high and low protein foods. (4 points)

MeanProteinIron <- aggregate(USDAclean$Iron,list(USDAclean$HighProtein),FUN = mean) 
colnames(MeanProteinIron) <- c("low/high protein","AVG")
head(MeanProteinIron)

##   low/high protein      AVG
## 1                0 2.696634
## 2                1 3.069541

13. Create a function for a “HealthCheck” program to detect unhealthy foods. Use the algorithm flowchart below as a basis. (4 points)

require(jpeg)

## Loading required package: jpeg

img<-readJPEG("HealthCheck.jpg")
plot(1:4, ty = 'n', ann = F, xaxt = 'n', yaxt = 'n')
rasterImage(img,1,1,4,4)

healthcheck = function(x) {
  if (x$HighSodium == 0) return("Pass")
  else if (x$HighSugar == 0) return("Pass")
  else if (x$HighFat == 0) return("Pass")
  else return("Fail")
}

14. Add a new variable called HealthCheck to the data frame using the output of the function. (4 points)

for (i in 1:nrow(USDAclean)) {
  USDAclean$HealthCheck[i] = healthcheck(USDAclean[i,])
}

15. How many foods in the USDAclean data frame fail the HealthCheck? (4 points)

sum(USDAclean$HealthCheck == 'Fail')

## [1] 237

16. Visualize the correlation among Calories, Protein, Total Fat, Carbohydrate, Sodium and Cholesterol. (4 points)

cor(USDAclean[3:8])

##                Calories      Protein     TotalFat Carbohydrate       Sodium
## Calories     1.00000000  0.122122537  0.804495022   0.42460618  0.032321026
## Protein      0.12212254  1.000000000  0.057035611  -0.30471117 -0.003489485
## TotalFat     0.80449502  0.057035611  1.000000000  -0.12434291  0.002916089
## Carbohydrate 0.42460618 -0.304711167 -0.124342914   1.00000000  0.046838692
## Sodium       0.03232103 -0.003489485  0.002916089   0.04683869  1.000000000
## Cholesterol  0.02391933  0.269854840  0.093289601  -0.21937986 -0.017774863
##              Cholesterol
## Calories      0.02391933
## Protein       0.26985484
## TotalFat      0.09328960
## Carbohydrate -0.21937986
## Sodium       -0.01777486
## Cholesterol   1.00000000

17. Is the correlation between Calories and Total Fat statistically significant? Why? (4 points)

cor.test(USDAclean$Calories,USDAclean$TotalFat)

## 
##  Pearson's product-moment correlation
## 
## data:  USDAclean$Calories and USDAclean$TotalFat
## t = 107.58, df = 6308, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7956139 0.8130305
## sample estimates:
##      cor 
## 0.804495

#The correlation between Calories and Total Fat statistically significant because
#their correlation coefficient is about 0.804495 which is close to a perfect positive correlation, 1.0.

18. Create a Linear Regression Model, using Calories as the dependent variable Protein, Total Fat, Carbohydrate, Sodium and Cholesterol as the independent variables. (4 points)

MOD=summary(lm(Calories~Protein+TotalFat+Carbohydrate+Sodium+Cholesterol,data=USDAclean))
print(MOD)

## 
## Call:
## lm(formula = Calories ~ Protein + TotalFat + Carbohydrate + Sodium + 
##     Cholesterol, data = USDAclean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -191.087   -3.832    0.426    5.147  291.011 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.9882753  0.4832629   8.253  < 2e-16 ***
## Protein      3.9891994  0.0233550 170.807  < 2e-16 ***
## TotalFat     8.7716980  0.0143291 612.158  < 2e-16 ***
## Carbohydrate 3.7432001  0.0091404 409.522  < 2e-16 ***
## Sodium       0.0003383  0.0002189   1.545    0.122    
## Cholesterol  0.0110138  0.0019861   5.545 3.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.92 on 6304 degrees of freedom
## Multiple R-squared:  0.9877, Adjusted R-squared:  0.9877 
## F-statistic: 1.009e+05 on 5 and 6304 DF,  p-value: < 2.2e-16

19. Which independent variable is the least significant? Why? (4 points)

summary(aov(MOD,data = USDAclean))

##                Df    Sum Sq   Mean Sq   F value   Pr(>F)    
## Protein         1   2728899   2728899 7.620e+03  < 2e-16 ***
## TotalFat        1 116762840 116762840 3.260e+05  < 2e-16 ***
## Carbohydrate    1  61215495  61215495 1.709e+05  < 2e-16 ***
## Sodium          1       789       789 2.203e+00    0.138    
## Cholesterol     1     11014     11014 3.075e+01 3.05e-08 ***
## Residuals    6304   2257685       358                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#ANSWER: Sodium is the least significant variable since the F value is the highest making it less significant in comparison to the other variables. Not only that but the P value is above 0.05 as well which would also make it not statistically significant.

20. Create a new model by using only the significant independent variables. (4 points)

MOD=summary(lm(Calories~Protein+TotalFat+Carbohydrate+Cholesterol,data=USDAclean))
print(MOD)

## 
## Call:
## lm(formula = Calories ~ Protein + TotalFat + Carbohydrate + Cholesterol, 
##     data = USDAclean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -191.220   -3.787    0.464    5.104  290.922 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.077907   0.479822   8.499  < 2e-16 ***
## Protein      3.989679   0.023355 170.824  < 2e-16 ***
## TotalFat     8.771904   0.014330 612.131  < 2e-16 ***
## Carbohydrate 3.743859   0.009131 409.996  < 2e-16 ***
## Cholesterol  0.010980   0.001986   5.528 3.36e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.93 on 6305 degrees of freedom
## Multiple R-squared:  0.9877, Adjusted R-squared:  0.9876 
## F-statistic: 1.261e+05 on 4 and 6305 DF,  p-value: < 2.2e-16

21. A new product is just produced with the following data: Protein=0.1, TotalFat=37, Carbohydrate=400, Cholesterol=75, Sugar=NA, Calcium=35, Iron=NA, Potassium=35, VitaminC=10, VitaminE=NA, VitaminD=NA. Based on the new model you created, what is the predicted value for Calories? (4 points)

lm(Calories~Protein+TotalFat+Carbohydrate+Sodium+Cholesterol,data=USDAclean)

## 
## Call:
## lm(formula = Calories ~ Protein + TotalFat + Carbohydrate + Sodium + 
##     Cholesterol, data = USDAclean)
## 
## Coefficients:
##  (Intercept)       Protein      TotalFat  Carbohydrate        Sodium  
##    3.9882753     3.9891994     8.7716980     3.7432001     0.0003383  
##  Cholesterol  
##    0.0110138

pred_value=3.9882753+(0.1)*3.9891994+(37)*8.7716980+(400)*3.7432001+(440)*0.0003383+(75)*0.0110138
print(pred_value)

## [1] 1827.195

#The predicted value would be 1827.195

22. If the Carbohydrate amount increases from 400 to 40000 (10000% increase), how much change will occur on Calories in percent? Explain why? (4 points)

pred_value=3.9882753+(0.1)*3.9891994+(37)*8.7716980+(400)*3.7432001+(40000)*0.0003383+(75)*0.0110138
print(pred_value)

## [1] 1840.578

increase=1842.08-1827.195
percentIncrease=increase/1827.195*100
print(percentIncrease)

## [1] 0.8146366

# Due to the low significance of the Carbohydrate coefficient any change will have a small change in the predictive value. This is why the percent increase is only 0.81 percent.

CMTH 642 Data Analytics: Advanced Methods

Assignment 1 (10%)

[Insert your full name]

[Insert course section & student number]

1. Read the csv files in the folder. (4 points)

2. Merge the data frames using the variable “ID”. Name the Merged Data Frame “USDA”. (4 points)

3. Check the datatypes of the attributes. Delete the commas in the Sodium and Potasium records. Assign Sodium and Potasium as numeric data types. (4 points)

4. Remove records (rows) with missing values in more than 6 attributes (columns). How many records remain in the data frame? (4 points)

5. For records with missing values for Sugar, Vitamin E and Vitamin D, replace missing values with mean value for the respective variable. (4 points)

6. With a single line of code, remove all remaining records with missing values. Name the new Data Frame “USDAclean”. How many records remain in the data frame? (4 points)

7. Which food has the highest sodium level? (4 points)

8. Create a histogram of Vitamin C distribution in foods. (4 points)

10. Create a scatterplot to illustrate the relationship between a food’s TotalFat content and its Calorie content. (4 points)

11. Add a variable to the data frame that takes value 1 if the food has higher sodium than average, 0 otherwise. Call this variable HighSodium. Do the same for High Calories, High Protein, High Sugar, and High Fat. How many foods have both high sodium and high fat? (4 points)

12. Calculate the average amount of iron, for high and low protein foods. (4 points)

13. Create a function for a “HealthCheck” program to detect unhealthy foods. Use the algorithm flowchart below as a basis. (4 points)

14. Add a new variable called HealthCheck to the data frame using the output of the function. (4 points)

15. How many foods in the USDAclean data frame fail the HealthCheck? (4 points)

16. Visualize the correlation among Calories, Protein, Total Fat, Carbohydrate, Sodium and Cholesterol. (4 points)

17. Is the correlation between Calories and Total Fat statistically significant? Why? (4 points)

18. Create a Linear Regression Model, using Calories as the dependent variable Protein, Total Fat, Carbohydrate, Sodium and Cholesterol as the independent variables. (4 points)

19. Which independent variable is the least significant? Why? (4 points)

20. Create a new model by using only the significant independent variables. (4 points)

21. A new product is just produced with the following data: Protein=0.1, TotalFat=37, Carbohydrate=400, Cholesterol=75, Sugar=NA, Calcium=35, Iron=NA, Potassium=35, VitaminC=10, VitaminE=NA, VitaminD=NA. Based on the new model you created, what is the predicted value for Calories? (4 points)

22. If the Carbohydrate amount increases from 400 to 40000 (10000% increase), how much change will occur on Calories in percent? Explain why? (4 points)

23. Prepare an exploratory data analysis question about the dataset. Write a code to answer your question. Visualise your answer. (Your question should be related to at least three attributes) (12 points)