Introduction to Analytics Edge

Prevalence of Data

  • 2.7 Zettabytes of electronic data exist in the world today - 2,700,000,000,000,000,000,000 bytes
    • This is equal to the storage required for more than 200 billion HD movies
  • New data is produced at an exponential rate
  • Decoding the human genome originally took 10 years to process; now it can be achieved in one week

Data and Analytics are Useful

  • Estimated that there is a shortage of 140,000 - 190,000 people with deep analytical skills to fill the demand of
    jobs in the U.S. by 2018
  • IBM has invested over $20 billion since 2005 to grow its analytics business
  • Companies will invest more than $120 billion by 2015 on analytics, hardware, software and services
  • Critical in almost every industry
    • Healthcare, media, sports, finance, government, etc.

Definition of Analytics

The science of using data to build models that lead to better decisions that add value to individuals, to companies, and to institutions.

Examples of Data Anlytics Used

IBM Watson

  • Watson is a supercomputer with 3,000 processors and a database of 200 million pages of information
  • Watson combined many algorithms to increase accuracy and confidence
  • Approached the problem in a different way than how a human does
  • Deals with massive amounts of data, often in unstructured form
    • 90% of data in the world is unstructured

eHarmony

  • Online dating site focused on long term relationships
  • Relies much more on data than other dating sites
  • Suggests a limited number of high quality matches
    • Users don’t have to search and dig through profiles
  • eHarmony has successfully leveraged the power of
    analytics to create a successful and thriving business
    • 14% of US online dating market

The Framingham Heart Study

  • Much of the now-common knowledge regarding heart disease came from this study
  • Provided necessary evidence for the development of drugs to lower blood pressure
  • Paved the way for other clinical prediction rules
    • Predict clinical outcomes using patient data
  • A model allows medical professionals to make predictions for patients worldwide

D2Hawkeye

  • Combined data with analytics to improve quality and cost management in healthcare
  • Substantial improvement in D2Hawkeye’s ability to identify patients who need more attention
  • Use expert knowledge to identify new variables and refine existing variables
  • Can make predictions for millions of patients without manually reading patient files

An Introduction to R

What is R?

  • A software environment for data analysis, statistical computing, and graphics
  • A programming language

In the next section, the basic operations and functions used in R for data analysis are explored.

Basic Calculations

# Basic Calculations
8*6
## [1] 48
2^16
## [1] 65536
8*6
## [1] 48
8*10
## [1] 80

Functions

# Mathematical functions
sqrt(2)
## [1] 1.414214
abs(-65)
## [1] 65

Variables

# Setting variables
SquareRoot2 = sqrt(2)
SquareRoot2
## [1] 1.414214
HoursYear <- 365*24
HoursYear
## [1] 8760
# Identifies the stored variables
ls()
## [1] "HoursYear"   "SquareRoot2"

Vectors

Two vectors - Country and LifeExpectancy are created. Accordingly, both vectors are indexed to display specific elements inside the vectors. Finally, a third vector Sequence has a range from 0 to 100 in increments of 2.

# Create Vectors
c(2,3,5,8,13)
## [1]  2  3  5  8 13
Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country
## [1] "Brazil"      "China"       "India"       "Switzerland" "USA"
LifeExpectancy
## [1] 74 76 65 83 79
Country[1]
## [1] "Brazil"
LifeExpectancy[3]
## [1] 65
Sequence = seq(0,100,2)
Sequence
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76  78  80  82  84  86  88  90  92  94  96
## [50]  98 100

Data Frames

The data.frame CountryData calls upon the two vectors Country and LifeExpectancy. Successively, an additional vector CountryData$Populations is added to the data.frame. Finally, a data.frame AllCountryData utilizes the two previous data.frames CountryData and NewCountryData.

# Create data frames
CountryData = data.frame(Country, LifeExpectancy)
CountryData
##       Country LifeExpectancy
## 1      Brazil             74
## 2       China             76
## 3       India             65
## 4 Switzerland             83
## 5         USA             79
CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData
##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData
##     Country LifeExpectancy Population
## 1 Australia             82      23050
## 2    Greece             81      11125
AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData
##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
## 6   Australia             82      23050
## 7      Greece             81      11125

Loading csv files

The dataset WHO.csv is loaded into the variable WHO. The str and summary commands provide physical and statistical descriptions of the variable.

# Load dataset
WHO = read.csv("WHO.csv")
# Output the string of the dataset
str(WHO)
## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
# Output the summary of the dataset
summary(WHO)
##                 Country                      Region     Population         Under15          Over60      FertilityRate   LifeExpectancy  ChildMortality    CellularSubscribers  LiteracyRate  
##  Afghanistan        :  1   Africa               :46   Min.   :      1   Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00   Min.   :  2.200   Min.   :  2.57      Min.   :31.10  
##  Albania            :  1   Americas             :35   1st Qu.:   1696   1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00   1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60  
##  Algeria            :  1   Eastern Mediterranean:22   Median :   7790   Median :28.65   Median : 8.53   Median :2.400   Median :72.50   Median : 18.600   Median : 97.75      Median :91.80  
##  Andorra            :  1   Europe               :53   Mean   :  36360   Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01   Mean   : 36.149   Mean   : 93.64      Mean   :83.71  
##  Angola             :  1   South-East Asia      :11   3rd Qu.:  24535   3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00   3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85  
##  Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000   Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00   Max.   :181.600   Max.   :196.41      Max.   :99.80  
##  (Other)            :188                                                                                NA's   :11                                        NA's   :10          NA's   :91     
##       GNI        PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
##  Min.   :  340   Min.   : 37.20              Min.   : 32.50               
##  1st Qu.: 2335   1st Qu.: 87.70              1st Qu.: 87.30               
##  Median : 7870   Median : 94.70              Median : 95.10               
##  Mean   :13321   Mean   : 90.85              Mean   : 89.63               
##  3rd Qu.:17558   3rd Qu.: 98.10              3rd Qu.: 97.90               
##  Max.   :86440   Max.   :100.00              Max.   :100.00               
##  NA's   :32      NA's   :93                  NA's   :93

Subsetting

The dataset WHO is subsetted into WHO_Europe using the region as an argument to collect the data.

# Subset the dataset with the region in Europe
WHO_Europe = subset(WHO, Region == "Europe")
# Output the string of the dataset
str(WHO_Europe)
## 'data.frame':    53 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Population                   : int  3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
##  $ Under15                      : num  21.3 15.2 20.3 14.5 22.2 ...
##  $ Over60                       : num  14.93 22.86 14.06 23.52 8.24 ...
##  $ FertilityRate                : num  1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
##  $ LifeExpectancy               : int  74 82 71 81 71 71 80 76 74 77 ...
##  $ ChildMortality               : num  16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
##  $ CellularSubscribers          : num  96.4 75.5 103.6 154.8 108.8 ...
##  $ LiteracyRate                 : num  NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
##  $ GNI                          : num  8820 NA 6100 42050 8960 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
##  $ PrimarySchoolEnrollmentFemale: num  NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...

Writing csv files

# Writes a new comma-separated values for the europe dataset
write.csv(WHO_Europe, "WHO_Europe.csv")

Removing variables

# Remove Europe variable
rm(WHO_Europe)

Basic Data Analysis

Various basic data analysis commands are implemented to provide a statistical summary of the datasets and how to appropriately identify various elements of interest.

# Basic data analysis
mean(WHO$Under15)
## [1] 28.73242
sd(WHO$Under15)
## [1] 10.53457
summary(WHO$Under15)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.12   18.72   28.65   28.73   37.75   49.99
# Find which data point is the minimum and find its index
which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin ... Zimbabwe
# Find which data point is the maximum and find its index
which.max(WHO$Under15)
## [1] 124
WHO$Country[124]
## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin ... Zimbabwe

Scatterplot

# Scatterplot
plot(WHO$GNI, WHO$FertilityRate)

Subsetting Outliers

# Subsetting outliers
Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5) 
# Calcule the number of observations
nrow(Outliers)
## [1] 7
# Add columns
Outliers[c("Country","GNI","FertilityRate")]
##               Country   GNI FertilityRate
## 23           Botswana 14550          2.71
## 56  Equatorial Guinea 25620          5.04
## 63              Gabon 13740          4.18
## 83             Israel 27110          2.92
## 88         Kazakhstan 11250          2.52
## 131            Panama 14510          2.52
## 150      Saudi Arabia 24700          2.76

Histograms

# Histogram
hist(WHO$CellularSubscribers)

### Boxplot

# Boxplot
boxplot(WHO$LifeExpectancy ~ WHO$Region)

boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")

Summary Tables

The table command creates a table for each region in the WHO data.frame. The tapply command demonstrates the relationship between two vectors in the data.frame using a statistical descriptor.

# Tabulate the region data.frame 
z = table(WHO$Region)
kable (z)
Var1 Freq
Africa 46
Americas 35
Eastern Mediterranean 22
Europe 53
South-East Asia 11
Western Pacific 27
# Compares two groups using a statsitical measure
z = tapply(WHO$Over60, WHO$Region, mean)
kable(z)
x
Africa 5.220652
Americas 10.943714
Eastern Mediterranean 5.620000
Europe 19.774906
South-East Asia 8.769091
Western Pacific 10.162963
z = tapply(WHO$LiteracyRate, WHO$Region, min)
kable(z)
x
Africa NA
Americas NA
Eastern Mediterranean NA
Europe NA
South-East Asia NA
Western Pacific NA
z = tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)
kable(z)
x
Africa 31.1
Americas 75.2
Eastern Mediterranean 63.9
Europe 95.2
South-East Asia 56.8
Western Pacific 60.6

Understanding Food USDA Dataset

In the following section, the United States Department of Agriculture (USDA) dataset on the dietary contents of food is examined.

Loading in the Dataset

Read the csv file

# Load in the dataset
  USDA = read.csv("USDA.csv")

Structure of the dataset

# Outputs a string
  str(USDA)
## 'data.frame':    7058 obs. of  16 variables:
##  $ ID          : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
##  $ Calories    : int  717 717 876 353 371 334 300 376 403 387 ...
##  $ Protein     : num  0.85 0.85 0.28 21.4 23.24 ...
##  $ TotalFat    : num  81.1 81.1 99.5 28.7 29.7 ...
##  $ Carbohydrate: num  0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
##  $ Sodium      : int  714 827 2 1395 560 629 842 690 621 700 ...
##  $ SaturatedFat: num  51.4 50.5 61.9 18.7 18.8 ...
##  $ Cholesterol : int  215 219 256 75 94 100 72 93 105 103 ...
##  $ Sugar       : num  0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
##  $ Calcium     : int  24 24 4 528 674 184 388 673 721 643 ...
##  $ Iron        : num  0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
##  $ Potassium   : int  24 26 5 256 136 152 187 93 98 95 ...
##  $ VitaminC    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VitaminE    : num  2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
##  $ VitaminD    : num  1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...

Statistical summary

# Outputs the summary
z = summary(USDA)
kable(z)
ID Description Calories Protein TotalFat Carbohydrate Sodium SaturatedFat Cholesterol Sugar Calcium Iron Potassium VitaminC VitaminE VitaminD
Min. : 1001 BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA: 2 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.000 Min. : 0.000 Min. : 0.0000
1st Qu.: 8387 CAMPBELL,CAMPBELL’S SEL MICROWAVEABLE BOWLS,HEA : 2 1st Qu.: 85.0 1st Qu.: 2.29 1st Qu.: 0.72 1st Qu.: 0.00 1st Qu.: 37.0 1st Qu.: 0.172 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 9.00 1st Qu.: 0.520 1st Qu.: 135.0 1st Qu.: 0.000 1st Qu.: 0.120 1st Qu.: 0.0000
Median :13294 OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT : 2 Median :181.0 Median : 8.20 Median : 4.37 Median : 7.13 Median : 79.0 Median : 1.256 Median : 3.00 Median : 1.395 Median : 19.00 Median : 1.330 Median : 250.0 Median : 0.000 Median : 0.270 Median : 0.0000
Mean :14260 POPCORN,OIL-POPPED,LOFAT : 2 Mean :219.7 Mean :11.71 Mean : 10.32 Mean : 20.70 Mean : 322.1 Mean : 3.452 Mean : 41.55 Mean : 8.257 Mean : 73.53 Mean : 2.828 Mean : 301.4 Mean : 9.436 Mean : 1.488 Mean : 0.5769
3rd Qu.:18337 ABALONE,MIXED SPECIES,RAW : 1 3rd Qu.:331.0 3rd Qu.:20.43 3rd Qu.: 12.70 3rd Qu.: 28.17 3rd Qu.: 386.0 3rd Qu.: 4.028 3rd Qu.: 69.00 3rd Qu.: 7.875 3rd Qu.: 56.00 3rd Qu.: 2.620 3rd Qu.: 348.0 3rd Qu.: 3.100 3rd Qu.: 0.710 3rd Qu.: 0.1000
Max. :93600 ABALONE,MXD SP,CKD,FRIED : 1 Max. :902.0 Max. :88.32 Max. :100.00 Max. :100.00 Max. :38758.0 Max. :95.600 Max. :3100.00 Max. :99.800 Max. :7364.00 Max. :123.600 Max. :16500.0 Max. :2400.000 Max. :149.400 Max. :250.0000
NA (Other) :7048 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :84 NA’s :301 NA’s :288 NA’s :1910 NA’s :136 NA’s :123 NA’s :409 NA’s :332 NA’s :2720 NA’s :2834

Basic Data Analysis

Vector notation

# Outputs the sodium index
  USDA$Sodium

Finding the index of the food with highest sodium levels

# Finding the index of the food with the highest sodium levels
  which.max(USDA$Sodium)
## [1] 265

Get names of variables in the dataset

# Get names of the variables
  names(USDA)
##  [1] "ID"           "Description"  "Calories"     "Protein"      "TotalFat"     "Carbohydrate" "Sodium"       "SaturatedFat" "Cholesterol"  "Sugar"        "Calcium"      "Iron"         "Potassium"   
## [14] "VitaminC"     "VitaminE"     "VitaminD"

Get the name of the food with highest sodium levels

# Get the name of the food with the highest sodium levels
  USDA$Description[265]
## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ACORN FLOUR,FULL FAT ACORN STEW (APACHE) ACORNS,DRIED ... ZWIEBACK

Create a subset of the foods with sodium content above 10,000mg

# Subset foods with sodium content above 10,000 mg
  HighSodium = subset(USDA, Sodium>10000)

Count the number of rows, or observations

# Count the number of rows, or observations
nrow(HighSodium)
## [1] 10

Output names of the foods with high sodium content

# Output names of the foods with high sodium content
  HighSodium$Description
##  [1] SALT,TABLE                                              SOUP,BF BROTH OR BOUILLON,PDR,DRY                       SOUP,BEEF BROTH,CUBED,DRY                              
##  [4] SOUP,CHICK BROTH OR BOUILLON,DRY                        SOUP,CHICK BROTH CUBES,DRY                              GRAVY,AU JUS,DRY                                       
##  [7] ADOBO FRESCO                                            LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE LEAVENING AGENTS,BAKING SODA                           
## [10] DESSERTS,RENNIN,TABLETS,UNSWTND                        
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ACORN FLOUR,FULL FAT ACORN STEW (APACHE) ACORNS,DRIED ... ZWIEBACK

Finding the index of CAVIAR in the dataset

# Finding the index of CAVIAR
  match("CAVIAR", USDA$Description)
## [1] 4154

Find amount of sodium in caviar

# Find amount of sodium in CAVIAR
  USDA$Sodium[4154]
## [1] 1500

Doing it in one command!

# Do the previous two commands in one step
  USDA$Sodium[match("CAVIAR", USDA$Description)]
## [1] 1500

Summary function over Sodium vector

# Output a summary
  summary(USDA$Sodium)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    37.0    79.0   322.1   386.0 38758.0      84

Standard deviation

# Calculates the standard deviation
  sd(USDA$Sodium, na.rm = TRUE)
## [1] 1045.417

Plots of USDA Dataset

Scatter Plots

# Scatterplot
  plot(USDA$Protein, USDA$TotalFat)

Add xlabel, ylabel and title

# Add x label, y label, and title to the scatterplot
  plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")

Histograms

# Histogram
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")

Add limits to x-axis

# Add limits to x-axis
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))

Specify breaks of histogram

# Specify breaks
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)

Boxplots

# Boxplots
  boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")

Adding a variable

Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise

# Create variable for high sodium
  HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
# Outputs a string
  str(HighSodium)
##  num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...

Adding the variable to the dataset

# Add variable
  USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))

Similarly for HighProtein, HigCarbs, HighFat

# Similar to the previous command for different food groups
  USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
  USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
  USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))

Summary Tables

How many foods have higher sodium level than average?

# Tabulate the amount of foods that have higher sodium level than average
 z =  table(USDA$HighSodium)
kable(z)
Var1 Freq
0 4884
1 2090

How many foods have both high sodium and high fat?

# Tabulate the number of foods that have both high sodium and fat
 z =  table(USDA$HighSodium, USDA$HighFat)
kable(z)
0 1
0 3529 1355
1 1378 712

Average amount of iron sorted by high and low protein?

# Compare two groups using a statsitical measure
 z =  tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
kable(z)
x
0 2.558945
1 3.197294

Maximum level of Vitamin C in hfoods with high and low carbs?

# Compare two groups using a statistical measure
z =   tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
kable(z)
x
0 1677.6
1 2400.0

Using summary function with tapply

# Compare two groups using a statistical measure
z =   tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)
kable(z)
x
0 c(Min. = 0, 1st Qu. = 0, Median = 0, Mean = 6.36403527640353, 3rd Qu. = 2.8, Max. = 1677.6, NA's = 248)
1 c(Min. = 0, 1st Qu. = 0, Median = 0.2, Mean = 16.3119884448724, 3rd Qu. = 4.5, Max. = 2400, NA's = 83)