Analytics Edge: Unit 1

Introduction to Analytics Edge

Prevalence of Data

2.7 Zettabytes of electronic data exist in the world today - 2,700,000,000,000,000,000,000 bytes
- This is equal to the storage required for more than 200 billion HD movies
New data is produced at an exponential rate
Decoding the human genome originally took 10 years to process; now it can be achieved in one week

Data and Analytics are Useful

Estimated that there is a shortage of 140,000 - 190,000 people with deep analytical skills to fill the demand of
jobs in the U.S. by 2018
IBM has invested over $20 billion since 2005 to grow its analytics business
Companies will invest more than $120 billion by 2015 on analytics, hardware, software and services
Critical in almost every industry
- Healthcare, media, sports, finance, government, etc.

Definition of Analytics

The science of using data to build models that lead to better decisions that add value to individuals, to companies, and to institutions.

Examples of Data Anlytics Used

IBM Watson

Watson is a supercomputer with 3,000 processors and a database of 200 million pages of information
Watson combined many algorithms to increase accuracy and confidence
Approached the problem in a different way than how a human does
Deals with massive amounts of data, often in unstructured form
- 90% of data in the world is unstructured

eHarmony

Online dating site focused on long term relationships
Relies much more on data than other dating sites
Suggests a limited number of high quality matches
- Users don’t have to search and dig through profiles
eHarmony has successfully leveraged the power of
analytics to create a successful and thriving business
- 14% of US online dating market

The Framingham Heart Study

Much of the now-common knowledge regarding heart disease came from this study
Provided necessary evidence for the development of drugs to lower blood pressure
Paved the way for other clinical prediction rules
- Predict clinical outcomes using patient data
A model allows medical professionals to make predictions for patients worldwide

D2Hawkeye

Combined data with analytics to improve quality and cost management in healthcare
Substantial improvement in D2Hawkeye’s ability to identify patients who need more attention
Use expert knowledge to identify new variables and refine existing variables
Can make predictions for millions of patients without manually reading patient files

An Introduction to R

What is R?

A software environment for data analysis, statistical computing, and graphics
A programming language

In the next section, the basic operations and functions used in R for data analysis are explored.

Basic Calculations

# Basic Calculations
8*6
## [1] 48
2^16
## [1] 65536
8*6
## [1] 48
8*10
## [1] 80

Functions

# Mathematical functions
sqrt(2)
## [1] 1.414214
abs(-65)
## [1] 65

Variables

# Setting variables
SquareRoot2 = sqrt(2)
SquareRoot2
## [1] 1.414214
HoursYear <- 365*24
HoursYear
## [1] 8760
# Identifies the stored variables
ls()
## [1] "HoursYear"   "SquareRoot2"

Vectors

Two vectors - Country and LifeExpectancy are created. Accordingly, both vectors are indexed to display specific elements inside the vectors. Finally, a third vector Sequence has a range from 0 to 100 in increments of 2.

# Create Vectors
c(2,3,5,8,13)
## [1]  2  3  5  8 13
Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country
## [1] "Brazil"      "China"       "India"       "Switzerland" "USA"
LifeExpectancy
## [1] 74 76 65 83 79
Country[1]
## [1] "Brazil"
LifeExpectancy[3]
## [1] 65
Sequence = seq(0,100,2)
Sequence
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76  78  80  82  84  86  88  90  92  94  96
## [50]  98 100

Data Frames

The data.frame CountryData calls upon the two vectors Country and LifeExpectancy. Successively, an additional vector CountryData$Populations is added to the data.frame. Finally, a data.frame AllCountryData utilizes the two previous data.frames CountryData and NewCountryData.

# Create data frames
CountryData = data.frame(Country, LifeExpectancy)
CountryData
##       Country LifeExpectancy
## 1      Brazil             74
## 2       China             76
## 3       India             65
## 4 Switzerland             83
## 5         USA             79
CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData
##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData
##     Country LifeExpectancy Population
## 1 Australia             82      23050
## 2    Greece             81      11125
AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData
##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
## 6   Australia             82      23050
## 7      Greece             81      11125

Loading csv files

The dataset WHO.csv is loaded into the variable WHO. The str and summary commands provide physical and statistical descriptions of the variable.

# Load dataset
WHO = read.csv("WHO.csv")
# Output the string of the dataset
str(WHO)
## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
# Output the summary of the dataset
summary(WHO)
##                 Country                      Region     Population         Under15          Over60      FertilityRate   LifeExpectancy  ChildMortality    CellularSubscribers  LiteracyRate  
##  Afghanistan        :  1   Africa               :46   Min.   :      1   Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00   Min.   :  2.200   Min.   :  2.57      Min.   :31.10  
##  Albania            :  1   Americas             :35   1st Qu.:   1696   1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00   1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60  
##  Algeria            :  1   Eastern Mediterranean:22   Median :   7790   Median :28.65   Median : 8.53   Median :2.400   Median :72.50   Median : 18.600   Median : 97.75      Median :91.80  
##  Andorra            :  1   Europe               :53   Mean   :  36360   Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01   Mean   : 36.149   Mean   : 93.64      Mean   :83.71  
##  Angola             :  1   South-East Asia      :11   3rd Qu.:  24535   3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00   3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85  
##  Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000   Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00   Max.   :181.600   Max.   :196.41      Max.   :99.80  
##  (Other)            :188                                                                                NA's   :11                                        NA's   :10          NA's   :91     
##       GNI        PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
##  Min.   :  340   Min.   : 37.20              Min.   : 32.50               
##  1st Qu.: 2335   1st Qu.: 87.70              1st Qu.: 87.30               
##  Median : 7870   Median : 94.70              Median : 95.10               
##  Mean   :13321   Mean   : 90.85              Mean   : 89.63               
##  3rd Qu.:17558   3rd Qu.: 98.10              3rd Qu.: 97.90               
##  Max.   :86440   Max.   :100.00              Max.   :100.00               
##  NA's   :32      NA's   :93                  NA's   :93

Subsetting

The dataset WHO is subsetted into WHO_Europe using the region as an argument to collect the data.

# Subset the dataset with the region in Europe
WHO_Europe = subset(WHO, Region == "Europe")
# Output the string of the dataset
str(WHO_Europe)
## 'data.frame':    53 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Population                   : int  3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
##  $ Under15                      : num  21.3 15.2 20.3 14.5 22.2 ...
##  $ Over60                       : num  14.93 22.86 14.06 23.52 8.24 ...
##  $ FertilityRate                : num  1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
##  $ LifeExpectancy               : int  74 82 71 81 71 71 80 76 74 77 ...
##  $ ChildMortality               : num  16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
##  $ CellularSubscribers          : num  96.4 75.5 103.6 154.8 108.8 ...
##  $ LiteracyRate                 : num  NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
##  $ GNI                          : num  8820 NA 6100 42050 8960 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
##  $ PrimarySchoolEnrollmentFemale: num  NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...

Writing csv files

# Writes a new comma-separated values for the europe dataset
write.csv(WHO_Europe, "WHO_Europe.csv")

Removing variables

# Remove Europe variable
rm(WHO_Europe)

Basic Data Analysis

Various basic data analysis commands are implemented to provide a statistical summary of the datasets and how to appropriately identify various elements of interest.

# Basic data analysis
mean(WHO$Under15)
## [1] 28.73242
sd(WHO$Under15)
## [1] 10.53457
summary(WHO$Under15)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.12   18.72   28.65   28.73   37.75   49.99
# Find which data point is the minimum and find its index
which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin ... Zimbabwe
# Find which data point is the maximum and find its index
which.max(WHO$Under15)
## [1] 124
WHO$Country[124]
## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin ... Zimbabwe

Scatterplot

# Scatterplot
plot(WHO$GNI, WHO$FertilityRate)

Subsetting Outliers

# Subsetting outliers
Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5) 
# Calcule the number of observations
nrow(Outliers)
## [1] 7
# Add columns
Outliers[c("Country","GNI","FertilityRate")]
##               Country   GNI FertilityRate
## 23           Botswana 14550          2.71
## 56  Equatorial Guinea 25620          5.04
## 63              Gabon 13740          4.18
## 83             Israel 27110          2.92
## 88         Kazakhstan 11250          2.52
## 131            Panama 14510          2.52
## 150      Saudi Arabia 24700          2.76

Histograms

# Histogram
hist(WHO$CellularSubscribers)

### Boxplot

# Boxplot
boxplot(WHO$LifeExpectancy ~ WHO$Region)

boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")

Summary Tables

The table command creates a table for each region in the WHO data.frame. The tapply command demonstrates the relationship between two vectors in the data.frame using a statistical descriptor.

# Tabulate the region data.frame 
z = table(WHO$Region)
kable (z)

Var1	Freq
Africa	46
Americas	35
Eastern Mediterranean	22
Europe	53
South-East Asia	11
Western Pacific	27

# Compares two groups using a statsitical measure
z = tapply(WHO$Over60, WHO$Region, mean)
kable(z)

	x
Africa	5.220652
Americas	10.943714
Eastern Mediterranean	5.620000
Europe	19.774906
South-East Asia	8.769091
Western Pacific	10.162963

z = tapply(WHO$LiteracyRate, WHO$Region, min)
kable(z)

	x
Africa	NA
Americas	NA
Eastern Mediterranean	NA
Europe	NA
South-East Asia	NA
Western Pacific	NA

z = tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)
kable(z)

	x
Africa	31.1
Americas	75.2
Eastern Mediterranean	63.9
Europe	95.2
South-East Asia	56.8
Western Pacific	60.6

Understanding Food USDA Dataset

In the following section, the United States Department of Agriculture (USDA) dataset on the dietary contents of food is examined.

Loading in the Dataset

Read the csv file

# Load in the dataset
  USDA = read.csv("USDA.csv")

Structure of the dataset

# Outputs a string
  str(USDA)
## 'data.frame':    7058 obs. of  16 variables:
##  $ ID          : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
##  $ Calories    : int  717 717 876 353 371 334 300 376 403 387 ...
##  $ Protein     : num  0.85 0.85 0.28 21.4 23.24 ...
##  $ TotalFat    : num  81.1 81.1 99.5 28.7 29.7 ...
##  $ Carbohydrate: num  0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
##  $ Sodium      : int  714 827 2 1395 560 629 842 690 621 700 ...
##  $ SaturatedFat: num  51.4 50.5 61.9 18.7 18.8 ...
##  $ Cholesterol : int  215 219 256 75 94 100 72 93 105 103 ...
##  $ Sugar       : num  0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
##  $ Calcium     : int  24 24 4 528 674 184 388 673 721 643 ...
##  $ Iron        : num  0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
##  $ Potassium   : int  24 26 5 256 136 152 187 93 98 95 ...
##  $ VitaminC    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VitaminE    : num  2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
##  $ VitaminD    : num  1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...

Statistical summary

# Outputs the summary
z = summary(USDA)
kable(z)

ID	Description	Calories	Protein	TotalFat	Carbohydrate	Sodium	SaturatedFat	Cholesterol	Sugar	Calcium	Iron	Potassium	VitaminC	VitaminE	VitaminD
Min. : 1001	BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA: 2	Min. : 0.0	Min. : 0.00	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.000	Min. : 0.00	Min. : 0.000	Min. : 0.00	Min. : 0.000	Min. : 0.0	Min. : 0.000	Min. : 0.000	Min. : 0.0000
1st Qu.: 8387	CAMPBELL,CAMPBELL’S SEL MICROWAVEABLE BOWLS,HEA : 2	1st Qu.: 85.0	1st Qu.: 2.29	1st Qu.: 0.72	1st Qu.: 0.00	1st Qu.: 37.0	1st Qu.: 0.172	1st Qu.: 0.00	1st Qu.: 0.000	1st Qu.: 9.00	1st Qu.: 0.520	1st Qu.: 135.0	1st Qu.: 0.000	1st Qu.: 0.120	1st Qu.: 0.0000
Median :13294	OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT : 2	Median :181.0	Median : 8.20	Median : 4.37	Median : 7.13	Median : 79.0	Median : 1.256	Median : 3.00	Median : 1.395	Median : 19.00	Median : 1.330	Median : 250.0	Median : 0.000	Median : 0.270	Median : 0.0000
Mean :14260	POPCORN,OIL-POPPED,LOFAT : 2	Mean :219.7	Mean :11.71	Mean : 10.32	Mean : 20.70	Mean : 322.1	Mean : 3.452	Mean : 41.55	Mean : 8.257	Mean : 73.53	Mean : 2.828	Mean : 301.4	Mean : 9.436	Mean : 1.488	Mean : 0.5769
3rd Qu.:18337	ABALONE,MIXED SPECIES,RAW : 1	3rd Qu.:331.0	3rd Qu.:20.43	3rd Qu.: 12.70	3rd Qu.: 28.17	3rd Qu.: 386.0	3rd Qu.: 4.028	3rd Qu.: 69.00	3rd Qu.: 7.875	3rd Qu.: 56.00	3rd Qu.: 2.620	3rd Qu.: 348.0	3rd Qu.: 3.100	3rd Qu.: 0.710	3rd Qu.: 0.1000
Max. :93600	ABALONE,MXD SP,CKD,FRIED : 1	Max. :902.0	Max. :88.32	Max. :100.00	Max. :100.00	Max. :38758.0	Max. :95.600	Max. :3100.00	Max. :99.800	Max. :7364.00	Max. :123.600	Max. :16500.0	Max. :2400.000	Max. :149.400	Max. :250.0000
NA	(Other) :7048	NA’s :1	NA’s :1	NA’s :1	NA’s :1	NA’s :84	NA’s :301	NA’s :288	NA’s :1910	NA’s :136	NA’s :123	NA’s :409	NA’s :332	NA’s :2720	NA’s :2834

Basic Data Analysis

Vector notation

# Outputs the sodium index
  USDA$Sodium

Finding the index of the food with highest sodium levels

# Finding the index of the food with the highest sodium levels
  which.max(USDA$Sodium)
## [1] 265

Get names of variables in the dataset

# Get names of the variables
  names(USDA)
##  [1] "ID"           "Description"  "Calories"     "Protein"      "TotalFat"     "Carbohydrate" "Sodium"       "SaturatedFat" "Cholesterol"  "Sugar"        "Calcium"      "Iron"         "Potassium"   
## [14] "VitaminC"     "VitaminE"     "VitaminD"

Get the name of the food with highest sodium levels

# Get the name of the food with the highest sodium levels
  USDA$Description[265]
## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ACORN FLOUR,FULL FAT ACORN STEW (APACHE) ACORNS,DRIED ... ZWIEBACK

Create a subset of the foods with sodium content above 10,000mg

# Subset foods with sodium content above 10,000 mg
  HighSodium = subset(USDA, Sodium>10000)

Count the number of rows, or observations

# Count the number of rows, or observations
nrow(HighSodium)
## [1] 10

Output names of the foods with high sodium content

# Output names of the foods with high sodium content
  HighSodium$Description
##  [1] SALT,TABLE                                              SOUP,BF BROTH OR BOUILLON,PDR,DRY                       SOUP,BEEF BROTH,CUBED,DRY                              
##  [4] SOUP,CHICK BROTH OR BOUILLON,DRY                        SOUP,CHICK BROTH CUBES,DRY                              GRAVY,AU JUS,DRY                                       
##  [7] ADOBO FRESCO                                            LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE LEAVENING AGENTS,BAKING SODA                           
## [10] DESSERTS,RENNIN,TABLETS,UNSWTND                        
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ACORN FLOUR,FULL FAT ACORN STEW (APACHE) ACORNS,DRIED ... ZWIEBACK

Finding the index of CAVIAR in the dataset

# Finding the index of CAVIAR
  match("CAVIAR", USDA$Description)
## [1] 4154

Find amount of sodium in caviar

# Find amount of sodium in CAVIAR
  USDA$Sodium[4154]
## [1] 1500

Doing it in one command!

# Do the previous two commands in one step
  USDA$Sodium[match("CAVIAR", USDA$Description)]
## [1] 1500

Summary function over Sodium vector

# Output a summary
  summary(USDA$Sodium)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    37.0    79.0   322.1   386.0 38758.0      84

Standard deviation

# Calculates the standard deviation
  sd(USDA$Sodium, na.rm = TRUE)
## [1] 1045.417

Plots of USDA Dataset

Scatter Plots

# Scatterplot
  plot(USDA$Protein, USDA$TotalFat)

Add xlabel, ylabel and title

# Add x label, y label, and title to the scatterplot
  plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")

Histograms

# Histogram
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")

Add limits to x-axis

# Add limits to x-axis
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))

Specify breaks of histogram

# Specify breaks
  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)

Boxplots

# Boxplots
  boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")

Adding a variable

Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise

# Create variable for high sodium
  HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
# Outputs a string
  str(HighSodium)
##  num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...

Adding the variable to the dataset

# Add variable
  USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))

Similarly for HighProtein, HigCarbs, HighFat

# Similar to the previous command for different food groups
  USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
  USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
  USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))

Summary Tables

How many foods have higher sodium level than average?

# Tabulate the amount of foods that have higher sodium level than average
 z =  table(USDA$HighSodium)
kable(z)

Var1	Freq
0	4884
1	2090

How many foods have both high sodium and high fat?

# Tabulate the number of foods that have both high sodium and fat
 z =  table(USDA$HighSodium, USDA$HighFat)
kable(z)

	0	1
0	3529	1355
1	1378	712

Average amount of iron sorted by high and low protein?

# Compare two groups using a statsitical measure
 z =  tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
kable(z)

	x
0	2.558945
1	3.197294

Maximum level of Vitamin C in hfoods with high and low carbs?

# Compare two groups using a statistical measure
z =   tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
kable(z)

	x
0	1677.6
1	2400.0

Using summary function with tapply

# Compare two groups using a statistical measure
z =   tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)
kable(z)

	x
0	c(Min. = 0, `1st Qu.` = 0, Median = 0, Mean = 6.36403527640353, `3rd Qu.` = 2.8, Max. = 1677.6, `NA's` = 248)
1	c(Min. = 0, `1st Qu.` = 0, Median = 0.2, Mean = 16.3119884448724, `3rd Qu.` = 4.5, Max. = 2400, `NA's` = 83)

Analytics Edge: Unit 1

Sulman Khan

October 23, 2018

Introduction to Analytics Edge

Prevalence of Data

Data and Analytics are Useful

Definition of Analytics

Examples of Data Anlytics Used

IBM Watson

eHarmony

The Framingham Heart Study

D2Hawkeye

An Introduction to R

What is R?

Basic Calculations

Functions

Variables

Vectors

Data Frames

Loading csv files

Subsetting

Writing csv files

Removing variables

Basic Data Analysis

Scatterplot

Subsetting Outliers

Histograms

Summary Tables

Understanding Food USDA Dataset

Loading in the Dataset

Read the csv file

Structure of the dataset

Statistical summary

Basic Data Analysis

Vector notation

Finding the index of the food with highest sodium levels

Get names of variables in the dataset

Get the name of the food with highest sodium levels

Create a subset of the foods with sodium content above 10,000mg

Count the number of rows, or observations

Output names of the foods with high sodium content

Finding the index of CAVIAR in the dataset

Find amount of sodium in caviar

Doing it in one command!

Summary function over Sodium vector

Standard deviation

Plots of USDA Dataset

Scatter Plots

Add xlabel, ylabel and title

Histograms

Add limits to x-axis

Specify breaks of histogram

Boxplots

Adding a variable

Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise

Adding the variable to the dataset

Similarly for HighProtein, HigCarbs, HighFat

Summary Tables

How many foods have higher sodium level than average?

How many foods have both high sodium and high fat?

Average amount of iron sorted by high and low protein?

Maximum level of Vitamin C in hfoods with high and low carbs?

Using summary function with tapply