Analytics Edge: Unit 1

Introduction to Analytics Edge

Prevalence of Data

2.7 Zettabytes of electronic data exist in the world today - 2,700,000,000,000,000,000,000 bytes
- This is equal to the storage required for more than 200 billion HD movies
New data is produced at an exponential rate
Decoding the human genome originally took 10 years to process; now it can be achieved in one week

Data and Analytics are Useful

Estimated that there is a shortage of 140,000 - 190,000 people with deep analytical skills to fill the demand of
jobs in the U.S. by 2018
IBM has invested over $20 billion since 2005 to grow its analytics business
Companies will invest more than $120 billion by 2015 on analytics, hardware, software and services
Critical in almost every industry
- Healthcare, media, sports, finance, government, etc.

Definition of Analytics

The science of using data to build models that lead to better decisions that add value to individuals, to companies, and to institutions.

Examples of Data Anlytics Used

IBM Watson

Watson is a supercomputer with 3,000 processors and a database of 200 million pages of information
Watson combined many algorithms to increase accuracy and confidence
Approached the problem in a different way than how a human does
Deals with massive amounts of data, often in unstructured form
- 90% of data in the world is unstructured

eHarmony

Online dating site focused on long term relationships
Relies much more on data than other dating sites
Suggests a limited number of high quality matches
- Users don’t have to search and dig through profiles
eHarmony has successfully leveraged the power of
analytics to create a successful and thriving business
- 14% of US online dating market

The Framingham Heart Study

Much of the now-common knowledge regarding heart disease came from this study
Provided necessary evidence for the development of drugs to lower blood pressure
Paved the way for other clinical prediction rules
- Predict clinical outcomes using patient data
A model allows medical professionals to make predictions for patients worldwide

D2Hawkeye

Combined data with analytics to improve quality and cost management in healthcare
Substantial improvement in D2Hawkeye’s ability to identify patients who need more attention
Use expert knowledge to identify new variables and refine existing variables
Can make predictions for millions of patients without manually reading patient files

An Introduction to R

What is R?

A software environment for data analysis, statistical computing, and graphics
A programming language

In the next section, the basic operations and functions used in R for data analysis are explored.

Basic Calculations

8*6
## [1] 48
2^16
## [1] 65536
8*6
## [1] 48
8*10
## [1] 80

Functions

sqrt(2)
## [1] 1.414214
abs(-65)
## [1] 65

Variables

SquareRoot2 = sqrt(2)
SquareRoot2
## [1] 1.414214
HoursYear <- 365*24
HoursYear
## [1] 8760
ls()
## [1] "HoursYear"   "SquareRoot2"

Vectors

Two vectors - Country and LifeExpectancy are created. Accordingly, both vectors are indexed to display specific elements inside the vectors. Finally, a third vector Sequence has a range from 0 to 100 in increments of 2.

c(2,3,5,8,13)
## [1]  2  3  5  8 13
Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country
## [1] "Brazil"      "China"       "India"       "Switzerland" "USA"
LifeExpectancy
## [1] 74 76 65 83 79
Country[1]
## [1] "Brazil"
LifeExpectancy[3]
## [1] 65
Sequence = seq(0,100,2)
Sequence
##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [40]  78  80  82  84  86  88  90  92  94  96  98 100

Data Frames

The data.frame CountryData calls upon the two vectors Country and LifeExpectancy. Successivly, an additional vector CountryData$Populations is added to the data.frame. Finally, a data.frame AllCountryData utilizes the two previous data.frames CountryData and NewCountryData.

CountryData = data.frame(Country, LifeExpectancy)
CountryData
##       Country LifeExpectancy
## 1      Brazil             74
## 2       China             76
## 3       India             65
## 4 Switzerland             83
## 5         USA             79
CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData
##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData
##     Country LifeExpectancy Population
## 1 Australia             82      23050
## 2    Greece             81      11125
AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData
##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
## 6   Australia             82      23050
## 7      Greece             81      11125

Loading csv files

The dataset WHO.csv is loaded into the variable WHO. The str and summary commands provide physical and statistical descriptions of the variable.

WHO = read.csv("WHO.csv")
str(WHO)
## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
summary(WHO)
##                 Country                      Region     Population         Under15          Over60      FertilityRate   LifeExpectancy  ChildMortality   
##  Afghanistan        :  1   Africa               :46   Min.   :      1   Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00   Min.   :  2.200  
##  Albania            :  1   Americas             :35   1st Qu.:   1696   1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00   1st Qu.:  8.425  
##  Algeria            :  1   Eastern Mediterranean:22   Median :   7790   Median :28.65   Median : 8.53   Median :2.400   Median :72.50   Median : 18.600  
##  Andorra            :  1   Europe               :53   Mean   :  36360   Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01   Mean   : 36.149  
##  Angola             :  1   South-East Asia      :11   3rd Qu.:  24535   3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00   3rd Qu.: 55.975  
##  Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000   Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00   Max.   :181.600  
##  (Other)            :188                                                                                NA's   :11                                       
##  CellularSubscribers  LiteracyRate        GNI        PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
##  Min.   :  2.57      Min.   :31.10   Min.   :  340   Min.   : 37.20              Min.   : 32.50               
##  1st Qu.: 63.57      1st Qu.:71.60   1st Qu.: 2335   1st Qu.: 87.70              1st Qu.: 87.30               
##  Median : 97.75      Median :91.80   Median : 7870   Median : 94.70              Median : 95.10               
##  Mean   : 93.64      Mean   :83.71   Mean   :13321   Mean   : 90.85              Mean   : 89.63               
##  3rd Qu.:120.81      3rd Qu.:97.85   3rd Qu.:17558   3rd Qu.: 98.10              3rd Qu.: 97.90               
##  Max.   :196.41      Max.   :99.80   Max.   :86440   Max.   :100.00              Max.   :100.00               
##  NA's   :10          NA's   :91      NA's   :32      NA's   :93                  NA's   :93

Subsetting

The dataset WHO is subsetted into WHO_Europe using the region as an argument to collect the data.

WHO_Europe = subset(WHO, Region == "Europe")
str(WHO_Europe)
## 'data.frame':    53 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Population                   : int  3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
##  $ Under15                      : num  21.3 15.2 20.3 14.5 22.2 ...
##  $ Over60                       : num  14.93 22.86 14.06 23.52 8.24 ...
##  $ FertilityRate                : num  1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
##  $ LifeExpectancy               : int  74 82 71 81 71 71 80 76 74 77 ...
##  $ ChildMortality               : num  16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
##  $ CellularSubscribers          : num  96.4 75.5 103.6 154.8 108.8 ...
##  $ LiteracyRate                 : num  NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
##  $ GNI                          : num  8820 NA 6100 42050 8960 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
##  $ PrimarySchoolEnrollmentFemale: num  NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...

Writing csv files

write.csv(WHO_Europe, "WHO_Europe.csv")

Removing variables

rm(WHO_Europe)

Basic Data Analysis

Various basic data analysis commands are implemented to provide a statistical summary of the datasets and how to appropriately identify various elements of interest.

mean(WHO$Under15)
## [1] 28.73242
sd(WHO$Under15)
## [1] 10.53457
summary(WHO$Under15)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.12   18.72   28.65   28.73   37.75   49.99

which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain ... Zimbabwe

which.max(WHO$Under15)
## [1] 124
WHO$Country[124]
## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain ... Zimbabwe

Scatterplot

plot(WHO$GNI, WHO$FertilityRate)

Subsetting Outliers

Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5) 
nrow(Outliers)
## [1] 7
Outliers[c("Country","GNI","FertilityRate")]
##               Country   GNI FertilityRate
## 23           Botswana 14550          2.71
## 56  Equatorial Guinea 25620          5.04
## 63              Gabon 13740          4.18
## 83             Israel 27110          2.92
## 88         Kazakhstan 11250          2.52
## 131            Panama 14510          2.52
## 150      Saudi Arabia 24700          2.76

Histograms

hist(WHO$CellularSubscribers)

### Boxplot

boxplot(WHO$LifeExpectancy ~ WHO$Region)

boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")

Summary Tables

The table command creates a table for each region in the WHO data.frame. The tapply command demonstrates the relationship between two vectors in the data.frame using a statistical descriptor.

table(WHO$Region)
## 
##                Africa              Americas Eastern Mediterranean                Europe       South-East Asia       Western Pacific 
##                    46                    35                    22                    53                    11                    27

tapply(WHO$Over60, WHO$Region, mean)
##                Africa              Americas Eastern Mediterranean                Europe       South-East Asia       Western Pacific 
##              5.220652             10.943714              5.620000             19.774906              8.769091             10.162963
tapply(WHO$LiteracyRate, WHO$Region, min)
##                Africa              Americas Eastern Mediterranean                Europe       South-East Asia       Western Pacific 
##                    NA                    NA                    NA                    NA                    NA                    NA
tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)
##                Africa              Americas Eastern Mediterranean                Europe       South-East Asia       Western Pacific 
##                  31.1                  75.2                  63.9                  95.2                  56.8                  60.6

Understanding Food USDA Dataset

In the following section, the United States Department of Agriculture (USDA) dataset on the dietary contents of food is examined.

Loading in the Dataset

Read the csv file

  USDA = read.csv("USDA.csv")

Structure of the dataset

  str(USDA)
## 'data.frame':    7058 obs. of  16 variables:
##  $ ID          : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
##  $ Calories    : int  717 717 876 353 371 334 300 376 403 387 ...
##  $ Protein     : num  0.85 0.85 0.28 21.4 23.24 ...
##  $ TotalFat    : num  81.1 81.1 99.5 28.7 29.7 ...
##  $ Carbohydrate: num  0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
##  $ Sodium      : int  714 827 2 1395 560 629 842 690 621 700 ...
##  $ SaturatedFat: num  51.4 50.5 61.9 18.7 18.8 ...
##  $ Cholesterol : int  215 219 256 75 94 100 72 93 105 103 ...
##  $ Sugar       : num  0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
##  $ Calcium     : int  24 24 4 528 674 184 388 673 721 643 ...
##  $ Iron        : num  0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
##  $ Potassium   : int  24 26 5 256 136 152 187 93 98 95 ...
##  $ VitaminC    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VitaminE    : num  2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
##  $ VitaminD    : num  1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...

Statistical summary

  summary(USDA)
##        ID                                                              Description      Calories        Protein         TotalFat       Carbohydrate   
##  Min.   : 1001   BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA:   2   Min.   :  0.0   Min.   : 0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 8387   CAMPBELL,CAMPBELL'S SEL MICROWAVEABLE BOWLS,HEA             :   2   1st Qu.: 85.0   1st Qu.: 2.29   1st Qu.:  0.72   1st Qu.:  0.00  
##  Median :13294   OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT    :   2   Median :181.0   Median : 8.20   Median :  4.37   Median :  7.13  
##  Mean   :14260   POPCORN,OIL-POPPED,LOFAT                                    :   2   Mean   :219.7   Mean   :11.71   Mean   : 10.32   Mean   : 20.70  
##  3rd Qu.:18337   ABALONE,MIXED SPECIES,RAW                                   :   1   3rd Qu.:331.0   3rd Qu.:20.43   3rd Qu.: 12.70   3rd Qu.: 28.17  
##  Max.   :93600   ABALONE,MXD SP,CKD,FRIED                                    :   1   Max.   :902.0   Max.   :88.32   Max.   :100.00   Max.   :100.00  
##                  (Other)                                                     :7048   NA's   :1       NA's   :1       NA's   :1        NA's   :1       
##      Sodium         SaturatedFat     Cholesterol          Sugar           Calcium             Iron           Potassium          VitaminC       
##  Min.   :    0.0   Min.   : 0.000   Min.   :   0.00   Min.   : 0.000   Min.   :   0.00   Min.   :  0.000   Min.   :    0.0   Min.   :   0.000  
##  1st Qu.:   37.0   1st Qu.: 0.172   1st Qu.:   0.00   1st Qu.: 0.000   1st Qu.:   9.00   1st Qu.:  0.520   1st Qu.:  135.0   1st Qu.:   0.000  
##  Median :   79.0   Median : 1.256   Median :   3.00   Median : 1.395   Median :  19.00   Median :  1.330   Median :  250.0   Median :   0.000  
##  Mean   :  322.1   Mean   : 3.452   Mean   :  41.55   Mean   : 8.257   Mean   :  73.53   Mean   :  2.828   Mean   :  301.4   Mean   :   9.436  
##  3rd Qu.:  386.0   3rd Qu.: 4.028   3rd Qu.:  69.00   3rd Qu.: 7.875   3rd Qu.:  56.00   3rd Qu.:  2.620   3rd Qu.:  348.0   3rd Qu.:   3.100  
##  Max.   :38758.0   Max.   :95.600   Max.   :3100.00   Max.   :99.800   Max.   :7364.00   Max.   :123.600   Max.   :16500.0   Max.   :2400.000  
##  NA's   :84        NA's   :301      NA's   :288       NA's   :1910     NA's   :136       NA's   :123       NA's   :409       NA's   :332       
##     VitaminE          VitaminD       
##  Min.   :  0.000   Min.   :  0.0000  
##  1st Qu.:  0.120   1st Qu.:  0.0000  
##  Median :  0.270   Median :  0.0000  
##  Mean   :  1.488   Mean   :  0.5769  
##  3rd Qu.:  0.710   3rd Qu.:  0.1000  
##  Max.   :149.400   Max.   :250.0000  
##  NA's   :2720      NA's   :2834

Basic Data Analysis

Vector notation

  USDA$Sodium

Finding the index of the food with highest sodium levels

  which.max(USDA$Sodium)
## [1] 265

Get names of variables in the dataset

  names(USDA)
##  [1] "ID"           "Description"  "Calories"     "Protein"      "TotalFat"     "Carbohydrate" "Sodium"       "SaturatedFat" "Cholesterol"  "Sugar"       
## [11] "Calcium"      "Iron"         "Potassium"    "VitaminC"     "VitaminE"     "VitaminD"

Get the name of the food with highest sodium levels

  USDA$Description[265]
## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ... ZWIEBACK

Create a subset of the foods with sodium content above 10,000mg

  HighSodium = subset(USDA, Sodium>10000)

Count the number of rows, or observations

nrow(HighSodium)
## [1] 10

Output names of the foods with high sodium content

  HighSodium$Description
##  [1] SALT,TABLE                                              SOUP,BF BROTH OR BOUILLON,PDR,DRY                      
##  [3] SOUP,BEEF BROTH,CUBED,DRY                               SOUP,CHICK BROTH OR BOUILLON,DRY                       
##  [5] SOUP,CHICK BROTH CUBES,DRY                              GRAVY,AU JUS,DRY                                       
##  [7] ADOBO FRESCO                                            LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE
##  [9] LEAVENING AGENTS,BAKING SODA                            DESSERTS,RENNIN,TABLETS,UNSWTND                        
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ... ZWIEBACK

Finding the index of CAVIAR in the dataset

  match("CAVIAR", USDA$Description)
## [1] 4154

Find amount of sodium in caviar

  USDA$Sodium[4154]
## [1] 1500

Doing it in one command!

  USDA$Sodium[match("CAVIAR", USDA$Description)]
## [1] 1500

Summary function over Sodium vector

  summary(USDA$Sodium)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    37.0    79.0   322.1   386.0 38758.0      84

Standard deviation

  sd(USDA$Sodium, na.rm = TRUE)
## [1] 1045.417

Plots of USDA Dataset

Scatter Plots

  plot(USDA$Protein, USDA$TotalFat)

Add xlabel, ylabel and title

  plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")

Histograms

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")

Add limits to x-axis

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))

Specify breaks of histogram

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)

Boxplots

  boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")

Adding a variable

Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise

  HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
  str(HighSodium)
##  num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...

Adding the variable to the dataset

  USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))

Similarly for HighProtein, HigCarbs, HighFat

  USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
  USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
  USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))

Summary Tables

How many foods have higher sodium level than average?

  table(USDA$HighSodium)
## 
##    0    1 
## 4884 2090

How many foods have both high sodium and high fat?

  table(USDA$HighSodium, USDA$HighFat)
##    
##        0    1
##   0 3529 1355
##   1 1378  712

Average amount of iron sorted by high and low protein?

  tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
##        0        1 
## 2.558945 3.197294

Maximum level of Vitamin C in hfoods with high and low carbs?

  tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
##      0      1 
## 1677.6 2400.0

Using summary function with tapply

  tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)
## $`0`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    0.000    0.000    6.364    2.800 1677.600      248 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.20   16.31    4.50 2400.00      83

Analytics Edge: Unit 1

Sulman Khan

October 23, 2018

Introduction to Analytics Edge

Prevalence of Data

Data and Analytics are Useful

Definition of Analytics

Examples of Data Anlytics Used

IBM Watson

eHarmony

The Framingham Heart Study

D2Hawkeye

An Introduction to R

What is R?

Basic Calculations

Functions

Variables

Vectors

Data Frames

Loading csv files

Subsetting

Writing csv files

Removing variables

Basic Data Analysis

Scatterplot

Subsetting Outliers

Histograms

Summary Tables

Understanding Food USDA Dataset

Loading in the Dataset

Read the csv file

Structure of the dataset

Statistical summary

Basic Data Analysis

Vector notation

Finding the index of the food with highest sodium levels

Get names of variables in the dataset

Get the name of the food with highest sodium levels

Create a subset of the foods with sodium content above 10,000mg

Count the number of rows, or observations

Output names of the foods with high sodium content

Finding the index of CAVIAR in the dataset

Find amount of sodium in caviar

Doing it in one command!

Summary function over Sodium vector

Standard deviation

Plots of USDA Dataset

Scatter Plots

Add xlabel, ylabel and title

Histograms

Add limits to x-axis

Specify breaks of histogram

Boxplots

Adding a variable

Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise

Adding the variable to the dataset

Similarly for HighProtein, HigCarbs, HighFat

Summary Tables

How many foods have higher sodium level than average?

How many foods have both high sodium and high fat?

Average amount of iron sorted by high and low protein?

Maximum level of Vitamin C in hfoods with high and low carbs?

Using summary function with tapply