Analytics Edge: Unit 1

Introduction to Analytics Edge

Prevalence of Data

2.7 Zettabytes of electronic data exist in the world today - 2,700,000,000,000,000,000,000 bytes
- This is equal to the storage required for more than 200 billion HD movies
New data is produced at an exponential rate
Decoding the human genome originally took 10 years to process; now it can be achieved in one week

Data and Analytics are Useful

Estimated that there is a shortage of 140,000 - 190,000 people with deep analytical skills to fill the demand of
jobs in the U.S. by 2018
IBM has invested over $20 billion since 2005 to grow its analytics business
Companies will invest more than $120 billion by 2015 on analytics, hardware, software and services
Critical in almost every industry
- Healthcare, media, sports, finance, government, etc.

Definition of Analytics

The science of using data to build models that lead to better decisions that add value to individuals, to companies, and to institutions.

Examples of Data Anlytics Used

IBM Watson

Watson is a supercomputer with 3,000 processors and a database of 200 million pages of information
Watson combined many algorithms to increase accuracy and confidence
Approached the problem in a different way than how a human does
Deals with massive amounts of data, often in unstructured form
- 90% of data in the world is unstructured

eHarmony

Online dating site focused on long term relationships
Relies much more on data than other dating sites
Suggests a limited number of high quality matches
- Users don’t have to search and dig through profiles
eHarmony has successfully leveraged the power of
analytics to create a successful and thriving business
- 14% of US online dating market

The Framingham Heart Study

Much of the now-common knowledge regarding heart disease came from this study
Provided necessary evidence for the development of drugs to lower blood pressure
Paved the way for other clinical prediction rules
- Predict clinical outcomes using patient data
A model allows medical professionals to make predictions for patients worldwide

D2Hawkeye

Combined data with analytics to improve quality and cost management in healthcare
Substantial improvement in D2Hawkeye’s ability to identify patients who need more attention
Use expert knowledge to identify new variables and refine existing variables
Can make predictions for millions of patients without manually reading patient files

An Introduction to R

What is R?

A software environment for data analysis, statistical computing, and graphics
A programming language

In the next section, the basic operations and functions used in R for data analysis are explored.

Basic Calculations

8*6

## [1] 48

2^16

## [1] 65536

8*6

## [1] 48

8*10

## [1] 80

Functions

sqrt(2)

## [1] 1.414214

abs(-65)

## [1] 65

Variables

SquareRoot2 = sqrt(2)
SquareRoot2

## [1] 1.414214

HoursYear <- 365*24
HoursYear

## [1] 8760

ls()

## [1] "HoursYear"   "SquareRoot2"

Vectors

Two vectors - Country and LifeExpectancy are created. Accordingly, both vectors are indexed to display specific elements inside the vectors. Finally, a third vector Sequence has a range from 0 to 100 in increments of 2.

c(2,3,5,8,13)

## [1]  2  3  5  8 13

Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country

## [1] "Brazil"      "China"       "India"       "Switzerland" "USA"

LifeExpectancy

## [1] 74 76 65 83 79

Country[1]

## [1] "Brazil"

LifeExpectancy[3]

## [1] 65

Sequence = seq(0,100,2)
Sequence

##  [1]   0   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32
## [18]  34  36  38  40  42  44  46  48  50  52  54  56  58  60  62  64  66
## [35]  68  70  72  74  76  78  80  82  84  86  88  90  92  94  96  98 100

Data Frames

The data.frame CountryData calls upon the two vectors Country and LifeExpectancy. Successively, an additional vector CountryData$Populations is added to the data.frame. Finally, a data.frame AllCountryData utilizes the two previous data.frames CountryData and NewCountryData.

CountryData = data.frame(Country, LifeExpectancy)
CountryData

##       Country LifeExpectancy
## 1      Brazil             74
## 2       China             76
## 3       India             65
## 4 Switzerland             83
## 5         USA             79

CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData

##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000

Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData

##     Country LifeExpectancy Population
## 1 Australia             82      23050
## 2    Greece             81      11125

AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData

##       Country LifeExpectancy Population
## 1      Brazil             74     199000
## 2       China             76    1390000
## 3       India             65    1240000
## 4 Switzerland             83       7997
## 5         USA             79     318000
## 6   Australia             82      23050
## 7      Greece             81      11125

Loading csv files

The dataset WHO.csv is loaded into the variable WHO. The str and summary commands provide physical and statistical descriptions of the variable.

WHO = read.csv("WHO.csv")
str(WHO)

## 'data.frame':    194 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
##  $ Population                   : int  29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
##  $ Under15                      : num  47.4 21.3 27.4 15.2 47.6 ...
##  $ Over60                       : num  3.82 14.93 7.17 22.86 3.84 ...
##  $ FertilityRate                : num  5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
##  $ LifeExpectancy               : int  60 74 73 82 51 75 76 71 82 81 ...
##  $ ChildMortality               : num  98.5 16.7 20 3.2 163.5 ...
##  $ CellularSubscribers          : num  54.3 96.4 99 75.5 48.4 ...
##  $ LiteracyRate                 : num  NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
##  $ GNI                          : num  1140 8820 8310 NA 5230 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
##  $ PrimarySchoolEnrollmentFemale: num  NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...

summary(WHO)

##                 Country                      Region     Population     
##  Afghanistan        :  1   Africa               :46   Min.   :      1  
##  Albania            :  1   Americas             :35   1st Qu.:   1696  
##  Algeria            :  1   Eastern Mediterranean:22   Median :   7790  
##  Andorra            :  1   Europe               :53   Mean   :  36360  
##  Angola             :  1   South-East Asia      :11   3rd Qu.:  24535  
##  Antigua and Barbuda:  1   Western Pacific      :27   Max.   :1390000  
##  (Other)            :188                                               
##     Under15          Over60      FertilityRate   LifeExpectancy 
##  Min.   :13.12   Min.   : 0.81   Min.   :1.260   Min.   :47.00  
##  1st Qu.:18.72   1st Qu.: 5.20   1st Qu.:1.835   1st Qu.:64.00  
##  Median :28.65   Median : 8.53   Median :2.400   Median :72.50  
##  Mean   :28.73   Mean   :11.16   Mean   :2.941   Mean   :70.01  
##  3rd Qu.:37.75   3rd Qu.:16.69   3rd Qu.:3.905   3rd Qu.:76.00  
##  Max.   :49.99   Max.   :31.92   Max.   :7.580   Max.   :83.00  
##                                  NA's   :11                     
##  ChildMortality    CellularSubscribers  LiteracyRate        GNI       
##  Min.   :  2.200   Min.   :  2.57      Min.   :31.10   Min.   :  340  
##  1st Qu.:  8.425   1st Qu.: 63.57      1st Qu.:71.60   1st Qu.: 2335  
##  Median : 18.600   Median : 97.75      Median :91.80   Median : 7870  
##  Mean   : 36.149   Mean   : 93.64      Mean   :83.71   Mean   :13321  
##  3rd Qu.: 55.975   3rd Qu.:120.81      3rd Qu.:97.85   3rd Qu.:17558  
##  Max.   :181.600   Max.   :196.41      Max.   :99.80   Max.   :86440  
##                    NA's   :10          NA's   :91      NA's   :32     
##  PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
##  Min.   : 37.20              Min.   : 32.50               
##  1st Qu.: 87.70              1st Qu.: 87.30               
##  Median : 94.70              Median : 95.10               
##  Mean   : 90.85              Mean   : 89.63               
##  3rd Qu.: 98.10              3rd Qu.: 97.90               
##  Max.   :100.00              Max.   :100.00               
##  NA's   :93                  NA's   :93

Subsetting

The dataset WHO is subsetted into WHO_Europe using the region as an argument to collect the data.

WHO_Europe = subset(WHO, Region == "Europe")
str(WHO_Europe)

## 'data.frame':    53 obs. of  13 variables:
##  $ Country                      : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
##  $ Region                       : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Population                   : int  3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
##  $ Under15                      : num  21.3 15.2 20.3 14.5 22.2 ...
##  $ Over60                       : num  14.93 22.86 14.06 23.52 8.24 ...
##  $ FertilityRate                : num  1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
##  $ LifeExpectancy               : int  74 82 71 81 71 71 80 76 74 77 ...
##  $ ChildMortality               : num  16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
##  $ CellularSubscribers          : num  96.4 75.5 103.6 154.8 108.8 ...
##  $ LiteracyRate                 : num  NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
##  $ GNI                          : num  8820 NA 6100 42050 8960 ...
##  $ PrimarySchoolEnrollmentMale  : num  NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
##  $ PrimarySchoolEnrollmentFemale: num  NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...

Writing csv files

write.csv(WHO_Europe, "WHO_Europe.csv")

Removing variables

rm(WHO_Europe)

Basic Data Analysis

Various basic data analysis commands are implemented to provide a statistical summary of the datasets and how to appropriately identify various elements of interest.

mean(WHO$Under15)

## [1] 28.73242

sd(WHO$Under15)

## [1] 10.53457

summary(WHO$Under15)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.12   18.72   28.65   28.73   37.75   49.99

which.min(WHO$Under15)

## [1] 86

WHO$Country[86]

## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe

which.max(WHO$Under15)

## [1] 124

WHO$Country[124]

## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe

Scatterplot

plot(WHO$GNI, WHO$FertilityRate)

Subsetting Outliers

Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5) 
nrow(Outliers)

## [1] 7

Outliers[c("Country","GNI","FertilityRate")]

##               Country   GNI FertilityRate
## 23           Botswana 14550          2.71
## 56  Equatorial Guinea 25620          5.04
## 63              Gabon 13740          4.18
## 83             Israel 27110          2.92
## 88         Kazakhstan 11250          2.52
## 131            Panama 14510          2.52
## 150      Saudi Arabia 24700          2.76

Histograms

hist(WHO$CellularSubscribers)

### Boxplot

boxplot(WHO$LifeExpectancy ~ WHO$Region)

boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")

Summary Tables

The table command creates a table for each region in the WHO data.frame. The tapply command demonstrates the relationship between two vectors in the data.frame using a statistical descriptor.

table(WHO$Region)

## 
##                Africa              Americas Eastern Mediterranean 
##                    46                    35                    22 
##                Europe       South-East Asia       Western Pacific 
##                    53                    11                    27

tapply(WHO$Over60, WHO$Region, mean)

##                Africa              Americas Eastern Mediterranean 
##              5.220652             10.943714              5.620000 
##                Europe       South-East Asia       Western Pacific 
##             19.774906              8.769091             10.162963

tapply(WHO$LiteracyRate, WHO$Region, min)

##                Africa              Americas Eastern Mediterranean 
##                    NA                    NA                    NA 
##                Europe       South-East Asia       Western Pacific 
##                    NA                    NA                    NA

tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)

##                Africa              Americas Eastern Mediterranean 
##                  31.1                  75.2                  63.9 
##                Europe       South-East Asia       Western Pacific 
##                  95.2                  56.8                  60.6

Understanding Food USDA Dataset

In the following section, the United States Department of Agriculture (USDA) dataset on the dietary contents of food is examined.

Loading in the Dataset

Read the csv file

  USDA = read.csv("USDA.csv")

Structure of the dataset

  str(USDA)

## 'data.frame':    7058 obs. of  16 variables:
##  $ ID          : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
##  $ Calories    : int  717 717 876 353 371 334 300 376 403 387 ...
##  $ Protein     : num  0.85 0.85 0.28 21.4 23.24 ...
##  $ TotalFat    : num  81.1 81.1 99.5 28.7 29.7 ...
##  $ Carbohydrate: num  0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
##  $ Sodium      : int  714 827 2 1395 560 629 842 690 621 700 ...
##  $ SaturatedFat: num  51.4 50.5 61.9 18.7 18.8 ...
##  $ Cholesterol : int  215 219 256 75 94 100 72 93 105 103 ...
##  $ Sugar       : num  0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
##  $ Calcium     : int  24 24 4 528 674 184 388 673 721 643 ...
##  $ Iron        : num  0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
##  $ Potassium   : int  24 26 5 256 136 152 187 93 98 95 ...
##  $ VitaminC    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VitaminE    : num  2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
##  $ VitaminD    : num  1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...

Statistical summary

  summary(USDA)

##        ID       
##  Min.   : 1001  
##  1st Qu.: 8387  
##  Median :13294  
##  Mean   :14260  
##  3rd Qu.:18337  
##  Max.   :93600  
##                 
##                                                        Description  
##  BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA:   2  
##  CAMPBELL,CAMPBELL'S SEL MICROWAVEABLE BOWLS,HEA             :   2  
##  OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT    :   2  
##  POPCORN,OIL-POPPED,LOFAT                                    :   2  
##  ABALONE,MIXED SPECIES,RAW                                   :   1  
##  ABALONE,MXD SP,CKD,FRIED                                    :   1  
##  (Other)                                                     :7048  
##     Calories        Protein         TotalFat       Carbohydrate   
##  Min.   :  0.0   Min.   : 0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 85.0   1st Qu.: 2.29   1st Qu.:  0.72   1st Qu.:  0.00  
##  Median :181.0   Median : 8.20   Median :  4.37   Median :  7.13  
##  Mean   :219.7   Mean   :11.71   Mean   : 10.32   Mean   : 20.70  
##  3rd Qu.:331.0   3rd Qu.:20.43   3rd Qu.: 12.70   3rd Qu.: 28.17  
##  Max.   :902.0   Max.   :88.32   Max.   :100.00   Max.   :100.00  
##  NA's   :1       NA's   :1       NA's   :1        NA's   :1       
##      Sodium         SaturatedFat     Cholesterol          Sugar       
##  Min.   :    0.0   Min.   : 0.000   Min.   :   0.00   Min.   : 0.000  
##  1st Qu.:   37.0   1st Qu.: 0.172   1st Qu.:   0.00   1st Qu.: 0.000  
##  Median :   79.0   Median : 1.256   Median :   3.00   Median : 1.395  
##  Mean   :  322.1   Mean   : 3.452   Mean   :  41.55   Mean   : 8.257  
##  3rd Qu.:  386.0   3rd Qu.: 4.028   3rd Qu.:  69.00   3rd Qu.: 7.875  
##  Max.   :38758.0   Max.   :95.600   Max.   :3100.00   Max.   :99.800  
##  NA's   :84        NA's   :301      NA's   :288       NA's   :1910    
##     Calcium             Iron           Potassium          VitaminC       
##  Min.   :   0.00   Min.   :  0.000   Min.   :    0.0   Min.   :   0.000  
##  1st Qu.:   9.00   1st Qu.:  0.520   1st Qu.:  135.0   1st Qu.:   0.000  
##  Median :  19.00   Median :  1.330   Median :  250.0   Median :   0.000  
##  Mean   :  73.53   Mean   :  2.828   Mean   :  301.4   Mean   :   9.436  
##  3rd Qu.:  56.00   3rd Qu.:  2.620   3rd Qu.:  348.0   3rd Qu.:   3.100  
##  Max.   :7364.00   Max.   :123.600   Max.   :16500.0   Max.   :2400.000  
##  NA's   :136       NA's   :123       NA's   :409       NA's   :332       
##     VitaminE          VitaminD       
##  Min.   :  0.000   Min.   :  0.0000  
##  1st Qu.:  0.120   1st Qu.:  0.0000  
##  Median :  0.270   Median :  0.0000  
##  Mean   :  1.488   Mean   :  0.5769  
##  3rd Qu.:  0.710   3rd Qu.:  0.1000  
##  Max.   :149.400   Max.   :250.0000  
##  NA's   :2720      NA's   :2834

Basic Data Analysis

Vector notation

  USDA$Sodium

Finding the index of the food with highest sodium levels

  which.max(USDA$Sodium)

## [1] 265

Get names of variables in the dataset

  names(USDA)

##  [1] "ID"           "Description"  "Calories"     "Protein"     
##  [5] "TotalFat"     "Carbohydrate" "Sodium"       "SaturatedFat"
##  [9] "Cholesterol"  "Sugar"        "Calcium"      "Iron"        
## [13] "Potassium"    "VitaminC"     "VitaminE"     "VitaminD"

Get the name of the food with highest sodium levels

  USDA$Description[265]

## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ... ZWIEBACK

Create a subset of the foods with sodium content above 10,000mg

  HighSodium = subset(USDA, Sodium>10000)

Count the number of rows, or observations

nrow(HighSodium)

## [1] 10

Output names of the foods with high sodium content

  HighSodium$Description

##  [1] SALT,TABLE                                             
##  [2] SOUP,BF BROTH OR BOUILLON,PDR,DRY                      
##  [3] SOUP,BEEF BROTH,CUBED,DRY                              
##  [4] SOUP,CHICK BROTH OR BOUILLON,DRY                       
##  [5] SOUP,CHICK BROTH CUBES,DRY                             
##  [6] GRAVY,AU JUS,DRY                                       
##  [7] ADOBO FRESCO                                           
##  [8] LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE
##  [9] LEAVENING AGENTS,BAKING SODA                           
## [10] DESSERTS,RENNIN,TABLETS,UNSWTND                        
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ... ZWIEBACK

Finding the index of CAVIAR in the dataset

  match("CAVIAR", USDA$Description)

## [1] 4154

Find amount of sodium in caviar

  USDA$Sodium[4154]

## [1] 1500

Doing it in one command!

  USDA$Sodium[match("CAVIAR", USDA$Description)]

## [1] 1500

Summary function over Sodium vector

  summary(USDA$Sodium)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     0.0    37.0    79.0   322.1   386.0 38758.0      84

Standard deviation

  sd(USDA$Sodium, na.rm = TRUE)

## [1] 1045.417

Plots of USDA Dataset

Scatter Plots

  plot(USDA$Protein, USDA$TotalFat)

Add xlabel, ylabel and title

  plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")

Histograms

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")

Add limits to x-axis

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))

Specify breaks of histogram

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)

  hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)

Boxplots

  boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")

Adding a variable

Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise

  HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
  str(HighSodium)

##  num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...

Adding the variable to the dataset

  USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))

Similarly for HighProtein, HigCarbs, HighFat

  USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
  USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
  USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))

Summary Tables

How many foods have higher sodium level than average?

  table(USDA$HighSodium)

## 
##    0    1 
## 4884 2090

How many foods have both high sodium and high fat?

  table(USDA$HighSodium, USDA$HighFat)

##    
##        0    1
##   0 3529 1355
##   1 1378  712

Average amount of iron sorted by high and low protein?

  tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)

##        0        1 
## 2.558945 3.197294

Maximum level of Vitamin C in hfoods with high and low carbs?

  tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)

##      0      1 
## 1677.6 2400.0

Using summary function with tapply

  tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)

## $`0`
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##    0.000    0.000    0.000    6.364    2.800 1677.600      248 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00    0.00    0.20   16.31    4.50 2400.00      83

Analytics Edge: Unit 1

Sulman Khan

October 23, 2018

Introduction to Analytics Edge

Prevalence of Data

Data and Analytics are Useful

Definition of Analytics

Examples of Data Anlytics Used

IBM Watson

eHarmony

The Framingham Heart Study

D2Hawkeye

An Introduction to R

What is R?

Basic Calculations

Functions

Variables

Vectors

Data Frames

Loading csv files

Subsetting

Writing csv files

Removing variables

Basic Data Analysis

Scatterplot

Subsetting Outliers

Histograms

Summary Tables

Understanding Food USDA Dataset

Loading in the Dataset

Read the csv file

Structure of the dataset

Statistical summary

Basic Data Analysis

Vector notation

Finding the index of the food with highest sodium levels

Get names of variables in the dataset

Get the name of the food with highest sodium levels

Create a subset of the foods with sodium content above 10,000mg

Count the number of rows, or observations

Output names of the foods with high sodium content

Finding the index of CAVIAR in the dataset

Find amount of sodium in caviar

Doing it in one command!

Summary function over Sodium vector

Standard deviation

Plots of USDA Dataset

Scatter Plots

Add xlabel, ylabel and title

Histograms

Add limits to x-axis

Specify breaks of histogram

Boxplots

Adding a variable

Creating a variable that takes value 1 if the food has higher sodium than average, 0 otherwise

Adding the variable to the dataset

Similarly for HighProtein, HigCarbs, HighFat

Summary Tables

How many foods have higher sodium level than average?

How many foods have both high sodium and high fat?

Average amount of iron sorted by high and low protein?

Maximum level of Vitamin C in hfoods with high and low carbs?

Using summary function with tapply