The science of using data to build models that lead to better decisions that add value to individuals, to companies, and to institutions.
In the next section, the basic operations and functions used in R for data analysis are explored.
8*6
## [1] 48
2^16
## [1] 65536
8*6
## [1] 48
8*10
## [1] 80
sqrt(2)
## [1] 1.414214
abs(-65)
## [1] 65
SquareRoot2 = sqrt(2)
SquareRoot2
## [1] 1.414214
HoursYear <- 365*24
HoursYear
## [1] 8760
ls()
## [1] "HoursYear" "SquareRoot2"
Two vectors - Country and LifeExpectancy are created. Accordingly, both vectors are indexed to display specific elements inside the vectors. Finally, a third vector Sequence has a range from 0 to 100 in increments of 2.
c(2,3,5,8,13)
## [1] 2 3 5 8 13
Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country
## [1] "Brazil" "China" "India" "Switzerland" "USA"
LifeExpectancy
## [1] 74 76 65 83 79
Country[1]
## [1] "Brazil"
LifeExpectancy[3]
## [1] 65
Sequence = seq(0,100,2)
Sequence
## [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
## [18] 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66
## [35] 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100
The data.frame CountryData calls upon the two vectors Country and LifeExpectancy. Successively, an additional vector CountryData$Populations is added to the data.frame. Finally, a data.frame AllCountryData utilizes the two previous data.frames CountryData and NewCountryData.
CountryData = data.frame(Country, LifeExpectancy)
CountryData
## Country LifeExpectancy
## 1 Brazil 74
## 2 China 76
## 3 India 65
## 4 Switzerland 83
## 5 USA 79
CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData
## Country LifeExpectancy Population
## 1 Brazil 74 199000
## 2 China 76 1390000
## 3 India 65 1240000
## 4 Switzerland 83 7997
## 5 USA 79 318000
Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData
## Country LifeExpectancy Population
## 1 Australia 82 23050
## 2 Greece 81 11125
AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData
## Country LifeExpectancy Population
## 1 Brazil 74 199000
## 2 China 76 1390000
## 3 India 65 1240000
## 4 Switzerland 83 7997
## 5 USA 79 318000
## 6 Australia 82 23050
## 7 Greece 81 11125
The dataset WHO.csv is loaded into the variable WHO. The str and summary commands provide physical and statistical descriptions of the variable.
WHO = read.csv("WHO.csv")
str(WHO)
## 'data.frame': 194 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
summary(WHO)
## Country Region Population
## Afghanistan : 1 Africa :46 Min. : 1
## Albania : 1 Americas :35 1st Qu.: 1696
## Algeria : 1 Eastern Mediterranean:22 Median : 7790
## Andorra : 1 Europe :53 Mean : 36360
## Angola : 1 South-East Asia :11 3rd Qu.: 24535
## Antigua and Barbuda: 1 Western Pacific :27 Max. :1390000
## (Other) :188
## Under15 Over60 FertilityRate LifeExpectancy
## Min. :13.12 Min. : 0.81 Min. :1.260 Min. :47.00
## 1st Qu.:18.72 1st Qu.: 5.20 1st Qu.:1.835 1st Qu.:64.00
## Median :28.65 Median : 8.53 Median :2.400 Median :72.50
## Mean :28.73 Mean :11.16 Mean :2.941 Mean :70.01
## 3rd Qu.:37.75 3rd Qu.:16.69 3rd Qu.:3.905 3rd Qu.:76.00
## Max. :49.99 Max. :31.92 Max. :7.580 Max. :83.00
## NA's :11
## ChildMortality CellularSubscribers LiteracyRate GNI
## Min. : 2.200 Min. : 2.57 Min. :31.10 Min. : 340
## 1st Qu.: 8.425 1st Qu.: 63.57 1st Qu.:71.60 1st Qu.: 2335
## Median : 18.600 Median : 97.75 Median :91.80 Median : 7870
## Mean : 36.149 Mean : 93.64 Mean :83.71 Mean :13321
## 3rd Qu.: 55.975 3rd Qu.:120.81 3rd Qu.:97.85 3rd Qu.:17558
## Max. :181.600 Max. :196.41 Max. :99.80 Max. :86440
## NA's :10 NA's :91 NA's :32
## PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
## Min. : 37.20 Min. : 32.50
## 1st Qu.: 87.70 1st Qu.: 87.30
## Median : 94.70 Median : 95.10
## Mean : 90.85 Mean : 89.63
## 3rd Qu.: 98.10 3rd Qu.: 97.90
## Max. :100.00 Max. :100.00
## NA's :93 NA's :93
The dataset WHO is subsetted into WHO_Europe using the region as an argument to collect the data.
WHO_Europe = subset(WHO, Region == "Europe")
str(WHO_Europe)
## 'data.frame': 53 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Population : int 3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
## $ Under15 : num 21.3 15.2 20.3 14.5 22.2 ...
## $ Over60 : num 14.93 22.86 14.06 23.52 8.24 ...
## $ FertilityRate : num 1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
## $ LifeExpectancy : int 74 82 71 81 71 71 80 76 74 77 ...
## $ ChildMortality : num 16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
## $ CellularSubscribers : num 96.4 75.5 103.6 154.8 108.8 ...
## $ LiteracyRate : num NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
## $ GNI : num 8820 NA 6100 42050 8960 ...
## $ PrimarySchoolEnrollmentMale : num NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
## $ PrimarySchoolEnrollmentFemale: num NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...
write.csv(WHO_Europe, "WHO_Europe.csv")
rm(WHO_Europe)
Various basic data analysis commands are implemented to provide a statistical summary of the datasets and how to appropriately identify various elements of interest.
mean(WHO$Under15)
## [1] 28.73242
sd(WHO$Under15)
## [1] 10.53457
summary(WHO$Under15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.12 18.72 28.65 28.73 37.75 49.99
which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
which.max(WHO$Under15)
## [1] 124
WHO$Country[124]
## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra ... Zimbabwe
plot(WHO$GNI, WHO$FertilityRate)
Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5)
nrow(Outliers)
## [1] 7
Outliers[c("Country","GNI","FertilityRate")]
## Country GNI FertilityRate
## 23 Botswana 14550 2.71
## 56 Equatorial Guinea 25620 5.04
## 63 Gabon 13740 4.18
## 83 Israel 27110 2.92
## 88 Kazakhstan 11250 2.52
## 131 Panama 14510 2.52
## 150 Saudi Arabia 24700 2.76
hist(WHO$CellularSubscribers)
### Boxplot
boxplot(WHO$LifeExpectancy ~ WHO$Region)
boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")
The table command creates a table for each region in the WHO data.frame. The tapply command demonstrates the relationship between two vectors in the data.frame using a statistical descriptor.
table(WHO$Region)
##
## Africa Americas Eastern Mediterranean
## 46 35 22
## Europe South-East Asia Western Pacific
## 53 11 27
tapply(WHO$Over60, WHO$Region, mean)
## Africa Americas Eastern Mediterranean
## 5.220652 10.943714 5.620000
## Europe South-East Asia Western Pacific
## 19.774906 8.769091 10.162963
tapply(WHO$LiteracyRate, WHO$Region, min)
## Africa Americas Eastern Mediterranean
## NA NA NA
## Europe South-East Asia Western Pacific
## NA NA NA
tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)
## Africa Americas Eastern Mediterranean
## 31.1 75.2 63.9
## Europe South-East Asia Western Pacific
## 95.2 56.8 60.6
In the following section, the United States Department of Agriculture (USDA) dataset on the dietary contents of food is examined.
USDA = read.csv("USDA.csv")
str(USDA)
## 'data.frame': 7058 obs. of 16 variables:
## $ ID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
## $ Calories : int 717 717 876 353 371 334 300 376 403 387 ...
## $ Protein : num 0.85 0.85 0.28 21.4 23.24 ...
## $ TotalFat : num 81.1 81.1 99.5 28.7 29.7 ...
## $ Carbohydrate: num 0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
## $ Sodium : int 714 827 2 1395 560 629 842 690 621 700 ...
## $ SaturatedFat: num 51.4 50.5 61.9 18.7 18.8 ...
## $ Cholesterol : int 215 219 256 75 94 100 72 93 105 103 ...
## $ Sugar : num 0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
## $ Calcium : int 24 24 4 528 674 184 388 673 721 643 ...
## $ Iron : num 0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
## $ Potassium : int 24 26 5 256 136 152 187 93 98 95 ...
## $ VitaminC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VitaminE : num 2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
## $ VitaminD : num 1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...
summary(USDA)
## ID
## Min. : 1001
## 1st Qu.: 8387
## Median :13294
## Mean :14260
## 3rd Qu.:18337
## Max. :93600
##
## Description
## BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA: 2
## CAMPBELL,CAMPBELL'S SEL MICROWAVEABLE BOWLS,HEA : 2
## OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT : 2
## POPCORN,OIL-POPPED,LOFAT : 2
## ABALONE,MIXED SPECIES,RAW : 1
## ABALONE,MXD SP,CKD,FRIED : 1
## (Other) :7048
## Calories Protein TotalFat Carbohydrate
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 85.0 1st Qu.: 2.29 1st Qu.: 0.72 1st Qu.: 0.00
## Median :181.0 Median : 8.20 Median : 4.37 Median : 7.13
## Mean :219.7 Mean :11.71 Mean : 10.32 Mean : 20.70
## 3rd Qu.:331.0 3rd Qu.:20.43 3rd Qu.: 12.70 3rd Qu.: 28.17
## Max. :902.0 Max. :88.32 Max. :100.00 Max. :100.00
## NA's :1 NA's :1 NA's :1 NA's :1
## Sodium SaturatedFat Cholesterol Sugar
## Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 37.0 1st Qu.: 0.172 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 79.0 Median : 1.256 Median : 3.00 Median : 1.395
## Mean : 322.1 Mean : 3.452 Mean : 41.55 Mean : 8.257
## 3rd Qu.: 386.0 3rd Qu.: 4.028 3rd Qu.: 69.00 3rd Qu.: 7.875
## Max. :38758.0 Max. :95.600 Max. :3100.00 Max. :99.800
## NA's :84 NA's :301 NA's :288 NA's :1910
## Calcium Iron Potassium VitaminC
## Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 9.00 1st Qu.: 0.520 1st Qu.: 135.0 1st Qu.: 0.000
## Median : 19.00 Median : 1.330 Median : 250.0 Median : 0.000
## Mean : 73.53 Mean : 2.828 Mean : 301.4 Mean : 9.436
## 3rd Qu.: 56.00 3rd Qu.: 2.620 3rd Qu.: 348.0 3rd Qu.: 3.100
## Max. :7364.00 Max. :123.600 Max. :16500.0 Max. :2400.000
## NA's :136 NA's :123 NA's :409 NA's :332
## VitaminE VitaminD
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.120 1st Qu.: 0.0000
## Median : 0.270 Median : 0.0000
## Mean : 1.488 Mean : 0.5769
## 3rd Qu.: 0.710 3rd Qu.: 0.1000
## Max. :149.400 Max. :250.0000
## NA's :2720 NA's :2834
USDA$Sodium
which.max(USDA$Sodium)
## [1] 265
names(USDA)
## [1] "ID" "Description" "Calories" "Protein"
## [5] "TotalFat" "Carbohydrate" "Sodium" "SaturatedFat"
## [9] "Cholesterol" "Sugar" "Calcium" "Iron"
## [13] "Potassium" "VitaminC" "VitaminE" "VitaminD"
USDA$Description[265]
## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ... ZWIEBACK
HighSodium = subset(USDA, Sodium>10000)
nrow(HighSodium)
## [1] 10
HighSodium$Description
## [1] SALT,TABLE
## [2] SOUP,BF BROTH OR BOUILLON,PDR,DRY
## [3] SOUP,BEEF BROTH,CUBED,DRY
## [4] SOUP,CHICK BROTH OR BOUILLON,DRY
## [5] SOUP,CHICK BROTH CUBES,DRY
## [6] GRAVY,AU JUS,DRY
## [7] ADOBO FRESCO
## [8] LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE
## [9] LEAVENING AGENTS,BAKING SODA
## [10] DESSERTS,RENNIN,TABLETS,UNSWTND
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ... ZWIEBACK
match("CAVIAR", USDA$Description)
## [1] 4154
USDA$Sodium[4154]
## [1] 1500
USDA$Sodium[match("CAVIAR", USDA$Description)]
## [1] 1500
summary(USDA$Sodium)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 37.0 79.0 322.1 386.0 38758.0 84
sd(USDA$Sodium, na.rm = TRUE)
## [1] 1045.417
plot(USDA$Protein, USDA$TotalFat)
plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)
boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")
HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
str(HighSodium)
## num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...
USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))
table(USDA$HighSodium)
##
## 0 1
## 4884 2090
table(USDA$HighSodium, USDA$HighFat)
##
## 0 1
## 0 3529 1355
## 1 1378 712
tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
## 0 1
## 2.558945 3.197294
tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
## 0 1
## 1677.6 2400.0
tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 6.364 2.800 1677.600 248
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.20 16.31 4.50 2400.00 83