The science of using data to build models that lead to better decisions that add value to individuals, to companies, and to institutions.
In the next section, the basic operations and functions used in R for data analysis are explored.
8*6
## [1] 48
2^16
## [1] 65536
8*6
## [1] 48
8*10
## [1] 80
sqrt(2)
## [1] 1.414214
abs(-65)
## [1] 65
SquareRoot2 = sqrt(2)
SquareRoot2
## [1] 1.414214
HoursYear <- 365*24
HoursYear
## [1] 8760
ls()
## [1] "HoursYear" "SquareRoot2"
Two vectors - Country and LifeExpectancy are created. Accordingly, both vectors are indexed to display specific elements inside the vectors. Finally, a third vector Sequence has a range from 0 to 100 in increments of 2.
c(2,3,5,8,13)
## [1] 2 3 5 8 13
Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country
## [1] "Brazil" "China" "India" "Switzerland" "USA"
LifeExpectancy
## [1] 74 76 65 83 79
Country[1]
## [1] "Brazil"
LifeExpectancy[3]
## [1] 65
Sequence = seq(0,100,2)
Sequence
## [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76
## [40] 78 80 82 84 86 88 90 92 94 96 98 100
The data.frame CountryData calls upon the two vectors Country and LifeExpectancy. Successivly, an additional vector CountryData$Populations is added to the data.frame. Finally, a data.frame AllCountryData utilizes the two previous data.frames CountryData and NewCountryData.
CountryData = data.frame(Country, LifeExpectancy)
CountryData
## Country LifeExpectancy
## 1 Brazil 74
## 2 China 76
## 3 India 65
## 4 Switzerland 83
## 5 USA 79
CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData
## Country LifeExpectancy Population
## 1 Brazil 74 199000
## 2 China 76 1390000
## 3 India 65 1240000
## 4 Switzerland 83 7997
## 5 USA 79 318000
Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData
## Country LifeExpectancy Population
## 1 Australia 82 23050
## 2 Greece 81 11125
AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData
## Country LifeExpectancy Population
## 1 Brazil 74 199000
## 2 China 76 1390000
## 3 India 65 1240000
## 4 Switzerland 83 7997
## 5 USA 79 318000
## 6 Australia 82 23050
## 7 Greece 81 11125
The dataset WHO.csv is loaded into the variable WHO. The str and summary commands provide physical and statistical descriptions of the variable.
WHO = read.csv("WHO.csv")
str(WHO)
## 'data.frame': 194 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
summary(WHO)
## Country Region Population Under15 Over60 FertilityRate LifeExpectancy ChildMortality
## Afghanistan : 1 Africa :46 Min. : 1 Min. :13.12 Min. : 0.81 Min. :1.260 Min. :47.00 Min. : 2.200
## Albania : 1 Americas :35 1st Qu.: 1696 1st Qu.:18.72 1st Qu.: 5.20 1st Qu.:1.835 1st Qu.:64.00 1st Qu.: 8.425
## Algeria : 1 Eastern Mediterranean:22 Median : 7790 Median :28.65 Median : 8.53 Median :2.400 Median :72.50 Median : 18.600
## Andorra : 1 Europe :53 Mean : 36360 Mean :28.73 Mean :11.16 Mean :2.941 Mean :70.01 Mean : 36.149
## Angola : 1 South-East Asia :11 3rd Qu.: 24535 3rd Qu.:37.75 3rd Qu.:16.69 3rd Qu.:3.905 3rd Qu.:76.00 3rd Qu.: 55.975
## Antigua and Barbuda: 1 Western Pacific :27 Max. :1390000 Max. :49.99 Max. :31.92 Max. :7.580 Max. :83.00 Max. :181.600
## (Other) :188 NA's :11
## CellularSubscribers LiteracyRate GNI PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
## Min. : 2.57 Min. :31.10 Min. : 340 Min. : 37.20 Min. : 32.50
## 1st Qu.: 63.57 1st Qu.:71.60 1st Qu.: 2335 1st Qu.: 87.70 1st Qu.: 87.30
## Median : 97.75 Median :91.80 Median : 7870 Median : 94.70 Median : 95.10
## Mean : 93.64 Mean :83.71 Mean :13321 Mean : 90.85 Mean : 89.63
## 3rd Qu.:120.81 3rd Qu.:97.85 3rd Qu.:17558 3rd Qu.: 98.10 3rd Qu.: 97.90
## Max. :196.41 Max. :99.80 Max. :86440 Max. :100.00 Max. :100.00
## NA's :10 NA's :91 NA's :32 NA's :93 NA's :93
The dataset WHO is subsetted into WHO_Europe using the region as an argument to collect the data.
WHO_Europe = subset(WHO, Region == "Europe")
str(WHO_Europe)
## 'data.frame': 53 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Population : int 3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
## $ Under15 : num 21.3 15.2 20.3 14.5 22.2 ...
## $ Over60 : num 14.93 22.86 14.06 23.52 8.24 ...
## $ FertilityRate : num 1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
## $ LifeExpectancy : int 74 82 71 81 71 71 80 76 74 77 ...
## $ ChildMortality : num 16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
## $ CellularSubscribers : num 96.4 75.5 103.6 154.8 108.8 ...
## $ LiteracyRate : num NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
## $ GNI : num 8820 NA 6100 42050 8960 ...
## $ PrimarySchoolEnrollmentMale : num NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
## $ PrimarySchoolEnrollmentFemale: num NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...
write.csv(WHO_Europe, "WHO_Europe.csv")
rm(WHO_Europe)
Various basic data analysis commands are implemented to provide a statistical summary of the datasets and how to appropriately identify various elements of interest.
mean(WHO$Under15)
## [1] 28.73242
sd(WHO$Under15)
## [1] 10.53457
summary(WHO$Under15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.12 18.72 28.65 28.73 37.75 49.99
which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain ... Zimbabwe
which.max(WHO$Under15)
## [1] 124
WHO$Country[124]
## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain ... Zimbabwe
plot(WHO$GNI, WHO$FertilityRate)
Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5)
nrow(Outliers)
## [1] 7
Outliers[c("Country","GNI","FertilityRate")]
## Country GNI FertilityRate
## 23 Botswana 14550 2.71
## 56 Equatorial Guinea 25620 5.04
## 63 Gabon 13740 4.18
## 83 Israel 27110 2.92
## 88 Kazakhstan 11250 2.52
## 131 Panama 14510 2.52
## 150 Saudi Arabia 24700 2.76
hist(WHO$CellularSubscribers)
### Boxplot
boxplot(WHO$LifeExpectancy ~ WHO$Region)
boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")
The table command creates a table for each region in the WHO data.frame. The tapply command demonstrates the relationship between two vectors in the data.frame using a statistical descriptor.
table(WHO$Region)
##
## Africa Americas Eastern Mediterranean Europe South-East Asia Western Pacific
## 46 35 22 53 11 27
tapply(WHO$Over60, WHO$Region, mean)
## Africa Americas Eastern Mediterranean Europe South-East Asia Western Pacific
## 5.220652 10.943714 5.620000 19.774906 8.769091 10.162963
tapply(WHO$LiteracyRate, WHO$Region, min)
## Africa Americas Eastern Mediterranean Europe South-East Asia Western Pacific
## NA NA NA NA NA NA
tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)
## Africa Americas Eastern Mediterranean Europe South-East Asia Western Pacific
## 31.1 75.2 63.9 95.2 56.8 60.6
In the following section, the United States Department of Agriculture (USDA) dataset on the dietary contents of food is examined.
USDA = read.csv("USDA.csv")
str(USDA)
## 'data.frame': 7058 obs. of 16 variables:
## $ ID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
## $ Calories : int 717 717 876 353 371 334 300 376 403 387 ...
## $ Protein : num 0.85 0.85 0.28 21.4 23.24 ...
## $ TotalFat : num 81.1 81.1 99.5 28.7 29.7 ...
## $ Carbohydrate: num 0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
## $ Sodium : int 714 827 2 1395 560 629 842 690 621 700 ...
## $ SaturatedFat: num 51.4 50.5 61.9 18.7 18.8 ...
## $ Cholesterol : int 215 219 256 75 94 100 72 93 105 103 ...
## $ Sugar : num 0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
## $ Calcium : int 24 24 4 528 674 184 388 673 721 643 ...
## $ Iron : num 0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
## $ Potassium : int 24 26 5 256 136 152 187 93 98 95 ...
## $ VitaminC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VitaminE : num 2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
## $ VitaminD : num 1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...
summary(USDA)
## ID Description Calories Protein TotalFat Carbohydrate
## Min. : 1001 BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA: 2 Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 8387 CAMPBELL,CAMPBELL'S SEL MICROWAVEABLE BOWLS,HEA : 2 1st Qu.: 85.0 1st Qu.: 2.29 1st Qu.: 0.72 1st Qu.: 0.00
## Median :13294 OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT : 2 Median :181.0 Median : 8.20 Median : 4.37 Median : 7.13
## Mean :14260 POPCORN,OIL-POPPED,LOFAT : 2 Mean :219.7 Mean :11.71 Mean : 10.32 Mean : 20.70
## 3rd Qu.:18337 ABALONE,MIXED SPECIES,RAW : 1 3rd Qu.:331.0 3rd Qu.:20.43 3rd Qu.: 12.70 3rd Qu.: 28.17
## Max. :93600 ABALONE,MXD SP,CKD,FRIED : 1 Max. :902.0 Max. :88.32 Max. :100.00 Max. :100.00
## (Other) :7048 NA's :1 NA's :1 NA's :1 NA's :1
## Sodium SaturatedFat Cholesterol Sugar Calcium Iron Potassium VitaminC
## Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 37.0 1st Qu.: 0.172 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 9.00 1st Qu.: 0.520 1st Qu.: 135.0 1st Qu.: 0.000
## Median : 79.0 Median : 1.256 Median : 3.00 Median : 1.395 Median : 19.00 Median : 1.330 Median : 250.0 Median : 0.000
## Mean : 322.1 Mean : 3.452 Mean : 41.55 Mean : 8.257 Mean : 73.53 Mean : 2.828 Mean : 301.4 Mean : 9.436
## 3rd Qu.: 386.0 3rd Qu.: 4.028 3rd Qu.: 69.00 3rd Qu.: 7.875 3rd Qu.: 56.00 3rd Qu.: 2.620 3rd Qu.: 348.0 3rd Qu.: 3.100
## Max. :38758.0 Max. :95.600 Max. :3100.00 Max. :99.800 Max. :7364.00 Max. :123.600 Max. :16500.0 Max. :2400.000
## NA's :84 NA's :301 NA's :288 NA's :1910 NA's :136 NA's :123 NA's :409 NA's :332
## VitaminE VitaminD
## Min. : 0.000 Min. : 0.0000
## 1st Qu.: 0.120 1st Qu.: 0.0000
## Median : 0.270 Median : 0.0000
## Mean : 1.488 Mean : 0.5769
## 3rd Qu.: 0.710 3rd Qu.: 0.1000
## Max. :149.400 Max. :250.0000
## NA's :2720 NA's :2834
USDA$Sodium
which.max(USDA$Sodium)
## [1] 265
names(USDA)
## [1] "ID" "Description" "Calories" "Protein" "TotalFat" "Carbohydrate" "Sodium" "SaturatedFat" "Cholesterol" "Sugar"
## [11] "Calcium" "Iron" "Potassium" "VitaminC" "VitaminE" "VitaminD"
USDA$Description[265]
## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ... ZWIEBACK
HighSodium = subset(USDA, Sodium>10000)
nrow(HighSodium)
## [1] 10
HighSodium$Description
## [1] SALT,TABLE SOUP,BF BROTH OR BOUILLON,PDR,DRY
## [3] SOUP,BEEF BROTH,CUBED,DRY SOUP,CHICK BROTH OR BOUILLON,DRY
## [5] SOUP,CHICK BROTH CUBES,DRY GRAVY,AU JUS,DRY
## [7] ADOBO FRESCO LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE
## [9] LEAVENING AGENTS,BAKING SODA DESSERTS,RENNIN,TABLETS,UNSWTND
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ... ZWIEBACK
match("CAVIAR", USDA$Description)
## [1] 4154
USDA$Sodium[4154]
## [1] 1500
USDA$Sodium[match("CAVIAR", USDA$Description)]
## [1] 1500
summary(USDA$Sodium)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 37.0 79.0 322.1 386.0 38758.0 84
sd(USDA$Sodium, na.rm = TRUE)
## [1] 1045.417
plot(USDA$Protein, USDA$TotalFat)
plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100)
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)
boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")
HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
str(HighSodium)
## num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...
USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))
table(USDA$HighSodium)
##
## 0 1
## 4884 2090
table(USDA$HighSodium, USDA$HighFat)
##
## 0 1
## 0 3529 1355
## 1 1378 712
tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
## 0 1
## 2.558945 3.197294
tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
## 0 1
## 1677.6 2400.0
tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)
## $`0`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 6.364 2.800 1677.600 248
##
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 0.00 0.20 16.31 4.50 2400.00 83