The science of using data to build models that lead to better decisions that add value to individuals, to companies, and to institutions.
In the next section, the basic operations and functions used in R for data analysis are explored.
# Basic Calculations
8*6
## [1] 48
2^16
## [1] 65536
8*6
## [1] 48
8*10
## [1] 80# Mathematical functions
sqrt(2)
## [1] 1.414214
abs(-65)
## [1] 65# Setting variables
SquareRoot2 = sqrt(2)
SquareRoot2
## [1] 1.414214
HoursYear <- 365*24
HoursYear
## [1] 8760
# Identifies the stored variables
ls()
## [1] "HoursYear" "SquareRoot2"Two vectors - Country and LifeExpectancy are created. Accordingly, both vectors are indexed to display specific elements inside the vectors. Finally, a third vector Sequence has a range from 0 to 100 in increments of 2.
# Create Vectors
c(2,3,5,8,13)
## [1] 2 3 5 8 13
Country = c("Brazil", "China", "India","Switzerland","USA")
LifeExpectancy = c(74,76,65,83,79)
Country
## [1] "Brazil" "China" "India" "Switzerland" "USA"
LifeExpectancy
## [1] 74 76 65 83 79
Country[1]
## [1] "Brazil"
LifeExpectancy[3]
## [1] 65
Sequence = seq(0,100,2)
Sequence
## [1] 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96
## [50] 98 100The data.frame CountryData calls upon the two vectors Country and LifeExpectancy. Successively, an additional vector CountryData$Populations is added to the data.frame. Finally, a data.frame AllCountryData utilizes the two previous data.frames CountryData and NewCountryData.
# Create data frames
CountryData = data.frame(Country, LifeExpectancy)
CountryData
## Country LifeExpectancy
## 1 Brazil 74
## 2 China 76
## 3 India 65
## 4 Switzerland 83
## 5 USA 79
CountryData$Population = c(199000,1390000,1240000,7997,318000)
CountryData
## Country LifeExpectancy Population
## 1 Brazil 74 199000
## 2 China 76 1390000
## 3 India 65 1240000
## 4 Switzerland 83 7997
## 5 USA 79 318000
Country = c("Australia","Greece")
LifeExpectancy = c(82,81)
Population = c(23050,11125)
NewCountryData = data.frame(Country, LifeExpectancy, Population)
NewCountryData
## Country LifeExpectancy Population
## 1 Australia 82 23050
## 2 Greece 81 11125
AllCountryData = rbind(CountryData, NewCountryData)
AllCountryData
## Country LifeExpectancy Population
## 1 Brazil 74 199000
## 2 China 76 1390000
## 3 India 65 1240000
## 4 Switzerland 83 7997
## 5 USA 79 318000
## 6 Australia 82 23050
## 7 Greece 81 11125The dataset WHO.csv is loaded into the variable WHO. The str and summary commands provide physical and statistical descriptions of the variable.
# Load dataset
WHO = read.csv("WHO.csv")
# Output the string of the dataset
str(WHO)
## 'data.frame': 194 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 3 4 1 4 1 2 2 4 6 4 ...
## $ Population : int 29825 3162 38482 78 20821 89 41087 2969 23050 8464 ...
## $ Under15 : num 47.4 21.3 27.4 15.2 47.6 ...
## $ Over60 : num 3.82 14.93 7.17 22.86 3.84 ...
## $ FertilityRate : num 5.4 1.75 2.83 NA 6.1 2.12 2.2 1.74 1.89 1.44 ...
## $ LifeExpectancy : int 60 74 73 82 51 75 76 71 82 81 ...
## $ ChildMortality : num 98.5 16.7 20 3.2 163.5 ...
## $ CellularSubscribers : num 54.3 96.4 99 75.5 48.4 ...
## $ LiteracyRate : num NA NA NA NA 70.1 99 97.8 99.6 NA NA ...
## $ GNI : num 1140 8820 8310 NA 5230 ...
## $ PrimarySchoolEnrollmentMale : num NA NA 98.2 78.4 93.1 91.1 NA NA 96.9 NA ...
## $ PrimarySchoolEnrollmentFemale: num NA NA 96.4 79.4 78.2 84.5 NA NA 97.5 NA ...
# Output the summary of the dataset
summary(WHO)
## Country Region Population Under15 Over60 FertilityRate LifeExpectancy ChildMortality CellularSubscribers LiteracyRate
## Afghanistan : 1 Africa :46 Min. : 1 Min. :13.12 Min. : 0.81 Min. :1.260 Min. :47.00 Min. : 2.200 Min. : 2.57 Min. :31.10
## Albania : 1 Americas :35 1st Qu.: 1696 1st Qu.:18.72 1st Qu.: 5.20 1st Qu.:1.835 1st Qu.:64.00 1st Qu.: 8.425 1st Qu.: 63.57 1st Qu.:71.60
## Algeria : 1 Eastern Mediterranean:22 Median : 7790 Median :28.65 Median : 8.53 Median :2.400 Median :72.50 Median : 18.600 Median : 97.75 Median :91.80
## Andorra : 1 Europe :53 Mean : 36360 Mean :28.73 Mean :11.16 Mean :2.941 Mean :70.01 Mean : 36.149 Mean : 93.64 Mean :83.71
## Angola : 1 South-East Asia :11 3rd Qu.: 24535 3rd Qu.:37.75 3rd Qu.:16.69 3rd Qu.:3.905 3rd Qu.:76.00 3rd Qu.: 55.975 3rd Qu.:120.81 3rd Qu.:97.85
## Antigua and Barbuda: 1 Western Pacific :27 Max. :1390000 Max. :49.99 Max. :31.92 Max. :7.580 Max. :83.00 Max. :181.600 Max. :196.41 Max. :99.80
## (Other) :188 NA's :11 NA's :10 NA's :91
## GNI PrimarySchoolEnrollmentMale PrimarySchoolEnrollmentFemale
## Min. : 340 Min. : 37.20 Min. : 32.50
## 1st Qu.: 2335 1st Qu.: 87.70 1st Qu.: 87.30
## Median : 7870 Median : 94.70 Median : 95.10
## Mean :13321 Mean : 90.85 Mean : 89.63
## 3rd Qu.:17558 3rd Qu.: 98.10 3rd Qu.: 97.90
## Max. :86440 Max. :100.00 Max. :100.00
## NA's :32 NA's :93 NA's :93The dataset WHO is subsetted into WHO_Europe using the region as an argument to collect the data.
# Subset the dataset with the region in Europe
WHO_Europe = subset(WHO, Region == "Europe")
# Output the string of the dataset
str(WHO_Europe)
## 'data.frame': 53 obs. of 13 variables:
## $ Country : Factor w/ 194 levels "Afghanistan",..: 2 4 8 10 11 16 17 22 26 42 ...
## $ Region : Factor w/ 6 levels "Africa","Americas",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Population : int 3162 78 2969 8464 9309 9405 11060 3834 7278 4307 ...
## $ Under15 : num 21.3 15.2 20.3 14.5 22.2 ...
## $ Over60 : num 14.93 22.86 14.06 23.52 8.24 ...
## $ FertilityRate : num 1.75 NA 1.74 1.44 1.96 1.47 1.85 1.26 1.51 1.48 ...
## $ LifeExpectancy : int 74 82 71 81 71 71 80 76 74 77 ...
## $ ChildMortality : num 16.7 3.2 16.4 4 35.2 5.2 4.2 6.7 12.1 4.7 ...
## $ CellularSubscribers : num 96.4 75.5 103.6 154.8 108.8 ...
## $ LiteracyRate : num NA NA 99.6 NA NA NA NA 97.9 NA 98.8 ...
## $ GNI : num 8820 NA 6100 42050 8960 ...
## $ PrimarySchoolEnrollmentMale : num NA 78.4 NA NA 85.3 NA 98.9 86.5 99.3 94.8 ...
## $ PrimarySchoolEnrollmentFemale: num NA 79.4 NA NA 84.1 NA 99.2 88.4 99.7 97 ...# Writes a new comma-separated values for the europe dataset
write.csv(WHO_Europe, "WHO_Europe.csv")# Remove Europe variable
rm(WHO_Europe)Various basic data analysis commands are implemented to provide a statistical summary of the datasets and how to appropriately identify various elements of interest.
# Basic data analysis
mean(WHO$Under15)
## [1] 28.73242
sd(WHO$Under15)
## [1] 10.53457
summary(WHO$Under15)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.12 18.72 28.65 28.73 37.75 49.99
# Find which data point is the minimum and find its index
which.min(WHO$Under15)
## [1] 86
WHO$Country[86]
## [1] Japan
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin ... Zimbabwe
# Find which data point is the maximum and find its index
which.max(WHO$Under15)
## [1] 124
WHO$Country[124]
## [1] Niger
## 194 Levels: Afghanistan Albania Algeria Andorra Angola Antigua and Barbuda Argentina Armenia Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin ... Zimbabwe# Scatterplot
plot(WHO$GNI, WHO$FertilityRate)# Subsetting outliers
Outliers = subset(WHO, GNI > 10000 & FertilityRate > 2.5)
# Calcule the number of observations
nrow(Outliers)
## [1] 7
# Add columns
Outliers[c("Country","GNI","FertilityRate")]
## Country GNI FertilityRate
## 23 Botswana 14550 2.71
## 56 Equatorial Guinea 25620 5.04
## 63 Gabon 13740 4.18
## 83 Israel 27110 2.92
## 88 Kazakhstan 11250 2.52
## 131 Panama 14510 2.52
## 150 Saudi Arabia 24700 2.76# Histogram
hist(WHO$CellularSubscribers) ### Boxplot
# Boxplot
boxplot(WHO$LifeExpectancy ~ WHO$Region)boxplot(WHO$LifeExpectancy ~ WHO$Region, xlab = "", ylab = "Life Expectancy", main = "Life Expectancy of Countries by Region")The table command creates a table for each region in the WHO data.frame. The tapply command demonstrates the relationship between two vectors in the data.frame using a statistical descriptor.
# Tabulate the region data.frame
z = table(WHO$Region)
kable (z)| Var1 | Freq |
|---|---|
| Africa | 46 |
| Americas | 35 |
| Eastern Mediterranean | 22 |
| Europe | 53 |
| South-East Asia | 11 |
| Western Pacific | 27 |
# Compares two groups using a statsitical measure
z = tapply(WHO$Over60, WHO$Region, mean)
kable(z)| x | |
|---|---|
| Africa | 5.220652 |
| Americas | 10.943714 |
| Eastern Mediterranean | 5.620000 |
| Europe | 19.774906 |
| South-East Asia | 8.769091 |
| Western Pacific | 10.162963 |
z = tapply(WHO$LiteracyRate, WHO$Region, min)
kable(z)| x | |
|---|---|
| Africa | NA |
| Americas | NA |
| Eastern Mediterranean | NA |
| Europe | NA |
| South-East Asia | NA |
| Western Pacific | NA |
z = tapply(WHO$LiteracyRate, WHO$Region, min, na.rm=TRUE)
kable(z)| x | |
|---|---|
| Africa | 31.1 |
| Americas | 75.2 |
| Eastern Mediterranean | 63.9 |
| Europe | 95.2 |
| South-East Asia | 56.8 |
| Western Pacific | 60.6 |
In the following section, the United States Department of Agriculture (USDA) dataset on the dietary contents of food is examined.
# Load in the dataset
USDA = read.csv("USDA.csv")# Outputs a string
str(USDA)
## 'data.frame': 7058 obs. of 16 variables:
## $ ID : int 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
## $ Description : Factor w/ 7054 levels "ABALONE,MIXED SPECIES,RAW",..: 1303 1302 1298 2303 2304 2305 2306 2307 2308 2309 ...
## $ Calories : int 717 717 876 353 371 334 300 376 403 387 ...
## $ Protein : num 0.85 0.85 0.28 21.4 23.24 ...
## $ TotalFat : num 81.1 81.1 99.5 28.7 29.7 ...
## $ Carbohydrate: num 0.06 0.06 0 2.34 2.79 0.45 0.46 3.06 1.28 4.78 ...
## $ Sodium : int 714 827 2 1395 560 629 842 690 621 700 ...
## $ SaturatedFat: num 51.4 50.5 61.9 18.7 18.8 ...
## $ Cholesterol : int 215 219 256 75 94 100 72 93 105 103 ...
## $ Sugar : num 0.06 0.06 0 0.5 0.51 0.45 0.46 NA 0.52 NA ...
## $ Calcium : int 24 24 4 528 674 184 388 673 721 643 ...
## $ Iron : num 0.02 0.16 0 0.31 0.43 0.5 0.33 0.64 0.68 0.21 ...
## $ Potassium : int 24 26 5 256 136 152 187 93 98 95 ...
## $ VitaminC : num 0 0 0 0 0 0 0 0 0 0 ...
## $ VitaminE : num 2.32 2.32 2.8 0.25 0.26 0.24 0.21 NA 0.29 NA ...
## $ VitaminD : num 1.5 1.5 1.8 0.5 0.5 0.5 0.4 NA 0.6 NA ...# Outputs the summary
z = summary(USDA)
kable(z)| ID | Description | Calories | Protein | TotalFat | Carbohydrate | Sodium | SaturatedFat | Cholesterol | Sugar | Calcium | Iron | Potassium | VitaminC | VitaminE | VitaminD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1001 | BEEF,CHUCK,UNDER BLADE CNTR STEAK,BNLESS,DENVER CUT,LN,0" FA: 2 | Min. : 0.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.000 | Min. : 0.00 | Min. : 0.000 | Min. : 0.00 | Min. : 0.000 | Min. : 0.0 | Min. : 0.000 | Min. : 0.000 | Min. : 0.0000 | |
| 1st Qu.: 8387 | CAMPBELL,CAMPBELL’S SEL MICROWAVEABLE BOWLS,HEA : 2 | 1st Qu.: 85.0 | 1st Qu.: 2.29 | 1st Qu.: 0.72 | 1st Qu.: 0.00 | 1st Qu.: 37.0 | 1st Qu.: 0.172 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | 1st Qu.: 9.00 | 1st Qu.: 0.520 | 1st Qu.: 135.0 | 1st Qu.: 0.000 | 1st Qu.: 0.120 | 1st Qu.: 0.0000 | |
| Median :13294 | OIL,INDUSTRIAL,PALM KERNEL (HYDROGENATED),CONFECTION FAT : 2 | Median :181.0 | Median : 8.20 | Median : 4.37 | Median : 7.13 | Median : 79.0 | Median : 1.256 | Median : 3.00 | Median : 1.395 | Median : 19.00 | Median : 1.330 | Median : 250.0 | Median : 0.000 | Median : 0.270 | Median : 0.0000 | |
| Mean :14260 | POPCORN,OIL-POPPED,LOFAT : 2 | Mean :219.7 | Mean :11.71 | Mean : 10.32 | Mean : 20.70 | Mean : 322.1 | Mean : 3.452 | Mean : 41.55 | Mean : 8.257 | Mean : 73.53 | Mean : 2.828 | Mean : 301.4 | Mean : 9.436 | Mean : 1.488 | Mean : 0.5769 | |
| 3rd Qu.:18337 | ABALONE,MIXED SPECIES,RAW : 1 | 3rd Qu.:331.0 | 3rd Qu.:20.43 | 3rd Qu.: 12.70 | 3rd Qu.: 28.17 | 3rd Qu.: 386.0 | 3rd Qu.: 4.028 | 3rd Qu.: 69.00 | 3rd Qu.: 7.875 | 3rd Qu.: 56.00 | 3rd Qu.: 2.620 | 3rd Qu.: 348.0 | 3rd Qu.: 3.100 | 3rd Qu.: 0.710 | 3rd Qu.: 0.1000 | |
| Max. :93600 | ABALONE,MXD SP,CKD,FRIED : 1 | Max. :902.0 | Max. :88.32 | Max. :100.00 | Max. :100.00 | Max. :38758.0 | Max. :95.600 | Max. :3100.00 | Max. :99.800 | Max. :7364.00 | Max. :123.600 | Max. :16500.0 | Max. :2400.000 | Max. :149.400 | Max. :250.0000 | |
| NA | (Other) :7048 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :1 | NA’s :84 | NA’s :301 | NA’s :288 | NA’s :1910 | NA’s :136 | NA’s :123 | NA’s :409 | NA’s :332 | NA’s :2720 | NA’s :2834 |
# Outputs the sodium index
USDA$Sodium# Finding the index of the food with the highest sodium levels
which.max(USDA$Sodium)
## [1] 265# Get names of the variables
names(USDA)
## [1] "ID" "Description" "Calories" "Protein" "TotalFat" "Carbohydrate" "Sodium" "SaturatedFat" "Cholesterol" "Sugar" "Calcium" "Iron" "Potassium"
## [14] "VitaminC" "VitaminE" "VitaminD"# Get the name of the food with the highest sodium levels
USDA$Description[265]
## [1] SALT,TABLE
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ACORN FLOUR,FULL FAT ACORN STEW (APACHE) ACORNS,DRIED ... ZWIEBACK# Subset foods with sodium content above 10,000 mg
HighSodium = subset(USDA, Sodium>10000)# Count the number of rows, or observations
nrow(HighSodium)
## [1] 10# Output names of the foods with high sodium content
HighSodium$Description
## [1] SALT,TABLE SOUP,BF BROTH OR BOUILLON,PDR,DRY SOUP,BEEF BROTH,CUBED,DRY
## [4] SOUP,CHICK BROTH OR BOUILLON,DRY SOUP,CHICK BROTH CUBES,DRY GRAVY,AU JUS,DRY
## [7] ADOBO FRESCO LEAVENING AGENTS,BAKING PDR,DOUBLE-ACTING,NA AL SULFATE LEAVENING AGENTS,BAKING SODA
## [10] DESSERTS,RENNIN,TABLETS,UNSWTND
## 7054 Levels: ABALONE,MIXED SPECIES,RAW ABALONE,MXD SP,CKD,FRIED ABIYUCH,RAW ACEROLA JUICE,RAW ACEROLA,(WEST INDIAN CHERRY),RAW ACORN FLOUR,FULL FAT ACORN STEW (APACHE) ACORNS,DRIED ... ZWIEBACK# Finding the index of CAVIAR
match("CAVIAR", USDA$Description)
## [1] 4154# Find amount of sodium in CAVIAR
USDA$Sodium[4154]
## [1] 1500# Do the previous two commands in one step
USDA$Sodium[match("CAVIAR", USDA$Description)]
## [1] 1500# Output a summary
summary(USDA$Sodium)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 37.0 79.0 322.1 386.0 38758.0 84# Calculates the standard deviation
sd(USDA$Sodium, na.rm = TRUE)
## [1] 1045.417# Scatterplot
plot(USDA$Protein, USDA$TotalFat)# Add x label, y label, and title to the scatterplot
plot(USDA$Protein, USDA$TotalFat, xlab="Protein", ylab = "Fat", main = "Protein vs Fat", col = "red")# Histogram
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C")# Add limits to x-axis
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100))# Specify breaks
hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=100) hist(USDA$VitaminC, xlab = "Vitamin C (mg)", main = "Histogram of Vitamin C", xlim = c(0,100), breaks=2000)# Boxplots
boxplot(USDA$Sugar, ylab = "Sugar (g)", main = "Boxplot of Sugar")# Create variable for high sodium
HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))
# Outputs a string
str(HighSodium)
## num [1:7058] 1 1 0 1 1 1 1 1 1 1 ...# Add variable
USDA$HighSodium = as.numeric(USDA$Sodium > mean(USDA$Sodium, na.rm=TRUE))# Similar to the previous command for different food groups
USDA$HighCarbs = as.numeric(USDA$Carbohydrate > mean(USDA$Carbohydrate, na.rm=TRUE))
USDA$HighProtein = as.numeric(USDA$Protein > mean(USDA$Protein, na.rm=TRUE))
USDA$HighFat = as.numeric(USDA$TotalFat > mean(USDA$TotalFat, na.rm=TRUE))# Tabulate the amount of foods that have higher sodium level than average
z = table(USDA$HighSodium)
kable(z)| Var1 | Freq |
|---|---|
| 0 | 4884 |
| 1 | 2090 |
# Tabulate the number of foods that have both high sodium and fat
z = table(USDA$HighSodium, USDA$HighFat)
kable(z)| 0 | 1 | |
|---|---|---|
| 0 | 3529 | 1355 |
| 1 | 1378 | 712 |
# Compare two groups using a statsitical measure
z = tapply(USDA$Iron, USDA$HighProtein, mean, na.rm=TRUE)
kable(z)| x | |
|---|---|
| 0 | 2.558945 |
| 1 | 3.197294 |
# Compare two groups using a statistical measure
z = tapply(USDA$VitaminC, USDA$HighCarbs, max, na.rm=TRUE)
kable(z)| x | |
|---|---|
| 0 | 1677.6 |
| 1 | 2400.0 |
# Compare two groups using a statistical measure
z = tapply(USDA$VitaminC, USDA$HighCarbs, summary, na.rm=TRUE)
kable(z)| x | |
|---|---|
| 0 | c(Min. = 0, 1st Qu. = 0, Median = 0, Mean = 6.36403527640353, 3rd Qu. = 2.8, Max. = 1677.6, NA's = 248) |
| 1 | c(Min. = 0, 1st Qu. = 0, Median = 0.2, Mean = 16.3119884448724, 3rd Qu. = 4.5, Max. = 2400, NA's = 83) |