Question: Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).
Answer: To check all the available data frames, and other objects that were already included in the R installation, use the following command:
data()
I have chosen “msleep” data frame. This data frame contains the sleep data of mammals. This data frame gets installed with ggplot2 installation. If ggplot2 is not installed, please use the following command to install ggplot2 package:
#install.packages("ggplot2")
#Once the ggplot2 package is installed, use the following command to include the ggplot2 package:
library("ggplot2")
The ggplot2 package contains the msleep data frame. To check the variable types present in this data frame, use the following command:
str(msleep)
## 'data.frame': 83 obs. of 11 variables:
## $ name : chr "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
## $ genus : chr "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
## $ vore : Factor w/ 4 levels "carni","herbi",..: 1 4 2 4 2 2 1 NA 1 2 ...
## $ order : chr "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
## $ conservation: Factor w/ 7 levels "","cd","domesticated",..: 5 NA 6 5 3 NA 7 NA 3 5 ...
## $ sleep_total : num 12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
## $ sleep_rem : num NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
## $ sleep_cycle : num NA NA NA 0.133 0.667 ...
## $ awake : num 11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
## $ brainwt : num NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
## $ bodywt : num 50 0.48 1.35 0.019 600 ...
In the msleep data frame, the following are categorical variables:
vore
conservation
The following are numerical variables:
sleep_total
sleep_rem
sleep_cycle
awake
brainwt
bodywt
The following are character variables:
name
genus
order
Question: Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.
Answer: We can use the “summary()” function to get the summary information of a data set. For numerical variables the min, max, mean, median, 75th percentile, and 25th percentiles are displayed. For factor type variable, counts of different factors within the factor variables are displayed. For character variables, the max length of the respective character data is displayed.
summary(msleep)
## name genus vore order
## Length:83 Length:83 carni :19 Length:83
## Class :character Class :character herbi :32 Class :character
## Mode :character Mode :character insecti: 5 Mode :character
## omni :20
## NA's : 7
##
##
## conservation sleep_total sleep_rem sleep_cycle
## lc :27 Min. : 1.90 Min. :0.100 Min. :0.1167
## domesticated:10 1st Qu.: 7.85 1st Qu.:0.900 1st Qu.:0.1833
## vu : 7 Median :10.10 Median :1.500 Median :0.3333
## en : 4 Mean :10.43 Mean :1.875 Mean :0.4396
## nt : 4 3rd Qu.:13.75 3rd Qu.:2.400 3rd Qu.:0.5792
## (Other) : 2 Max. :19.90 Max. :6.600 Max. :1.5000
## NA's :29 NA's :22 NA's :51
## awake brainwt bodywt
## Min. : 4.10 Min. :0.00014 Min. : 0.005
## 1st Qu.:10.25 1st Qu.:0.00290 1st Qu.: 0.174
## Median :13.90 Median :0.01240 Median : 1.670
## Mean :13.57 Mean :0.28158 Mean : 166.136
## 3rd Qu.:16.15 3rd Qu.:0.12550 3rd Qu.: 41.750
## Max. :22.10 Max. :5.71200 Max. :6654.000
## NA's :27
Question: Determine the frequency for one of the categorical variables.
Answer: To determine the frequency of factors within a categorical variable, we can use the table fucntion. The following code displays the frequency of vore variable in msleep data frame
table(msleep$vore)
##
## carni herbi insecti omni
## 19 32 5 20
Question: Determine the frequency for one of the categorical variables, by a different categorical variable.
Answer: we can use the table function again to get combined frequencies of two categorical variables (below example considers the genus and vore as categorical variables)
table(msleep$genus, msleep$vore)
##
## carni herbi insecti omni
## Acinonyx 1 0 0 0
## Aotus 0 0 0 1
## Aplodontia 0 1 0 0
## Blarina 0 0 0 1
## Bos 0 1 0 0
## Bradypus 0 1 0 0
## Callorhinus 1 0 0 0
## Calomys 0 0 0 0
## Canis 1 0 0 0
## Capreolus 0 1 0 0
## Capri 0 1 0 0
## Cavis 0 1 0 0
## Cercopithecus 0 0 0 1
## Chinchilla 0 1 0 0
## Condylura 0 0 0 1
## Cricetomys 0 0 0 1
## Cryptotis 0 0 0 1
## Dasypus 1 0 0 0
## Dendrohyrax 0 1 0 0
## Didelphis 0 0 0 1
## Elephas 0 1 0 0
## Eptesicus 0 0 1 0
## Equus 0 2 0 0
## Erinaceus 0 0 0 1
## Erythrocebus 0 0 0 1
## Eutamias 0 1 0 0
## Felis 1 0 0 0
## Galago 0 0 0 1
## Genetta 1 0 0 0
## Giraffa 0 1 0 0
## Globicephalus 1 0 0 0
## Haliochoerus 1 0 0 0
## Heterohyrax 0 1 0 0
## Homo 0 0 0 1
## Lemur 0 1 0 0
## Loxodonta 0 1 0 0
## Lutreolina 1 0 0 0
## Macaca 0 0 0 1
## Meriones 0 1 0 0
## Mesocricetus 0 1 0 0
## Microtus 0 1 0 0
## Mus 0 1 0 0
## Myotis 0 0 1 0
## Neofiber 0 1 0 0
## Nyctibeus 1 0 0 0
## Octodon 0 1 0 0
## Onychomys 1 0 0 0
## Oryctolagus 0 1 0 0
## Ovis 0 1 0 0
## Pan 0 0 0 1
## Panthera 3 0 0 0
## Papio 0 0 0 1
## Paraechinus 0 0 0 0
## Perodicticus 0 0 0 1
## Peromyscus 0 0 0 0
## Phalanger 0 0 0 0
## Phoca 1 0 0 0
## Phocoena 1 0 0 0
## Potorous 0 1 0 0
## Priodontes 0 0 1 0
## Procavia 0 0 0 0
## Rattus 0 1 0 0
## Rhabdomys 0 0 0 1
## Saimiri 0 0 0 1
## Scalopus 0 0 1 0
## Sigmodon 0 1 0 0
## Spalax 0 0 0 0
## Spermophilus 0 3 0 0
## Suncus 0 0 0 0
## Sus 0 0 0 1
## Tachyglossus 0 0 1 0
## Tamias 0 1 0 0
## Tapirus 0 1 0 0
## Tenrec 0 0 0 1
## Tupaia 0 0 0 1
## Tursiops 1 0 0 0
## Vulpes 2 0 0 0
Additional information: We can use the prop.table() function to get the proportions. The proportions represent the joint probability distribution of two or more variables. For instance, the following command will process the table generated above, into a probability distribution, between the variables genus and vore:
prop.table(table(msleep$genus, msleep$vore))
##
## carni herbi insecti omni
## Acinonyx 0.01315789 0.00000000 0.00000000 0.00000000
## Aotus 0.00000000 0.00000000 0.00000000 0.01315789
## Aplodontia 0.00000000 0.01315789 0.00000000 0.00000000
## Blarina 0.00000000 0.00000000 0.00000000 0.01315789
## Bos 0.00000000 0.01315789 0.00000000 0.00000000
## Bradypus 0.00000000 0.01315789 0.00000000 0.00000000
## Callorhinus 0.01315789 0.00000000 0.00000000 0.00000000
## Calomys 0.00000000 0.00000000 0.00000000 0.00000000
## Canis 0.01315789 0.00000000 0.00000000 0.00000000
## Capreolus 0.00000000 0.01315789 0.00000000 0.00000000
## Capri 0.00000000 0.01315789 0.00000000 0.00000000
## Cavis 0.00000000 0.01315789 0.00000000 0.00000000
## Cercopithecus 0.00000000 0.00000000 0.00000000 0.01315789
## Chinchilla 0.00000000 0.01315789 0.00000000 0.00000000
## Condylura 0.00000000 0.00000000 0.00000000 0.01315789
## Cricetomys 0.00000000 0.00000000 0.00000000 0.01315789
## Cryptotis 0.00000000 0.00000000 0.00000000 0.01315789
## Dasypus 0.01315789 0.00000000 0.00000000 0.00000000
## Dendrohyrax 0.00000000 0.01315789 0.00000000 0.00000000
## Didelphis 0.00000000 0.00000000 0.00000000 0.01315789
## Elephas 0.00000000 0.01315789 0.00000000 0.00000000
## Eptesicus 0.00000000 0.00000000 0.01315789 0.00000000
## Equus 0.00000000 0.02631579 0.00000000 0.00000000
## Erinaceus 0.00000000 0.00000000 0.00000000 0.01315789
## Erythrocebus 0.00000000 0.00000000 0.00000000 0.01315789
## Eutamias 0.00000000 0.01315789 0.00000000 0.00000000
## Felis 0.01315789 0.00000000 0.00000000 0.00000000
## Galago 0.00000000 0.00000000 0.00000000 0.01315789
## Genetta 0.01315789 0.00000000 0.00000000 0.00000000
## Giraffa 0.00000000 0.01315789 0.00000000 0.00000000
## Globicephalus 0.01315789 0.00000000 0.00000000 0.00000000
## Haliochoerus 0.01315789 0.00000000 0.00000000 0.00000000
## Heterohyrax 0.00000000 0.01315789 0.00000000 0.00000000
## Homo 0.00000000 0.00000000 0.00000000 0.01315789
## Lemur 0.00000000 0.01315789 0.00000000 0.00000000
## Loxodonta 0.00000000 0.01315789 0.00000000 0.00000000
## Lutreolina 0.01315789 0.00000000 0.00000000 0.00000000
## Macaca 0.00000000 0.00000000 0.00000000 0.01315789
## Meriones 0.00000000 0.01315789 0.00000000 0.00000000
## Mesocricetus 0.00000000 0.01315789 0.00000000 0.00000000
## Microtus 0.00000000 0.01315789 0.00000000 0.00000000
## Mus 0.00000000 0.01315789 0.00000000 0.00000000
## Myotis 0.00000000 0.00000000 0.01315789 0.00000000
## Neofiber 0.00000000 0.01315789 0.00000000 0.00000000
## Nyctibeus 0.01315789 0.00000000 0.00000000 0.00000000
## Octodon 0.00000000 0.01315789 0.00000000 0.00000000
## Onychomys 0.01315789 0.00000000 0.00000000 0.00000000
## Oryctolagus 0.00000000 0.01315789 0.00000000 0.00000000
## Ovis 0.00000000 0.01315789 0.00000000 0.00000000
## Pan 0.00000000 0.00000000 0.00000000 0.01315789
## Panthera 0.03947368 0.00000000 0.00000000 0.00000000
## Papio 0.00000000 0.00000000 0.00000000 0.01315789
## Paraechinus 0.00000000 0.00000000 0.00000000 0.00000000
## Perodicticus 0.00000000 0.00000000 0.00000000 0.01315789
## Peromyscus 0.00000000 0.00000000 0.00000000 0.00000000
## Phalanger 0.00000000 0.00000000 0.00000000 0.00000000
## Phoca 0.01315789 0.00000000 0.00000000 0.00000000
## Phocoena 0.01315789 0.00000000 0.00000000 0.00000000
## Potorous 0.00000000 0.01315789 0.00000000 0.00000000
## Priodontes 0.00000000 0.00000000 0.01315789 0.00000000
## Procavia 0.00000000 0.00000000 0.00000000 0.00000000
## Rattus 0.00000000 0.01315789 0.00000000 0.00000000
## Rhabdomys 0.00000000 0.00000000 0.00000000 0.01315789
## Saimiri 0.00000000 0.00000000 0.00000000 0.01315789
## Scalopus 0.00000000 0.00000000 0.01315789 0.00000000
## Sigmodon 0.00000000 0.01315789 0.00000000 0.00000000
## Spalax 0.00000000 0.00000000 0.00000000 0.00000000
## Spermophilus 0.00000000 0.03947368 0.00000000 0.00000000
## Suncus 0.00000000 0.00000000 0.00000000 0.00000000
## Sus 0.00000000 0.00000000 0.00000000 0.01315789
## Tachyglossus 0.00000000 0.00000000 0.01315789 0.00000000
## Tamias 0.00000000 0.01315789 0.00000000 0.00000000
## Tapirus 0.00000000 0.01315789 0.00000000 0.00000000
## Tenrec 0.00000000 0.00000000 0.00000000 0.01315789
## Tupaia 0.00000000 0.00000000 0.00000000 0.01315789
## Tursiops 0.01315789 0.00000000 0.00000000 0.00000000
## Vulpes 0.02631579 0.00000000 0.00000000 0.00000000
We have other useful function called margin.table(), which provides the sum of each of the row data, across the columns or sum of each of the column data across the rows. For example, the following command, uses the above probability distribution between genus and vore variables, and generates the row and column counts.
#Sum of each of the column value, across each row
margin.table(prop.table(table(msleep$genus, msleep$vore)),1)
##
## Acinonyx Aotus Aplodontia Blarina Bos
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Bradypus Callorhinus Calomys Canis Capreolus
## 0.01315789 0.01315789 0.00000000 0.01315789 0.01315789
## Capri Cavis Cercopithecus Chinchilla Condylura
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Cricetomys Cryptotis Dasypus Dendrohyrax Didelphis
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Elephas Eptesicus Equus Erinaceus Erythrocebus
## 0.01315789 0.01315789 0.02631579 0.01315789 0.01315789
## Eutamias Felis Galago Genetta Giraffa
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Globicephalus Haliochoerus Heterohyrax Homo Lemur
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Loxodonta Lutreolina Macaca Meriones Mesocricetus
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Microtus Mus Myotis Neofiber Nyctibeus
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Octodon Onychomys Oryctolagus Ovis Pan
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Panthera Papio Paraechinus Perodicticus Peromyscus
## 0.03947368 0.01315789 0.00000000 0.01315789 0.00000000
## Phalanger Phoca Phocoena Potorous Priodontes
## 0.00000000 0.01315789 0.01315789 0.01315789 0.01315789
## Procavia Rattus Rhabdomys Saimiri Scalopus
## 0.00000000 0.01315789 0.01315789 0.01315789 0.01315789
## Sigmodon Spalax Spermophilus Suncus Sus
## 0.01315789 0.00000000 0.03947368 0.00000000 0.01315789
## Tachyglossus Tamias Tapirus Tenrec Tupaia
## 0.01315789 0.01315789 0.01315789 0.01315789 0.01315789
## Tursiops Vulpes
## 0.01315789 0.02631579
#For better display format, you can use the data frame function as follows:
data.frame(margin.table(prop.table(table(msleep$genus, msleep$vore)),1))
## Var1 Freq
## 1 Acinonyx 0.01315789
## 2 Aotus 0.01315789
## 3 Aplodontia 0.01315789
## 4 Blarina 0.01315789
## 5 Bos 0.01315789
## 6 Bradypus 0.01315789
## 7 Callorhinus 0.01315789
## 8 Calomys 0.00000000
## 9 Canis 0.01315789
## 10 Capreolus 0.01315789
## 11 Capri 0.01315789
## 12 Cavis 0.01315789
## 13 Cercopithecus 0.01315789
## 14 Chinchilla 0.01315789
## 15 Condylura 0.01315789
## 16 Cricetomys 0.01315789
## 17 Cryptotis 0.01315789
## 18 Dasypus 0.01315789
## 19 Dendrohyrax 0.01315789
## 20 Didelphis 0.01315789
## 21 Elephas 0.01315789
## 22 Eptesicus 0.01315789
## 23 Equus 0.02631579
## 24 Erinaceus 0.01315789
## 25 Erythrocebus 0.01315789
## 26 Eutamias 0.01315789
## 27 Felis 0.01315789
## 28 Galago 0.01315789
## 29 Genetta 0.01315789
## 30 Giraffa 0.01315789
## 31 Globicephalus 0.01315789
## 32 Haliochoerus 0.01315789
## 33 Heterohyrax 0.01315789
## 34 Homo 0.01315789
## 35 Lemur 0.01315789
## 36 Loxodonta 0.01315789
## 37 Lutreolina 0.01315789
## 38 Macaca 0.01315789
## 39 Meriones 0.01315789
## 40 Mesocricetus 0.01315789
## 41 Microtus 0.01315789
## 42 Mus 0.01315789
## 43 Myotis 0.01315789
## 44 Neofiber 0.01315789
## 45 Nyctibeus 0.01315789
## 46 Octodon 0.01315789
## 47 Onychomys 0.01315789
## 48 Oryctolagus 0.01315789
## 49 Ovis 0.01315789
## 50 Pan 0.01315789
## 51 Panthera 0.03947368
## 52 Papio 0.01315789
## 53 Paraechinus 0.00000000
## 54 Perodicticus 0.01315789
## 55 Peromyscus 0.00000000
## 56 Phalanger 0.00000000
## 57 Phoca 0.01315789
## 58 Phocoena 0.01315789
## 59 Potorous 0.01315789
## 60 Priodontes 0.01315789
## 61 Procavia 0.00000000
## 62 Rattus 0.01315789
## 63 Rhabdomys 0.01315789
## 64 Saimiri 0.01315789
## 65 Scalopus 0.01315789
## 66 Sigmodon 0.01315789
## 67 Spalax 0.00000000
## 68 Spermophilus 0.03947368
## 69 Suncus 0.00000000
## 70 Sus 0.01315789
## 71 Tachyglossus 0.01315789
## 72 Tamias 0.01315789
## 73 Tapirus 0.01315789
## 74 Tenrec 0.01315789
## 75 Tupaia 0.01315789
## 76 Tursiops 0.01315789
## 77 Vulpes 0.02631579
margin.table(prop.table(table(msleep$genus, msleep$vore)),2)
##
## carni herbi insecti omni
## 0.25000000 0.42105263 0.06578947 0.26315789
The last output shows the probability distributions of vore variable. Hence, the three functions table(), prop.table() and margin.table() are very important functions to find the probability distributions (in tabular form). The obtained table object can be converted to a data frame easily (if needed for further analysis)
Question: Create a graph for a single numeric variable. Answer: Let us draw the histogram of awake variable of msleep data frame
ggplot(data=msleep,aes(x=awake)) +
geom_histogram(binwidth=1) +
labs(title="Awake Histogram", x="Awake Hours", y="Frequency")
Question: Create a scatterplot of two numeric variables Answer: Let us consider the variables brainwt and sleep_total variables, and draw a scatter plot to observe the pattern
ggplot(data=msleep,aes(x=(awake),y=(bodywt), color=vore)) +
geom_point(jitter=T,size=3) +
labs(title="Brain_wt and Sleep_hours plot", x="Awake_Hours", y="Body_Weight_Pounds")