Question: Choose and load any R dataset (except for diamonds!) that has at least two numeric variables and at least two categorical variables. Identify which variables in your data set are numeric, and which are categorical (factors).

Answer: To check all the available data frames, and other objects that were already included in the R installation, use the following command:

data()

I have chosen “msleep” data frame. This data frame contains the sleep data of mammals. This data frame gets installed with ggplot2 installation. If ggplot2 is not installed, please use the following command to install ggplot2 package:

#install.packages("ggplot2")
#Once the ggplot2 package is installed, use the following command to include the ggplot2 package:
library("ggplot2")

The ggplot2 package contains the msleep data frame. To check the variable types present in this data frame, use the following command:

str(msleep)
## 'data.frame':    83 obs. of  11 variables:
##  $ name        : chr  "Cheetah" "Owl monkey" "Mountain beaver" "Greater short-tailed shrew" ...
##  $ genus       : chr  "Acinonyx" "Aotus" "Aplodontia" "Blarina" ...
##  $ vore        : Factor w/ 4 levels "carni","herbi",..: 1 4 2 4 2 2 1 NA 1 2 ...
##  $ order       : chr  "Carnivora" "Primates" "Rodentia" "Soricomorpha" ...
##  $ conservation: Factor w/ 7 levels "","cd","domesticated",..: 5 NA 6 5 3 NA 7 NA 3 5 ...
##  $ sleep_total : num  12.1 17 14.4 14.9 4 14.4 8.7 7 10.1 3 ...
##  $ sleep_rem   : num  NA 1.8 2.4 2.3 0.7 2.2 1.4 NA 2.9 NA ...
##  $ sleep_cycle : num  NA NA NA 0.133 0.667 ...
##  $ awake       : num  11.9 7 9.6 9.1 20 9.6 15.3 17 13.9 21 ...
##  $ brainwt     : num  NA 0.0155 NA 0.00029 0.423 NA NA NA 0.07 0.0982 ...
##  $ bodywt      : num  50 0.48 1.35 0.019 600 ...

In the msleep data frame, the following are categorical variables:

The following are numerical variables:

The following are character variables:

Question: Generate summary level descriptive statistics: Show the mean, median, 25th and 75th quartiles, min, and max for each of the applicable variables in your data set.

Answer: We can use the “summary()” function to get the summary information of a data set. For numerical variables the min, max, mean, median, 75th percentile, and 25th percentiles are displayed. For factor type variable, counts of different factors within the factor variables are displayed. For character variables, the max length of the respective character data is displayed.

summary(msleep)
##      name              genus                vore       order          
##  Length:83          Length:83          carni  :19   Length:83         
##  Class :character   Class :character   herbi  :32   Class :character  
##  Mode  :character   Mode  :character   insecti: 5   Mode  :character  
##                                        omni   :20                     
##                                        NA's   : 7                     
##                                                                       
##                                                                       
##        conservation  sleep_total      sleep_rem      sleep_cycle    
##  lc          :27    Min.   : 1.90   Min.   :0.100   Min.   :0.1167  
##  domesticated:10    1st Qu.: 7.85   1st Qu.:0.900   1st Qu.:0.1833  
##  vu          : 7    Median :10.10   Median :1.500   Median :0.3333  
##  en          : 4    Mean   :10.43   Mean   :1.875   Mean   :0.4396  
##  nt          : 4    3rd Qu.:13.75   3rd Qu.:2.400   3rd Qu.:0.5792  
##  (Other)     : 2    Max.   :19.90   Max.   :6.600   Max.   :1.5000  
##  NA's        :29                    NA's   :22      NA's   :51      
##      awake          brainwt            bodywt        
##  Min.   : 4.10   Min.   :0.00014   Min.   :   0.005  
##  1st Qu.:10.25   1st Qu.:0.00290   1st Qu.:   0.174  
##  Median :13.90   Median :0.01240   Median :   1.670  
##  Mean   :13.57   Mean   :0.28158   Mean   : 166.136  
##  3rd Qu.:16.15   3rd Qu.:0.12550   3rd Qu.:  41.750  
##  Max.   :22.10   Max.   :5.71200   Max.   :6654.000  
##                  NA's   :27

Question: Determine the frequency for one of the categorical variables.

Answer: To determine the frequency of factors within a categorical variable, we can use the table fucntion. The following code displays the frequency of vore variable in msleep data frame

table(msleep$vore)
## 
##   carni   herbi insecti    omni 
##      19      32       5      20

Question: Determine the frequency for one of the categorical variables, by a different categorical variable.

Answer: we can use the table function again to get combined frequencies of two categorical variables (below example considers the genus and vore as categorical variables)

table(msleep$genus, msleep$vore)
##                
##                 carni herbi insecti omni
##   Acinonyx          1     0       0    0
##   Aotus             0     0       0    1
##   Aplodontia        0     1       0    0
##   Blarina           0     0       0    1
##   Bos               0     1       0    0
##   Bradypus          0     1       0    0
##   Callorhinus       1     0       0    0
##   Calomys           0     0       0    0
##   Canis             1     0       0    0
##   Capreolus         0     1       0    0
##   Capri             0     1       0    0
##   Cavis             0     1       0    0
##   Cercopithecus     0     0       0    1
##   Chinchilla        0     1       0    0
##   Condylura         0     0       0    1
##   Cricetomys        0     0       0    1
##   Cryptotis         0     0       0    1
##   Dasypus           1     0       0    0
##   Dendrohyrax       0     1       0    0
##   Didelphis         0     0       0    1
##   Elephas           0     1       0    0
##   Eptesicus         0     0       1    0
##   Equus             0     2       0    0
##   Erinaceus         0     0       0    1
##   Erythrocebus      0     0       0    1
##   Eutamias          0     1       0    0
##   Felis             1     0       0    0
##   Galago            0     0       0    1
##   Genetta           1     0       0    0
##   Giraffa           0     1       0    0
##   Globicephalus     1     0       0    0
##   Haliochoerus      1     0       0    0
##   Heterohyrax       0     1       0    0
##   Homo              0     0       0    1
##   Lemur             0     1       0    0
##   Loxodonta         0     1       0    0
##   Lutreolina        1     0       0    0
##   Macaca            0     0       0    1
##   Meriones          0     1       0    0
##   Mesocricetus      0     1       0    0
##   Microtus          0     1       0    0
##   Mus               0     1       0    0
##   Myotis            0     0       1    0
##   Neofiber          0     1       0    0
##   Nyctibeus         1     0       0    0
##   Octodon           0     1       0    0
##   Onychomys         1     0       0    0
##   Oryctolagus       0     1       0    0
##   Ovis              0     1       0    0
##   Pan               0     0       0    1
##   Panthera          3     0       0    0
##   Papio             0     0       0    1
##   Paraechinus       0     0       0    0
##   Perodicticus      0     0       0    1
##   Peromyscus        0     0       0    0
##   Phalanger         0     0       0    0
##   Phoca             1     0       0    0
##   Phocoena          1     0       0    0
##   Potorous          0     1       0    0
##   Priodontes        0     0       1    0
##   Procavia          0     0       0    0
##   Rattus            0     1       0    0
##   Rhabdomys         0     0       0    1
##   Saimiri           0     0       0    1
##   Scalopus          0     0       1    0
##   Sigmodon          0     1       0    0
##   Spalax            0     0       0    0
##   Spermophilus      0     3       0    0
##   Suncus            0     0       0    0
##   Sus               0     0       0    1
##   Tachyglossus      0     0       1    0
##   Tamias            0     1       0    0
##   Tapirus           0     1       0    0
##   Tenrec            0     0       0    1
##   Tupaia            0     0       0    1
##   Tursiops          1     0       0    0
##   Vulpes            2     0       0    0

Additional information: We can use the prop.table() function to get the proportions. The proportions represent the joint probability distribution of two or more variables. For instance, the following command will process the table generated above, into a probability distribution, between the variables genus and vore:

prop.table(table(msleep$genus, msleep$vore))
##                
##                      carni      herbi    insecti       omni
##   Acinonyx      0.01315789 0.00000000 0.00000000 0.00000000
##   Aotus         0.00000000 0.00000000 0.00000000 0.01315789
##   Aplodontia    0.00000000 0.01315789 0.00000000 0.00000000
##   Blarina       0.00000000 0.00000000 0.00000000 0.01315789
##   Bos           0.00000000 0.01315789 0.00000000 0.00000000
##   Bradypus      0.00000000 0.01315789 0.00000000 0.00000000
##   Callorhinus   0.01315789 0.00000000 0.00000000 0.00000000
##   Calomys       0.00000000 0.00000000 0.00000000 0.00000000
##   Canis         0.01315789 0.00000000 0.00000000 0.00000000
##   Capreolus     0.00000000 0.01315789 0.00000000 0.00000000
##   Capri         0.00000000 0.01315789 0.00000000 0.00000000
##   Cavis         0.00000000 0.01315789 0.00000000 0.00000000
##   Cercopithecus 0.00000000 0.00000000 0.00000000 0.01315789
##   Chinchilla    0.00000000 0.01315789 0.00000000 0.00000000
##   Condylura     0.00000000 0.00000000 0.00000000 0.01315789
##   Cricetomys    0.00000000 0.00000000 0.00000000 0.01315789
##   Cryptotis     0.00000000 0.00000000 0.00000000 0.01315789
##   Dasypus       0.01315789 0.00000000 0.00000000 0.00000000
##   Dendrohyrax   0.00000000 0.01315789 0.00000000 0.00000000
##   Didelphis     0.00000000 0.00000000 0.00000000 0.01315789
##   Elephas       0.00000000 0.01315789 0.00000000 0.00000000
##   Eptesicus     0.00000000 0.00000000 0.01315789 0.00000000
##   Equus         0.00000000 0.02631579 0.00000000 0.00000000
##   Erinaceus     0.00000000 0.00000000 0.00000000 0.01315789
##   Erythrocebus  0.00000000 0.00000000 0.00000000 0.01315789
##   Eutamias      0.00000000 0.01315789 0.00000000 0.00000000
##   Felis         0.01315789 0.00000000 0.00000000 0.00000000
##   Galago        0.00000000 0.00000000 0.00000000 0.01315789
##   Genetta       0.01315789 0.00000000 0.00000000 0.00000000
##   Giraffa       0.00000000 0.01315789 0.00000000 0.00000000
##   Globicephalus 0.01315789 0.00000000 0.00000000 0.00000000
##   Haliochoerus  0.01315789 0.00000000 0.00000000 0.00000000
##   Heterohyrax   0.00000000 0.01315789 0.00000000 0.00000000
##   Homo          0.00000000 0.00000000 0.00000000 0.01315789
##   Lemur         0.00000000 0.01315789 0.00000000 0.00000000
##   Loxodonta     0.00000000 0.01315789 0.00000000 0.00000000
##   Lutreolina    0.01315789 0.00000000 0.00000000 0.00000000
##   Macaca        0.00000000 0.00000000 0.00000000 0.01315789
##   Meriones      0.00000000 0.01315789 0.00000000 0.00000000
##   Mesocricetus  0.00000000 0.01315789 0.00000000 0.00000000
##   Microtus      0.00000000 0.01315789 0.00000000 0.00000000
##   Mus           0.00000000 0.01315789 0.00000000 0.00000000
##   Myotis        0.00000000 0.00000000 0.01315789 0.00000000
##   Neofiber      0.00000000 0.01315789 0.00000000 0.00000000
##   Nyctibeus     0.01315789 0.00000000 0.00000000 0.00000000
##   Octodon       0.00000000 0.01315789 0.00000000 0.00000000
##   Onychomys     0.01315789 0.00000000 0.00000000 0.00000000
##   Oryctolagus   0.00000000 0.01315789 0.00000000 0.00000000
##   Ovis          0.00000000 0.01315789 0.00000000 0.00000000
##   Pan           0.00000000 0.00000000 0.00000000 0.01315789
##   Panthera      0.03947368 0.00000000 0.00000000 0.00000000
##   Papio         0.00000000 0.00000000 0.00000000 0.01315789
##   Paraechinus   0.00000000 0.00000000 0.00000000 0.00000000
##   Perodicticus  0.00000000 0.00000000 0.00000000 0.01315789
##   Peromyscus    0.00000000 0.00000000 0.00000000 0.00000000
##   Phalanger     0.00000000 0.00000000 0.00000000 0.00000000
##   Phoca         0.01315789 0.00000000 0.00000000 0.00000000
##   Phocoena      0.01315789 0.00000000 0.00000000 0.00000000
##   Potorous      0.00000000 0.01315789 0.00000000 0.00000000
##   Priodontes    0.00000000 0.00000000 0.01315789 0.00000000
##   Procavia      0.00000000 0.00000000 0.00000000 0.00000000
##   Rattus        0.00000000 0.01315789 0.00000000 0.00000000
##   Rhabdomys     0.00000000 0.00000000 0.00000000 0.01315789
##   Saimiri       0.00000000 0.00000000 0.00000000 0.01315789
##   Scalopus      0.00000000 0.00000000 0.01315789 0.00000000
##   Sigmodon      0.00000000 0.01315789 0.00000000 0.00000000
##   Spalax        0.00000000 0.00000000 0.00000000 0.00000000
##   Spermophilus  0.00000000 0.03947368 0.00000000 0.00000000
##   Suncus        0.00000000 0.00000000 0.00000000 0.00000000
##   Sus           0.00000000 0.00000000 0.00000000 0.01315789
##   Tachyglossus  0.00000000 0.00000000 0.01315789 0.00000000
##   Tamias        0.00000000 0.01315789 0.00000000 0.00000000
##   Tapirus       0.00000000 0.01315789 0.00000000 0.00000000
##   Tenrec        0.00000000 0.00000000 0.00000000 0.01315789
##   Tupaia        0.00000000 0.00000000 0.00000000 0.01315789
##   Tursiops      0.01315789 0.00000000 0.00000000 0.00000000
##   Vulpes        0.02631579 0.00000000 0.00000000 0.00000000

We have other useful function called margin.table(), which provides the sum of each of the row data, across the columns or sum of each of the column data across the rows. For example, the following command, uses the above probability distribution between genus and vore variables, and generates the row and column counts.

#Sum of each of the column value, across each row 
margin.table(prop.table(table(msleep$genus, msleep$vore)),1)
## 
##      Acinonyx         Aotus    Aplodontia       Blarina           Bos 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
##      Bradypus   Callorhinus       Calomys         Canis     Capreolus 
##    0.01315789    0.01315789    0.00000000    0.01315789    0.01315789 
##         Capri         Cavis Cercopithecus    Chinchilla     Condylura 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
##    Cricetomys     Cryptotis       Dasypus   Dendrohyrax     Didelphis 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
##       Elephas     Eptesicus         Equus     Erinaceus  Erythrocebus 
##    0.01315789    0.01315789    0.02631579    0.01315789    0.01315789 
##      Eutamias         Felis        Galago       Genetta       Giraffa 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
## Globicephalus  Haliochoerus   Heterohyrax          Homo         Lemur 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
##     Loxodonta    Lutreolina        Macaca      Meriones  Mesocricetus 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
##      Microtus           Mus        Myotis      Neofiber     Nyctibeus 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
##       Octodon     Onychomys   Oryctolagus          Ovis           Pan 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
##      Panthera         Papio   Paraechinus  Perodicticus    Peromyscus 
##    0.03947368    0.01315789    0.00000000    0.01315789    0.00000000 
##     Phalanger         Phoca      Phocoena      Potorous    Priodontes 
##    0.00000000    0.01315789    0.01315789    0.01315789    0.01315789 
##      Procavia        Rattus     Rhabdomys       Saimiri      Scalopus 
##    0.00000000    0.01315789    0.01315789    0.01315789    0.01315789 
##      Sigmodon        Spalax  Spermophilus        Suncus           Sus 
##    0.01315789    0.00000000    0.03947368    0.00000000    0.01315789 
##  Tachyglossus        Tamias       Tapirus        Tenrec        Tupaia 
##    0.01315789    0.01315789    0.01315789    0.01315789    0.01315789 
##      Tursiops        Vulpes 
##    0.01315789    0.02631579
#For better display format, you can use the data frame function as follows:
data.frame(margin.table(prop.table(table(msleep$genus, msleep$vore)),1))
##             Var1       Freq
## 1       Acinonyx 0.01315789
## 2          Aotus 0.01315789
## 3     Aplodontia 0.01315789
## 4        Blarina 0.01315789
## 5            Bos 0.01315789
## 6       Bradypus 0.01315789
## 7    Callorhinus 0.01315789
## 8        Calomys 0.00000000
## 9          Canis 0.01315789
## 10     Capreolus 0.01315789
## 11         Capri 0.01315789
## 12         Cavis 0.01315789
## 13 Cercopithecus 0.01315789
## 14    Chinchilla 0.01315789
## 15     Condylura 0.01315789
## 16    Cricetomys 0.01315789
## 17     Cryptotis 0.01315789
## 18       Dasypus 0.01315789
## 19   Dendrohyrax 0.01315789
## 20     Didelphis 0.01315789
## 21       Elephas 0.01315789
## 22     Eptesicus 0.01315789
## 23         Equus 0.02631579
## 24     Erinaceus 0.01315789
## 25  Erythrocebus 0.01315789
## 26      Eutamias 0.01315789
## 27         Felis 0.01315789
## 28        Galago 0.01315789
## 29       Genetta 0.01315789
## 30       Giraffa 0.01315789
## 31 Globicephalus 0.01315789
## 32  Haliochoerus 0.01315789
## 33   Heterohyrax 0.01315789
## 34          Homo 0.01315789
## 35         Lemur 0.01315789
## 36     Loxodonta 0.01315789
## 37    Lutreolina 0.01315789
## 38        Macaca 0.01315789
## 39      Meriones 0.01315789
## 40  Mesocricetus 0.01315789
## 41      Microtus 0.01315789
## 42           Mus 0.01315789
## 43        Myotis 0.01315789
## 44      Neofiber 0.01315789
## 45     Nyctibeus 0.01315789
## 46       Octodon 0.01315789
## 47     Onychomys 0.01315789
## 48   Oryctolagus 0.01315789
## 49          Ovis 0.01315789
## 50           Pan 0.01315789
## 51      Panthera 0.03947368
## 52         Papio 0.01315789
## 53   Paraechinus 0.00000000
## 54  Perodicticus 0.01315789
## 55    Peromyscus 0.00000000
## 56     Phalanger 0.00000000
## 57         Phoca 0.01315789
## 58      Phocoena 0.01315789
## 59      Potorous 0.01315789
## 60    Priodontes 0.01315789
## 61      Procavia 0.00000000
## 62        Rattus 0.01315789
## 63     Rhabdomys 0.01315789
## 64       Saimiri 0.01315789
## 65      Scalopus 0.01315789
## 66      Sigmodon 0.01315789
## 67        Spalax 0.00000000
## 68  Spermophilus 0.03947368
## 69        Suncus 0.00000000
## 70           Sus 0.01315789
## 71  Tachyglossus 0.01315789
## 72        Tamias 0.01315789
## 73       Tapirus 0.01315789
## 74        Tenrec 0.01315789
## 75        Tupaia 0.01315789
## 76      Tursiops 0.01315789
## 77        Vulpes 0.02631579
margin.table(prop.table(table(msleep$genus, msleep$vore)),2)
## 
##      carni      herbi    insecti       omni 
## 0.25000000 0.42105263 0.06578947 0.26315789

The last output shows the probability distributions of vore variable. Hence, the three functions table(), prop.table() and margin.table() are very important functions to find the probability distributions (in tabular form). The obtained table object can be converted to a data frame easily (if needed for further analysis)

Question: Create a graph for a single numeric variable. Answer: Let us draw the histogram of awake variable of msleep data frame

ggplot(data=msleep,aes(x=awake)) + 
  geom_histogram(binwidth=1) +
  labs(title="Awake Histogram", x="Awake Hours", y="Frequency")

Question: Create a scatterplot of two numeric variables Answer: Let us consider the variables brainwt and sleep_total variables, and draw a scatter plot to observe the pattern

ggplot(data=msleep,aes(x=(awake),y=(bodywt), color=vore)) + 
    geom_point(jitter=T,size=3) +
    labs(title="Brain_wt and Sleep_hours plot", x="Awake_Hours", y="Body_Weight_Pounds")