Fictional Nepalese Research Study

Please be aware that it is a fictional study conducted as a practice to my big data analysis skills.

This project is classified into two phases: A. Fictional Data Generation, and B. Exploratory Data Analyses/Data Visualization

A. Fictional Data Generation

I will start off creating 100,000 Fictional Data Points about Nepalese Peoples’ Characteristics:

Variables and their Explanations

Male: Reported Gender of the Nepalese Pepole. This variable has two categories, Yes and No. The response ‘Yes’ suggests that the participant is a male, while ‘No’ refers to a female.
Province: The province of living of the research participants. These province represent the exact modern day administrative divisions of Nepal. This variable has 7 categories: Province 1, Province 2, Province 3, Province 4, Province 5, Province 6, Province 7.
Age: Reported age of the respondents. It’s a continuous variable. Let’s assume only the adults were included in the study, thus, people were between 25-80 years old.
Height: Height of the participants measured in centimeters. As our participants are adults, it is fair to assume that their heights measures between 125-200 centimeters.
Weight: Let’s assume that the participants weigh between 40 - 90 kilograms.
Area: Let’s assume that they either live in the urban areas like Kathmandu, or in the rural Nepalese villages. Participants who responds ‘Urban’ lives in urban areas and those who respond ‘Rural’ live in villages. And finally,
Income: It is important to note that this is fictional data and I am going to randomly assign income to the participants, thus, the income level here may not represent the national status. With that note: it is a categorical variable. People who earn between Rs. 0 to 100,000 are Low Income (coded 1), Rs. 100,001 through 200,000 are Medium Income (coded 2), Rs. 200,001 to 400,000 Medium-High Income, and Rs. 400, 001 and higher in High Income participants.

In this particular project, I am going to conduct some preliminary analyses. In other words, I am interested to conduct some study about Nepal; I downloaded census data from Nepalese governments’ online archive and I want to familiarize myself with that data and the potential direction I want to take. Or this kind of analyses are also conducted as a part of my job, when I am a research analyst and my boss wants some insights into the Nepalese data for various reason, for example, she wants to launch her product in Nepali market and she wants to check if the product (lets’ say clothing lines) fit average Nepalese customer or she wants to custom create products for targeting average people based on their height, etc.

Male <- c("Yes", "No")
Province <- c("Province 1", "Province 2", "Province 3", "Province 4", "Province 5", "Province 6", "Province 7")
Age <- 25:80
Height <- 125:200
Weight <- 40:90
Area <- c("Urban","Rural") 
Income <- 1:4

Here’s the structure of all the variables included in the study.

# Variable "Male"- Yes refers to Males, and No to Females
Male

[1] "Yes" "No"

# Variable "Province" - Province 1 through Province 7
Province

[1] "Province 1" "Province 2" "Province 3" "Province 4" "Province 5"
[6] "Province 6" "Province 7"

# Variable "Age"- Age of the Participants
Age

 [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
[26] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
[51] 75 76 77 78 79 80

# Variable "Height"- Height of the Participants in Centimeters
Height

 [1] 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
[20] 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
[39] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
[58] 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200

# Variable "Weight"- Weight of the Participants in Kilograms
Weight

 [1] 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[26] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
[51] 90

# Variable "Area"- Area of living of the Participants
Area

[1] "Urban" "Rural"

# Variable "Income"- Income level of the Participants
Income

[1] 1 2 3 4

Now, I am going to create an empty matrix, where I want to put the values for these variables. As mentioned earlier, I am going to use 100000 data points, I need one hundred thousand rows and 7 variables representing the variable I created above. I don’t have data to put into these table. I will do that later.

Nepal_Data <- matrix (nrow = 100000, ncol = 7, data = NA)
colnames(Nepal_Data) <- c("Male", "Province", "Age", "Height", "Weight", "Area", "Income")

Once I created and empty matrix, lets’ check if it worked.

head(Nepal_Data)

     Male Province Age Height Weight Area Income
[1,]   NA       NA  NA     NA     NA   NA     NA
[2,]   NA       NA  NA     NA     NA   NA     NA
[3,]   NA       NA  NA     NA     NA   NA     NA
[4,]   NA       NA  NA     NA     NA   NA     NA
[5,]   NA       NA  NA     NA     NA   NA     NA
[6,]   NA       NA  NA     NA     NA   NA     NA

The outline of the matrix has been created but it doesn’t have any data. I am going to create a ‘for loop’ that randomly selects values from the specified range for each of the variables I created and stores them in the matrix named Nepal_Data 100000 times.

Here’s how we do it:

for ( i in 1:100000){
  Nepal_Data[i,1] <- sample(Male, size = 1)
  Nepal_Data[i,2] <- sample(Province, size = 1)
  Nepal_Data[i,3] <- sample(Age, size = 1)
  Nepal_Data[i,4] <- sample(Height, size = 1)
  Nepal_Data[i,5] <- sample(Weight, size = 1)
  Nepal_Data[i,6] <- sample(Area, size = 1)
  Nepal_Data[i,7] <- sample(Income, size = 1)
}
head(Nepal_Data)

     Male  Province     Age  Height Weight Area    Income
[1,] "Yes" "Province 5" "66" "180"  "48"   "Rural" "3"   
[2,] "No"  "Province 6" "35" "196"  "85"   "Urban" "3"   
[3,] "Yes" "Province 1" "69" "178"  "64"   "Urban" "3"   
[4,] "Yes" "Province 5" "75" "160"  "44"   "Urban" "3"   
[5,] "Yes" "Province 5" "58" "177"  "49"   "Urban" "2"   
[6,] "Yes" "Province 2" "75" "193"  "44"   "Rural" "4"

Looks like the data have been generated and put into the Nepal_Data matrix. As the data has one hundred thousand rows, it is worthless to display all of them here. However, I have to make sure that all the data have been created and I want to see the structure, as well, which requires me to find some ways to summarize the data.

In this context, I am going to use apply function and read the data. Apply function provides us the total number of variables, their categories, and number of data points under all of the respective sub-categories.

There is no reason to summarize the data by the rows, because different rows have different types of data and there are 100000 data points. So, we are going to summarize the data by the columns. Here’s how we do it:

apply(X = Nepal_Data, MARGIN = 2, FUN = table)

$Male

   No   Yes 
49891 50109 

$Province

Province 1 Province 2 Province 3 Province 4 Province 5 Province 6 Province 7 
     14423      14265      14287      14270      14283      14161      14311 

$Age

  25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40 
1833 1835 1834 1701 1730 1828 1717 1731 1883 1798 1795 1750 1740 1844 1801 1754 
  41   42   43   44   45   46   47   48   49   50   51   52   53   54   55   56 
1806 1793 1769 1772 1780 1807 1852 1737 1766 1842 1774 1808 1740 1808 1803 1803 
  57   58   59   60   61   62   63   64   65   66   67   68   69   70   71   72 
1829 1782 1770 1886 1715 1767 1782 1724 1757 1746 1895 1740 1740 1743 1785 1818 
  73   74   75   76   77   78   79   80 
1805 1828 1843 1725 1786 1755 1728 1817 

$Height

 125  126  127  128  129  130  131  132  133  134  135  136  137  138  139  140 
1305 1310 1279 1284 1308 1278 1349 1345 1382 1300 1338 1330 1318 1362 1347 1299 
 141  142  143  144  145  146  147  148  149  150  151  152  153  154  155  156 
1264 1306 1376 1302 1278 1334 1269 1318 1314 1321 1333 1281 1329 1261 1335 1264 
 157  158  159  160  161  162  163  164  165  166  167  168  169  170  171  172 
1364 1361 1310 1289 1260 1277 1305 1382 1293 1322 1305 1314 1303 1363 1375 1348 
 173  174  175  176  177  178  179  180  181  182  183  184  185  186  187  188 
1281 1322 1348 1299 1365 1307 1306 1312 1276 1348 1322 1281 1341 1308 1319 1351 
 189  190  191  192  193  194  195  196  197  198  199  200 
1286 1298 1305 1297 1324 1313 1327 1308 1265 1294 1360 1347 

$Weight

  40   41   42   43   44   45   46   47   48   49   50   51   52   53   54   55 
1917 1992 1967 1956 2004 1970 1980 1956 1938 2016 1909 1990 1935 1969 1951 2034 
  56   57   58   59   60   61   62   63   64   65   66   67   68   69   70   71 
1937 1977 1922 1880 2040 1994 1996 1944 1977 1896 1936 2002 1969 1959 1909 1983 
  72   73   74   75   76   77   78   79   80   81   82   83   84   85   86   87 
1967 1980 2007 1982 1893 2001 1927 1935 1938 2004 1951 2015 1987 1893 1982 1986 
  88   89   90 
1918 1872 1957 

$Area

Rural Urban 
49939 50061 

$Income

    1     2     3     4 
25003 24977 24771 25249

B. Exploratory Data Analyses/Data Visualization

Now, I am going to subset the data and dig into the sample data. For the demonstration purpose, I am going to compare the heights of the rural males to urban males.

It is hard to manipulate the matrix based on my interest. Thus, my first job is to change the matrix to a data.frame. Here’s how we do it.

Nepal_Table <- as.data.frame(Nepal_Data, stringsAsFactors = TRUE)
head(Nepal_Table)

  Male   Province Age Height Weight  Area Income
1  Yes Province 5  66    180     48 Rural      3
2   No Province 6  35    196     85 Urban      3
3  Yes Province 1  69    178     64 Urban      3
4  Yes Province 5  75    160     44 Urban      3
5  Yes Province 5  58    177     49 Urban      2
6  Yes Province 2  75    193     44 Rural      4

Subsetting Data Using ‘filter’ command

Rural and Urban Nepalese Males

Rural_Nepalese_Males <- filter(Nepal_Table, Area == "Rural", Male == "Yes")
Urban_Nepalese_Males <- filter(Nepal_Table, Area == "Urban", Male == "Yes")
summary(Rural_Nepalese_Males)

  Male             Province         Age            Height          Weight     
 No :    0   Province 1:3566   67     :  498   171    :  373   48     :  539  
 Yes:24988   Province 2:3570   60     :  496   183    :  361   83     :  534  
             Province 3:3492   27     :  490   189    :  360   81     :  523  
             Province 4:3580   33     :  486   174    :  358   84     :  519  
             Province 5:3528   49     :  480   148    :  353   55     :  517  
             Province 6:3562   50     :  477   185    :  351   49     :  514  
             Province 7:3690   (Other):22061   (Other):22832   (Other):21842  
    Area       Income  
 Rural:24988   1:6285  
 Urban:    0   2:6244  
               3:6126  
               4:6333

sapply(Rural_Nepalese_Males, class)

    Male Province      Age   Height   Weight     Area   Income 
"factor" "factor" "factor" "factor" "factor" "factor" "factor"

summary(Urban_Nepalese_Males)

  Male             Province         Age            Height          Weight     
 No :    0   Province 1:3662   26     :  514   158    :  392   60     :  554  
 Yes:25121   Province 2:3552   60     :  497   170    :  374   64     :  548  
             Province 3:3572   38     :  487   151    :  372   73     :  528  
             Province 4:3596   50     :  485   131    :  370   49     :  525  
             Province 5:3625   35     :  482   143    :  369   46     :  522  
             Province 6:3573   39     :  482   172    :  357   67     :  522  
             Province 7:3541   (Other):22174   (Other):22887   (Other):21922  
    Area       Income  
 Rural:    0   1:6213  
 Urban:25121   2:6229  
               3:6293  
               4:6386

sapply(Rural_Nepalese_Males, class)

    Male Province      Age   Height   Weight     Area   Income 
"factor" "factor" "factor" "factor" "factor" "factor" "factor"

I have been able to subset the data to two small data files, one that of Rural Nepalese Males and the other of Urban Males. However, looking at the class of these data, all of our data have been recorded as ‘factors’. Obviously, not all of them are characters. If we don’t do anything to them, it will be hard for me to compare the Nepalse Males based on their heights, because heights are definitely the numeric data. On the other hand, I don’t want the character variables like Male, Area, and Province remain as character, and numeric data to change into the numeric data types.

Transforming Data Types for Easy Analysis

I am going to use the varhandle package which makes it really easy to do so.

library(varhandle)
Rural_Nepalese_Males <- unfactor(Rural_Nepalese_Males)
Urban_Nepalese_Males <- unfactor(Urban_Nepalese_Males)
sapply(Rural_Nepalese_Males, class)

       Male    Province         Age      Height      Weight        Area 
"character" "character"   "numeric"   "numeric"   "numeric" "character" 
     Income 
  "numeric"

sapply(Rural_Nepalese_Males, class)

       Male    Province         Age      Height      Weight        Area 
"character" "character"   "numeric"   "numeric"   "numeric" "character" 
     Income 
  "numeric"

As I check the transformed data types, in both data sets, I have: a. Male as Character b. Province as Character c. Age as Numeric d. Height as Numeric e. Weight as Numeric f. Area as Character, and g. Income as Numeric

These data types exactly match what I want them to be. Now lets visualize our data. #### Creating Simple Histogram Using the Height of the Rural Nepalese Males

ggplot(Rural_Nepalese_Males, aes(Height))+
  geom_histogram()

This histogram is kind of boring. Let’s make it a little interesting by addition some colors.

ggplot(Rural_Nepalese_Males, aes(Height))+
  geom_histogram(fill="maroon", color="coral")+
  xlab("Height in Centimeters, Rural Nepalese Males")

Now, lets change some Bin sizes and see how the plots look. #### Bin width=5

Rural_Nepalese_Males1 <- ggplot(Rural_Nepalese_Males, aes(Height))+
  geom_histogram(fill="maroon", color="coral", binwidth=5)+
  xlab("Bin Width = 5")
ylab("")

$y
[1] ""

attr(,"class")
[1] "labels"

Bin Width = 10

Rural_Nepalese_Males2 <- ggplot(Rural_Nepalese_Males, aes(Height))+
  geom_histogram(fill="maroon", color="coral", binwidth = 10)+
  xlab("Bin Width = 10")
ylab("")

$y
[1] ""

attr(,"class")
[1] "labels"

Bin Width = 20

Rural_Nepalese_Males3 <- ggplot(Rural_Nepalese_Males, aes(Height))+
  geom_histogram(fill="maroon", color="coral", binwidth = 20)+
  xlab("Bin Width = 20")
ylab("")

$y
[1] ""

attr(,"class")
[1] "labels"

Bin Width = 50

Rural_Nepalese_Males4 <- ggplot(Rural_Nepalese_Males, aes(Height))+
  geom_histogram(fill="maroon", color="coral", binwidth = 50)+
  xlab("Bin Width = 50")
ylab("")

$y
[1] ""

attr(,"class")
[1] "labels"

Putting all these Histograms for Comparison

plot_grid(Rural_Nepalese_Males1, Rural_Nepalese_Males2, Rural_Nepalese_Males3, Rural_Nepalese_Males4, 
          labels = "Height of Rural Nepalese Males",
          hjust = -1, vjust = 0.2)

Plotting the Height of the Urban Nepalese Males

ggplot(Urban_Nepalese_Males, aes(Height))+
  geom_histogram(fill="magenta", color="lavender")+
  xlab("Height in Centimeters, Urban Nepalese Males")

$y
[1] ""

attr(,"class")
[1] "labels"

$y
[1] ""

attr(,"class")
[1] "labels"

$y
[1] ""

attr(,"class")
[1] "labels"

$y
[1] ""

attr(,"class")
[1] "labels"

$y
[1] ""

attr(,"class")
[1] "labels"

Kernel Density Estimatation

ggplot()+
  geom_histogram(data=Urban_Nepalese_Males, aes(Height , ..density..), fill="white" , color="darkred")+
  geom_density(kernel="gaussian", aes(Height))

ggplot()+
  geom_histogram(data=Rural_Nepalese_Males, aes(Height , ..density..), fill="yellow" , color="green")+
  geom_density(kernel="gaussian", aes(Height))

Using Points to Represent Histograms

ggplot()+
  geom_freqpoly(data=Urban_Nepalese_Males, aes(Height , ..density..), color="darkred")+
  geom_freqpoly(data=Rural_Nepalese_Males, aes(Height , ..density..), color="green")+
  xlab("Height in Centimeters")

Using Kernel Density

ggplot()+
  geom_density(data=Urban_Nepalese_Males, aes(Height), color="darkred")+
  geom_density(data=Rural_Nepalese_Males, aes(Height), color="green")+
  xlab("Height in Centimeters")

Using Cumulative Distributive Function

ggplot()+
  stat_ecdf(data=Urban_Nepalese_Males, aes(Height), color="darkred")+
  stat_ecdf(data=Rural_Nepalese_Males, aes(Height), color="green")+
  xlab("Height in Centimeters")

```