Please be aware that it is a fictional study conducted as a practice to my big data analysis skills.
This project is classified into two phases: A. Fictional Data Generation, and B. Exploratory Data Analyses/Data Visualization
I will start off creating 100,000 Fictional Data Points about Nepalese Peoples’ Characteristics:
In this particular project, I am going to conduct some preliminary analyses. In other words, I am interested to conduct some study about Nepal; I downloaded census data from Nepalese governments’ online archive and I want to familiarize myself with that data and the potential direction I want to take. Or this kind of analyses are also conducted as a part of my job, when I am a research analyst and my boss wants some insights into the Nepalese data for various reason, for example, she wants to launch her product in Nepali market and she wants to check if the product (lets’ say clothing lines) fit average Nepalese customer or she wants to custom create products for targeting average people based on their height, etc.
Male <- c("Yes", "No")
Province <- c("Province 1", "Province 2", "Province 3", "Province 4", "Province 5", "Province 6", "Province 7")
Age <- 25:80
Height <- 125:200
Weight <- 40:90
Area <- c("Urban","Rural")
Income <- 1:4
Here’s the structure of all the variables included in the study.
# Variable "Male"- Yes refers to Males, and No to Females
Male
[1] "Yes" "No"
# Variable "Province" - Province 1 through Province 7
Province
[1] "Province 1" "Province 2" "Province 3" "Province 4" "Province 5"
[6] "Province 6" "Province 7"
# Variable "Age"- Age of the Participants
Age
[1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
[26] 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74
[51] 75 76 77 78 79 80
# Variable "Height"- Height of the Participants in Centimeters
Height
[1] 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
[20] 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162
[39] 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
[58] 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
# Variable "Weight"- Weight of the Participants in Kilograms
Weight
[1] 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
[26] 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89
[51] 90
# Variable "Area"- Area of living of the Participants
Area
[1] "Urban" "Rural"
# Variable "Income"- Income level of the Participants
Income
[1] 1 2 3 4
Now, I am going to create an empty matrix, where I want to put the values for these variables. As mentioned earlier, I am going to use 100000 data points, I need one hundred thousand rows and 7 variables representing the variable I created above. I don’t have data to put into these table. I will do that later.
Nepal_Data <- matrix (nrow = 100000, ncol = 7, data = NA)
colnames(Nepal_Data) <- c("Male", "Province", "Age", "Height", "Weight", "Area", "Income")
Once I created and empty matrix, lets’ check if it worked.
head(Nepal_Data)
Male Province Age Height Weight Area Income
[1,] NA NA NA NA NA NA NA
[2,] NA NA NA NA NA NA NA
[3,] NA NA NA NA NA NA NA
[4,] NA NA NA NA NA NA NA
[5,] NA NA NA NA NA NA NA
[6,] NA NA NA NA NA NA NA
The outline of the matrix has been created but it doesn’t have any data. I am going to create a ‘for loop’ that randomly selects values from the specified range for each of the variables I created and stores them in the matrix named Nepal_Data 100000 times.
Here’s how we do it:
for ( i in 1:100000){
Nepal_Data[i,1] <- sample(Male, size = 1)
Nepal_Data[i,2] <- sample(Province, size = 1)
Nepal_Data[i,3] <- sample(Age, size = 1)
Nepal_Data[i,4] <- sample(Height, size = 1)
Nepal_Data[i,5] <- sample(Weight, size = 1)
Nepal_Data[i,6] <- sample(Area, size = 1)
Nepal_Data[i,7] <- sample(Income, size = 1)
}
head(Nepal_Data)
Male Province Age Height Weight Area Income
[1,] "Yes" "Province 5" "66" "180" "48" "Rural" "3"
[2,] "No" "Province 6" "35" "196" "85" "Urban" "3"
[3,] "Yes" "Province 1" "69" "178" "64" "Urban" "3"
[4,] "Yes" "Province 5" "75" "160" "44" "Urban" "3"
[5,] "Yes" "Province 5" "58" "177" "49" "Urban" "2"
[6,] "Yes" "Province 2" "75" "193" "44" "Rural" "4"
Looks like the data have been generated and put into the Nepal_Data matrix. As the data has one hundred thousand rows, it is worthless to display all of them here. However, I have to make sure that all the data have been created and I want to see the structure, as well, which requires me to find some ways to summarize the data.
In this context, I am going to use apply function and read the data. Apply function provides us the total number of variables, their categories, and number of data points under all of the respective sub-categories.
There is no reason to summarize the data by the rows, because different rows have different types of data and there are 100000 data points. So, we are going to summarize the data by the columns. Here’s how we do it:
apply(X = Nepal_Data, MARGIN = 2, FUN = table)
$Male
No Yes
49891 50109
$Province
Province 1 Province 2 Province 3 Province 4 Province 5 Province 6 Province 7
14423 14265 14287 14270 14283 14161 14311
$Age
25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
1833 1835 1834 1701 1730 1828 1717 1731 1883 1798 1795 1750 1740 1844 1801 1754
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
1806 1793 1769 1772 1780 1807 1852 1737 1766 1842 1774 1808 1740 1808 1803 1803
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
1829 1782 1770 1886 1715 1767 1782 1724 1757 1746 1895 1740 1740 1743 1785 1818
73 74 75 76 77 78 79 80
1805 1828 1843 1725 1786 1755 1728 1817
$Height
125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
1305 1310 1279 1284 1308 1278 1349 1345 1382 1300 1338 1330 1318 1362 1347 1299
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156
1264 1306 1376 1302 1278 1334 1269 1318 1314 1321 1333 1281 1329 1261 1335 1264
157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172
1364 1361 1310 1289 1260 1277 1305 1382 1293 1322 1305 1314 1303 1363 1375 1348
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
1281 1322 1348 1299 1365 1307 1306 1312 1276 1348 1322 1281 1341 1308 1319 1351
189 190 191 192 193 194 195 196 197 198 199 200
1286 1298 1305 1297 1324 1313 1327 1308 1265 1294 1360 1347
$Weight
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55
1917 1992 1967 1956 2004 1970 1980 1956 1938 2016 1909 1990 1935 1969 1951 2034
56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
1937 1977 1922 1880 2040 1994 1996 1944 1977 1896 1936 2002 1969 1959 1909 1983
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87
1967 1980 2007 1982 1893 2001 1927 1935 1938 2004 1951 2015 1987 1893 1982 1986
88 89 90
1918 1872 1957
$Area
Rural Urban
49939 50061
$Income
1 2 3 4
25003 24977 24771 25249
Now, I am going to subset the data and dig into the sample data. For the demonstration purpose, I am going to compare the heights of the rural males to urban males.
It is hard to manipulate the matrix based on my interest. Thus, my first job is to change the matrix to a data.frame. Here’s how we do it.
Nepal_Table <- as.data.frame(Nepal_Data, stringsAsFactors = TRUE)
head(Nepal_Table)
Male Province Age Height Weight Area Income
1 Yes Province 5 66 180 48 Rural 3
2 No Province 6 35 196 85 Urban 3
3 Yes Province 1 69 178 64 Urban 3
4 Yes Province 5 75 160 44 Urban 3
5 Yes Province 5 58 177 49 Urban 2
6 Yes Province 2 75 193 44 Rural 4
Rural_Nepalese_Males <- filter(Nepal_Table, Area == "Rural", Male == "Yes")
Urban_Nepalese_Males <- filter(Nepal_Table, Area == "Urban", Male == "Yes")
summary(Rural_Nepalese_Males)
Male Province Age Height Weight
No : 0 Province 1:3566 67 : 498 171 : 373 48 : 539
Yes:24988 Province 2:3570 60 : 496 183 : 361 83 : 534
Province 3:3492 27 : 490 189 : 360 81 : 523
Province 4:3580 33 : 486 174 : 358 84 : 519
Province 5:3528 49 : 480 148 : 353 55 : 517
Province 6:3562 50 : 477 185 : 351 49 : 514
Province 7:3690 (Other):22061 (Other):22832 (Other):21842
Area Income
Rural:24988 1:6285
Urban: 0 2:6244
3:6126
4:6333
sapply(Rural_Nepalese_Males, class)
Male Province Age Height Weight Area Income
"factor" "factor" "factor" "factor" "factor" "factor" "factor"
summary(Urban_Nepalese_Males)
Male Province Age Height Weight
No : 0 Province 1:3662 26 : 514 158 : 392 60 : 554
Yes:25121 Province 2:3552 60 : 497 170 : 374 64 : 548
Province 3:3572 38 : 487 151 : 372 73 : 528
Province 4:3596 50 : 485 131 : 370 49 : 525
Province 5:3625 35 : 482 143 : 369 46 : 522
Province 6:3573 39 : 482 172 : 357 67 : 522
Province 7:3541 (Other):22174 (Other):22887 (Other):21922
Area Income
Rural: 0 1:6213
Urban:25121 2:6229
3:6293
4:6386
sapply(Rural_Nepalese_Males, class)
Male Province Age Height Weight Area Income
"factor" "factor" "factor" "factor" "factor" "factor" "factor"
I have been able to subset the data to two small data files, one that of Rural Nepalese Males and the other of Urban Males. However, looking at the class of these data, all of our data have been recorded as ‘factors’. Obviously, not all of them are characters. If we don’t do anything to them, it will be hard for me to compare the Nepalse Males based on their heights, because heights are definitely the numeric data. On the other hand, I don’t want the character variables like Male, Area, and Province remain as character, and numeric data to change into the numeric data types.
I am going to use the varhandle package which makes it really easy to do so.
library(varhandle)
Rural_Nepalese_Males <- unfactor(Rural_Nepalese_Males)
Urban_Nepalese_Males <- unfactor(Urban_Nepalese_Males)
sapply(Rural_Nepalese_Males, class)
Male Province Age Height Weight Area
"character" "character" "numeric" "numeric" "numeric" "character"
Income
"numeric"
sapply(Rural_Nepalese_Males, class)
Male Province Age Height Weight Area
"character" "character" "numeric" "numeric" "numeric" "character"
Income
"numeric"
As I check the transformed data types, in both data sets, I have: a. Male as Character b. Province as Character c. Age as Numeric d. Height as Numeric e. Weight as Numeric f. Area as Character, and g. Income as Numeric
These data types exactly match what I want them to be. Now lets visualize our data. #### Creating Simple Histogram Using the Height of the Rural Nepalese Males
ggplot(Rural_Nepalese_Males, aes(Height))+
geom_histogram()
This histogram is kind of boring. Let’s make it a little interesting by addition some colors.
ggplot(Rural_Nepalese_Males, aes(Height))+
geom_histogram(fill="maroon", color="coral")+
xlab("Height in Centimeters, Rural Nepalese Males")
Now, lets change some Bin sizes and see how the plots look. #### Bin width=5
Rural_Nepalese_Males1 <- ggplot(Rural_Nepalese_Males, aes(Height))+
geom_histogram(fill="maroon", color="coral", binwidth=5)+
xlab("Bin Width = 5")
ylab("")
$y
[1] ""
attr(,"class")
[1] "labels"
Rural_Nepalese_Males2 <- ggplot(Rural_Nepalese_Males, aes(Height))+
geom_histogram(fill="maroon", color="coral", binwidth = 10)+
xlab("Bin Width = 10")
ylab("")
$y
[1] ""
attr(,"class")
[1] "labels"
Rural_Nepalese_Males3 <- ggplot(Rural_Nepalese_Males, aes(Height))+
geom_histogram(fill="maroon", color="coral", binwidth = 20)+
xlab("Bin Width = 20")
ylab("")
$y
[1] ""
attr(,"class")
[1] "labels"
Rural_Nepalese_Males4 <- ggplot(Rural_Nepalese_Males, aes(Height))+
geom_histogram(fill="maroon", color="coral", binwidth = 50)+
xlab("Bin Width = 50")
ylab("")
$y
[1] ""
attr(,"class")
[1] "labels"
plot_grid(Rural_Nepalese_Males1, Rural_Nepalese_Males2, Rural_Nepalese_Males3, Rural_Nepalese_Males4,
labels = "Height of Rural Nepalese Males",
hjust = -1, vjust = 0.2)
ggplot(Urban_Nepalese_Males, aes(Height))+
geom_histogram(fill="magenta", color="lavender")+
xlab("Height in Centimeters, Urban Nepalese Males")
$y
[1] ""
attr(,"class")
[1] "labels"
$y
[1] ""
attr(,"class")
[1] "labels"
$y
[1] ""
attr(,"class")
[1] "labels"
$y
[1] ""
attr(,"class")
[1] "labels"
$y
[1] ""
attr(,"class")
[1] "labels"
ggplot()+
geom_histogram(data=Urban_Nepalese_Males, aes(Height , ..density..), fill="white" , color="darkred")+
geom_density(kernel="gaussian", aes(Height))
ggplot()+
geom_histogram(data=Rural_Nepalese_Males, aes(Height , ..density..), fill="yellow" , color="green")+
geom_density(kernel="gaussian", aes(Height))
ggplot()+
geom_freqpoly(data=Urban_Nepalese_Males, aes(Height , ..density..), color="darkred")+
geom_freqpoly(data=Rural_Nepalese_Males, aes(Height , ..density..), color="green")+
xlab("Height in Centimeters")
ggplot()+
geom_density(data=Urban_Nepalese_Males, aes(Height), color="darkred")+
geom_density(data=Rural_Nepalese_Males, aes(Height), color="green")+
xlab("Height in Centimeters")
ggplot()+
stat_ecdf(data=Urban_Nepalese_Males, aes(Height), color="darkred")+
stat_ecdf(data=Rural_Nepalese_Males, aes(Height), color="green")+
xlab("Height in Centimeters")
```