Dataset Loading
Load the dataset, with the correct date format and the name of the columns, which is present in the CSV:
library(readr)
dataset <- read_csv("C:/Users/Leonel/Projects/dataset.csv", col_types = cols(fecha = col_date(format = "%Y%m%d")))
Exploratory Analysis
As usual, the first thing is to see what the data looks like and it is correctly handled:
head(dataset)
Now, I need to know the * dimensions * of the dataset:
dim(dataset)
[1] 1000001 4
We see that the dataset has a million rows and 4 columns.
I want to see a fast summary of the data:
summary(dataset)
X1 fecha id txn
Min. : 0 Min. :2016-01-04 Min. : 6 Min. :-1328767.1
1st Qu.: 250000 1st Qu.:2016-01-04 1st Qu.: 432193 1st Qu.: -100.0
Median : 500000 Median :2016-01-08 Median :65048457 Median : -24.0
Mean : 500000 Mean :2016-03-23 Mean :54800228 Mean : -97.6
3rd Qu.: 750000 3rd Qu.:2016-05-03 3rd Qu.:98663337 3rd Qu.: -2.9
Max. :1000000 Max. :2016-12-06 Max. :98953092 Max. : 450000.0
NA's :396151 NA's :396151
It can be seen:
- The first thing that can be observed is that the identifier column is indexed to 0. This can be an issue if we want to insert that value as a PK in some DB handlers.
- I see that the range of values for the ID column is quite high, to be a column of users. To study values, it may be convenient to leave out this value.
date and txn have 396151 NA’s or missing data. Inspecting quickly, I can see that when one is missing, the other too. Very convenient.
Now, I want a list of rows with missing values:
dataset[!complete.cases(dataset),]
I see that there are more than 390,000 rows without data. Approximately 40% of the data.
Cleaning and Imputation
At this point, it is normal to decide what to do with the missing data. On normal conditions, one would check with the source of the data about the missing rows, if it is possible and the implications. However, for this challenge that’s not a possibility.
With that in mind,the first impulse would be to discard the missing data, especially because the rows that lack information have all 2 missing observations. However, this matter would naturally awaken the curiosity of any scientist. Imputing quantitative observations (txn) would not represent a greater challenge. However, for the missing dates, it may not be trivial. My first impulse is to find out if the dates are ordered logically.
I’m going to extract the month:
library(dplyr)
library(lubridate)
dataset2 <- mutate(dataset, month = month(dataset$fecha, label = TRUE))
library(ggplot2)
ggplot(dataset2,aes(x=month,fill=month)) + geom_bar(width = 0.6) + theme(legend.position = "none") + labs(title="Data distribution by month")

ggplot(dataset2,aes(x=fecha,fill=fecha)) + geom_bar(width = 0.6) + theme(legend.position = "none") + labs(title="Data distribution by day")

I worry that the data does not seem to have an equal distribution. There seems to be much more data in January than in May and December together. In addition, the only 3 months that have data, seem to be randomly distributed. (From January to May there are 4 months and from May to December, there are 7) The distribution per day makes inequality in the distribution of data much more evident.
In conclusion, I do not feel that it has great value for the analysis of data, to make imputations to the date and to the amount of transactions for the users. What I consider valuable, is to take into account the records with missing information, since only the user who realizes the transaction and suna valuable information, in case you want to summarize the amount of missing transactions for a user, or the number of transactions totals in a given month. The way to insert the data in a database will depend on the objective of the analysis, of course. In addition, the management information system that will process the data. I recommend placing an invalid date and zero (0) as the value of the transaction. (This is because being transactions with income and expenses, zero is a value that will not affect the balance calculations).
Although the exercise does not specify the level of confidence that you have in this dataset, due to the missing date and txn data, it is logical to be a bit skeptical. I consider it imperative to see the distribution in the amount of transactions:
boxplot(dataset$txn)

The result makes sense for banking transactions. Although there is someone who made a retreat a little atypical !. -1328767.1 Maybe it’s worth reviewing … But it’s not the goal of this exercise.
Conclusions
The data is somewhat noisy and scattered. It is to be expected in a test dataset. However, performing this exploratory analysis has allowed us to obtain valuable information for the creation of a compatible data model in a Backend system, as required by the exercise. Things to take into account:
- The identifier of the rows is indexed to zero.
- The date may require parsing before being inserted into a handler.
- The UUID must be an integer.
- The TXN must be a signed decimal.
- Records with a missing date should be inserted, but considered in the results.
- The records with missing date must be shown to the end user in some way, without affecting aggregates or accounts.
LS0tDQp0aXRsZTogIkJhbmlzdG1vIENoYWxsZW5nZSBFeHBsb3JhdG9yeSBBbmFseXNpcyINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCiMjIyBEYXRhc2V0IExvYWRpbmcNCkxvYWQgdGhlIGRhdGFzZXQsIHdpdGggdGhlIGNvcnJlY3QgZGF0ZSBmb3JtYXQgYW5kIHRoZSBuYW1lIG9mIHRoZSBjb2x1bW5zLCB3aGljaCBpcyBwcmVzZW50IGluIHRoZSBDU1Y6DQpgYGB7cn0NCmxpYnJhcnkocmVhZHIpDQpkYXRhc2V0IDwtIHJlYWRfY3N2KCJDOi9Vc2Vycy9MZW9uZWwvUHJvamVjdHMvZGF0YXNldC5jc3YiLCBjb2xfdHlwZXMgPSBjb2xzKGZlY2hhID0gY29sX2RhdGUoZm9ybWF0ID0gIiVZJW0lZCIpKSkNCmBgYA0KDQojIyMgRXhwbG9yYXRvcnkgQW5hbHlzaXMNCkFzIHVzdWFsLCB0aGUgZmlyc3QgdGhpbmcgaXMgdG8gc2VlIHdoYXQgdGhlIGRhdGEgbG9va3MgbGlrZSBhbmQgaXQgaXMgY29ycmVjdGx5IGhhbmRsZWQ6DQpgYGB7cn0NCmhlYWQoZGF0YXNldCkNCmBgYA0KDQpOb3csIEkgbmVlZCB0byBrbm93IHRoZSAqIGRpbWVuc2lvbnMgKiBvZiB0aGUgZGF0YXNldDoNCmBgYHtyfQ0KZGltKGRhdGFzZXQpDQpgYGANCldlIHNlZSB0aGF0IHRoZSBkYXRhc2V0IGhhcyBhIG1pbGxpb24gcm93cyBhbmQgNCBjb2x1bW5zLg0KDQoNCkkgd2FudCB0byBzZWUgYSBmYXN0ICpzdW1tYXJ5KiBvZiB0aGUgZGF0YToNCmBgYHtyfQ0Kc3VtbWFyeShkYXRhc2V0KQ0KYGBgDQpJdCBjYW4gYmUgc2VlbjoNCg0KMS4gVGhlIGZpcnN0IHRoaW5nIHRoYXQgY2FuIGJlIG9ic2VydmVkIGlzIHRoYXQgdGhlIGlkZW50aWZpZXIgY29sdW1uIGlzIGluZGV4ZWQgdG8gMC4gVGhpcyBjYW4gYmUgYW4gaXNzdWUgaWYgd2Ugd2FudCB0byBpbnNlcnQgdGhhdCB2YWx1ZSBhcyBhIFBLIGluIHNvbWUgREIgaGFuZGxlcnMuDQoyLiBJIHNlZSB0aGF0IHRoZSByYW5nZSBvZiB2YWx1ZXMgZm9yIHRoZSBJRCBjb2x1bW4gaXMgcXVpdGUgaGlnaCwgdG8gYmUgYSBjb2x1bW4gb2YgdXNlcnMuIFRvIHN0dWR5IHZhbHVlcywgaXQgbWF5IGJlIGNvbnZlbmllbnQgdG8gbGVhdmUgb3V0IHRoaXMgdmFsdWUuDQozLiBgZGF0ZWAgYW5kIGB0eG5gIGhhdmUgYDM5NjE1MWAgTkEncyBvciBtaXNzaW5nIGRhdGEuIEluc3BlY3RpbmcgcXVpY2tseSwgSSBjYW4gc2VlIHRoYXQgd2hlbiBvbmUgaXMgbWlzc2luZywgdGhlIG90aGVyIHRvby4gVmVyeSBjb252ZW5pZW50Lg0KDQpOb3csIEkgd2FudCBhIGxpc3Qgb2Ygcm93cyB3aXRoICptaXNzaW5nIHZhbHVlcyo6DQpgYGB7cn0NCmRhdGFzZXRbIWNvbXBsZXRlLmNhc2VzKGRhdGFzZXQpLF0NCmBgYA0KSSBzZWUgdGhhdCB0aGVyZSBhcmUgbW9yZSB0aGFuIGAzOTAsMDAwYCByb3dzIHdpdGhvdXQgZGF0YS4gQXBwcm94aW1hdGVseSA0MCUgb2YgdGhlIGRhdGEuDQoNCiMjIyBDbGVhbmluZyBhbmQgSW1wdXRhdGlvbg0KQXQgdGhpcyBwb2ludCwgaXQgaXMgbm9ybWFsIHRvIGRlY2lkZSB3aGF0IHRvIGRvIHdpdGggdGhlIG1pc3NpbmcgZGF0YS4gT24gbm9ybWFsIGNvbmRpdGlvbnMsIG9uZSB3b3VsZCBjaGVjayB3aXRoIHRoZSBzb3VyY2Ugb2YgdGhlIGRhdGEgYWJvdXQgdGhlIG1pc3Npbmcgcm93cywgaWYgaXQgaXMgcG9zc2libGUgYW5kIHRoZSBpbXBsaWNhdGlvbnMuIEhvd2V2ZXIsIGZvciB0aGlzIGNoYWxsZW5nZSB0aGF0J3Mgbm90IGEgcG9zc2liaWxpdHkuDQoNCldpdGggdGhhdCBpbiBtaW5kLHRoZSBmaXJzdCBpbXB1bHNlIHdvdWxkIGJlIHRvIGRpc2NhcmQgdGhlIG1pc3NpbmcgZGF0YSwgZXNwZWNpYWxseSBiZWNhdXNlIHRoZSByb3dzIHRoYXQgbGFjayBpbmZvcm1hdGlvbiBoYXZlIGFsbCAyIG1pc3Npbmcgb2JzZXJ2YXRpb25zLg0KSG93ZXZlciwgdGhpcyBtYXR0ZXIgd291bGQgbmF0dXJhbGx5IGF3YWtlbiB0aGUgY3VyaW9zaXR5IG9mIGFueSBzY2llbnRpc3QuIEltcHV0aW5nIHF1YW50aXRhdGl2ZSBvYnNlcnZhdGlvbnMgKHR4bikgd291bGQgbm90IHJlcHJlc2VudCBhIGdyZWF0ZXIgY2hhbGxlbmdlLiBIb3dldmVyLCBmb3IgdGhlIG1pc3NpbmcgZGF0ZXMsIGl0IG1heSBub3QgYmUgdHJpdmlhbC4NCk15IGZpcnN0IGltcHVsc2UgaXMgdG8gZmluZCBvdXQgaWYgdGhlIGRhdGVzIGFyZSBvcmRlcmVkIGxvZ2ljYWxseS4NCg0KSSdtIGdvaW5nIHRvIGV4dHJhY3QgdGhlIG1vbnRoOg0KYGBge3J9DQpsaWJyYXJ5KGRwbHlyKQ0KbGlicmFyeShsdWJyaWRhdGUpDQpkYXRhc2V0MiA8LSBtdXRhdGUoZGF0YXNldCwgbW9udGggPSBtb250aChkYXRhc2V0JGZlY2hhLCBsYWJlbCA9IFRSVUUpKQ0KbGlicmFyeShnZ3Bsb3QyKQ0KZ2dwbG90KGRhdGFzZXQyLGFlcyh4PW1vbnRoLGZpbGw9bW9udGgpKSArIGdlb21fYmFyKHdpZHRoID0gMC42KSArIHRoZW1lKGxlZ2VuZC5wb3NpdGlvbiA9ICJub25lIikgKyBsYWJzKHRpdGxlPSJEYXRhIGRpc3RyaWJ1dGlvbiBieSBtb250aCIpDQpnZ3Bsb3QoZGF0YXNldDIsYWVzKHg9ZmVjaGEsZmlsbD1mZWNoYSkpICsgZ2VvbV9iYXIod2lkdGggPSAwLjYpICsgdGhlbWUobGVnZW5kLnBvc2l0aW9uID0gIm5vbmUiKSArIGxhYnModGl0bGU9IkRhdGEgZGlzdHJpYnV0aW9uIGJ5IGRheSIpDQpgYGANCkkgd29ycnkgdGhhdCB0aGUgZGF0YSBkb2VzIG5vdCBzZWVtIHRvIGhhdmUgYW4gZXF1YWwgZGlzdHJpYnV0aW9uLiBUaGVyZSBzZWVtcyB0byBiZSBtdWNoIG1vcmUgZGF0YSBpbiBKYW51YXJ5IHRoYW4gaW4gTWF5IGFuZCBEZWNlbWJlciB0b2dldGhlci4NCkluIGFkZGl0aW9uLCB0aGUgb25seSAzIG1vbnRocyB0aGF0IGhhdmUgZGF0YSwgc2VlbSB0byBiZSByYW5kb21seSBkaXN0cmlidXRlZC4gKEZyb20gSmFudWFyeSB0byBNYXkgdGhlcmUgYXJlIDQgbW9udGhzIGFuZCBmcm9tIE1heSB0byBEZWNlbWJlciwgdGhlcmUgYXJlIDcpDQpUaGUgZGlzdHJpYnV0aW9uIHBlciBkYXkgbWFrZXMgaW5lcXVhbGl0eSBpbiB0aGUgZGlzdHJpYnV0aW9uIG9mIGRhdGEgbXVjaCBtb3JlIGV2aWRlbnQuDQoNCkluIGNvbmNsdXNpb24sIEkgZG8gbm90IGZlZWwgdGhhdCBpdCBoYXMgZ3JlYXQgdmFsdWUgZm9yIHRoZSBhbmFseXNpcyBvZiBkYXRhLCB0byBtYWtlIGltcHV0YXRpb25zIHRvIHRoZSBkYXRlIGFuZCB0byB0aGUgYW1vdW50IG9mIHRyYW5zYWN0aW9ucyBmb3IgdGhlIHVzZXJzLg0KV2hhdCBJIGNvbnNpZGVyIHZhbHVhYmxlLCBpcyB0byB0YWtlIGludG8gYWNjb3VudCB0aGUgcmVjb3JkcyB3aXRoIG1pc3NpbmcgaW5mb3JtYXRpb24sIHNpbmNlIG9ubHkgdGhlIHVzZXIgd2hvIHJlYWxpemVzIHRoZSB0cmFuc2FjdGlvbiBhbmQgc3VuYSB2YWx1YWJsZSBpbmZvcm1hdGlvbiwgaW4gY2FzZSB5b3Ugd2FudCB0byBzdW1tYXJpemUgdGhlIGFtb3VudCBvZiBtaXNzaW5nIHRyYW5zYWN0aW9ucyBmb3IgYSB1c2VyLCBvciB0aGUgbnVtYmVyIG9mIHRyYW5zYWN0aW9ucyB0b3RhbHMgaW4gYSBnaXZlbiBtb250aC4NClRoZSB3YXkgdG8gaW5zZXJ0IHRoZSBkYXRhIGluIGEgZGF0YWJhc2Ugd2lsbCBkZXBlbmQgb24gdGhlIG9iamVjdGl2ZSBvZiB0aGUgYW5hbHlzaXMsIG9mIGNvdXJzZS4gSW4gYWRkaXRpb24sIHRoZSBtYW5hZ2VtZW50IGluZm9ybWF0aW9uIHN5c3RlbSB0aGF0IHdpbGwgcHJvY2VzcyB0aGUgZGF0YS4gSSByZWNvbW1lbmQgcGxhY2luZyBhbiBpbnZhbGlkIGRhdGUgYW5kIHplcm8gKDApIGFzIHRoZSB2YWx1ZSBvZiB0aGUgdHJhbnNhY3Rpb24uIChUaGlzIGlzIGJlY2F1c2UgYmVpbmcgdHJhbnNhY3Rpb25zIHdpdGggaW5jb21lIGFuZCBleHBlbnNlcywgemVybyBpcyBhIHZhbHVlIHRoYXQgd2lsbCBub3QgYWZmZWN0IHRoZSBiYWxhbmNlIGNhbGN1bGF0aW9ucykuDQoNCkFsdGhvdWdoIHRoZSBleGVyY2lzZSBkb2VzIG5vdCBzcGVjaWZ5IHRoZSBsZXZlbCBvZiBjb25maWRlbmNlIHRoYXQgeW91IGhhdmUgaW4gdGhpcyBkYXRhc2V0LCBkdWUgdG8gdGhlIG1pc3NpbmcgYGRhdGVgICBhbmQgYHR4bmAgZGF0YSwgaXQgaXMgbG9naWNhbCB0byBiZSBhIGJpdCBza2VwdGljYWwuDQpJIGNvbnNpZGVyIGl0IGltcGVyYXRpdmUgdG8gc2VlIHRoZSAqZGlzdHJpYnV0aW9uICogaW4gdGhlIGFtb3VudCBvZiB0cmFuc2FjdGlvbnM6DQpgYGB7cn0NCmJveHBsb3QoZGF0YXNldCR0eG4pDQpgYGANClRoZSByZXN1bHQgbWFrZXMgc2Vuc2UgZm9yIGJhbmtpbmcgdHJhbnNhY3Rpb25zLiBBbHRob3VnaCB0aGVyZSBpcyBzb21lb25lIHdobyBtYWRlIGEgcmV0cmVhdCBhIGxpdHRsZSBhdHlwaWNhbCAhLiBgLTEzMjg3NjcuMWAgTWF5YmUgaXQncyB3b3J0aCByZXZpZXdpbmcgLi4uIEJ1dCBpdCdzIG5vdCB0aGUgZ29hbCBvZiB0aGlzIGV4ZXJjaXNlLg0KDQojIyMgQ29uY2x1c2lvbnMNClRoZSBkYXRhIGlzIHNvbWV3aGF0IG5vaXN5IGFuZCBzY2F0dGVyZWQuIEl0IGlzIHRvIGJlIGV4cGVjdGVkIGluIGEgdGVzdCBkYXRhc2V0LiBIb3dldmVyLCBwZXJmb3JtaW5nIHRoaXMgZXhwbG9yYXRvcnkgYW5hbHlzaXMgaGFzIGFsbG93ZWQgdXMgdG8gb2J0YWluIHZhbHVhYmxlIGluZm9ybWF0aW9uIGZvciB0aGUgY3JlYXRpb24gb2YgYSBjb21wYXRpYmxlIGRhdGEgbW9kZWwgaW4gYSBCYWNrZW5kIHN5c3RlbSwgYXMgcmVxdWlyZWQgYnkgdGhlIGV4ZXJjaXNlLiBUaGluZ3MgdG8gdGFrZSBpbnRvIGFjY291bnQ6DQoNCiogVGhlIGlkZW50aWZpZXIgb2YgdGhlIHJvd3MgaXMgaW5kZXhlZCB0byB6ZXJvLg0KKiBUaGUgZGF0ZSBtYXkgcmVxdWlyZSBwYXJzaW5nIGJlZm9yZSBiZWluZyBpbnNlcnRlZCBpbnRvIGEgaGFuZGxlci4NCiogVGhlIFVVSUQgbXVzdCBiZSBhbiBpbnRlZ2VyLg0KKiBUaGUgVFhOIG11c3QgYmUgYSBzaWduZWQgZGVjaW1hbC4NCiogUmVjb3JkcyB3aXRoIGEgbWlzc2luZyBkYXRlIHNob3VsZCBiZSBpbnNlcnRlZCwgYnV0IGNvbnNpZGVyZWQgaW4gdGhlIHJlc3VsdHMuDQoqIFRoZSByZWNvcmRzIHdpdGggbWlzc2luZyBkYXRlIG11c3QgYmUgc2hvd24gdG8gdGhlIGVuZCB1c2VyIGluIHNvbWUgd2F5LCB3aXRob3V0IGFmZmVjdGluZyBhZ2dyZWdhdGVzIG9yIGFjY291bnRzLg0K