Typically a scientist or anyone who works with data will end up with several files with different copies of the data and several files with intermediate steps. If you are working with data, there should have been a time where you had a file with name data, but now you probably have:
In addition you will certainly have some intermediate data analysis step files such as
This is not a good practice because it increases the probability of making mistakes and getting inconsistent results. At some point you will probably loose the track of which analyses were done with which files, and hence you can get (and publish) inconsistent results because they have been done with different data. This problem becomes even more serious if you work with several colleagues and all of you have their own data files.
Try to have only one data file, and it is better if this file has the raw data, without any kind of processing. If your data come from different sources then you can have one data file for each source, or maybe better, a data base file with different tables for each data source. For instance, it could be the case where you have one table for field data and one for laboratory data, or when you repeat an experiment under different conditions. If you like to use workbooks, then you can have one file with a different sheet for each data source, but if this is the case, do not try to put all the data together in a single sheet by copy and paste; it is very dangerous.
For a single source of data you must have one and only one data table. The table is the standard format for data analysis, and it is the kind of data structure that any statistical package likes. In a table you have:
If you have several tables for data coming from different sources, then each table must have an additional column that allows them to be linked.
For your data table it is good to use standard short labels because it makes easier to share data with colleagues. If you have not defined standard labels within your group, maybe this is the time to do it. It is a good practice to have a text file or a document with the standard labels and their full descriptions. For instance, in the sweetpotato breeders team we have:
Some recommendations for your labels:
For a very simple analysis you can use any kind of menu-driven programs (these are the kind of programs where you need a lot of clicks with the mouse and you do not need to write so much), but for a more complicated analysis it is better to use a command-driven program such as R. What is a more complicated analysis? I think an analysis is complicated enough to use a command-driven program if while trying to do it with some menu-driven program you need:
Why is it better to use a command-driven program for these situations?
If you work with a command-driven program and follow these recommendations, you must end up with only two files, one with the raw data and one with the code for processing and analysing the data. The key idea here is that anyone with these two files must be able to reproduce your analysis and therefore to get exactly the same results you got, no matter how complicated the analysis is. This is the concept of reproducibility. Reproducibility is important because:
If you are still not convinced about the importance of reproducibility, then maybe you should see this video.