The Most Boring/Valuable Data Science Advice

Here's the thing

I'm going to make this quick. You do a carefully thought through analysis. You present it to all the movers and shakers at your company. Everyone loves it. Six months later someone asks you a question you didn't cover so you need to reproduce your analysis...

But you can't remember where the hell you saved the damn thing on your computer.

If you're a data scientist (especially the decision sciences/analysis focused kind) this has happened to you. A bunch. You might laugh it off, shrug, w/e but it's a real problem because now you have to spend hours if not days recreating work you've already done. It's a waste of time and money.

Fix it.

I used to be this person too, so I get it. I decided to experiment with a new method that sounds so simplistic and stupid you'll think it won't work.

Try it.

Just. Try. It. It will change your life.

The Fix

I now keep all of my analyses in a single folder. I call mine "Research" call yours w/e you want it doesn't fucking matter. Next, any time I start ANY analysis (SQL query, python notebook, Excel workbook, etc) I create a folder in my research folder. The folder is named by the date and then a brief description of what it is about. 


That's it. I have not lost a single analysis since I started doing this. I've been asked for analyses that are 6 months old and I can find them in <10 minutes. 

Once you have this folder structure you have to work directly out of the folder for all of your work. There can't be extra effort to get your work into this folder or you'll start losing research again.

But I won't remember what date I did an analysis!

You don't have to! But you will remember that you did it a few months ago. Or a week or two ago. If you remember nothing you can probably ask the person with the question when you first presented that and they'll give you a time range. Really the time range is there to eliminate 95% of the noise.

But it's hard to make sure the descriptions are clear/useful!

If you work on two analyses that you would describe basically the same way (for some weird reason) that's where the dates come in handy! It's hard to mix up what you did last week with what you did last month in this system. Worst case you find both and have to look at both. You still haven't lost anything but maybe it takes you the full ten minutes to figure out which is which.

But... but...!

If this method is even an option for you it's because you lose your research. Before you come up with more excuses just try this out for a couple weeks and see if it doesn't affect the quality of your life.

Tooling that has helped me

What's important is that this method is so light and simple, it works with almost any tool. If you use SQL as much as I do, I recommend getting a tool like PyCharm or DataGrip that allows you to organize your SQL files in custom folders like you would code in a standard IDE. I can't imagine this being easy if you have to keep copying and pasting SQL code from your research folder. 

Again, you have to set up your workflow so you can work entirely out of this one folder.

Keep that tenet in mind as you look at your tools and evaluate new tools. It's key.

Writing a Spark job? Do the same. Just live in this folder for your analysis.

You can combine this with software that does full text searches over folders as well to make finding things even easier.


That's it. I know it's nothing flashy or exciting. No one who interviews you for a data science role will ever ask you about this either. The people who work with data scientists though will notice you have your shit together and while those around you can never keep track of their work, you're always able to pick up where you left off and get down to business immediately.

Just remember this format and try it.


If you've solved this problem another way, I'd love to hear your solution. For people who struggle with this I'd love to offer more than one approach. Share it with me via comments or Twitter or something. Thanks!
27 responses
Check it in to github or bitbucket or gitlab and you're really cooking with gas.
Thanks for that Bill. Maybe you can help me with that. The one friction that's added in the past is password management for things. Especially when using new tools, it requires me to get my world in order before I can commit and then I just never end up committing. What's a good pattern here?
Huh? Sure... But why not just remote repositories each with an excellent README, one so straightforward and thorough that your future self could never fail to understand?
@Justin not sure what passwords you're referring to, but for me: * I use SSH access to github so don't have to remember a password * Ditto for SSH tunnels that I need to access certain databases * AWS credentials are saved in environment variables in my home directory * DB passwords usually I just save in my git repo although i probably should manage these with environment variables too
Some good thoughts! Thanks for sharing. At the end of the day, I'm all about whatever is easiest. If you can find your research then your method gets an "A".
I use a similar approach. I call my base level folder as PROJECTS. I just create a second folder in this projects folder and keep working on it for some tasks. Although I haven't used the idea of naming the folder with Date, which I like and will start using it. But the main idea of working from one folder is useful in so many levels. Thanks
Respectfully to all, I'm unsettled by this discussion. What's the defunct alternative to keeping your projects "organized?" Having folders and .csv files scattered on one's desktop? Also, database credentials in a remote repository? Really? Locally, I maintain environment variables in pre/post activate hooks of Python virtual environments. Typically, everything goes into its own repository and is persisted remotely. MOST importantly, a README is provided that makes it dead-simple to reacquaint myself with a given project 6 months down the line. e.g. 1. The purpose of this study is X. 2. To recreate your local database, run `Y`. To rebuild your models, do `Z`. Pro-level is Docker-izing everything to ensure the environment is completely immutable and persisted for as long as Docker Hub is still standing. Finally, there is the issue of those who do all projects in the same repository - perhaps to make easy use of reusable objects (i.e. a BaseModel, a BasePlotter, etc.) - without the external dependency. Here, environment management becomes trickier: what if one project uses pandas==1.15, and the other pandas==1.18? I haven't thought through a solution to this one as it's never been that pressing of an issue.
Hey William, thanks for joining the discussion! To your last point, something my team has started playing with is scripting the creation of a new environment for specific projects and then version controlling that script. Not everything in Python neatly stays within an environment but it's better than not doing it.
This sounds to me like a Dockerfile almost by definition. Would you agree?
I'd highly recommend following the data science cookie cutter model found here: I work with a team of six data scientists and engineers and it's been a real boon for our teams productivity. It has several features that address William's well founded issues. Particularly adding passwords and credentials to your repo. Instead you add those things to a .env file which does not go into version control. Instead the file stays locally and then anything specified in it will be added to your environment variables. As for dependencies, I recommend starting a clean virtual environment using Anaconda or PyCharm. Then you download the exact versions of your required packages and you can do a pip freeze which will record those versions to your requirements.txt file, ensuring future runs will use the correct dependencies. The link above is Python specific and follows a set structure, but you can easily fork the repo and tweak it to your specific needs.
Ry- Thanks! That looks awesome. Will definitely try it out.
William, definately don't think it's a good idea to use a single repo, it's not really how git is supposed to work. I have addressed the issue of sharing components across projects by creating a separate private GitHub repo which is structured as a pip package, then installing it in the virtual environment from requirements.txt using the git+ssh protocol.
Max, This is ideal. Thanks for sharing.
How do you deal with the data itself? Do you version the data? If so, how? Do you symlink it into the directory? Do you copy it in? This is the problem that I always run into. At some point you are either replicating data everywhere (and with current SSD hard drive sizes being what they are on modern laptops you quickly run out of space), or you run into the problem that an analysis might change because the underlying data changes. If someone has figured out how to deal with this, I would be most grateful.
@Andrew, I like to use `git-lfs` to store a cached/versioned copy the data as part of the repo, fits in really easily with the rest of my git workflow.
I would add to everyone mentioning Github as well. When I first got started, I had one private repo and all of my analysis went into that one folder. That got really hard to manage with pull requests, etc. So I eventually moved every single piece of analysis I do into their own private repo now and I make good uses of issues, pull requests and writing README's for each folder. I do a lot of work on my own, so much of it is very tedious, but it does help keep my sanity. Depending on the size of the data, I might do something a bit different. I use BigQuery a lot, so that negates having to store the data, but sometimes I will include a saved version of a CSV if I think the data might change at some point in the future. That's more of a gut call.
Like Max said above I have a Python package on its own git repo for tooling and utility functions that is installable via pip. With virtualenvs you can install an old version off of git, which is great because it means you're free to make changes. For the data, I set up a permanent data repository (on S3, but it could be just a remote server) and wrote get/put functions in the above package. So you just have to know the path, e.g. 'proj1/this_dataset.csv'. Then you have something like this: put(dataframe, 'proj1/this_dataset.csv') # last line in generator script dataframe = get('proj1/this_dataset.csv') # first line in analysis script I would suggest making 'put' unable to overwrite existing files - that way you guarantee data doesn't change, and if you want to re-generate you have to append a version number into the filename. You could automate this in the functions, but that's a lot of trouble, and I think it's better to be explicit in the code - that way you can grep through code to see which script generated the version you're after. To deal with disk space, I make the get function write to one spot, like ~/.data/, and I know I can just blow that away to free up space because the data will be fetched if it's not present - so I treat it like a cache. You could also write a cron job to expire out old files if you want. Configuring of server credentials could be handled via a tool like Ansible or Chef, which is handy when you have a largeish team and work across a lot of servers and have a lot of credentials to manage. Also, you can set these up to support key rotations and updates. In terms of the code, we have a 'projects' repository for quick one-offs, and separate repositories for larger projects. The last piece of the puzzle for us is a knowledge base to publish notebooks and markdown files, so that your analysis is centrally viewable and accessible. With tags and search it is easier to find things. This is especially important for teams to make your work findable by others, or to make your work persist in the company after you've left. I've built something on top of Nikola for this that I hope to open source soon.
Justin, if managing files worries you, checkout the cloud offering by DataScience Inc.
Thanks for the post. This seems like a reasonable way to solve the problem. My current solution is to keep 1 notebook in emacs org-mode and write entries for each project while including links to the analysis folder and even resulting charts, tables alongside my commentary. I organize my notebook by date so I can filter to the approximate time and then search my free text for what I remember about an old analysis. This works for me most of the time.
Simple and efficient. For more complex applications of this naming scheme have alook at this article of Data Science Journal:
For organization of project files and photos, I used this similar naming scheme, with multi-level directory structure: /work/2016/1208/blogpost/. Normally I keep the folder name short and handy, and put there an empty text file named with keywords for quick search with file locating tools, such as: "/work/date/2016/1208/blogpost/project file quick location losing research.txt". Detailed descriptions can also be put inside this label text file.
6 visitors upvoted this post.