Jason's Dissertation System
The notes for possibly my last lab meeting...
- Science should be more reproducible than it is
- Software can carry much of the load
- Document profusely as you go
- Paper is over
Science should be more reproducible than it is
If we took all of the research papers that have ever been written and attempted to reproduce the work therein, my guess is that we could only easily reproduce a very small percentage. One reason, is that the details are often left out. The years of toil that built up to the results are often lost to the researcher's notebook, never to see the light of day. Even though the big picture is present in papers, it would take us a long time to get the same results...if we could at all. The second, is that every piece of work has mistakes, not to mention bad assumptions and lack of the proper knowledge. Even so, we forge forward in science and are able to "stand on the shoulders of giants". Enough detail and information is generally released for forward moving progress. But what if we had access to more detail that is shared now? Would we be lost in the sea of information? Or would we forge tools to sift through it all and make greater sense? Would be progress more rapidly? with better quality?
There's also the fact that people have to make mistakes to learn. If were able to structure the world such that no experiment needed to be reproduced by another, and each new experiment was a forward gain in science, we may end up with too many people not understanding the foundations. This would lead to foundations with little support that are susceptible to failing. But there is also the "redeeming feature of life that we are able to use many things without understanding every detail of them" [Ljung1999]. But this is only true if the foundations are strong. We ignore the detail in previous methods, because we believe they are strong. We need a balance of people worrying about the strength of the foundations and the folks who build new foundations on the old. The more that the foundations and layers can be reinforced, the greater our confidence in their truth becomes. And thus all structure built on those foundations can follow suit.
I really love finding thick, long, old dissertations and tech reports with all the gory details of projects. I just get a kick out of seeing how things were done and what the failures were. We need to publish the scientific failures just as much as the scientific success. I believe the gain from both, can be just as powerful. If I want to use someone else's technique, the more information that they provided the better. But I guess there is a contrary argument too, about creativity. If all of the details of someone's work is shared, it may affect the direction you go in. It may cause you to believe in only a narrow path to your goal. It may not be plausible that one would want to try a radical idea because of learning about the failures of others, not wanting to follow in their footsteps. I think the benefits from great sharing will outweigh those occasions, but I may be wrong.
Software can carry much of the load
Computers and the software are unbelievably powerful and they will continue to grow in power rapidly. Computers can take care of the drudge work and the tedious tasks that are inherently ridden with mistakes. They can free us and let us move on to the more important issues. Computers can also help us see things that our current mediums cannot.
What if you could download all of the data and accompanying software for every research paper, thesis or dissertation? And with one tap on the screen reproduce all of the material in the paper. What if you could then interact with the data and try out different ideas with the tools? There are some philosophies about a "complete" research paper. This paper is way more than the printed medium of text and contains everything needed to work with the results and to expand it into more. The Wolfram Mathematica Computable Document Format has some interesting elements that give interactive parts to a traditional paper. Apple's new iBook is also going to be a game changer. Sweave is also another interesting model, where you can embed all of the code to generate your figures into your document.
In the past couple of years I've learned a great deal about software. The main thing I've learned is that engineers don't have a clue about it. We get the short end of the stick when it comes to learning how to program. The computer scientists are light years ahead of us, not to mention the self taught hackers that make our computer world come to life. I'd highly recommend learning to program if you are a scientist. It will provide you with an unbelievable amount of power that you never new existed. And be sure to learn how to program from people that get it (most engineers don't). A good place to start is somewhere like Software Carpentry.
I started late in the game, but have attempted to write my dissertation in such a way that the science is more reproducible that usual. I'm working by these ideas:
- The content should be written presentation neutral.
- The primary presentation view is through a web browser, but a static PDF version is also available to suit UCD's archaic submission rules.
- The source code for all the figures, animations, and interactive bits should be included with the dissertation.
- The experimentally collected data should all be available for download and use by others.
- Software tools should be developed if at all possible, instead of disconnected scripts.
My software stack
I've developed several pieces of prototype software that provides me with the tools to quickly analyze models, experimental data and to create graphs. I wish I had realized the utility of building software libraries early on. If I did I would now have more polished software that would be less of a prototype. Secondly, these software tools are almost inevitably useful to other researchers (especially folks in your lab!). Be sure to collaborate on these things with your peers. Each time you decide to write some script think of Dijkstra's "two or more, use a for". This is a rule of thumb for loops, but I think it embodies a bigger picture. Any time you have to repeat something, even if it is only once, think about turning it into a function or collection of functions. Think about reuse, not by just yourself but others. Now that I have decently documented tool libraries that I wrote for others to understand, when I go back to them, I can also understand them! Here are some descriptions of some of the tools. All can be found on my Github account.
- DynamicistToolKit (Python)
- A clearing house for all really generic functions and classes that I write that may be useful across all the work I do.
- AutolevToolKit (Python)
- A collection of tools which parse Autolev output for extracting the equations of motion and for extracting equations and converting them to LaTeX. It has a prototype of a numerical dynamic system class with accompanying linear dynamic system class to make basic analysis quick and painless.
- Yeadon (Python)
- A program that computes the inertia of a human.
- BicycleParameters (Python)
- A program to manipulate the physical parameters of a bicycle and to do basic analysis with some widely used models.
- BicycleDAQ (Matlab)
- A GUI tool that collects time series and metadata from the instrumented bicycle. It also collects calibration data for the various sensors.
- BicycleDataProcessor (Python with PyTables, SciPy, and matplotlib)
- A tool that stores all of the data collected from the instrumented bicycle in a database for easy retrieval and manipulation. It also processes the raw data into the variables of interest, so you can directly compare it with models.
- BicycleSystemID (Matlab & Python)
- A set of tools for interacting with the Matlab System ID toolbox. It has functions built around the grey and black box identification of several bicycle, rider and control models.
- DelftBicycleDataViewer (Matlab)
- A prototype video and data viewer for the first instrumented bicycle I worked on.
- HumanControl (Matlab)
- An implementation of our bicycle human control model. It autocomputes the controller parameters for any bicycle and any speed, simulates the model during lane changes, and computes a handling quality metric. It includes a script that draws almost every figure in the accompanying paper.
- MotionCapture (Python & Matlab)
- A Matlab GUI tool for interactively exploring the data from the bicycle motion capture experiments and python tools for basic statistics.
- dissertation (Sphinx [Python])
- A Sphinx document which contains the content of my dissertation and scripts to generate the majority of the figures. My notes on why I choose Sphinx.
- A popular mode based text editor.
- Manages my references, reviews and citations.
- Everything is versioned.
Document profusely as you go
It's easy to get caught up in doing, doing, doing before you ever write anything down. But writing at least forces you to explain your ideas to yourself, not to mention others. It's so easy to start a blog, there is no reason you shouldn't. Write at least for 30 minutes every single day explaining what you did. It's not necessary to worry about it being understandable to anyone else but yourself. Don't worry about fancy formatting or getting it perfect. Just put something on paper (i.e. a text editor). Open notebooks are a great new idea to facilitate this. My friend, Carl, has a great example. These little tidbits will not only help you figure things out, but will also attract others to your work and they may even help you figure things out. Put yourself out there, expose your weaknesses, other people will respect that and potentially become collaborators and helpers. I'm confident that it will pay off.
Paper is over
We've got to move beyond paper. Because we still prepare our work for an 8.5" x 11" sheet of paper, we've pigeon-holed our ability to explain ideas and communicate. We have to move beyond it and leave it behind. I'm not sure if the medium will be all that useful in the future. Brett Victor has some great ideas on how this can change: