Nothing, my friends, is free.
There is always a price to pay. In cycle time, hardware, disk space, and frustration.
The Price of Knowledge is Disillusion
Exhibit A: Hey, I know how we can track pandas usage while running production code! We’ll wrap every pandas call and note down the module and line number it came from.
Pros: comprehensive coverage in a single run of your code.
Cons: that single run of your code, which usually takes (say) 2.5 minutes, takes many hours and never finishes because the database connection gets reset every night at 1AM. Meanwhile it uses all available RAM on my workstation and I’m stuck reading Twitter on my phone.
Why? Because we are wrapping all calls, we cannot use multi-threading even if your code is optimized for it. Because we are rejecting calls that originate intra-pandas, we run a stack trace on every call to see where it came from. A single DataFrame constructor can call various pandas functions thousands of times – each time we run a stack trace.
Mitigation: Only wrap the calls that are actually getting made instead of wrapping everything that we detected could possibly be made anywhere in the code. Which leads us to…
Exhibit B: Hey, what if we set the wrapper so it doesn’t do a stack trace every time, just notes the call was made each time it’s made, then increments a counter? That way we can reduce our wrapper list only to those wrappers, saving memory.
Pros: Quicker than running with the ability to create tests.
Cons: That single run of your code, usually 2.5 minutes, takes 2 hours, and now you don’t know where the call originates. Still wrapping every single instance of every call.
Mitigation: We created a “single-counter” wrapper mode that shuts off the wrapper after it traps a function for the first time and increments the counter for that function. The result is a simple list of all wrapped calls that got intercepted during that run. The penalty is now negligible. The solution was costly though - it took a few days to recognize the issue, come up with a solution, and code it.
Now at last we can run with a stripped-down version of the full wrapper list, tailored for the business code at hand. Let’s kick off this puppy and watch it … still crawl. Dang, that didn’t help at all.
Exhibit C: Hey, let’s only capture some of the calls in the reduced list. I bet if we wrap 10% of the calls we’ll get back 90% of the performance.
Pros: code runs take only a few hundred percent longer than unwrapped runs. We get feedback on some calls that we can start acting on right away.
Cons: We need to run the code 10 times, each one in a separate directory (we use directory names as identifiers for certain items). Our reporting software has to know to traverse multiple layers of directories. The infrastructure becomes less legible. And the elision / construction of the reduced call sets must be done carefully. More software, more possible bugs. Plus, any scaffolding of the software must make sure to kill and destroy and overwrite and reload and otherwise nuke the existing TestGenerator objects before starting a new run otherwise the directories for each run can come from any existing object and that will send data flying around all over the place. You might end up with a pickle file in one directory and the test file calling for it in another directory.
Exhibit D: Hey, let’s write out unit tests to disk that mimic every pandas call we intercept. We will only write out tests for calls that originate in business code, since what pandas do to other pandas is their bidness if you know what I mean.
Pros: comprehensive set of tests that sit apart from business code to let us pre-determine all the issues you will run into when switching over to the new pandas, form a multi-pronged mitigation plan, and create an environment that passes all these tests before flipping the switch on your code.
Cons: DataFrames in 0.7 cannot be pickled and reconstituted in v0.15. The code has changed entirely underneath. Also, seemingly simpler data types like Series are indexed using a different elemental type so the same operations produce results in different sort order. Each of which causes either errors of processing or failing tests. To get around this, every value of every variable of every data type must be serialized and sent to disk separately, then re-assembled using the new version of pandas. The disk space requirements are enormous.
Mitigation: Only write out a specified number of tests, such as 10, for each function called from each module and code line.
But… what if those 10 scripts all get written at the beginning of the run, and the most interesting calls happen at the end of the run? I’ve got an idea…
Exhibit E: Hey, let’s write a test randomizer that stores all the data from all the calls in memory and only writes them out at the end, picking the first, the last, and 8 others at random?
Pros: Boom! Not too much disk space, and tests from all over the code.
Cons: too many to list, Let’s just start with memory overuse, move on to “who is going to code this monstrosity?”, and finish with “who is going to test that we coded this correctly?”
Besides disk space and memory, these conveniences cost time and effort. What really happens when we come across this sort of issue? I have to recognize the problem, admit the problem, discuss the problem, agree that there is a problem, propose solutions, let go of the proposed solution when it is not deemed feasible, defend the solution when it needs defending, and finally approach the client with the problem and solution – and re-approach when the true cost of the solution makes itself known in other ways later.
Did I mention frustration? It’s a cool product in a cool place with cool people, and it’s no picnic.
Meanwhile, on the AWESOME side of things, I coded my first classes and class methods, complete with inheritance and everything!
Why, I hear you ask, did you do that? Your procedures were working fine to generate those pretty graphs. Well, yes and no. Those pretty graphs came with a cost (are you detecting a THEME???), which was illegibility. Excel is good for a lot of things but in this context it was not the best tool for handing someone a slab of data and saying, here, slice and dice this to find out what you want to know. A lot of questions were going unanswered: what did the newest run show us that we haven’t seen before? Which modules are more important than others? What are the actual errors and failures turned up by the tests run against the new pandas?
I wanted more. I needed more. (Well, and Gil and Mike insisted on more.)
Mike’s proposed solution was to open and handle each json file independently by creating a class “json_file” and subcalssing it for each type of file. The json_file class would know the name of the file and how to open it and pulle the data out from it into the object. The subclass would take care of translating that data into the various pieces we care about: lists of functions hit, lists of modules and lines hit, lists of errors and failures correlated to functions, etc.
Once I had decided on the data sets each file should have I thought about the methods each one should have. Each would need to be able to compare the list of function with another list and put out the differences. Same thing with a functions map – the nested dictionary with function on top, module in te middle, and line at the bottom. Any object with a functions map would need to be able to compare that to another map and put out a concatenated map or a set of maps showing the differences between them.
I then created a halper class that traversed the reporting filesystem and found all the files of each type and instantiated them, packaging the objects into a list. Then I created a suite of applications that used these different lists of objects to put out reports that we care about.
That’s all I have time for. If you’d like to see the code let me know and I will post excerpts next time.