Phase IV: Demonstration
At the end of last week I went on vacation so there was no post. Sorry y’all! But that doesn’t mean nothing happened. It was a really cool couple weeks for my pandas upgrade project.
We got to the point where we were creating unit test scripts with input and output values captured in files using python’s pickle library. Also we were working out how to describe the full scope of the work and how we were doing with it – in other words, describing our coverage model.
An overabundance of pickles
Once we got the pickling working correctly, we kind of overdid it. Some of the calls we wrapped were called millions of times during a run of the business application, and our script dutifully made a test, complete with pickle files, for each and every one. At some point the network drive we were storing them on got so full that it triggered the IT department to warn all users to stop using that share. They found our directories and told us to delete them immediately.
Mike made two changes to tame the beast. First, he captured and parsed the traceback for each intercepted call, and parsed it for evidence that the call originated in business code, rather than from somewhere else inside the library. This reduced the number of calls dramatically, but we still wanted more assurance that we wouldn’t bring the system to its knees again so he capped the number of scripts created for any one call at 50. Each script spawns 4 pickle files, so there are max 200 files per call. They are not huge files so this is easily manageable on the file system. And as we shall see, it is sufficient for our needs at least for now.
One of the checkpoints we set up during the initial proof of concept phase involved showing our work to the guys who were paying our rate. These guys seemed genuinely curious about what we had done and even a little uncertain about what we were going to present.
Mike and I gave them a rundown on the steps we had accomplished so far: the wrapping of the calls into pandas and NumPy, the information that gives us about inputs, outputs, and the unit test scripts we can write with those. “Oh – so you’re the guys who brought down the network drive yesterday!” exclaimed one of them. “Yes. Yes we did” we replied. One thing that is now on their to-do list is to get us some infrastructure upgrades. More on that later.
I also showed them the coverage models we had worked up, using the static and dynamic analysis tools. It was still preliminary though – I wasn’t yet able to show how the pieces fit together, because I didn’t yet understand it myself and we hadn’t found the key to the puzzle yet. But we told them we were working on it, and they saw that we were making good progress, so they expressed confidence that we would get it and so off we went.
To express coverage is to isolate the dimensions that you care about and measure their extent, then figure out how far along those dimensions you have actually discovered things that matter. This poses a set of interlocking problems: how to describe the problem space, how to measure the dimensions, how to show what you’ve accomplished, how to prioritize work, how to define “done” in a meaningful way, and how to orient people who need to provide input to your thought process without alienating them.
Our dimensions include:
- the set of pandas (& NumPy) modules that are called from production code
- the set of production applications and supporting code that make the calls, down to the line number
- the frequency with which the calls are made
- the variety of data types with which the calls are made
- the variety of data and settings with which the business applications are run, which force the various calls to get made and drive the full variety of data types through the library calls
- the extent to which the unit test scripts against the library calls fail when run against the new library
- the change in these values over time as our understanding improves and our work progresses
Our challenge is to express these in an easily-graspable way which allows for prioritization and stopping.
We achieved a very important breakthrough this week on how to accomplish this. Mike updated the static analysis graphing code to track line numbers where each type of call was made. He put these into the JSON file he builds when he runs the static analysis of the code, so that it reports every call into pandas from every business application and supporting library, including the line number. Then he updated the AOP wrapper code to parse the traceback to find the line number of the business application code that originated the pandas call, and note that in a JSON summary of the dynamic run, as captured by AOP wrapper activity. Finally, he wrote a glue program that writes out a third JSON file that combines the data sets: for each pandas call made anywhere in business code, list any application that actually caused that call to be made and how many times. This corresponds to the number of unit tests scripts which were created during that run of the AOP wrapper.
Now we can put all of that into a pivot table in Excel, and easily see which pandas calls have not yet been exercised, and make a prioritized list of them and work it down.
This also provides a framework for hanging other dimensionality data:
- Each pandas call has a list of eligible data types: how many does the business code invoke? How many have we invoked yet?
- How many business applications are there that use pandas? How many have we tested?
- How many pandas calls goes a particular app make? What are the setting that cause it to make more, or different calls? When we make a change to the test settings, do we see any change in our coverage?
New versions of the libraries
Once we had a few calls wrapped, and a few hundred unit test scripts that worked great against the original pandas library, Mike felt awfully curious about how they worked against the new library. He found an Anaconda distro of pandas and installed it on his local drive, then ran the nose tests against it.
He picked a very simple function: “Series.notnull”. It seems fairly straightforward, and a good candidate for a function that would not fail – that is, would not have changed so much from the version we were working with originally to the latest version that the tests would fail.
Out of 50 tests of this simple, straightforward pandas function, 17 failed.
They all failed the same way: the data type that we had pickled and were now deserializing and sending to notnull was not an ndarray, and notnull now only worked with some kind of ndarray.
This is when Mike pulled out his “I told you so” card – which to his credit he uses very sparingly – and reminded me that when we were discussing coverage and stopping heuristics, the data itself would not matter nearly so much as the data type. So, that’s why my discussion of coverage above focused less on business data per se, and more on business data as a means for driving various data types through the pandas library.
So the question now is, how do we want to mitigate this? Do we want to create a wrapper for the Series.notnull call which transforms all non-ndarray types into that type? Or do we want to trace the call to the business code itself and make the update there? Strictly from a perspective of process repeatability, creating a wrapper would be best, but in any specific case the client may want to go another route.
A final word on infrastructure: to run the business applications through a sufficient variety of paces to exercise all the pandas and NumPy calls in the code, and to save off a sufficient number of nose tests and pickle files to provide a valid target set of failing functions to write wrappers to fix, we are going to have to have more resources. The nice things about hedge funds is they tend to be able to afford things, so we have been led to believe we can use AWS resources as needed. We are going to start spinning those up next week. Then we can put out as many pickle files as we want without bringing the network down. Here’s to network hygiene!
I would love to field your comments and questions on this project. Stay in touch and check in next week.