The Universe of Discourse


Fri, 17 May 2019

How a good organization handles stuff

Will not appear in live blog

formula provider: mathjax

We try to push code from development to production most days, and responsibility for overseeing this is given to a “pushmeister”. This job rotates and yesterday it fell to me. There are actually two pushmeisters, one in the morning to get the code from development to the staging environment and into the hands of the QA team, and then an afternoon pushmeister who takes over resolving the problems found by QA and getting the code from staging to production.

ZipRecruiter is in a time zone three hours behind mine, so I always take the morning duty. This works out well, because I get into work around 05:30 Pacific time and can get an early start. A large part of the morning pushmeister's job is to look at what test failures have been introduced overnight, and either fix them or track down the guilty parties and find out what needs to be done to get them fixed.

The policy is that the staging release is locked down as soon as possible after 09:00 Pacific time. No regular commits are accepted into the day's release once the lockdown has occurred. At that point the release is packaged and sent to the staging environment, which is as much as possible like production. It is automatically deployed, and the test suite is run on it. This takes about 90 minutes total. If the staging deployment starts late, it throws off the whole schedule, and the QA team might have to stay late. The QA people are brave martyrs, and I hate to make them stay late if I can help it.

Since I get in at 05:30, I have a great opportunity: I can package a staging release and send it for deployment without locking staging, and find out about problems early on. Often there is some test that fails in staging that wasn't failing in dev.

This time there was a more interesting problem: the deployment itself failed! Great to find this out three hours ahead of time. The problem in this case was some process I didn't know about failing to parse some piece of Javascript I didn't know about. But the Git history told me that that Javascript had been published the previous day and who had published it, so I tracked down the author, Anthony Aardvark, to ask about it.

What we eventually determined was, Anthony had taken some large, duplicated code out of three web pages, turned it into a library in a separate .js file, and written a little wrapper around it. He didn't know why this would break the deployment. But he suggested I talk to the font-end programmers. They did understand: they had a deployment-time process that neither Anthony nor I knew about, which would preprocess all the .js files for faster loading. But it only worked on JS version 5 code, not on JS version 6 code. Anthony's code partook of JS6 features, but neither he nor I knew enough about the differences to know what to do about it.

After a couple of attempts to fix Anthony's code myself, I reverted it. 09:00 came, and I locked down the staging release on time, with Anthony's branch neatly reverted. The deployment went out on time and I handed things over to the afternoon pushmeister with things in good order.

The following day I checked back in the front-end Slack channel and saw they had had a discussion about how this problem could have been detected sooner. They were reviewing a set of changes which, once deployed, will prevent the problem from recurring.

What went right here? Pretty much everything, I think. I had a list of details, but I can sum them all up:

A problem came up, everyone did their job, and it was solved timely.

Also, nobody yelled or lost their temper.

Problems come up and we solve them. I do my job and I can count on my co-workers to help me. Everyone does their job, and people don't lose their tempers.

We have a great company.


[Other articles in category /tech] permanent link