Scale Challenge 1 - Tackling engineering blind spots when logs don't mirror reality

There are a variety of problems that one encounters as systems scale challenging the status-quo of the existing systems. It is unbelievable but true that different people perceive the same problem differently and consequently react to it unexpected ways of the other.

I have always enjoyed my work and thus end up thinking about some intriguing challenges even after I think it is solved, wondering if there are better ways to solve it. And when I find one, I have my aha moment, giving a deep sense of satisfaction.

When I work with my teams and in fact even during hiring, I give people a taste of this experience, just so they cherish their time spent working with me and have a take away or two (as much as I yearn to have it) when we conclude our session.

Here goes one such challenge to kindle your mind.

Case 1: Tackling engineering blind spots when logs don't mirror reality

You are an engineering leader that joined a growth-stage startup that is witnessing exponential growth. As sweet as it sounds, it comes with tremendous amount of challenges on many fronts that torments you in reality. But to help you level up your game in engineering, let me just throw the light on tech stuff alone.

CFO: Boy-o-boy, the cloud costs are exploding. You got to do something about it.

CPO: Man, the average issues per day has increased by 50% over the last quarter. This hurts my product quality metrics. What can we do about it to restore product's service quality? And how do we meet our targets of on-time delivery of new features?

Sales Head: Hey, with so many support tickets not getting addressed on-time, social media is raging with our service quality. This hurts my sales target.

Customer Service Head: Most of the tickets are beyond the realm of L1 support and it is for engineering team to address it.

Engineering Team: Our Microservices Architecture on Kubernetes seems to be sturdy and handling the scale well from the uptime metrics given by New Relic APM tool and we don't see any context specific errors as well in Airbrake, our error monitoring tool. The reason we are not able to close on-time most of the issues that are being reported is that we don't see them in our centralized logging service - AWS Cloudwatch Logs.

Consider that the customer is king and his words are truth. He/she did take some actions that apparently are not traceable by your team of engineers.

How would you as an engineering leader approach this situation? How would you guide your engineers to identifying and fixing this issue? Walk me through your thought process.
What do you think could have gone wrong here? How would you go about finding the cause? What corrective actions do you intend to take in this situation?

Take a deep breadth and take some time re-reading to assimilate the problem.

Blog @ Codonomics

Search This Blog

Buy @ Amazon

Scale Challenge 1 - Tackling engineering blind spots when logs don't mirror reality

Case 1: Tackling engineering blind spots when logs don't mirror reality