How to build your Own Analytics Platform for Engineering Data
Today, I show you how you can build an analytics platform to analyze your engineering data.
Now you might wonder why you would like to even build an engineering data platform, and maybe even what that is. Let me start with an example: Automated tests.
Reducing testing costs
Well, automated tests are essential to ensure reliable and high-quality software. Nevertheless, writing, maintaining and running large automated test suites takes a lot of effort and is expensive.
That’s why, at Microsoft, we started to look closer at how we can improve the testing activities to improve the life of the engineers, and at the same time reduce the costs of testing. We did so by having a close look at the test data that we have available. One of our goals was to decrease the time and effort it takes to run a test suite, without compromising quality.
For this work, we started with a bunch of investigations. First, we looked at the obvious:
- How long are the execution times for the tests?
- What kind of test do we have?
- How often do the tests fail?
- Why do they fail?
Then, we started to see if we can classify tests into different categories based on certain criteria such as time to execute, parts of the system they test, or their failure rate.
But, we also looked at how often the same test cases were run among all team members and even teams. With all the data that we gathered we were able to distill performance profiles for test cases. That means we could estimate how valuable and effective a test case is.
These performance profiles helped us to substantially reduce testing costs without impacting software quality.
All thanks to engineering data
This was just one example, to demonstrate you what you can do with the engineering data at your fingertips. In my team at Microsoft, we used such data not only to substantially decrease testing times, but also build times. We improved code review practices as well as code review tools. And the foundation of all those improvements has been engineering data.
What kind of engineering data do you have
Well, engineering teams do a lot of things: they communicate via emails or chat, write source code, check the code into the source code repository, test the system, deploy it, and so on. By engaging in each of these activities engineers leave data traces behind.
And just like in any other industry, we can use those data traces to better understand what works and what does not work when developing software.
Most of the time we want to use this data to
- increase the productivity of the engineering team,
- increase the quality of the software,
- reduce friction and bottlenecks,
- improve engineering tools,
- decrease the overall costs to build software.
The picture below shows some of the data sources you will find in a typical organization.
So, how can you use this data yourself? Well, before you can concentrate on all the different investigations, you have to first understand what data you have available.
Each data-source lives in its own universe
When you look at your engineering data, you will see that most of these data sources have their own database, access, and authentication mechanisms. They will all also have their own data schemata. They are designed in a way that makes their primary tasks most effective and efficient.
For example, in the software system the HR team uses, you will have the data stored in a way that allows HR staff to easily create new records when a person is hired, update records when a person moves offices, or delete records when an employee leaves the company. Employees outside of HR should also not have access to all the data – just think about data privacy.
Nevertheless, knowing the role of an employee, his or her hire date and especially team structures are real treasures when it comes to analysing your engineering data. Apart from the problems with authentication and access rights, the data in those systems isn’t designed in a way that allows for insightful investigations. There will be a lot of noise in the data, and you will have to deal with missing data points.
So, in order for you to use this data, you need to think which data you need to extract, and which transformation you have to apply before you can store it into your own data analytics platform.
Multiple data-sources for similar data
Another reason why you need to extract and transform the data is to consolidate the systems. Even at small companies, you might find different systems offering similar functionality.
For example, different engineering teams might use different code review systems, different software versioning software, or different issue tracker. Even though they hold similar information, the schemata of those systems might look different. So, one of your first tasks is to understand how the data differ and which commonalities and patterns you find.
The real power lies in the relationships between data sources
Each of these data sources, be it the issue tracking system, the code review tool, or the code repository system, will have a set of shared data entities that look similar in each of the systems.
Those data entities can be the user, the change numbers, or the bug ids. For example, an employee might first create a bug report in the bug repository, then change some code and commit it to the code repository and submit a code review. In each of those systems, you will have a user entity that identifies this employee. Still, the usernames might be different in each of the systems.
In an ideal world, and to make useful correlations and investigations, you want to identify the user amongst different system, be able to know which commit solves a specific bug report, or which code change is associated with the code review.
Building bridges between data-sources
So, to be able to really leverage this data, the data has not only to be collected, cleaned and stored in a specific schema, it also needs to go through transformations and matching. Your goal is to be able to cross-reference as much as possible between data-sources.
But, how can you make those data sources play with each other?
Well, the best way is if the systems create the cross-references automatically. Sometimes you can have a semi-automatic approach or heuristics that you implemented on top of existing systems. Often, you need to rely on the engineers to provide this data. And sometimes, linking data sources is done completely manual.
For example, an engineer can refer to the bug-id with a specific keyword in their commit message, even though your bug repository and your code review system have nothing to do with each other. Well, this link is manually created, so you must expect that errors will be introduced, and account for this during analysis. Still, such “data bridges” are of tremendous value once you want to query and analyze your data.
How you get started building your own engineering analytics platform
The first step towards building your own engineering analytics platform is to analyze and document each of the data sources you have at hand.
Which data sources do you have at your company?
- Bug repository
- Code repository
- Code Review data
- Builds and tests
- Organizational data
For each of those data sources, have a look at:
- How does the data look like from this data source?
- What information is stored in this data source?
- What’s the quality of this data source?
- How is this data created?
- Which transformation would you need to make this data useful?
- How can you link the data to other data sources?
- If you have another data source holding similar data, what are differences or similarities?
Data analytics platform architecture
Well, I deliberately omitted some details to keep this post digestible. But let me give you an idea of an architecture of an engineering data analytics platform. The illustration below follows closely the design of Microsoft’s engineering data platform CodeMine. A platform we and many other product teams at Microsoft use on a daily basis.
As you can see in the illustration, the data from the original data sources are loaded, cleaned and transformed, and then stored in a unified data schema in the consolidated engineering data platform.
On top of this platform is an API. This API is designed to conveniently query the engineering data. Part of this platform are also services that handle access permissions, as well as data archiving. As you might be dealing with a large amount of data, you also have to design a way to load new data in an incremental way.
Coming up next…
If you want to learn more about building analytics platforms, I suggest this article on how CodeMine, Microsoft internal data platform was built, or this article, and how we created and deployed the Code Review analytics platform.
This article is also part of a larger blog post series on analyzing engineering data, you should follow. The next posts will concentrate on specific investigations and analyses you can do with this data, as well as on do’s and don’t of measuring certain aspects of your engineering work.
If you need help with any of this, feel to reach out to me via twitter or book a free consultation. You can also join the engineering data analytics community.
Don’t forget to subscribe to my mailing list, so I can ping you once the next article is live.