Every software engineer also acts as a software architect. I want to cover why that is and take a look at the principles, dimensions, aspects to consider in this discipline.
What is architecture anyway?
In software we often talk about architecture. The term is actually understood rather vaguely as people use it to refer to different things in their mind. The differentiation into software architecture, solution architecture, systems architecture, cloud architecture, or enterprise architecture is not particularly helpful either — at least for explaining what architecture is.
Presenting architecture definitions is a slippery slope. Please think of this article as that guy’s perspective on architecture.
There is actually an — in my opinion — not so bad definition in ISO 42010. According to the standard “architecture” is comprised of the
fundamental concepts or properties of a system in its environment embodied in its elements, relationships, and in the principles of its design and evolution
Formalities aside, I think of it as
A concept of what goes where and talks how to whom.
And as a corollary to that
What are the implications because of that?
Since software engineers are deciding what goes where and who talks how to whom basically all the time, every software engineer is also an architect.
There are many levels on which architecture happens. In every software system we can decide “what goes where and talks how to whom”. Where classes reside, which packages may call classes located in other packages, which boundaries may or may not be crossed. I call architecture within one software artifact “software architecture”.
In my work I often came across larger systems consisting of several “microservices” (which is yet another ill defined term). I usually drop the “micro” and go with “service”. Deciding what goes into which service and how services talk to each other is part of what I consider “solution architecture”. That’s the field I am mostly interested in and that’s what this article will focus on.
In a software world deciding “what goes where and talks how to whom” yields impactful results towards cross-functional requirements or non-functional requirements. Including but not limited to: “How independently can teams work?”, “How independently can engineers within a team work?”, “How easily can responsibility be transferred to another team?”, “What about availability?”, “What is the blast radius if anything goes down?”, “How can the system be scaled?”, “What are the implications for performance?”, “How quickly can we change something?”, “Impact on cost and effort?”, etc and so forth.
The first part of “what goes where and talks how to whom” entails the second. Depending on what goes where we need to think about who talks how to whom. What goes where is driven by cohesion, while who talks how to whom deals with coupling.
In my eyes, coupling and cohesion is the central anchor point — the essence, if you will — of architectural thinking. Obviously, there is no right or wrong here, there is always a tradeoff, there is always a decision, which by the way gives rise to the infamous all the time correct answer: “It depends”.
Cohesion describes how much things belong together. We usually want high cohesion. Coupling describes interdependence between things. We usually want low and/or loose coupling (between cohesive units).
Deciding what goes where. Drawing some boxes. How hard can it be? Allow me to spoil the fun: It can be extremely tedious and unbelievably hard! To achieve high cohesion I do not think too much about what needs to go in there as I think about what can go elsewhere. A voice in my head reads Antoine de Saint-Exupéry’s quote about perfection to me over and over again:
“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
Admittedly a good driver for architectural thinking, perfection is more often than not actually not required nor desirable. The cost is simply to high. If we move things from one place to somewhere else, we are creating lots of places which all need attention one way or another. This may result in creating new git repositories, new pipelines, new deployment descriptions, new dashboards, etc — which I like to refer to as “moving parts”. Moving parts drive complexity on any system.
Cohesion can be considered on different levels of abstraction. The things to look out for are boundaries.
A library could be such a boundary. Say we have code in two libraries which (almost) always go together, then why do we want to have two rather than one? Even if some users do not use all the stuff in your library it may be worthwhile to ship it as one to reduce the hassle of maintaining and integrating them. However, if you have good reasons to create multiple libraries there may be other ways to create cohesion of those libraries. Spring Boot for example provides so called starters, which utilize Maven’s transitive dependency system to form a cohesive unit of things.
Deployment artifacts or services are another candidate for boundaries. It makes sense to put different things into different services for a variety of technical reasons such as test time, deployment time, scalability, availability, error proneness, etc. At the same time the moving parts argument holds true here as well. Creating tons of small services adds complexity. And we generally try to avoid/decrease complexity because it is so incredibly hard to catch the point where it is still manageable before it punches you in the face. Several services can also be grouped into a cohesive unit. You could apply whatever means your runtime environment (e.g. Kubernetes namespaces, Cloud Foundry spaces, AWS VPCs) has to offer or simply use tags or labels, or don’t do anything at all and remember that grouping from the top of your head. When considering which services belong to one another I find DDD amazingly helpful.
Another incredibly beneficial question is “if we’d move this service to a different team, what should then go to that team as well?”. This question yields yet another boundary: The team boundary. Thanks to Conway’s Law team boundaries are not to be underestimated. Putting things to different teams can be incredibly painful or a pillar of stability in your overall architecture.
Once we have a proposal, a decision, or constraints (e.g. off-the-shelve-products) of what goes where, we need to figure out who talks how to whom. Generating insights from thinking about coupling is crucial in validating or invalidating our thoughts on cohesion. Our learnings here may lead us back to the question what goes where as we may not like what we will find out looking at coupling.
In terms of boxes and arrows we will now focus on the arrows. When looking at a diagram my first question is usually whether the arrows describe dependencies or message flow (I personally prefer dependencies). Dependencies describe who initiates the communication, while message flow is a result of information being pushed or pulled. Another dimension often expressed with solid or dashed arrows are considerations towards synchronous and asynchronous processing.
As architects we need to decide who is initiating the communication and who is waiting to be contacted. The box which is initiating the communication has a dependency on the box on the receiving end. If that box is not available, it cannot complete its piece of work. Which actually does not have to be as bad as it sounds as we often can complete work later. The box on the receiving end needs to be aware that load might propagate.
We want to carefully consider which box needs to know about the other boxes. In my experience this is less dependent on technicalities and more dependent on the use cases. If you need data from another service to execute your use case, you might want to call that other service, especially when it is dynamic stuff. For less urgent and probably more static stuff you can replicate data and keep it locally, then you do not have to reach out to the other service while executing your use case — which improves performance and reduces chance for failure. If this approach sounds interesting, I recommend to checkout self-contained systems.
Pulling and Pushing Information
Data replication as well as event propagation can be done using pull or push. While dependencies center around the question “who knows about the other service?” message flow is concerned with “who is responsible for making sure that data is where it needs to be?”. Another way to look at it is to consider active and passive. Is a service pulling or is it being pulled? Is a service pushing or is it being pushed to?
Pulling information often goes hand in hand with polling. Pulling information in short regular intervals. Service A may poll service B to see, if there is new work do to be done. If service B is not available, service A has nothing to do. Once service B comes back online, service A will poll again and if there is new work service A will get to it. Setting up polling is nice because resiliency is already baked in. Depending on your poll rate and your traffic/load in the system you may perform unnecessary calls between the services. This is not super efficient and may produce noise in your monitoring. On the other side every pull can act as a dead man switch which can be used to notice failures in your system. Pulling information may also result in a certain delay. If new information comes in right after you pulled, it takes until the next pull for that information to be transmitted. This may or may not be a problem. Polling can also be used to safeguard a system from overload. Simply poll less frequently if your system is under heavy load.
To reduce the noise and improve reaction time two systems can use long polling. With this mechanism the client uses a long timeout and the server holds the connection for a certain amount of time if there is no new work to be transmitted. If new work comes in, the server can immediately transmit the new work. As an alternative you can simply poll really often.
Pulling information can be annoying if you have to poll various targets and additional sources pop up all the time. Your service would need to change because another service has been added to the system — a violation of the Open Closed Principle. In that case it would be nice if those new sources would push the information — or let your service know how to poll them.
Pushing is pretty much the opposite of pull. It transmits only information if it has to (reducing noise) and it does so immediately. However, too many pushes may overload the target system. If the target system becomes unavailable you may have to retry at a later point in time. Failures in form of “pushes do not work anymore” may be hard to detect, as monitoring the absence of something easily leads to false positives. Pushing information can be very effective in scenarios where services do not operate on the same network. When integrating with external services it is often simpler to reach out to them, rather than providing an internet facing endpoint including security aspects which can be pulled.
Pushing information can also be annoying if you have to push to various targets and they keep growing. In this case it would be more convenient if the additional services would pull information from you. Alternatively new services may register a push endpoint with your service in a self-service fashion. Think of webhooks with any 3rd party service.
Thinking about push and pull usually goes hand in hand with dependencies (they both are expressed by direction of arrows). Please notice that push and pull may become annoying when various targets are involved. So take a good look at your 1:n and n:1 relationships to decide who should initiate communication.
Another insightful question to decide the direction of arrows is “if we had to change something, which parts of the system would need to be touched?”. If you had to change several pieces of your system simultaneously to implement a use case, try turning things around and repeat that thought experiment. A concrete example are service mails in e-commerce. There are tons of events which may include a mail: User registered, password lost, order placed, parcel shipped, etc. Should the use cases call the mail service to send that mail or should the mail service poll the use cases if a new mail needs to be sent?
As a rule of thumb, I like to focus on the use cases and keep the business rules together. Whoever has to make sure that certain things happen, should initiate communication. In my mail example above I definitely favor the use case reaching out to the mail service and push information. I come to a different conclusion when looking at order placement and order fulfillment. To me they are two different use cases and they are even timely decoupled. A user does not expect the order being fulfilled synchronously as part of placing the order. Here, it does make sense for the fulfillment service to poll the placement service asynchronously and reap the benefits of polling. Such use cases make good boundaries for services, even for teams.
Synchronous and Asynchronous Processing
While every call is technically synchronous we can consciously decide to make the processing and message flow synchronous or asynchronous. An asynchronous call has the notion of accepting the request, immediately returning, and then working in the background. When we learn programming we usually start out with synchronous calls (e.g. calling a function, sending an http request).
At first glance synchronous calls look awesome! They tend to be easy to follow along, easy to debug, work on current data, perform the thing they are supposed to do at the earliest point in time possible. Why would anyone ever need anything else? On second sight we notice, that the error modes need some work to be dealt with properly. What should happen if a synchronous call fails? Report back to the user? Retry? If so, how often or until when? A chain of synchronous calls may increase the wait time until all involved parties did complete their work. All timeouts need to consider all the calls and their timeouts down the road. If multiple services are involved and all work on data, we may be confronted with transactional requirements. A failure in a chain may result in inconsistent data. Another interesting aspect is scaling, if one tier of services receives increased load, synchronous calls cascade the pressure throughout the system. I am surely just scratching the surface, however, it should become clear that synchronous calls are a double-edged sword.
Asynchronous processing on the other side usually comes with a retry mechanism baked in, they allow buffering which safeguards from overload, enable concurrent designs (I highly recommend Rob Pike’s talk “Concurrency is not Parallelism”), and can be equally fast compared to synchronous calls. However, asynchronous processing feels awkward, especially when you are not used to it. It may lead to opaqueness which makes it hard to follow what is happening where and when, which makes it extremely taxing to think and reason about. In addition, there is no guarantee as to when the work will be finished. Reporting back errors can only be done in an asynchronous way, which may or may not fit the use case. Also, asynchronous processing promotes eventual consistency, which can be mind boggling on its own. Eureka! Asynchronous processing is a double-edged sword as well. Welcome to engineering.
Depending on your environment a choice between synchronous and asynchronous processing can be huge. I work a lot with event-driven architecture where asynchronous processing fits in quite nicely. As asynchronous processing reduces the dependency on other parts of the system being available and properly scaled, I came to default to asynchronous processing over synchronous processing, except for good reason.
IMHO coupling and cohesion are at the heart of architectural thinking. Deciding what goes where and who talks how to whom plus dealing with the implications of those decisions are what architects do in a software world. While cohesion is a discussion about boundaries (lib, service, team), coupling is an discussion about dependencies, message flow (push/pull), and processing modes (sync/async). I focus on use cases for orientation in deciding what goes where as well as dependency management and message flow. As I usually default to asynchronous processing over synchronous processing (except for good reason), I also lean towards pull over push.