About the Author:
Yang Hongzhi, the person in charge of the architecture of Zhihu homepage, is mainly responsible for the engineering construction, engineering architecture optimization, performance improvement and other work.
1. Background
Our old homepage system is a centralized design: the all-round feedbase plus the main function. Basically all the functions are coupled. Each line of maintenance must be extremely careful so as not to affect other logic. Here, let's first give you a general background:
The system has 660,000 lines**, the learning cost of homepage logic Xi at least 3 months;
Adding a new feed type is 7 working days;
The amount of work to expand a tab is about 3 weeks;
Without perfect link tracing, online bugs are impossible to start;
At present, the server-side response is as high as 870ms;
Part of the logic is unmaintained;
Second, what should we do
Actually, the idea is very clearDecouple modules to optimize performance.
But how to decouple, what to decouple, and how to optimize performance? j**a/go/c……Rewrite?
In February 2018, in order to support the business of the brother team, almost all of our engineers were exported to the homepage. In fact, there are only three engineers and one Xi on the homepage. The most senior employees have only been hired for 6 months, and the rest are basically new employees, with an average of 26 months, and no experience in large-scale project maintenance. In other words, it's all newcomers.
With this state of affairs, how do we reconstruct it?
Everybody knows that refactoring is a big thing, everybody is positive, everybody wants to do something good. But everyone was doing ordinary business before, and there were too many background colors for PM development. It seems that it is more important to complete functions quickly and launch quickly than anything else, and it seems that only in this way can you show your reliability.
In such a situation, the consequences of rashly initiating a refactoring or rewriting are unimaginable. It's like blinding a group of athletes in a 100-meter race and asking them to run a marathon. They are very passionate and have excellent skills, but they don't know where to run, and more importantly, their skills are not necessarily suitable for long-distance running.
We should: point out the direction: let everyone be aware of what we want to do and why we want to do it;
Determine the criteria: Refactoring is not a showmanship, it is not a flashy system. It is necessary to seek the most suitable design for engineering, cooperation and business;
Determine the method: let everyone's ability be targeted;
Continuous promotion: encourage active design and trial and error; Tighten the gate, let everyone be familiar with and grow.
Three, what we did
1. Point out the direction and guide thinking.
Why refactor? Everyone is just a superficial understanding, ** it! But why? Why does it change? "矬" embodied in **? Isn't the new architecture just the same?
In fact, the current "Zhen" is not because of the poor ability of the predecessors, but because the former ** is no longer suitable for the current business scenario. There is no such thing as a good or bad architecture, only the right fit. What we want to do is not a perfect system, so we must not show off our skills, but should try our best to adapt to the current business scenario.
Our goal is to make the homepage easier to use and our new architecture to be slower.
Then, it also guides everyone to think about why they want to do polishing in a certain place, why they should pay attention to a certain detail, and why they need to give up a certain feature. It's not a waste of time, the quick output is just a passing plus, and the author-friendly system is absolute rubbish.
In addition, the criteria are guided with visible results to reach consensus. For example, the refactored ActionCard module has clear configuration management, standard automatic dotting, and extremely low expansion costs. Through a good result, guide everyone to think about why they do this, what they will do, and how to achieve these indicators with what they are responsible for.
da
Click (up to 18 words).
2. Determination method: module-oriented division of labor.
Why not interface-oriented?
Interface-oriented development is the most mature development mode at present, in which both parties define the interface, implement it in parallel, and then jointly debug it. Efficient rollout, great decoupling, more convenient bug locating...
Who defines the interface?
Newcomers simply don't understand the business logic. Divorced from the actual business, all the interfaces defined are flashy mistakes. At the same time, it is not possible to wait for everyone to be familiar with the business before starting the refactoring. Even if I am familiar with the business, I am not confident in defining the perfect interface for all modules. For example, the filtering logic, the basic computing unit is nearly 100, and there are various performance optimization hacks in it, how to deal with itEasy to learn and XiEasy to maintainEasy to troubleshoot, everyone just has a preliminary cognitive outline. As for compatibility-to-performance optimization strategies, what does the implementation look like? You can only take one step at a time.
Who implements the interface? If it is just declared, but not implemented, the caller will always be blocked, and the logic of the transferred party will never have real traffic difficulties. Detach from verification, go online rashly, and all interfaces go online together, and if there is a problem with any interface, everyone rolls back together:Manpower locks each other up.
How does module-oriented development work?
For example, the person who reconstructs the filter module only thinks about how to do the filter module, and does not care about other logic.
If interaction with other modules is involved:
a.Other modules call this module: Refactor this module and change the caller's call logic easily.
b.This module calls other modules: keep the original API call method and try to make as few changes as possible. (After the callee is ready, modify the interface layer, and the rules are the same as a).
Note: If the caller and the callee are not ready, you can add a temporary adaptation layer to transfer the original logic.
For example, the new version of FeedItem has a different serialization format than the old version; However, the session module does not support protocol upgrades. Use the new version of the adaptation layer protocol to adapt to the old version to ensure the normal operation of the session, and at the same time, you can use online traffic to verify the functions of the new module. After the session support protocol is upgraded, the session maintainer will go offline for adaptation**.
3. Run in small steps.
Module-oriented development, where each person is responsible for an independent function, can focus only on this function. The maintainers of other modules will not touch any part of the module, and we can make changes as much as possible, as small as possible, in the learning Xi: even if a certain launch is just a hack, even if a certain launch is only on a temporary adaptation layer.
The first step of the filtering module is to simply and rudely remove hacks such as if else in the calculation link, extract the basic operators, and then combine them. The risk is basically zero, the cost is quite low (does not rely on too much filter knowledge), but the benefit is great (the details of the filter are understood, and the old logic is clarified).
Clarify the value and unify the will
Why do you do refactoring, how can you make it slower?
A good system should be maintainer-oriented, user-friendly, and have low learning Xi costs and low debug costs.
If the cost has been reduced, think more about whether it can be lower. Within our manpower and capabilities, we deliver the simplest things, not the authors are comfortable writing and not for the sake of going live right away. For example, in order to facilitate the use of a certain feature, several layers of inheritance are engaged. The author writes very coolly, but the cost of reading is unpredictable. Then, when someone gets upset, rewrite it...Try to think about it, isn't this kind of architecture designed to be flexible?
That's why we have to repeatedly guide everyone to think about why we refactored and what indicators can prove that our refactoring is successful. These indicators are the core competitiveness of the home page, but also the core competitiveness of the individual. To hone these indicators is not a waste of time, but we go to excellence together with the homepage.
Constantly motivate and point out mistakes
Due to unfamiliarity with the business and lack of refactoring experience. We often encounter various problems and raise a lot of questions. For example: I think this design is good, I think it is okay to write it like this, it should not be a waste of time, and more likely it is difficult to start. There are many scenes, and if you find that something is wrong, you should stop it as soon as possible.
As a person in charge, accept trial and error, but grasp the direction of trial and error. At the same time, we must respect the fruits of other people's labor, and we can't kill them with a single shot, and feel that we have to do what we say.
The basis for avoiding detours is unity of will. When you see good design, you should take the opportunity to strengthen your awareness and analyze why they are good, which is good enough to promote certain indicators; For the design of improper thinking, it is necessary to point out the possible risk points.
Fourth, what we have done
Due to limited manpower (taking into account some product iterations), refactoring (time-consuming 2In May, only one phase of work was completed, but the results were not low:
The extended feed has been reduced from 7d to 2d (expected to be reduced to 1d after the completion of the second phase);
Expanded new tab from 2W to 3D;
Support for new promotion slots reduced from 1d to 1h;
The Xi cost of pulling architecture has been reduced from 3W to 3-5D;
It supports real-time feedback on recalls;
It supports real-time full-link tracing of pull logs to quickly locate bugs.
The generalized card was launched, and the new feed type client was initially released without the need to send a version;
Access to the new version of the AB platform (including runtime) function to realize the dynamic analysis of online experiments.
No matter what the background, we want to make the home page into an iron army, responsible for the home page, so that the engineering architecture is more optimized, and the engineers are stronger!