GST System Architecture Principles
GST system is a Government program built as a critical national IT infrastructure and needs to sustain openness in the long run and a program of this scale has never been attempted before. GST system shall be built on the following core principles:
2.1 Platform Approach
GST system will be built as a platform. This means that GST system will be built entirely with open APIs from day one, and the system features can be accessed via any user interface (internal or 3rd party applications) that works on top of these APIs. Hence the GST system is envisaged as a faceless system with 100% API driven architecture at the core of it. GST portal will be one such application on top of these APIs, rather than being fused into the platform as a monolithic system.
It is critical that a platform based approach is taken for any large scale application development, to ensure adequate focus and resources on issues related to scalability, security and data management. Building an application platform with reusable components or frameworks across the application suite provides a mechanism to abstract all necessary common features into a single layer. As described in earlier section, open APIs designed to be used for internal and external purposes form the core design mechanism to ensure openness, multi-user ecosystem, specific vendor/system independence, and most importantly providing tax payers and other ecosystem players with choice of using innovative applications on various devices (mobile, tablet, etc.) that are built on top of these APIs.
Adoption of open API, open standards and wherever prudent open source products are of paramount importance for the system. This will ensure the system to be lightweight, scalable and secure. Openness comes from use of open standards and creating vendor neutral APIs and interfaces for all components. All the APIs will be stateless. Data access must be always through APIs, no application will access data directly from the storage layer or data access layer. For every internal data access also (access between various modules) there will be APIs and no direct access will be there.
Whenever options are available, open source frameworks/components shall be used instead of proprietary frameworks/components. Use of proprietary products/components/frameworks must be via open APIs (if publicly available APIs do not exist, the MSP shall be responsible for creating vendor neutral APIs before any proprietary system can be used). Use of open source is critical to ensure a national critical infrastructure like GSTN becomes secure and independent from the impact of changes in political relations with other countries.
Use of open APIs addresses two primary goals – loose coupling of components allowing independent evolution of each component without affecting the other, and having a vendor/provider neutral layer allowing use of one or more providers and replacement of a system component with another without affecting other parts of the system. In addition to the above goals, having API driven approach allows test automation for automated regression testing, continuous re-factoring and tuning within an implementation, and better component level versioning and lifecycle management.
2.3 No Vendor lock-in and Replace-ability
a) Software vendor neutrality
As per GOI policy on adoption of open source software, GSTN shall prefer open source system (OSS) in comparison to closed source software (CSS). Specific OEM products may only be used when necessary to achieve scale, performance and reliability. Every such OEM component/service/product/framework/MSP pre-existing product or work must be wrapped in a vendor neutral API so that at any time the OEM product can be replaced without affecting rest of the system. In addition, there must be at least 2 independent OEM products available using same standard/API before it can be used to ensure system is not locked in to single vendor implementation.
b) Use of commodity hardware
Commodity technology refers to hardware technologies that is completely commoditized and are available from a variety of providers. Applications that are built on such technologies are built on commodity computing architecture. With GST system needing to handle 100’s of billions of invoices and millions of tax payers, this is one of the most key architecture decisions.
Applications that are architected to only use commodity hardware fully benefit from using best technologies at a cost effective rates and allow applications to not be tied to a proprietary and vendor specific technology. Large e-Governance applications should always choose this instead of custom alternatives. Such applications also benefit best when technology evolves at a rapid pace.
GST System will be completely built using an open commodity hardware and scaled using several blade/rack servers on x86 platform. Such open scale-out architecture allows GSTN to procure latest blade/rack servers from any vendor at the best price whenever required. Similarly, storage layer also will not depend on any specialized hardware and takes advantage of heterogeneous storage arrays having from multiple vendors. Network backbone and other hardware deployed with GSTN data centres will be based on open standards having multiple vendors capable of providing them at competitive rates.
2.4 Security and Privacy
The system will ensure privacy and data integrity and must disseminate data to authenticated and authorized users only (both internal and external users). Security and privacy of data within GST system is foundational and is clearly reflected in GSTN’s strategy, design and its processes throughout the system. System must implement various measures to achieve this including mechanisms to ensure security of tax data, spanning from strong end-to-end encryption of sensitive data, use of strong PKI national standards encryption, use of HSM (Hardware Security Module) appliances, physical security, access control, network security, stringent audit mechanism, 24×7 monitoring, and measures such as data partitioning and data encryption.
It is very important that all personal and Tax data collected for the purpose of GST is provided significant protection across GST system and its ecosystem. It has to be ensured that the tax data is handled with the utmost care within its own and partner domains and follows some of the major principles of data privacy/protection recognized by countries that have already enacted such laws. Internally data must be protected from all threats and must be kept confidential.
Activities such as anti-spoofing (no one should be able to masquerade for inappropriate access), anti-sniffing (no one should be able get data and interpret it), anti-tampering (no one should be able to put/change data which was not meant to be put/changed) should be taken care for data in transit, as well as data at rest, from internal and external threats.
For achieving massive scale it is critical that technology choices are kept simple, open, multi-vendor, and standards based. Following are key considerations that need to be followed at architecture level from the beginning to ensure technology scale:
a. Loose coupling through open stateless API and messaging
The system design shall be modular with clear separation of concerns at data storage, service and API layer. Adoption of open standards shall work towards the singular goal of interoperability.
Because GST system is conceived as a ‘common platform’ on which many applications will be built/ interfaced, it is critical that all 3rd party interfaces be fully interoperable without any affinity to platforms, programming languages, network technologies. Such open interoperability is an absolute requirement for GST system to be widely adopted as a national tax platform
In addition, even within the GST solution, all components must be loosely coupled using open interfaces (APIs) ensuring interoperability across components and subsystems. Also, given the fact that there are tax systems managed by 3rd party agencies, it is critical that any solution that is certified is able to interoperate with GST solution seamlessly.
Whenever the logic is long running (taking potentially hours or days), it is critical that it is broken into small components and wired them through an asynchronous workflow. Such design allows each component to do its job fast, release resources, and handle failures at micro level. It also allows each of these components to be run across a cluster of machines and allow horizontal scaling. But, such asynchronous design requires each component to be designed through a published open API and loosely couple them through a messaging layer. Such API wrapped, black-box style approach also allows component level tuning and re-factoring to achieve required performance and scale. The system shall be able to be scaled by adding additional computer hardware; it shall not have any scalability constraint at application level.
b. Data partitioning and parallel processing
GST system functionality naturally lends itself for massive parallel and distributed system. For linear scaling, it is essential that entire system is architected to work in parallel within and across machines with appropriate data and system partitioning. Considering the fact that GST system will need to handle 100’s of billions of invoices and returns over the next few years, data partitioning (or sharding) is integral to ensure as data and volume grow, system can continue to scale without having bottlenecks at data access level. Choice of appropriate data sources such as RDBMS, Hadoop, NoSQL data stores, distributed file systems; etc. must be made to ensure there is absolutely no “single point of bottleneck” in the entire system including at the database and system level to scale linearly using commodity hardware.
c. Horizontal scale for compute, Network and storage
GST system architecture must be such that all components including compute, network and storage must scale horizontally to ensure that additional resources (compute, storage, network etc.) can be added as and when needed to achieve required scale. This also ensures that capital investments can be made only when required. Given the significance of the GST system, it is important that the scalability of the system be measurable and demonstrable, before GO- Live.
2.6 Manageability and Lights-out Operation
GST system is expected to handle millions of registration leading up to billions of ‘returns submission’ and invoice management. It is inevitable that in such large scale compute environment, some thing or other fails regularly; be it a hardware failure, network outage, or software crashes. Assuming otherwise (that nothing fails) is naive and it is essential that the application architecture handles these failures properly, be resilient to failures and have the ability to restart, and make human intervention minimal.
For complete lights out operation, all layers of the system such as app, infrastructure must be managed through automation and proactive alerting rather than using 100’s of people manually managing.
The entire application must be architected in such a way that every component of the system is monitored in a non-intrusive fashion (without affecting the performance or functionality of that component) and business metrics are published in a near real-time fashion. This allows data centre operators to be alerted proactively in the event of system issues and highlight these issues on a Network Operations Centre (NoC) at a granular level. The solution should be envisaged to utilize various tools and technologies for management and monitoring services. There should be management and monitoring tools to maintain the SLAs.
Application architecture shall also allow specific components to be watched very closely through a component level debugging scheme. Such debug logging shall be limited to specific components and for a very short time so as to enable engineering team to analyze any specific issue arising in production and troubleshoot. Every service should have the capability to enable such features at runtime and shut it off after collecting detail log data. These logs/events must be analyzed via automated tools for appropriate action by the operators.
There should be dashboard within NoC provided to the administrator to provide virtualization of the overall platform and IT Infrastructure and be able to produce reports in an automated fashion showing various performance metrics In addition, the skills and training necessary for administrators and technical support team be planned well in advance of launch of the system.
The system must have appropriate measures to ensure processing reliability for the data received or accessed through the solution. As this is a very crucial system and data are of high sensitivity, the data transfer and data management should be reliable to keep the confidence of the stakeholders. As the system will be API driven the APIs built both by internal and external authorities should go through performance and security measures to increase reliability.
It will be necessary that the following issues be taken care properly.
a. Prevent processing of duplicate incoming files / data
b. Zero loss of data ( data already saved / date at rest should also not be lost)
c. Unauthorized access and alteration to the Data uploaded in the GST system shall be prevented.
The solution design and deployment architecture will ensure that the application can be deployed in a multi-DC active-active environment offering system High Availability and failover.
The solution should meet the following availability requirements
a) Load Balanced across two or more servers within one data center and across multiple data centers in an active-active fashion avoiding single point of failure
b) Deployment of any number of service instances should be possible within and across data centers to meet the scale
c) Ability to deploy application instances in heterogeneous multi-vendor hardware within the same cluster to ensure newer hardware can be added within same cluster to meet the scale without having to change all machines to uniform configuration
d) Distributed and load balanced implementation of application to ensure that availability of services is not compromised at any failure instance
e) GST system should provide minimum 99.9 % uptime.
f) RPO being zero (no data loss of source of truth data that cannot be constructed) and availability of data across active-active multi-DC environment to ensure services can run from anywhere
An important aspect of ensuring the above mentioned availability criteria would be creation of Standard Operating Procedure (SOP) for system upgrades, maintenance and other procedural needs.
2.9 Data Driven Decision Making
All the decisions making in the system shall be driven out of data and not on the basis of assumptions. Lot more metadata needs to be attached with various data such as invoices etc. so that the appropriate decision can be taken. System shall have more and more meta tags so that time taken by various functions while capturing / entering the data etc. is captured and behaviour of system is verified. This would help to ensure quality is measured systematically and feedback is given to improve any specific issues that are identified.
A large multi-provider ecosystem created by GST system can only be managed efficiently by measuring process data at a high degree of granularity, creating well defined metrics from this process data, and creating feedback loop for these insights and learning to be shared back to the ecosystem for continuous improvement. When working with 3rd party organizations that are part of ecosystem, it is essential that entire system is measured using data and decisions are made completely based on data.
Thus the objectives for GST system are as below, keeping in mind the large ecosystem:
a. Drive decision making based on data analytics: The analytics module within GST system shall be such that stakeholders can easily include data and insights in their operations on a regular basis. Processes must be in place to drive a feedback loop to the overall organization including partner ecosystem to drive continuous improvement. Every transaction & event including that of the system administrators should go in BI and accordingly decision can be taken
b. Empower self-improvement: The analytics function shall also help stakeholders to improve by themselves. Tools, data and platform shall be created to be able to help stakeholders analyze their own performance and operational metrics themselves.
2.10 Reconstruction of truth
System should NOT allow database / system administrators to make any changes to data. It should ensure that the data and file (data at rest) that is kept in the systems has tamper resistance capacity and source of truth (original data of invoices and final returns) could be used to reconstruct derived data such as ledgers and system generated returns. System should be able to detect any data tampering through matching of hash value and should be able to reconstruct the truth.
Education Guide on Goods & Service Tax (GST)