The Reportable Conditions Knowledge Management System (RCKMS) is a decision support service that evaluates the reportability of case reports from healthcare providers. Deployed on the Association of Public Health Laboratories (APHL) AIMS platform on AWS cloud, RCKMS integrates directly with EHR providers processing case reports in realtime in support of the broader electronic case reporting (eCR) effort nationwide. RCKMS is currently used nationwide by over 11,000 providers for electronic case reporting (eCR). This work requires a close relationship not only with APHL but also with other organizations engaged in the eCR work, for example the Council of State and Territorial Epidemiologists (CSTE) and CDC.
In early 2018, the necessity for rapid scaling of the system was made imminent by the intent to move from what was then a pilot program to nationwide adoption. With this on the horizon, the primary code owner of the project HLN was tasked with the responsibility of redesigning the architecture to make use of newly available cloud native tooling on AWS to prepare for the coming challenges in vastly increased volume and new considerations in data management for thousands of onboarding healthcare facilities. To tackle the sizable undertaking of the work ahead, HLN reached out to Hivemetric as domain experts to help design and implement the system.
At the time of undertaking the modernization of the RCKMS system, the majority of the application responsibilities were handled via a single monolithic application. To support the inevitable needs of future scalability, performance optimization, as well as to best navigate the unique data and security requirements of the split architecture, the primary effort revolved around redesigning the supporting application logic into discrete well organized microservices. Given the nature of the system processing sensitive protected health information (PHI) within its decision support services relying on saturation of data authored and tested in public facing systems, the architecture would have to be deployed in a split model with discrete service organization for consideration of unique requirements implementing a DMZ between service clusters.
The following microservices were isolated to distribute the supporting functional workflows to independently scalable and tunable resources:
CAT (Clinical Authoring Tool) - This serves as the primary utilities by which epidemiologists and state and local public health authorities (PHA) author and publish reporting specifications to be run within the decision support architecture. Previously served by the monolithic backend, this application was optimized to be bundled and deployed to an S3 cloud bucket to be served as a web service via AWS Cloudfront
MTS (Middle Tier Service) - This application was the core of the previous monolith and handled all relevant activities from authoring data, to generating rules artifacts, as well as brokering requests to the decision support service. This service was stripped to purely serve as the CRUD service supporting authoring activities, deployed as a Payara microservice, optimized for containerized runtime, supported by AWS RDS for PostgreSQL, and deployed into the authoring cluster on Kubernetes.
DSUS (Data Support Update Service) - This was a unique microservice added to the system architecture to function as a data broker to dependent services. This enabled the decoupling of both data/artifact saturation and publishing activities from MTS, as well the implementation of a lightweight data cache on AWS DocumentDB for deterministic data generated by much heavier data constructs stored RDS for PostgreSQL. This service handles queue based publishing activities, invocation of RGS for artifact generation, as well brokering data saturation of artifacts and update references for replication to the decision support architecture by way of S3 and DocumentDB. This service is built in Node and optimized for container runtime on a node-alpine base image and deployed in the authoring architecture on Kubernetes.
RGS (Rules Generation Service) - Also previously part of the monolithic application, RGS was isolated to the sole responsibility of processing a slim record invoked by DSUS and compiling a binary payload containing the Drools based logic for a given jurisdictions rule set. Built as a Spring Boot Service, optimized for container runtimes, and deployed to the authoring cluster on Kubernetes.
RRS (RCKMS Report Service) - An additional unique microservice added to the system to adopt the sole responsibility of data aggregation and report generation for various standards of generated reports by RCKMS and jurisdictional administrators. Built in Node and optimized for container runtime on a node-alpine base image and deployed in the authoring architecture on Kubernetes.
(OUS, DSS, VCS, SS) - These services discussed below are the operating components of the decision support architecture. Given their nature as modular microservices, a singular cluster of these services are deployed within the authoring architecture with a unique applied configuration map to serve as test resources for non-PHI authoring activities.
OUS (Opencds Update Service) - This service is a unique microservice to singularly handle the saturation and brokering of artifact updates necessary for the Decision Support Service runtime. Running on a chron based polling routine and isolated as the sole point of runtime communication to AWS S3, OUS serves as the artifact synchronization mechanism for the decision support architecture. Built in Node and optimized for container runtime, deployed in the decision support cluster on Kubernetes.
SS (Shared Service) - This service operates as the primary broker of inbound case reports to the RCKMS system. This service handles the brokering of necessary data transforms for supported data types, the identification of target jurisdictions, the brokering of invocation to the Decision Support Service, as well as the navigation of reportability results and composition of the RCKMS response document. Also previously part of the monolithic application, the Shared Service was rebuilt to run on with minimal data requirements operating off a minimized data cache supported by AWS DocumentDB, decoupling many of the heavier runtime processes it was bound to previously.
VCS (VMR Converter Service) - Also previously part of the monolithic application, VCS was isolated explicitly as a transform service to process inbound payloads from supported clinical data formats to the vMR format for Shared Service and Decision Support processing. Built as a SpringBoot service and optimized for container runtime, deployed to the decision support cluster on Kubernetes.
DSS (Decision Support Service) - The Decision Support Service implements the open source CDS architecture, OpenCDS, and serves as the runtime execution of the Drools logic produced via authored artifacts against the inbound case report data provided via the Shared Service. Also originally part of the monolith, DSS was isolated to the runtime execution of OpenCDS supported by artifacts and data generated in the supporting microservices. DSS is saturated continuously with updated artifacts via two sources. The primary is the polling process engaging OUS, and the other is by means of an init container deployed within the DSS pod which serves as a startup synchronization directly from the AWS S3 buckets in which the necessary artifacts are replicated. Built as a Tomcat service, optimized for container runtime, and deployed in the decision support cluster on kubernetes.
IaC - All clusters, namespaces, statefulsets, and deployment resources, as well build triggers, and tag management for auto-deployments across our 3 candidate and development environments are managed via Hashicorp Terraform repositories. Terraform state for each repository is managed via Terraform cloud configured with version control integration to our source repositories on github.
GitOps - All source repositories are configured with branch protection on all branches enabled for automated builds or automated deployments. Each of these repositories require a number of configured tests to be run and a pull request is required before any commits can be merged to the protected branch. These tests include automated unit tests where relevant, code quality analysis via CodeClimate, as well as mandated code review by one or more parties.
Release Methodology - All production environments are managed through a container mirroring process to AWS Elastic Container Registry whereby qualified release candidates are tagged according to a release ready flag and mirrored to the production container registry where they are staged for zero-downtime release via a Rolling Update release strategy.
Deployed Resources - All services deployed are independently scalable and subject to configured HPAs. Additionally readiness and liveness probes are configured on all resources enabling automated recovery and pod restarts. Furthermore, for enhanced visibility where relevant any services with multiple log outputs not bound for standard out, sidecar containers are deployed with each pod to consolidate and organize continuous log streaming.
Load Testing - The RCKMS infrastructure handles ~250,000 eCRs on a daily basis with a high degree of fluctuation in volume. For this reason, it is imperative through all continuous development, performance testing is conducted at high volume. For this purpose, we elected to implement an ephemeral testing environment which mirrors our production environment for decision support. This environment is configured with corresponding machine types for deployed pods and has a jMeter based testing harness deployed within the cluster to closely replicate the conditions of production runtime. All candidate releases are run through a load testing cycle before being qualified for release.
Monitoring - All logs, pod metrics, and K8s control plane metrics are streamed to a DataDog installation from which custom dashboards, metrics, and alerting policies can be configured for continuous 24/7 monitoring of the RCKMS production systems.
Given the broad dependence on the RCKMS system and the unacceptability of downtime, a unique roll out strategy was devised to be able to validate the new architecture and redesigned system under production load serving wild-type data whilst maintaining ongoing availability of the current production system.
To satisfy this need, the Hivemetric in collaboration with HLN and the APHL hosting provider for the AIMS platform designed an approach whereby the system could be effectively split-test with a purpose-built testing harness without any degradation of the operating production environment. Both versions of the architecture were stood up in parallel and two boutique microservices were designed to handle this unique workflow. One service SSWS (Shared Service Wrapper Service) was designed to serve as a data broker to proxy traffic and responses to and from the Legacy and Modern architectures respectively and the second SSCS (Shared Service Comparison Service) was designed to aggregate and compare response data and performance metrics from both architectures. With this in place, the RCKMS team in collaboration with stakeholders APHL and the broader eCR community were able to validate the new architecture for service continuity and response consistency as well as precisely measure the resulting impact of the new architecture on platform performance.
Here are some statistics reflecting the positive impact of RCKMS modernization activities:
In partnership with the Centers of Disease Control and Prevention (CDC), the new nation-wide Electronic Case Reporting (eCR) system was expanded from one pilot site to over 11,000 healthcare facilities between 2019-2022, with upwards of 1 million HL7 CDA messages now being processed daily.
The microservices re-architecture of the decision support application resulted in a 77% reduction in average response time between legacy versus modern architecture.
In conjunction with CDC and CSTE, the application user base was expanded from seven to 70+ public health departments, a 1,300% increase in number of application users, and an expansion from 3 to 35 Health IT vendor partners.
Reduced runtime dependent data transaction volume by 98% from previous operating architecture.
Supported adoption changing healthcare standards, such as new HL7 eICR CDA standard and HL7 FHIR standards