High availability is critical to any successful business operation. To ensure that requests are processed in the event of failure, FME Server supports configuring high availability throughout the multiple levels of an integrated system. FME Server supports high availability in three ways:
FME Server comes out of the box with component recovery. This means that, even on a single system, FME Server monitors and can restart components that become unresponsive, including the FME Engines and the FME Server Core. This is achieved through the FME Server Process Monitor and configuration. The ability for FME Server to monitor its own components ensures reliable uptime and dependability.
In addition to the component recovery FME Server delivers the ability to restart a translation (job) when a crash occurs. FME Server will continue to resubmit a translation up to the configured number of attempts. This ensures jobs that experience temporary issues, say a network hiccup, are resubmitted and run again. The Translation Recovery is configurable and can be turned off entirely. Learn more about it here in FME Server's Administration Guide: Job Recovery Configuration.
The goal of a fail-over environment is to remove single points of failure so that a component can fail, but not take the system offline. There are two approaches to fail-over that we support with FME Server: Active/Passive and Active/Active.
We typically recommend the Active/Passive architecture and this meets the needs of most of our customers. There are pros and cons to both methods.
With the Active/Passive fail-over approach, when the Active system fails the Passive system can take over the capabilities of the failed Active system and assume the role of the Active system. This places the failed system into the Passive mode. The failed system can then be investigated while the new Active system provides continued operation of FME Server. Once the Passive system is recovered it will remain in this role until another failure on the Active system occurs. A heartbeat between the Active and Passive systems ensures fail-over occurrence. The types of failures that typically cause this type of failover are hardware or OS crashes, in which the primary system goes down completely.
In the Active/Passive architecture the Web Applications, FME Server Engines and FME Server Database Repository are separated physically. Web Application failover involves the FME Server Web Applications / Web Services, and the failover configuration is made within the third-party software not with FME Server's configuration. Similarly the FME Server Database Repository will sit on a separate system and its fault tolerance configuration is outside of FME Server's Fault Tolerance configuration. The FME Server Engines will be discussed below.
Any translations running at the time of failover, on Active FME Server core, will be lost however upon the failover completion to the Passive system, the translations will be resubmitted.
More information on Active/Passive Fault Tolerance can be found here.
In the Active/Active Failover Architecture there are 2 or more duplications of systems all capable of the same functionality, and a load balancer directs incoming traffic to one of the available systems. The FME Server Core, Java Web Application Server, FME Engines, and FME Server Database Repository all reside on the same system. Additional systems are configured precisely the same, and when requests are directed to any of the systems, they are handled independently and only by one system. A hardware failure causes loss of a job translation, but new requests are directed to another system. This approach works well in an Amazon Web Services environment in which machines can be cloned easily to expand capacity.
It is important to understand that any translations running on a system that fails will be lost until the failed system is brought back online. This is because each FME Server Core has a separate queue for translations requests and the other systems in the environment will not know about any failed or pending translations on another system.
In a Fault Tolerant environment the FME Server Engines are installed on separate physical (or virtual) systems to create redundancy and protect from hardware or OS failures. Because FME Server can be configured to run multiple engines concurrently, it is possible to split them amongst numerous systems. In addition, FME Server Engine tasks can be controlled by Job Routing - a feature for reserving engines to run certain jobs. For example, long-running jobs can be assigned to specific engines, freeing up other engines to run shorter jobs. This configuration avoids a situation in which all your engines are tied up running long jobs, while small jobs queue up.
As mentioned briefly above in the Active/Passive section depending on the architecture of the environment the Web Interface where users log into will failover seamlessly to the passive system. However the configuration of the Web Application is not part of the failover configuration of FME Server. FME Server supports both Apache Tomcat and Oracle WebLogic and therefore the failover over configuration is required within those applications.
No one has followed this yet.