Westpac has run a proof-of-concept to set up Splunk as a single enterprise monitoring platform over the past three months, and has already used it to troubleshoot a “high priority incident”.
Senior service engineer Jon Leaver told the Sydney Splunk user group that Splunk was already used for security information and event management (SIEM) and by a handful of “specialised Linux teams” on the infrastructure side.
However, Leaver saw an opportunity to use expand the use of Splunk to streamline issues management between the bank’s various infrastructure teams.
Presently, each infrastructure team runs its own alerting and management tools, but there is no central view across these tools and alerts.
“Westpac is already a pretty big Splunk customer except only in the SIEM space. We didn’t have any infrastructure monitoring at all – no Windows event logs, no Red Hat logs, nothing at all – that was getting ingested to Splunk, and no one had ever done it before at Westpac, so that’s why our team got some exposure to it,” Leaver said.
“We have a number of infrastructure teams and all the teams use different alerting tools for platforms so SCOM [System Center Operations Manager] for Windows, Nagios for Linux, SCCM [System Center Configuration Manager] – there’s many different platforms that we use, and they’re all very specific and all closed off.
“If you needed access to one, you’d have to go through three different approval processes.”
Support teams saw “a lot” of noise when system issues arose, with multiple tickets filed in quick succession for the same issue.
“Just say for example, you get a CPU that’s going 99 percent [utilisation], sometimes you’ll have five tickets in five seconds, and it’s not like there’s any smarts about that ticket,” Leaver said.
“It took a really long time to troubleshoot a production issue. And especially in banking, time can be of the essence.”
In addition, Leaver noted it was a “lost opportunity” having alert data locked up in individual systems, instead of being made available in a single dashboard where alerts from different systems could be correlated to build a better understanding of an incident or issue.
This led the bank to set up a proof-of-concept to broaden its use of Splunk to act as a single enterprise-wide monitoring platform for infrastructure operations and run teams, and levels one-to-three support.
Leaver said the purpose was a “one-to-one migration” from existing alert management tools to Splunk, but to enable additional features and add value to the troubleshooting process as well.
The proof-of-concept was run using the bank’s existing Splunk infrastructure – Leaver noted the team tipped off its account managers to the projected additional use.
The proof “captures data from a variety of sources i.e. security logs, application logs, allowing improved analysis and troubleshooting opportunities as well as business insights i.e. [to] correlate application performance logs with underlying infrastructure performance logs”, leading to “reduce[d] outage times”, the bank noted in a slide deck.
Leaver said that the proof-of-concept offered “intuitive troubleshooting”. At a simple level, he was able to look across use event codes in Splunk to determine when a server rebooted, and then dig into what triggered the reboot.
The team was also able to use its learnings from the proof-of-concept to put its first infrastructure dashboard into production to help Westpac troubleshoot a “high priority incident, [where] a large number of hosts went down”.
“It was a very high priority outage that we were looking at troubleshooting a while, and actually our team came up with this dashboard,” Leaver said.
“Straightaway you can see how many of our hosts are online, how many disconnected in the last five minutes, how many are no longer connected, split of an operating system.
“This dashboard went to that kind of level where it’s gone to the ‘eyes on glass’ team, as well as our partners and vendors who manage the infrastructure.”
The bank appears to be moving ahead with a full migration of alerting into Splunk, based on the success of the proof-of-concept, but Leaver noted it would take some time.
“We have to go through the account management and the costing processes,” he said, noting the large increase in alert ingestion that will be required.
He said that moving from SCOM to Splunk for monitoring of the bank’s Wintel environment “is still in its infancy”.
“I’d say it will take a few months,” he said.