3 Challenges of Using Log Analysis for Threat Hunting

The entire “threat hunting” need emerges from the fact that the prevention products at the network and endpoint level will never be able to detect 100% of cyberattacks. Since what bypasses prevention, leads to breaches, there will always be a need for tools, techniques, and products that can detect these missed compromises before they can inflict damage.

While there are many techniques that can be used to perform threat hunting, many turn to log analysis. Security analysts have spent countless hours trying to find the proverbial “needle in the haystack” by analyzing logs. Armed with solutions to amass logs at scale, big databases to crunch the logs, data scientists to crunch new correlations and insights with the big data, most have still not been successful at large in stopping the headline breaches with this technique. For starters, logs can be erased to cover an attackers tracks.

Having worked with various flavors of log data from numerous sources in large and small networks, here are the challenges I have seen with using log analysis for threat hunting:

1. Lateral Movement Detection

This is one of the well recognized stages of the cyber attack kill chain. A number of common exploit tools are publicly available for Windows to propagate the attack further. The most common of these techniques continues to be Scheduled Tasks, psExec or just directly using SMB/CIFS. So, the next logical question is, are there logs that can indicate the lateral movement behaviour in a Windows environment? The answer is Yes.

If you are still running Windows 2003/XP security events such as 528 (Successful Logon), 529 (Logon Failure), 602 (Creation of Scheduled Task), 601 (New service Installed) are some of the log events to watch out for. If you are running Windows Vista or higher, you deal with 4 digit Windows Event IDs, and the similar IDs to watch out for are: 4624 (Successful Logon), 4625 (Logon Failure), 4698 (Creation of Scheduled Task), 4697 (New service installed). The list of these events is not an extensive list but a subset of the ones that can be used to detect the lateral movement.

So, where is the roadblock? The roadblock lies in the fact that many organizations are diligent about recording the Windows Domain Controller logs, however, they do not store the logs coming from desktops and laptops connecting to the domain! In order to detect the lateral movement, a stitching is necessary between the Domain Controller logs and the endpoint logs. Add to that the fact that Windows logs are quite chatty.

For instance, the act of creating a shared folder on a Windows domain, can lead to 20+ events recorded on the domain controller logs. Automating the collection of these logs continues to be a problem at the technical and budgetary levels in organizations. Additionally as new vectors of lateral movement are discovered, it is needed to continually maintain any changes to detection analytics.

2. Correlating alerts from network periphery to internal network

Even though today’s network perimeter is porous, there are common controls such as Firewalls, Intrusion Detection/Prevention, breach detection systems deployed at the network perimeter. These systems are a great source of information. If these logs are carefully analyzed, it is possible to discover the “reconnaissance” phase of the kill chain.

The challenge, however, is the inability to correlate alerts from the network periphery to the internal network, making it difficult to locate the compromise. To illustrate this, let’s look at a specific example. Let’s say a user visiting a waterhole site results in download of a malware due to the site exploiting a vulnerability in the user’s browser plug-in. A detection system running at the network perimeter that runs attachments (from HTTP flows, emails etc) in virtual sandbox may be equipped to detect the attack. However, such systems do not stall the HTTP/TCP session for performance reasons. Virtual sandbox or further analysis cannot be performed in packet-time. So, the system may come back after the analysis finishes, and alert that  a malicious payload has entered the environment. It will have the IP of the malicious site, and an IP address it is destined to.

Now, the internal system may be behind one or more firewalls. In order to figure, where the sample landed in the network, it is necessary to stitch the firewall address translation logs, DHCP logs, etc. at the time of this malicious HTTP flow. Firewall logs are chatty, and some environments suppress a range of logs for performance reasons. So, one may not even have the logs necessary for stitching. In general, this results in manual stitching by analysts in many environments to figure out if the downloaded sample is active in the environment.

3. Clock Skews

Some analytics based on the log analysis may require a need to establish a temporal sequence/set of certain log messages. More often, I have noticed a clock skew, even in large sophisticated environments. One might think that NTP would have addressed the issue a long time back, but, in reality, it’s not uncommon to see systems on a network whose clocks are off. There are certain log forwarders that gather logs from individual machines/systems, and batch send them into log stores like Splunk with a delay. These clock slews and delays can cause issues with analytics relying on log sequence/set for detection.

Learn more about the pitfalls of Log Analysis for Threat Hunting

These are just a handful of the potential pitfalls in what is currently considered as the chief method for “threat hunting”. Join us next Thursday, Jan 25th, for a Webinar as we take in an depth look at the challenges of hunting using Log Analysis. We will also introduce a new approach called Forensic State Analysis (FSA) to show how FSA arms security practitioners with an effective and efficient methodology to hunt without relying solely on sophisticated security infrastructure, sensors, big data or experts.