When bigger isn’t better

This won’t take long, I still read slashdot there you have my confession. This article discusses the “fallout” politically from the Tsar Bomba tests and I found very interesting a single comment (I’ve read most of it before.

 “There is always this temptation for big bombs. I found a memo by somebody at Sandia, talking about meeting with the military. He said that the military didn’t really know what they wanted these big bombs for, but they figured that if the Soviets thought they were a good idea, then the US should have one, too”

Why did this interest me? I notice the desire for “bigger” isn’t just an issue for bombs but often an issue for systems design. the Tsar Bomb was certainly the biggest, it certainly “worked” but could it ever be used? What we know about the bomb indicates it was so big it couldn’t fit in the plane they strapped it to the bottom and hoped not to crash on take off. The aerodynamic of the plane changed so much it would be defenseless in actual combat. It’s just too big to be used.

How does this relate to system design, when we design a system so large we can just make one, we can’t replicate it, we can’t test it we find we have a system so large it “can work” but just “can’t work”.

This doesn’t mean all definitions of “big” are poor choices but it’s a consideration something to be mindful of.

  • Is my system so big it can’t fit in the network I have (plane)?
  • Is the system so big its blast radius can take out my entire company/department if it fails?
  • Is the system so big I can’t test it because I can only make one?
  • Is the system so big I can’t test it outside of production because I can’t recreate the remainder of the environment?

What can I do about a system too big?

  • Can I distribute across zones or instances?
  • Can I delegate functionality to external systems?
  • Can I make the environment larger to avoid “big fish in little pond” and “noisy neighbor” concerns?

Another reason you want agents

Microsoft has released a cool new tool for Linux ported from Windows. I was asked today why I don’t think “syslog” is an acceptable way to bring large events into the SIEM (Splunk of course). It took about 60s to wrap up the conversation the original asker of the question was able to validate my concerns pretty quickly sadly many times expectations and requirements for open or no agents get set before the “problem we are trying to solve” or “use case is defined.

When I first said no the immediate response was “yes it works fine”. This is the “it works on my machine problem.

In the pcap above we see a full syslog event over the wire (tcp) and it looks a-ok but there is always more keep watching packets and you start to see data truncation.

Hey man bad XML what gives? The system loggers on RedHat and ubuntu are forked builds of their upstream syslog-ng and rsyslog source products. The vendors keep a close eye on the features and functions they need to write to /var/log/* but don’t actually test or validate most other functionality if you look closely at the source well not even that close you will need a Franken build of patches from the upstream that isn’t the upstream. To ship large events from a Linux host over TCP you need the upstream proper builds for your OS which means changing out the package of a core feature of the OS ask yourself who is going to support that.

After you do this swap out (and figure out how to test validate and support) now you need to configure it. Using the 1980s bsd syslog is a bad look. Why? Well, there is an IETF standard (RFC5424) it addresses issues like breakage with \n and other special chars BSD didn’t have to think about but it’s never the default for the same reasons IBM still sells mainframes the industry is scared to break existing implementations. If you want /need to avoid an agent now you have to not only load up third-party builds not supported by your OS vendor but also have to make diverging config changes and figure out how to support that. Searching for that with all the correct words will land you here. So what to do? Using host agents is the best choice for Splunk that’s the Universal Forwarder if you really want to make this work you need to find the right combination of settings and be prepared to be up all night when something goes wrong because no one else can find the docs on the internet.

You don’t have enough fingers

As you may have guessed I have been spending a substantial amount of time working with infrastructure log sources. I’ve recently had time to start addressing practice and theory of application level log sources and the substantial risks developers are taking without the awareness of their organization into those risks. The conversation starts like this.

I have sensitive data in my logs and I need to filter that out

Security teams world wide

Filtering out sensitive data sounds like a good idea right. No its not right, its wrong and this is why.

  • Application developers are maintaining unnecessary sensitive data in high risk in memory code. Why is this high risk? logging APIs are highly configurable while you may know the “WARN” level of logging doesn’t contain sensitive data the “TRACE” level may, the logging component could be configured easily in production to send sensitive data to another malicious target
  • Application data written to the front end application tier is more easily exfiltrated
  • Application data written to disk is NOT encrypted any user or tool i.e. Ansible could be used to access this data.
  • This is a well known vulnerability that must be addressed.

But defense in depth I want to filter out just in case

Security teams world wide

This just in case approach is the deployment of untested code to production which is both an operational and security risk. Why is it untested glad you ask, because you don’t know your input you can’t test the behavior and unintended consequences to performance and integrity are possible.

  • It is ineffective (think DLP) this approach is simply guessing
  • It is expensive, string parsing on a CPU is SLOW, moving strings between processes is even slower.
  • It introduces new risks: Man in the middle interception and manipulation of audit data in the stream.

What can we do instead?

Glad you asked, logging, tracing, metrics are requirements issued to the developers as part of the software development life cycle. Requirements must be tested (automated and manual) as part of the delivery to production. Yes, that is hard everything we do is hard.

Joe Crobak writes Seven Best Practices for Keeping Sensitive Data Out of Logs

I want to focus on #6 automated QA. Joe mentioned the concept but didn’t elaborate on what that may mean. I will provide an example, using BDD testing as an example.

    Scenario Outline: <action> fruits
        Given there are <start> <fruits>
        When <user> eat <action> <fruits>
        Then <user> should have <left> <fruits>
        Then <user> and <action> or <fruits> combination should not be logged

An individuals fruit preference can the action of the individual can be presumed to be private matters that should not be available in logs.

When developing our test parser we will provide an additional test

@then("<user> and <action> or <fruits> combination should not be logged")
def should_have_left_cucumbers(user, eat, fruits):
# The user itself should be logged
    assert splunk.search(f"| search index="something" term(\"{user}\") > 0
#the users action and preference should not be
    assert splunk.search(f"| search index="something" term(\"{user}\") AND term(\"{action}\") == 0
assert splunk.search(f"| search index="something" term(\"{user}\") AND term(\"{fruits}\") == 0

While a simplified example as you can see using BDD we can link the logging behavior to business requirements for logging and audit making clear to the developers what they can and can not log.

Do your business analyst capture requirements for logging and audit in clear testable ways, if not ask why.

Oh Sh**T we didn’t think you would check our work.

Do you have a workflow to check your work or are you trusting the system because you think it works? One of the most frequent conversations I have goes something like this. Ryan: The best way to accomplish this task is …… some common alternatives you might think work is A B and C but they often fail in silent ways and this is how you know by checking D E F. Frequently I am challenged on my experience with a reply to the effect of “We’ve been doing B for years and never had a problem. I say great I’m always eager to learn how do you validate that works. If you are a betting person what do you think the odds of an answer are here pretty low. Computers and humans are both very reliable one does exactly what it is told the other does exactly what it knows it has to.

Enter Stamford in order to be fair in vaccine distribution they create a data drive algorithm to “do fair” or really delegate the determination of fair to someone else that didn’t know how to check their work. Please when writing software the lives, careers, and economies depend on test your code don’t be these guys.


Commitment to diversity in tech

I’m very pleased with the progress tech has made this year, and I say progress, not arrival because change is hard for humans. As a segment of society, I think tech is willfully changing. Every now and then I have something to say on this topic. If it is not personal to me I don’t say much honestly because so much is already said and virtue signaling is a bad look. I commented recently on laziness in ML leading to re-enforced bias. Gatekeeping in tech has been an issue that has personally impacted me. My path to tech was a unique one, I wasn’t a Barista I was a bored C student in an underperforming rural Alabama highschool that learned how to “tech” the business way. I liked solving problems and people would pay me well to solve computer problems so that’s what I learned. I started really small. Networking Mac computers on apple talk using phonenet connectors to solve a problem the poor rural kids I went to school within Skipperville Alabama needed more opportunity to learn to read than their parents could provide them (Thank you Pizzahut). Also thank you one stoplight little town for giving me a chance to get started. My path to IT Started because no one in “tech” was willing to solve the problem. I’ve built a worldwide reputation not on my formal education but on my customer focus. I don’t like the gatekeeping in tech but I also highly value education and well-educated knowledgeable teams I work with. I often find especially west coast tech community members are focused on tech for tech’s sake and that’s great for R&D but it brings solutions to real problems which are where “non-traditional” people like me come in. While I might not be the guy that invents a new way of storing machine data I am the guy you call to build the largest application of that software in the world we can’t do this without diversity.

Lets talk about that phrase non-traditional. Think about the 10 names in tech you can quickly and got find their resume online. I promise you most of them don’t have formal tech education Like me we came to tech not because tech attracted us but the problem attracted us. Yes more people are in tech today because of STEM education but I would argue I am “classically trained” :). Lets work together to solve the worlds problems, I say for 2021 lets turn off the zoom camera. Change how we evaluate resumes to be focused on passion outcomes and less on certifications and degrees.

MaxMind Databases and Splunk Enterprise

I’ve finally been able to take a couple of days and update and refresh my MaxMind Add-on for Splunk Enterprise and Enterprise Cloud. The latest version of the add-on updates the GeoIP2 library allowing for additional fields from the licensed anonymous IP database. It also built and tested using the new Addonfactory CI/CD infrastructure at Splunk. (See my conf talk). This is a major version as it introduces a requirement for python3 and thus Splunk Enterprise 8.0> because GeoIP2 is now python3 only. Older versions should still work for now if you can not upgrade. Head over to Splunkbase to get it now https://splunkbase.splunk.com/app/3022/

Your cloud vendor wants to send syslog cloud to cloud

I get asked about this from time to time whats wrong with sending syslog over the internet its a standard right?

IETF Syslog meaning RFC5424 over TLS (RFC5425) seems like a good idea until you think of the consequences and just what those consequences might be?

How do you plan to authenticate that.

Certificates well maybe this opens your SIEM up to a nasty low cost denial of service problem. Client cert auth is trivial to use as DOS with any invalid cert and expensive validation options. If this was happening how would you know neither syslog nor rsyslog will log this in an obvious way.

Secret SDATA? now we allow any client to auth and send data we must accept and parse the data to find out if its allowed sure that can’t be abused

IP Restrictions I have some beach front property for you.

All of the above

How will you scale that? please see prior posts on load balancing syslog

Next time you hear the suggestion of RFC 5424 syslog just laugh at the joke and ask what options are really being proposed.

When I say syslog what I really mean is

Syslog is a ambiguous term so I thought I would clarify what I am talking about

syslog is a daemon where Linux/UNIX sent logs back in the day. This in most cases results in an entry in a file in /var/log that may or may not have any particular structure this is normally not what I am talking about

Syslog was not a standard in the beginning. RFC 3164 is not a standards document its a memorialization of some common practices. Do you want a 1988 Honda Civic if you vendor’s Syslog looks like this you should look at it like a used car.

<111> July 01 12:13:11 My old car's logs

Syslog is not just text over tcp/udp. A syslog message must have the PRI such as <111> it must have a structure something like this:

<34>1 2003-10-11T22:14:15.003Z mymachine myapplication 1234 ID47 [example@0 class="high"] BOMmyapplication is started

Syslog is now a set of standards

  • RFC 5424 is the transport neutral message format https://tools.ietf.org/html/rfc5424
  • RFC 5425 describes how to use TLS as the transport (best practice) if network security matters worst practice when performance matters https://tools.ietf.org/html/rfc5425
  • RFC 5426 describes how to use UDP as the transport best practice for performance https://tools.ietf.org/html/rfc5426
  • RFC 6587 describes how to use TCP as the transport worst practice for performance best practice for large messages over unreliable networks https://tools.ietf.org/html/rfc5587

A message should not be considered “standard Syslog” if it is not in the RFC5424 protocol using RFC 5425 5426 or 6587 as the transport. Standards compliance matters lets start making vendors feel bad they have had 12 years to get it right.

Devices that think you know their name

What exactly is that talkers name is one of the most frustrating problems in syslog eventing and the most frustrating in analytics. For far too long the choices have been to use the devices name OR use reverse DNS but never both. Today SC4S 1.20.0 solves this problem by doing what you would do!

  1. If the device has a host name in the event use that
  2. Else if our management/cmdb solution knows the right name use that instead
  3. Else maybe someone updated DNS try that instead.

Simple logical easy to understand and available now in Splunk Connect for Syslog. No more of this

Event with IP as a host

Plenty more like this

IP translated to host using CMDB sourced lookup

Performant AND Reliable Syslog UDP is best

The faces I’ve seen made to this statement say a lot. I hope you read past the statement for my reasons and when other requirements may prompt another choice.

Wait you say TCP uses ACKS so data won’t be lost, yes that’s true but there are buts

  • But when the TCP session is closed events published while the system is creating a new session will be lost. (Closed Window Case)
  • But when the remote side is busy and can not ack fast enough events are lost due to local buffer full
  • But when a single ack is lost by the network and the client closes the connection. (local and remote buffer lost)
  • But when the remote server restarts for any reason (local buffer lost)
  • But when the remote server restarts without closing the connection (local buffer plus timeout time lost)
  • But when the client side restarts without closing the connection

That’s a lot of buts and its why TCP is not my first choice when my requirement is for mostly available syslog (no such thing as HA) with minimized data loss.

Wait you say when should I use TCP syslog. To be honest there is only one case. When the syslog event is larger than the maximum size of the UDP packet on your network typically limited to Web Proxy, DLP and IDs type sources. That is messages that are very large but not very fast compared to firewalls for example. So we jump to TCP when the network can’t handle the length of our events

There is a third option TLS a subset of devices can forward logs using TLS over TCP this provides some advantages with proper implementation.

  • TLS can continue a session over a broken TCP reducing buffer loss conditions
  • TLS will fill packets for more efficient use of wire
  • TLS will compress in most cases

While I am here I want to say a word about Load Balancers as a means of high availability. This is snake oil.

  • TCP over an NLB double the opportunity for network error to cause data loss and almost always increases the size of the buffer lost I have seen over 25% loss on multiple occasions
  • TCP over NLB can lead to imbalanced resource use due to long-lived sessions. The NLB is not designed to balance for connection KbS its design to balance connections in TCP all connections are not equal leading to out of disk space conditions
  • UDP can not be probed UDP over NLB can lead to sending logs to long-dead servers.
  • Load Balancers break message reassembly common examples of 1 of 3 type messages like Cisco ACS, Cisco ISE, Symantec Mail Gateway can not be properly processed when sprayed across multiple servers.

Wait you ask how do I mitigate down time for Syslog?

  • Use VM Ware or hyper-v with a cluster of hosts which will reduce your outage to only host reboots which in this day and time is rare
  • Use a Clustered IP solution (i.e. Keepalived) so you can drain the server to a partner before restart.

A few other idea’s you may have to bring “HA” to syslog that will be counter productive

  • DNS –
    • Most known Syslog sources will only use 1 typically the first or one random IP from a list of A records for a very long period of time ignoring the TTL. Using DNS to change the target is likely to not work in a short enough period of time in some cases hours
    • DNS Global Load Balancer similar to the above clients often holds cached results for far longer than TTL. In addition, the actual device configuration does not use the correct DNS servers for GLB to properly detect distance and will route incorrectly
  • AnyCast
    • UDP anycast can work in exceptional condition the scale of a single clustered pair of Syslog servers can not provide capacity. (Greater than 10 TB per day) However, because of the polling issues described with NLBs above my experience with AnyCast has been high data loss and project failure. Over a dozen projects with well-known logos over the last 10 years names you would know. While Anycast can simplify administration it does not mitigate loss and if the routers in use are not up to the task can increase loss. Most anycast use cases have some method of recovery such as DNS. syslog does not. While AnyCast on paper seems to be an easy answer the engineering required to succed is not trival ask youself is it worth it, and can we monitor it effectivly.
  • Sending the message multiple times to multiple servers to so it can be “de-duplicated” by “someone’s software” Deduplication requires global unique keys this doesn’t exist so this isn’t possible. More than once is worse than sometimes never because if we are counting errors or attacks we see more than is real resulting in false positives and causing lack of operational trust in the data making your project effectively useless. A missed event will more likely than not occur again and be captured in short order.