Friday, November 19, 2010

More splunk fun...

I've started setting up summary indexes. I take results and put them in a second index for reporting. First you have to create the new index, mine's called "dashboard_summarize". It will require a restart of splunk, just so you know. Next up, the ugly query:


search = host="srchqenmana*" (source="/usr/local/tvs/apache-tomcat/logs/qlogger/*" NOT source="*.gz") "<A9_Request" AND NOT ("FFFFFFFFFFFF" OR "000013ED3AEB" OR "Agent.007") | lookup Market_by_Controller_ID Controller_ID as Controller_ID OUTPUT Market as Market | eval QueryFirstTwo=substr(TextQuery,1,2) | transaction MAC, QueryFirstTwo maxspan=5m maxpause=1m delim="," mvlist=TextQuery | eval LastQuery=mvindex(TextQuery, -1) | fillnull value=0 forward | eval MAC="costtimequalityscope".MAC | eval MAC=md5(MAC)|stats count(LastQuery) as QueryCount by LastQuery, Market, Controller_ID, StreamingServerID, forward | fields QueryCount LastQuery Controller_ID StreamingServerID Market forward |collect addtime=true index=dashboard_summarize


Yikes! Lets break that down a bit. First up we have the sifting portion of the query. Basically search terms that rule data pieces in our out:

host="srchqenmana*" (source="/usr/local/tvs/apache-tomcat/logs/qlogger/*" NOT source="*.gz") "<A9_Request" AND NOT ("FFFFFFFFFFFF" OR "000013ED3AEB" OR "Agent.007")

Next up we have some data lookups. We take the numerical ControllerID and map that to a human readable market name like 'Salt Lake' or 'Bucks County'.

lookup Market_by_Controller_ID Controller_ID as Controller_ID OUTPUT Market as Market
Next we start doing calculations, conversions and transformations of the data. We'll stanza by stanza this part:

eval QueryFirstTwo=substr(TextQuery,1,2)

Eval a field called 'QueryFirstTwo' to the first two letters of the string TextQuery using the substr function

transaction MAC, QueryFirstTwo maxspan=5m maxpause=1m delim="," mvlist=TextQuery

This little gem is a beauty. I wish I could take credit for what the Splunk consultant did there. Basically we define what a single user search is here by defining what a transaction is.. We do not count just the simple submission of a request, because we do live updating of search results after two letters. So if you were searching for the show HOUSE, with live updating you would make a request for HO, HOU, HOUS, HOUSE at every key press. That's great if your just measuring raw throughput, not not a valuable business data point. If everyone is searching for a really long search terms like SUPERNATURAL your usage stats would be skewed. So we roll those up into a single transaction by setting some parameters. First, we time box it at 5 minutes. Second we only allow for a 1 minute pause. Sure there are edge cases where you may exceed either of these time boundaries but it should be a wash over all. Further the MAC address and the first two letters of the search must also be the same. This lets us have typos later on. So if you did HOUU the HOUS because HO would match, it's still in the same transaction. And the last little bit, mvlist=TextQuery says to make a multi-value (or array) of TextQuery values used in this transaction. In my example the list would have ("HO", "HOU", "HOUS","HOUSE"). This comes up in our next stanza.
eval LastQuery=mvindex(TextQuery, -1)

If you look up mvindex and it's syntax, you see that we're setting the field LastQuery to the last entry in the list. In my example, LastQuery=HOUSE. Side note: the page linked for mvindex is titled 'Common Eval Functions' according to the URL. I'd hate to see the uncommon ones.
fillnull value=0 forward | eval MAC="salted".MAC | eval MAC=md5(MAC)

I'm grouping the next three stanza's together because they're doing similar things. If the field named "forward" is null, set it to zero. Next we add a salt to the MAC address to obscure/anonymize it. The MAC (much like an IP address), while not directly identifying an individual is sensitive just the same, and needs to be hidden, so first we add the string salted to the current value of MAC. Think of this like a password or key. Next we convert the string+MAC value to the MD5 HASH of that string. So 000013ED3AEB becomes salted000013ED3AEB which becomes ce431f1c1a634337ca1cdcde78a1d15f. Now if someone knows someone's MAC address and does echo -n "000013ED3AEB" | md5sum to try and figure out their new obscured value, they can't because they don't know the SALT. And because the salt can be of arbitrary length, brute force isn't effective. So it's reasonably protected if for some reason the data needs to be shared with non-trusted parties.

stats count(LastQuery) as QueryCount by LastQuery, Market, Controller_ID, StreamingServerID, forward, MAC
This one is fairly straight forward. Get the number of times the search term was searched, organized by Market (which we looked up in a table before) and Controller_ID, StreamingServerID, and the value of forward (which are app specific fields that only has meaning to us) The why of this is coming up.
fields QueryCount LastQuery Controller_ID StreamingServerID Market forward, MAC

Next we want to take the fields listed above and output them in the search results (why is next).
collect addtime=true index=dashboard_summary

Lastly, we collect this data and store it into an index called 'dashboard_summary'. What we're doing is making a roll-up of searches and weeding out all the cruft that isn't needed to make the reports or dashboards. Further because we've scrubbed sensitive data, we now can let a larger audience view the data by giving them only permissions to this new index. Because the index is lean and mean, dashboards and reports are several orders of magnitude faster than going against the raw data. Further we've pre-paid a lot of calculation expense with the eval's and transaction logic.  
Now I have an index to do my reporting out of that's much faster than the raw queries against all the data.

Thursday, November 18, 2010

Working with Splunk

I've been doing a lot of work with Splunk lately. Splunk is a powerful and flexible indexing tool. It slurps up log files and data and makes them searchable. I think the real power of Splunk over a lot of other log management and searching tools is it's ability to search across multiple servers for the same time period. Another powerful feature is it's ability to do field extraction. So when a log file says "IP_Address=10.11.12.13" you can do field related searches like "AND IP_Address=10.11.12.13" or more powerfully "NOT IP_Address=10.11.*"


Fields are where I'm spending a lot of my time lately. In our current search and discovery platform we have lots of fields with interesting values from people making search requests. We have values such as channelmap, controllerID, MAC, TextQuery and a few other interesting values. Because we have these interesting field values and Splunk extracts them for us, we can generate very interesting usage reports. Such as number of unique users, users per market, etc. And because we have a relatively closed set of users, we can produce interesting numbers like the percentage of users our platform. Powerful stuff.


Anyway, I hope to write up some of my more interesting uses of splunk in the future.