Splunk 7.0 the good and bad

So I will preface this post by saying I love Splunk, it is the best log aggregation application out there.

So on with the post, and it must be a good one right? Anyways, Splunk released version 7.0 of their Splunk Enterprise product last week during their .Conf 2017 conference, which I was an FTR at. There were a few new features in it that were amazing, such as the new metrics index type which was blazingly fast. So like all “fanboys” of anything I decided to update my home server on Thursday night after the conference was over. This is where the fun began.

First when I started using Splunk years ago, it supported a myriad of operating systems for the servers. If you wanted Solaris, FreeBSD, AIX, HP-UX, MacOSX, Windows or Linux you were golden. However over the years that list has been pared back to now just Linux and Windows. (MacOSX is supported, but only for the free and trial editions. Basically used for development and home use, not for enterprise use.)

So now that Solaris is no longer supported, I needed to switch my home system from OpenIndiana (aka OpenSolaris) over to Linux. With that I spun up a new CentOS 7 VM on my home server, and copied over all my Splunk data from the Solaris one to the Linux one. I then removed the bin and lib directories (I use the tar installs and that is they only place machine specific binaries exist.) With that done, I untarred the Linux Splunk 7.0 over top my current directory and started it up. So far everything was good, until I tried to do a search. If it was a search for like the last 15 minutes it worked, but anything over that was dead because one of the hot buckets was corrupted. I am not sure if it happened during the transit or what. So off to the fsck commmand to try to fix them. An hour or so later it couldn’t fix some of them, so it was getting late and I just went to bed.

The next day when I returned home I tried to log in to my Splunk instance to see how it was doing, to my surprise I couldn’t even log in to it. It appeared that the linux host had crashed. I was dumbfounded as I hadn’t seen an actual kernel panic like that in a while. So I restarted the machine and started splunk back up and everything was working again.

A few days past and I went to check on it again, and once again it was dead. So now I am really curious. I ended up installing the crash utilities on the host and started going through the vmcore files. Yup each time it crashed it was splunkd that caused it. Unfortunately I don’t know much more than that as to what is actually causing it to happen. It appears to happen at random times.

The output of crash shows this:

KERNEL: /usr/lib/debug/lib/modules/3.10.0-693.2.2.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2017-10-01-12:26:50/vmcore [PARTIAL DUMP]
CPUS: 8
DATE: Sun Oct 1 12:26:43 2017
UPTIME: 1 days, 17:12:55
LOAD AVERAGE: 0.00, 0.01, 0.05
TASKS: 330
NODENAME: splunk
RELEASE: 3.10.0-693.2.2.el7.x86_64
VERSION: #1 SMP Tue Sep 12 22:26:13 UTC 2017
MACHINE: x86_64 (3399 Mhz)
MEMORY: 3 GB
PANIC: "double fault: 0000 [#1] SMP "
PID: 1420
COMMAND: "splunkd"
TASK: ffff8800bae91fa0 [THREAD_INFO: ffff880000120000]
CPU: 3
STATE: TASK_RUNNING (PANIC)

First few lines of the “bt” output from crash:
PID: 1420 TASK: ffff8800bae91fa0 CPU: 3 COMMAND: "splunkd"
#0 [ffff8800bfac4d88] machine_kexec at ffffffff8105c4cb
#1 [ffff8800bfac4de8] __crash_kexec at ffffffff81104a32
#2 [ffff8800bfac4eb8] crash_kexec at ffffffff81104b20
#3 [ffff8800bfac4ed0] oops_end at ffffffff816ad2b8
#4 [ffff8800bfac4ef8] die at ffffffff8102e97b
#5 [ffff8800bfac4f28] do_double_fault at ffffffff8102b6e2
#6 [ffff8800bfac4f50] double_fault at ffffffff816b6908
[exception RIP: page_fault+13]
RIP: ffffffff816ac52d RSP: ffff880000122fc8 RFLAGS: 00010092
RAX: 0000000000000ff8 RBX: 0000000000000000 RCX: ffffffff816ac2ac
RDX: 00001fffffffffff RSI: ffffffff81a73118 RDI: 0000000000000000
RBP: ffff880000123098 R8: ffffffff81911167 R9: ffffea00002e7b80
R10: ffffea00002e7b80 R11: 0000000000000000 R12: ffffffff81a73118
R13: ffffffff81a73118 R14: ffff880000120000 R15: ffff88008665a580
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
--- ---
#7 [ffff880000122fc8] page_fault at ffffffff816ac52d
#8 [ffff880000123048] spurious_fault at ffffffff816afd8e
#9 [ffff8800001230a0] __do_page_fault at ffffffff816b01ae
#10 [ffff880000123100] do_page_fault at ffffffff816b0325
#11 [ffff880000123130] page_fault at ffffffff816ac548
[exception RIP: spurious_fault+48]
RIP: ffffffff816afd8e RSP: ffff8800001231e8 RFLAGS: 00010002
RAX: 0000000000000ff8 RBX: 0000000000000000 RCX: ffffffff816ac2ac
RDX: 00001fffffffffff RSI: ffffffff81a73118 RDI: 0000000000000000
RBP: ffff880000123208 R8: ffffffff81911167 R9: ffffea00002e7b80
R10: ffffea00002e7b80 R11: 0000000000000000 R12: ffffffff81a73118
R13: ffffffff81a73118 R14: ffff880000120000 R15: ffff88008665a580
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#12 [ffff880000123210] __do_page_fault at ffffffff816b01ae
#13 [ffff880000123270] do_page_fault at ffffffff816b0325
#14 [ffff8800001232a0] page_fault at ffffffff816ac548
[exception RIP: spurious_fault+48]
RIP: ffffffff816afd8e RSP: ffff880000123358 RFLAGS: 00010002
RAX: 0000000000000ff8 RBX: 0000000000000000 RCX: ffffffff816ac2ac
RDX: 00001fffffffffff RSI: ffffffff81a73118 RDI: 0000000000000000
RBP: ffff880000123378 R8: ffffffff81911167 R9: ffffea00002e7b80
R10: ffffea00002e7b80 R11: 0000000000000000 R12: ffffffff81a73118
R13: ffffffff81a73118 R14: ffff880000120000 R15: ffff88008665a580
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000

Output from the “vm” command:
PID: 1420 TASK: ffff8800bae91fa0 CPU: 3 COMMAND: "splunkd"
MM PGD RSS TOTAL_VM
ffff88008665a580 ffff8800b87f4000 324076k 864244k
VMA START END FLAGS FILE
ffff8800a5efe5e8 563d22303000 563d248b7000 8000875 /splunk/splunk/bin/splunkd
ffff8800a5efe438 563d248b7000 563d24964000 8100873 /splunk/splunk/bin/splunkd
ffff8800a5efe288 563d24964000 563d249db000 8100073
ffff880097e90438 7f4f8d800000 7f4f8ea00000 8200073
ffff8800a1f6cd80 7f4f90400000 7f4f91c00000 8200073
ffff8800a1f6ca20 7f4f923f7000 7f4f923f8000 8100070
ffff8800a1f6c948 7f4f923f8000 7f4f925f8000 8100073
ffff8800a1f6c6c0 7f4f925f8000 7f4f925f9000 8100070
ffff8800a1f6c798 7f4f925f9000 7f4f927f9000 8100073
ffff8800a1f6cca8 7f4f927f9000 7f4f927fa000 8100070
ffff8800a1f6c870 7f4f927fa000 7f4f929fa000 8100073
ffff88009ce66948 7f4f935ff000 7f4f93600000 8100070
ffff88009ce66a20 7f4f93600000 7f4f93a00000 8100073
ffff88009466aca8 7f4f93a00000 7f4f94800000 8200073
ffff88009ca98e58 7f4f949cd000 7f4f949e3000 8000075 /usr/lib64/libresolv-2.17.so
ffff88009ca98d80 7f4f949e3000 7f4f94be3000 8000070 /usr/lib64/libresolv-2.17.so
ffff88009ca99008 7f4f94be3000 7f4f94be4000 8100071 /usr/lib64/libresolv-2.17.so
ffff88009ca990e0 7f4f94be4000 7f4f94be5000 8100073 /usr/lib64/libresolv-2.17.so
ffff88009ca98f30 7f4f94be5000 7f4f94be7000 8100073
ffff88009ca98bd0 7f4f94be7000 7f4f94bec000 8000075 /usr/lib64/libnss_dns-2.17.so
ffff88009ca98af8 7f4f94bec000 7f4f94deb000 8000070 /usr/lib64/libnss_dns-2.17.so
ffff88009ca98ca8 7f4f94deb000 7f4f94dec000 8100071 /usr/lib64/libnss_dns-2.17.so
ffff88009ca991b8 7f4f94dec000 7f4f94ded000 8100073 /usr/lib64/libnss_dns-2.17.so
ffff88009ca98798 7f4f94ded000 7f4f94df9000 8000075 /usr/lib64/libnss_files-2.17.so
ffff88009ca986c0 7f4f94df9000 7f4f94ff8000 8000070 /usr/lib64/libnss_files-2.17.so
ffff88009ca98948 7f4f94ff8000 7f4f94ff9000 8100071 /usr/lib64/libnss_files-2.17.so
ffff88009ca98a20 7f4f94ff9000 7f4f94ffa000 8100073 /usr/lib64/libnss_files-2.17.so
ffff88009ca98870 7f4f94ffa000 7f4f95000000 8100073
ffff8800a1f6ce58 7f4f95000000 7f4f95600000 8200073
ffff880036735cb0 7f4f957ef000 7f4f957f0000 8100070
ffff880036735878 7f4f957f0000 7f4f959f0000 8100073
ffff880036735a28 7f4f959f0000 7f4f959f1000 8100070
ffff880036734af8 7f4f959f1000 7f4f95bf1000 8100073
ffff880036735d88 7f4f95bf1000 7f4f95bf2000 8100070

So now that means I will definitely hold off upgrading my production servers as if this is happening on my personal one, then I can only imagine what would happen to larger instances. It could also be a result of me being a fanboy and installing the .0 release of software, which any good admin will tell you “just say no to .0”.