Understanding Splunk Fishbucket
Introduction to Splunk Fishbucket
In this post we will learn what is fishbucket in Splunk but before that lets us understand what Splunk is and its purpose.
Splunk is used for monitoring, searching, analyzing the machine data in real time. The datasource can range from application data to user created data to sensors or devices.
Purpose of Splunk fishbucket
Before analyzing the data, Splunk index the data. The index is necessary to analyze the data. But here is the issue, what if the same data is indexed multiple times or in other words, how to avoid duplicate indexing the same chunk of data?
Splunk fishbucket keeps seek pointers and CRCs for the indexed files inside the directory. This directory is called fishbucket. Since through fishbucket we can know which data has already been indexed, so splunkd can tell if it has been read already and avoid duplicate indexing.
How fishbucket works?
File monitor processor searches the fishbucket to see if the CRC from the beginning of the file is already there or not. This is the first step of file monitor processor whenever it starts looking at a file. There can be three possible scenarios:
Scenario 1: If CRC is not present is fishbucket, the file is indexed as new. This is simple, file has never been indexed. After indexing, it stores CRC and seekpointer inside fishbucket.
Scenario 2: If CRC is present is fishbucket and seek pointer is same as current end of file , this means the file has already been indexed and has not been changed since last indexed. Seek pointer is used to check if there is change in file or not.
Scenario 3: If CRC is present is fishbucket and seek pointer is beyond the current end of file, this means something in the part of file which we have already read has been changed. Since we cannot know what has been changed, lets re-index the whole data again.
Location of fishbucket directory
All these CRCs and seek pointer is stored in location by default:
Retention policy of fishbucket index
Via indexes.conf, we can change the retention policy of fishbucket index. This may be needed if we are indexing a lots of number of file. But we need to be careful when changing retention policy because if the file which has already been indexed but the CRCs and seek pointer got deleted due to change of retention policy, there is risk of same file getting indexed again.
Ways to track down a particular file when needed
If we need to know which file has been indexed and reindexed at which particular time, we can search all the events in the fishbucket associated with it by the file or source name. We can check seek pointer and mod time to know the required details.
We can also search fishbucket through GUI by searching for “index=_thefishbucket”.
The Fishbucket and btprobe Command
To reset the individual input checkpoint, use the btprobe command :
splunk cmd btprobe –d SPLUNK_HOME/var/lib/splunk/fishbucket/splunk_private_db –file <source>–reset
- splunk clean eventdata _thefishbucket command to Force re-indexing of all file monitors in the indexer.
- rm -r ~/splunkforwarder/var/lib/splunk/fishbucket command to Manually delete the fishbucket on forwarders.
For more information about splunk fishbucket : Splunk documentation