The dataManager.pl script (v5.xx)
The
dataManager.pl script runs like a deamon on the machine STOL01 at the port 9000 on the storage subnet (
stol01.storage:9000).
- the executable is located at
/root/bin/dataManager.pl- the working directory with data and configuration files is:
/var/dataman
-
Short description about it's functionalities:
The
dataManager.pl script:
- scans periodically the data mount point (
/storage ) searching for .gwf (rawdata) files newer then the file specified into the
/var/dataman/last_processed.file . If that files do not exist (usually at the first running cycle) the file is created empty and a standard time window will be used for the first scan. Then the new files will be added to the services processing queues (data replica, etc…)
- receives information from all the
dataStorage.pl modules running on each storage farm disk host machine, keeping track of the available space for each volume involved. A trigger is reset each time that a status message is received for a volume. If no infos were received upon a time counter an alarm is send by email to the operators and a message is sent to the
fflGen.pl module: the volume is no longer available and it will be removed from the ffl list. A soon as a status message for that volume is received, an email will be sent to the operators and a message to the
fflGen.pl module in order to insert again that volume into the ffl list. If the unavailable volume is the current destination replica volume ( pointed by the file
/var/dataman/current_dest_label ) , an email is sent to the operators and the rsync subprocess is stopped by making an rsync lock file (see
procedure P1 in the appendix to fix the problem).
- runs an rsync subprocess (
dataManager.pl:rsync) in order to replicate the rawdata files from the local storage area (the double equallogic array disk space) to the offline storage farm buffer volumes. For each rsync session, the space status of the destination volume will be checked: if the volume is 100% full, the current destination label is inserted at the bottom of the full volumes list ( located at
/var/dataman/full_volumes.list ). If no other empty volumes are available (see
procedure P2 in the appendix for know how to check the volumes status), the same list will be open and the first label will be cutted from the list and selected like the next destination volume. A "delete" command will be sent to the
dataStorage.pl module that owns the volume and to the
fflGen.pl module in order to delete the oldest files from the offline buffer, making free space for the new, incoming data. The rsync flows will automatically restart after few (usually 5) minutes.
NOTE: Due to fix an xfs issue on large file system, the data deletion is disabled by default and a deletion
procedure P3 is implemented.
The rsync uses the rsyncd server configured on the target node. The username used to write files on the volumes of a node and the target volume labels are defined in the /etc/rsyncd.conf file on the node.
-
Configuration parameters and files
The following files must be present into the working directory,
this files can be manually edited:
-
/var/dataman/local_host_name : usually
stol01.storage, is the local host name (mapped on the storage subnet);
-
/var/dataman/datastorage_host_list : contains a full list of all the machines that host a
dataStorage.pl module;
-
/var/dataman/fflgen_host_name : usually
stol02.storage, is the name of the machine that hosts the module
fflGen.pl;
-
/var/dataman/datafinder_host_name : usually
stol02.storage, is the name of the machine that hosts the module
dataFinder.pl;
this files can be edited due to manage an exception:
-
/var/dataman/full_volume.list : contains a list of couples "host::volume". Can be edited in order to remove from the list some volumes that will be managed by the system but will be never deleted/scratched again. A typical example is removing volumes filled by files related to a run;
-
/var/dataman/next_dest_volume : usually is not present into the directory. This file, if present, must contain a list of labels (a couple "host::volume") that point to an empty volume. This file is used by the script when a volume is full and we want to force to select this volume like the next data destination;
- other files like .db .list and .lock files must be leaved untouched, changing or deleting them can cause system incosistency.
All files are automatically replicated each hour by a crond script into the directory
/var/dataSoft.bkp/dataman.stol1 located on the
datasw machine.
The following variables are located into the head of the script: to change their value the script must be stopped and restarted:
- set this variable to 1 or 2 to increment the log level detail:
# verbose level (1/2)
our $verbose = 1;- set this services variable to 0 (disable) or 1 (enable):
# active services (0/1)
our $srv_rsync = 1;- set this variables to authorize the script to delete oldest volume if there are no more space free:
# delete oldest volume (0/1)
our $can_delete = 1;... other variable must not be changed!
In order to start the script like a deamon, follow this rules:
- check if no instance are running:
# ps -edaf | grep dataM
if the list is not empty and there are processes like
dataManager.pl:main or
dataManager:rsync that are still running, please wait until they will finish before restart the script like a deamon. If not, a port conflict will occour. If you are restarting the process after a crash is possible that there is no
dataManager:rsync process actives but there is a lock file (
/var/dataman/rsync.lock ): if so, please remove it before starting the script. To start the script on the STOL1 machine (from the root account), you must:
# cd /var/dataman
# nohup /root/bin/dataManager.pl &
then check the log or/and the
nohup.out file:
# tail -f /var/log/dataManager.pl.log
If you are starting the script after a total shutdown, remember that the
correct script activation sequence is the following:
- the
fflGen.pl on stol02;
- both the
dataStorage.pl on st01 and st02;
- the
dataFinder.pl (data recover module) on stol02;
- the
dataManager.pl on stol01;
From any machine in the storage farm network simply type:
# dataCommand.pl stol01.storage:9000 "stop:<reasons>"
please note that the ":" is mandatory even no <reasons> are specified, than take a look to the log file or to the file
nohup.out:
# tail -f /var/log/dataManager.pl.log
a
rsync session may could be still active: wait for it end before doing a shutdown or restart the machine or the script: if that session will not finish the ffl list for that file will be not updated.
If the volume pointed by the
current_dest_label is no more available, a
/var/dataman/rsync.lock file is present and contains the string "Unavailable dest volume!" written inside. This means that
rsync process is stopped by this link, so do not remove it until the problem is fixed.
The files already sent to that volume are logged into the file
/var/dataman/rsync_done.host::volume. for this files, an automatic recover request is sent to the
dataFinder.pl module running on STOL2 so, if the file replicas are present in the backup stream, they are inserted into the
ffl list overriding replicas from the unavailable volume.
Please repair the volume or delete the oldest volume from the storage farm and do the following steps:
1 - if the volume is repaired, skip to step 7, if not please continue with step 2 and delete the oldest data from the offline farm;
2 - open the file
/var/dataman/full_volumes.list and select the first raw: copy it somewhere (pen & paper) and then delete it, also save the modified file;
3 - send the "delete volume" command to the
fflGen.pl module, this will remove files related to that volume from the ffl list (see
fflGen.pl howto about this command);
4 - do a login as root user into the selected host, go to the
/storage/selected_volume/data/DAQ/rawdata and type
rm -f *.gwf or send a "detele volume" to the dataStorage.pl module on that host (see
dataStorage.pl howto about this command);
5 - manual edit the
/var/dataman/current_dest_label and set it to the new couple
host::volume;
6 - send by
rsync files from the
/var/dataman/rsync_done.host::volume list to the new location: you can use the
/root/bin/dataListCopy.pl script, this script will also update the ffl list so, if you will use it, skip to step 8:
# dataListCopy.pl /var/dataman/rsync_done.host::volume newHost::newVolume
7 - if you have repaired the volume (step 2) or you used
rsync manually at step 6, please login as root into the new selected host (or to the machine that is hosting the repaired volume) and run the
fflAddDir.pl script:
# fflAddDir.pl sto02.storage /storage/volumeName/data/DAQ/rawdata hostName::volumeName
8 - wait some minutes, then simply check the
raw.ffl file, looking for the requested files.
In order to check the farm status (in terms of disk occupancy/availability), is possible to send a command direct to the
dataManager.pl module:
# dataCommand.pl stol01.storage:9000 "dumpstatus:/tmp/farm_status.txt"
this command will produce a file on
stol1 machine (in our example the file
/tmp/farm_status.txt) with a raw for each volume managed by the
dataStorage.pl modules and some information like (this is only a part of the file):
st8rear::v003 online(1) with 9% used
st8rear::v005 online(1) with 100% used
st8rear::v007 OFFLINE(33983) about 135932 minutes
st8rear::v201 online(1) with 100% used
we can see that there are full volumes (100% used), a partially filled volume (maybe is the current destination volume) and an offline volume. The number closed in round brackets is the polling interval request time: when this number grows over a fixed number (usually 3) the volume will be set as "unavailable" and moved out from the
raw.ffl file: in normal conditions the value is 0 or 1.
In order to fix a delte bug issue typical for big xfs fileystems, the volume deletion must be done manually:
- check with Procedure P1 how many volumes are available free (reported to be at "1%";