StorageFarmdataManager < InformationSystem < Wiki_Virgo

You are here: Wiki_Virgo_LSC>InformationSystem Web>WebSearch>StorageFarmdataManager (06 Jun 2019, Salconi)Edit Attach

The dataManager.pl script (v5.xx)

The dataManager.pl script runs like a deamon on the machine STOL01 at the port 9000 on the storage subnet (stol01.storage:9000).

- the executable is located at /root/bin/dataManager.pl
- the working directory with data and configuration files is: /var/dataman

Short description about it's functionalities:

The dataManager.pl script:

- scans periodically the data mount point ( /storage ) searching for .gwf (rawdata) files newer then the file specified into the /var/dataman/last_processed.file . If that files do not exist (usually at the first running cycle) the file is created empty and a standard time window will be used for the first scan. Then the new files will be added to the services processing queues (data replica, etc...)

- receives information from all the dataStorage.pl modules running on each storage farm disk host machine, keeping track of the available space for each volume involved. A trigger is reset each time that a status message is received for a volume. If no infos were received upon a time counter an alarm is send by email to the operators and a message is sent to the fflGen.pl module: the volume is no longer available and it will be removed from the ffl list. A soon as a status message for that volume is received, an email will be sent to the operators and a message to the fflGen.pl module in order to insert again that volume into the ffl list. If the unavailable volume is the current destination replica volume ( pointed by the file /var/dataman/current_dest_label ) , an email is sent to the operators and the rsync subprocess is stopped by making an rsync lock file (see procedure P1 in the appendix to fix the problem).

- runs an rsync subprocess (dataManager.pl:rsync) in order to replicate the rawdata files from the local storage area (the double equallogic array disk space) to the offline storage farm buffer volumes. For each rsync session, the space status of the destination volume will be checked: if the volume is 100% full, the current destination label is inserted at the bottom of the full volumes list ( located at /var/dataman/full_volumes.list ). If no other empty volumes are available (see procedure P2 in the appendix for know how to check the volumes status), the same list will be open and the first label will be cutted from the list and selected like the next destination volume. A "delete" command will be sent to the dataStorage.pl module that owns the volume and to the fflGen.pl module in order to delete the oldest files from the offline buffer, making free space for the new, incoming data. The rsync flows will automatically restart after few (usually 5) minutes.

NOTE: Due to fix an xfs issue on large file system, the data deletion is disabled by default and a deletion procedure P3 is implemented.

The rsync uses the rsyncd server configured on the target node. The username used to write files on the volumes of a node and the target volume labels are defined in the /etc/rsyncd.conf file on the node.

Configuration parameters and files

The following files must be present into the working directory,

this files can be manually edited:

- /var/dataman/local_host_name : usually stol01.storage, is the local host name (mapped on the storage subnet);

- /var/dataman/datastorage_host_list : contains a full list of all the machines that host a dataStorage.pl module;

- /var/dataman/fflgen_host_name : usually stol02.storage, is the name of the machine that hosts the module fflGen.pl;

- /var/dataman/datafinder_host_name : usually stol02.storage, is the name of the machine that hosts the module dataFinder.pl;

this files can be edited due to manage an exception:

- /var/dataman/full_volume.list : contains a list of couples "host::volume". Can be edited in order to remove from the list some volumes that will be managed by the system but will be never deleted/scratched again. A typical example is removing volumes filled by files related to a run;

- /var/dataman/next_dest_volume : usually is not present into the directory. This file, if present, must contain a list of labels (a couple "host::volume") that point to an empty volume. This file is used by the script when a volume is full and we want to force to select this volume like the next data destination;

- other files like .db .list and .lock files must be leaved untouched, changing or deleting them can cause system incosistency.

All files are automatically replicated each hour by a crond script into the directory /var/dataSoft.bkp/dataman.stol1 located on the datasw machine.

The following variables are located into the head of the script: to change their value the script must be stopped and restarted:

- set this variable to 1 or 2 to increment the log level detail:
# verbose level (1/2)
our $verbose = 1;

- set this services variable to 0 (disable) or 1 (enable):
# active services (0/1)
our $srv_rsync = 1;

- set this variables to authorize the script to delete oldest volume if there are no more space free:
# delete oldest volume (0/1)
our $can_delete = 1;

... other variable must not be changed!

Start the script

In order to start the script like a deamon, follow this rules:

- check if no instance are running:

# ps -edaf | grep dataM

if the list is not empty and there are processes like dataManager.pl:main or dataManager:rsync that are still running, please wait until they will finish before restart the script like a deamon. If not, a port conflict will occour. If you are restarting the process after a crash is possible that there is no dataManager:rsync process actives but there is a lock file ( /var/dataman/rsync.lock ): if so, please remove it before starting the script. To start the script on the STOL1 machine (from the root account), you must:

# cd /var/dataman

# nohup /root/bin/dataManager.pl &

then check the log or/and the nohup.out file:

# tail -f /var/log/dataManager.pl.log

If you are starting the script after a total shutdown, remember that the correct script activation sequence is the following:

- the fflGen.pl on stol02;

- both the dataStorage.pl on st01 and st02;

- the dataFinder.pl (data recover module) on stol02;

- the dataManager.pl on stol01;

Stop the script

From any machine in the storage farm network simply type:

# dataCommand.pl stol01.storage:9000 "stop:<reasons>"

please note that the ":" is mandatory even no <reasons> are specified, than take a look to the log file or to the file nohup.out:

# tail -f /var/log/dataManager.pl.log

a rsync session may could be still active: wait for it end before doing a shutdown or restart the machine or the script: if that session will not finish the ffl list for that file will be not updated.

Appendix - Procedure P1

If the volume pointed by the current_dest_label is no more available, a /var/dataman/rsync.lock file is present and contains the string "Unavailable dest volume!" written inside. This means that rsync process is stopped by this link, so do not remove it until the problem is fixed.

The files already sent to that volume are logged into the file /var/dataman/rsync_done.host::volume. for this files, an automatic recover request is sent to the dataFinder.pl module running on STOL2 so, if the file replicas are present in the backup stream, they are inserted into the ffl list overriding replicas from the unavailable volume.

Please repair the volume or delete the oldest volume from the storage farm and do the following steps:

1 - if the volume is repaired, skip to step 7, if not please continue with step 2 and delete the oldest data from the offline farm;

2 - open the file /var/dataman/full_volumes.list and select the first raw: copy it somewhere (pen & paper) and then delete it, also save the modified file;

3 - send the "delete volume" command to the fflGen.pl module, this will remove files related to that volume from the ffl list (see fflGen.pl howto about this command);

4 - do a login as root user into the selected host, go to the /storage/selected_volume/data/DAQ/rawdata and type rm -f *.gwf or send a "detele volume" to the dataStorage.pl module on that host (see dataStorage.pl howto about this command);

5 - manual edit the /var/dataman/current_dest_label and set it to the new couple host::volume;

6 - send by rsync files from the /var/dataman/rsync_done.host::volume list to the new location: you can use the /root/bin/dataListCopy.pl script, this script will also update the ffl list so, if you will use it, skip to step 8:

# dataListCopy.pl /var/dataman/rsync_done.host::volume newHost::newVolume

7 - if you have repaired the volume (step 2) or you used rsync manually at step 6, please login as root into the new selected host (or to the machine that is hosting the repaired volume) and run the fflAddDir.pl script:

# fflAddDir.pl sto02.storage /storage/volumeName/data/DAQ/rawdata hostName::volumeName

8 - wait some minutes, then simply check the raw.ffl file, looking for the requested files.

Appendix - Procedure P2

In order to check the farm status (in terms of disk occupancy/availability), is possible to send a command direct to the dataManager.pl module:

# dataCommand.pl stol01.storage:9000 "dumpstatus:/tmp/farm_status.txt"

this command will produce a file on stol1 machine (in our example the file /tmp/farm_status.txt) with a raw for each volume managed by the dataStorage.pl modules and some information like (this is only a part of the file):

st8rear::v003 online(1) with 9% used
st8rear::v005 online(1) with 100% used
st8rear::v007 OFFLINE(33983) about 135932 minutes
st8rear::v201 online(1) with 100% used

we can see that there are full volumes (100% used), a partially filled volume (maybe is the current destination volume) and an offline volume. The number closed in round brackets is the polling interval request time: when this number grows over a fixed number (usually 3) the volume will be set as "unavailable" and moved out from the raw.ffl file: in normal conditions the value is 0 or 1.

Appendix - Procedure P3

In order to fix a delte bug issue typical for big xfs fileystems, the volume deletion must be done manually:

- check with Procedure P1 how many volumes are available free (reported to be at "1%";

Topic revision: r1 - 06 Jun 2019, Salconi

InformationSystem

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Wiki_Virgo_LSC? Send feedback

The dataManager.pl script (v5.xx)

Short description about it's functionalities:

Configuration parameters and files

Start the script

Stop the script

Appendix - Procedure P1

Appendix - Procedure P2

Appendix - Procedure P3