This wiki is obsolete, see the NorduGrid web pages for up to date information.

NOX/Tests/A-REX test cases

From NorduGrid
Jump to navigationJump to search

The A-REX test should be carried out by testing various different internal features, which is further described below. First a description of the #Setup and test environment is given. Then the following test cases will be defined, described and evaluated:

Note that list of test cases might not be complete, so if you have an additional test please add it before 2/11-09.
After discussing the test cases a #Conclusion is drawn, upon the achieved test results. In the bottom of the page suggested test cases which could not be carried out, for some reason or another, can be found.

The A-REX Description and Administrator's manual is located here: http://www.nordugrid.org/documents/arex_tech_doc.pdf.

This test is primarily carried out by Martin Skou Andersen. The #Batch system support test was carried out by several testers, which is documented in that section.

Setup and test environment

The test is carried out on a AMD Athlon(tm) MP Processor 1800+ (2 cores) with OS CentOS release 5.3 (Final) and ARC-1.RC3. The available memory in the machine is 2.5 Gb.

The backend system used by A-REX is Torque (PBS), which have been configured to run on the same machine.

General A-REX configuration

The general A-REX configuration used in testing is given below. From test to test single elements in the configuration will be changed to fit to the given test case.

<?xml version="1.0"?>
<ArcConfig
  xmlns="http://www.nordugrid.org/schemas/arcconfig/2009/08"
  xmlns:tcp="http://www.nordugrid.org/schemas/tcp/2009/08"
  xmlns:arex="http://www.nordugrid.org/schemas/a-rex/LRMS/2009/08"
  xmlns:ip="http://www.nordugrid.org/schemas/a-rex/InfoProvider/2009/08"
  xmlns:lrms="http://www.nordugrid.org/schemas/a-rex/LRMS/2009/08"
>
  <Server>
    <PidFile>/var/run/arched.pid</PidFile>
    <Logger>
      <File>/root/arex-test/arex-test.log</File>
      <Level>INFO</Level>
    </Logger>
  </Server>
  <ModuleManager>
    <Path>/opt/nordugrid-arc-1.rc3/lib/arc/</Path>
  </ModuleManager>
  <Plugins>
    <Name>mcctcp</Name>
    <Name>mcctls</Name>
    <Name>mcchttp</Name>
    <Name>mccsoap</Name>
    <Name>arcshc</Name>
    <Name>identitymap</Name>
    <Name>arex</Name>
  </Plugins>
  <Chain>
    <Component name="tcp.service" id="tcp">
      <next id="tls"/>
      <tcp:Listen><tcp:Port>60000</tcp:Port></tcp:Listen>
    </Component>
    <Component name="tls.service" id="tls">
      <next id="http"/>
      <KeyPath>/etc/grid-security/hostkey.pem</KeyPath>
      <CertificatePath>/etc/grid-security/hostcert.pem</CertificatePath>
      <CACertificatesDir>/etc/grid-security/certificates</CACertificatesDir>
      <SecHandler name="arc.authz" id="pdps" event="incoming">
        <PDP name="simplelist.pdp" location="/etc/grid-security/grid-mapfile"/>
      </SecHandler>
      <SecHandler name="identity.map" id="map" event="incoming">
        <PDP name="allow.pdp"><LocalName>grid</LocalName></PDP>
      </SecHandler>
    </Component>
    <Component name="http.service" id="http">
      <next id="soap">POST</next>
      <next id="plexer">GET</next>
      <next id="plexer">PUT</next>
    </Component>
    <Component name="soap.service" id="soap">
      <next id="plexer"/>
    </Component>
    <Plexer name="plexer.service" id="plexer">
      <next id="a-rex">^/arex</next>
    </Plexer>
    <Service name="a-rex" id="a-rex">
      <arex:endpoint>https://heppc23.hepexp.nbi.dk:60000/arex</arex:endpoint>
      <arex:usermap><arex:defaultLocalName>grid</arex:defaultLocalName></arex:usermap>

      <arex:commonName>A-REX</arex:commonName>
      <arex:longDescription>ARC execution service</arex:longDescription>
      <arex:OperatingSystem>LINUX</arex:OperatingSystem>
      <arex:serviceMail>skou@nbi.dk</arex:serviceMail>
      <arex:InfoproviderWakeupPeriod>5</arex:InfoproviderWakeupPeriod>

      <arex:loadLimits>
          <arex:maxJobsTracked>1000</arex:maxJobsTracked>
          <arex:maxJobsRun>100</arex:maxJobsRun>
          <arex:maxJobsTransfered>20</arex:maxJobsTransfered>
          <arex:maxJobsTransferedAdditional>2</arex:maxJobsTransferedAdditional>
          <arex:maxFilesTransfered>4</arex:maxFilesTransfered>
          <arex:wakeupPeriod>5</arex:wakeupPeriod>
      </arex:loadLimits>
      <arex:dataTransfer>
          <arex:Globus>
              <arex:gridmapfile>/etc/grid-security/grid-mapfile</arex:gridmapfile>
              <arex:cadir>/etc/grid-security/certificates</arex:cadir>
              <arex:certpath>/etc/grid-security/hostcert.pem</arex:certpath>
              <arex:keypath>/etc/grid-security/hostkey.pem</arex:keypath>
          </arex:Globus>
      </arex:dataTransfer>
      <arex:jobLogPath>/var/log/arex-jobs.log</arex:jobLogPath>
      <arex:control>
          <arex:username>.</arex:username>
          <arex:controlDir>/tmp/alpha-queue/jobstatus</arex:controlDir>
          <arex:sessionRootDir>/tmp/alpha-queue/grid</arex:sessionRootDir>
          <arex:cache>
              <arex:location>
                  <arex:path>/tmp/alpha-queue/cache</arex:path>
              </arex:location>
          </arex:cache>
      </arex:control>

      <arex:LRMS>
          <arex:type>pbs</arex:type>
          <arex:defaultShare>alpha</arex:defaultShare>
          <arex:runtimeDir>/tmp/SOFTWARE</arex:runtimeDir>
          <lrms:pbs_bin_path>/home/grid/torque/bin</lrms:pbs_bin_path>
          <lrms:pbs_log_path>/var/spool/torque/server_logs</lrms:pbs_log_path>
       </arex:LRMS>

      <ip:ComputingService>
          <ip:Name>HEPPC23:Alpha</ip:Name>
          <ip:OtherInfo>Test cluster</ip:OtherInfo>
          <ip:AdminDomain>DK/NBI</ip:AdminDomain>

          <ip:ExecutionEnvironment name="localhost">
              <ip:CPUVendor>adotf</ip:CPUVendor>
              <ip:CPUModel>adotf</ip:CPUModel>
              <ip:CPUClockSpeed>adotf</ip:CPUClockSpeed>
              <ip:OSName>centos</ip:OSName>
              <ip:OSVersion>5.1</ip:OSVersion>
              <ip:ConnectivityIn>false</ip:ConnectivityIn>
              <ip:ConnectivityOut>true</ip:ConnectivityOut>
          </ip:ExecutionEnvironment>

          <ip:ComputingShare name="alpha">
              <ip:Description>HEPPC23 computing element</ip:Description>
              <ip:MappingQueue>alpha</ip:MappingQueue>
              <ip:ExecEnvName>localhost</ip:ExecEnvName>
              <ip:AuthorizedVO>nordugrid.org</ip:AuthorizedVO>
              <ip:MaxVirtualMemory>2000</ip:MaxVirtualMemory>
              <ip:SchedulingPolicy>fifo</ip:SchedulingPolicy>
          </ip:ComputingShare>
      </ip:ComputingService>
    </Service>
  </Chain>
</ArcConfig>

Run Time Environment (RTE)

The location of RTE scripts can be specified in the A-REX configuration using the LRMS/runtimeDir XML element in the service part. On the server side the RTE consist of a bash script. The functionality of RTEs on A-REX should be tested by specifying RTE requirements in job description and checking if the scripts/executables corresponding to the requested RTEs was called 3 times. With each call a single argument should be passed with the following values and order "0", "1" and "2". Where "0" should be passed on frontend and "1" and "2" on computing node.

A-REX configuration

The LRMS/runtimeDir element was set to the directory /tmp/SOFTWARE as seen in #General A-REX configuration. In that directory the following script was created:

#!/bin/bash

output="Executing: ${0} ${1} at "`date`

echo $output
echo $output >> /root/arex-test/arex-software-output-test.log

Which prints arguments and date to stdout and the specified file. This script was replicated serveral times with different names, in order to create multiple runtime environments.

# ls -R /tmp/SOFTWARE/
/tmp/SOFTWARE:
APP

/tmp/SOFTWARE/APP:
bar-2.3  bar-2.5  foo-1.3  foo-1.5          foo-bar-11.4.34  generic
bar-2.4  foo-1.2  foo-1.4  foo-bar-11.3.34  foo-bar-11.4.54

Test result

  • The job description submitted to A-REX contained the following RunTimeEnvironment element:
    <Resources>
      <RunTimeEnvironment>
        <Software>
          <Name>APP/bar</Name>
          <Version>2.5</Version>
        </Software>
      </RunTimeEnvironment>
    </Resources>

Which resulted in the following output:

[root@heppc23 ~]# cat /var/log/arex-software-output-test.log
Executing: /opt/nordugrid-arc-1.rc3/libexec/arc/submit-pbs-job 0 at Sat Oct 31 16:18:21 CET 2009
Executing: /var/spool/torque/mom_priv/jobs/1143.heppc23.hepexp.nbi.dk.SC 1 at Sat Oct 31 16:18:22 CET 2009
Executing: /var/spool/torque/mom_priv/jobs/1143.heppc23.hepexp.nbi.dk.SC 2 at Sat Oct 31 16:18:22 CET 2009

which was the expected output.

Data cache

The functionality of data cache should be tested by checking if new files are created in cache, when a job with input data is processed. The testing should further check if the cached file is reused upon successive submissions and runs of the same job. The job.#.errors file and/or content of the cache directory can be examined to check the functionality. Cache functionality should also be test using the cache URL options of input files - see "cache" option description in http://www.nordugrid.org/documents/URLs.pdf for more info.

A-REX configuration

Cache can be configured in the GM-config configuration file by using the attributes cachedir and cachesize. The equivalents in the XML configuration style are control/cache/location/path, control/cache/lowWatermark and control/cache/highWatermark.

Notes on documentation

  • The documentation says that cache can be turned off per-file by using the URL option cache=no. In this case it should also be specified that cache can be turned off by setting DataStaging/DownloadToCache to false in the job description.
  • From the documentation it is unclear whether caching local input files (local to the user submitting) is supported or not. It seems from the source code that caching local input files is not supported. A clarification on this is needed.

Test result

  • In association with David a error in the code was found, which made the XML cache configuration non-functional. This was fixed in revision 15443.
  • Unexpected behaviour was observed in this test, see notes below.
  • A-REX was configured with:
          <arex:cache>
              <arex:location>
                  <arex:path>/scr/arex-cache</arex:path>
              </arex:location>
          </arex:cache>

The log output shows:

[2009-11-02 14:21:26] [Arc] [INFO] [5320/142013984] Service side MCCs are loaded
[2009-11-02 14:21:26] [Arc.AREX:GM] [INFO] [5320/142193792] Used configuration file /tmp/arexATMQ2U
[2009-11-02 14:21:26] [Arc] [INFO] [5320/142193792] Added user : (empty)
[2009-11-02 14:21:26] [Arc] [INFO] [5320/142193792] 	Session root dir : /tmp/alpha-queue/grid
[2009-11-02 14:21:26] [Arc] [INFO] [5320/142193792] 	Control dir      : /tmp/alpha-queue/jobstatus
[2009-11-02 14:21:26] [Arc] [INFO] [5320/142193792] 	default LRMS     : pbs
[2009-11-02 14:21:26] [Arc] [INFO] [5320/142193792] 	default queue    : (empty)
[2009-11-02 14:21:26] [Arc] [INFO] [5320/142193792] 	default ttl      : 604800
[2009-11-02 14:21:26] [Arc] [INFO] [5320/142193792] 	Cache            : /scr/arex-cache
[2009-11-02 14:21:26] [Arc] [INFO] [5320/142193792] 	Cache cleaning disabled
[2009-11-02 14:21:26] [Arc.AREX:GM] [INFO] [5320/142193792] Starting jobs' monitoring

Then when trying to submit the following job description:

<?xml version="1.0" encoding="UTF-8"?>
<JobDefinition
 xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl"
 xmlns:posix="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix">
  <JobDescription>
    <JobIdentification>
      <JobName>Cache test</JobName>
    </JobIdentification>
    <Application>
      <posix:POSIXApplication>
        <posix:Executable>/bin/echo</posix:Executable>
        <posix:Argument>Cache test</posix:Argument>
        <posix:Output>out.txt</posix:Output>
        <posix:Error>err.txt</posix:Error>
      </posix:POSIXApplication>
    </Application>
    <DataStaging>
      <FileName>random.dat</FileName>
      <DownloadToCache>true</DownloadToCache>
      <Source>
        <URI>http://hep.nbi.dk/~skou/arex-test/random-50Mb.dat</URI>
      </Source>
    </DataStaging>
  </JobDescription>
</JobDefinition>

the downloader shows the following:

[2009-11-09 10:35:29] [Arc.DataMover] [INFO] [9464/149996512] Transfer from http://hep.nbi.dk:80/~skou/arex-test/random-50Mb.dat to file:/tmp/alpha-queue/grid/3139612577593181635244908/random.dat
[2009-11-09 10:35:29] [Arc.DataMover] [INFO] [9464/149996512] Real transfer from http://hep.nbi.dk:80/~skou/arex-test/random-50Mb.dat to file:/tmp/alpha-queue/grid/3139612577593181635244908/random.dat
[2009-11-09 10:35:29] [Arc.DataMover] [INFO] [9464/149996512] cache file: /scr/arex-cache/data/aa/192de9a1408879a2189bac3f38c8fcd3385967

And the directory listing of the cache shows:

[root@heppc23 arex-cache]# ls -lhR
.:
total 8.0K
drwxr-xr-x 3 root root 4.0K Nov  9 10:35 data
drwxr-xr-x 2 root root 4.0K Nov  9 10:37 joblinks

./data:
total 4.0K
drwx------ 2 root root 4.0K Nov  9 10:35 aa

./data/aa:
total 48M
-rw-r--r-- 1 root root 48M Nov  9 10:35 192de9a1408879a2189bac3f38c8fcd3385967
-rw------- 1 root root 123 Nov  9 10:35 192de9a1408879a2189bac3f38c8fcd3385967.meta

./joblinks:
total 0

Which shows that cache works.

  • When enabling cache cleaning in A-REX the computed size of disk space corresponding to the high- and low-watermark should be reported as an INFO message.
  • With cache cleaning enabled in A-REX:
          <arex:cache>
              <arex:location>
                  <arex:path>/tmp/alpha-queue/cache/</arex:path>
              </arex:location>
              <arex:highWatermark>2</arex:highWatermark>
              <arex:lowWatermark>1</arex:lowWatermark>
          </arex:cache>

which is on a 41Gb disk, it is expected that cleaning should start when the size exceeds 820 Mb, and that the cleaning should reduce the cache-size to below 410Mb. This was tested by submitting 20 jobs requesting a 50 Mb input file (different locations).

<?xml version="1.0" encoding="UTF-8"?>
<JobDefinition
 xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl"
 xmlns:posix="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix">
  <JobDescription>
    <JobIdentification>
      <JobName>Cache test</JobName>
    </JobIdentification>
    <Application>
      <posix:POSIXApplication>
        <posix:Executable>/bin/sleep</posix:Executable>
        <posix:Argument>1m</posix:Argument>
        <posix:Output>out.txt</posix:Output>
        <posix:Error>err.txt</posix:Error>
      </posix:POSIXApplication>
    </Application>
    <DataStaging>
      <FileName>random.dat</FileName>
      <DownloadToCache>true</DownloadToCache>
      <Source>
        <URI>http://hep.nbi.dk/~skou/arex-test/random-50Mb.$i.dat</URI>
      </Source>
    </DataStaging>
  </JobDescription>
</JobDefinition>

By listing the directory structure of the cache location (/scr/arex-cache), the cleaning was observed, and some unexpected behaviour was observed, and is commented below:

  • The cleaning starts even if only a single file is downloaded (50 Mb). This is probably due to the disk where cache resides is already filled with data, and thus the used space already exceeds the high-watermark. However this is unclear since the documentation specifically mention that the watermarks specifies the size of the cache in percentages of the disk size. Either a clarification in the documentation is needed or instead the cache size should be calculated from the actual space taken up by the cache.
  • Then changing the watermarks to be 2/1 percent above the existing disk space usage, another issue was observed. The cleaning starts and cleans cache files, down to the low-watermark, as expected. Since jobs are still running input files still reside in the joblinks directory. Then when a new cleaning starts, additional cache files are cleaned aggregating to the same space as cleaned in the previous clean. Whether this is the intended behaviour is unclear. Because when the jobs, which got "their" cache files cleaned, finishes then the space taken by these "cleaned" files will be freed. Either a clarification in the documentation is needed or the behaviour of the cache cleaner should be modified.

Multiple caches

This test extends the #Data cache test. Multiple cache directories should be set using the INI attribute cachedir or XML element control/cache/location/path. The test should be carried out by submitting multiple jobs which specifies multiple input files. The test is succesful if the cached files are distributed evenly between the defined cache directories, with a given file only residing in a single cache directory.

Test result

Running under non-root account

A-REX should be tested by starting and running ARC-HED as non-root user. The session and control directories must be owned by that user and all jobs internally should belong to that user.

Test result

A-REX was configured by setting logfile, pidfile, keypath, joblogpath, controldir, sessiondir and cachedir to be owned by a non-root user. This non-root user ran A-REX with the mentioned configuration and no problems was observed, all A-REX processes belonged to that user.

Computing nodes without shared file system

A-REX asumes by default that the filesystem is shared with the computing node(s). If shared filesystem is turned off the backend scripts in A-REX should direct LRMS to move files to and from computng node. LRMS must be configured to support this. The result of this action should be verified by watching the content of the session directory on computing node. In INI style the usage of shared filesystem can be switched off by the shared_filesystem attribute, which takes a 'yes'/'no' value. In XML style this attribute is named LRMS/sharedFilesystem.

Notes on documentaiton

  • The documentation of the shared_system attribute does not clearly specify what the default value of this attribute is. It is neither stated that the LRMS might need additional configuration for this functionality to work.
  • The documentation does not offer very much help on how to setup the computing nodes without shared filesystem.

Test result

  • The test setup was limited to a single machine, and thus A-REX and the computing actually shared the same filesystem, so it was not possible to carry out the test in a real situation.
  • Several solutions was tried however none of these was succesful.
  • First the LRMS/sharedFilesystem element was set to false, which resulted in jobs failing with message Job submission to LRMS failed. The job.<ID>.errors file reports the following:
Need to know at which directory to run job: RUNTIME_LOCAL_SCRATCH_DIR must be s
et if RUNTIME_NODE_SEES_FRONTEND is empty
/opt/nordugrid-arc-1.rc3/libexec/arc/submit-pbs-job: line 38: /tmp/alpha-queue/
jobstatus/job.293301257857816840648002.failed: Permission denied

However the description of the shared_filesystem attribute does not mention any other attributes which should be configured. If other attributes should be configured this should be noted in the description of the shared_filesystem attribute.

  • Then also the LRMS/sharedScratch element was used setting it to a existing directory. Now jobs simply got stuck in LRMS. State: INLRMS from SUBMIT is the last message reported by A-REX. The created PBS script contains stagein and stageout arguments.
  • By changing the ownership of the directory pointed to by LRMS/sharedScratch to the runtime user a error message was reported in the job.ID.errors file: /var/spool/torque/mom_priv/jobs/1177.heppc23.hepexp.nbi.dk.SC: line 112: /tmp/alpha-queue/node-storage/291121257235468982029490/out.txt: Too many levels of symbolic links.

Job processing load limit

The maximum number of jobs processed in the different internal states can be configured in A-REX. This test should check that the configured limit is respected by A-REX. The test should be carried out by submitting many jobs and then check the distribution over internal states by the gm-jobs utility. There should be no more jobs in a particular state than specified in the single configuration parameter maxjobs. The XML equivalents are loadLimits/maxJobsTracked and loadLimits/maxJobsRun.

Test result

  • The following loadLimits configuration was used in the test:
      <arex:loadLimits>
          <arex:maxJobsTracked>20</arex:maxJobsTracked>
          <arex:maxJobsRun>10</arex:maxJobsRun>
          <arex:wakeupPeriod>10</arex:wakeupPeriod>
      </arex:loadLimits>

On the client side 100 jobs was submitted (20 minutes sleep jobs). The gm-jobs utility was ran with the -c argument pointing to the used configuration file. The output was as follows:

Jobs total: 100
 ACCEPTED: 0 (0)
 PREPARING: 90 (90)
 SUBMIT: 0 (0)
 INLRMS: 10 (0)
 FINISHING: 0 (0)
 FINISHED: 0 (0)
 DELETED: 0 (0)
 CANCELING: 0 (0)
 Accepted: 100/20
 Running: 10/10
 Processing: 0+0/-1077179128+4745596

which does not seem to respect the configured values (Accepted: 100/20).

Data processing load limit

This test should check that the maximum number of simultaneously running uploaders and downloaders corresponds to the configured value. It should be carried out by submitting many jobs, containing in/output-files, to A-REX. The test is succesful if the configured value is respected.

Configuration

In INI style the maxload attribute is used to configure the maximum number of simultaneously running up/down-loaders, and in XML the corresponding elements is loadLimits/maxJobsTransfered, loadLimits/maxJobsTransferedAdditional and loadLimits/maxFilesTransfered.

Test result

Data shortcut

In this test A-REX should mirror data of another location. Then it should be checked that when the job requests data from that location that the local mirrored data is used instead. The copyurl INI attribute specifies the location being mirrored, and the local path where the mirrored data is stored. In XML this attribute is dataTransfer/mapURL/from and dataTransfer/mapURL/to

Notes on documentation

  • The description of linkurl refers to local_path, however it is not clear from the context what local_path is. And it does not seem it is mentioned elsewhere in the documentation.
  • It is not clear from the description of linkurl what replacement is. It says: "replacement specifies the way to access the file from the frontend, and is used to check permissions". It should be stated more clearly that replacement is a path which will be used to create the symbolic link. Same with node_path.

Test result

  • The test revealed an error in the configuration code. Revision 15400 fixes this issue.
  • The test was run with a configuration of dataTransfer as showed below.
<arex:dataTransfer>
  <arex:mapURL link="false">
    <arex:from>http://hep.nbi.dk/~skou/arex-test/</arex:from>
    <arex:to>/tmp/alpha-queue/data-mirror/</arex:to>
  </arex:mapURL>
  ...
</arex:dataTransfer>

and submitting the following job description:

<?xml version="1.0" encoding="UTF-8"?>
<JobDefinition
 xmlns="http://schemas.ggf.org/jsdl/2005/11/jsdl"
 xmlns:posix="http://schemas.ggf.org/jsdl/2005/11/jsdl-posix">
  <JobDescription>
    <JobIdentification>
      <JobName>Data shortcut test</JobName>
    </JobIdentification>
    <Application>
      <posix:POSIXApplication>
        <posix:Executable>/bin/echo</posix:Executable>
        <posix:Argument>Data shortcut test</posix:Argument>
        <posix:Output>out.txt</posix:Output>
        <posix:Error>err.txt</posix:Error>
      </posix:POSIXApplication>
    </Application>
    <DataStaging>
      <FileName>random.dat</FileName>
      <Source>
        <URI>http://hep.nbi.dk/~skou/arex-test/random.dat</URI>
      </Source>
    </DataStaging>
  </JobDescription>
</JobDefinition>

which gave the following log output:

[2009-11-02 15:05:44] [Arc.DataMover] [INFO] [16275/145113912] Transfer from http://hep.nbi.dk:80/~skou/arex-test/random.dat to file:/tmp/alpha-queue/grid/1608512571707431353917448/random.dat
[2009-11-02 15:05:44] [Arc.DataMover] [INFO] [16275/145113912] Real transfer from http://hep.nbi.dk:80/~skou/arex-test/random.dat to file:/tmp/alpha-queue/grid/1608512571707431353917448/random.dat
[2009-11-02 15:05:44] [Arc.URLMap] [INFO] [16275/145113912] Mapping http://hep.nbi.dk:80/~skou/arex-test/random.dat to file:/tmp/alpha-queue/data-mirror/random.dat
[2009-11-02 15:05:44] [Arc.DataMover] [INFO] [16275/145113912] URL is mapped to local access - checking permissions on original URL

which was as expected.

Data transfer directly from/to computing node

Here it should be tested that when setting localtransfer to yes in the configuration, that up- and downloader is run as part of job execution on computing node, and not through A-REX. The equivalent XML element is dataTransfer/localTransfer which takes a boolean.

Responsivity control

The response time of the A-REX/grid-manager can be configured by the wakeupperiod attribute, and it should be checked that the control directory is scaned for changed according to the configured value. The equivalent XML attribute is loadLimits/wakeUpperiod.

Test result

  • No difference in responsitivity is registered when using the following two configurations:
      <arex:loadLimits>
          ...
          <arex:wakeupPeriod>10</arex:wakeupPeriod>
      </arex:loadLimits>

and

      <arex:loadLimits>
          ...
          <arex:wakeupPeriod>600</arex:wakeupPeriod>
      </arex:loadLimits>

And this holds both when submitting a Hello World job, and a 10 seconds sleep job.

Forcing secure transfer

Command: securetransfer

Specifies if gridftp data transfer should be done using data channel encryption. I do not know exactly how to test it. But it shoudl be visible from gridftp server logs.

Forcing passive transfer for frontend behind NAT

Command: passivetransfer

Makes ftp-like transfer happen in passive mode (client - a-rex in this case - initiates data connection). Can be tested by setting firewall not allowing incoming connections (except one needed for job submission) and submitting job with input data residing at gridftp server.

Authorization plugins

Executing authorized plugins when jobs reach a specified state should be tested, and it can be configured with the authplugin INI attribute. The XML equivalent are authPlugin/state and authPlugin/command plus additional XML attributes authPlugin.timeout, authPlugin.onSuccess, authPlugin.onFailure and authPlugin.onTimeout. Can be tested by writing simple scripts logging job id and state names to some log file.

Notes on documentation

  • The description of the authplugin attribute should more clearly specify that plugin is the command which is run when reaching the specified state. Also is should be noted that everything after options is considered as part of plugin.
  • The description of the timeout option should mention what the default timeout is (a fixed value or running forever).

Test result

Session directories on NFS

Command: norootpower

Allows to have session directories on NFS server with root squashing setup. To test it one probably needs NFS server.

Computing element profiles

Two profiles currently exist which configures an A-REX service. SecureComputingElementWithFork and NonSecureComputingElementWithFork. The profiles from revision 15547 have been tested with the RC3 release.

Notes on the profile

  • The profiles should set the default loglevel of ARCHED and all services to same level. Default should be WARNING.
  • It is not possible to set IP-version in the INI-configuration.
  • Since this is only a computing element profile, registration to an ISIS should not be done. This should be available in a separate profile.
  • The fork_job_limit should have another default value than cpunumber. 1 is suggested.

Test result

  • A-REX was succesfully configured and started with the SecureComputingElementWithFork profile. Also submitting jobs to A-REX was successful. From the log, it is evident that the job is submitted and reaches the INLRMS state. However no further progress is registered, for the given job, even when job has finished. This have been reported in the #Batch system support section aswell.

Batch system support

Support needs to be checked for each of the 7 supported LRMSes. Some basic things to check:

  • Submit job: check A-rex can submit a job to the LRMS. A-rex should move the job to INLRMS state. Check that job is indeed submitted (by using LRMS's commands).
  • Query state: check that the state reported by arcstat corresponds to the state of the job in the LRMS. Only check the most common states (INLRMS:Q and INLRMS:R).
  • Query exit code: check that exit code returned by the user's executable is reported correctly. Look for exitcode= in job.*.diag file. Also check with arcstat. Test with one job that returns 0 and one job that returns some other value.
  • Cancel job: check that A-rex can cancel a job in LRMS. A-rex should move the job to state CANCELING, then FINISHED (look in job.*.status file). Check that job is no longer in the LRMS.

Test results

LRMS Version Tester Submit job Query state Query exit code Cancel job
Fork - Martin Skou OK OK OK - see notes OK
Torque 2.1.7 Marek OK OK OK OK
Torque 2.3.7 Marek OK OK OK OK
SGE 6.1u6
SGE 6.2 Adrian
LL 3.5.1.1 Adrian&Anders OK OK OK OK
LSF
SLURM
Condor 7.3.1 Adrian OK OK OK OK

Notes

  • The issue below was observed when using fork as LRMS. This issue was fixed in revision 15646 and have been verified as well.
  • When using fork as LRMS the job gets stuck in state INLRMS (INLRMS:EXECUTED), even when the job has finished. When trying to kill such a job, the following is reported:
[2009-11-09 15:40:36] [Arc] [INFO] [10627/159344184] 106271257774686121481425: Canceling job ((empty)) because of user request
[2009-11-09 15:40:36] [Arc] [INFO] [10627/159344184] 106271257774686121481425: State: CANCELING from INLRMS
[2009-11-09 15:40:36] [Arc] [INFO] [10627/159344184] 106271257774686121481425: state CANCELING: starting child: /opt/nordugrid-arc-1.rc3/libexec/arc/cancel-fork-job
[2009-11-09 15:40:37] [Arc] [INFO] [10627/159344184] 106271257774686121481425: state CANCELING: child exited with code 1
[2009-11-09 15:40:37] [Arc] [ERROR] [10627/159344184] 106271257774686121481425: Failed to cancel running job
[2009-11-09 15:40:38] [Arc] [ERROR] [10627/159344184] 106271257774686121481425: Job failure detected

and afterwards the job is moved to stated FINISHED:

[2009-11-09 15:40:38] [Arc] [INFO] [10627/159344184] 106271257774686121481425: State: FINISHING from CANCELING
[2009-11-09 15:40:38] [Arc] [INFO] [10627/159344184] 106271257774686121481425: state: PREPARING/FINISHING: starting new child
[2009-11-09 15:40:38] [Arc] [INFO] [10627/159344184] 106271257774686121481425: State FINISHING: starting child: /opt/nordugrid-arc-1.rc3/libexec/arc/uploader
[2009-11-09 15:40:39] [Arc] [INFO] [10627/159344184] 106271257774686121481425: State: FINISHING: child exited with code: 0
[2009-11-09 15:40:39] [Arc] [INFO] [10627/159344184] 106271257774686121481425: State: FINISHED from FINISHING
[2009-11-09 15:40:39] [Arc] [INFO] [10627/159344184] 106271257774686121481425: Job is requested to clean - deleting

LIDI support

Additional notes

  • When starting ARCHED without specifying tcp:Listen/tcp:Version an ERROR will be reported:
[2009-10-31 13:15:19] [Arc.MCC.TCP] [INFO] [6921/143762976] Listening on TCP port 60000(IPv6)
[2009-10-31 13:15:19] [Arc.MCC.TCP] [ERROR] [6921/143762976] Failed to bind socket for TCP port 60000(IPv4): Address already in use

No ERROR message should be given since ARCHED succeeds in using version 6.

  • A-REX does not respect the loglevel specified in the service section Service/debugLevel.
  • Every time a user submits a job a WARNING about credentials expiration is given:
[2009-11-09 10:06:18] [Arc] [WARNING] [25574/136769816] Certificate /O=Grid/O=NorduGrid/OU=nbi.dk/CN=Martin Skou Andersen/CN=1418772567 will expire in 11 hours 52 minutes 7 seconds

This message should be promoted to DEBUG level. Also policy INFO messages are printed with every submission:

[2009-11-09 10:06:18] [Arc.SimpleListPDP] [INFO] [25574/136769816] subject: /O=Grid/O=NorduGrid/OU=nbi.dk/CN=Martin Skou Andersen
[2009-11-09 10:06:18] [Arc.SimpleListPDP] [INFO] [25574/136769816] policy line: "/O=Grid/O=NorduGrid/OU=nbi.dk/CN=Martin Skou Andersen" grid
[2009-11-09 10:06:18] [Arc.SimpleListPDP] [INFO] [25574/136769816] subject: /O=Grid/O=NorduGrid/OU=nbi.dk/CN=Martin Skou Andersen

These should also be promoted to DEBUG level.

Conclusion

Note: The test is ongoing.

Some issues were encountered when carrying out the test.

The #Data cache was unsuccesful, since the input files input files was not stored in the specified cache directory. Also when specifying control/cache/highWatermark cache was not enabled. In the #Data shortcut test an error was found and fixed, after which the test was succesful.

During the testing, the A-REX documentation have been used as a reference and several notes on it can be found in dedicated sections in the respective tests, see the table of contents. The documentation does not cover XML configuration.

Also please see the #Additional notes section.

Untested cases

The following test cases have not been carried out. For the exact reason see the specific test case.

Dynamic RTE (janitor)

Documentation on janitor can be found in trunk/doc/tech_doc/janitor/Janitor.pdf. There also exist a configuration example, which is located at trunk/src/services/arex/arex_janitor.xml.example.in.

Note: A janitor specific test will be carried out by Gábor Szigeti, so a janitor test will not be done here.

Cache sharing

Test not carried out, since this feature is not implemented in the KnowARC Final Release.

Command: remotecachedir

There may be multiple a-rex/grid-manager instances running each with own cache. The cache of one instance may be used by another one. Could be tested by running 2 instances, processing job in instace 1, then processing similar job in instance 2. Second job should be able to use file cached by instance 1.

Data processing load limit by VO/user

Test not carried out, sinec this feature is not implemented in the KnowARC Final Release.

Command: maxloadshare

Similar to maxload but applies per share. Share is per different value of specified share type. Canbe tested by adding adding 'maxloadshare="# dn" and submitting jobs by using 2 different credentials. If compiled with VOMS support it would be nice to test it for voms share types.

Data transfer retries

Test not carried out, since this feature is not implemented in the KnowARC Final Release.

Command: maxtransfertries Allows retries of data transfers. Please contact David on details.

BES interface

The test do not cover testing the features of the BES interface implemented by A-REX, since it was not possible to construct a test scenario.

Data transfer timeout and speed control/timeout

This test should set timeout, minimum speed, minimum average speed and maximal inactivity for data transfers, and it should then be checked that transfers follow these settings. The attribute speedcontrol can used to configure these settings. The XML equivalents are dataTransfers/timeouts/minSpeed, dataTransfers/timeouts/minSpeedTime, dataTransfers/timeouts/minAverageSpeed and dataTransfers/timeouts/maxInactivityTime.

Test result

No test scenario could be created.