anafile

anafile (1)     analyze a file and print statistics.     (V7.2 (January 2007))


Syntax
anafile [–v|–q] [–l] [–a] [–u] [–cc[g]] [–head[=n]|–hash] [–d[c] [–f[x] format_file] [–w width] [[–t table_structure] input_file...]

Description
The anafile program is meant mainly to check the concordance of a description file (generally named ReadMe) with the actual data files, and to report the inconsistencies.

Without a description file, anafile analyses the beginning or the complete file(s), assigns a file class to each file, and lists on request the column (byte-per-byte) statistics. The possible file class values are detailed in the section File Classes below.

Options
Without any option, anafile examines the first 2880 bytes, and assigns the file class.

–a    lists the position (line and column numbers) of each character which leads to a classification as ascii binary or binary file.

–d    considers the input file as tab-separated-values (–d alone option), or column-separated-values where the character c represents the column separator. This option, combined with (–cc, generates statistics by columns (width of each column, and characters present there).

–cc    generates a detailed list of the frequencies of every character in each column: the characters of each column are listed in order of their decreasing frequency.
Associated to the –d option, the contents of every field (defined as the contiguous bytes between the column separator) are examined, and statistics on its contents are listed. With the addition of g (i.e. with option –ccg), the statistics include computations of min/max in each field, a proposal of a description to in include in a ReadMe file and possible alignements (using the acut program) are suggested.
See also the –head option if some top lines have to be skipped (or used as titles / units)

–hash    is a shortcut for –headlines=#

–headlines    specifies that the input file may contain heading line(s) – i.e. lines that explain the contents. These heading lines (the header) may be formed by a number of lines or a leading character. For instance, the existence of 2 heading lines is specified with -head=2; the existence of heading lines starting by the hash sign is is specified with -head='#'. The default is a singlle line (i.e. equivalent to -head=1)

–fx format_file    uses the contents of format_file to check the compliance of the file to the specified format. The x may be used for further options concerning this format file like computation of ranges, verifications against the CDS Standards, etc... (see the section Format File below).

–l    asks to examine the complete files to assign the file class, and lists the number of lines as well as the number of bytes of the longest line.

–q    suiet mode: the messages are minimized.

–t table_structure    indicates that the next file argument designates a data file which contains data structured like table_structure (for instance an excerpt of a table). table_structure is therefore a name which must exist in one of the Byte-by-byte Description ... section in the ReadMe file.
A value of - for table_structure asks to stop this behaviour.
Note that this option can only work when a –f specification precedes -bf–t.

–u    lists the columns which are constant (i.e. have exactly the same contents) over all lines. This option may be used to check that e.g. that the decimal points are correctly aligned, or to find out the blank columns which could be removed with trcol(1).

–v    is a verbose option.

–w width    specifies the assumed column width for ascii bulk or ebcdic files; such files have no linefeed embedded, and the length of each line must be assumed.

Format File
A typical format file (specified via the –f option) contains the following:

Byte-per-byte Description of file: hbc
-------------------------------------------------------------------
   Bytes Format  Units   Label    Explanations
-------------------------------------------------------------------
   2-  4  I3     ---     HBC      [1,423]+ HBC number.
       5  A1     ---     NEBUL    [n] Nebulosity association flag.
       6  A1     ---     REMARK   [*] Remark flag.
   8- 18  A11    ---     NAME     [A-Z0-9@.+-]! Star name.
  20- 56  A37    ---     OTHER    Other designation.
  59- 60  I2     h       RAh      Hours of right ascension (1950.0).
  62- 63  I2     min     RAm      Minutes of right ascension.
  65- 69  F5.2   s       RAs      Seconds of right ascension.
      71  A1     -       DE-      Sign of declination (1950.0).
  72- 73  I2     deg     DEd      Degrees of declination.
  75- 76  I2     arcmin  DEm      Minutes of declination.
  78- 81  F4.1   arcsec  DEs      Seconds of declination.
  83- 90  A8     ---     REF      References to the position.
-------------------------------------------------------------------

The format file is made of five columns: the byte position, the format, the units, the label and an explanation text. Such a file is interpreted by anafile , is reedited in a standard form (on the screen if the –v option is present, or the format file is rewritten with the –fw option), and is used for data check. Note also that special labels are understood by anafile (see Special Labels below).

The explanation text may contain further restrictions concerning the range for numeric fields or the character set allowed for alphabetical field; refer to Validity Checks section below.

Note that the byte position may be specified as relative from the end of the previous field when followed by the X letter, as it is in Fortran. The number preceding the X therefore represents the number of blanks between the two columns.

With the –f. (a dot following the f) option, the input format file does not include the units column; the reedition fills this column with dashes

With the –f1 (a one following the f) option, the first column of the input format file contains only the starting byte of the column; the ending byte is derived from the format column. The reedition (with option –f1w) completes this column with the ending byte.

With the –f1X (a one and X following the f) option, it is assumed that a blank always separates two adjacent columns; the contents of the starting byte, if existing, is ignored. The reedition (with option –f1Xw) computes the startingending byte column.

With the –fr option, the actual ranges (minimal / maximal values) of each column are computed.

With the –fs option, the format file is assumed to conform to the Standards for Astronomical Catalogues; further compliance checks (presence of titles, correctness of units, etc...) of this format file are then performed.

With the –fw option, the format file is rewritten according to standards.

The format options may be combined, e.g. –f1.w asks to rewrite format file where no units and no ending columns were supplied.

Special Labels
Some labels are recognized and further checks are performed. The complete list with implied defaults are listed in the Standards for Astronomical Catalogues. A few frequent such labels are:
  • Right ascension fields RAh, RAm and RAs
  • Declination sign DE- (only a sign or a blank is accepted)
  • Declination values DEd, DEm and DEs
  • Positions in decimal degrees RAdeg (range [0,360[) and DEdeg (range [–90,+90])
  • Galactic positions GLON (range [0,360[) and GLAT (range [–90,+90])
  • Position angle PA

Validity Checks
The first word (i.e. set of characters followed by a blank) of the explanation of a column may specify validity checks to perform about:
  1. the available range of the value in the column
  2. the possibility of blank or NULL values
  3. the order of the value within the table (increasing / decreasing value)

For a numerical field, a range can be specified in the format file if the explanation text starts with a square bracket [ ] as in the HBC column in the above example. The opening bracket is [ if the lower value is included, and ] if the lower value is excluded — i.e. the standard mathematical conventions apply. Both lower and upper values are not required; for instance, the specification of any value lower than 100 (100 excluded) is specified by [,100[. Writing [] is acceptable when no range checking applies — e.g. to override the default range implied by the label (see Special Labels section above).

For an alphabetical (i.e. A-format) field, the set of the allowed characters may be specified in the format file if the explanation text starts with a square bracket [: permitted characters are surrounded by square brackets [...], and the dash indicates a range (in the ascii sequence). The closing bracket is accepted as a character permitted in the set if it is specified first (i.e. []] means that only the closing bracket is acceptable); the dash is accepted when it is first or last character. On the above example, only n (or blank) is accepted in the NEBUL column, and uppercase letters, digits, and the symbols @ . + - in the NAME column.

An exclamation mark ! following immediately the range or character-set specification indicates that the field cannot be blank, i.e. cannot be filled with only blanks. In the above example, the Name field can never be blank.

A question mark ? following immediately the range or character-set specification indicates that the field can be blank, i.e. can be filled with only blanks. The default concerning blank fields is:

  • numeric field: the column cannot be filled with blanks, unless the ? symbol exists;
  • alphabetic (A-format) field: the column can be filled with blanks; the ! means that a completely blank field can't exist.
The question mark may be followed immediated by a numeric value to designate non-blank NULL values, e.g. ?=99.99 telling that values 99.99 indicate unknown values.

The order within a column can be specified with the signs:

  • [+] indicates strictly increasing values in the table;
  • [] indicates strictly decreasing values in the table;
  • [=], associated to + or , indicates that the variation is not strict, i.e. the row number n+1 may have a value larger than or equal to that of row n if += is specified.

File Classes
Outside the context of a description file, an input file is classified into one of the following categories according to its contents:
  • ascii file
    standard ascii file, where no other control characters than tabs, carriage-return and line-feed can be found.
  • ascii mailable
    may include other control characters like bell or escape which can be sent over the network without problem.
  • ascii binary
    includes control characters which cannot be sent over the network, like Control-Q. (see also the –a option)
  • ascii bulk
    does not include any line-feed, which means that the file cannot be seen as a set of lines. (see –w option)
  • ebcdic
    EBCDIC (IBM-specific) file
  • fits
    FITS file (bulk file with all headers)
  • hdf
    FITS header (starting with SIMPLE or XTENSION) without the data part, and stored as a standard ascii file.
  • directory
    (not an actual file)
  • binary
    file which cannot be classified in one of the above categories.

Returned Status
The anafile command returns 0 in case of success. A non-zero status may indicate bad options or unreadable files; it is also returned when the format_file indicated by a –fs option does not conform to the Standards for Astronomical Catalogues.

See also
awk(1) gawk(1) trcol(1) ``Standards for Astronomical Catalogues''