Datasets - TeraScan common data format (TDF)

SYNOPSIS

lib/libcdf.a

INTRODUCTION

Each TeraScan dataset is a separate UNIX file organized in the TeraScan common data format (TDF). The TDF is an extremely versatile file format capable of assimilating a wide variety of data types, shapes and sizes. For example, a single dataset could contain satellite image data, random in-situ data, and 3-D model data. The TDF also allows applications to access data without any knowledge of the physical layout of that data.

The TDF was developed during the same period that NASA developed the Common Data Format (CDF) [Treinish and Gough, 1987] and served as a basis for the UNIDATA Network Common Data Format (netCDF) [Rew, 1988]. The TDF has been substantially upgraded since then.

Dimensions, variables, relations, and attributes are the basic dataset components. Variables are simply arrays of data; dimensions define the sizes of these arrays. Relations are ordered lists of variables. Attributes hold information about the dataset as a whole, or about individual variables, dimensions, or relations. Only datasets, variables, and relations can currently have application-defined attributes.

It is important to note that dimensions do not necessarily refer to the physical space in which measurements are taken, but rather to the shapes and sizes of the arrays used to store the measurement data. For example, data taken at random locations in space is stored one-dimensionally. The maximum number of dimensions for a given variable is currently 5. This accomodates the usual X, Y, Z, and time coordinates, plus one left over to handle complex coordinates.

The following datatypes may be used to define variables and attributes; byte, short, long, float, double, and string. Codes and ranges for these datatypes are defined in include/gp.h. String is a variable-width datatype, i.e., the number of bytes required to store one element is application-defined. Applications can implement a complex-valued variable by adding an extra dimension of length 2 to the variable.

Normally all dataset definitions and data are stored in a single UNIX file. However, a dataset can reference variables from several files using links. Links allow rapid import of non-TDF data and support lightweight dataset subsets and assemblies.

EXAMPLES

Raw AVHRR Imagery

AVHRR data from NOAA satellites consists of up to five channels of image data, plus per-scan line sensor calibration data.

                             +------------------------+
       AVHRR Data            |      Last Channel      |
                         +------------------------+   |
                         |     Second Channel     |   |
      +--------+     +------------------------+   |   |
      | Header |     |     First Channel      |   |   |
      |        |     |                        |   |   |
      |        |     |                        |   |   |
      |        |     |                        |   |---+
      |        |     |                        |   |
      |        |     |                        |---+
      |        |     |                        |
      +--------+     +------------------------+

The resulting TDF dataset has three dimensions:

      # image lines (rows)
      # image samples (columns)
      width of the header

It also contains header variable, plus variables for each of the channels. TOVS Package Output

Output for the Wisconsin TOVS package consists of a number of metereological readings at constant elevations, but at random (latitude,longitude) points. The following schema shows one approach to fitting this output into the TDF:

Dimensions:
   1         472  points
   2          19  hirs_channels
   3           4  msu_channels
   4          15  gp_levels
   5          14  temp_levels
   6           5  dew_pt_levels
   7           9  wind_levels
   8          10  temp_inp_levels
   9           5  dp_inp_levels

Global Attributes:
  pass_date        00/09/16
  start_time       01:13:49
  satellite        noaa-11

Variables:
  gp_levels        gp_levels
  temp_levels      temp_levels
  dew_pt_levels    dew_pt_levels
  wind_levels      wind_levels
  temp_inp_levels  temp_inp_levels
  dp_inp_levels    dp_inp_levels
  latitude         points
  longitude        points
  elevation        points
  solar_zenith     points
  ground_temp      points
  ground_dew_pt    points
  ground_pressure  points
  skin_temp        points
  total_ozone      points
  stability_index  points
  precip_water     points
  longwave_flux    points
  cloud_pressure   points
  cloud_temp       points
  hirs_temp        points X hirs_channels
  msu_temp         points X msu_channels
  gp_height        points X gp_height
  temperature      points X temp_levels
  dew_pt           points X dew_pt_levels
  wind_speed       points X wind_levels
  wind_dir         points X wind_levels
  temp_guess       points X temp_inp_levels
  dew_pt_guess     points X dp_inp_levels

For brevity, units and datatype have omitted from the above listing. Note that for each of the level dimensions there is a corresponding variable with the same name. These dimension variables supply the values corresponding to each dimension index. For example, the 9 wind_levels are 850, 700, 500, 400, 300, 250, 200, 150, and 100 millibars.

PROGRAMMING INTERFACE

TDF access routines are independent of any other TeraScan software component except lib/utils.a. See dirfile, misc, and terrno. Therefore, TDF applications can be written without using TeraScan user interface or earth transform facilities. TDF calls can be embedded in existing non-TeraScan applications as desired.

TDF datatypes, constants, and error status codes are defined in include/gp.h.

Object Pointers

The basic TDF objects are sets, dimensions, variables, and relations. Application-defined attributes are not considered objects, even though they can be treated as such. Files are secondary objects and are of only passing concern to applications.

Pointers to objects (actually object data structures) are returned by search or definition functions. These pointers are used as arguments to other functions. All data structures have magic numbers and alignment criteria that help identify bogus pointers. A pointer to an object's data structure is pinned (i.e., can never change) until the object is no longer available (i.e. when the containing dataset is closed).

All application-accessible data structures exist in memory that is allocated using UNIX malloc. malloc is used sparingly and in an unfragmented manner so as not to impact applications that also use and (possibly abuse) malloc.

Applications cannot be prevented from modifying data structures, even for datasets opened as read-only. Given this, it was decided to let applications perform all operations except variable I/O for read-only datasets, including defining new variables, relations, and attributes.

One obvious disadvantage of having application-accessible data structures is that applications will undoubtedly trash them more easily than if they were hidden. All data structure components should be considered read-only, unless otherwise specified.

Applications can loop through a list of similar objects (e.g., all dimensions belonging to a dataset) using

	while (pointer != NULL) pointer = pointer->next;

Attributes

Attributes refine the definitions of datasets and their components. There are two kinds of attributes:

Built-in attributes, i.e., fields in application-accessible data structures (see include/gp.h).

Application-defined attributes, created using the define or copy attribute functions.

Dimensions and files do not have application-defined attributes. The only file attribute of any interest to applications is file->path, which is built-in. Application-defined dimension attributes may be added in the future.

Note that applications are free to change names directly and potentially generate name conflicts within a dataset. This is the least harmful of all the ways applications can damage datasets.

Different objects can have attributes with the same name, but with different datatypes or lengths. This new flexibility should be used cautiously; two attributes with different meanings should never have the same name.

The following built-in attributes are intended for use by applications; only those marked (*) can be set directly by applications.

* dim->name      - dimension name
  dim->unlimited - non-zero if dimension can grow
  dim->size      - current size
* dim->coord     - dimension coordinate
* dim->scale     - orig index = index * scale + offset
* dim->offset

* var->name      - variable name
* var->units     - units
  var->type      - datatype
* var->badval    - missing value as stored on disk
* var->usemin    - minimum valid stored value
* var->usemax    - maximum valid stored value
* var->scale     - true value = scale * stored value + offset
* var->offset

* rel->name      - relation name
* rel->kind	 - relation kind (analogous to variable units)

* att->name      - attribute name
* att->units     - attribute units
  att->type      - datatype
  att->size      - number of elements in attribute

  file->path     - file path name

Application-defined attributes are normally not accessed like objects. Their values are set and retrieved by name, rather than by pointer. Pointers to attribute definitions are available for getting attribute datatype, lengths, and units, as well as looping through lists of attributes.

Application-Defined Relationships

The new abstraction relation has been added to datasets. A relation consists of an ordered list of variables all belonging to the same dataset. Relations have built-in attributes name and kind, where relation kind is analogous to variable units. Relations also can have application-defined attributes. The number and order of the variables associated by a relation, as well its application-defined attributes, are determined by its kind.

The following is an example of how relations can be used:

Given a variable date that contains an ordered list of dates, a variable year that contains an ordered list of years, and a variable year_index that is defined as follows:

index[i] = j if k > j => date[k] >= year[i]

define the relation year_index of kind sparse_index, consisting of the ordered tuple (date, year, year_index). (Obviously, date and year must have the same units for this to work.)

Built-in Relationships

The following relationships are built-in to application-accessible data structures; only those marked (*) can be changed directly by applications:

  var->dim[], var->ndims           - variable has dimensions
  rel->var[], rel->nvars           - relation relates variables

* dim->var    - a dimension can get its values from a variable
                i.e., value coresponding to dim=i is var[i]
  var->file   - a variable's data is stored in a file

  set->natts, set->att, att->next  - dataset has attributes
  var->natts, var->att, att->next  - variable has attributes
  rel->natts, rel->att, att->next  - relation has attributes

  set->ndims, set->dim, dim->next  - a dataset has dimensions
  set->nvars, set->var, var->next  - a dataset has variables
  set->nrels, set->rel, rel->next  - a dataset has relations

  firstset, set->next   - a program has a list of datasets

  dim->owner    - a dimension belongs to a dataset
  var->owner    - a variable belongs to a dataset
  rel->owner    - a relation belongs to a dataset
  att->owner    - an attribute belongs to a dataset, variable,
			or relation

Pointers are used to represent all built-in relationships. Linked lists are used for all has relationships except two: var->dim[] and rel->var[]. In both cases, these associations are many-to-many. Linked lists are impractical due to multi-threading. Instead, variable dimensions and relation variables are stored in arrays. The number of variable dimensions is limited (e.g., GP_VAR_DIMS = 5). There is no limit on the number of relation variables.

Some built-in relationships are circular; e.g. var->dim[] and dim->var, or set->var and var->owner. Due to the hierarchical nature of declarations in C, some of these pointers have to be declared of type char, which is unfortunate.

Scaled Variable Data

In original TeraScan datasets, information for converting 8-bit or 16-bit data to real values was stored in application-defined scaling attributes. Now, scaling attributes are built into all variables, regardless of datatype. var->scale and var->offset are used to convert stored data to its true form:

true value = var->scale * stored value + var->offset

Note that built-in attributes var->badval, var->usemin, and var->usemax all refer to stored values. When presenting these attributes to users, applications may want to apply scaling to at least var->usemin and var->usemax.

The most common use of scaling is to store real-valued data with a minimum yet appropriate number of significant bits. However, scaling can be used to help change variable units without changing actual data. For example, to change from degrees Celsius to degrees Fahrenheit:

gpputname(var->units, C_FAHRENHEIT); var->scale *= 1.8; var->offset += 32.;

Another benefit of built-in scaling is that it allows applications to pretend they are working with a single type of data: double precision. Variable read and write routines that respectively scale and unscale data are provided as part of the standard interface. This does not preclude the writing of applications that treat each type of variable differently.

Dimension Coordinates

Applications may use the coord, scale, and offset built-in dimension attributes to relate different dimensions. For example, if two dimensions have the same coord attribute, applications may decide that the two dimensions are parallel. The scale and offset attribute can then be used to determine the exact correspondence between the two dimensions, assuming that correspondence is linear.

Coordinate types GP_X_COORD, GP_Y_COORD, GP_Z_COORD, GP_TIME_COORD, GP_COMPLEX_COORD, and GP_NO_COORD are defined in include/gp.h for this purpose. Applications are not restricted to these coordinate types.

Unlimited (Growing) Dimensions

Unlimited dimensions can be defined using a size of GP_UNLIMITED, found in include/gp.h. The following guidelines apply when working with datasets with unlimited dimensions:

Only one dimension in a dataset can be growing; defining a second unlimited dimension will fix the size of the former growing dimension.

If a variable is defined with a growing dimension, that dimension must be the variable's leading dimension.

All variables to be defined with an unlimited leading dimension must be defined prior to writing any data corresponding to that dimension. The size of the unlimited dimension will be fixed at the point where the new variable is defined.

Cloning Objects

Cloning an object refers to the process of creating a like object with the same attributes, optionally with a new name. When a variable is cloned, the new variable is created with the same named dimensions. These dimensions must exist in the output dataset, but do not have to have the same sizes as the corresponding dimensions of the original variable. Similarly, when a relation is cloned, the new relation is created, associating the same named variables.

When a dimension is cloned, its corresponding variable (if one is defined) is not carried over to the new dimension. This would present a chicken and egg problem, because the dimension could not be created without the variable, and the variable could not be created without the dimension.

Definitions vs. Variable Data

Everything about a dataset with the exception of variable data is maintained in virtual memory until the dataset is closed or synched. If a dataset is opened for read access and then is closed, nothing is written to disk regardless of whether the application changed attribute values or defined new objects.

If a dataset is opened with write access and then is closed, all object definitions and attributes are saved to disk. Saving definition and attribute changes can be suppressed by aborting the dataset rather than closing it.

However, changes to variable data occur at the whim of the underlying file system. Variable data is not maintained in virtual memory, but is written to directly to the file system. Aborting a dataset in the midst of writing variable data will leave the dataset in an undefined, probably unreadable state.

TeraScan datasets support random hypercube access to variable data. A hypercube is defined by a starting 0-relative coordinate, (i1,i2,...) and a cube size (n1,n2,...). Variable indexing is similar to array indexing under C; i.e., the index of the last dimension is the fastest moving.

Link Subsets and Assemblies

Any array data that can support random hypercube access can be linked to a TDF variable. For example, data for a variable or variable hypercube in one TeraScan dataset can be linked to a variable in another (or the same) TeraScan dataset. This link mechanism allows data from one or more datasets to be linked to a single dataset without instantiation, i.e., without moving any data around.

The following TeraScan applications take advantage of this link mechanism:

subset - Creates a variable and/or dimension subset of input datasets.

assemble - Gathers selected variables from input datasets into a single output dataset.

burst - Slices variables along any dimension, creating link variables for each of the slices.

impbin - Imports structured array data from non-TDF files.

This link mechanism is similar to the UNIX facility for creating symbolic file links. One drawback of using links is that links can be orphaned. If data in file X is linked to a variable V in dataset A, and then X is removed, the link variable V becomes orphaned.

As a special case, a NULL file can be linked to a TDF variable. In this case, all stored values for the variable are assumed to be 0.

vAutomatic Uncompression

Datasets that have been compressed using the UNIX compress function can be uncompressed automatically by TeraScan. TeraScan uses the UNIX zcat function to uncompress datasets, redirecting the output to the scratch directory defined by the environment variable UNCOMPRESSDIR. If UNCOMPRESSDIR is undefined, uncompression is not attempted.

A list of automatically uncompressed files is kept in the Registry file in the UNCOMPRESSDIR. This file is ASCII but is not intended to be edited. For each automatically uncompressed file, the following information is shown: true path name of original, full path name of uncompressed copy, last modification time of original in seconds, and the max idle time in seconds.

Idle time is defined to be the difference between the current time and the last access time of the original. The environment variable UNCOMPRESSIDLE specifies the maximum idle time in minutes for automatically uncompressed files. If UNCOMPRESSIDLE is not set, the maximum idle time is assumed to be 60 minutes. Different files can have different maximum idle times.

The environment variable UNCOMPRESSMAX specifies the maximum space in megabytes to be allocated in the UNCOMPRESSDIR for automatically uncompressed files. If UNCOMPRESSMAX is not set, the maximum is assumed to be 10 megabytes. This maximum is only a rough limit; see the algorithm outlined below:

Given input compressed file F,

If UNCOMPRESSDIR is not defined, can't uncompress F.

If F is in Registry, F's last modification time matches what is in the Registry, and F's uncompressed copy still exists, use it.

Delete all entries in Registry if original no longer exists, original's last modification time does not match Registry, uncompressed copy does not exist, or idle time (e.g., current time minus last access time of original) exceeds the max idle time.

While the total space occupied by uncompressed copies plus the size of F (not its uncompressed copy!) exceeds UNCOMPRESSMAX, delete the entry in Registry closest to exceeding its max idle time.

Uncompress F and put it in the Registry, setting its max idle time to UNCOMPRESSIDLE. Hard Limits

There are currently only two hard limits for TeraScan datasets: length of names and number of variable dimensions. The name length limit applies not only to names, but also to such built-in attributes as var->units and rel->kind. Arbitrary name lengths were not implemented for the following reasons:

Applications are invariably written assuming a maximum name length, which may as well be constant across customer sites.

If names have unlimited length, built-in attributes var->units and rel->kind also would have unlimited length.

Unlimited length names mean more extensive use of malloc, which has been avoided.

Error Handling

Pipeline processing applications, interactive display applications, and application subsystems (e.g., TeraScan earth transform) have very different error-handling requirements:

Pipeline processing applications typically take a very brutal approach to errors; i.e., abort!

Interactive display applications must always return control to the user, even on such show stopping errors as running out of disk space or memory

Application subsystems must always return control to the application after converting lower-level error codes into higher-level ones (e.g., no such attribute => dataset does not have earth location).

In order to support these different cases, a switchable error-handler is used by all the dataset interface routines. An application subsystem can switch its own error-handler in and out several times while an application is running.

The default error-handler simply sets the TeraScan global variable terrno to the appropriate error code. In addition to UNIX file open and memory allocation errors, the following errors may be encountered:

EGP_ERROR               Unknown error.
EGP_BAD_MAGIC           Invalid or corrupt dataset.
EGP_BAD_DATATYPE        Invalid datatype code.
EGP_DUP_NAME            Duplicate object name.
EGP_NOT_GP_SET          Not a valid dataset.
EGP_BAD_LENGTH          Invalid length.
EGP_BAD_DIM_SIZE        Invalid dimension size.
EGP_DIM_OUTSIDE_SET     Dimension in wrong dataset.
EGP_DIM_STILL_USED      Dimension still in use.
EGP_LINK_DIM_MISMATCH   Dimension mismatch for link.
EGP_LINK_OPEN           Cannot open linked file.
EGP_UNCOMPRESS          Cannot uncompress file.
EGP_NO_SUCH_ATT         No such attribute.
EGP_NO_SUCH_DIM         No such dimension.
EGP_NO_SUCH_VAR         No such variable.
EGP_NO_SUCH_REL         No such relation.
EGP_SHORT_READ          Incomplete read.
EGP_SHORT_WRITE         Incomplete write.
EGP_BAD_REL_NVARS       Invalid number of vars for relation.
EGP_VAR_OUTSIDE_SET     Variable in wrong dataset.
EGP_NUM_VAR_DIMS        Invalid number of variable dimensions.
EGP_VAR_DIM_BOUNDS      Outside variable bounds.
EGP_VAR_DIM_MISMATCH    Dimension mismatch.
EGP_VAR_DIM_SIZE        Non-lead dim can't be growing.
EGP_VAR_READONLY        Variable is readonly.
EGP_VAR_STILL_USED      Variable is still in use.
EGP_BAD_SET_PTR         Bad set pointer found/passed.
EGP_BAD_FILE_PTR        Bad file pointer found/passed.
EGP_BAD_ATT_PTR         Bad attr pointer found/passed.
EGP_BAD_DIM_PTR         Bad dim pointer found/passed.
EGP_BAD_VAR_PTR         Bad var pointer found/passed.
EGP_BAD_REL_PTR         Bad rel pointer found/passed.

These error codes are defined in include/gp.h.

FILES

include/gp.h, lib/libcdf.a, lib/libutils.a, /usr/include/errno.h

SEE ALSO

gpatt, gpdim, gperr, gpio, gplink, gpname, gprel, gpset, gptype, gpvar, dirfile, misc, terrno.

NOTES

One of the strong points of the TDF and its programming interface is that applications do not depend on the physical layout of data on disk. The physical layout of a typical dataset is as follows:

- dataset header of 644 bytes (historical)
- data for non-link variables
- file descriptions for link variables
- dataset attributes
- dimension descriptions
- variable descriptions and attributes
- relation descriptions and attributes

The start of data for a given variable is defined by var->datastart. Data for non-link variables is guaranteed either to be completely contiguous or row-wise contiguous. The ith row of array A is defined to be all elements of A with leading index i. The distance between rows is var->dimdist[0].


Last Update: $Date: 2001/12/14 19:39:19 $