lib/libcdf.a
Each TeraScan dataset is a separate UNIX file organized in the TeraScan common data format (TDF). The TDF is an extremely versatile file format capable of assimilating a wide variety of data types, shapes and sizes. For example, a single dataset could contain satellite image data, random in-situ data, and 3-D model data. The TDF also allows applications to access data without any knowledge of the physical layout of that data.
The TDF was developed during the same period that NASA developed the Common Data Format (CDF) [Treinish and Gough, 1987] and served as a basis for the UNIDATA Network Common Data Format (netCDF) [Rew, 1988]. The TDF has been substantially upgraded since then.
Dimensions, variables, relations, and attributes are the basic dataset components. Variables are simply arrays of data; dimensions define the sizes of these arrays. Relations are ordered lists of variables. Attributes hold information about the dataset as a whole, or about individual variables, dimensions, or relations. Only datasets, variables, and relations can currently have application-defined attributes.
It is important to note that dimensions do not necessarily refer to the physical space in which measurements are taken, but rather to the shapes and sizes of the arrays used to store the measurement data. For example, data taken at random locations in space is stored one-dimensionally. The maximum number of dimensions for a given variable is currently 5. This accomodates the usual X, Y, Z, and time coordinates, plus one left over to handle complex coordinates.
The following datatypes may be used to define variables and attributes; byte, short, long, float, double, and string. Codes and ranges for these datatypes are defined in include/gp.h. String is a variable-width datatype, i.e., the number of bytes required to store one element is application-defined. Applications can implement a complex-valued variable by adding an extra dimension of length 2 to the variable.
Normally all dataset definitions and data are stored in a single UNIX file. However, a dataset can reference variables from several files using links. Links allow rapid import of non-TDF data and support lightweight dataset subsets and assemblies.
Raw AVHRR Imagery
AVHRR data from NOAA satellites consists of up to five channels of image data, plus per-scan line sensor calibration data.
+------------------------+
AVHRR Data | Last Channel |
+------------------------+ |
| Second Channel | |
+--------+ +------------------------+ | |
| Header | | First Channel | | |
| | | | | |
| | | | | |
| | | | |---+
| | | | |
| | | |---+
| | | |
+--------+ +------------------------+
The resulting TDF dataset has three dimensions:
# image lines (rows)
# image samples (columns)
width of the header
It also contains header variable, plus variables for each of the channels. TOVS Package Output
Output for the Wisconsin TOVS package consists of a number of metereological readings at constant elevations, but at random (latitude,longitude) points. The following schema shows one approach to fitting this output into the TDF:
Dimensions: 1 472 points 2 19 hirs_channels 3 4 msu_channels 4 15 gp_levels 5 14 temp_levels 6 5 dew_pt_levels 7 9 wind_levels 8 10 temp_inp_levels 9 5 dp_inp_levels Global Attributes: pass_date 00/09/16 start_time 01:13:49 satellite noaa-11 Variables: gp_levels gp_levels temp_levels temp_levels dew_pt_levels dew_pt_levels wind_levels wind_levels temp_inp_levels temp_inp_levels dp_inp_levels dp_inp_levels latitude points longitude points elevation points solar_zenith points ground_temp points ground_dew_pt points ground_pressure points skin_temp points total_ozone points stability_index points precip_water points longwave_flux points cloud_pressure points cloud_temp points hirs_temp points X hirs_channels msu_temp points X msu_channels gp_height points X gp_height temperature points X temp_levels dew_pt points X dew_pt_levels wind_speed points X wind_levels wind_dir points X wind_levels temp_guess points X temp_inp_levels dew_pt_guess points X dp_inp_levels
For brevity, units and datatype have omitted from the above listing. Note that for each of the level dimensions there is a corresponding variable with the same name. These dimension variables supply the values corresponding to each dimension index. For example, the 9 wind_levels are 850, 700, 500, 400, 300, 250, 200, 150, and 100 millibars.
TDF access routines are independent of any other TeraScan software component except lib/utils.a. See dirfile, misc, and terrno. Therefore, TDF applications can be written without using TeraScan user interface or earth transform facilities. TDF calls can be embedded in existing non-TeraScan applications as desired.
TDF datatypes, constants, and error status codes are defined in include/gp.h.
Object Pointers
The basic TDF objects are sets, dimensions, variables, and relations. Application-defined attributes are not considered objects, even though they can be treated as such. Files are secondary objects and are of only passing concern to applications.
Pointers to objects (actually object data structures) are returned by search or definition functions. These pointers are used as arguments to other functions. All data structures have magic numbers and alignment criteria that help identify bogus pointers. A pointer to an object's data structure is pinned (i.e., can never change) until the object is no longer available (i.e. when the containing dataset is closed).
All application-accessible data structures exist in memory that is allocated using UNIX malloc. malloc is used sparingly and in an unfragmented manner so as not to impact applications that also use and (possibly abuse) malloc.
Applications cannot be prevented from modifying data structures, even for datasets opened as read-only. Given this, it was decided to let applications perform all operations except variable I/O for read-only datasets, including defining new variables, relations, and attributes.
One obvious disadvantage of having application-accessible data structures is that applications will undoubtedly trash them more easily than if they were hidden. All data structure components should be considered read-only, unless otherwise specified.
Applications can loop through a list of similar objects (e.g., all dimensions belonging to a dataset) using
while (pointer != NULL) pointer = pointer->next;
Attributes
Attributes refine the definitions of datasets and their components. There are two kinds of attributes:
Built-in attributes, i.e., fields in application-accessible data structures (see include/gp.h).
Application-defined attributes, created using the define or copy attribute functions.
Dimensions and files do not have application-defined attributes. The only file attribute of any interest to applications is file->path, which is built-in. Application-defined dimension attributes may be added in the future.
Note that applications are free to change names directly and potentially generate name conflicts within a dataset. This is the least harmful of all the ways applications can damage datasets.
Different objects can have attributes with the same name, but with different datatypes or lengths. This new flexibility should be used cautiously; two attributes with different meanings should never have the same name.
The following built-in attributes are intended for use by applications; only those marked (*) can be set directly by applications.
* dim->name - dimension name dim->unlimited - non-zero if dimension can grow dim->size - current size * dim->coord - dimension coordinate * dim->scale - orig index = index * scale + offset * dim->offset * var->name - variable name * var->units - units var->type - datatype * var->badval - missing value as stored on disk * var->usemin - minimum valid stored value * var->usemax - maximum valid stored value * var->scale - true value = scale * stored value + offset * var->offset * rel->name - relation name * rel->kind - relation kind (analogous to variable units) * att->name - attribute name * att->units - attribute units att->type - datatype att->size - number of elements in attribute file->path - file path name
Application-defined attributes are normally not accessed like objects. Their values are set and retrieved by name, rather than by pointer. Pointers to attribute definitions are available for getting attribute datatype, lengths, and units, as well as looping through lists of attributes.
Application-Defined Relationships
The new abstraction relation has been added to datasets. A relation consists of an ordered list of variables all belonging to the same dataset. Relations have built-in attributes name and kind, where relation kind is analogous to variable units. Relations also can have application-defined attributes. The number and order of the variables associated by a relation, as well its application-defined attributes, are determined by its kind.
The following is an example of how relations can be used:
Given a variable date that contains an ordered list of dates, a variable year that contains an ordered list of years, and a variable year_index that is defined as follows:
index[i] = j if k > j => date[k] >= year[i]
define the relation year_index of kind sparse_index, consisting of the ordered tuple (date, year, year_index). (Obviously, date and year must have the same units for this to work.)
Built-in Relationships
The following relationships are built-in to application-accessible data structures; only those marked (*) can be changed directly by applications:
var->dim[], var->ndims - variable has dimensions
rel->var[], rel->nvars - relation relates variables
* dim->var - a dimension can get its values from a variable
i.e., value coresponding to dim=i is var[i]
var->file - a variable's data is stored in a file
set->natts, set->att, att->next - dataset has attributes
var->natts, var->att, att->next - variable has attributes
rel->natts, rel->att, att->next - relation has attributes
set->ndims, set->dim, dim->next - a dataset has dimensions
set->nvars, set->var, var->next - a dataset has variables
set->nrels, set->rel, rel->next - a dataset has relations
firstset, set->next - a program has a list of datasets
dim->owner - a dimension belongs to a dataset
var->owner - a variable belongs to a dataset
rel->owner - a relation belongs to a dataset
att->owner - an attribute belongs to a dataset, variable,
or relation
Pointers are used to represent all built-in relationships. Linked lists are used for all has relationships except two: var->dim[] and rel->var[]. In both cases, these associations are many-to-many. Linked lists are impractical due to multi-threading. Instead, variable dimensions and relation variables are stored in arrays. The number of variable dimensions is limited (e.g., GP_VAR_DIMS = 5). There is no limit on the number of relation variables.
Some built-in relationships are circular; e.g. var->dim[] and dim->var, or set->var and var->owner. Due to the hierarchical nature of declarations in C, some of these pointers have to be declared of type char, which is unfortunate.
Scaled Variable Data
In original TeraScan datasets, information for converting 8-bit or 16-bit data to real values was stored in application-defined scaling attributes. Now, scaling attributes are built into all variables, regardless of datatype. var->scale and var->offset are used to convert stored data to its true form:
true value = var->scale * stored value + var->offset
Note that built-in attributes var->badval, var->usemin, and var->usemax all refer to stored values. When presenting these attributes to users, applications may want to apply scaling to at least var->usemin and var->usemax.
The most common use of scaling is to store real-valued data with a minimum yet appropriate number of significant bits. However, scaling can be used to help change variable units without changing actual data. For example, to change from degrees Celsius to degrees Fahrenheit:
gpputname(var->units, C_FAHRENHEIT); var->scale *= 1.8; var->offset += 32.;
Another benefit of built-in scaling is that it allows applications to pretend they are working with a single type of data: double precision. Variable read and write routines that respectively scale and unscale data are provided as part of the standard interface. This does not preclude the writing of applications that treat each type of variable differently.
Dimension Coordinates
Applications may use the coord, scale, and offset built-in dimension attributes to relate different dimensions. For example, if two dimensions have the same coord attribute, applications may decide that the two dimensions are parallel. The scale and offset attribute can then be used to determine the exact correspondence between the two dimensions, assuming that correspondence is linear.
Coordinate types GP_X_COORD, GP_Y_COORD, GP_Z_COORD, GP_TIME_COORD, GP_COMPLEX_COORD, and GP_NO_COORD are defined in include/gp.h for this purpose. Applications are not restricted to these coordinate types.
Unlimited (Growing) Dimensions
Unlimited dimensions can be defined using a size of GP_UNLIMITED, found in include/gp.h. The following guidelines apply when working with datasets with unlimited dimensions:
Only one dimension in a dataset can be growing; defining a second unlimited dimension will fix the size of the former growing dimension.
If a variable is defined with a growing dimension, that dimension must be the variable's leading dimension.
All variables to be defined with an unlimited leading dimension must be defined prior to writing any data corresponding to that dimension. The size of the unlimited dimension will be fixed at the point where the new variable is defined.
Cloning Objects
Cloning an object refers to the process of creating a like object with the same attributes, optionally with a new name. When a variable is cloned, the new variable is created with the same named dimensions. These dimensions must exist in the output dataset, but do not have to have the same sizes as the corresponding dimensions of the original variable. Similarly, when a relation is cloned, the new relation is created, associating the same named variables.
When a dimension is cloned, its corresponding variable (if one is defined) is not carried over to the new dimension. This would present a chicken and egg problem, because the dimension could not be created without the variable, and the variable could not be created without the dimension.
Definitions vs. Variable Data
Everything about a dataset with the exception of variable data is maintained in virtual memory until the dataset is closed or synched. If a dataset is opened for read access and then is closed, nothing is written to disk regardless of whether the application changed attribute values or defined new objects.
If a dataset is opened with write access and then is closed, all object definitions and attributes are saved to disk. Saving definition and attribute changes can be suppressed by aborting the dataset rather than closing it.
However, changes to variable data occur at the whim of the underlying file system. Variable data is not maintained in virtual memory, but is written to directly to the file system. Aborting a dataset in the midst of writing variable data will leave the dataset in an undefined, probably unreadable state.
TeraScan datasets support random hypercube access to variable data. A hypercube is defined by a starting 0-relative coordinate, (i1,i2,...) and a cube size (n1,n2,...). Variable indexing is similar to array indexing under C; i.e., the index of the last dimension is the fastest moving.
Link Subsets and Assemblies
Any array data that can support random hypercube access can be linked to a TDF variable. For example, data for a variable or variable hypercube in one TeraScan dataset can be linked to a variable in another (or the same) TeraScan dataset. This link mechanism allows data from one or more datasets to be linked to a single dataset without instantiation, i.e., without moving any data around.
The following TeraScan applications take advantage of this link mechanism:
subset - Creates a variable and/or dimension subset of input datasets.
assemble - Gathers selected variables from input datasets into a single output dataset.
burst - Slices variables along any dimension, creating link variables for each of the slices.
impbin - Imports structured array data from non-TDF files.
This link mechanism is similar to the UNIX facility for creating symbolic file links. One drawback of using links is that links can be orphaned. If data in file X is linked to a variable V in dataset A, and then X is removed, the link variable V becomes orphaned.
As a special case, a NULL file can be linked to a TDF variable. In this case, all stored values for the variable are assumed to be 0.
vAutomatic Uncompression
Datasets that have been compressed using the UNIX compress function can be uncompressed automatically by TeraScan. TeraScan uses the UNIX zcat function to uncompress datasets, redirecting the output to the scratch directory defined by the environment variable UNCOMPRESSDIR. If UNCOMPRESSDIR is undefined, uncompression is not attempted.
A list of automatically uncompressed files is kept in the Registry file in the UNCOMPRESSDIR. This file is ASCII but is not intended to be edited. For each automatically uncompressed file, the following information is shown: true path name of original, full path name of uncompressed copy, last modification time of original in seconds, and the max idle time in seconds.
Idle time is defined to be the difference between the current time and the last access time of the original. The environment variable UNCOMPRESSIDLE specifies the maximum idle time in minutes for automatically uncompressed files. If UNCOMPRESSIDLE is not set, the maximum idle time is assumed to be 60 minutes. Different files can have different maximum idle times.
The environment variable UNCOMPRESSMAX specifies the maximum space in megabytes to be allocated in the UNCOMPRESSDIR for automatically uncompressed files. If UNCOMPRESSMAX is not set, the maximum is assumed to be 10 megabytes. This maximum is only a rough limit; see the algorithm outlined below:
Given input compressed file F,
If UNCOMPRESSDIR is not defined, can't uncompress F.
If F is in Registry, F's last modification time matches what is in the Registry, and F's uncompressed copy still exists, use it.
Delete all entries in Registry if original no longer exists, original's last modification time does not match Registry, uncompressed copy does not exist, or idle time (e.g., current time minus last access time of original) exceeds the max idle time.
While the total space occupied by uncompressed copies plus the size of F (not its uncompressed copy!) exceeds UNCOMPRESSMAX, delete the entry in Registry closest to exceeding its max idle time.
Uncompress F and put it in the Registry, setting its max idle time to UNCOMPRESSIDLE. Hard Limits
There are currently only two hard limits for TeraScan datasets: length of names and number of variable dimensions. The name length limit applies not only to names, but also to such built-in attributes as var->units and rel->kind. Arbitrary name lengths were not implemented for the following reasons:
Applications are invariably written assuming a maximum name length, which may as well be constant across customer sites.
If names have unlimited length, built-in attributes var->units and rel->kind also would have unlimited length.
Unlimited length names mean more extensive use of malloc, which has been avoided.
Error Handling
Pipeline processing applications, interactive display applications, and application subsystems (e.g., TeraScan earth transform) have very different error-handling requirements:
Pipeline processing applications typically take a very brutal approach to errors; i.e., abort!
Interactive display applications must always return control to the user, even on such show stopping errors as running out of disk space or memory
Application subsystems must always return control to the application after converting lower-level error codes into higher-level ones (e.g., no such attribute => dataset does not have earth location).
In order to support these different cases, a switchable error-handler is used by all the dataset interface routines. An application subsystem can switch its own error-handler in and out several times while an application is running.
The default error-handler simply sets the TeraScan global variable terrno to the appropriate error code. In addition to UNIX file open and memory allocation errors, the following errors may be encountered:
EGP_ERROR Unknown error. EGP_BAD_MAGIC Invalid or corrupt dataset. EGP_BAD_DATATYPE Invalid datatype code. EGP_DUP_NAME Duplicate object name. EGP_NOT_GP_SET Not a valid dataset. EGP_BAD_LENGTH Invalid length. EGP_BAD_DIM_SIZE Invalid dimension size. EGP_DIM_OUTSIDE_SET Dimension in wrong dataset. EGP_DIM_STILL_USED Dimension still in use. EGP_LINK_DIM_MISMATCH Dimension mismatch for link. EGP_LINK_OPEN Cannot open linked file. EGP_UNCOMPRESS Cannot uncompress file. EGP_NO_SUCH_ATT No such attribute. EGP_NO_SUCH_DIM No such dimension. EGP_NO_SUCH_VAR No such variable. EGP_NO_SUCH_REL No such relation. EGP_SHORT_READ Incomplete read. EGP_SHORT_WRITE Incomplete write. EGP_BAD_REL_NVARS Invalid number of vars for relation. EGP_VAR_OUTSIDE_SET Variable in wrong dataset. EGP_NUM_VAR_DIMS Invalid number of variable dimensions. EGP_VAR_DIM_BOUNDS Outside variable bounds. EGP_VAR_DIM_MISMATCH Dimension mismatch. EGP_VAR_DIM_SIZE Non-lead dim can't be growing. EGP_VAR_READONLY Variable is readonly. EGP_VAR_STILL_USED Variable is still in use. EGP_BAD_SET_PTR Bad set pointer found/passed. EGP_BAD_FILE_PTR Bad file pointer found/passed. EGP_BAD_ATT_PTR Bad attr pointer found/passed. EGP_BAD_DIM_PTR Bad dim pointer found/passed. EGP_BAD_VAR_PTR Bad var pointer found/passed. EGP_BAD_REL_PTR Bad rel pointer found/passed.
These error codes are defined in include/gp.h.
include/gp.h, lib/libcdf.a, lib/libutils.a, /usr/include/errno.h
gpatt, gpdim, gperr, gpio, gplink, gpname, gprel, gpset, gptype, gpvar, dirfile, misc, terrno.
One of the strong points of the TDF and its programming interface is that applications do not depend on the physical layout of data on disk. The physical layout of a typical dataset is as follows:
- dataset header of 644 bytes (historical) - data for non-link variables - file descriptions for link variables - dataset attributes - dimension descriptions - variable descriptions and attributes - relation descriptions and attributes
The start of data for a given variable is defined by var->datastart. Data for non-link variables is guaranteed either to be completely contiguous or row-wise contiguous. The ith row of array A is defined to be all elements of A with leading index i. The distance between rows is var->dimdist[0].
Last Update: $Date: 2001/12/14 19:39:19 $