McMurdo LTER Proposal: Section 5

Section 5. Data Management

The MCM-LTER data and information management system is housed at INSTAAR, Univ. of Colorado. We have adopted the general features of the NWT-LTER data system (Ingersoll et al. 1997), also located at INSTAAR, including a centralized data system with different levels of access which is managed by a full-time data manager. The manager is supervised by Dr. McKnight and works with PIs and collaborators to meet database needs and to merge data sets relating to common sampling sites and times. Our system is designed to minimize the time between data collection, data submission, acquisition of metadata, and to provide timely public access to data files and other site related information (e.g. maps, bibliographic resources, site news).

A distinguishing feature of the MCM-LTER is the high degree of coordination among investigators in planning each field season which carries over to the data management program. The data manager can anticipate which data sets will be submitted following the analysis of the field data at the home institution. Data sets are routinely submitted to the data manager in an electronic format after quality assurance by the investigator. Our core data sets are 1) continuous year-round (such as meteorological data), 2) continuous during the austral summer (streamflow), and 3) for discrete time points at specific locations (e.g. lake water chemistry). These core data need to be promptly accessible by all investigators to interpret their results. Following the approach presented by Ingersoll et al. (1997), we recognize the following types of data:

Type 1- Electronic data: e.g. continuous meteorological data
Type 2- (Electronic) Hard copy data: e.g. field measurements and analysis of discrete samples
Type 3- Electronic manipulated data: e.g. continuous and discrete streamflow data which have been interpreted by using rating curves determined for each season.

The data manager is not directly involved in primary data entry. The data manager merges the Type 2 data sets to verify that a record is complete and to identify inconsistencies that can be resolved by querying the field team. The Type 1 and Type 3 data sets are from established meteorological and stream gauging networks and, thus, require only slight modifications between years.

The configuration of the hardware and software used to archive and manage data is shown in Figure 5.1. Data are entered on computers at co-principal investigators' home institutes. Files are then submitted in PC-format using ftp. Metadata (documentation to go with data files) are submitted electronically and/or by specifying a publication containing the necessary information. INSTAAR's UNIX system (a Sun/Ultra-Enterprise 150 Server) is used to retrieve files submitted electronically, manually enter data and text from hard copies, store raw data, revise formats for representation in a relational mode, and generate files to use on the web page. Available software includes Microsoft Word, Excel, and Access, as well as general text editors and HTML coding.

In addition to providing comma-delimited ASCII files for all sets of data, special data management tools were developed for meteorological and hydrological data. The meteorological tool was created by a former data manager (Ken McGwire). This tool is a web-based front end to a CGI program written in the C programming language. The met tool allows the user to subset data covering specified time periods and to aggregate these data according to standard time intervals. Users may extract values simultaneously for multiple parameters and multiple meteorological stations. Data are taken from ASCII flat files for each station. This tool currently resides on a UNIX machine at DRI, and is hotlinked to our main web site; it will be transferred to Colorado in year 1 of MCM-II. The hydrological tool was created at the USGS office in Wisconsin. It is Oracle-based, and, similar to the meteorological tool, allows the user to extract data while specifying multiple criteria for what is desired. This tool currently resides at the USGS, and is hotlinked to the web site at INSTAAR. Development of a tool similar to both of those described here, but covering multiple types of data (meteorological, limnological, soil, etc.), will occur at the INSTAAR in the near future using Oracle-based software.

The data manager communicates with all science teams before the field season about the sampling plan, permitting preparation of templates for data entry and cross-relational file structure in anticipation of data submittal. At the end of the field season, the data manager receives the actual sampling schedule (e.g. sampling dates and depths for each lake, sample dates and location for glacier, soil and stream field measurements and samples),and prepares a master data template for use by the investigators in submitting their data.

The MCM-LTER has two categories of Type 2 data; fast acquisition data obtained within a few months of the field season (e.g. water chemistry data), and slow acquisition data for which the analysis is more time-consuming or costly and results are unavailable for 6-12 months or more (e.g. bacterioplankton, phytoplankton, soil biota species abundance). The data manager tracks the progress of both categories of Type 2 data. This information about progress is used to devise long-term plans for allocating resources for subsequent analysis. The data manager performs the final quality assurance and quality control prior to a data set being made available to the community and organizes the metadata for the submitted data sets. The data manager also posts updates regarding availability of delayed Type 2 data on the web site.

Upon request and time permitting, the data manager generates composite data sets from algorithms provided by the investigators. For example, one composite data set is the annual solute flux to a dry valley lake from a given stream or for all streams flowing into that lake. This employs continuous Type 3 data from the hydrologic network and discrete chemical data obtained for samples collected at specific times. The data manager works with the investigator to document the quality of the composite data set for a particular year. This approach is illustrated in Figure 1.

A GIS of the Taylor Valley was developed at DRI. This provides a means to organize the spatial data and can also be used to develop maps of system characteristics. The GIS and map-maker function can now be accessed by the web site.

The MCM-LTER has a record of timely submission of data to the data manager. We have a sequential procedure to ensure that timely data submission is maintained. The data manager keeps the supervising investigator (Dr. McKnight) informed of status on a weekly basis. In the case of a substantial delay or lack of response to the data manager's inquiry, Dr. McKnight contacts the investigator to discuss plans which meet the needs of other investigators. Dr. Lyons, as the lead PI, conducts further discussion as needed. Persistent, unsatisfactory conditions may be considered in planning for future field seasons and allocation of resources. Because each field season is planned by the MCM-LTER as a team, with decisions made about distribution of "slots" on the ice, equipment purchases, etc., investigators have a strong incentive to be current with their submission of data.

Data Accessibility and Schedule: Data accessibility is very much driven by the timing of the annual field season and completion of data analyses. The continuous data sets (Type 1 and 3) are submitted to the data manager within 2-4 months after completion of the field season in mid-February. When these data are posted in the MCM-LTER database they are immediately available to other MCM-LTER researchers and the broad scientific community. This corresponds to an annual updating of the Type 1 & 3 data sets.

Type 2 data sets of core monitoring data and experimental data from either short term or long term manipulative experiments are made available to the MCM-LTER investigators once the merger of the data submitted has been completed, shortly after the field season. The core monitoring data are made accessible to the scientific community two years after the end of the field season. For example, the Type 2 core data sets from the 1993-94, 1994-95, and 1995-96 seasons currently are available through the web. We also plan to continue to publish data reports which organize and synthesize discrete data sets for sites which are being monitored in the long term (e.g. Alger et al. 1997).

Information management services Information distribution is largely handled through the world wide web. The MCM data manager is responsible for a bibliographic database of MCM-LTER publications. The present bibliographic tool allows for PIs to be responsible for maintaining the bibliography through a simple interface. The bibliographic tool provides a web-based front end to a BibTex format bibliographic reference file. The tool consists of a set of HTML templates for data entry and update, backed by a set of CGI programs written in the C programming language. This tool also has a CGI program that performs a text search on all fields in the bibliographic file. Viewing, saving, printing, and/or searching parts or all of the bibliography is available to the public, while editing the bibliography is password protected. The data manager is responsible for regular review of the site and will request that PIs update the site when necessary. Also through the main MCM web site the public is able to view details of the overall project (information on PIs, project descriptions, etc.), and connect to related web sites at DRI, Colorado State University, INSTAAR, and the USGS.