Geopsy: Database

From GeopsyWiki
Jump to navigation Jump to search

Internal database structure

Geopsy Database is not build on an existing database engine such as MySQL. The database is only made of a list of signals. A signal is a vector of numbers (double floating point real numbers, 64 bits) documented by a collection of information fields, typically the information extracted from file headers.

Information fields of a signal
Field name Type Description
AmplitudeUnit string The type of unit to display amplitudes (linked to the type of sensor: accelerometer, velocimeter,...). The physical unit display is active only for signal with a CountPerUnit different from 1.0. See also VoltPerUnit, CountPerUnit
AverageAmplitude double It is the average amplitude of the signal. The units depends upon the value of Count2Volt. Contrary to the other fields, calculating this value requires the signal samples to be loaded into memory. Hence using this field may slow down Geopsy. We advise using it only if necessary. This value is read-only
Comments string Free space for user comments
Component string Name of the component, it must be one of the following keywords: Vertical, North or East
CountPerUnit double The conversion factor between 'count' and 'unit' defined by AmplitudeUnit. It is the product of CountPerVolt and VoltPerUnit. Read-only, alter its value by changing the intermediate factors
CountPerVolt double The conversion factor between 'count' and 'volt'. 'counts' are divided by this factor to obtain 'volts'. This value is read from the majority of signal files. By default this value is 1.0. When the factor is equal to unity, the amplitude scale of the signal is automatically considered as counts, else as volts. See also VoltPerUnit, VoltPerCount, CountPerUnit
SamplingPeriod double It is the sampling period expressed in period. This property, as well as SamplingFrequency, can be modified. However, make sure that it corresponds exactly to the true recording frequency sampling rate
DupAverage[] double This parameter is useful for travel time tomography analysis. For phase picking (see TimePick), a travel time from source to receiver should be equivalent to a travel time from receiver to source observed on another signal. Couples of time picks are identified by a unique DupID. The average and the standard deviation for each couple is calculated in DupAverage and DupStdErr, respectively. This parameter is read-only
DupID integer This parameter is useful for travel-time tomography analysis. Identifies uniquely a couple of signals with the same ray path, where sources and receivers are just swapped. This parameter is read-only
DupStdErr[] double This parameter is useful for travel time tomography analysis. See DupAverage for details
DuplicateRays string 'true' if there are duplicate ray paths attached to this source-receiver path. This parameter is read-only
Duration double Time elapsed between the first and the last sample of the signal (in seconds). This parameter is not saved in the signal structure but calculated from SamplingPeriod and NSamples. See T0 for acceptable formats
EndTime double Time elapsed between the first and the last sample of the signal (in seconds). This parameter is not saved in the signal structure but calculated from Duration and T0. See T0 for acceptable formats
FileFormat string The name of the file format. This parameter is read-only
FileName string The complete name of the signal file to which the signal belongs, including its path. This parameter is read-only
FileNumber integer The number affected to the signal file to which the signal belongs. This number depends upon the order of loading files into the database. This parameter is read-only
HeaderModified string 'true' if the header was modified. When closing the current database, if at least one signal has this field as true, the user is warned to save the database. This value is read-only
Id integer Unique number to reference the signal, used by groups. This value is read-only
IsOriginalFile string Contains "Original" if the signal samples come from an imported file. Otherwise, the field contains "processed". See saving a database for details. This value is read-only
MatlabVariableName string For signals loaded from a Matlab file, this field contains the name of the Matlab variable.
MaximumAmplitude double It is the maximum amplitude reached by the signal for the whole Duration. The units depends upon the value of Count2Volt. Contrary to the other fields, calculating this value requires the signal samples to be loaded into memory. Hence using this field may slow down Geopsy. We advise using it only if necessary. This value is read-only
Name string Arbitrary name to identify the signal, usually it is set to the name of the recording station
NSamples integer The number of samples in the signal. This value is read-only.
NumberInFile integer A signal file may contain various signals. This parameter is this index of this signal in its file. This parameter is read-only.
OriginalFileName string The original file name as recorded by the acquisition device.
Pointer integer Hexadecimal address of internal signal structure (debugging only). This parameter is read-only.
ReceiverX

ReceiverY ReceiverZ

double The coordinates of the receiver where the signal was recorded (Cartesian system expressed in metres)
SamplingFrequency double It is the sampling frequency expressed in Hz. This parameter is not saved in the structure but calculated from SamplingPeriod. You can modified it, SamplingPeriod is changed accordingly
SampleSize integer Memory size occupied by sample vector. Read-only
ShortFileName string The original file name without its complete path. Read-only
SourceAzimuth double The azimuth from the source. The coordinates of the source for which the signal was recorded has to be given. Read-only
SourceDistance double The distance to the source. The coordinates of the source for which the signal was recorded has to be given. Read-only
SourceRoughAzimuth double Returns exactly the same value as SourceAzimuth. The coordinates of the source for which the signal was recorded has to be given. Read-only. The main interest of this field is for sorting signals along a line and identify source-receiver directions.
SourceX

SourceY SourceZ

double The coordinates of the source for which the signal was recorded (Cartesian system expressed in metres). These fields are relevant to records where the source is clearly identified. It is generally useful for refraction and travel time tomography analysis
StackCount integer The number of recorded stack.
T0 double The delay (in seconds) between the time reference and the first sample of the signal. It can be either positive or negative. Accepted format: "XwXdXhXmXs", for weeks, days, hours, minutes and seconds. Any of these parts can be ommited. Without unit specification, seconds are assumed. A '-' sign can be added as a prefix
TimePick[] double It is a time value that can be modified by the user, either by editing the field or by picking phases (with the mouse) on a graphic representation of the signal. This parameter is useful for travel time tomography analysis, to define time limits of a taper or of a signal cut, or for any processing that requires phase picking
TimeReference string It is the time reference with the format "DD/MM/YYYY hh:mm:ss". All signals recorded synchronously must have the same time reference. The T0 takes the distinct start-up times into account with an arbitrary precision in the time scale (time reference is limited to seconds). A good practice is to set the time reference to the day of acquisition and at midnight (19/05/2005 00:00:00). All T0 are then the number of seconds since the beginning of the day
Type char It is a single character that records the current type of signal: 'w' for waveforms, 's' for frequency spectra, and 't' for arrival time without signal. The type is read-only, you cannot modify it directly. Conversion from 'w' to 's' and vice-versa is done after a Fourier transform
UnitPerCount double Inverse of CountPerUnit. Read-only
UnitPerVolt double Inverse of VoltPerUnit
VoltPerCount double Inverse of CountPerVolt
VoltPerUnit double Conversion factor from 'volt' to any arbitrary unit, usually m/s (velocimeters) or m/s^2 (accelerometers). The unit type and the factor value are not found in most signal file headers. Hence if you want to plot signals with the very specific unit and a corrected amplitude, you must edit this value manually or through a script. See also CountPerVolt, UnitPerVolt, CountPerUnit, AmplitudeUnit

"double" means double floating point real numbers coded on 64 bits, "integer" means a positive or negative integer, "string" means any string of characters, and "char" is a single character. Fields followed by '[]' are vectors and they accept any index as argument. Fields marked in bold represent the most important parameters that must be correctly defined to allow visualization of signals.

When a new signal file is loaded into the database, a new memory structure is allocated for the signal and the fields listed above are filled in from the information contained in the file header. The information extracted from the file header depends upon the file format.

The signal samples are never directly read on opening a file which greatly speeds up the signal handling for the user comfort. According to the user actions (e.g. visualization of traces), it might be necessary to load the samples into memory. In Geopsy core engine (library geopsycore), a cache mechanism has been developed to efficiently handle long signal vectors (keep signals in memory as long as possible until no space is left, then purge rationally according to space needed). From the user point of view, it might be noticed that the first time a signal is visualized, it may be slower than for any later access.

Any subset of the total ensemble of signals can be created. The information is never duplicated because subset are defined by pointers to the original signal structures. The subsets are visualized through signal viewers of various types: tables, graphics, maps or chronogram.

Why creating a database on disk?

The various signal file formats available in seismology and geophysical prospecting generally include a header which contain heterogeneous information. There was a need to store in a uniform format basic information useful for the data processings implemented in Geopsy (e.g. picks of events, source and receiver coordinates, ...).

Some signal file formats can store various signals in a single file, others not. Signal processings, such as array computations, may be applied to only a part of a file or to signals located in various files. There was a need for grouping signals independently of the original file organization usually driven by the acquisition method. Exporting signals of interest to a temporary file before processing is not a satisfactory way of doing things because it duplicates the data on disk and there is a risk of altering information from the file conversion. Furthermore, confusion is likely to occur between true original signals and pre-processed signals (e.g. filtering, DC removal, ...).

Geopsy proposes an alternative with the concept of groups. A group is a list of signal ID (identification number). A name is given to each group which explains its contents. The ID are automatically affected to each signal when importing files into Geopsy. Hence, the affectation depends upon the order of loading files. A proper database storage ensure that each ID effectively corresponds to a unique and well defined signal.

External signal processing tools (e.g. command line softwares) can access a geopsy database to retrieve signals of interest. The ensemble of signals is generally referenced by the name of a group previously created in Geopsy's main frame. The command line tools have access to the signal samples with no care about the original file format. The geopsy core engine handles all file access and memory allocations to ease the development of processing tools based on signals.

Each time a command line tool is started with access to a Geopsy database all the header information is loaded into memory. There is no access to the original signal files which ensures a very quick start-up of any database even if it contains a lot of signals (e.g. 7 seconds were recorded for a database of 33360 signals).

Database file structure

Former version of Geopsy were based on a directory structure. This is no longer the case. All the information is stored in a single file with extension gpy. It is an xml compressed file like many other file types shared by other Geopsy.org softwares.

Example of database structure (contents.xml):

<SignalDB>
  <version>3</version>
  <File>
    <name>/home/mwathele/projets/Documentation/DOCWORKSHOP/RING01_SHORT/WA.WAU01..HHE.D.2010.056.000</name>
    <format>MiniSeed</format>
    <original>true</original>
    <Signal>
      <ID>1</ID>
      <Name>WA_WAU01</Name>
      <Component>East</Component>
      <T0>29995</T0>
      <TimeReference>25/02/2010 00:00:00</TimeReference>
      <DeltaT>0.010000000000000000208</DeltaT>
      <Type>Waveform</Type>
      <NSamples>93974</NSamples>
      <CountPerVolt>1</CountPerVolt>
      <VoltPerUnit>1</VoltPerUnit>
      <AmplitudeUnit>Velocity</AmplitudeUnit>
      <NumberInFile>0</NumberInFile>
      <Receiver>0.940537635798569 -0.0688222621658598 43.643929</Receiver>
      <OffsetInFile>0</OffsetInFile>
      <ByteIncrement>0</ByteIncrement>
      <MiniSeedRecords>0,512,512,512,1024,512,1536,... </MiniSeedRecords>
    </Signal>
    ...
  </File>
  ...
  <Group>
    <folder>true</folder>
    <name>/</name>
    <comments>Root folder of all groups</comments>
    <Group>
      <folder>false</folder>
      <name>vertical</name>
      <comments></comments>
      <ids>9 6 18 15 12 21 3 24 </ids>
    </Group>
    <Group>
      <folder>false</folder>
      <name>All 3C</name>
      <comments></comments>
      <ids>9 6 18 15 12 21 3 24 14 17 2 20 5 8 11 23 19 22 13 4 16 7 1 10 </ids>
    </Group>
  </Group>
</SignalDB>