edges-io¶
Module for reading EDGES data and working with EDGES databases.
This package implements all necessary functionality for reading EDGES data. It’s two main concerns are:
Reading the various file formats required for EDGES data: - VNA readings - fastspec output - thermistor readings - field weather recordings - field thermlog recordings
Verifying and exploring databases of measurements in a robust and reliable way.
Features¶
Some features currently implemented:
Verify a “calibration observation” quickly without reading any actual data, with a nice command-line tool:
edges-io check
.Optionally apply various automatic _fixes_ to a calibration observation to bring it into line with standard database layout.
Read
acq
,h5
,mat
andnpz
spectrum files seamlessly.Read S1P files.
Verification of read data.
Intuitive class hierarchy so that any subset of an observation can be handled.
Read field-based weather and thermlog information
Installation¶
Installation should be as simple as either one of the following:
$ pip install git+git://github.com/edges-collab/edges-io
or, if you would like to develop edges-io
and use it too:
$ git clone https://github.com/edges-collab/edges-io
$ cd edges-io
$ pip install -e .[dev]
There are a few dependencies, which should be installed automatically when following the
above command. If you are using conda
(which is recommended) then you can obtain
a cleaner/faster install by doing the following:
$ conda create -n edges python=3
$ conda activate edges
$ conda install numpy scipy h5py
And then following either of the above instructions.
Usage¶
You can use edges-io
either as a library or a command-line tool. The library is
self-documented, so you can look at the docstring of any of the available functions.
We describe some basics of each approach here.
CLI¶
To run the checking tool, simply do:
$ edges-io check PATH
PATH
should be the top-level directory of a calibration observation (i.e. a folder
that has a sub-folder 25C/
, which has subfolders Spectra/
, Resistance/
and
S11
etc.).
There are a few options you can use, for example changing the temperature of the observation,
and enabling automatic fixes. The latter can be achieved simply with the --fix
flag.
If you find that a particular kind of error happens regularly,
make an issue so we can add the
fix.
Library¶
The library is useful for gathering an entire observation and performing operations
on its data. The library exposes a hierarchy of calibration objects, including base
objects like a Spectrum
, Resistance
or S1P
file, and container objects
like Spectra
or S11
. An entire observation can be loaded as a
CalibrationObservation
, and it contains references to all children.
For example:
>>> from edges_io import io
>>> obs = io.CalibrationObservation("path_to_observation")
>>> print(obs.s11.path)
"path_to_observation/25C/S11"
>>> print(obs.spectra.ambient.path)
"path_to_observation/25C/Spectra/Ambient_XXX.acq"
>>> ambient_spectrum = obs.spectra.ambient.read()
See how edges-io
is used in
edges-cal
for a more involved example.
Defining Observations¶
One of the main goals of edges-io
is to make the definition of a “Calibration
Observation” as clear, robust and error-free as possible. Many files go into any
particular observation – spectra, resistance measurements and S11 measurements – which
are all required to form together a calibration solution (which can then be applied
independently to field data). This code provides a clear structure for how these files
must be laid out in order for them to be read and used automatically. This is done in
a formal sense in this document, but is also implemented within
the code itself.
In the above document, the specification is laid out as formally as possible, and that document has the final word on what is allowable. However, this can mean it’s a bit hard to interpret, and so we here present a “simpler” guide to what constitutes a “Calibration Observation”.
A single natural “Observation” (see below for how to combine multiple observations into a single “virtual” observation) is a single directory with multiple files/subdirectories in it. That directory must be named under a certain convention that time-stamps it and gives some useful metadata of the observation (like which receiver number was measured). It is possible that in the future, a metadata file within the observation will specify most of this information, but it is also useful to have a unique label for the observation.
One question that is important in all of this definition is what to do when either 1) a file exists that shouldn’t be there, or 2) a file doesn’t exist that should be there. It may be tempting to overlook extra files that shouldn’t be there. However, they can be a source of error. For example, spectra can be split across multiple files, and we use a file pattern to find the files that should be read in. If an extra file that “shouldn’t be there” exists and the file pattern matches it, then errors can occur (even worse if the contents of that file are able to be read by the spectrum reader, but correspond to a different load or something of that sort, where the results will be wrong, but no error raised). Thus, when checking the integrity of an observation we flag extra files as errors, and require the user to fix them up. To make this a bit easier, and let those files stay in the directory (so we don’t lose potentially valuable information), one of a few extensions can be added to the extra file:
.old
: for files that contain valid data but that is superseded by newer measurements and should be ignored,
.invalid
: for data that has something wrong with it (equipment broken, wrong input parameters, etc.),
.ignore
: files to ignore for any other reason.
If the file does not have one of these extensions, and is not in the list of accepted files for the Observation, an error will be raised by the checker.
On the other hand, if a file is missing that must be there, different things can happen in different situations. The default case is to treat this as a warning, which may be counter-intuitive (surely missing a required file should be an error?!). The reason for this is that that file may be supplemented by a different Observation. Perhaps this Observation is incomplete – maybe all the data that was taken was a single set of Spectra, which is supposed to complement a previous observation which had a full set of measurements. In this case, while the “natural” Observation is incomplete, it is not necessary to give an error, as long as a warning is given such that it must be combined with another observation. Nevertheless, some combinations of files are required to have been taken in the same physical observation to ensure consistency (namely, S11 measurements for each standard in a given load). If particular standards are missing, an error will be raised.
These caveats should be kept in mind as we talk about “required” directories/files below. “Required” will mean that after combining all the observations that we want/need (see next section), we require this particular file.
Within the top-level observation directory are a number of directories denoting the ambient temperature at
which the observation was taken. These will usually be 15C, 25C or 35C. Most newer
observations are at 25C. One should never mix files between different ambient temperatures.
Thus, in reality, an observation is contained within one of these folders, and in practice,
the CalibrationObservation
has its path
attribute set to the temperature directory.
Inside this directory can be up to two files, and exactly three folders. One of the files
is a Notes.txt
file which summarises human-readable notes about the observation (“we
ran the ambient spectra first, but had a delay because of xxx…”). The other file is named
definition.yaml
and includes metadata about the observation in a specific format
(this file also allows you to supplement the observation with other observations, but
we’ll get to that later). Measurements/data like the male/female resistance should be
put in here (til now they have been found somewhere an input manually by the analyst
when doing the calibration, which is very risky and prone to error – they are properly
part of the measurement itself, not a choice of the analyst).
The three folders are Spectra
, Resistance
and S11
. Note that an observation
must have all three of these (and nothing else, after combining observations).
Within Spectra
exist a bunch of spectra taken over about 12-24 hours for each of
four “calibration sources” in the lab: they are “Ambient”, “HotLoad”, “LongCableOpen”
and “LongCableShorted” (often referred to as their simple aliases “ambient”, “hot_load”,
“open” and “short” in the code). These spectra will be in either .acq
or .h5
format, depending on the version of fastspec
that took the measurements. Due to the
way fastspec
takes its data, each source may have multiple files for a single
measurement (each integration is saved to a new line in the file, but a new file is
created at particular local times each day). Thus, typically one would like to read in
and concatenate _all_ the files for that load, to use all the data.
Beyond this, it is _possible_ that two fully separate “runs” for a given source/load
will be made. In this case, an identifier for the “run number” is put into the filename.
Only one run number is actually used to do any particular calibration. In practice, it
is very rare to have more than one “useable” run number for any particular load.
Typically, a second run is only taken if it is deemed necessary due to the first being
invalid in some way. If this is the case, this should be noted in the Notes.txt
and/or the definition.yaml
.
The Resistance
folder is almost exactly the same as the Spectra
. Each of the
sources is represented here again (with the same names), and the filename format is the
same, except that the files themselves are all .csv
. These measure the resistance
readings of the sources, which are used to derive the physical temperatures of the loads
(against which the spectra are calibrated). Again, each source is allowed to have
multiple “runs” specified by their “run number”. However, again in practice it is
very uncommon to have more than one usable run.
The S11
folder contains measurements of the reflection coefficients of the sources,
along with the LNA itself and the internal switch. These are all made with a VNA, and
each reading takes of order a minute. Thus, multiple readings of these measurements can
be taken – and typically are taken. Inside the S11
folder exist a folder for each
of the main loads (or sources), in which are measurements of the four standards (open
,
short
, match
and external
). Each of these standards can be measured multiple
times, and so each file has the format <standard-name><rep-num>.s1p
, where rep-num
goes from 1 - 99. However, each of the standards for a load is measured one after the
other on the same connection (i.e. there is no disconnection between them, to avoid
issues with different connection characteristics between the standards). Thus, one can’t
choose to use repeat number 01 for open
and repeat number 2 for short
for the Ambient
source. For a given source, all standards used must be of the same repeat number (but multiple
runs can exist for the source).
Besides the S11’s of the sources, we also need measurements of the LNA reflections, and
the internal switch. These exist in the folders ReceiverReadingXX
and SwitchingStateXX
respectively. Here the XX
correspond to what we call a “run” number, which
correspond to a complete re-measuring of the standards at different points in the
observation process. An arbitrary number of these can be performed (up to 99), but only
one is required.
In all cases, the default behaviour of edges-io
is to use the last run number and
repeat number available for any given measurement.
Combining Multiple Observations¶
As of v0.4.0, CalibrationObservation
objects no longer need to be defined fully by
one directory containing all measurements. While that is still an option (and the easiest
way to define a calibration observation), they can also be defined in a more sophisticated
way internally or externally.
Internally, a definition.yaml
file is allowed (and encouraged) which defines properties of the
observation, and also has include
and prefer
keywords which are used to supplement
or override any particular parts of the observation. For example include
could point
to the top-level of any other observation, which could then be used whenever the
main observation lacks data. If this file exists, by default it is used to construct
the full observation virtually. An incomplete example of such a definition file can
be found here.
Externally, a different file format is used to explicitly define every single measurement
file in an observation. This is supposed to be exhaustive and complete to make it
unambiguous. An example can be found in the test-suite.
One can use such a file to create a CalibrationObservation
by using the
CalibrationObservation.from_observation_yaml()
function.
The way the code actually handles these “virtual” observations is essentially to create a temporary directory and make symlinks to all the files that are required. This virtual observation then looks and feels like a normal single observation, but is in fact patched together from various observations.
Using the HDF5Object
¶
edges-io
contains a convenient HDF5Object
class whose purpose is to make working
with HDF5 files a bit more formal (and arguably more simple). By subclassing it, you
can specify an exact file layout that can be verified on read, to ensure a file is
in the correct format (not just HDF5, but that it has the correct data structures and
groups and metadata).
Using such a class is meant to provide a very thin wrapper over the file. So, for instance
if you have a file my_hdf5_format_file.h5
, whose structure is defined by the class
CustomH5Format
, you can create an object like this:
>>> fl = CustomH5Format("my_hdf5_format_file.h5")
Directly on creation, the file will be checked for compatibility and return an error if it contains extraneous keys, or lacks keys that it requires.
Once created, the fl
variable now has operations which can “look into” the file
and load its data. It supports lazy-loading, so doing:
>>> print(fl['dataset'].max())
will load the ‘dataset’ data, and get the maximum, but it will not keep the data in memory, and will not load any other datasets. If you have data in groups, you can easily do:
>>> print(fl['group']['dataset'].min())
To load the data into the object permanently use the .load
method:
>>> fl.load('group')
In fact, doing this will load all data under ‘group’. If you just wanted to load “dataset” out of “group”:
>>> fl['group'].load('dataset')
An example of how to define a subclass of HDF5Object
can be seen in the
HDF5RawSpectrum
class, which is used to define fastspec
output files.
How the code works in a bit more detail¶
For the sake of developers (lets face it, most users of this particular repo should also be developers), we will try to explain in a little more detail how the code works here. This will focus on how the code treats the organization of a calibration observation, and how it performs checks and makes fixes.
The basic idea is that each directory, and each kind of file, is represented by a
distinct class, describing that kind of thing. For example, the top-level directory
(actually, the top-level plus the ambient temperature directory) is represented by the
CalibrationObservation
class, while the Spectra
directory is represented by the
Spectra
class, and S1P files are represented by the S1P
class.
All of these classes are subclasses either of _DataContainer
(if it’s a folder) or
_DataFile
(if it’s a file). All of them have a path
attribute which points to its
own path on-disk. _DataFile
classes are much simpler, and typically only know how to
check its own filename for consistency with the specification, and how to read the data
in that particular filetype (they know nothing about their parents).
_DataContainer
classes know about their own path
,
but also can determine a list of files/subfolders they contain (they know nothing about their
parents), and know how to map these files/folders onto their relevant defining classes.
They are able to check their own path for consistency, ensure that all relevant sub-files
exist, ensure that none extra exist, and recursively check the consistency of their
sub-files and folders by calling their checking methods.
Each file and folder in the observation becomes a specific instance of one of these
classes (there will be multiple S1P
instances for all of the S11 measurements, and
each may have a different name
attribute to identify the standard it represents).
This top-down hierarchical structure is useful, and similar the to the way Unix filesystems
operate. However, it does mean that a particular instance is not necessarily unique: the
“match” standard S11 will exist within all sources, and since each class doesn’t know
its parent, the Ambient/Match01.s1p
cannot be distinguished from the HotLoad/Match01.s1p
.
However, a method exists on the top-level CalibrationObservation
which can match a
particular input path to a unique sequence of instances which do uniquely define it
(i.e. the first would be a sequence containing a LoadS11
class with name=Ambient
and the second would contain a LoadS11
class with name=HotLoad
).
Another thing to note about the setup is the different between the classes and instances
of those classes. Much of the functionality of the system is implemented just through
the classes themselves – one does not need to make instances of the classes to perform
the filesystem checks, for instance. In this case, the path
is given to the check()
method of the class, eg. CalibrationObservation.check(path)
, which itself will
call the check
method of any of its children etc. This will never read any data, it
will just check filename formats and contents of directories. However, one can make an
instance of the CalibrationObservation
, which will itself go and make instances
of all its children, storing them in the top-level class in a nice hierarchical way, in
which each of the children can be used independently. By default, when you create such
an instance, it will first perform the full check that would have been performed (but
in this case it should exit at the first error raised, and raise it as an error, rather
than continuing and printing all errors). Notably, these instances can be used to read
the data in the files themselves. The instance will also decide which files to use
in the observation (i.e. which run numbers and repeat numbers).
Note¶
This project has been set up using PyScaffold 3.2.3. For details and usage information on PyScaffold see https://pyscaffold.org/.