DataSplit Reference

DataSplit

DataSplits are collections of multiple DataSets, with each DataSet assigned to a specific role. i.e. training data, validation data, testing data, etc.

class dacapo.experiments.datasplits.DataSplit
class dacapo.experiments.datasplits.TrainValidateDataSplit(datasplit_config)

Configured with dacapo.datasplits.datasplits.TrainValidateDataSplitConfig

DataSet

DataSets define a spatial region containing the necessary data for training provided as multiple Arrays. This can include as much as raw, ground_truth, and a mask, or it could be just raw data in the case of self supervised models.

ABC:

class dacapo.experiments.datasplits.datasets.Dataset

Implementations:

class dacapo.experiments.datasplits.datasets.RawGTDataset(dataset_config)

Configured with dacapo.experiments.datasplits.datasets.RawGTDatasetConfig

Arrays

Arrays define the interface for a contiguous spatial region of data. This data can be raw, ground truth, a mask, or any other spatial data. Arrays can be a direct interface to some storage i.e. a zarr/n5 container, tiff stack, or other data storage, or can be a wrapper modifying another array. This might include operations such as normalizing intensities for raw data, binarizing labels to generate a mask, or upsampling and downsampling. Providing these operations as wrappers around allows us to lazily fetch and transform the data we need consistently in different contexts such as training or validation.

ABC:

class dacapo.experiments.datasplits.datasets.arrays.Array
abstract property attrs: Dict[str, Any]

Return a dictionary of metadata attributes stored on this array.

abstract property axes: List[str]

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

abstract property data: ndarray

Get a numpy like readable and writable view into this array.

abstract property dims: int

Returns the number of spatial dimensions.

abstract property dtype: Any

The dtype of this array, in numpy dtypes

abstract property num_channels: int | None

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

abstract property roi: Roi

The total ROI of this array, in world units.

abstract property voxel_size: Coordinate

The size of a voxel in physical units.

abstract property writable: bool

Can we write to this Array?

Implementations:

class dacapo.experiments.datasplits.datasets.arrays.ZarrArray(array_config)

This is a zarr array

Configured with dacapo.experiments.datasplits.datasets.arrays.ZarrArrayConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

classmethod create_from_array_identifier(array_identifier, axes, roi, num_channels, voxel_size, dtype, write_size=None, name=None)

Create a new ZarrArray given an array identifier. It is assumed that this array_identifier points to a dataset that does not yet exist

property data: Any

Get a numpy like readable and writable view into this array.

property dims: int

Returns the number of spatial dimensions.

property dtype: Any

The dtype of this array, in numpy dtypes

property num_channels: int | None

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi

The total ROI of this array, in world units.

property voxel_size

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?

class dacapo.experiments.datasplits.datasets.arrays.BinarizeArray(array_config)

This is wrapper around a ZarrArray containing uint annotations. Because we often want to predict classes that are a combination of a set of labels we wrap a ZarrArray with the BinarizeArray and provide something like groupings=[(“mito”, [3,4,5])] where 4 corresponds to mito_membrane, 5 is mito_ribos, and 3 is everything else that is part of a mitochondria. The BinarizeArray will simply combine labels 3,4,5 into a single binary channel for th class of “mito”. We use a single channel per class because some classes may overlap. For example if you had groupings=[(“mito”, [3,4,5]), (“membrane”, [4, 8, 1])] where 4 is mito_membrane, 8 is er_membrane, and 1 is plasma_membrane. Now you can have a binary classification for membrane or not which in some cases overlaps with the channel for mitochondria which includes the mito membrane.

Configured with dacapo.experiments.datasplits.datasets.arrays.BinarizeArrayConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

property data

Get a numpy like readable and writable view into this array.

property dims: int

Returns the number of spatial dimensions.

property dtype

The dtype of this array, in numpy dtypes

property num_channels: int

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi: Roi

The total ROI of this array, in world units.

property voxel_size: Coordinate

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?

class dacapo.experiments.datasplits.datasets.arrays.ResampledArray(array_config)

This is a zarr array

Configured with dacapo.experiments.datasplits.datasets.arrays.ResampledArrayConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

property data

Get a numpy like readable and writable view into this array.

property dims: int

Returns the number of spatial dimensions.

property dtype

The dtype of this array, in numpy dtypes

property num_channels: int

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi: Roi

The total ROI of this array, in world units.

property voxel_size: Coordinate

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?

class dacapo.experiments.datasplits.datasets.arrays.IntensitiesArray(array_config)

This is wrapper another array that will normalize intensities to the range (0, 1) and convert to float32. Use this if you have your intensities stored as uint8 or similar and want your model to have floats as input.

Configured with dacapo.experiments.datasplits.datasets.arrays.IntensitiesArrayConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

property data

Get a numpy like readable and writable view into this array.

property dims: int

Returns the number of spatial dimensions.

property dtype

The dtype of this array, in numpy dtypes

property num_channels: int

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi: Roi

The total ROI of this array, in world units.

property voxel_size: Coordinate

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?

class dacapo.experiments.datasplits.datasets.arrays.MissingAnnotationsMask(array_config)

This is wrapper around a ZarrArray containing uint annotations. Complementary to the BinarizeArray class where we convert labels into individual channels for training, we may find crops where a specific label is present, but not annotated. In that case you might want to avoid training specific channels for specific training volumes. See package fibsem_tools for appropriate metadata format for indicating presence of labels in your ground truth. “https://github.com/janelia-cosem/fibsem-tools

Configured with dacapo.experiments.datasplits.datasets.arrays.MissintAnnotationsMaskConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

property data

Get a numpy like readable and writable view into this array.

property dims: int

Returns the number of spatial dimensions.

property dtype

The dtype of this array, in numpy dtypes

property num_channels: int

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi: Roi

The total ROI of this array, in world units.

property voxel_size: Coordinate

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?

class dacapo.experiments.datasplits.datasets.arrays.OnesArray(array_config)

This is a wrapper around another source_array that simply provides ones with the same metadata as the source_array.

Configured with dacapo.experiments.datasplits.datasets.arrays.OnesArrayConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

property data

Get a numpy like readable and writable view into this array.

property dims

Returns the number of spatial dimensions.

property dtype

The dtype of this array, in numpy dtypes

property num_channels

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi

The total ROI of this array, in world units.

property voxel_size

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?

class dacapo.experiments.datasplits.datasets.arrays.ConcatArray(array_config)

This is a wrapper around other source_arrays that concatenates them along the channel dimension.

Configured with dacapo.experiments.datasplits.datasets.arrays.ConcatArrayConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

property data

Get a numpy like readable and writable view into this array.

property dims

Returns the number of spatial dimensions.

property dtype

The dtype of this array, in numpy dtypes

property num_channels

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi

The total ROI of this array, in world units.

property voxel_size

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?

class dacapo.experiments.datasplits.datasets.arrays.LogicalOrArray(array_config)

Configured with dacapo.experiments.datasplits.datasets.arrays.LogicalOrArrayConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

property data

Get a numpy like readable and writable view into this array.

property dims: int

Returns the number of spatial dimensions.

property dtype

The dtype of this array, in numpy dtypes

property num_channels

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi: Roi

The total ROI of this array, in world units.

property voxel_size: Coordinate

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?

class dacapo.experiments.datasplits.datasets.arrays.CropArray(array_config)

Used to crop a larger array to a smaller array.

Configured with dacapo.experiments.datasplits.datasets.arrays.CropArrayConfig

property attrs

Return a dictionary of metadata attributes stored on this array.

property axes

Returns the axes of this dataset as a string of charactes, as they are indexed. Permitted characters are:

  • zyx for spatial dimensions

  • c for channels

  • s for samples

property data

Get a numpy like readable and writable view into this array.

property dims: int

Returns the number of spatial dimensions.

property dtype

The dtype of this array, in numpy dtypes

property num_channels: int

The number of channels provided by this dataset. Should return None if the channel dimension doesn’t exist.

property roi: Roi

The total ROI of this array, in world units.

property voxel_size: Coordinate

The size of a voxel in physical units.

property writable: bool

Can we write to this Array?