mapreader.classify.datasets

Module Contents

Classes

PatchDataset

An abstract class representing a Dataset.

PatchContextDataset

An abstract class representing a Dataset.

Attributes

parhugin_installed

mapreader.classify.datasets.parhugin_installed = True
class mapreader.classify.datasets.PatchDataset(patch_df, transform, delimiter=',', patch_paths_col='image_path', label_col=None, label_index_col=None, image_mode='RGB')

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Parameters:
  • patch_df (pandas.DataFrame | str) –

  • transform (str | torchvision.transforms.Compose | Callable) –

  • delimiter (str) –

  • patch_paths_col (str | None) –

  • label_col (str | None) –

  • label_index_col (str | None) –

  • image_mode (str | None) –

return_orig_image(idx)

Return the original image associated with the given index.

Parameters:

idx (int or Tensor) – The index of the desired image, or a Tensor containing the index.

Returns:

The original image associated with the given index.

Return type:

PIL.Image.Image

Notes

This method returns the original image associated with the given index by loading the image file using the file path stored in the patch_paths_col column of the patch_df DataFrame at the given index. The loaded image is then converted to the format specified by the image_mode attribute of the object. The resulting PIL.Image.Image object is returned.

create_dataloaders(set_name='infer', batch_size=16, shuffle=False, num_workers=0, **kwargs)

Creates a dictionary containing a PyTorch dataloader.

Parameters:
  • set_name (str, optional) – The name to use for the dataloader.

  • batch_size (int, optional) – The batch size to use for the dataloader. By default 16.

  • shuffle (bool, optional) – Whether to shuffle the PatchDataset, by default False

  • num_workers (int, optional) – The number of worker threads to use for loading data. By default 0.

  • **kwargs – Additional keyword arguments to pass to PyTorch’s DataLoader constructor.

Returns:

Dictionary containing dataloaders.

Return type:

Dict

class mapreader.classify.datasets.PatchContextDataset(patch_df, total_df, transform, delimiter=',', patch_paths_col='image_path', label_col=None, label_index_col=None, image_mode='RGB', context_dir='./maps/maps_context', create_context=False, parent_path='./maps')

Bases: PatchDataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader.

Note

DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Parameters:
  • patch_df (pandas.DataFrame | str) –

  • total_df (pandas.DataFrame | str) –

  • transform (str) –

  • delimiter (str) –

  • patch_paths_col (str | None) –

  • label_col (str | None) –

  • label_index_col (str | None) –

  • image_mode (str | None) –

  • context_dir (str | None) –

  • create_context (bool) –

  • parent_path (str | None) –

save_context(processors=10, sleep_time=0.001, use_parhugin=True, overwrite=False)

Save context images for all patches in the patch_df.

Parameters:
  • processors (int, optional) – The number of required processors for the job, by default 10.

  • sleep_time (float, optional) – The time to wait between jobs, by default 0.001.

  • use_parhugin (bool, optional) – Whether to use Parhugin to parallelize the job, by default True.

  • overwrite (bool, optional) – Whether to overwrite existing parent files, by default False.

Return type:

None

Notes

Parhugin is a Python package for parallelizing computations across multiple CPU cores. The method uses Parhugin to parallelize the computation of saving parent patches to disk. When Parhugin is installed and use_parhugin is set to True, the method parallelizes the calling of the get_context_id method and its corresponding arguments. If Parhugin is not installed or use_parhugin is set to False, the method executes the loop over patch indices sequentially instead.

get_context_id(id, overwrite=False, save_context=False, return_image=True)

Save the parents of a specific patch to the specified location.

Parameters:
  • id – Index of the patch in the dataset.

  • overwrite (bool, optional) – Whether to overwrite the existing parent files. Default is False.

  • save_context (bool, optional) – Whether to save the context image. Default is False.

  • return_image (bool, optional) – Whether to return the context image. Default is True.

Raises:

ValueError – If the patch is not found in the dataset.

Return type:

None

plot_sample(idx)

Plot a sample patch and its corresponding context from the dataset.

Parameters:

idx (int) – The index of the sample to plot.

Returns:

Displays the plot of the sample patch and its corresponding context.

Return type:

None

Notes

This method plots a sample patch and its corresponding context side-by- side in a single figure with two subplots. The figure size is set to 10in x 5in, and the titles of the subplots are set to “Patch” and “Context”, respectively. The resulting figure is displayed using the matplotlib library (required).