Mammogram Deidentification

`kskit.dicom.deid_mammogram`

This module is a mammograms deidentification toolbox.

This module contains functions related to deidentification of mammograms. It fulfills the following purposes:

deidentifying mammogram's images
deidentifying mammogram's metadata

Deidentification Functionalities
Image Deidentification based on OCR
Attributes/Metadata Deidentification based on a Recipe

Image Deidentification

`deidentify_image_png(infile, outdir, filename)`

Deidentify and write a given mammogram's image in outdir as filename.png

This function invokes the OCR reader for getting all potential words on a mammogram's image. Then, it hides all found words by higlighting them in black.

Parameters:

Name	Type	Description	Default
`infile`	`str`	The path of the DICOM file to deidentify.	required
`outdir`	`str`	The path of the directory that will store the output.	required
`filename`	`str`	The name of the resulting PNG file. (don't add the file extension).	required

`get_PIL_image(dataset)`

Get Image object from Python Imaging Library(PIL)

Get the image from the pydicom dataset and convert it from a numpy.ndarray to a PIL image object. If available, the function will use metadata information contained inside the pydicom dataset for the conversion.

Parameters:

Name	Type	Description	Default
`dataset`	`Dataset`	A pydicom dataset which can be obtained from a DICOM file.	required

Returns:

Name	Type	Description
`Image`	`Image`	A PIL image object.

Example

get_PIL_image.py
from kskit.dicom.deid_mammogram import get_PIL_image
import pydicom

ds = pydicom.read_file("my-mammogram.dcm")
img = get_PIL_image(ds)
img.show()

`get_text_areas(pixels, languages=['fr'])`

Read and return words of an image.

This function takes a pixel array in input and submits it to the easyOCR Reader. This Reader will then return a list of found words. This function implicitly remove authorized words from the computed list.

Parameters:

Name	Type	Description	Default
`pixels`	`ndarray`	An array representing an image.	required
`languages`	`list`	A list of supported languages for the OCR Reader. This allows to submit images with text written in different languages.	`['fr']`

Returns:

Name	Type	Description
`list`	`list`	A list of words detected on the submitted image.

Info

The list of available languages can be found here.

`remove_authorized_words_from(ocr_data)`

Remove authorized words from ocr_data list

This function allows to remove authorized words from easyOCR output. It is useful if you want to keep some text information on your image such as image laterality information (RMLO, LCC, OBLIQUE G...).

Parameters:

Name	Type	Description	Default
`ocr_data`	`list`	A list of words and coordinates obtained after submitting an image to easyOCR Reader.	required

Returns:

Type	Description
`list`	The same list of words and coordinates minus the authorized words elements.

`hide_text(pixels, ocr_data, color_value='black', mode='rectangle')`

Censor text present on the pixels array representing an image.

Take the input image and draw new shapes with PIL package in order to censor OCR-detected words.

Parameters:

Name	Type	Description	Default
`pixels`	`ndarray`	A pixels array representing an image	required
`ocr_data`	`list`	A list of words and coordinates obtained by easyOCR Reader after submitting an image.	required
`color_value`	`str`	A string indicating the color of the rectangle used for censoring information (`white` or `black`)	`'black'`
`mode`	`str`	A string indicating the method for censoring information. (`blur` or `rectangle`)	`'rectangle'`

Returns:

Type	Description
`ndarray`	The deidentified pixels array.

Attributes Deidentification

`deidentify_attributes(indir, outdir, org_root, erase_outdir=True)`

Produce a Pandas dataframe with deidentified information from a folder of DICOM files.

This function creates a Pandas dataframe from all files present in the indir folder. Then, it loads the deidentification recipe and iterates through the dataframe to deidentify its content. Finally, it returns the deidentified dataframe object.

It also takes outdir and erase_outdir arguments for handling output directory auto-cleaning in the context of a data pipeline. If you're not interested in auto-cleaning your output repository, simply specify outdir and set erase_outdir to False.

Parameters:

Name	Type	Description	Default
`indir`	`str`	The input directory (DICOM files to deidentify)	required
`outdir`	`str`	The output directory (deidentified/resulting files)	required
`org_root`	`str`	An organization root identifier for deidentifying DICOM UIDs.	required
`erase_outdir`	`bool`	Empty the output directory if True	`True`

Returns:

Type	Description
`DataFrame`	A Pandas dataframe containing all metadata/attributes information.

Info

org_root refers to a prefix used for deidentifying DICOM UIDs. This prefix has to be unique for your organization.

For more information, see NEMA DICOM Standards Documentation.

Example

Let's test our recipe by adding one of its attribute into a pydicom dataset. The attribute in our recipe looks like this:

"0x00209161": [
    "ConcatenationUID",
    "UI",
    "PSEUDONYMISER"
],

Step n°1: We add the new DICOM UID to our pydicom dataset

import pydicom

ds = pydicom.read_file("my-mammogram.dcm")
ds.add_new("0x00209161", "UI", "1.123.123.1234.123456.12345678")
ds.save_as("my-modified-mammogram.dcm")

It will then appear inside your pydicom dataset:

(0020, 9161) Concatenation UID                   UI: 1.123.123.1234.123456.12345678

Step n°2: We deidentify the folder containing our test mammogram

from kskit.dicom.deid_mammogram import deidentify_attributes

df = deidentify_attributes("/path/to/mammogram/folder", "/path/to/outdir", org_root="9.9.9.9.9", erase_outdir=False)
print(df.ConcatenationUID_0x00209161_UI_1____)

9.9.9.9.9.474079559915109435636573090782

`load_recipe()`

Get the recipe from recipe.json and load it into a python dict.

This function reads recipe.json. If a user-defined version of the file is detected inside $DP_HOME/data/input, it will be used. Otherwise, the inbuilt version of the file will be used.

Be aware that the inbuilt version of the file does not suit a generic usage. It was created for the Deep.piste study. It is highly recommended to create your own version of recipe.json.

Returns:

Type	Description
`dict`	A Python dictionary with recipe elements.

Note

You don't have to call this function as it already implicitly called by deidentify_attributes.

Tip

This function can be called to check if your customized recipe is correctly detected by kskit.

Example

load_recipe.py
from kskit.dicom.deid_mammogram import load_recipe

recipe = load_recipe()
print(recipe)

{'0x00020000': ['FileMetaInformationGroupLength', 'UL', 'CONSERVER'], '0x00020001': ['FileMetaInformationVersion', 'OB', 'CONSERVER']}

`get_general_rule(tag, recipe)`

Get the rule associated with the given tag in recipe.json

Parameters:

Name	Type	Description	Default
`tag`	`str`	A DICOM tag	required
`recipe`	`dict`	A Python dictionary containing recipe elements. See `load_recipe()`	required

Returns:

Type	Description
`str`	The action associated to this DICOM tag in the provided recipe. It can be anything among deidentification actions (CONSERVER, RETIRER EFFACER, PSEUDONYMISER)

Note

This function is implicitly called by deidentify_attributes each time it needs to take a deidentification action.

Warning

This function takes a zero trust approach when encountering unknown tags and will always return RETIRER (= REMOVE) for all tags not found inside the recipe.

Example

Example n°1: Retrieve a rule for a tag inside the recipe

get_general_rule_for_known_tag.py
from kskit.dicom.deid_mammogram import load_recipe, get_general_rule

recipe = load_recipe()
rule = get_general_rule("0x00020000", recipe)

CONSERVER

Example n°2: Retrieve a rule for a tag that is not declared inside the recipe

get_general_rule_for_unknown_tag.py
from kskit.dicom.deid_mammogram import load_recipe, get_general_rule

recipe = load_recipe()
rule = get_general_rule("0x00026666", recipe)

RETIRER

`get_specific_rule(tags, recipe)`

Extract the specific rule from a list of tags in recipe.json if there is one.

Parameters:

Name	Type	Description	Default
`tags`	`List[str]`	A list of DICOM tags. The parent attribute is always before the child attribute. For instance, if we take ['AAA', 'BBB', 'CCC'], 'AAA' is a sequence containing 'BBB' and 'BBB' is a sequence containing the attribute 'CCC'.	required
`recipe`	`dict`	A Python dictionary containing recipe elements. See `load_recipe()`	required

Returns:

Type	Description
`str`	The action associated to this DICOM tag in the provided recipe. Same values as `get_general_rules`. It can also return `None` if no specific rules are defined for tags inside the list.