Skip to content

Mammogram Deidentification

kskit.dicom.deid_mammogram

This module is a mammograms deidentification toolbox.

This module contains functions related to deidentification of mammograms. It fulfills the following purposes:

  • deidentifying mammogram's images
  • deidentifying mammogram's metadata
Deidentification Functionalities
Image Deidentification based on OCR
Attributes/Metadata Deidentification based on a Recipe

Image Deidentification

deidentify_image_png(infile, outdir, filename)

Deidentify and write a given mammogram's image in outdir as filename.png

This function invokes the OCR reader for getting all potential words on a mammogram's image. Then, it hides all found words by higlighting them in black.

Parameters:

Name Type Description Default
infile str

The path of the DICOM file to deidentify.

required
outdir str

The path of the directory that will store the output.

required
filename str

The name of the resulting PNG file. (don't add the file extension).

required
get_PIL_image(dataset)

Get Image object from Python Imaging Library(PIL)

Get the image from the pydicom dataset and convert it from a numpy.ndarray to a PIL image object. If available, the function will use metadata information contained inside the pydicom dataset for the conversion.

Parameters:

Name Type Description Default
dataset Dataset

A pydicom dataset which can be obtained from a DICOM file.

required

Returns:

Name Type Description
Image Image

A PIL image object.

Example
get_PIL_image.py
1
2
3
4
5
6
from kskit.dicom.deid_mammogram import get_PIL_image
import pydicom

ds = pydicom.read_file("my-mammogram.dcm")
img = get_PIL_image(ds)
img.show()
get_text_areas(pixels, languages=['fr'])

Read and return words of an image.

This function takes a pixel array in input and submits it to the easyOCR Reader. This Reader will then return a list of found words. This function implicitly remove authorized words from the computed list.

Parameters:

Name Type Description Default
pixels ndarray

An array representing an image.

required
languages list

A list of supported languages for the OCR Reader. This allows to submit images with text written in different languages.

['fr']

Returns:

Name Type Description
list list

A list of words detected on the submitted image.

Info

The list of available languages can be found here.

remove_authorized_words_from(ocr_data)

Remove authorized words from ocr_data list

This function allows to remove authorized words from easyOCR output. It is useful if you want to keep some text information on your image such as image laterality information (RMLO, LCC, OBLIQUE G...).

Parameters:

Name Type Description Default
ocr_data list

A list of words and coordinates obtained after submitting an image to easyOCR Reader.

required

Returns:

Type Description
list

The same list of words and coordinates minus the authorized words elements.

hide_text(pixels, ocr_data, color_value='black', mode='rectangle')

Censor text present on the pixels array representing an image.

Take the input image and draw new shapes with PIL package in order to censor OCR-detected words.

Parameters:

Name Type Description Default
pixels ndarray

A pixels array representing an image

required
ocr_data list

A list of words and coordinates obtained by easyOCR Reader after submitting an image.

required
color_value str

A string indicating the color of the rectangle used for censoring information (white or black)

'black'
mode str

A string indicating the method for censoring information. (blur or rectangle)

'rectangle'

Returns:

Type Description
ndarray

The deidentified pixels array.

Attributes Deidentification

deidentify_attributes(indir, outdir, org_root, erase_outdir=True)

Produce a Pandas dataframe with deidentified information from a folder of DICOM files.

This function creates a Pandas dataframe from all files present in the indir folder. Then, it loads the deidentification recipe and iterates through the dataframe to deidentify its content. Finally, it returns the deidentified dataframe object.

It also takes outdir and erase_outdir arguments for handling output directory auto-cleaning in the context of a data pipeline. If you're not interested in auto-cleaning your output repository, simply specify outdir and set erase_outdir to False.

Parameters:

Name Type Description Default
indir str

The input directory (DICOM files to deidentify)

required
outdir str

The output directory (deidentified/resulting files)

required
org_root str

An organization root identifier for deidentifying DICOM UIDs.

required
erase_outdir bool

Empty the output directory if True

True

Returns:

Type Description
DataFrame

A Pandas dataframe containing all metadata/attributes information.

Info

org_root refers to a prefix used for deidentifying DICOM UIDs. This prefix has to be unique for your organization.

For more information, see NEMA DICOM Standards Documentation.

Example

Let's test our recipe by adding one of its attribute into a pydicom dataset. The attribute in our recipe looks like this:

"0x00209161": [
    "ConcatenationUID",
    "UI",
    "PSEUDONYMISER"
],

Step n°1: We add the new DICOM UID to our pydicom dataset

1
2
3
4
5
import pydicom

ds = pydicom.read_file("my-mammogram.dcm")
ds.add_new("0x00209161", "UI", "1.123.123.1234.123456.12345678")
ds.save_as("my-modified-mammogram.dcm")

It will then appear inside your pydicom dataset:

(0020, 9161) Concatenation UID                   UI: 1.123.123.1234.123456.12345678

Step n°2: We deidentify the folder containing our test mammogram

1
2
3
4
from kskit.dicom.deid_mammogram import deidentify_attributes

df = deidentify_attributes("/path/to/mammogram/folder", "/path/to/outdir", org_root="9.9.9.9.9", erase_outdir=False)
print(df.ConcatenationUID_0x00209161_UI_1____)
9.9.9.9.9.474079559915109435636573090782

load_recipe()

Get the recipe from recipe.json and load it into a python dict.

This function reads recipe.json. If a user-defined version of the file is detected inside $DP_HOME/data/input, it will be used. Otherwise, the inbuilt version of the file will be used.

Be aware that the inbuilt version of the file does not suit a generic usage. It was created for the Deep.piste study. It is highly recommended to create your own version of recipe.json.

Returns:

Type Description
dict

A Python dictionary with recipe elements.

Note

You don't have to call this function as it already implicitly called by deidentify_attributes.

Tip

This function can be called to check if your customized recipe is correctly detected by kskit.

Example
load_recipe.py
1
2
3
4
from kskit.dicom.deid_mammogram import load_recipe

recipe = load_recipe()
print(recipe)
{'0x00020000': ['FileMetaInformationGroupLength', 'UL', 'CONSERVER'], '0x00020001': ['FileMetaInformationVersion', 'OB', 'CONSERVER']}
get_general_rule(tag, recipe)

Get the rule associated with the given tag in recipe.json

Parameters:

Name Type Description Default
tag str

A DICOM tag

required
recipe dict

A Python dictionary containing recipe elements. See load_recipe()

required

Returns:

Type Description
str

The action associated to this DICOM tag in the provided recipe. It can be anything among deidentification actions (CONSERVER, RETIRER EFFACER, PSEUDONYMISER)

Note

This function is implicitly called by deidentify_attributes each time it needs to take a deidentification action.

Warning

This function takes a zero trust approach when encountering unknown tags and will always return RETIRER (= REMOVE) for all tags not found inside the recipe.

Example

Example n°1: Retrieve a rule for a tag inside the recipe

get_general_rule_for_known_tag.py
1
2
3
4
from kskit.dicom.deid_mammogram import load_recipe, get_general_rule

recipe = load_recipe()
rule = get_general_rule("0x00020000", recipe)
CONSERVER

Example n°2: Retrieve a rule for a tag that is not declared inside the recipe

get_general_rule_for_unknown_tag.py
1
2
3
4
from kskit.dicom.deid_mammogram import load_recipe, get_general_rule

recipe = load_recipe()
rule = get_general_rule("0x00026666", recipe)
RETIRER

get_specific_rule(tags, recipe)

Extract the specific rule from a list of tags in recipe.json if there is one.

Parameters:

Name Type Description Default
tags List[str]

A list of DICOM tags. The parent attribute is always before the child attribute. For instance, if we take ['AAA', 'BBB', 'CCC'], 'AAA' is a sequence containing 'BBB' and 'BBB' is a sequence containing the attribute 'CCC'.

required
recipe dict

A Python dictionary containing recipe elements. See load_recipe()

required

Returns:

Type Description
str

The action associated to this DICOM tag in the provided recipe. Same values as get_general_rules. It can also return None if no specific rules are defined for tags inside the list.