Mammogram Deidentification
kskit.dicom.deid_mammogram
This module is a mammograms deidentification toolbox.
This module contains functions related to deidentification of mammograms. It fulfills the following purposes:
- deidentifying mammogram's images
- deidentifying mammogram's metadata
Deidentification Functionalities |
---|
Image Deidentification based on OCR |
Attributes/Metadata Deidentification based on a Recipe |
Image Deidentification
deidentify_image_png(infile, outdir, filename)
Deidentify and write a given mammogram's image in outdir as filename.png
This function invokes the OCR reader for getting all potential words on a mammogram's image. Then, it hides all found words by higlighting them in black.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
infile |
str
|
The path of the DICOM file to deidentify. |
required |
outdir |
str
|
The path of the directory that will store the output. |
required |
filename |
str
|
The name of the resulting PNG file. (don't add the file extension). |
required |
get_PIL_image(dataset)
Get Image object from Python Imaging Library(PIL)
Get the image from the pydicom dataset and convert it from a numpy.ndarray to a PIL image object. If available, the function will use metadata information contained inside the pydicom dataset for the conversion.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset |
Dataset
|
A pydicom dataset which can be obtained from a DICOM file. |
required |
Returns:
Name | Type | Description |
---|---|---|
Image |
Image
|
A PIL image object. |
Example
get_text_areas(pixels, languages=['fr'])
Read and return words of an image.
This function takes a pixel array in input and submits it to the easyOCR Reader. This Reader will then return a list of found words. This function implicitly remove authorized words from the computed list.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pixels |
ndarray
|
An array representing an image. |
required |
languages |
list
|
A list of supported languages for the OCR Reader. This allows to submit images with text written in different languages. |
['fr']
|
Returns:
Name | Type | Description |
---|---|---|
list |
list
|
A list of words detected on the submitted image. |
Info
The list of available languages can be found here.
remove_authorized_words_from(ocr_data)
Remove authorized words from ocr_data list
This function allows to remove authorized words from easyOCR output. It is useful if you want to keep some text information on your image such as image laterality information (RMLO, LCC, OBLIQUE G...).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ocr_data |
list
|
A list of words and coordinates obtained after submitting an image to easyOCR Reader. |
required |
Returns:
Type | Description |
---|---|
list
|
The same list of words and coordinates minus the authorized words elements. |
hide_text(pixels, ocr_data, color_value='black', mode='rectangle')
Censor text present on the pixels array representing an image.
Take the input image and draw new shapes with PIL package in order to censor OCR-detected words.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pixels |
ndarray
|
A pixels array representing an image |
required |
ocr_data |
list
|
A list of words and coordinates obtained by easyOCR Reader after submitting an image. |
required |
color_value |
str
|
A string indicating the color of the rectangle used for censoring information ( |
'black'
|
mode |
str
|
A string indicating the method for censoring information. ( |
'rectangle'
|
Returns:
Type | Description |
---|---|
ndarray
|
The deidentified pixels array. |
Attributes Deidentification
deidentify_attributes(indir, outdir, org_root, erase_outdir=True)
Produce a Pandas dataframe with deidentified information from a folder of DICOM files.
This function creates a Pandas dataframe from all files present in the indir
folder.
Then, it loads the deidentification recipe and iterates through the dataframe to
deidentify its content. Finally, it returns the deidentified dataframe object.
It also takes outdir
and erase_outdir
arguments for handling output directory auto-cleaning in the context of a data
pipeline. If you're not interested in auto-cleaning your output repository, simply
specify outdir
and set erase_outdir
to False
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
indir |
str
|
The input directory (DICOM files to deidentify) |
required |
outdir |
str
|
The output directory (deidentified/resulting files) |
required |
org_root |
str
|
An organization root identifier for deidentifying DICOM UIDs. |
required |
erase_outdir |
bool
|
Empty the output directory if True |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
A Pandas dataframe containing all metadata/attributes information. |
Info
org_root
refers to a prefix used for deidentifying DICOM UIDs.
This prefix has to be unique for your organization.
For more information, see NEMA DICOM Standards Documentation.
Example
Let's test our recipe by adding one of its attribute into a pydicom dataset. The attribute in our recipe looks like this:
Step n°1: We add the new DICOM UID to our pydicom dataset
It will then appear inside your pydicom dataset:
Step n°2: We deidentify the folder containing our test mammogram
load_recipe()
Get the recipe from recipe.json and load it into a python dict.
This function reads recipe.json
. If a user-defined version of the file
is detected inside $DP_HOME/data/input
, it will be used. Otherwise, the
inbuilt version of the file will be used.
Be aware that the inbuilt version of the file does not suit a generic usage.
It was created for the Deep.piste study. It is highly recommended to create
your own version of recipe.json
.
Returns:
Type | Description |
---|---|
dict
|
A Python dictionary with recipe elements. |
Note
You don't have to call this function as it already implicitly called by deidentify_attributes.
Tip
This function can be called to check if your customized recipe is correctly detected by kskit.
Example
load_recipe.py | |
---|---|
get_general_rule(tag, recipe)
Get the rule associated with the given tag in recipe.json
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tag |
str
|
A DICOM tag |
required |
recipe |
dict
|
A Python dictionary containing recipe elements. See |
required |
Returns:
Type | Description |
---|---|
str
|
The action associated to this DICOM tag in the provided recipe. It can be anything among deidentification actions (CONSERVER, RETIRER EFFACER, PSEUDONYMISER) |
Note
This function is implicitly called by deidentify_attributes each time it needs to take a deidentification action.
Warning
This function takes a zero trust approach when encountering unknown tags and will always return RETIRER (= REMOVE) for all tags not found inside the recipe.
Example
Example n°1: Retrieve a rule for a tag inside the recipe
get_general_rule_for_known_tag.py | |
---|---|
Example n°2: Retrieve a rule for a tag that is not declared inside the recipe
get_general_rule_for_unknown_tag.py | |
---|---|
get_specific_rule(tags, recipe)
Extract the specific rule from a list of tags in recipe.json
if there is one.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tags |
List[str]
|
A list of DICOM tags. The parent attribute is always before the child attribute. For instance, if we take ['AAA', 'BBB', 'CCC'], 'AAA' is a sequence containing 'BBB' and 'BBB' is a sequence containing the attribute 'CCC'. |
required |
recipe |
dict
|
A Python dictionary containing recipe elements. See |
required |
Returns:
Type | Description |
---|---|
str
|
The action associated to this DICOM tag in the provided recipe. Same values as |