De-duplicate an archive


To work more efficiently, remove the exact duplicates from an archive.

De-duplication is the process of finding duplicate items in a data source. Items are considered to be duplicates if their text content, excluding whitespace, is the same. (Their attributes are not compared.)

After the process is done, you can decide what to do with the exact duplicates.

  1. Open the archive and view the Archive Details page.archive_details_blur2.jpg
  2. In the Exact Duplicates section, click Generate exact duplicates.generate_exact_duplicates.jpg
  3. Click Start Deduplication if a deduplication has never been performed before.

    Note: The processing time depends on the number of items in the archive.

  4. Optional: After the files have been processed, you can do the following from the Archive Details page:
    • To view the unique items, click View Deduplicated Items.
    • To create a new bucket or new dataset from the unique items, or to add them to an existing bucket, click the SAMPLE button, then click the desired operation. Note: It make take a couple of minutes for the left navigation to refresh and display the new bucket or dataset.sample_new_bucket_or_dataset.jpg
    • To permanently delete the clusters so you can de-duplicate the archive again, click the SAMPLE button, then click Delete Groups.


Have more questions? Submit a request


Powered by Zendesk