Why are there duplicates of the same tweet in the dataset? What causes that? How can I filter the duplicates and see the groups of duplicates?

There are three sources of duplicates that we have observed: 

  1. Retweets are the primary source of duplicates.
  2. We site share options are a secondary source.
  3. People may happen to write the same text independently; this may be with or without coordination.

Duplicates mean the body text of the tweet is identical. However, each tweet in a group has unique metadata.

