Understanding why there are duplicates of the same Twitter tweet in a data set

Follow

Why are there duplicates of the same tweet in the data set? What causes that? How can I filter the duplicates and see the groups of duplicates?

There are three sources of duplicates that we have observed: 

  1. Retweets are the primary source of duplicates.
  2. We site share options are a secondary source.
  3. People may happen to write the same text independently; this may be with or without coordination.

Duplicates mean the body text of the tweet is identical. However, each tweet in a group has unique metadata.

Please see the following help articles for more information:

https://texifter.zendesk.com/hc/en-us/sections/200217320-De-duplicating-and-Clustering

https://texifter.zendesk.com/hc/en-us/articles/216548217-My-Twitter-data-has-lots-of-duplicates-How-can-I-eliminate-identical-tweets-

https://texifter.zendesk.com/hc/en-us/articles/201042264-De-duplicate-an-archive

Have more questions? Submit a request

Comments

Powered by Zendesk