Why are there duplicates of the same tweet in the dataset? What causes that? How can I filter the duplicates and see the groups of duplicates?
There are three sources of duplicates that we have observed:
- Retweets are the primary source of duplicates.
- We site share options are a secondary source.
- People may happen to write the same text independently; this may be with or without coordination.
Duplicates mean the body text of the tweet is identical. However, each tweet in a group has unique metadata.
Please see the following help articles for more information:
https://texifter.zendesk.com/hc/en-us/sections/200217320-De-duplicating-and-Clustering
https://texifter.zendesk.com/hc/en-us/articles/201042264-De-duplicate-an-archive
Comments