My Twitter data has lots of duplicates. How can I eliminate identical tweets?

I have a problem because a single tweet is coming in many times (repeating again & again). My requirement is to have only a single occurrence of a given tweet.

The practice of "retweeting" the same content is very common in Twitter and that is what you are seeing. You will find that they are not actual duplicates if you examine the metadata. Each of these retweets will have a metadata signature of a unique user.

The instances of identical retweets can be one of the most interesting features of a Twitter data collection. You can exclude or manage them either by using filtering, DiscoverText's automated duplicate detection algorithm, or by including a "-is:retweet" rule in your Gnip PowerTrack query against the Twitter firehose.

Retweets function as either noise or signal depending on the task. If you are looking for the complete diversity of views, or particular expressions of a viewpoint, the retweets may cloud the picture making it harder and more time consuming to complete the analysis. However, if you are assessing the complete landscape of Twitter, the counts of duplicates and near-duplicates, as well as the groups and individuals responsible for the more viral content) represent critical pieces of the communicative geography.
Please see the deduplication section of this support site for more information. Below is a fragment of an example based on an archive of 6,232 tweets. DiscoverText enables the user to review snippets of the largest groups in a rank order list. You can also sample one item out of each group, as well as the singles, to solve the noise problem and create a purposively diverse sample.
Have more questions? Submit a request


Powered by Zendesk