I am exporting a deduplicated dataset of historical Twitter tweets to a CSV file, but am receiving fewer data items than were originally collected, and fewer than the grand total that DiscoverText says are in the dataset. I am expecting about 44,000 tweets but only about 41,500 are in the exported file. Why does this happen?
Over time, some of the tweets were deleted and/or the Twitter accounts suspended after the data was initially collected. We are required by Twitter's Terms of Service to ensure that tweets that are deleted, tweets that are from a suspended account, or any tweets that are no longer accessible to the public without permissions are removed from general distribution. One of the ways we comply is to have DiscoverText check each item when it is either displayed or exported.
In one example that we examined, it was found that 308 of the tweets were marked as deleted, and additionally, 1,992 of the tweets came from accounts that were suspended - thus, a final output of 2,300 fewer data items. As mentioned, unfortunately, we have to play by Twitter’s rules and cannot display or export these items.
DiscoverText is only able to check for the deletions and/or suspensions when the data item is viewed or exported because we have tens of millions of tweets stored in our systems, and it would be too resource intensive to continuously watch for all these items. Therefore, the total number of data items that are legally accessible may be different from day to day, and not equal to the number at the time of collection.