UPDATE: Perhaps the most important thing to consider is the change in Twitter over time. The available metadata associated with a Tweet has shifted as the platform has shifted and you cannot expect to find identical fields in data from 2006 as you would from data in 2015.
The metadata associated with each Tweet will also vary depending on the presence or absence of certain attributes in a specific Tweet or an overall collection (ex., location coordinates or country codes). Here is an example of the potential metadata fields for each Tweet in a small (~2,000 item) recent (2015) project.
- Title - this is the text of the tweet
- Text - this is also the text of the tweet
- country_code - often no data here
- favorites_count - number of favorites at a point in time
- followers_count - number of followers the authors has
- friends_count - number of users the author follows
- hashtag - # or #s present in the Tweet
- id - Tweet ID
- is_retweet - TRUE or blank
- link - Tweet URL
- location_coord_type - often blank
- location_coords - often blank
- location_displayname - often blank
- location_type - often blank
- media_display_url - from the Tweet text
- media_type - ex., pic.twitter
- media_url - often a duplicate of the media_display_url
- posted_time - date & time the Tweet was posted
- real_name - purported user's real name
- rule_match - the match to a Gnip rule that pulled this Tweet
- source - Hootsuite, TweetDeck, Facebook, Paper.li, Twitter for iPhone, etc.
- statuses_count - number of Tweets by that user at that point in time
- tweet_url - the URL of the Tweet
- user_bio_summary - the user bio
- user_location - rate varies, but a lot more of this than geo data
- user_mention - other Twitter user mentioned in Tweet text
- user_mention_username - see above
- user_twitter_page - the twitter home page of the user
- username - the Twitter username
UPDATE: As of 10/27/2015
Gnip items loaded in (either live or via Sifter) will be able to fulfill the following (in addition to all the metadata that we can):
1) Twitter's User ID for a retweeted user (plus maybe a extra field for the Username)
2) Twitter's User ID for a user that received a reply (plus maybe a extra field for the Username)
3) Twitter's Status ID for the original tweet that is retweeted
4) Twitter's Status ID for the tweet, that received a reply
5) a field containing the unabridged original text of a retweeted tweet"
A caveat - though we can get this information now inside of DiscoverText, the naming convention for the metadata keys had to remain generic to keep in line with items from Gnip (as they don't have a direct notion of these items) - so, to get this data as above, the fields map to the following metadata keys:
(1) "tweet link userid" and "tweet link username"
(2) "in reply to userid" and "in reply to username"
(3) "tweet link statusid" (can also be gotten via the "tweet twitter link")
(4) "tweet link tweetid" (can also be gotten via the "in reply to link")
(5) "text body"