NOTE: Updated with 2018 prices
Question: "To publish our research paper we need a large amount of data from Twitter but the Twitter's public API is not responding properly so to finish our research we have to buy data from your company. We have no commercial purpose to use these data. We just want to finish this research with good amount of data. I would really appreciate if your company can help us out because we badly need these Tweets to complete our research, since we are students I hope you will give us the best pricing so that we can afford to buy it."
Question: "Is there not a way to get data based on a Twitter account? For example, get all historical tweets from a specific Twitter account for a year?"
Question: "My research involves analysing Twitter historical data. I was quoted over $4000.00 for access to data. Unfortunately, I'm not able to pay any money for data as it's used for academic research purposes, not commercial. I'd be thankful if you could tell me how I can access historical data."
You could complete this task using Sifter:
The problem is price. There is a per/day cost of $25 (i.e., $25 for each day of Twitter data, so a year's worth is $9,125) in addition to a volume component:
The reason is that for every day you pull data a massive and costly transaction occurs in a commercial cloud. The business model (# of days + volume of Tweets) comes directly from Gnip & Twitter. There is no academic use case discount from Twitter. Irrespective of who pulls the data, corporate, individual, or academic, we pay the same price.
When you pull all the Tweets from a single user or matching a complex PowerTrack 2.0 rule for one day, the query has to run against all 500,000,000 Tweets for that day to find the ones that are from that specific user. Information retrieval in this environment is a computationally intense process. All of the systems are operating in a commercial cloud. There are programmers at multiple organizations, database managers, lawyers, accountants, multiple cloud hosting providers, computation, electricity and bandwidth costs, to name only a few of the factors impacting the price. So, you can get a year's worth of a user Tweets, but that is 365 x 0.5 billion rows of data that must be queried. If you are matching on features of metadata as well as the body of the Tweet, we are talking about trillions of cells to query and pull information from.
We suggest rethinking the project to collect data from a smaller number of days. Reducing the number of days is the best way to deal with the real costs of using the Gnip Historical PowerTrack via Sifter.
The best ways to control costs are:
- Limit the number of days
- Use the random "sample" rule to limit volume
To complete your project affordably, we advise that you drastically reduce the total number of data collection days and use a method of systematic representative or purposive sampling. For example, the last day and first day of every month, or the 7th and 21st day of every other month. It also makes sense to consider an event-driven collection strategy. If there is a point (or points) in time when key events transpire (eg., disaster, election, controversy) try collecting data for a few days before and a few days after the critical event.
By lowering the number of days you will save a lot of money.
An additional way is to obtain a representative sample (e.g., 10%) of the data by using PowerTrack's sample:
The sample: rule returns a random sample of Tweets that match a rule rather than the entire set of Tweets. Sample percent must be represented by an integer value between 1 and 100. For example:
(hurricane OR flood) sample:10
We offer 75% off list price student discounts for DiscoverText, which is where the data ends up:
If you have not already done so, please create a free 3-day trial and explore the results with Gnip's PowerTrack, which is a day-forward version of the data stream offered through Sifter.