Illocution Inc. Principled, Practical, Innovative, Free
Home Education Research Software Corpora About | Contact | I=M(s)

Upzilla API:

Note: As usual, this works great in Chrome and Firefox. The Internet Explorer view is getting better but it still has some issues.

The Upzilla API gives everybody access to raw trend data (JSON) from Illocution's Twitter Stratified Random Sample (SRS). Of course, you will have to know what to do with the data once you have it, but this is a great way to add linguistically-principled data to your application or web site.

Aside from explaining the meaning of the variables you send in and get back, the main feature on this page is the API Demo down at the bottom. Use it to perfect your query, determine exactly what data are returned, and then you can then cut and paste the URL provided to your application. It's really pretty simple: send a GET and get some JSON. But just to make it easier, we also provided some examples that will help you pinpoint the data set you want. So give it a whirl and see what comes up.

Index:
  • API Documentation
  • API Call Examples
  • API Demo
API Documentation:

The Data Set:
DATA The Upzilla trend data are based on data extracted from the Twitter Stratified Random Sample (TSRS), an ongoing project at Illocution Inc. For each of 3 spans (periods) N -- the 30-day span, the 7-day span, and the 1-day (24-hour) span -- tweet data were collected from the TSRS for the N*24 most-recent hours. For the spans with available data, the span data were compared with the norm derived from the data of the next-larger span. For this particular analysis, only those tweets determined to be English were used, http links and @username strings were removed. Hash tags were kept. No other manipulation occurred (i.e. there are no hand-picked data -- you get what you get).
Updates The Upzilla data set is updated every 6 hours. The approximate update times are 00:00, 06:00, 12:00, and 18:00 hours Eastern U.S. Time (GMT-05:00). Note that the time supplied in the return data is GMT.
Usage Well, we wouldn't make it available if we didn't want you to use it. So, have at it. However, we reserve the right to modify and/or cancel the API and data collection methods at any time. We will try to give ample notice.

The Request:
HTTP You gain access to the Upzilla API by making a simple HTTP request to the following: http://www.illocutioninc.com/cgi-bin/upzilla_api.py. The demo below uses GET (you can cut and paste from there), but you can use POST as well. Defaults are set in the CGI script, so you don't actually have to include any variables (just click the link above and see what happens). But you will probably want to include variables, so the options are explained below.
span The span options are either 1 or 7 days. If span is set to 1 day, you will receive data derived from a comparison of data from the last 24 hours (1 day) to data from the last 168 hours (7 days) from the time of analysis. If span is set to 7 days, you will receive data derived from a comparison of data from the last 168 hours (7 days) to data from the last 720 hours (30 days) from the time of analysis. The "span=1" data will change rapidly and often dramatically between updates (every 6 hours), while the "span=7" data changes more slowly. The default is 1.
gram The gram option specifies what n-gram you are interested in, either 1 for single grams (tokens), or 2 for bi-grams. The default is 1.
column There are 7 values for each n-gram in the data set which is returned. The column variable allows you to specify which value (i.e. column) to use for sorting. This is used with the order variable to get the n-gram data sorted the way you want it. The default is "column=s". The columns are defined as follows:
  • p2 - the percent of tweets in which the given n-gram is found in the larger of the two spans being compared.

  • p1 - the percent of tweets in which the given n-gram is found in the smaller of the two spans being compared.

  • r2 - a measure from 0 to 1 of how infrequent the given n-gram is in the larger of the two spans being compared. The distribution is hyperbolic, so most n-grams have a value near 1. This measure can be used to filter out noise.

  • r1 - a measure from 0 to 1 of how infrequent the given n-gram is in the smaller of the two spans being compared. The distribution is hyperbolic, so most n-grams have a value near 1. This measure can be used to filter out noise.

  • z - a z-score representing how disproportionate the p1 value is compared to the norm established by the p2 value. A positive value indicates the n-gram occurs more often than expected (emerging), and a negative value indicates a rate of occurrence less than expected (declining). Sort "order=desc" to get most positive, and "order=asc" to get most negative.

  • v - a measure of validity for the given z-score which ranges from 0 to 1 where 1 is most valid (reliable).

  • s - a measure of probable interest which is currently r2*z*v but is likely to change in order to incorporate p2 and p2. As it stands, this represents z filtered for noise (i.e. commonality).

order The order specifies the direction of the sort for the column specified. It can be either asc (ascending) or desc (descending). Remember that "order=desc" returns the highest value first, which is what most folks want. Using "column=z&order=asc" can be used to find the declining n-grams, which is a nice feature. Default is desc.
offset Used in conjunction with the limit variable, the offset variable specifies how many n-gram data rows to skip before beginning to return data. This would allow you, for example, to get results in sets of 100 by requesting "limit=100" and "offset=N" where N is 100 times the number of previous requests. Or, "offset=9&limit=1" would return data for only the 10th n-gram in the given sort. The default is 0, don't skip anything.
limit The limit allows you to specify how many rows of n-gram data you want returned. This would be a maximum you want, but the actual value may be lower depending on the data. The max return is 1000. The default is 100. Undefined returns the default. Zero (0) will return statistics, but no n-gram data rows.
pass The pass variable just allows you to pass a single text variable (no whitespace, up to 64 characters) through the API. So, if you send "pass=myvariable", it will be returned in the JSON object (data) as '{"pass":"myvariable"}'. The default is '{"pass":null}'.
callback If you are using the returned data in a JavaScript callback function and you want the JSON object wrapped like "mycallback(JSON)" then specify the function name using "callback=mycallback" (no whitespace, up to 64 characters). Otherwise, if you are making a call in another language, don't include the callback variable, and the data will come as a plain JSON object. You can parse it however you want from there. The default is plain JSON object (no wrapper). Note that the demo will only allow "callback=demo_display" because this is the actual callback being used in the demo JavaScript.

The Return:
JSON The returned data are JSON (JavaScript Object Notation). No other options are offered because JSON will handle the data, and almost all languages will handle JSON. Basically you get back a JavaScript object literal in the form '{"varstring1":value1, "varstring2":value2,...}'. The only variation is that the "data" variable (or member) has an array for a value, and that array is made up of arrays with 8 constituents that represent a row of data (i.e. an n-gram and 7 measures). If the callback variable has been used in the query, the returned object literal is wrapped in function-type notation in the form 'mycallback({"varstring1":value1, "varstring2":value2,...})'. You know, it's all a string anyway, so it's pretty simple. Use the API Demo below to help you visualize the JSON structure of the return data.
givens All of the variables given in the HTTP request (or the default values) are returned in the JSON object literal similar to '{"column":"s", "order":"DESC"}'. See the above descriptions. Other returned variables are explained below.
error This should be returned with a null value. Otherwise, there has been an error and the JSON object literal will have no other members. You can use code similar to 'if (data["error"]) {handle_error()} else {process_data(data)};' to determine if data exists to be processed.
cindex The cindex is the index to the sort column in the array containing the n-gram data. You can use this to know how to index and pull out the correct value. For example, if you pass in "column=p2", then cindex will be 1. Using "column=s" will set cindex to 7, etcetera.
data_time The date and time of the analysis that produced the data being returned in GMT.
data_length The number of n-grams data arrays (i.e. rows) returned. This may be less than the limit variable passed in.
data_tweets2 The count of tweets in the larger of the two spans.
data_tweets1 The count of tweets in the smaller of the two spans.
data_tokens2 The count of n-gram tokens for the given "gram=n" in the larger of the two spans (i.e. for a query where "gram=2" is passed in, the count is for the number of bi-grams in the given tweets).
data_tokens1 The count of n-gram tokens for the given "gram=n" in the smaller of the two spans (i.e. for a query where "gram=2" is passed in, the count is for the number of bi-grams in the given tweets).
data_types2 The count of n-gram types (unique strings) for the given "gram=n" in the larger of the two spans (i.e. for a query where "gram=2" is passed in, the count is for the number of bi-grams types in the given tweets).
data_types1 The count of n-gram types (unique strings) for the given "gram=n" in the smaller of the two spans (i.e. for a query where "gram=2" is passed in, the count is for the number of bi-grams types in the given tweets).
data_max2 The count of tweets containing the most frequent 1-gram in the larger of the two spans.
data_max1 The count of tweets containing the most frequent 1-gram in the smaller of the two spans.
data The "data" variable (or member) has an array for a value, and that array itself is made up of arrays, each with 8 constituents that represent a row of data (i.e. an n-gram and 7 measures). Generally, you will want to iterate over the "data" array and process each sub-array (i.e. row) of data. The constituents are as follows (as described above): [n-gram,p2,p1,r2,r1,z,v,s]. The n-gram is a string, and the rest are numbers. You can use the cindex variable to know what the sort column index is if you have to pull that value out.
API Call Examples:
Twit Trend Source Code Read the source HTML from the Twit Trend application. It includes the JavaScript used to process the Upzilla JSON data, both n-gram data and stats. It makes 4 calls to the Upzilla API to fill out the various charts and statistics at the bottom of the page. The calls use variations of the span, gram, order, and pass variables; direct the return data to a callback function; and select data using cindex.
API Demo Source Code Read the source HTML from this page and have a look at the block containing the API Demo. This section of the HTML is self-contained (set of by some comments) and includes all the CSS and JavaScript. It really takes very little to make the demo run.
Compared to the past week, what 2-grams are up today? That is, what are the emerging 2-grams today? span=1, gram=2, column=s OR column=z, order=desc
Compared to last week, what 2-grams are low today (i.e. declining)? span=1, gram=2, column=s OR column=z, order=asc
What are the most frequent words in the past month? span=7, gram=1, column=p2, order=desc

Use the n-gram p2 value and the data_tweets2 value to estimate the actual number of tweets with a given n-gram. You can get up to 1000, so this is really better than using the Twitter Lexicon data because the return is current.
What are the most frequent words in the past 24 hours? span=1, gram=1, column=p1, order=desc

Use the n-gram p1 value and the data_tweets1 value to estimate the actual number of tweets with a given n-gram. You can get up to 1000, so this is really better than using the Twitter Lexicon data because the return is current.
What are the most frequent words in the past week? span=7, gram=1, column=p1, order=desc OR span=1, gram=1, column=p2, order=desc

Use the n-gram p_ value and the data_tweets_ value to estimate the actual number of tweets with a given n-gram. You can get up to 1000, so this is really better than using the Twitter Lexicon data because the return is current.
API Demo:
Options: span = gram = column = order = offset = limit = pass = callback =
URL: http://www.illocutioninc.com/cgi-bin/upzilla_api.py?span=1&gram=1&column=s&order=desc&offset=0&limit=100&pass=&callback=demo_callback
Return: No data.


Legal
Disclaimer
Copyright © 2013 Illocution Inc.
All Rights Reserved