Glossary

Follow

 Adjudication: A process to validate the work of “coders” as well as to refine the coding criteria and develop a “gold standard” training set for machine-learning.

 

ActiveLearning: A stepwise refinement methodology where humans and computers work together to code and classify raw data to gain valuable insights. It incorporates the coding choices of one or more coders over several iterations.

 

Annotation: Observations and bookmarks created by coders to highlight items of special interest.

 

API: Application Programming Interfaces used to collect data from third-party applications like Twitter, SurveyMonkey, or Facebook.

 

Archive: Raw data gathered from social media, surveys, email, document collections, and spreadsheets.

 

Bucket: A subset of an archive or archives broken out for further analysis. Buckets can contain results from search, filtering, coding, duplicate detection, clustering, and other techniques.

 

Classifier: Criteria for coding data items, which can be binary such as “Yes” or “No”, “Is” or “Is Not”, or more complex such as “Large”, “Medium”, “Small”, “Not sure”.

 

Cluster: Groups of data which are near duplicates. You can see which text has been added, edited, or deleted from nearly identical texts.

 

Coder: People that you can assign the task of coding. DiscoverText can provide several experienced coders, or you can assign your own personnel to be your peer coders. Work can be spread across several coders simultaneously.

 

Coding: The process of classifying datasets according to defined criteria. Also known as tagging or labeling, DiscoverText has excellent tools to speed up this process. Human coders can do 150 – 200 items per hour with a simple case.

 

Comparison Tools: Allow you to compare the accuracy of individual coders.

 

Data Feeds: A large stream of data from third-parties such as Twitter, Facebook, RSS, and other online sources.

 

Data Unit: An individual line item of textual data; would correspond to a single row in a spreadsheet. Data archives contain many units.

 

Dataset: Derived from archives or buckets. They are coded by humans and computers.

 

Deduplication: Detects data which are duplicates. Very useful for survey responses, email collections, or social media collections that you suspect were spammed or mass-retweeted. Duplicate responses only need to be coded once. A valuable eDiscovery tool.

 

Exports: Datasets can be exported in CSV, XML, or Zip file format so you can archive the data or analyze it in another application.

 

Filter: Create subsets of the data by selection based on key attributes such as location, date, data source, hashtag, and machine classification scores.

 

Memos: The messages, bookmarks, and any other reference information that is available.

 

Peers: A trusted and verified colleague with whom you collaborate, share coding, and data.

 

Project: Contains all the archives, buckets, and dataset for one line of inquiry.

 

Redaction: Makes some data “invisible” in reports for confidentiality purposes.

 

Search: Looking for particular terms or keywords.

 

TopMeta: The most important fields (attributes) used for advanced filtering of a data collection.

 

 

 

 

Have more questions? Submit a request

Comments

Powered by Zendesk