Custom classifiers (or what we call "sifters") are not difficult to build. A small team, or a spirited individual, can build one before lunch. The trick is to understand some basic principles.
- Fewer categories is easier than many.
So, what is optimal? The answer is two or three. We have built many effective classifiers with four, five, or six codes. The trade-off is that you need to do more coding to get to a point where you are confident in the classification. Our advice is to start with two to three codes and then use the "split" feature to drill into the finer grained categories.
- Balance your training sets.
If you create a dataset that is coded 95% category A, 3% category B, and 2% category C, the results may be very disappointing. To get reliable classification results, ensure that your dataset has a good mix of items from all categories. Using the search and bucket features is a good way to prepare a balanced dataset.
- Find good coders.
There is tremendous variation in the quality of the annotations that coders produce. Some are fast but inaccurate. Others are slow but highly accurate. The best coders are both quick and accurate. Contact Texifter if you need help finding coders; we know some good ones.
Note: Not all good coders are good for all tasks. In some cases, you need domain-knowledgeable coders.
- Use the validate dataset feature early in the process.
Our adjudication procedures are novel. We started building software (the Coding Analysis Toolkit) specifically to support the process of adjudication of coder disagreement. Our patent filing on "Coder Rank for Enhanced Machine Learning" builds on years of experience seeing widespread variation in coder ability.
When you first set out to create a classifier, try to get four of your DiscoverText peers to all code the same 100 items. Then go through the adjudication process. You will learn a lot about your data, the codes, boundary cases, and the coders.
- Iterate, iterate, iterate.
Repeat this process as many times as needed. After each round, retrain the classifier and be sure to exclude the invalid items when you "rebuild" the classifier via the ActiveLearning Advanced Options. Gradually you will weed out the false positives that result from the classification.
- Use the classifier scores to pull new samples of high value items.
If a new classifier is being developed and you have completed one or two rounds of coding, set up a filter with the following two criteria:
- Filter for items not coded
- Filter for a code or all codes above 95% likely to be in a category.
Put the results in a bucket and create a new dataset using the random sampling tool. These are your high value items. Coding this new dataset helps cut down the false positives in your classification results.