Textualize: an interactive text visualization and writing development tool
Source code: Textualize.zip
Data Domain & Visualization Considerations
Our data domain is text. The goal of our software is to develop a (school-aged) student's writing skills by offering interactive feedback in the form of data visualization. We initially considered which aspects of text would be most critical for our audience, as well as the visualization techniques that might best provide opportunities for learning:
Dimensions of Text
Data Visualization Options
We decided on diction, or word choice, as our textual feature. We made this decision based on the fact that young writers often overuse the same words in their written work. A tool that highlights occurrence and proximity of word use would then be useful in helping them develop their writing. They could also upload samples of writing by more experienced authors (e.g. Hemingway, which is our default text) to investigate the patterns of diction that they employ. To this end, the program will be set up as an interactive feedback loop, with dynamic data and a number of opportunities for the user to interact with the data. The three main panels on the interface consist of a text box for the document, a visualization to show word-frequency data within the document, and a third visualization to show location(s) of particular words in the document:
Initial Story Board
Revised Story Board
The text box, where the document can be written and/or revised in real time, is a simple scrollable window (200 x 580 pixels). We chose that size in order to display a couple paragraphs' worth of text at a time within the limits of our 800 x 600 visual display area, and we want it to be scrollable in order to maintain the default readable font size. We also deemed that showing this volume of text at a time (~1800 characters) is appropriate to our audience: school-aged students. We positioned the document text box on the left of the screen, since we usually read left-to-right in our culture, and this is where the loop of interactive feedback begins. First the data (the text) must be entered and/or changed, in order for the user to process it, moving his/her attention to the right of the screen.
For the visualization where we're showing word frequency within the document, we chose a simple histogram bar-chart, which is appropriate for showing frequency of occurrences. We decided on vertical bars because it is a simple, conventional way to draw a histogram. We want to keep the data as uncomplicated as possible, so it can be used by a young student and/or busy teacher. We have distinct words from the document along the x-axis (nominal data), and frequency (count) of each of those distinct words along the y-axis (quantitative data). Both axes scale themselves according to the data in the document, so the user can maintain the full overview of his/her word-use frequency in the document. There is an interactive filter box if the user wants to "zoom in" for greater detail.
The bars do not have labels along the x-axis because we ultimately thought it would create too much noise, with all those words packed tightly together. Instead, you can get the x & y values for a bar (the word and precise frequency) by hovering over the bar to activate a tool tip box containing that information. We're also including a mouse-hover feature that highlights the bar the mouse is over.
By default, the data in the histogram is sorted greatest-to-least left-to-right, for easy, concise identification of most frequently used words. Words that occur with equal frequency to one another, are ordered alphabetically along the x-axis. The interactive filter field is in close proximity to the histogram, since they're closely related, and above it so it's noticeable. It contains the following default-filtered words: and, but, or, the, a, an, & of. We chose these ("stop-")words because they are non-semantic, high-frequency words (common articles, prepositions, and conjunctions). The user has the interactive ability add to revise which words to filter.
If the user clicks on a bar from the histogram, the software draws a univariate dot plot beneath the histogram, on the "third panel" of the application, to show the relative location of all the occurrences of a particular word in the document (using an ordinal algorithm). This dot plot serves as an overview, and the document box itself becomes the detail, as that particular word turns orange (and gets bigger) everywhere it appears in the document. The dot plot uses position/proximity of & among the dots, to show patterns in word frequency, as well as the exact location of the first character in the word, relative to all the characters in the document. We chose to visualize word location by ordinal character instead of by ordinal word because, although they are likely fairly similar visualizations, this method more closely (spatially) correlates the dot-plot to the text box visualization. It updates as the user revises the text in the document. This provides visual feedback to the user as he/she interacts with the data. The sum of this dot-plot feature can be categorized as linking and brushing.
Similarly, the user can double-click on a word from the document text box, or mouse-highlight it, to activate that word in all three panels (the text box, the histogram, and the dot-plot).
With respect to color, we chose black (for the characters) and grays & blues (for the backgrounds & visualizations) that would be friendly and soft on the user's eyes. When a user hovers the mouse over a piece of data on the histogram, it highlights that bar in the negative and displays the tool tip box showing the specific data. If the user clicks on that bar, it turns orange to compliment the color of the other bars (which are blue/periwinkle). The background color of the tool tip box (a faint yellow) is based on a general color-convention for tool tips. None of the colors on the visualization is saturated. The orange of a clicked-on piece of data is consistent throughout the visualization: the highlighted bar on the histogram, the dots in the univariate dot plot, and the color of that word everywhere it appears in the document. This shows the user that the three visualizations (histogram, dot-plot, text box) are referencing the same piece of data. The data visualization and ability to interact with the data, however, are not dependent on the color change--the words in the text that change color also increase in font size relative to the rest of the document, to facilitate visualization. This enhances the overview-and-detail effect between the text box and the dot-plot, as well as creates a focus-and-context effect within the text box. It also integrates sensitivity to color-blind users.
Finally, the user can click on a word that has been highlighted in the text box, after it has become activated. In this case, that word has become a link to an online thesaurus page that offers synonyms to that particular word. This could come in handy for possible revisions the user may wish to consider.
Our interaction techniques, then, include: filtering by dynamic query, overview-plus-detail, focus-and-context, linking & brushing, details on demand via tool tips, and dynamic data (the text itself).
We chose Flash/Flare as the toolkit for implementing our visualization, for two reasons. The first, and most critical, is that it supports the features we wished to implement. Secondly, we wished to gain experience in Flash programming, which was a new experience for both of us.
Our final interactive visualization application looks very much like our description above, though a but different from the original storyboards. The main changes we made between the description above and our final product were:
* adding automatic data transformation by eliminating case-sensitivity in the text (the data), and * filtered words need not be separated by commas (although they could), just a space.
Between the storyboards and the final product, we iterated our design using various sketches, revisions to the text ideas, and coding capabilities of Flare. For example, the storyboards included horizontal scrollbars for the histogram and the dot-plot (with insets showing the full view); the final product did not. One storyboard shows labels along the x-axis; the final product does not. Also, we revised some of the colors we used, and the placement of items on the page, like the filter query box.
Working as a pair, we collaborated on all the decisions we made about the product, at every iteration in the process. Both Coram and Christopher drew storyboards. Coram wrote the code, as he has experience in Computer Science, and Christopher does not. Coram learned Flare by use of the tutorial provided on the course wiki, as well as ad hoc Internet searches. Coram also taught Christopher some basic concepts of coding as he worked on building this application in Flare. Christopher composed the written portions of the assignment, with Coram's input and feedback throughout the process. In sum, Coram spent approximately 20 hours on the assignment independently, Christopher spent approximately 5 hours on the project independently, and the two worked together on the project for approximately 8 hours. The aspects that took the most time were learning to use Flare and Actionscript, and developing the text parsing implementation.
Limitations / Areas for Improvement
The text parsing capabilities of our program our limited, although they are sufficient to support our visualization objectives. Future enhancements might include advanced text parsing capabilities, text-based search, more user-friendly filtering tools, additional dimensions by which to analyze the data (e.g. sentence length across the document), and rendering performance.