# user

Lecture on Oct 5, 2010. (Slides)

• Required

• Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases. Stolte, Tang, and Hanrahan. IEEE Transactions on Visualization and Computer Graphics, 8(1), Jan 2002. (pdf)

• Multidimensional detective. A. Inselberg. Proc. IEEE InfoVis 1997. (pdf)

• Optional

• Dynamic queries, starfield displays, and the path to Spotfire. Shneiderman. (html)

acravens wrote:

The Inselberg article's discussion of parallel coordinates reminded me of the work of geographer André Skupin (http://geography.sdsu.edu/People/Pages/skupin/index.htm). His poster "In Terms of Geography" (http://www.scimaps.org/maps/map/in_terms_of_geograph_92/) in the Places and Spaces (http://www.scimaps.org/) exhibit that toured through Stanford last year is based on a similar idea of representing each point as an n-dimensional vector. Instead of graphing those vectors in parallel coordinates, however, he uses cartography analogues to represent them as a map of abstract knowledge space. While this approach probably doesn't have the same interactivity for certain modeling tasks as Inselberg describes, once the map is created, it can be analyzed interactively in standard GIS software. For certain data sets, this fairly unconventional approach could have advantages in helping understand structural relationships. For instance, he's used it to map the "landscape" of geography academia and scientific research.

skairam wrote:

It might take a few more examples for me to understand, but I didn't find Inselberg's example of VSLI chip manufacturing yield/quality detective work extremely compelling. The process of identifying important lines seemed difficult if it's supposed to be done completely by looking at the original graphic (Figure 1).

It seemed like the kind of thing that you could solve much more quickly and effectively with some sort of statistical method (NB: I'm not a statistician by any measure). Except for the fact that you are trying to maximize 2 variables (quality and yield), it seemed as if something like a linear regression would be key (though I don't know if there's any such statistical test that explores relationship between a set of independent variables and 2 dependent variables).

msavva wrote:

The concluding remark in the Polaris paper recognizes that the lack of animation in the system prevents the user from sequencing data on the time axis, which is useful in practice for recognizing patterns varying over time. This remark made me also realize that the lack of transition animations between different views of a data-set very often breaks the sense of continuity when switching through different visualizations (and it seems to me that this would come up very frequently in a data exploration tool such as Polaris). At least going by personal experience, when trying to make the connection between a new view and one that I was looking at a second ago, I need a significant amount of time to tease out correspondences by visually searching and matching patterns. This seems to become a problem especially when reshuffling data through a transform that changes the ordering along a dimension. Animating the marks from one view to the next helps in this respect and I'm always impressed by the intuitive feel of continuity given by animated transitions in interactive visualizations. Admittedly, an obvious mapping between two views does not always exist but even the simple idea of "old mark to new mark morph trajectories" makes it much easier to keep track of what is where.

rparikh wrote:

In the paper about Polaris, one of the coolest features that stuck out to me were the visual data transformation tools, especially "Brushing" (section 5.3). I was thinking of Polaris as essentially excel for more complicated n-dimensional databases, but visual manipulation features like that could prove very useful. To me this seems like a nifty direct manipulation feature, akin to many cool graphical features that show up in movies (e.g. the protagonist notices a very subtle trend in the numbers and then highlights it to show the audience a very obvious smoking gun...).

Polaris also reminded me of my first summer internship the summer before college. I worked for a startup ChaCha as a "development intern." One of the tasks they had me do was use Omniture, their analytics software, and write scripts to download CSV files from them on a daily basis and then visualize those using "open source business intelligence software" called Pentaho. I had no idea what data cubes or slicing or dicing meant but I dealt with those all summer...

ankitak wrote:

One of the interaction techniques implemented in Polaris is unlimited "undo and redo" in an analysis session. However, to use this, users have to use the back and forward buttons which helps them move to the previous or next visuals (similar to that in a browser). However, if the visual specification at each stage is anyway stored in the system, the interface could have a camera roll at the bottom which shows the thumbnails of all the visuals in chronological order (of creation), which when selected would show that visual on the main panel. This would make analysis history more accessible and usable to the analysts. Importantly, being able to sift through the visuals quickly (say by using left and right keys on the keyboard) would simulate animation which can help analysts recognize various patterns.

Moreover, Stolte et al. discuss that polaris focusses on providing an effective exploratory interface rather than attaining interactive query times. There seems to be some new work in this direction presented in a poster at Infovis 2008. (This is also related to the database exploration tools mentioned in the related work section.) Though I am unable to understand their work from the slides they have uploaded, it seems like this work focusses on optimizing interactive visualization for relational database exploration.

However, I have to accept that I am impressed by the ideas presented in the paper and their elegant implementation techniques. One of the features that I felt might be crucial in many applications was the ability to visually join multiple data sets having common x and y axes into a single display. Though such feature already exists in some image editing tools, when used in relation to data visualization, it can help in analysis of relationships between various trends (for example, between pollution and global temperatures, literacy and poverty, rise and fall of people involved in various occupations).

rakasaka wrote:

Chris Ahlberg's FilmFinder reminded me of Hans Rosling and his GapMinder project - it would be nice to have a talk from him?

I was more impressed with Polaris than the multi-dimensional detective approach proposed by Inselberg. Nonetheless, the suggestions for decomposing "messy" data in order to reveal more informative displays is perhaps worthy to note.

Instead of a similarity to Excel like rparikh mentioned above, I felt Polaris was more of a mashup between database query tools (like Toad) and GIS tools (like ArcGIS) which can provide layers of geographic information to display information both at the macro and micro level. I was particularly impressed with the normalized set form of an expression - a logical subgrouping of data no doubt proves to be essential in understanding the whole data set.

With such power and efficiency in Polaris one can only ask when it is that data visualization tools will become context-aware and be able to generate new and innovative visualizations! Clearly with the reduction in effort in generating new views it's only a matter of time, I suppose.

strazz wrote:

I consider the insights presented by Inselberg to be very valuable, and clearly presented. When thinking about exploratory analysis, he simplifies the it's biggest issues with phrases like "do not let the picture intimidate you... carefully scrutinize the picture... test obvious assumptions..." and so on, which makes it easier to grasp the benefits and issues related to it. In addition, I also liked the fact that he states that the efficacy of a visual data mining tool can only be judged when applied to a real databases, and that it's always related to the objectives or stories we're trying to prove. On the other hand, I think the work made regarding Polaris was very comprehensive,their use of algebra and Berdin's encoding are very compelling (well, I have some doubts on their shape encoding for nominal data as it wasn't very intuitive to analyze).I believe they accomplished their goal of creating a visual tool to rapidly change visualizations of data and use those capabilities to find patterns. For example, even a simple pivot table with graphical marks encoded by size, proved to be useful to determine patterns in the data which would've been very hard to detect with raw numbers.

gneokleo wrote:

I found Iselberg’s approach of “investigating” data through visualization very interesting. It’s obvious that we can identify patterns or mistakes in datasets when we are looking at a graph rather than looking at the data tables and the author stresses the importance of this through VLSI examples. The author often questions the data and it's true, since as we saw from the class sometimes the data we are working on cannot be always assumed to be true. It’s also very interesting to see the evolution of these tools from Iselberg’s time to Protovis and other tools like Tableau. However there is something that struck me while I was reading the papers. The authors mention that it’s useful to look a data from different perspectives which require a comprehensive tool that covers many different charting techniques and visualization automation techniques. Fortunately tools like Protovis and Tableau have become very powerful and fast which allow for fast transformations of data. Protovis also allows for a wider range of users since the direct manipulation interface techniques that they use make it much more intuitive for users when compared to other graphing tools. This intuitiveness and ease of use is something that also impressed me in Tableau too even though I still think that there’s a few things missing especially types of graphs when compared to R for example.

lekanw wrote:

I think it's interesting that the most compelling data visualization tools often depend on the user being able to ask smart and probing questions of the data, and being able to respond to those questions, rather than merely the ability to organize the data visually by itself. I think this gets to the core of why we even do data visualization--that is that people are still vastly superior to computers for certain pattern recognition tasks, so much of the time, a smart analyst paired with a good, flexible, human-centric data visualization tool, can outperform a computer-centric data visualization tool. There are specialized exceptions, especially in bio and other sciences.

jorgeh wrote:

"... One idea, a la Pad++ [4], is to change the visual representation as we change the level of detail; ..." (from the Conclusions and Future Work section of the Polaris paper). I've been thinking of this idea for some time now. Does anybody knows if Tableau (I understand that it derived from Polaris, right?) or any other data-vis software have implemented anything like this?

hyatt4 wrote:

I appreciated Inselberg's various admonitions, beginning with do not let the picture intimidate you. In a previous post, I wondered about the trade off in utility of a visualization versus its attractiveness. Here, the data visualizations are unapologetically grotesque in an artistic fashion (my apologies to any artists out there who feel otherwise), but that is not his concern. His concern is to be able to visually identify cues and information, and as such one must be prepared to work hard and analytically. The information being looked at here is not something to be used as light entertainment to a casual reader, but rather give insight to politicians or businesses interested in reaching or furthering their goals.

Parallel coordinates are really interesting, not only for their use in analysis, but also because they are able to use a position encoding for all of the dimensions of a data set. According to Mackinlay's rankings of the effectiveness of visual encodings, the parallel coordinates approach should give extremely effective results.

@rparikh – I also thought that brushing was a really cool interaction/manipulation technique. I'd like to see an example of that kind of interactivity applied to the parallel coordinates visualization (maybe particular lines or ranges could be selected directly). More generally, the elegant integration of controls for input (e.g. selection) directly onto a visualization seems much more powerful than a config sidebar when it comes to exploratory analysis.

trcarden wrote:

I really liked the chernoff faces presented today in lecture. On the Los Angeles chart i could get a feeling of "happiness" immediately without having to consult the legend. Where i would want to live would happen to be where there were was low unemployment, in a affluent neighborhood, and with relatively low urban stress (happy round smilies). The only issue i have with the encoding is that there was a seemly positive and negative connotation with regards to race. A black or gray smiley doesn't encode happiness quite like a yellow/tan one does. There is a ordinal value to colors in regards to their connotations in graphics (ie colors have more meaning than simply nominal values when placed on faces). Black and grey seem to connote sickness or death where as tan/yellow is health or life. To be fair, races should be presented in another way such that color connotations don't, as tufte would say,"lie" to us.

jasonch wrote:

While I appreciate Inselberg's "lessons" like carefully scrutinize pictures and don't let the picture intimidate you, I don't feel like the parallel coordinates graphs are good demonstration of visualizations. For one, putting different types of data, each with its own range and scale, side by side seems problematic (may introduce irrelevant correlations?). Also, using only one type of encoding seems overloaded and overwhelming, as I believe many people felt when they first see the "figure 1." All in all, the parallel coordinates as an exploratory tool may be great, but almost "requires" to be used exploratorily, which is less expressive than some other visualizations using various encodings.

gdavo wrote:

@skairam:

I was not completely convinced by Inselberg's demonstration as well. I am impressed that he is able to find patterns in the data by looking at the parallel coordinates, but it looks quite tedious, even with the help of his maxims. The Polaris paper is only 5 years younger (2002 v. 1997) and yet this tool looks so much more powerful than parallel coordinates. Just one example: parallel coordinates are limited to a "small" number of rows. The complete VLSI example with 473 batches is terrible. You don't even know if a line just stands for one rows or hundreds of superimposed rows. On the other hand polaris can deal with huge databases as we saw during today's lecture. Thanks to the brushing tool it seems that it would be easier and more user-friendly to derive the same conclusions about the high yield/quality batches.

yanzhudu wrote:

I think what we can learn for the Inselberg's paper is the emphasis on incremental development of visualization and discover patterns along the way. Visualization design is a creative process. It is unlikely that we can get everything right or discover everything in the first attempt. Therefore the "do not let picture intimidate you" and "you can't be unlucky all the time" suggestions made by Inselberg.

The Polaris/Tableau visualization software made this incremental refinement process much easier, allowing us to discover data pattern by trial and error.

jbastien wrote:

I might be old style or yet-to-be-converted, but I think that exploratory analysis of non-trivial datasets is still much easier with a command line interface than it is with fancy visualization tools.

What I mean by this is that data is usually present in different tables, it's very noisy, non-uniform, and there's a lot of useless columns. I find that the visualization frames the end result, whereas having tables and data lends itself to deeper exploration.

It's also much easier to transform the data with a command line interface than with a tool.

I'm far from saying visualization is useless; indeed once the data is nice and manageable I think the visualization tool allows the user to frame questions. The answer to these questions then allow the user to find more questions to ask, and he often has to go back to the command line interface and dig some more.

abhatta1 wrote:

While appreciating the many unique features which Polaris offers including the brushing, sorting, undo/redo etc. , I wonder one important performance question discussed briefly in the paper. Polaris might take a long time to generate graphs (~seconds) from complex multidimensional datasets.Is there a need for any software or does there exist any software which does real-time multidimensional data analysis?

@jbastien

I think I half agree with you. I don't want to open the doors to the "command line vs graphical interface" debate, but I do agree that when you have a complex dataset with multiple tables and complex relations, sometimes it is nice to be able to "walk" the data to build some intuition about your dataset (and if walking the data is easier for you via a command line tool, so be it).

What I mean by "walking" the data is to start with a list of records and then being able to (a) easily look at the records attributes and (b) easily traverse relations and investigate associated records in another table (for example, JOINS for SQL tables). While this may not immediately give you interesting results I do this it is a great way to quickly understand how the schema is constructed which then will allow you to potentially see interesting ways of generating visualizations. None of the tools I have used so far do a good job of doing this. Most of the SQL clients do not allow you to naturally navigate foreign keys and force you to write long, complicated queries. The closest things I have found is using an Object-Relational framework (for example CoreData in Mac OS X or ActiveRecord in Ruby on Rails) but these obviously require you to write custom code which is hardly convenient when you want to quickly explore a new dataset. I would love a tool that makes navigating a SQL database or OLAP cubes as easy as exploring the folder structure on your hard disk.

I am obviously not criticizing Polaris or Tableau - they do an amazing job for what they are. I guess my main point is that when you have a complex dataset it would be awesome to have a tool that allows you to quickly glance over the raw data. Spending some time and looking at this data is really important to posing interesting question that can be answered through compelling visualizations.

ankitak wrote:

In the class we looked at Scatterplot Matrix (SPLOM) - Since the top and bottom triangles in this matrix are mirror images of each other, I wonder how displaying only one of the triangles would affect the effectiveness of visualization. On one hand, it would be helpful in increasing the effectiveness by removing redundancy from the visual, thus helping the user focus easily on the data. On the other hand, showing both the triangles gives a good visual impact due to symmetry and can help in looking at each pane in two different orientations.

@abhatta1: i was thinking about the same problem, but haven't come across any other such tool. However, (as mentioned in my earlier comment), a poster in infovis 2008 seems to focus on optimization of interactive visualization for relational database exploration (though their techniques and methods of implementation are not mentioned in the slides attached).

mariasan wrote:

@lekanw I agree. When Prof Heer was demoing Tableau in class I was initially thinking that it would work just as well to have the program generate a suite of the most common visualizations from a group of (by the user) highlighted variables. It wasn't until I saw a few of "this isn't very interesting" examples that it hit home (again) how much better humans are at interpreting data.

Maybe Tableau already has this, but if not I wish that there was a way to capture the path of my analysis, that I could later "replay". If you look at bunch of users bet you could find some interesting analysis patterns. And make a visualization of them! :P

amirg wrote:

We talked briefly in class about how visualizations can help you identify problems with your data. Inselberg speaks to with his maxim of "test the assumptions, and especially the 'I am really sure of ...'s". I think this is important because actually testing your assumptions can lead you to surprises in your data or could point out problems with it. One of the things that I found most interesting about the bacteria/antibiotic data set was that one of the bacteria labeled as a member of the Streptococcus family was not similar to the other two at all in terms of its drug resistance. With the right visualization and understanding that our notion of the bacterial families was not complete, maybe there would have been more questioning of which family the bacteria correctly belong to. Of course, this is much easier said than done because it is extremely difficult to formulate all of our assumptions about a data set, in large part because it is hard for us to know what our assumptions are.

These ideas lead me to the following question: Is it possible to formulate and test our assumptions about a data set? How would we go about doing so?

The notion of testing assumptions is also reminiscent of the comment by Martin Wattenberg in the ACM Queue interview, where he said that one of the signs that a visualization is actually good is that you start to notice that your data was different than what you expected it to be so you've already learned something new about it.

estrat wrote:

I've always had problems with a p-value of 0.05. It always seemed way too high. A 5% change of rejecting a null hypothesis that is in fact true seems unacceptable. I guess you have to compare that probability to the probability of other errors in the experiment (experimental methodology, randomness of subjects, etc.) so maybe it's not as unreasonable as it seems, but I still feel like 0.01 ought to be the standard (if not lower). I wonder how many experiments out there are rejecting a null hypothesis that's true.

esegel wrote:

For some reason, Parallel Coordinates typically generate more scoffs than insights when presented. What are the prejudices against this design? Parallel coordinates present multi-dimensional data via strongly-perceived positional encodings. When dealing with lots of series, this visualization highlights correlations in the data and "clumps" of similarly behaving series—particularly when the dimensions are ordered appropriately. This is achieved without aggregating (i.e. sum, average, max) any of the dimensions. What other visualization type allows the clear presentation of high dimensional data?

Of course, parallel coordinates can be abused and can sometimes be replaced by other types of visualizations. For example, a dataset with 3 dimensions (e.g. time, position, size) should probably not use parallel coordinates—instead, doing a standard 2-axis chart (position v. time) and then encoding size with the size of the data points. This strategy of encoding variables in the size, shape, color, etc. of the datapoints can indeed add dimensions to standard graph formats. But what do you do for high dimensional datasets (e.g. 5+ dimensions)? (The main example used in the paper had 15 dimensions.) There are only so many features you can stick on datapoints before they become unreadable, and none of these encode as expressively as the position-encoding used by parallel coordinates.

sholbert wrote:

I was dumbstruck by the demonstration in class. Tableau is so cool! The software does an amazing job of contextualizing and customizing the data, tailoring it to the story that you are trying to tell without sacrificing usability.

I just saw this visualization recently and thought it was pretty interesting: http://mbvc.tumblr.com/post/1255452520/6-surprisingly-effective-treatments-for-depression

However, the "effective" y-axis is likely an aggregation of different test statistics, and I think the raw form would be interesting to play with in Tableau.

jtamayo wrote:

Tufte talks about small multiples and how they're effective because after the reader understands the first one, it's straightforward to interpret the data in the other ones. This is true because in his examples each small multiple is identical to the others, except for the data contained in it.

In the examples presented in lecture, however, each small multiple was a different design for the same data. This approach, coupled with interaction, allows the user to get a better sense of the data by comparing different views of it.

It seems, then, that we can either vary the data and preserve the design, or vary the design and preserve the data. Both can be effective ways of visualizing multidimensional data.

msewak wrote:

The tableau demo was so cool, it blew my mind.

In one of Tufte's chapters, he says that half Chernoff faces represent the same amount of data as full faces. I think it would be incredibly strange to redo the LA map with half faces.

I liked the way we can project multiple dimensional data in two dimensions by doing a principle component analysis on it, especially in a way that we preserve patterns and clustering in the data. It would be harder to find any semantic meaning in data which is composed of many dimensions.

felixror wrote:

Regarding to Inselberg’s paper on parallel coordinates, I don’t find such technique that compelling since all we have to do is to stare at the master visualization (Fig 1 in the paper) and manually spot out the patterns ourselves. Though the systematic and incremental approach introduced by Inselberg is good to learn, there should be better visualization techniques out there for us to reduce to manual effort of pattern detection. One thing I notice is that Inselberg’s paper was written in 1997. I do not think that there are many visualization tools out there for exploring multidimensional data. More advanced database visualization tools like Tableau was rolled out in 2003, 6 years after Inselberg’s paper was written. I think that, from what I learnt from the demo in class, Tableau will be a more convenient interface for us to play around with the multidimensional data. We can proceed by using Tableau as a tool for carrying out Inselberg’s critical approach to identify the underlying pattern of the data.

dlburke wrote:

@estrat I was thinking the same thing. I guess once the number of test cases becomes great, it is a reasonable assumption. But take the example on Wikipedia concerning coin flips. 14/20 heads is .058. Fourteen of twenty flips does not strike be as that remarkable. But how many cases are necessary to start considering a .05 p-value significant? Obviously, when considering the validity of a set of data, one needs to take other factors into account. Yet it still seems like using a 5% likelihood that the result is significant is not being cautious enough, considering it is the standard rather than a merely a value to use when the results are not terribly important.

iofir wrote:

In regards to the article "Multidimentional detective", I found that the multi-dimentional plots were confusing at first, but through following the example I realized that they are a great tool to investigate the data. Especially when you can separate the variable that you're trying to optimize for. (i.e. putting the most important measures at the top left.) However, the second part of the article was not very clear. I was not sure how the curved boundaries were added to the plot or why they are valuable. wouldn't the raw or aggregated samples serve a more informative visualization then a range. The curves themselves made no sense. (I'm not even sure how they got them, maybe through inspecting where the line did not cross.) They added no meaningful information and were less clear then simple lines would have been. I think that a better visualization of this part (the ranges after one or more variables are constrained) would be more effective with some color. Possibly by filling the area such that the inner regions have a different hue and more intensity. (increase Alpha to make the color more opaque)

jdudley wrote:

@estrat this is why I always like to compute q-values (i.e., false discovery rate) in addition to p-values. The best thing about q-values is that you can almost always compute them directly from the data via randomization and you get, in my opinion, a much more empirical sense of what is significant rather than relying on theoretical assumptions. You can also estimate q-values from a distribution of p-values. Essentially, you pick your significance level and estimate the proportion of p-values falling under a random p-value distribution at that cutoff. There are some nice R packages for computing q-values from John Storey and others.

ericruth wrote:

Personally, I really disagree with @jbastien and @saahmad about visualizations being less useful for exploring a data set. I feel like the whole power of visualizations is that they leverage our most effective method of parsing large amounts of information (visual parsing and pattern recognition > parsing a string of numbers or other values). Because of this, it seems silly to limit visualizations as a tool to convey something we already understand about the data. Everything visualizations leverage is just as effective on the individual analyzing the data - so why not take advantage of that and use it as a means to the end as well? A quote Prof. Heer mentioned demonstrates this point beautifully: "Every successful visualization that I've been involved with has had this stage where you realize, 'Oh my God, this data is not what I thought it would be!' So already, you've discovered something." -Martin Wattenberg

Of course, to effectively explore data using visualization we need tools built for this purpose. Data visualization is a younger field, so visual exploratory tools haven't had as much time to develop as command line tools, but I would argue that they're already as effective or more so than their text-based counterparts. I probably wouldn't have said this before seeing the Tableau demonstration in this lecture, but I was really impressed with how much we learned about the political data set is such a short time. In addition, I think the graphical interface is a much faster way to modify views and queries. I'm excited to see how tools like this develop in the coming years

andreaz wrote:

Parallel coordinates seem like an interesting technique for displaying high-dimensional data but it seems like the projective transformations that have to be applied to the data (Inselberg mentions these in point 4 in the introductionrotation, translation, scaling, perspective--factor hugely into how we perceive the relationships in the data from looking at the visualization. Their characterization algorithm orders the variables in terms of their predictive power, but I'm interested in knowing more about how one can generate an appropriate algorithm for applying these transformations.

One question I had about the Polaris reading was why the developers impose an ordering to nominal fields to treat them as ordinal. I may have missed something from the reading that explains this, but this transformation doesn't seem to reflect the underlying qualities of the data.

jsnation wrote:

I really enjoyed the Tableau demo in class, and I found it really cool that it automatically selects a common graph type for the types of data you select. I also really liked how it automatically creates graphs with small multiples when you add in enough parameters, rather than making a really confusing single graph.

Like many other people have said, when I was first reading the multidimensional detective article, I found the parallel coordinates plots to be really confusing. Then I read the first bold point to not let the pictures intimidate me and laughed at that, because I was falling into that trap. He really made his case for parallel coordinates in the middle of the article - where the parallel coordinate plots provided a means to find previously undiscovered relationships between the process variables. The parallel coordinate graphs are definitely not "easy" visualizations for someone to view - they seem like they must take training and heavy knowledge of the dataset to be able to make sense of them. But despite this, the visualization does seem to make clear relationships that otherwise would have been missed. I think an effective way to use parallel coordinates visualizations would be in conjunction with more conventional 2-3 variable visualizations. First the parallel coordinates visual could alert you to an interesting relationship between 2 or 3 variables, and then you could separate those from the rest of the dataset to create a more readable graph with only those important variables.

clfong wrote:

Reading the article "Multidimensional Detective", I find that parallel coordinates is not a very clear and informative tool in terms of discovering the relationship between across dimensions. The relationship between adjacent dimensions may be some how illustrated. However, it's really tough to trace through lines across multiple dimensions when all the lines looks just the same. If there are only a small number of data points, using colors to distinguish between different lines might be a viable choice. But that doesn't scale with the number of data points either. I think using more modern visualization system that allows us to easily switch across different subsets of dimensions, which allow more effective analysis than line tracing alone.

emrosenf wrote:

Parallel coordinates can be really neat. Alfred Inselberg has a great set of tutorial lectures at his home page: http://www.math.tau.ac.il/~aiisreal/ It's funny that the first two slides are the Broad St pump map, and the Chernoff faces.

Also, I believe that Paypal uses parallel coordinates for their fraud detection. I know that the early founders quickly discovered that humans were more adept at spotting suspicious patterns than computers. They devised a way to visualize the transaction flow to make certain patterns more clear. One such pattern was called the "christmas tree". In this pattern, they can see a bunch (sometimes hundreds) of transactions criss-cross to the top of the chart until all of a sudden, they converge at a single node.

It would be cool to try building something like this. Anyone interested?

asindhu wrote:

One major point that came across to me through both the lecture and the readings is the importance of an interactive visualization when doing exploratory data analysis. So far in the comments there have been mixed feelings about the parallel line plots, but I think we should keep in mind that they are mainly an exploratory tool, not necessarily a great option for a final visualization. I think the best use of parallel coordinates is in an interactive format when you're just starting out with a dataset and trying to discover interesting trends. Once trends have been identified between a couple variables, you can move to a more traditional two- or three- variable visualization.

On that note, I didn't see a reference in the Inselberg paper to an interactive tool for generating parallel line visualizations; does anyone know of one?

nikil wrote:

riddhi wrote:

I really liked the Tableau demo in class as well. Particularly because the examples helped drive home the point that it is important to play with the data and spend some time on exploring the enormous space of visualizations to see what might tell the best story about the data. Certain visualizations might seem like they are telling a great story, until you realize that they are hiding more interesting sub-structures in the data. Exploring what level of detail makes sense in the visualization is important, because while you usually don't want to display every single data point in your visualization, you do want to let your visualization have depth and interesting detail.

Chernoff faces were interesting. I think they can be useful for visualizing data that a) has somewhat correlated dimensions and b) has data that is somehow related to humans, human problems, feelings etc, that would evoke those facial features from humans.

I found the quartile-quartile chart to be useful and informative, especially when it allowed us to compare more easily between different models that were fitted to the data. It got me thinking about the use of data transforms, and how important they can be in machine learning and for extracting patterns. Instead of building classifiers on raw data with many many features (dimensions) and hoping your classifier will perform well on test data, it might be useful to transform the raw data (to say in this case a Q-Q plot) and then determine what model (or group of models eg. 3 gaussians with different centers like in class) might fit the data the best, either through quick visualization or by calculating the distance of each point on the Q-Q plot from the 45 degree line.

Also for PCA, in class we saw a slide that reduced the multi-dimensional data to 2 dimensions using PCA and then plotted them against each other as a scatter plot. One person from the class commented on how hard it was to extract any meaning from that plot in the absence of a sense of what the 2 new dimensions were actually measuring and the absence of any patterns in that plot. My sense of PCA is that it tries to come up with 2 dimensions that maximize the uncorrelated aspects of your data. When you plot you dimensions of your data and see patterns, it is because those two dimensions are correlated in some way. My guess is that it might be more instructive when trying to determine how similar two different datasets are to each other, to plot their first PCA dimensions against each other and their second PCA dimensions against each other and so on. My hunch is that the patterns you find in those plots might be much more interesting and meaningful.

nchen11 wrote:

I admit that I found Inselberg's paper a little hard to follow, but it seems as though the focus is mainly on manual visual analysis. This idea is reinforced by his bolded directives, such as "carefully scrutinize the picture" and "you can't be unlucky all the time!" The latter brings to mind brute-force algorithms, but I digress . . .

He concurs that future tools are needed to automate the exploration process, and I think that Polaris is a great example of such a tool. The pre-determined defaults allow the user to focus on the data patterns and less on the generation of the actual visualization, while the ability to change the various settings still allow for manual exploration.

I am particularly interested in Polaris/Tableau's competitors, and how they might choose similar or different approaches to the same kinds of data.

arievans wrote:

I'd actually like to make a comment tying together a concept I am learning in another class. In Decision Analysis, MS&E 252, we are learning about causality and the false conclusions that are drawn, leading to misdiagnoses, incorrect or unfair sentences, etc.

We spoke this week about the concept of visualizations illuminating incorrect or incomplete data. This is indirectly related; In the case of Decision Analysis, we are shown a visualization of a probability tree. Though displaying the information in that way shows data clearly to enable the observer to draw some conclusions, often times the possible conclusions that can be drawn are wrong--sometimes the graph is misinterpreted. Though I cant immediately pinpoint other situations where this might arise (that is, other specific visualizations), I assume that this issue could potentially exist elsewhere in this domain. This got me thinking along a whole new line of development alongside visualizations...

In Tableau we saw that there was a quick view for visualizations that might be useful and provide meaningful kinds of summary statistics and trend information. A good visualization is one that is clear and does not mislead its observer. But what is the safeguard against accidental false interpretations, such as in the case of the probability tree? What if in addition to the visualization, the software could provide textual information so as to very clearly in text tell the reader what they should be seeing. That seems to be the exercise we conduct in class all the time--"what do we learn from this graph?" It seems that we can push technology further in this direction, and more importantly, we can provide protection against misinterpretation. So for example if we knew that the probability tree often leads people to falsely switch the direction of causality, we can provide textual feedback such as "Note: Given the information here, smoking does not necessarily lead to lung cancer. This graph shows that if you have lung cancer, you are very likely to have been a smoker."

Tableau is great software and I am excited to use it, but given that the spirit of the course is always about improving our tools, perspectives, and visualizations, I thought this would be a nice feature to consider for the future.

anomikos wrote:

@saahmad. It is not a matter of command line vs GUI debate. I believe that everyone of us develops different preferences when it comes to the way we see and evaluate data. For me it is also easier to try many different queries to get a sense of the data before you dive into the visualization. Again this is a part of the exploratory Data Analysis process. On the multidimensional analysis I have to admit that I was really excited to read the Inselberg paper. I participate in MS&E270, where we compete in a strategy simulation of a market and the reports on the market give us a ton of data on products, usability and customer preferences. It is only there where you start to see the limitations of programs like Excel as helpful tools to drive decisions. I am really excited to use Tableau and multidimensional analysis to observe the market data from many different sides!

avogel wrote:

@jbastien, @saahmad, @anomikos. I've been thinking about what can be done with straight SQL vs. Tableau. It seems the main example of what SQL might be better at is the JOIN operation. Naturally, the folks writing Tableau would be interested in making their tool as powerful as possible, so I did some searching to see how they handled that (i.e., google "Tableau SQL JOIN"). It looks like it has some ability to do joins as discussed in this thread. Not having a complex data set in multiple tables/sheets to work on, does anyone have an idea how effective this looks? If I have some time, I might try some examples this weekend, but I don't think I'll have or be able to generate very good datasets quickly. Does this address any potential weaknesses in Tableau or GUI visualization tools in general?

jbastien wrote:

Tableau can indeed do joins and it works ok.

The problem isn't merely replicating SQL but being more expressive than SQL. Think about having a lot of data in separate tables where not everything is nice and uniform.

Having a database closely integrated with a SQL-like language allows you do do pretty powerful things very rapidly. It's not just about putting data together, it's also about modifying the data, splitting it, massaging it so that it's in a form that lends itself better to what it is you're trying to do.

selassid wrote:

I can't get behind parallel coordinates. Yes, any visualization technique is a tool to be learned, and so the initial warning of complexity was not completely off-putting, but I feel it has too many shortcomings to be effective. I think it is too visually imprecise, as the hoards of black lines can't be distinguished or even followed often. I also think it is inherently incomplete; only when specific data subsets are shown alone can patterns be teased out and no single image can tell a story, it seems. This means that there would be a huge benefit if you could incorporate interactivity by being able to select specific subsets of data to pull out correlations, but that's the issue: the original plots are too busy that you can only see correlations once you've already selected the subsets that contain them! In other visualizations, correlations can even be seen if you look at the whole dataset.

jeffwear wrote:

I was at first peeved by Inselberg's tongue-in-cheek presentation of his dataset, or rather his lack thereof. I felt as though he either spilled too little or too much ink telling us how he could not tell us more about the particulars of the variables "to protect the innocent" and especially because "they were not important." I felt as though further explanation would have only helped me to understand a problem in a domain (VSLI production) in which I am not at all familiar.

But in retrospect I realize that Inselberg's obfuscation of the dataset's context actually aids in its analysis. Insofar as the particulars are known, they encourage a faulty hypothesis (that by minimizing defects we will optimize yield)! From this standpoint it is almost better that we know nothing about the data - free of preconceptions, we can focus on optimizing for a variable or variables without prejudice toward the methods or solutions.

I wouldn't go so far as to say that we could get by entirely without knowledge of the domain. It's necessary to know something about VSLI production in order to interpret the results and to apply the conclusions thereby derived. Nevertheless, it seems that domain expertise should not be privileged while exploring the data.

rroesler wrote:

Gizmodo printed this an interesting set of graphs (originally from http://www.asymco.com/ )that use a similar set of data to the one we used in assignment 1. http://gizmodo.com/5657699/who-is-really-winning-the-smartphone-race. Here, instead of having market data based on cell phone operating system we see market data based on cellphone manufacturer.

The first graph is a vector plot showing market-share growth verse profit-share growth. I really like the vector representation because it gives a more intuitive sense to which direction a company is headed. We look at "Apple", for instance, and get the sense that they will be the main player in the immediate future. One problem I can imagine with this type of graph would be qualitatively comparing vectors that are close to the same length. For example, how would two companies compare if their respective vectors where the same length, but one is oriented more in the y-direction and the other more in the x-direction.

That is why I also like the second graph they show further down the page. Its a simple scatter plot showing market-share verse profit share; however, the plot area has been divided into four quadrants. Each quadrant is given a qualitative title such as Dominant for a company with high market and profit shares and Fading for a company with high market-share but low profit share. While these titles may not truly reflect the exact circumstances of each company, they provide an easy way to sum up the data.

heddle wrote:

Ok, so I didn't get the memo about commenting on lectures, and I am going to make up for it by commenting now on this last week's lectures. Really, the part I loved about Tuesday's lecture was the section dedicated to using faces to present data. The idea that data can be both descriptive and emotionally powerful is something that I think can be a huge asset, or, in the hands of someone unethical, a weapon Yes, often times the data we present is just clarified information, but the idea of using heuristic devices against the user to elicit a desired emotional response can change not only the concept of data presentation, but sometimes even the meaning of the number driven data itself.