current page



Group Member

* Ashton Anderson


In my project, I plan to visualize the quality of Wikipedia pages. I'll make use of the recent feature by which visitors can vote on the quality of Wikipedia pages. They can distinguish between different dimensions of quality (e.g., "complete", "well-written", "neutral"). I want to allow users to see at a glance how the quality of Wikipedia articles (a subject of increasing importance as Wikipedia becomes used more widely in classrooms, etc.) varies across Wikipedia, whether different "areas" of Wikipedia are of generally higher quality than others, which of these areas could use work, and how the different dimensions of quality relate to each other (does one lead to another? Are they simply equivalent?).

Related Work

To my knowledge, no one has attempted to visualize the quality of Wikipedia articles.

However, there has been a lot of work on (1) Wikipedia visualization and (2) Wikipedia article quality:


  • 11/29 -- Presentation
  • 12/3 -- Finish downloading, cleaning, sorting, organizing, aggregating, pre-processing the data. This includes comparing all pairwise article similarities.
  • 12/6 -- Finish projection of articles to 2-D space.
  • 12/12 -- Finish all other aspects of visualization (outlined in slides).
  • 12/13 -- Poster
  • 12/15 -- Paper

Class Presentation

Final Deliverables


jneid wrote:

Consider: Who would use your visualization? For what use case? What similarity metrics are useful?

blakec wrote:

There is so much data in wikipedia. I was wondering how you would incorporate all that data at once and in addition just store it on the clients computer. Will you be doing a lot of preprocessing?

jsadler wrote:

I really like the problem space. This is a very meaningful problem to determine "quality" aspects of information spaces .... reducing the information overload.

I like the hand sketch- with the 2D array and grouped article. What dimension is quality mapped too? and in what direction? I.e are higher quality articles on the outskirts or closer to the center?

- it seems you have hypothesized a few different quality - derived variables - it would be nice to hear what specific dimensions you are thinking about...

rc8138 wrote:

One important feature is the ability to allow users to search for a particular topic. Is that included in your design? How would you scale your software to accommodate that need?

jnriggs wrote:

Hey, cool idea! The biggest thing I was trying to wrap my head around was how you plan to cope with the immense size of the data set. You might consider the following. This has to do with your question: "How to embed articles into 2-d space?" Maybe you could consider a "layering" approach where you would have layers be for different subsets of the data and only show specific layers at a time. (a "zoom" slider might be a starting idea).

dbrody wrote:

How does the quality measures determine relatedness of articles. My understanding is that the 2d plot is based on relatedness of the articles. A major part seems to be how to visually layout the dots in some 2d space. Overall, I think visuallizing the quality is awesome.

junjie87 wrote:

One feature you could consider, other than a time slider, is a "play" button that allows you to animate through time automatically. I think it would give a better visualization on how the quality of Wikipedia entries change over time.

grnstrnd wrote:

The similarity of articles may have changed over time. Will you simply plot the similarity based on the current state of the articles, or will you allow the user to change the similarity plot over time? It might be interesting to allow this to change although the complexity would of course increase a ton. Awesome idea!

mkanne wrote:

Really like the well thought out use cases/user questions you mentioned. Your wireframes are very clean and I think the result wil be useful.

tpurtell wrote:

What kind of trends might someone look for about topical areas after they identify them as poor quality or high quality?

angelx wrote:

Wikipedia is a very useful resource, and it would be helpful to understand which areas of Wikipedia are "better" than others. Wikipedia is very large, are all Wikipedia articles shown and how are is the semantic areas determined?

pcish wrote:

It seems strange to plot pages on a 2D space without a clear definition of the axess. Is it possible perhaps to find another way to allow users to browse the available pages? E.g. an overlay over the actual wikipedia pages perhaps?

bsee wrote:

I feel that instead of taking the top 1000 articles, the user should be able to select the kinds of articles they want. This is mostly because I think that the top 1000 articles are usually heavily seen and edited, and so will usually be of highest quality.

However, if you can show a range of articles, both good and bad, popular and not, then it makes your application alot more useful in exploring wikipedia articles.

emjaykim wrote:

Thinking about the end user is good Drawing is beautiful - from a visual standpoint, would like to see in the real implementation, the dot intensities as differentiated as they are. Would like to see how the visualization takes care of overlapping dots.

zhenghao wrote:

Very cool project!

Other than just a similarity project, it would be nice to be able to see the link structure between articles to see if "quality" propagates over time. Some time varying animation to see change of quality over time would be nice too : )

mlchu wrote:

I think the use of time slide would be really useful to show the change over time. Would you also consider showing some statistics of the data and the changes over time as user slide through time?

bgeorges wrote:

I'm not sure how detailed the data set you have is, but if it is broken down by rater, you might want to normalize the ratings, since there are people that are "harsher" than others when giving ratings and this might skew your results.

kahye wrote:

Can you define an alternate or supportive quality metric other than user ratings such as completeness of the article or how well it is referenced? It would be also an interesting metric about wiki articles.

mbarrien wrote:

I wonder if the grouping by quality will mean something about the articles relative to one another... the articles seem like they'd be unrelated to one another, and the visualization won't help. Perhaps filtering by a subject area?

arvind30 wrote:

Another interesting dimension to consider could be how quality of article depends on the # of authors contributing to an article...

ifc wrote:

I'm not sure how well quality can be predicted using user ratings. Word count, image count, number of contributing authors, number of updates, and page views might also be features to look in to. Also if you can define quality scores for a set of the articles, you might consider some type of regression to predict quality for others.

netj wrote:

Assuming most people aren't interested in the same kinds of articles in Wikipedia, I'm not convinced why showing quality of a globally chosen set of articles would be useful for me. Making clear what your target users want to do with your visualization will help improve the design. For example, Wikipedia contributors would be interested in the quality of their own articles vs. that of articles from a global sample. Or, Wikipedia people tuning the rating system will want to see how the dimensions relate to each other.

chanind wrote:

It might be interesting to see how user ratings compare to attributes of the article like word count and number of references. Also, I'm not sure what the end goal of the visualization is. Is the goal just to be interesting? Is this a tool to help users find and improve poor quality articles?

jessyue wrote:

could you give the user the ability to choose topics or area of topics to view either by filtering or searching? this would be more useful from an end-user standpoint.

stojanik wrote:

A very ambitious project and if done well could prove to be useful for deeper exploration of community driven authorship. How are you prioritizing the ranking categories? Does trustworthiness assume more or less "weight" than a complete, well written, or objective ranking? Maybe allowing the user to order or weight the ranking categories would provide another dimension to search.

jofo wrote:

I think it would be useful with a more clear usage scenario to understand the application better, but perhaps that is difficult to come up with before it actually is implemented. I would also like to see perhaps categories or other clusters emerge than individual articles. A not too wild guess is that humanities and art is less represented among high-quality articles than technology (fewer authors), and (if it is so) it warrants interesting discussions...

Leave a comment