- David Chanin
- Ian Christopher
Newspapers chronicle current events in great detail and so by utilizing newspaper achieves we can gain a window into our past. However these newspaper archives are often massive in size and are filled with detailed texted information. Because this information is low level, effective visualization techniques are required to help researchers pick out macroscopic trends in the journalism of our past.
We plan on visualizing data from the Metro Newspapers dataset. We want to make it easy for journalism researchers to quickly identify trends over time of topics as well as find related keywords and phrases to their topic. We accomplish this by visualizing frequency of words and topics over time using a line chart. Above this chart we will have a display of related keywords to the topic being shown. Clicking on a keyword will add that word to the list of topics being visualized using an intersection operator. The user can also manually add topics to the visualization by typing them to compare trends in terms over time.
This project idea is based around talking with Geoff McGhee, of the Stanford Bill Lane Center for the American West, about what sorts of questions journalism researchers are likely to ask of this dataset. We will continue to work with Geoff to iterate our visualization as we continue development.
The data set
The raw data set of newspaper articules was given to us as a 16 GB tarbal from Mr. McGhee. The specifics of the unpacked data is as follows:
- Three major newspapers - Los Angeles Times, Chicago Tribune, The Baltimore Sun.
- Roughly ten years of coverage (2000-2009).
- About 2.5 million xml files.
- In addition to full text, the data includes a number of meta fields in the xml including author, an abstract, newspaper section, page number, and date.
- There are a number of regional stories.
- Lots of interesting historical events covered; we expect there to be a number of interesting trends.
Meeting Notes (with other people outside the group)
Geoff McGhee 1
First meeting with Geoff. Also so that we can get the data set from him.
- Warned not to let this data completely get out there, but not a life or death type of thing.
- seems to be very interested in the decline in American journalism.
- As such would be interested in visualizations that were able to show that. We only have the lsat ten years -- is this enough data?
- Less reporters now and more pressure to produce lots of writing.
- Maybe science articles and investigative reporting versus pop culture stuff?
- Also interested in publication bias. People seem to think that the media has become more liberal over time.
- Out be neat to have semantic analysis on text to test out the bias hypothesis.
- Another topic to look in to is the relationship between authors and stories. Are some more drawn to certain themes/bias'?
- A number of other possibilities. Looks like we have a bunch to think about.
These are actually Jason's notes. They are much more comprehensive than ours :).
- Stanford CoreNLP
- part*of*speech tagging, named entities extraction, lemmas
- probably not parsing due to the size of corpus
- General Inquirer
- list of sentiment words in English
- I have already scraped the General Inquirer website. If you want the list of sentiment words (original and stemmed variations), just let me know!
- Technical Terms
- how do news articles different from general text?
- more emphasis on entities
- lack context as each article is updated daily
- well*written, typically follow specific structure/writing styles
- how do journalists differ from general analysts?
- what questions?
- how conduct search?
- avoid high*level overview of 'patterns that we already know'
- good example: Meme Tracker
- dynamics of news cycle, comparing mainstream news vs blogs based on quotations
- evening news vs morning news?
- comparison of trends or patterns by authors, etc.
How might visualization help?
- provide information scent
- Google Stock match prices with events along timeline
input selection & query specification
- enable formulation of sufficiently interesting queries: "Give me morning news vs. evening news."
- UI design?
- expand query term to include contextual information?
- intelligent queries, e.g. learn from navigation history?
- what language model to use, to enable comparison?
- how to perform document retrieval?
- how to compute document similarity?
- provide context
- cross reference with census data
- tf.idf score works for retrieval
- G2 might be better for surfacing higher level trends
- works for intelligence analysis (proximity of entities)
- proximity of 'opinion' and 'entities'?
Document similarity measure
- tf.idf on specific terms
- similarity based on LDA topics
- modifiable model
- LDA topical similarity might be better as an initial measure than tf.idf
- analysis is an iterative process
- when experts can directly insert domain knowledge, they can iterate and improve the similarity measure to better suit analysis needs
Combine Text + Visual Analysis
- what functionalities should be based on text analysis?
- what functionalities should be based on visual analysis or interaction?
- faceted search
- distorted views to provide context
- shared infrastructure
- meet with Geoff
- prototype UIs
- example usage cases of journalist research
- what info scent, input, query, analysis, context to include?
- build a useable system
- basic system: faceted search on existing metadata
- elicit feedback from journalists
- components that each person can take away and experiment with
- extract textual features
- word proximity
- document similarity measure
- topic models
Interesting things to Visualize
Death of the American Newspaper <- emphasized by Peter
- Story lengths over time
- Story topics over time
- Story quality over time
- Story sources over time (ex thinktanks)
- Number of stories over time
- Media bias
- Is the media actually liberally biased?
Sentiment Analysis trends <- emphasized by Peter
- Track sources or “memes” throughout articles
- Track how misinformation in 1 article propagates to later articles
- Track ads over time (don’t think we have data for this though...)
- Article search is already fairly painless through Lexus Nexus, Bloomberg, and Google
- Op Ed section can throw off analysis since it’s random people saying random things
- Visualizing author of articles is not interesting since journalists are assigned to cover a single area for long periods of time
- The dataset we’re using is the posterchild of the death of the American newspaper as all 3 papers are essentially dead today
- Misinformation in past newspaper articles tends to propagate through future articles as journalists use that article as a research base
- As newspapers die, they cut stories about politics and science and start printing a lot of nonsense about celebrities
- Peter is really excited about sentiment analysis and isn't overly concerned about accuracy
- There are no good tools easily view aggregate trends in newspapers
Why newspapers archive data
- Newspapers view themselves as the “institutional memory” for their community
- Journalists research people will go back and see when and why that person was mentioned in the past
- Journalists want to compare the present to past events
- Newspapers are, for better of worse, viewed as trustworthy data sources
How journalists do research currently
- If at a Tier 1 newspaper, then look back through their digital archives
- Search on Lexus Nexus or Bloomberg
- Search google
- Director of the Graduate Program in Journalism at Stanford
- Shorter meeting
- Overall seems excited to talk to us
- Seems Journalism/CS connection has big potential in this area
- Looking through the 'books' seems like it would be very helpful to her. Says she and others would spend huge amounts of time scanning these financial documents looking for stories. Beyond the scope of this project, but interesting idea.
- Seems to like our mockups, though I think its tough to get good feedback without a prototype
- Introduced us to Phil Reese of the Sacramento Bee after showing some of his work
- Phil's work is little more story focused (we are more help journalists find interesting patterns in big newspaper sources).
- Nonetheless some of his work is really neat.
- glad to help out
- feature writer, freelancer, newspaper/magazine experience
- more business/organization writer than investigative stories
How she generally researches a potential story
- unusual for her to look at newspaper data, usually data with the last year
- normally won't do 'data looking for a reason', though has done it with census data and older workers
- investagative reporters are more likely to do this type of research
- start-up bubble in silicon valley for example
Things she might be interested in with respect to our tool
- if we could get lots of newspapers, it would be interesting to look at the number of international stories/investigative/etc.
- curious how layoffs over the years has effected newspapers (same as Geoff)
fewer reporters today -> fewer stories. But more bloggers repeating the same stories
- How often does stores repeat? What does that network look like
- More used to researching more major trends that have been highlighted
- Census data seemed to come up a number of times. Might be good to add.
- Excel is a very common tool for her once she gets raw numbers
Feedback on prototypes (David's, Ian's, and TJ's/Bobby's)
- Feels like this would provide little stories. Jumping off point?
- Needs to be able to get stories out of it. Not sure how.
- Likes the similar terms
- Time select (we are planning it) seems like it would be very helpful.
- Maybe look for names instead of nouns/adj?
- Interface for Bobby/TJ is confusing. Not sure how to approach it.
- Like the power of their tool though
Geoff McGhee 2
- Chit chat about data / catch up with a bunch of us (TJ/David/Ian/Jason/Bobby)
- Unfortunately we have page number issues. Oh well.
- Start to demo our prototypes to Geoff
- Other group first, Bobby and TJ explain their demo a bit
- Tj controlling the demo, "Power user" comes up a bit. I think Geoff might be a bit of one.
- Geoff really seems to like the power that TJ/Bobby's prototype gives them.
- Prototype could be intuitive, but the design right now isn't too much.
- Can you weight combinations? Get rid of things?
- Does each box to the side suppose to refine one single interest?
- Gray what you meant below class building would be nice.
- Color Picker?
- Change look and feel. Button shape doesn't help.
- What does filter do? Not sure at first? People might be confused by it.
- Blue is the new black? New neutral?
- List sources and maybe how you can get full text of the articles.
- Thoughts of input box for text as primary source -- lots of options vs usability issues come up.
- Little bit of a turn off when there are too many labels -- What to look at?
- David's Prototype
- Topics? Is there a better way to filter the results?
- Lit ut of instructions might be nice.
- Full text would be very nice, but we would have to be careful of scrapers (We signed something that isn't suppose to let people get all the full text).
- More drill down features would be good. Likes the related topics sparklines.
- Layout a bit confusing. Wasn't sure what was linked to what.
- Title and autohor probably ok for popups of dates.
- Publications on top of the related terms div.
- Wasn't sure about the time slider on our main sparkline.
- Editorial vs other sources would be very neat to see. Warns us to watch out for opinion /editorial sections -- maybe there should be some type of filtering.
- Have to explain things to people. Blank page always bad (maybe just have instructions).
Shorter phone interview. Kind of nice as we wont be able to explain our visualizations as well.
- How he goes about looking for / reporting on stories
- Look for the outliers in the data. Are there flaws in the data or do they actually mean something?
- Outliers can often make good topics if they are real.
- Also checks for variance in data. Another thing that might mark a good thing to look in to. (ie wealth in areas, etc).
- Uses Fusion tables by google for many of his visualizations. Seem to make things faster and easier.
- Feedback about prototypes
- In general, wants more text explaining the visualizations.
- As such likes the more straight forward ones. Seems to be a common theme we are getting about explaining the visualization to people.
- Had trouble figuring out TJ/Bobby's prototype. Doing this over the phone adds some interesting effects.
- Overall pretty positive feedback beyond these though. Might be holding back a little.
- Gives the names of a number of News Application Team contacts.
- Not sure we have enough time to utilize the list though.
- Lines are unclear.
- We should have our main graph at the top. Big bang at top.
- Better labeling is a must.
- Sometimes hard to understand why? We need our sample article popup.
- We sketched out a new layout for the visualization.
- Can someone use this without instruction? If not its too complicated to use.
- Talked about his experience a bit at newspapers.
- Seems most of these visualizations coming from newspapers are coming from teams and a highly iterative design process.
- Also directed us to projects.washingtonpost.com for more visualization examples.
Collaboration with T.J. Purtell and Bobby Georgescu
After the first few initial mockups and getting the raw newspaper data, we learned TJ Purtell and Bobby Georgescu are doing something very similar to us. As such we have decided to work together. One of the major factor in this decision was the belief that there will be a lot of duplication on the backend and interviewing people. We will likely have different final visualizations but will collaborate together often.
Nov 29 Presentation
Our page from the November 29th check up presentation is located here. The page includes a few mockups and peer comments on our progress. In case you are just looking for the slides, they are here:
Using these two mockups we both prototyped our views. The next photo is from the first mockup.
The next two are these are from the second mockup.
Final Visualization screenshots
Because we worked with another group, we have these deliverables on a joint page: Bobby, TJ, David, and Ian's Project page