Exploratory Data Analysis

A variety of digital tools have been designed to help users visually explore data sets and confirm or disconfirm hypotheses about the data. The task in this assignment is to use an existing visualization tool to formulate and answer a series of specific questions about a data set of your choice. After answering the questions you should create a final visualization that is designed to present the answer to your question to others. You should maintain a web notebook that documents all the questions you asked and the steps you performed from start to finish. The goal of this assignment is not to develop a new visualization tool, but to understand better the process of using visualizations to perform exploratory data analysis.

Here is one way to start.

  • Step 1. Pick a domain and data set that you are interested in.

    • Peruse the provided data sets below. Choose the one of greatest interest to you. We encourage you to use one of the provided data sets. However, if you would like to explore a different data set, please contact the teaching staff and include a description of the data.
  • Step 2. Pose an initial question that you would like to answer.

    • For example: Is there a relationship between melting point and atomic number? Are the brightness and color of stars correlated? Are there different patterns of nucleotides in different regions in human DNA?
  • Step 3. Assess the fitness of the data for answering your question.

    • Inspect the data--it is invariably helpful to first look at the raw values. Does the data seem appropriate for answering your question? If not, you may need to start the process over. If so, does the data need to be reformatted or cleaned prior to analysis? Perform any steps necessary to get the data into shape prior to visual analysis.

Exploratory Analysis Process

After you have an initial question and a dataset, construct a visualization that provides an answer to your question. As you construct the visualization you will find that your question evolves - often it will become more specific. Keep track of this evolution and the other questions that occur to you along the way. Once you have answered all the questions to your satisfaction, think of a way to present the data and the answers as clearly as possible. In this assignment, you should use existing visualization software tools. You may find it beneficial to use more than one tool.

Before starting, write down the initial question clearly. And, as you go, maintain a wiki notebook of what you had to do to construct the visualizations and how the questions evolved. Include in the notebook which data set you chose; describe any transformations or rearrangements of the dataset that you needed to perform. In particular, describe how you got the data into the format needed by the visualization system. Keep copies of any intermediate visualizations that helped you refine your question. After you have constructed the final visualization for presenting your answer, write a caption and a paragraph describing the visualization, and how it answers the question you posed. Think of the figure, the caption and the text as material you might include in a research paper.

Data Sets

We have provided the following two data sets and encourage you to use one of them in order to get started quickly and therefore have more time to explore the data and develop your analysis questions. That said, you are welcome to us a different data set if you prefer; just be sure to first confirm with the course staff.

Movie Data

This dataset contains some important statistics from a large sample of movies. The data includes the movie budget and revenue from different sources as well as ratings from RottenTomatoes and IMDB.

Download: csv file.

Sources: The Numbers, RottenTomatoes, IMDB

Flight Data

FAA data describing every commercial flight during the month of December 2009. For detailed descriptions of each data column in the attached file please see You are also welcome to download your own version of the file (which might include columns or time spans that were left out from this dataset) directly from

Note that Vadim has compiled data for the entire year of 2009; this dataset is extremely large and requires a relatively powerful computer to process interactively. It can be made available by request.

Download: zipped csv file.


Other Sources Some other data sets can be obtained at InfoChimps

Visualization Software

To create the visualizations, we will be using Tableau, a commercial database visualization tool that supports many different ways to interact with the data. Tableau has given us licenses so that you can install the software on your own computer. One goal of this assignment is for you to learn to use and evaluate the effectiveness of Tableau. Please talk to me if you think it won't be possible for you to use the tool. In addition to (or in lieu of) Tableau, you are free to also use other visualization tools as you see fit.

Submission Details

This is an individual assignment. You may not work in groups. Your completed assignment is due on Tue Oct 12, by end of day (11:59pm).

To submit your assignment, create a new wiki page with a title of the form:


You should also create a link to your submission in the list below.

