Thesis proposal: How should static analysis tools explain anomalies to developers?

Eclipse Explanations

On April 26, 2016, I presented my thesis proposal to a committee of five members: Dr. Emerson Murphy-Hill (Chair), Dr. Jing Feng (Graduate School Representative), Dr. Shriram Krishnamurthi (External Member), Dr. James Lester, and Dr. Christopher Parnin.

I received a conditional pass. A conditional pass means that a formal re-examination is not required, but that the committee expects additional revisions before approving the proposal.

I suspect that there are some students who do not even realize that they have received a conditional pass, since the event does not seem to be recorded anywhere that is student-accessible.

In the weeks that followed, I made several revisions to the thesis proposal document, incorporating feedback from the presentation:

  1. The committee reduced the scope of required experiments from five to three.
  2. The committee added a new requirement that I conduct a systematic literature review on static analysis notification techniques.
  3. I added a thesis contract to explicitly state the dissertation deliverables.

On May 11, 2016, I submitted the revised proposal to the committee.

On May 20, 2016, I was notified that the committee had approved the revisions.

Although some students prefer to keep their thesis proposal secret until graduation, I have made the proposal and presentation materials available so that they may help other students in structuring their own proposals:

Abstract

Despite the advanced static analysis tools available within modern integrated development environments (IDEs) for detecting anomalies, the error messages these tools produce to describe these anomalies remain perplexing for developers to comprehend. This thesis postulates that tools can computationally expose their internal reasoning processes to generate assistive error explanations in a way that approximates how developers explain errors to other developers and to themselves. Compared with baseline error messages, these error explanations significantly enhance developers’ comprehension of the underlying static analysis anomaly. The contributions of this dissertation are: 1) a theoretical framework that formalizes explanation theory in the context of static analysis anomalies, 2) a set of experiments that evaluate the extent to which evidence supports the theoretical framework, and 3) a proof-of-concept IDE extension, called Radiance, that applies my identified explanation-based design principles and operationalizes these principles into a usable artifact. My work demonstrates that tools stand to significantly benefit if they incorporate explanation principles in their design.

The Bones of the System: A Case Study of Logging and Telemetry at Microsoft

Our full paper, The Bones of the System: A Case Study of Logging and Telemetry at Microsoft, has been accepted to the International Conference on Software Engineering, Software Engineering in Practice Track (ICSE SEIP 2016). ICSE is hosted this year in Austin, Texas.

The abstract of the paper follows:

Large software organizations are transitioning to event data platforms as they culturally shift to better support data-driven decision making. This paper offers a case study at Microsoft during such a transition. Through qualitative interviews of 28 participants, and a quantitative survey of 1,823 respondents, we catalog a diverse set of activities that leverage event data sources, identify challenges in conducting these activities, and describe tensions that emerge in data-driven cultures as event data flow through these activities within the organization. We find that the use of event data span every job role in our interviews and survey, that different perspectives on event data create tensions between roles or teams, and that professionals report social and technical challenges across activities.

I am delighted to have been able to collaborate with Microsoft Research for this study. Thanks to Robert DeLine, Steven Drucker, and Danyel Fisher, the co-authors of the paper.

Challenges in Using Event Data

Challenges in Using Event Data

Migrating from PHP Markdown to Jetpack Markdown

I’ve had this blog since 2004, less than a year after the first release of WordPress. Since then, I’ve migrated the blog to each new WordPress release.

Unfortunately, with each migration comes additional technical debt. For example, beginning with WordPress 2.2, the default character set for databases was changed from latin1 to utf8. Performing this conversion is a tedious, manual process, and through the years I’ve converted columns as-needed to support modern character sets (such as when needing the Unicode ♥ symbol).

Until now, a blocking problem has been that the PHP Markdown plugin has bugs that cause it to incorrectly render certain advanced HTML content, such as those found in shortcodes. Unfortunately, the plugin entered maintenance mode in February of 2013 and is no longer actively developed.

Problem

  • PHP Markdown stores its post_content in Markdown form in the wp_posts table. The PHP Markdown plugin, just before displaying a post, translates this Markdown text into HTML.
  • A consequence is that deactivating this plugin means that post content no longer appears as HTML. That’s bad.
  • Jetpack Markdown, the candidate replacement plugin, stores its post_content in HTML, and keeps the Markdown content in a separate post_content_filtered column. The advantage of this approach is that posts render correctly even if the plugin is deactivated. The design trade-off is that the database must store both the HTML and Markdown forms of the content.

There’s an impedance mismatch in that the two plugins translate from Markdown to HTML at different points in the content process.

Migration

The migration involves iterating through every WordPress post, and copying the Markdown form of post_content and storing it in post_content_filtered. At this point, post_content and post_content_filtered will both contain the Markdown form of the content.

Next, for each post, re-run the Markdown function (from markdown.php) and replace the post_content column with the HTML version of the content. That is:

$to_html = Markdown($post_content);

Finally, the new Jetpack Markdown plugin stores metadata for each post by adding a _wpcom_is_markdown key to posts that use Markdown. Thus, insert rows into wp_postmeta to reflect this:

INSERT INTO wp_postmeta
(post_id, meta_key, meta_value) VALUES 
(post_id, '_wpcom_us_markdown', '1')

Done

The database is now migrated to a form that can be used by Jetpack Markdown.

Timeful: My E-mail Policy

Due to increased demands on my time and the need to minimize distractions, I am posting an official policy on my use of e-mail.

This policy is effective 11/27/2015, and is available at:

http://go.barik.net/timeful

E-mails sent to my personal account or University account are now batched. This means that your e-mail will be intentionally delayed against one of the following time boundaries (all listed times are in Eastern time):

  • Monday through Friday: 9 AM, 1 PM, 3 PM, 7 PM. If you send an e-mail after 7 PM, I will receive your e-mail at 9 AM the following day.

  • Saturday and Sunday: 9 PM only. Weekends are reserved for time with my family.

If you need to contact me urgently, you may send me a text message at 251-454-1579.

EmailProductivityCurve-600px

Social Media Addendum

  • I check Facebook once a week, usually on Friday night.

I ♥ Hacker News

I Heart HN

Our short paper, I ♥ Hacker News: Expanding Qualitative Research Findings by Analyzing Social News Websites, has been accepted to the joint meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE 2015, New Ideas). ESEC/FSE is hosted this year in Bergamo, Italy.

The abstract of the paper follows:

Grounded theory is an important research method in empirical software engineering, but it is also time consuming, tedious, and complex. This makes it difficult for researchers to assess if threats, such as missing themes or sample bias, have inadvertently materialized. To better assess such threats, our new idea is that we can automatically extract knowledge from social news websites, such as Hacker News, to easily replicate existing grounded theory research — and then compare the results. We conduct a replication study on static analysis tool adoption using Hacker News. We confirm that even a basic replication and analysis using social news websites can offer additional insights to existing themes in studies, while also identifying new themes. For example, we identified that security was not a theme discovered in the original study on tool adoption. As a long-term vision, we consider techniques from the discipline of knowledge discovery to make this replication process more automatic.

Improving Error Notification Comprehension in IDEs by Supporting Developer Self-Explanations

Explanatory interface mockup, conceptualized in the Eclipse IDE.

My second graduate consortium submission, Improving Error Notification Comprehension in IDEs by Supporting Developer Self-Explanations, has been accepted to the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC) in Atlanta, Georgia. The abstract of the paper follows:

Despite the advanced static analysis techniques available to compilers, error notifications as presented by modern IDEs remain perplexing for developers to resolve. My thesis postulates that tools fail to adequately support self-explanation, a core metacognitive process necessary to comprehend notifications. The contribution of my work will bridge the gap between the presentation of tools and interpretation by developers by enabling IDEs to present the information they compute in a way that supports developer self-explanation.

You can compare this submission with my prior VL/HCC Graduate Consortium.

Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets

Fuse Distribution

Our paper, Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets, has been accepted to the 12th Working Conference on Mining Software Repositories.

Fuse is obtained by filtering through 1.9 petabytes of raw data from Common Crawl, using Amazon Web Services. See our Fuse Spreadsheet Corpus project page for details on obtaining the spreadsheets and using the spreadsheet metadata.

The abstract of the paper follows:

Spreadsheets are perhaps the most ubiquitous form of end-user programming software. This paper describes a corpus, called Fuse, containing 2,127,284 URLs that return spreadsheets (and their HTTP server responses), and 249,376 unique spreadsheets, contained within a public web archive of over 26.83 billion pages. Obtained using nearly 60,000 hours of computation, the resulting corpus exhibits several useful properties over prior spreadsheet corpora, including reproducibility and extendability. Our corpus is unencumbered by any license agreements, available to all, and intended for wide usage by end-user software engineering researchers. In this paper, we detail the data and the spreadsheet extraction process, describe the data schema, and discuss the trade-offs of Fuse with other corpora.