Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets

Fuse Distribution

Our paper, Fuse: A Reproducible, Extendable, Internet-scale Corpus of Spreadsheets, has been accepted to the 12th Working Conference on Mining Software Repositories.

Fuse is obtained by filtering through 1.9 petabytes of raw data from Common Crawl, using Amazon Web Services. See our Fuse Spreadsheet Corpus project page for details on obtaining the spreadsheets and using the spreadsheet metadata.

The abstract of the paper follows:

Spreadsheets are perhaps the most ubiquitous form of end-user programming software. This paper describes a corpus, called Fuse, containing 2,127,284 URLs that return spreadsheets (and their HTTP server responses), and 249,376 unique spreadsheets, contained within a public web archive of over 26.83 billion pages. Obtained using nearly 60,000 hours of computation, the resulting corpus exhibits several useful properties over prior spreadsheet corpora, including reproducibility and extendability. Our corpus is unencumbered by any license agreements, available to all, and intended for wide usage by end-user software engineering researchers. In this paper, we detail the data and the spreadsheet extraction process, describe the data schema, and discuss the trade-offs of Fuse with other corpora.

Microsoft Research Internship

Microsoft Research

I’m happy to announce that I’ve accepted an offer to intern at Microsoft Research this summer, from June 1 through August 21, in Redmond, Washington. I’ll be working under the direction of Rob DeLine, Principal Researcher and Group Manager of Human Interactions in Programming (HIP).

Human Interactions in Programming HIP works at the intersection of HCI, CSCW, and Software Engineering. The group uses a human-centered approach to develop tools to support software engineers and teams.

Can Social Screencasting Help Developers Learn New Tools?

Screencasting Tool Usages Our short paper, Can Social Screencasting Help Developers Learn New Tools?, has been accepted to the 8th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE 2015). The workshop is colocated with ICSE, and will be hosted in Florence, Italy.

The lead author of the paper is Kubick Lubick. The abstract of the paper follows:

An effective way to learn about software development tools is by directly observing peers’ workflows. However, these tool knowledge transfer events happen infrequently because developers must be both colocated and available. We explore an online social screencasting system that removes the dependencies of colocation and availability while maintaining the beneficial tool knowledge transfer of peer observation. Our results from a formative study indicate these online observations happen more frequently than in-person observations, but their effects are only temporary. We conclude that while peer observation facilitates online knowledge transfer, it is not the only component — other social factors may be involved.

Commit Bubbles

Our paper, Commit Bubbles, has been accepted to ICSE 2015: New Ideas and Emerging Results.

Commit Bubbles Interface

The abstract of the paper follows:

Developers who use version control are expected to produce systematic commit histories that show well-defined steps with logical forward progress. Existing version control tools assume that developers also write code systematically. Unfortunately, the process by which developers write source code is often evolutionary, or as-needed, rather than systematic. Our contribution is a fragment-oriented concept called Commit Bubbles that will allow developers to construct systematic commit histories that adhere to version control best practices with less cognitive effort, and in a way that integrates with their as-needed coding workflows.

In other words, Commit Bubbles aims to alleviate the “tangled commit” and “non-descriptive commit message” dilemmas that developers routinely encounter when constructing version control commit histories:

Git Commit (xkcd)

Hadoop 2.6.0 Windows 64-bit Binaries

Hadoop

The official release of Apache Hadoop 2.6.0 does not include the required binaries (e.g., winutils.exe) necessary to run hadoop. In order to use Hadoop on Windows, it must be compiled from source. This takes a bit of effort, so I’ve provided a pre-compiled, unofficial distribution below:

I compiled the source using:

Then, using the Windows SDK 7.1 Command Prompt or Visual Studio Command Prompt (2010):

set JAVA_HOME=C:\PROGRA~1\Java\jdk1.7.0_71
set Platform=x64

The build system requires that you use the 8.3 short filename for JAVA_HOME (no spaces!). The environment variables (Platform) are also case sensitive. Finally:

mvn package -Pdist -DskipTests -Dtar

The binaries will be available in hadoop-dist/target.

Unknown Device in Windows 8 on Dell Vostro 3500

On a clean install of Windows 8 or 8.1 on a Dell Vostro 3500, you may notice an Unknown device listed under Other devices in Device Manager:

Unknown device

Under Driver Details, you should also see (ACPI\SMO8800\1):

http://static.barik.net/drivers/FFS_ST_W78_A00_Setup-KT7XG_ZPE.exe

This unknown device is actually the ST Microelectronics Free Fall Sensor. While the last supported operating system for this machine is Windows 7, you can install a generic version of the driver from the Dell KB article: Sudden Motion Sensor drivers are not installed in Windows 8 and Windows 8.1.

How Developers Visualize Compiler Messages

Explanatory Visualization for Ambiguous Reference

Our paper, How Developers Visualize Compiler Messages: A Foundational Approach to Notification Construction, has been accepted to the 2nd IEEE Working Conference on Software Visualization (VISSOFT 2014).

The abstract of the paper follows:

Self-explanation is one cognitive strategy through which developers comprehend error notifications. Self-explanation, when left solely to developers, can result in a significant loss of productivity because humans are imperfect and bounded in their cognitive abilities. We argue that modern IDEs offer limited visual affordances for aiding developers with self-explanation, because compilers do not reveal their reasoning about the causes of errors to the developer.

The contribution of our paper is a foundational set of visual annotations that aid developers in better comprehending error messages when compilers expose their internal reasoning. We demonstrate through a user study of 28 undergraduate Software Engineering students that our annotations align with the way in which developers self-explain error notifications. We show that these annotations allow developers to give significantly better self-explanations when compared against today’s dominant visualization paradigm, and that better self-explanations yield better mental models of notifications.

The results of our work suggest that the diagrammatic techniques developers use to explain problems can serve as an effective foundation for how IDEs should visually communicate to developers.