Information Technology Research Update by Diomidis Spinellis Department of Management Science and Technology Athens University of Economics and Business (AUEB) http://www.dmst.aueb.gr/dds/ Volume 3 Issue 1 July 27th, 2003 A free periodic newsletter providing summaries, analyses, insights, and commentaries on information technology research brought to you by the Information Systems Technology laboratory http://istlab.dmst.aueb.gr In this issue: - Automatic Program Verification - Information Technology and Economic Performance - Peer to Peer Networks Used for Tracking Zebras - ACM Transactions on Embedded Computing Systems - Code Reading Automatic Program Verification ------------------------------ (report from Microsoft's Faculty Summit 2003) What is the state of the art in automatic program verification? Two presentations given at the 2003 Microsoft Faculty Summit held in Cambridge England in the middle of July provide some interesting insights. Sir Tony Hoare, in an inspiring keynote presentation, proposed the goal of a Verifying Compiler as a Grand Challenge for Computing Research. According to Hoare's definition, a verifying compiler will use automated mathematical and logical reasoning to check the correctness of the programs it compiles. The criterion of correctness will be specified by types, assertions, and other redundant annotations that are associated with the code of the program. The capabilities of the verifying compiler can be evaluated by applying it on legacy open-source code. The particular challenge was abandoned in the 1970's but the current hardware's speed and capacity, our experience with safety-critical software, the open source movement, advances in global program analysis and theorem proving technology, and a unifying theory for concurrency and object orientation make it now feasible. The challenge will be of a magnitude similar to other Grand Challenges such as putting a man on the moon, curing cancer, or mapping the human genome, and easily satisfies the attributes a Grand Challenge should have: fundamental, astonishing, testable, inspiring, understandable, useful, historical, international, revolutionary, research-directed, challenging, incremental, co-operative, competitive, effective, and risky. In a different talk James Larus described tools developed and used internally at Microsoft for improving their software development. Two heuristic tools widely employed are PREfix and PREfast. PREfix performs a path-by-path interprocedural analysis and is computationally expensive, taking 4 weeks to run on the Windows kernel. PREfast works by running user-supplied plugins when traversing the compiler's abstract syntax tree. It runs on desktops and is easily customized. The two tools are widely deployed within Microsoft; 1/6 of the bugs fixed in the Windows Server 2003 were apparently found by those two tools. (Note that the above statement is completely different from: "the two tools uncovered 1/6 of the bugs in the Windows Server 2003") PREfix locates errors by traversing the call graph through a limited set of paths taking great care to be conservative and avoid false positives. Some of the errors it can detect include: - NULL pointer use - Memory allocation errors - Use of unitialized values - Library usage problems - Invalid resource states PREfast works at a local level examining the compiler's abstract syntax tree for error idioms. It is based on plugins, which individual developers write. If I remember correctly, there are hundreds of plugins in use within Microsoft. PREfast also uses an annotated version of the Windows APIs to detect buffer overflow errors (most WIN32 APIs receive a memory buffer and its size as separate arguments; I assume the annotations bind the two). PREfix and PREfast were termed first generation tools in contrast to the second generation tools that are sound and based on declarative specifications and models. SLAM performs software model checking. It receives as input the C source code and the API specification rules mentioned above and can reason through a systematic exploration of the program's state space whether a feasible path can lead to a violation of the API specification. It works by converting the C source into a boolean program (removing all non boolean elements) that can then be checked for specific paths. Typically the model checker is assisted by annotating the program with additional predicates. When applied on a set of 100 device drivers it uncovered more than 60 bugs. ESP performs scalable error detection through a path-sensitive analysis. The goal is to locate program states that signify errors. The problem of course is taming the explosion of the state space. At each branch the tracked program state can potentially double. The combinatorial explosion is managed as follows: a) knowing the previous state can determine the branch direction b) states can be merged at control flow join points and c) little code affects the tracked state. Finally, Fugue checks the behavior of programs related to the use of the API by performing a static analysis on annotated versions of the program. It works on Microsoft's managed languages that support the specification of attributes (C#, VB, Managed C++, VJ#). Attributes are arbitrary programmer-specified declarations that can be added on the language's objects and are tracked throughout the compilation cycle. Custom attributes are added to the source to specify e.g. the state of a resource or the "nullness" of a value. Declarative API usage rules are then applied to verify the implementation against the contract. More details are available at: http://research.microsoft.com/spt http://research.microsoft.com/pprc Information Technology and Economic Performance ----------------------------------------------- Since the early 1980s researchers of information systems and economics have been debating whether the information technology revolution is paying off in higher productivity. The first studies found no correlation between IT investment and productivity at the level of firms, industries, or the economy. This was aptly termed the productivity paradox, and has been an object of intense study, ever since. In a recent issue of the ACM Computing Surveys (35(1):1-28, March 2003) Dedrick, Gurbaxani, and Kraemer critically review the empirical evidence of more than 50 articles on computers and productivity based on a solid framework for evaluating the research. The review concludes that the productivity paradox has been effectively refuted at both the firm and the country level. Peer to Peer Networks Used for Tracking Zebras ---------------------------------------------- If you think that peer to peer networks are only used for swapping CD tracks you are wrong. In a paper presented at the 10th International Conference on Architectural Support for Programming Languages and Systems (ASPLOS-X) Philo Juang and his colleagues discuss how the movement of zebras on the open lands of Kenya can be tracked by attaching on their neck a collar with a GPS receiver, two transmitters, and memory for storing the data. The zebras will exchange their positional data between them when they encounter each other e.g. near a watering hole, using the low power transmitter. Researchers will thus be able to obtain data from multiple zebras using the high-powered transmitter by locating a single zebra. The system's design goal is to allow the zebra to move without any human intervention for periods of at least one year. (Energy-Efficient Computing for Wildlife Tracking: Design Tradeoffs and Early Experiences with ZebraNet. pp 96-107) ACM Transactions on Embedded Computing Systems ---------------------------------------------- Embedded computers, which now drive everything from toys and cars to household appliances and telephones, are the subject of ACM's recently launched quarterly Transactions title, Transactions on Embedded Computing Systems. http://www.acm.org/tecs/. Code Reading ------------ [With apologies for the blatant self-promotion] My book "Code Reading: The Open Source Perspective" (Addison-Wesley, 2003) is now available at online and bricks-and-mortar bookstores. The book details how to read and understand software source code. All 600 examples I use in the book are drawn from widely-used open source projects. The publication inaugurates Addison-Wesley's Effective Software Development Series, edited by Scott Meyers. I have set up a web site for the book linked at my home page. We read computer code when maintaining or extending a software system. We also read code when we are scavenging another project for code to reuse. Sometimes we read code when inspecting the work of others, learning how our colleagues solved a particular problem, or even as literature, to expand our minds. The reading of code is likely to be one of the most common activities of a computing professional, yet it is seldom taught as a subject or formally used as a method for learning how to design and program. One reason for this sad situation originally may have been the lack of real-world or high-quality code to read. Companies often protect source code as a trade secret and rarely allow others to read, comment on, experiment with, and learn from it. In the few cases where important proprietary code was allowed out of a company's closet, it spurred enormous interest and creative advancements. As an example, a generation of programmers benefited from John Lions's Commentary on the Unix Operating System that listed and annotated the complete source code of the sixth-edition Unix kernel. Although Lions's book was originally written under a grant from AT&T for use in an operating system course and was not available to the general public, copies of it circulated for years as bootleg nth-generation photocopies. In the last few years, however, the popularity of open-source software has provided us with a large body of code we can all freely read. Some of the most popular software systems used today, such as the Apache Web server, the Perl language, the GNU/Linux operating system, the BIND domain name server, and the sendmail mail-transfer agent are in fact available in open-source form. I was thus fortunate to be able to use open-source software such as the above to write this book as a primer and reader for software code. My goal was to provide background knowledge and techniques for reading code written by others. By using real-life examples taken out of working, open-source projects, I tried to cover most concepts related to code that are likely to appear before a software developer's eyes, including programming constructs, data types, data structures, control flow, project organization, coding standards, documentation, and architectures. This book is - as far as I know - the first one to exclusively deal with code reading as a distinct activity, one worthy on its own.