Carnegie Mellon University

Research

Carnegie Mellon’s Institute for Software Research (ISR) hosts an active research group with a highly interdisciplinary approach to software engineering. Indeed, we believe that interdisciplinary work is inherent to software engineering. The field of software engineering (SE) is built on computer science fundamentals, drawing from areas such as algorithms, programming languages, compilers, and machine learning. At the same time, SE is an engineering discipline: both the practice of SE and SE research problems revolve around technical solutions that successfully resolve conflicting constraints. As such, trade-offs between costs and benefits are an integral part of evaluating the effectiveness of methods and tools.

Emerging problems in the area of privacy, security, and mobility motivate many challenges faced by today’s software engineers, motivating new solutions in SE research. Because software is built by people, SE is also a human discipline, and so research in the field also draws on psychology and other social sciences. Carnegie Mellon faculty bring expertise from all of these disciplines to bear on their research, and we emphasize this interdisciplinary approach in our REU Site.

Below, we list broad areas of software engineering research that we anticipate summer students could work on, but note that we anticipate that this list may evolve as the summer approaches:

Mentor: Heather Miller

Description and significance
Nowadays, most apps you encounter are in some sense distributed. Everything from mobile games to the day-to-day apps you depend on for ridesharing or getting around town. These apps often make requests to dozens or even hundreds of remote services for help in performing some task like locating an available driver, or collaborating with another player in real-time in a game. While these sorts of applications are commonplace, they are nonetheless notoriously difficult to program. Such applications are typically broken up into individual *services* which handle requests from clients, with each service typically developed by entirely different teams of developers. Services are then often intended to depend on each other–the output of one service might be the input of another. Often, due to this style of development, and this architecture, individual services make incorrect assumptions about what other services may expect as input or output, leading to a buggy user experiences. This research will explore taking a programming languages approach to _composing_ these services, and ensuring in their composition that assumptions or invariants between services are not invalidated, leading to easier to build and easier to evolve and develop microservice architectures.

Student involvement
Students who work on this project will be expected to have a background in Computer Science (CS) or Software Engineering (SE), and to be proficient in one or more programming languages. Students will learn how contemporary apps are architected in the interest of scalability, and will obtain hands-on experience in programming language design for distributed computing. The students will start by surveying several microservice architectures in order to build an set of inspirational examples that must be supported by our microservice configuration language. Once a sufficient test corpus of applications is had, the students will move on to exploring ways to implement the required static and dynamic checks to enable sound composition between services. Finally, the students will perform an empirical analysis in an effort to understand the ways in which the proposed microservice configuration language eases the development burden on developers.

 

Mentors: David Garlan and Eunsuk Kang

Description and significance
When implementing new features or changing existing software, developers have to make trade-offs between quality attributes, such as security, performance, power consumption, reliability, and availability. To make optimal design decisions, developers have to be experts in all of the involved domains. However, nowadays, software is often developed by non-domain-experts. Additionally, design decisions that involve quality attribute trade-offs are typically implicit and not directly observable in the code. So it is often difficult to trace non-functional requirements and quality attribute trade-offs to implementations or adjust the software if those requirements change. The ABLE research group (http://www.cs.cmu.edu/Groups/able/) has been creating techniques that allow software architects to reason about quality attributes and their trade-offs to navigate the design space and develop appropriate designs for decades.

This research will explore how to simplify the development of complex software systems by making design decisions and quality attribute trade-offs first class.

Student involvement
Students working on this project will design, develop, and evaluate a library of common design decisions, such as caching, communication mechanisms, securing software, or performance optimization. Students are expected to have good programming and software engineering skills, as well as knowledge of at least one application domain (e.g., security, resource management, databases, or others). Students will learn how to apply their development skills in a research setting, how to conduct an academic literature review of existing solutions, and how to evaluate implementations scientifically. Further, students will train presentation skills and academic writing skills to document their results.

 

Mentor: Eunsuk Kang

Description and significance
Software is increasingly being used to control systems that closely interact with physical entities in the world, and whose failures may result in irreversible damage to users and our environment. These so called cyber-physical systems (CPSs), ranging from autonomous vehicles and drones to medical devices, pose new challenges in software engineering. Unlike traditional software applications, these systems are deployed in a highly dynamic, uncertain environment, and often rely on fragile assumptions about the behavior of users and other environmental agents (some of which may be malicious). Due to this dynamic, adversarial nature, software in CPSs must be designed to handle unexpected interactions with the environment, and guarantee an acceptable level of safety and security even in presence of component failures. Dr. Kang and his collaborators are working on (1) new principles and techniques for designing robust software [1,2] and (2) algorithms and tools for reasoning about critical properties of CPSs [3,4].

Student involvement
In this project, students will develop new techniques for designing safe and secure CPSs. Potential research tasks include developing (1) a language for specifying the behavior of a complex CPS at a high-level of abstraction, (2) a technique for automatically analyzing a system for potential vulnerabilities, and (3) an architectural mechanism for guaranteeing safety and security even in presence of unexpected inputs. Students will also get an hands-on experience applying these techniques to real-world case studies, such as autonomous drones, intelligent vehicles, and medical devices. Students are expected to have a basic background in logic and discrete mathematics, and will learn about the-state-of-the-art techniques in formal methods, CPS design, software analysis, and programming languages.

[1] Property-Driven Runtime Resolution of Feature Interactions. Santhana Gopalan Raghavan, Kosuke Watanabe, Eunsuk Kang, Chung-Wei Lin, Zhihao Jiang, and Shinichi Shiraishi. International Conference on Runtime Verification (RV), 2018.

[2] An Architectural Mechanism for Resilient IoT Services. Hokeun Kim, Eunsuk Kang, Edward A. Lee, and David Broman. ACM Workshop on the Internet of Safe Things (SafeThings), 2017.

[3] Model-Based Security Analysis of a Water Treatment System. Eunsuk Kang, Sridhar Adepu, Daniel Jackson, and Aditya P. Mathur. ICSE Workshop on Smart Cyber-Physical Systems, 2016.

[4] Multi-Representational Security Analysis. Eunsuk Kang, Aleksandar Milicevic, and Daniel Jackson. Symposium on the Foundations of Software Engineering (FSE), 2016.

Description and significance 
Intelligent agents on smartphones like Siri, Alexa and Google Assistant have become increasingly popular in recent years. They can perform various kinds of tasks on the user's behalf, but their capabilities are limited to the apps and services they support, without a way for users to teach them new tasks and new concepts. Prof. Brad Myers and PhD student Toby Li in the HCII (along with collaborators across SCS) are currently working a research project on designing and implementing a multi-modal intelligent smartphone agent that enables the users to teach it new procedures, concepts and user preferences by demonstration and natural language instructions. This new agent can allow non-developers to extend the capability of the agent to support their new personalized tasks, and to perform known tasks to their preferences. We have finished the development of a first version of the system with support in-task procedure generalization [1], error handling [3], natural language disambiguation [2] and smart home device control [4]. More details about this project is available at http://toby.li/research/ and http://www.cs.cmu.edu/~NatProg/sugilite.html.

Student involvement
Students will use human-centered methods [5] to analyze underlying problems and user requirements, design new features, implement these new features, and conduct evaluations and testing on the software artifacts.  Specifically, we are tackling challenges such as allowing users to teach the agent more complex task structures, supporting users in editing and repurposing existing scripts and previously taught concepts, enabling the agent to better understand users’ flexible natural language instructions, and designing secure and privacy-aware approaches to support cross-user scripts and knowledge sharing between agents. Students may create interviews or surveys, conduct lab usability studies, crowdsourcing studies or field deployment studies, and design prototypes for user interfaces. Students will also implement the system in Java in the Android environment, and may develop system components for semantic parsing, machine learning, and dialog systems.

[1] Toby Jia-Jun Li, Amos Azaria, and Brad A. Myers. 2017. SUGILITE: Creating Multimodal Smartphone Automation by Demonstration. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17).

[2] Toby Jia-Jun Li, Igor Labutov, Xiaohan Nancy Li, Xiaoyi Zhang, Wenze Shi, Tom M. Mitchell, and Brad A. Myers. 2018. APPINITE: A Multi-Modal Interface for Specifying Data Descriptions in Programming by Demonstration Using Verbal Instructions. In Proceedings of the 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC 2018).

[3] Toby Jia-Jun Li, Igor Labutov, Brad A. Myers, Amos Azaria, Alexander I. Rudnicky, and Tom M. Mitchell. 2018. An End User Development Approach for Failure Handling in Goal-oriented Conversational Agents. In Studies in conversational UX Design. Springer.

[4] Toby Jia-Jun Li, Yuanchun Li, Fanglin Chen, and Brad A. Myers. 2017. Programming IoT Devices by Demonstration Using Mobile Apps. In End-User Development, 3–17. 

[5] B. A. Myers, A. J. Ko, T. D. LaToza, and Y. Yoon. 2016. Programmers Are Users Too: Human-Centered Methods for Improving Programming Tools. Computer 49, 7: 44–52.

Mentor: Lorrie Cranor

Description and significance
Users gain security and privacy knowledge formally via perfunctory tutorial videos and mandatory workplace trainings, and informally via channels like family, TV crime show folklore, and pop-up warnings. Even when sound advice is noticed, understood and retained, users often rationally ignore it due to high effort costs and low apparent benefits at scale. Successful security and privacy education must overcome educational barriers by focusing on at-risk users and creating content that is personally meaningful and emotionally compelling. One promising method for public communication that has shown results in public health and health ethics is public communication via drama. We're working on a highly immersive and narrative play on data ethics and privacy that is suitable and scalable for the broader public.

Student involvement
Student working on the project will be required to have strong communication skills, a propensity toward critical thinking, and an enthusiasm for ethics, technology, and education. The students will assist in designing and managing a pilot test for the show. The students will gain an understanding in arts-based methods, qualitative research, and survey design and analysis. No specific technical skills are required, but a familiarity with programming and statistics will be helpful. However, students with a background outside of traditional tech fields who are enthusiastic about technology are also encouraged to apply.

 

Mentor: Limin Jia

Description and significance
One of Professor Jia’s research area is information flow security, which includes a wide range of techniques to protect users' private information and to identify vulnerabilities in platforms such as web browsers, mobile operating systems, and IoT applications. By tagging data and processes based on the type of secrets they contain and are allowed to receive (e.g., contact secret or no secret), enforcement mechanisms can be put into place to prevent secrets from being leaked to attackers. Similarly, by tagging data with labels representing their provenance (or trust level) and API arguments with required trust level, one can leverage these labels to identify code injection attacks (e.g., Cross-Site Scripting vulnerabilities). We are developing both theories and tools to improve information flow security in software.

Student involvement
There are three types of projects that students can be involved in. The first is web-measurement studies that leverage our (modified) Chromium to crawl websites to study specific kinds of web security issues, for example, XSS. Students interested in measurement studies will use and modify existing software infrastructures to collect data and conduct data analysis. The second is to develop bug detection tools to identify code injection attacks. Students will implement program analysis algorithms to find bugs in real applications. The third is constructing abstract models for mechanisms that enforce information flow security in real systems such as web applications and IoT applications. Students will develop formal models including labelled transition systems and type systems and operational semantics and construct proofs (noninterference proofs) for the model.

Mentors: Josh Sunshine, Jonathan Aldrich, Brad Myers

Description and Significance
Security vulnerabilities in blockchain programs (sometimes called “smart contracts”) have been repeatedly used to steal millions of dollars of assets. We are designing and implementing Obsidian, a new programming language for writing software for blockchain platforms. Obsidian allows the compiler to detect common kinds of bugs, facilitating safer software development. However, it does not suffice to provide safety properties if programmers are not able to use the language effectively. In order to improve programmers’ effectiveness using Obsidian, we are using a user-centered design process, which incorporates data from programmers into the design process. We also plan to compare Obsidian to existing languages in a user study to show that programmers are more effective in Obsidian than in existing languages.

Student Involvement
We are looking for students with an interest in programming language design and implementation to work on the Obsidian project. Ideally, students would have experience with the theory of programming languages, with compiler implementation, or with the methods of human-computer interaction There are several different possible projects for students, depending on student interest and project needs. Possible projects include improving how Obsidian works on real blockchain platforms; implementing software in Obsidian to help evaluate the language; working on the compiler so that it correctly implements the current language design; and designing and running user studies of the language.

Past REU students have won second place in the SPLASH student research competition [1] and published a paper at PLATEAU [2].

[1] Paulette Koronkevich. "Obsidian in the Rough: A Case Study Evaluation of a New Blockchain Programming Language". SPLASH 2018 Student Research Competition. Second place winner, undergraduate divison.

[2] Celeste Barnaby, Michael Coblenz, Tyler Etzel, Eliezer Kanal, Joshua Sunshine, Brad Myers, Jonathan Aldrich. "A User Study to Inform the Design of the Obsidian Blockchain DSL." 8th Workshop on Evaluation and Usability of Programming Languages and Tools (PLATEAU'2017) at SPLASH 2017, October 23, 2017.

Mentors:  Joshua Sunshine, Keenan Crane, and Jonathan Aldrich

Description and significance
Every day, millions of professionals, teachers, and students in science, technology, engineering, and mathematics fields (STEM) work in the abstract world of mathematical structures, logical relationships, and computational processes. These abstractions are often presented in a textual and notation-heavy way, especially in formal literature, depriving readers of the powerful visual and spatial intuition that is fundamental to building deep understanding. People in STEM communicate informally by drawing diagrams, but they often omit these sketches in the literature because it takes a tremendous time of time and expertise to create mathematical diagrams. Existing tools such as Mathematica, TikZ, and Adobe Illustrator lack the domain knowledge needed for creating mathematical diagrams, forcing the user to work at a low level.
What if anyone could simply type some mathematical notation and instantly see it visualized in several different ways? To make this vision a reality, our team is building a platform called Penrose. The Penrose platform comprises two domain-specific languages, Substance and Style, that are designed to enable users to write natural mathematical notation and separately specify the visual representation of notation in a domain. Given a Substance and Style program pair, the Penrose compiler transforms the problem of producing a diagram into a problem of numerical optimization, which is solved by an optimization runtime. The Penrose platform renders the solutions to the optimization problem as beautiful, automatic, custom mathematical diagrams. We have implemented a prototype platform for visualizing set theory [1,2] and are expanding it to the domains of linear algebra, real analysis, raytracing, meshes, and neural networks.

Student involvement
Students who work with us will join a highly interdisciplinary team spanning the fields of programming language design, software engineering, graphics, and visualization. Past students have had the opportunity to design and build fundamental parts of the Penrose platform, including the first Style compiler, a graphics API akin to “OpenGL for diagrams,” and an extensible syntactic sugar mechanism for mathematical notation. Future students will have many cross-cutting opportunities, including designing methods for automatic diagram optimization and layout, designing extensibility mechanisms for Substance and Style, working with mathematical domain experts to build Penrose's standard library, and conducting user studies on the usability of Substance and Style. Students will gain experience in performing human-centered domain-specific language design, extending fundamental graphics methods such as optimization and sampling, and conducting interdisciplinary research.

[1] K. Ye, J. Sunshine, J. Aldrich, and K. Crane. Designing extensible, domain-specific languages for mathematical diagrams. In Proc. Off the Beaten Track Workshop (OBT), 2017.

[2] W. Ni, K. Ye, J. Sunshine, J. Aldrich, and K. Crane. Substance and style: domain-specific languages for mathematical diagrams. In Proc. Workshop on Domain-Specific Language Design and Implementation (DSLDI), 2017.

Mentor: Travis Breaux

Description and significance
Mobile and Internet of Things devices use context to deliver rich, personalized services to end users. However, these services often collect sensitive personal information about end users, which can expose end users to privacy threats and vulnerabilities. To help developers plan new and exciting applications that leverage our personal tastes, preferences and behaviors, the Requirements Engineering Lab at CMU led by Dr. Breaux has begun developing tools to formally express privacy requirements across distributed, service-oriented systems. This includes an initial language syntax and semantics, implemented in Description Logic (DL) and Java, with tool support for detecting conflicts among different privacy policies (Breaux, Hibshi & Rao, 2014), and for verifying critical privacy properties across multi-party data flows(Breaux, Smullen & Hibshi, 2015).
Recently, the lab has been developing tools to extract meaningful requirements from user stories and privacy policies by combining crowdsourcing and natural language processing (NLP) techniques. Early versions of these tools have been written in Java and Python and have been applied to extract and construct data type lexicons (Bhatia & Breaux, 2015), build ontologies (Evans, Bhatia & Breaux, 2016; Hosseini, Breaux & Niu, 2018) and build frame-based models of requirements (Bhatia & Breaux, 2018). Presently, we are interested in designing interfaces for developers to use this knowledge in making better design decisions to improve user privacy.

Student involvement
Students will participate in the design and implementation of crowdsourcing and NLP-based tools to advance requirements extraction from text. Design tasks begin with a literature review to understand state-of-the-art NLP techniques, followed by discussions with the faculty advisor and graduate students about how to adapt these techniques to existing datasets. In addition, students can participate in designing and executing crowd worker tasks to collect annotations, and evaluating annotation data based on worker consensus, performance metrics, and task reliability. Example NLP techniques include phrase structure grammars and typed dependencies, and feature engineering and deep learning to perform named entity recognition and semantic role labeling.

Mentors: Jonathan Aldrich

Description and significance
The Plaid research group, led by Dr. Aldrich, applies novel programming language ideas to software engineering challenges, validating those ideas both theoretically and empirically. With his students, he is currently designing the Wyvern programming language, focused on exploring new language features for security and adaptability. Wyvern provides a pure object model that explores foundational issues in object-oriented type tests [1] and a capability-based module system that can help to assure a broad range of architectural security properties [2]. It also provides an extensible syntax and type system, enabling convenient and safe command libraries that forestall command injection attacks [3, 4].

Student involvement
The projects described above include extensive involvement of undergraduates; in fact, every paper cited above includes an undergraduate co-author. Going forward, there are a number of promising areas where an undergraduate can effectively contribute to research in the span of a summer. In summer 2018, a REU student developed a prototype language for describing run-time, component and connector architecture in Wyvern; a future student could develop a similar language for describing module architecture constraints. A related project is enhancing Wyvern's module system, e.g. to support test cases as part of a module signature so that if there are multiple implementations of the module, they can be automatically compared for behavioral compliance with the test cases—thus avoiding problems with incompatible replacements or upgrades. We are also developing a foundational theory of gradual verification using separation logic [5], and an excellent undergraduate project would be integrating the theory into Wyvern (in a first summer) and an optimized implementation for run-time checking (in a second summer).


[1] Joseph Lee, Jonathan Aldrich, Troy Shaw, and Alex Potanin. A Theory of Tagged Objects. Proc. European Conference on Object-Oriented Programming (ECOOP), 2015.

[2] Darya Melicher, Yangqingwei Shi, Alex Potanin, and Jonathan Aldrich. A Capability-Based Module System for Authority Control. In Proc. European Conference on Object-Oriented Programming (ECOOP), 2017.

[3] Cyrus Omar, Darya Kurilova, Ligia Nistor, Benjamin Chung, Alex Potanin, and Jonathan Aldrich. Safely Composable Type-Specific Languages. Proc. European Conference on Object-Oriented Programming, 2014. Distinguished Paper Award.

[4] Nathan Fulton, Cyrus Omar, and Jonathan Aldrich. Statically Typed String Sanitation Inside a Python. Proc. Privacy and Security in Programming (PSP), 2014. Best Paper Award.

[5] Johannes Bader, Jonathan Aldrich, and Eric Tanter. Gradual Program Verification. In Proc. International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI), 2018.

Mentors: Christian Kästner and Bogdan Vasilescu

Description and significance
Reuse of open source artifacts in software ecosystems has enabled significant advances in development efficiencies as developers can now build on significant infrastructure and develop apps or server applications in days rather than months or years. However, despite its importance, maintenance of this open source infrastructure is often left to few volunteers with little funding or recognition, threatening the sustainability of individual artifacts, such as OpenSSL, or entire software ecosystems. Reports of stress and burnout among open source developers are increasing. The teams of Dr. Kaestner and Dr. Vasilecu have explored dynamics in software ecosystems to expose differences, understand practices, and plan interventions [1,2,3,4]. Results indicate that different ecosystems have very different practices and interventions should be planned accordingly [1], but also that signaling based on underlying analyses can be a strong means to guide developer attention and affect change [2]. This research will further explore sustainability challenges in open source with particular attention to the interaction between paid and volunteer contributors and stress and resulting turnover.

Student involvement
Students will empirical study sustainability problems and interventions, using interviews, surveys, and statistical analysis of archival data (e.g., regression modeling, time series analysis for causal inference). What are the main reasons for volunteer contributors to drop out of open source projects? In what situations do volunteer contributors experience stress? In which projects will other contributors step up and continue maintenance when the main contributors leave? Which past interventions, such as contribution guidelines and code of conducts, have been successful in retaining contributors and easing transitions? How to identify subcommunities within software ecosystems that share common practices and how do communities and subcommunities learn from each other? Students will investigate these questions by exploring archival data of open source development traces (ghtorrent.org), will design interviews or surveys, will apply statistical modeling techniques, will build and test theories, and conduct literature surveys. Students will learn state of the art research methods in empirical software engineering and apply them to specific sustainability challenges of great importance. Students will actively engage with the open source communities and will learn to communicate their results to both academic and nonacademic audiences.

[1] Christopher Bogart and Christian Kästner and James Herbsleb and Ferdian Thung. How to Break an API: Cost Negotiation and Community Values in Three Software Ecosystems. In Proc. Symposium on the Foundations of Software Engineering (FSE), 2016.

[2] Asher Trockman, Shurui Zhou, Christian Kästner, and Bogdan Vasilescu. Adding sparkle to social coding: an empirical study of repository badges in the npm ecosystem. In Proc. International Conference on Software Engineering (ICSE), 2018.

[3] Bogdan Vasilescu, Kelly Blincoe, Qi Xuan, Casey Casalnuovo, Daniela Damian, Premkumar Devanbu, and Vladimir Filkov. The sky is not the limit: multitasking across github projects. In Proc. International Conference on Software Engineering (ICSE), 2016.

[4] Bogdan Vasilescu, Daryl Posnett, Baishakhi Ray, Mark GJ van den Brand, Alexander Serebrenik, Premkumar Devanbu, and Vladimir Filkov. Gender and tenure diversity in GitHub teams. In Proc. ACM Conference on Human Factors in Computing Systems (CHI), 2015.