Skip to content

Research Blog

DSL Usability Research

In my previous post, I asserted:

...learning a new formal language can itself contribute to the difficulty of encoding an experiment.

This statement was based on assumptions, intuitions, and folk wisdom. I started digging into the DSL usability research to see if I could find explicit support for this statement. This blog post is about what I found.

Suppose I have a DSL for a task that was previously manual. I want to conduct a user study. I decide to use some previously validated instrument to measure differences in perceived difficulty of encoding/performing a task (\(D\)), and vary the method used to code the task (\(M=\text{DSL}\) vs. \(M=\text{manual}\)). Suppose there is no variability in task difficulty for now: the specific task is fixed for the duration of the study, i.e., is controlled.

Ideally, I'd like to just measure the effect of \(M\) on \(D\); we are going to abuse plate notation1 a bit and say that the following graph denotes "method has an effect on percieved difficulty of performing a specific task for the population of experts in the domain of that task:"

flowchart LR
  M("Method ($$M$$)")
  subgraph ppl [domain experts]
      D("Percieved Difficulty of Task ($$D$$)")
  end
  M --> D

The first obvious problem is that \(D\) is a mixture of some "inherent" difference due to \(M\) and the novelty of the method/context/context/environment/situation (\(N\)). We have not included \(N\) in our model; let's do so now:

flowchart LR
  M("Method ($$M$$)")
    subgraph ppl [domain experts]
      direction TB
        N("Novelty ($$N$$)")
    D("Percieved Difficulty of Task ($$D$$)")
    end
  M-->D
  N-->D
Conducting a naïve study results in \((D \vert M=\text{manual}, N = 0)\) vs. \((D \vert M=\text{DSL}, N \gg 0)\). This is why we have the study participants perform a training task first: it's an attempt to lower \(N\) as much as possible, i.e., to control for novelty.

Training tasks are obviously not unique to DSL research; however, there are other tactics for reducing novelty that are unique to programming systems. For example, it seems obvious that IDE features like syntax highlighting and autocomplete that are part of a "normal" programming environment would reduce the value of \(N\); so would integrating the DSL into the target users' existing toolchain/workflow.

If we allow the task to vary, then our model needs to include another potential cause for \(D\):

flowchart LR
  M("Method ($$M$$)")
  C("Task Complexity ($$C$$)")
  subgraph ppl [  domain experts]
    direction TB
    N("Novelty ($$N$$)")
    D("Precieved Difficulty of Task ($$D$$)")
  end
  M --> D
  N --> D
  C --> D

The details of how we represent \(C\) matter: whatever scale we use, it contains a baked-in assumption that for any two tasks \(t_1\) and \(t_2\) where \(t_1\not=t_2\), but \(C(t_1)=C(t_2)\), we can treat \(t_1\equiv t_2\). This is a big assumption! What if there are qualitative differences between tasks not captured by the complexity metric that influence \(D\)? In that case, we may want to use a different variable to capture \(C\), perhaps a binary feature vector, or maybe we want to split \(C\) into a collection of distinct variables. Maybe task complexity isn't objective but subjective, in which case we would want to include in our domain experts plate. Maybe we want to forego \(C\) altogether and instead treat tasks as a population we need to sample over, e.g.,

flowchart LR
  M("Method ($$M$$)")
  subgraph ppl [  domain experts]
  subgraph tasks [tasks]
    direction TB
    N("Novelty ($$N$$)")
    D("Precieved Difficulty of Task ($$D$$)")
  end
  end
  M --> D
  N --> D

I have plenty more to say and would love to iterate on the design of this hypothetical user study, but I am going to stop here because the above diagram feels like something that should already be established in the literature. Like a lot of folk wisdom, it's suggested, implied, assumed, and (I think!) generally accepted, but so far I have not found any explicit validation of the above schema. That doesn't mean it isn't out there; it means that (a) there isn't a single canonical paper accepted by the community as evidence and (b) where the evidence does exist, it's embedded in work that primarily addresses some other research question.

So, for now, I am putting together a DSL usability study reading list of works that I think touch on this fundamental problem in meaningful ways. I consider Profiling Programming Language Learning and PLIERS: A Process that Integrates User-Centered Methods into Programming Language Design seed papers and have gotten recommendations from Andrew McNutt, Shriram Krishnamurthi, and Lindsey Kuper. Please feel free to add to this (or use it yourself!). I look forward to writing a follow up post on what I find. :)


  1. While the plate notation here looks similar to the output that Helical produces for HyPL code, the specific graphs are more precise than those that Helical can currently produce. For example, only \(D\) is embedded in the domain experts plate. Helical's current implementation would place both \(M\) and \(D\) in this plate. 

Jupyter DSLs

One of the broader goals of the Helical project is to make writing, maintaining, and debugging experiments easier and safer for the end-user through a novel domain-specific language. However, learning a new formal language can itself contribute to the difficulty of encoding an experiment. Therefore, we are intersted in mitigating the effects of language learning/novelty. To this end, a Northeastern coop student (Kevin G. Yang) investigated the suitability of using Jupyter notebooks as an execution environment for experiments last year.

Jupyter notebooks are commonly used by empiricists. If we want empiricists to use Helical, then it would make sense to integrate it into empiricists' computational workflow. Kevin began investigating the feasibility of adding such support for features such as syntax highlighting and code completion to Jupyter. This actually turned out to be a surprisingly difficult task!

Kevin ended up doing a deep dive into the Jupyter code base and issue database, resulting in an experience report and tutorial that he presented internally at the Northeastern Programming Research Lab's seminar series and externally at PLATEAU 2025. While his coop focused on a specific implementation task, the work led us to ask new research questions. For example, we were somewhat surprised by the breadth of tooling empirical scientists were using and that there was demand for custom syntax highlight organically in the Jupyter user base — conventional wisdom in the PL community is that DSLs are a bit niche! Thus, rather than focusing on Helical specifically, we broadened the task to DSL support in Jupyter more generally.

At the start of his coop, I had envisioned Kevin integrating Helical into Jupyter and then pivoting to a reproduction study. However, as he was working on the project, he became increasingly interested in visualization and usability. We were hoping to perform a user study in Summer 2025 to further investigate some of the research questions that arose and perhaps send a conference paper submission out to CHI or UIST; that thread was put on hold as Kevin continues his career exploration journey.

Introduction to Digital Twins

Recently, I've been reading about this new technology called digital twins. I started with this paper. I think it's a great introduction, and it's also the one that my research supervisor has recommended to me.

I had no idea what a digital twin is. I have not heard of this phrase at all, and the first impression of it was that it felt similar to what NFTs are. It's also a digital representation of a real-world physical object, and that was my first impression of what a digital twin could be.

As I continued reading the paper, I found out that no, it's not like that. Digital twins are online objects or online clones of something in the physical world, but there are huge differences. The digital twins are alive, where they gather real-time data from the object or system that it is representing in the real world, so it keeps updating itself. An NFT is an online object, it's something that does not change, and that is one of the key properties of NFTs.

Something that is quite talked about on the internet is how digital twins have a use over regular simulations, which is being heavily used in different areas of the world. They have similar features; they both gather data and they both try to simulate what something in the real world or even in the digital world might do. They try to predict the future. Now they are more similar, I think. They have a similar "go down inside" --- the real difference here is: simulation is where you feed it the data before it starts and try to predict everything afterwards based on the data you feed it, whereas the digital twin is more of a progression. The digital twin will evolve as the physical twin progresses in the real world, like a twin where they grow, they evolve, they do everything basically simultaneously. And that's the thing with the Digital Twin, you basically get a realtime, digital clone of something in the real world, whether that is an object, whether that is a system, whether it is anything that has sorts of data that you could replicate in the real world. And basically, that is the real goal of what digital twins are.

On a side note, there seems to be a lot of buzzwords that can be fitted into presenting digital twins in this paper. I've read a lot about AIs, LLMs, machine learning, different models, I've even seen blockchain, security... It seems like potentially could grab the investors' attention one day.

How digital twin relates to social media --- YSocial

When I was reading the paper on digital twins, I was wondering how this technology would be able to fit within the scope of research that I was doing within this co-op, as the paper's examples of digital twins were all manufacturing-related. After reading this paper, I realized why my research supervisor wanted me to learn about these systems. The paper presents "Ysocial", a digital trend platform that replicates an online social media environment.

In short, YSocial allows users to simulate a social media environment using LLMs. Some of the possible use cases for this would be to simulate political discussions on platforms like Twitter. It does this by utilizing LLM agents to mimic how real-world humans discuss, in this case, political topics on social media. Thanks to the development of AI, this can be done very easily compared to what it was like a decade ago.

It also is a huge playground for researchers to gain insights into how LLMs are actually performing in trying to mimic humans. By using YSocial, it allows researchers to gain huge amounts of data and try out different kinds of settings to play around with to see whether the LLMs we have today can actually human-led environment online would look like and to see what kind of different settings they can address in order to achieve different results on potentially different social media platforms.

For example, on Instagram, it is not strictly the same concept compared to Twitter (where Instagram is more photo and image-based). It is a very distinct platform compared to Twitter (where Twitter is more opportunity for you to express your feelings). Instagram is more of a record, a place where you keep track of what you have done. You can put all of your photos, you can even put your stories (what your recent activities have been) onto Instagram. By adjusting different sliders and adjusting different personalities, the agents on the Why Social platforms behave. It could potentially be a very powerful tool for researchers to dive into.

Trying out YSocial

I've tried to play around with YSocial and trying to set it up locally. I ran across a few problems that I was unable to solve on my own, so I reached out to the team behind YSocial and got some very helpful feedback!

Initially, while I was able to access the main dashboard of YSocial, I ran into problems afterwards. After following the instructions on that page and creating different experiments and agent populations, and actually activating the simulations, I was unable to get any of the results or get any of the posts that I would assume the agents would have created in order to have this simulation off Twitter. Instead, when I enter the simulation, I only get this posting page where I can act as a user of the social media platform and post something myself. It was initially very strange to me that a tool that would simulate social media doesn't actually give me any posts or any of the data that comes along with a simulation of social media. I later learned that this process can take a long time.

On the website, I also saw that there is a "hard way" to do it where I would have to set up my server and my client separately. I presume that it allows for more customization of the setup. I believed and confirmed with the authors that the documentation here is a bit outdated. I would have to play around and find out a lot more about the tool itself. See where some of the things have fallen out of date and replace them in the code. I was not successfully run my social distro yet, so hopefully in the coming few days I'll be able to find out what is wrong in the documentation or in the code for the "hard way."

One thing I discovered while trying YSocial is that some of the features on the web dashboard do not seem to work offline, even when I try to set up a local LLM through Ollama. When trying to create the "experiments" and "populations", if there is no network connection, the dashboard would not display the already created ones, instead only showing empty tables. A big todo for me to fully go through the codebase, as YSocial is open source, to see where this problem lies.

Looking ahead, something that me and my research supervisor would like to do is try and have YSocial communicate with Mastodon, which is the social media platform that we are focusing on. The goal is to set up a locally run master.instance where we would have Y Social acting as the LLM agent for users on that instance. We would try to figure out various kinds of things that we can play around with. Starting off, we would see what happens when changing some of the factors or configurations on Mastodon, especially related to privacy. These changes might affect how Mastodon is used, but with actual Mastodon, we are unable to simulate the effects of these changes with actual users. With these LLM agents, we can try and simulate the effects of these changes in a less harmful and less dangerous way.

Looking at this tool, I would imagine that it is wildly powerful and useful if being used correctly. As I figure out how some of the features of this platform actually work, I believe that I'll be able to simulate and gather a bunch of simulated data about social media and AI, and especially the combination between them. I hope I will be able to utilize this tool in various areas, including but not limited to Mastodon.

New student, new project!

I want to extend a belated welcome to Zixuan (Jason) Yu, a Northeastern University undergraduate student who is working with me on a research coop through December 2025. Jason's project focuses on identifying elements of the Mastodon code base where we might either want to intervene (in order to answer a research question) or where there might be associated privacy considerations.

Jason's project combines goals from the Privacy Narratives project and the Helical project. He will be posting here regularly, but before then, let's dicuss the connection between privacy and experimentation.

As Donald Campbell wrote in Methods for the Experimenting Society,

[Social p]rograms are ... continued or discontinued on inadequate ... grounds[,] ... due in part to the inertia of social organizations, in part to political predicaments which oppose evaluation, and in part to the fact that the methodology of evaluation is still inadequate.

Mastodon as both a software platform and as a collection of communities has less of this inertia. We can think of each Mastodon instance as being a little society. This multiplicity and diversity could present incredible opportunties for empowering citizen social scientists. Insofar as computing systems can be made to have less friction with respect to experimentation, Mastodon's role as a FOSS platform seems ripe with opportunity. Unfortunately, in this context, Campbell's vision for an experimenting society may sound a bit like moving fast and breaking things: an ethos that many Fediverse communities reject.

This is NOT what we want! At first blush, it would seen that a notion of trust is missing from Campbell's essay. On closer inspection, however, we find trust's cousin, participant/citizen privacy. Campbell's mentions of privacy focus on scenarios where participants might be unwilling to disclose information to the researcher or research organization; today there are many more parties that may violate trust. We would argue that a violation of trust is the primary harm --- or at least the primary percieved harm --- that experimentation in social networks can cause, and therefore cannot be considered as a separate concern.

It is with this context in mind that Jason will be identifying possible intervention points and scenarios that could cause privacy vulnerabilities in the Mastodon code base.

Welcome!

This is the first in what I hope will be many blog posts on the relationship between experimentation and programming languages.

First, the meta: there are two ways to categorize the work that this blog will cover. One is by the methods used: this work exists at the intersection of formal and empirical research methods and so some posts will be quite technical, focusing on formal language design, logic, causal inference, and general "mathiness." In a lot of ways, this isn't particularly useful information: we might as well say we are "doing science," since empirical and formal methods describe a lot of the tools we use to generate scientific knowledge. On the other hand, that framing distinguishes the goals of this work from ordinary academic scientific activity, where we typically use a small set of methods for a specialized set of domains, which leads us to the observation that...

The other way to categorize this work is by application area. We are broadly interested in "systems," which is a general word to describe objects in the world that interact with each other. We are specifically interested in computer-mediated systems, which range from sociotechnical systems like Mastodon and social games, to typical computer systems like databases, operating systems, or programming frameworks. These application areas provide contexts in which we can evaluate the methods we study, while also providing inspiration for the methodological problems we address.