results count:

Can AI Code AL? Testing LLMs on a Real Task for Data Editor

In this post, I propose to examine how good LLMs are for AL coding on real tasks. As an example, I will use a real task for the Data Editor that I have been putting off for quite a long time. For comparison, I selected the most powerful and popular models: Claude 4 Sonnet, GPT-5 High, Grok 4, and Gemini 2.5 Pro.

Introduction

During development, codebases tend to accumulate questionable decisions, outdated patterns, and short-lived fixes. This is technical debt, and I am convinced it is inevitable when a project evolves actively over time. There are many ways to reduce technical debt, but that is not the focus today.

The open-source Data Editor project is no exception. Despite my efforts to prevent it, technical debt can still appear. The project began when my girlfriend, a Business Central tester, pointed out that in Business Central version 15 and later you cannot edit data. She suggested creating an application to solve this limitation. Coming from RTC/Classic, that restriction felt significant, so the idea emerged and I designed the simplest possible implementation that would resemble the earlier experience. Originally, it was a small application for internal testing needs. The very first version was written quickly “just to make it work”, which inevitably created a fair amount of technical debt from the outset.

Of course, after I saw strong interest from many individuals and companies, I refactored and improved Data Editor repeatedly. Even so, until recently there was a part of the code that I disliked and kept postponing, simply because the issue did not affect functionality.

The problem

Looking at the previous version of Data Editor, you can see that three separate objects duplicated the same permissions for specific tables.

That created a clear maintenance problem: you had to keep permissions in sync across three places, with a real risk of missing one. It was inconvenient and difficult to maintain.

Later, one of the project’s contributors proposed addressing this technical debt with an LLM. I found this interesting because the LLM ultimately failed to complete the task. That experience led me to design my own LLM benchmark and inspired this comparison post.

Prompt and models

Any interaction with an LLM begins with writing a prompt. Ideally, the prompt clearly describes what needs to be done and the task is short and clear. In practice, that is not always feasible and can be time-consuming. I therefore chose a single-prompt strategy in agent-thinking mode, making the prompt specific but not overly detailed. If you make the prompt even more granular and keep iterating, you quickly reach the point where doing it yourself is faster and simpler.

I still consider this prompt detailed enough, because I used my own knowledge and context to highlight what to focus on and where. Sometimes that is simply not possible, since a complete understanding of the task may develop as the work progresses. In addition, the agent has access to the AL linter and code analyzers, so any obvious AL errors were surfaced to the models.

I have a tool to edit data in Business Central called Data Editor. This is the codebase for this tool.

I have one concern. I have 3 files with the same permissions: @Pag81000.DETDataEditorBuffer.al, @Cod81001.DETDataEditorMgt.al, @Pag81004.DETInsertNewRecord.al.

I would like to isolate these permissions in one object, as it is hard to manage and keep them in sync every time.

At the same time, these direct permissions are required for any READ/MODIFY/INSERT/DELETE operations,

So changes should take into account cases where Modify/Insert/Delete may be present in OnValidate of any field (RecordRef/FieldRef).

In addition, any Get/Find/Next is essentially a record read, which requires permissions to do so.

For test I select the most popular and powerful LLM models available today, namely:

Claude 4 Sonnet
GPT-5 High
Grok 4
Gemini 2.5 Pro
Human variant

In this lineup, “Human variant” represents my own solution to the task.

Why these models? Because they are the most powerful available today, and I want the best possible outcome.

For tests, I used Cursor IDE (fork from VS Code). I believe VS Code + Copilot will produce a similar result.

Grok 4

It is a fairly unconventional model with its own strengths. However, writing code is clearly not one of them. In my tests, it consistently impressed me with the freshness of its information and relatively minimal content filtering. As a conversational partner or a tool for finding the latest information, it can be a solid choice. As a coding model, however, it is questionable, and its opinions on any topic should be treated with caution, a strong personal bias is possible.

The model failed to complete the task and, worse, left the project in a non-compiling state. It likely handles large files poorly despite the advertised 256,000 token context window.

The project does not compile because of the incorrect placement of a variable that the model inserted in the wrong location. The model did start with the right idea, introducing a dedicated codeunit, Data Operations, but it ignored key details from the prompt, missed many places where data operations occur, and added a completely irrelevant procedure, TextValueAsVariant, to Data Operations. For some reason, the model decided not to fix the uncompilable code, even though it saw all the errors.

Here you can review the full list of changes, for my opinion the outcome is a complete failure.

Gemini 2.5 Pro

Gemini is a very creative and fast model, and its one-million-token context window is striking. I know it is a favorite among many developers, especially given its low cost and the option to use it for free in Google AI Studio.

In my experience, it is not the best for coding, though it is solid. I often saw it deliver an overall middle-of-the-road solution while producing one specific component that was genuinely impressive. The model excels at generating new ideas and creative approaches.

Although this model also left the project in a non-compiling state, it performed clearly better. It applied an interesting and theoretically valid idea: creating a new Permission Set to aggregate all of those table permissions. However, it ignored the compiler error stating that the name length must not exceed 20 characters.

Creating such a Permission Set would technically work, but it is not advisable. From a security standpoint, I would rather not introduce that kind of permission set with direct access to tables. The approach is creative, but not safe. I also prefer not to add an additional Permission Set that could affect backward compatibility with the previous version, since those permissions would also need to be assigned to users.

Here you can review the full list of changes, for my opinion the result is unsuccessful, but better than previous.

GPT-5 High

OpenAI’s models are widely regarded as among the best, if not the best, overall. They are highly intelligent and powerful, but they are not the fastest, and they are expensive. I use them frequently for a wide range of tasks, from producing summaries to researching physics questions.

GPT-5 High is excellent at coding, but I do not consider it the best in agent mode. The model appears to struggle specifically in that mode. In addition, it is very expensive and slow. According to the latest information, GPT-5 model offers a 400,000-token context window. The key is not to confuse it with GPT-5 Chat or GPT-5 Mini, as these are different models. Unfortunately, OpenAI’s naming remains confusing.

It is worth noting that the project compiles successfully after the model’s changes, which is a clear advantage. The model does monitor errors even when that is not stated explicitly, which I consider the correct approach. If a model modifies code that previously compiled, it must ensure that the code still compiles when it is finished.

The model also took the right path by extracting the required procedures into a separate codeunit and choosing to reuse the existing Cod81001.DETDataEditorMgt.al for that. This is workable, but clearly not ideal from an architectural standpoint. In addition, instead of isolating only the specific data-interaction points, the model moved entire procedures, which is also a debatable choice.

Unfortunately, the model did not catch every place where data operations occur, even though I explicitly highlighted the focus areas in the prompt. As a result, the solution is incomplete and only partially functional. For a product actively used by many people, a half-working solution is clearly unacceptable.

Here you can review the full list of changes, for my opinion the result is partially successful.

Claude 4 Sonnet

I consider Anthropic’s models the strongest for coding, especially in agent mode. I would say they are less impressive on other types of tasks, but for coding they are likely the best. Therefore, I placed high expectations on this model for this task. In terms of price, this model is fairly expensive, though not the most expensive compared with competitors; in my view, the quality justifies the cost. Its 1-million-token context window is enormous.

Fortunately, the model did not disappoint. Among all the models in this evaluation, it came closest to the best solution. Its approach was to extract data operations into a separate object, Cod81010.DETPermissionsProvider.al, which centralizes all permissions. Importantly, it moved only the necessary operations rather than entire procedures that merely contained them.

I would also note that the model found nearly every place in the code where this needed to be done and even recognized that CalcFields can be a potential risk, although I did not state that explicitly. But, it was still not perfect, a few scenarios were missed. Therefore, despite being the strongest result, I cannot call it a complete solution, although it is very close.

I was also asked why the Claude Opus 4 model was not tested, given that it is considered the most intelligent model from Anthropic. The reason is that after the release of version 4.1, the model seemed too inconsistent to me, even worse than Sonnet 4. After receiving this question, I tried it again, and the results were still unsatisfactory, you can see result here. It seems to me that the result is slightly below GPT-5 High, as if the model did too much unnecessary work while failing to complete everything required.

Here you can review the full list of changes, for my opinion the result is partially successful.

Human variant

Unfortunately, none of the LLM-generated solutions was complete or production-ready. Therefore, I decided to implement my own solution that fully satisfies my reliability requirements and does not introduce additional problems for users. In essence, my approach is very similar to Claude 4 Sonnet: extract a dedicated object for data operations and assign all required permissions there, nothing too complicated.

I also observed that there is no point in using direct permissions on the object, since they are no different in effect from indirect permissions. This applies only to object-level permissions. Therefore, I switched from direct to indirect permissions, see Cod81003.DETDataOperations.al.

Here you can review the full list of changes, for my opinion the result is fully successful.

Validation and tests

Unfortunately, none of the models produced functional tests for Permissions because they lacked the knowledge and context needed to write them. That is why using MCP would likely not have changed anything, the tests the models generated effectively tested nothing.

To evaluate the output of each model and the human, the results must be validated, and there are many ways to do so. First, the simplest approach is to apply the gray matter in one’s skull, in plain terms, a developer’s knowledge and experience. An experienced developer can judge the quality of the answers on that basis.

Manual testing is another useful path. Like the first method, it does not provide absolute guarantees, but together these approaches offer a practical way to compare outcomes and rank the options relative to one another.

I used my own knowledge and experience to assess result solutions, and I manually tested each variant. In addition, I introduced a third validation method: new automated tests designed to verify how the application behaves with minimal permission sets. It is important to note that these tests do not yet cover every scenario, I plan to expand number of automated tests for Data Editor in future.

In summary, I assessed solution quality using three validation methods:

Personal experience and knowledge
Manual testing
Automated tests

Result overview

Based on my validation methods, the result is:

Rank	Solution	Result
1	Human variant	success
2	Claude 4 Sonnet	partially
3	GPT 5 High	partially
4	Gemini 2.5 Pro	fail
5	Grok 4	fail

Conclusions

It is worth remembering that these were single-attempt tasks. I could have continued asking the LLM to fix the problematic areas. However, as I already noted, if a fairly simple task requires more time interacting with an LLM than solving it by hand, the benefit drops sharply. Moreover, asking it to correct something means specifying the issue explicitly, which in turn requires that you already have a complete understanding of the task and can do it alone.

Therefore, I do not believe in “vibe coding.” It is a non-working concept today and may never become viable.

That said, I am confident in the usefulness of LLMs, just not for every task. Working effectively with LLMs is not only about writing prompts and instructions, but also about understanding which problems suit them best. The fewer viable solution paths there are, and the more context you provide, the better the results.

You also should not expect an LLM to design a robust project architecture. It is better to create the architecture yourself and then instruct the model to operate within those boundaries.

This is why I appreciate how Microsoft Copilot positions itself, as an assistant, not a replacement. Based on my experience and many tests, that is accurate assumption. Without skilled oversight and personal expertise, you cannot validate the outcomes of LLM hallucinations.

So, here are my takeaways from previous work and from this experiment. They are true for me as of today and may change:

Vibe coding does not work.
LLMs generate non-production-ready code, validation is required.
The best models for coding are from Anthropic, for example Claude 4 Sonnet.
The best models for most other tasks are from OpenAI, for example GPT-5 High.
LLMs still hallucinate and make mistakes.
LLMs remain very useful across many tasks, including coding.
The technology is excellent, but the hype is far greater than necessary.
Effective work with LLMs requires expertise, not only domain knowledge, but also proficiency in using the models themselves.
AI can do AL 🙂