When Benchmark Automation Meets Reality

My Experience Trying to Automate Game Benchmarks on Linux

Over the past few weeks, I’ve been spending quite a lot of time experimenting with something that started as a small personal tool and slowly turned into a much larger project than I originally expected: automating performance benchmarks for modern games on Linux.

The original idea was actually quite simple. Anyone who has ever tried to compare graphics settings in a game knows how repetitive the process can become. You launch the game, go into the settings menu, change one option, run the benchmark, write down the numbers, exit the game, and then repeat the whole thing again. It doesn’t take long before it starts to feel less like gaming and more like doing laboratory work.

Because of that, I decided to write a small script that would automate the whole process. Instead of running each test manually, the script would apply a predefined configuration, launch the game, run the built-in benchmark, collect the results, and then move on to the next test. The idea was not just to save time, but also to make the results more consistent and easier to compare later.

To my surprise, the first attempts at Cyberpunk 2077 worked very well, and the current version of the script you can find here.

Within a relatively short time, I managed to get the system running for another game, Shadow of the Tomb Raider. Once the framework was in place, it became surprisingly satisfying to watch it work. The script would simply go through a list of test profiles one by one — different resolutions, different graphics presets, and sometimes different upscaling technologies like DLSS or FSR — and automatically collect benchmark results from the results files. All the scripts are available here.

Instead of sitting in front of the computer repeating the same steps again and again, I could simply start the script and let it run. Sometimes I would leave it running overnight and check the results in the morning.

Naturally, after seeing it work so well, I became a bit optimistic about how easy it would be to extend this idea to other games. At that moment, it genuinely felt like I had already solved the hard part. The rest, I thought, would simply be a matter of adding support for more titles.

That assumption turned out to be very wrong.

Encouraged by the success of the first scripts, I decided to try adding another game to the framework. The next candidate was Deus Ex: Mankind Divided, which seemed like a very reasonable choice. The game includes a built-in benchmark mode and has been used for performance testing by reviewers for many years, so it looked like an ideal target for automation.

What I expected to take a couple of days ended up taking almost an entire week.

During that week, I spent a surprising amount of time trying different approaches to launch the benchmark automatically. Getting the game itself to start from the command line was not particularly difficult — Steam and Proton handle that part quite reliably. The real challenge was convincing the game to run the benchmark automatically without any manual interaction.

In theory, this should have been simple. Many games support command-line parameters such as “run benchmark” or “benchmark mode” (-benchmark or --benchmark) or in some cases, I've needed to create a benchmark.ini file with specific content. In practice, however, things are rarely that straightforward. Some games ignore command-line arguments completely. Others accept them but behave differently depending on the launcher or the environment they are running in. And when Proton enters the picture, there is always a possibility that certain parameters are not passed exactly the way the original Windows version expects them.

After several days of experiments, I eventually reached a somewhat frustrating conclusion: I could reliably launch the game automatically, but triggering the benchmark itself was far more complicated than I had anticipated.

That week was the moment when I realized that automating game benchmarks is not just a matter of writing a single clever script. Every game behaves slightly differently, and each one introduces its own set of small obstacles.

Still, I wasn’t ready to give up on the idea yet.

Over the following weeks, I tried the same approach with a few other games. Each new attempt taught me something new about how different developers implement their benchmarking tools, settings, and results. Some games did not react to command-line parameters at all. Others allowed the benchmark to run automatically but saved the results in unusual formats. And in some cases the benchmark results were not written to a file at all — they were simply displayed on the screen at the end of the test.

For a human player that makes perfect sense. You run the benchmark, the game shows you a nice summary screen with the average FPS, minimum FPS, and a few other statistics, and that’s usually all you need.

For an automation script, however, that approach creates a problem. If the results only exist on the screen, in some games, there is nothing for the script to parse or store.

At that point I realized that my original assumption — that all games would behave roughly the same — was simply unrealistic.

Around that time I decided to change my strategy slightly. Instead of randomly picking another game and hoping that it would work better, I started looking for titles that were known to be useful for benchmarking. While browsing various forums and discussions about performance testing, one game kept appearing in recommendations: Assassin’s Creed Valhalla.

Interestingly enough, it was also a game I had been curious about playing anyway.

Valhalla seemed like a perfect candidate. It is a relatively modern title with demanding graphics and a built-in benchmark mode, which is why it is often used in GPU performance comparisons. From a testing perspective, that made it very attractive. If the automation worked, it could become a great tool for evaluating different hardware configurations.

After installing the game and launching the benchmark manually for the first time, I immediately understood why people use it so often for testing. The game looks absolutely stunning. The environments are huge, the lighting is complex, and the overall level of graphical detail is impressive even by modern standards.

On my system, which currently runs an NVIDIA RTX 5060 with 8 GB of VRAM, the game had no problem pushing the GPU quite hard, especially at higher resolutions and maximum quality settings. That alone made it useful for benchmarking, because a good benchmark should actually stress the hardware enough to reveal performance differences.

But while the visuals were impressive, the automation challenge was still there.

By that point, however, I had started noticing something interesting. Even though each game behaves differently, the overall structure of the benchmarking process is surprisingly similar across many titles. Almost every automated benchmark workflow consists of two main parts.

The first part is launching the benchmark automatically. Ideally, the game supports a command-line option that starts the benchmark immediately. When that works, automation becomes relatively easy. Unfortunately, many games either ignore such arguments or require additional steps before the benchmark can begin.

The second part is collecting the results. If the benchmark writes its results into a convenient format such as JSON, XML, or CSV, extracting the data is straightforward. A simple parser can read the file and extract the average frame rate, minimum frame rate, and other statistics.

However, when the results exist only on the screen, things become more complicated.

This is where I eventually arrived at a slightly unconventional solution.

Instead of trying to force every game to produce a result file, I realized that it might be easier to simply capture the results visually. If the benchmark shows the final statistics on the screen, the script can take a screenshot at that moment. Once the screenshot exists, it can be processed automatically.

The idea is surprisingly simple: take the screenshot, crop the region that contains the benchmark numbers, and then run optical character recognition on that image.

To improve the accuracy of the text recognition, the image can be preprocessed before running OCR. For example, converting the image to grayscale removes unnecessary color information and highlights the contrast between text and background. In some cases, creating a negative version of the image also helps, especially when the original interface uses bright text on a dark background.

These techniques are actually very similar to the preprocessing steps used in many computer vision systems. By simplifying the image and enhancing contrast, the recognition algorithm has a much better chance of reading the numbers correctly.

Of course, there is no universal preprocessing pipeline that works perfectly for every game. Different user interfaces use different colors, fonts, and layouts. Because of that, I eventually decided that the benchmarking framework should not rely on a single rigid extraction method.

Instead, the system provides a default example function that demonstrates how result extraction can work. Each game configuration can then override that function if necessary and provide its own custom logic for parsing benchmark results.

In practice, this means that even if two games display their benchmark results in completely different ways, they can still be supported within the same framework.

Looking back at the past few weeks, the experience has been both more difficult and more interesting than I originally expected. What started as a small automation script gradually turned into a deeper exploration of how games implement benchmarking tools and how those tools interact with Linux, Steam, and Proton.

The biggest lesson for me was that automation rarely works exactly the way we imagine it at the beginning. Real systems are messy, and every new game introduces small differences that need to be understood and handled.

At the same time, solving those small technical puzzles is exactly what makes the process enjoyable. Each time a new game finally runs its benchmark automatically and produces usable results, it feels like a small victory.

And who knows — if this project continues to grow, it might eventually become a useful tool not only for me but for other Linux gamers who enjoy experimenting with performance settings just as much as they enjoy playing the games themselves.

When Benchmark Automation Meets Reality

My Experience Trying to Automate Game Benchmarks on Linux

Read next

Automatic Cyberpunk 2077 Benchmarking on Linux – Complete Manual

Automatic benchmark of "Shadow of the Tomb Raider" - The Benchmark That Would Not Listen

Running the Cyberpunk 2077 Benchmark from the Ubuntu Terminal (With Proton) – My Journey

When Benchmark Automation Meets Reality

My Experience Trying to Automate Game Benchmarks on Linux

Read next

Automatic Cyberpunk 2077 Benchmarking on Linux – Complete Manual

Automatic benchmark of "Shadow of the Tomb Raider" - The Benchmark That Would Not Listen

Running the Cyberpunk 2077 Benchmark from the Ubuntu Terminal (With Proton) – My Journey

Automatic Cyberpunk 2077 Benchmarking on Linux – Complete Manual