When you click on links to various merchants on this site and make a purchase, this can result in this site earning a commission. Affiliate programs and affiliations include, but are not limited to, the eBay Partner Network. As an Amazon Associate I earn from qualifying purchases. #ad #promotions



Here is a quick overview of how I run a local Ai LLM testing reviews and how I approach the evaluation of those tests. It is a good thing to point out that Model Review/Evaluation is not Performance Benchmarking the way I approach it and those are usually separate videos, but not always. I do not expect folks to have watched every video or recall what I have talked about in videos and comments so this is a good way for me to create a static reference and direct folks to this article if they seem to have missed the explanation from prior videos/comments. Unfortunately YouTube does not allow creators to edit or revise a video once it is published, which impacts the quality and longevity of works and creates additional burdens that need solutions (like this) if you forget to mention something in a video.
My testing and evaluation methods are by no means scientific, as I state often. My “methodology” is very much based on what I have learned from my own hands on experiences in managing the hardware and software in my home Ai lab. Broadly speaking, I make Ai videos that fall into the following categories:
Local Ai Server Builds
Local Ai Software Guides
LLM Testing and Reviews
LLM Performance Benchmarking
Ai Hardware Reviews and Analysis
Ai News and Opinions
In this article I will talk only about LLM Testing and Review Setup Steps and then My (Current) Questions and Reasoning. Performance Benchmarking is another topic I may write an article about in the future, but I need to refine my approach more before I do that.
LLM Testing and Reviews Setup Steps
I have followed a fairly logical, in my opinion, method of testing models for model reviews. I am sure there are places for improvements and I welcome those suggestions, just drop a comment in a YouTube video with your own experiences running local models. This content is presented as “so I do it like this” and not as “you should do it like this” which is an important distinction.
Here I will use as an example ZAI GLM 4.5 Air 106B A12B. I do not pick a quant or parameter size for speed, rather I bias in the following order to attempt to optimize for highest runnable precision:
My (Current) Questions and Reasoning
My questions are centered around exposing core reasoning, capabilities and principles in LLMs that will help guide us to what I think would be a very commonly agreed upon set of issues that would generally be toward AGI. The current set of questions (or prompts, whatever you want to view them as) is 10 in length, and what I would call grade 1. There will be a progression from this set to the next set of questions after we have an LLM successfully complete all 10 in a one-shot test. They are fairly basic in my opinion also even if a human might have trouble with some of them, like 100 decimals of pi. Not all of them have a static right or wrong answer criteria, and we have a few really creative divergences from what is clearly a normalized response cluster. Those answers have been pretty cool to see.
These questions are not centered around what is the current state (2025) of what I perceive to be the core YouTube Ai audience. That is odd right? No. The YouTube Ai audience I often see and which I see on other creators who are likely humans channels is highly technical in nature and not reflective of the actual populace of the United States. I cannot evaluate that for the rest of the world, but I suspect it is even more pronounced in some regions and less in others. Yeah there is also the weird slop ai stuff that would seem to have to have an audience, but I suspect it is heavily bot created, commented and consumed. I don’t know WTF to make of that mess.
Not every question has been methodically applied every time, as stated above I am not a scientist and this is not scientific testing. If you think the current state of testing is sub-optimal, then maybe these questions resonate with you. The test set also has no formal evaluation name aside from “the questions DSP asks to LLMs”. Let’s take a look at these questions and what I evaluate them for.
Armageddon with a Twist
The end of the world in a concise question is somewhat hard to conceive and setup in a few minutes in a fun and playful manner, but I do have to say I enjoy what I have come up with here. It is hands down my favorite question and I legitimately love the responses and thoughts it provokes in me, the LLMs, and the audience. How an LLM would approach a solution to an event as shocking as an asteroid EOTWAWKI (end of the world as we know it) impact has also led to some incredible divergences in LLMs. It also has historically a highly probably tie-in to what happened to most of the Dinosaurs. We have seen responses range from outright refusal to answer the question at all, to unlimited gusto for airlock blasting of dissenting crew. Principled ethics ranging from Utilitarian to Deontological and a spectrum that extends even beyond considerations for humanity alone. If you are thrilled, enraged, troubled, elated or upset by it I would suggest my mission has been accomplished 🙂 The only “wrong” answer to this is refusing to answer and provide reasoning around that answer, which has happened on a number of LLMs. Fun question.
Coding Flippyblock Extreme
The worlds most favorite code prompt. Presented for that reason. Surprisingly however it is a very good question at unveiling the peak of “trained in” that an LLM producing Lab can achieve. This question has without doubt been a core part of corpus for some time now, and the quality does appear to reflect that it is continuing to be refined in the training set. This raises good questions about the level of fitting that is done before shipping and a good way to evaluate that. Not to say it has to be refined and trained in but I think it is fair to assume it has been. One of the more shocking recent level-ups that we see from reasoning models is “same window” reconsideration of the whole code. This also provides key insights into how to LLM ideates a solution. Important to realize that one-shot, or reasoned corrections, leads us to a pathway of drastic acceleration without human in the loop, a must for reliable agentic expansion. This possibly also provides a good historical peg to gauge model evolution against.
Parsing Peppermints
This is a question geared to unapologetically examine the ability to parse and count. Why? It is a fundamental skill that is needed for us to hit AGI. I approach my questions from the perspective of what a normal user and their expectations and use cases are with my Model Review videos.
Specifically, this means my testing questions are not about “best fitting” humans to the models preferred methods of interactions, which is what much of Ai YouTube does currently around Ai/LLM topics. That is not to say there is no value in that, there is great value in it! A great illustrative example is how Karpathy explains this topic extremely well and showcases a solution. This does not excuse the fundamental token resolution issue however.
Not to throw shade but the oversight of many seemingly very intelligent people is that they are completely missing the real world, non-negotiable expectations of normal users and attempting to either fit those users to the LLM or in some cases explain issues away as non-issues. This is the gap that is so frequently talked about in Ai. Fitting somewhat more-advanced-than-normal users to the LLMs preferred methods of interaction for increased positive outcomes absolutely makes sense. Trying to fit a normal to less-than-normal user to idiosyncrasies of LLMs however is a serious mistake. It risks irreparable damage to the field of Ai gaining mass adoption and utilization. Trying to explain it away is always the wrong approach and I would suggest folks look into the motivations of anyone attempting to hand wave and say this not a problem. It is likely one of the following three things:
1) blind acceptance and parroting.
2) self interest in maintaining a skills gap.
3) vested financial interests.
I see there are at least two approaches to arrive at a solution to this problem. The easier route likely is to craft tedious system prompts to guide the LLM into pathways to arrive at favorable solutions to counting and parsing fundamentals. The second is for some cracked ass math nerd to actually solve this at an architectural level which I fully believe eventually happens. Very few of the worlds most elite mathematicians are likely to have those chops to solve this at that level. It is a very non-trivial problem. Still this does not negate the need for solutions, even if it is at a somewhat more superficial system prompt level. Those are also just the two methods I can think of and there are likely other solutions I have not thought of. Possibly LLMs are not the way that leads to AGI. Either way, “don’t use these for counting or parsing” is simply an unacceptable state and I will not accept it. Nor will the masses.
Nothing any X or YouTube influencer, including the biggest names, can say validates the fundamental failures in modern SOTA LLMs. Parsing and counting are just some examples of those continued failures. If you are being told otherwise, question that persons motivations for telling you it is not an issue for you to be concerned with. AGI is not possible without the ability to count and parse accurately and reliably. It is highly disturbing to me that people have been told “this is not an issue” and accepted it as such and moved past it. I question either the intellect or the motivations of those who accept such explanations, especially when they attempt to pass them off to others. I think you might do well to consider that as well.
Arbitrary Arrays
This is a simple cypher that if you present it to most people, they can rapidly abstract an answer that is simply to shift down by 1. Indeed that is the answer that is most often given by the LLMs in both reasoning and non-reasoning variants. The reasoning variants have come up with a few interesting alternative possible answers that are also validate. Looking for LLMs that have been training to think the user is trying to trick them is one of the primary motives behind this question aside from basic array shifting. I am on the fence about that being an alignment that is trained into an LLM. There are a lot of issues that can arise with trust and manufactured ingenuity vs true ingenuity as a result. Outputs that are masked to appear based on presentation format in a manner that is not actually what happened under the hood is a key issue in particular.
Often the outputs of the arbitrary array question have led to either incomprehensible answers, or flat out incorrect answers. Reasoning models that have “panicked” has been witnessed a few times. This one is incorrectly answered in a surprisingly large amount of samples.
Random Cat Sentence Parse and Count
*Same as Parsing Peppermints.
Numeric Comparison
This has been achievable by every LLM in 2025 to my recollection and indeed we have moved past this being a basis question that is meaningful. It persists simply for consistency in the questioning in set 1.
Hundred Decimals of Pi
Typically this involved the LLMs creating a theoretical solution set pathway involving something like the Chudnovsky Pi calculations which you can read more about here. Eventually the LLM realizes that they do not have the compute capability to calculate this directly and they revert to the standards books, which have recorded Pi to high decimal precision and are without a doubt trained into every LLMs and have been for some time. Often they recall and verify the solution in step-wise fashion and this presents reasoning models with an opportunity to diverge and miss. To be clear, none of the LLMs that may have presented the information in a format that appeared to be them using a program to reach a solution has actually happened that way. Rather they have written code, then recalled the answer. Not used the code to solve the answer. Recently we did witness code produced that did arrive at an answer and it was very impressive in presentation style from the LLM. Of note, nothing in the question says they cannot just recall the decimals and output them. Few do.
Create an SVG of {{thing}}
This has been one of the more entertaining questions to run and has had some really interesting outputs. Cat on a fence has been entertaining and the expansion of this in question set 2 is even more engaging in creativity. The ability of the LLM to structure aspects of the Cat and fence and positionally have them close to what would look visually correct is a challenge that if evaluated by an artist it is likely every one of these would fail. The layering of the composition is critically important to the outcome as well and has been a common hanging point for the LLMs. We have had it create a few other variants over time but they have all been challenging for the same underlying fundamental reasons. A human would likely need to adjust the elements position in the inspector to arrive at correct attachment points. The question also offers some possible insight into how textual models can create visual elements. Highly subjective to rate an output as pass or fail either as some of them may fall into modern abstract art category very well. I hope someone in the future recalls how I was so puzzled by a cat SVG I had to get hundreds of people to vote on if it was a cat or not.
Pico de Gato
This has been achievable by every LLM in 2025 to my recollection and indeed we have moved past this being a basis question that is meaningful. It persists simply for consistency in the questioning in set 1.
Two Driver Problem
This question has a flaw in the manner of failure that was unanticipated by us when we created it. The intention is first to validate the LLM has the correct understanding of distances and can then apply them to a straight forward formula to arrive at an answer for which driver arrives first. Instead we have witnessed the LLMs often get very inaccurate distance estimations that would fundamentally change the travel plans for a person, however the output result is technically accurate as they “fail long” in the estimation. In hindsight, the distance estimation recall and/or extrapolation needs to be tightly coupled with the pass or fail of a model. It is a shame that we see most LLMs fail, but be correct and this is corrected in test set 2. The problem is dramatically different in that set and the likelyhood of failing in estimation in the correct direction to achieve a pass is removed by additional constraints.