Word Hammer, v2.0

30-June-2025 By Jeffrey Cooper

Welcome to my long-awaited follow-up to the POC version of Word Hammer that I released last December. After some positive feedback and serious use by myself, I decided to create a much more capable version! This became an obsession for me for a few months as I am working diligently on becoming fluent in Spanish (see my Spanish text immediately to the right), and this has become one more tool for me to do that with.

It serves multiple purposes. My projects all create tools, first and foremost, for me. This is one of my better ones. It works with ANY language, and is a drill sergeant, meant to help you fine tune your language skills. It is not a replacement for a teacher or Duolingo, but supplemental. If free form text, you simply tell it what you wish to practice next, and it prepares, initially, 20 questions, but will continue indefinitely if you like. Ostensibly, the goal is to get a score of 20, which seems sufficient to at least temporarily master a problem area.

As part of this app, I also tackled many different complex subparts, not because it was required, but because I wanted a more complete app and built other experiments into it as well. Multi-purpose- both the app and the process of developing it.

Nota para los lectores españoles: Estoy escribiendo mis articulos en dos idiomas mientras lo aprendo. Para mas información, lea este artículo.

Bienvenido a mi lanzamiento siguiente a la versión de prueba de concepto de Word Hammer que lancé el pasado mes de diciembre. Después de algunos comentarios positivos y uso intensivo por mi parte, ¡decidí crear una versión mucha más capaz! Esta app se convirtió en una obsesión para mi durante varios meses como estoy tratando para hablar con fluidez en español (por ejemplo, ese texto en español que escribí con solo un poca ayuda, por su mayor parte). Word Hammer se ha convertido una más herramienta para mi.

La app sirve varios propósitos. Todos mis proyectos crean herramienta, en primer lugar, para mi. Esta es una de mis mejores apps. Funciona in cualquier idioma y es una herramienta muy enfocada para ayudarte con precisión tus habilidades de las lenguas. No es un sustituto para Duolingo, pero suplemental. En texto libre, simplemente decirlo que quieres practicar, y preparará, inicialmente, 20 preguntas, pero continuará indefinidamente si quieres. Un buen objetivo es para obtener una puntuación de 20, lo cual parece suficiente para dominar temporalmente un área problemática.

Como un parte de esta app, también hice subpartes complejos, no porque fuera obligatorio, pero porque deseé una app más completa y construye otros experiencias dentro también. Fue multi-purpose, ambos la app y el proceso para desarrollarlo.

Projects & Sub-projects within Word Hammer

Word Hammer itself- the Quiz Generator
Universal LLM Interface (reusable)
Clusterflux SVG UI Component Library (reusable)
Localization Support (L10N)
- JSON configuration system to containt UI strings with separate UI JSON files for each language for the entire app’s UI + Prompts (yes, the prompts are also translated into your native language if it is other than English)
- DeepL Integration for UI translations
- Translation Management Dashboard

This list shows why Project 52 became difficult to maintain- one big project that took a lot of time was composed of a number of sub-projects.

There is a LOT baked into this app, and it is, first and foremost, to evaluate using LLMs to help learn a foreign language. So you have a LOT of options! Note this is not a commercial app and I do pay the bills for the LLM API calls that are happening behind the scenes. But this app has a lot of moving parts to it, as I will explain. So, let’s dig into it!

Esta lista demuestra por que Project52 qué se volvió difícil para mantener- un gran proyecto que tarda mucho tiempo era construido de muchos sub-proyectos.

Hay muchas cosas dentro esta aplicación, y en primar lugar, es para evaluar LLMs para ayudar a aprender un nuevo idioma. ¡Así tienes muchas opciones! Ten en cuenta no es una aplicación comercial y pago los gastos para las llamadas a los APIs de los LLMs que ocurrir atrás de las escenas. Pero esta app tiene muchas piezas móviles, como explicaré. ¡Así, investiguemos!

Quiz Type & LLM Selection

When the page loads, you see a much more complex screen than in v1.0, which simply asked for your quiz subject and used SambaNova’s hosted Llama model for the quiz.

Now I let you choose from 19 different LLM models from 5 different companies! And I let you tweak the temperature and top-K settings on some of the models (more below). And it has 5 different quiz modes!

Beyond that, it is integrated with DeepL, a language translation service, that provides UI string translations for the standardized l10n (localization) framework that I implemented! So if your native language is, say, Greek, the UI itself will be in Greek! You can actually go between any two languages in the world. The Localization itself supports 30 different languages (DeepL supports 30 languages- not every one on Earth).

And, I also developed a small UI library for me to use- which I call Clusterflux 😂. The switch, slider, and LED indicators among others, are used in the app to give it a custom feel, and because I wanted to do it.

Cuando se carga la página, ves una pantalla más compleja de en v1.0, lo cual simplemente te pidió que el sujeto para tu prueba. y utilizó un modelo Llama se alojó de SambaNova para la prueba.

Ahora, ¡te permito que escoger de 19 modelos de LLMs de 5 empresas diferentes! Y te permito ajustar las configuraciones de la temperatura y Top_k en varios de los models (más a continua). ¡Y tiene 5 modos de pruebas diferentes!

Más allá eso, es integrado con DeepL, un servicio para traducir los idiomas, que proporciona traducciones de las cadenas del UI para el framework de l10n (localización) que he implementado. Si tu idioma nativo es griego, ¡el UI lo mismo será en griego! Con la app, es posible para aprender entre cualquier dos lenguas en del mundo. La localización la misma apoya 30 idiomas diferentes (DeepL apoya 30 idiomas- no todos los idiomas del mundo).

Y también desarrollé una pequeña biblioteca UI para mi para usar que me llamo Clusterflux 😂. El interruptor, el deslizador y la indicación de LED con otros, se utilizan en la app para darle un toque personalizado, y porque quise hacerlo.

The first box contains both the pair of languages you are going between, as well as the Quiz types. Everything is free-text input, but the Base Language defaults to your browser’s base language. You can override this, though! And you can specify any target language, including regional variations by simply typing it. I use Latin American Spanish for myself.

I have had friends test the following languages, with decent results: Spanish (me from English and a friend from Ukrainian), Ukrainian (from Spanish by my Spanish teacher), Japanese, and Phonetic Bengali. Additionally, I have done some testing in German.

On the animated GIF above, you can see the 5 different quiz modes, that arose as I tested it and had ideas around how to improve the quizzes based on the problem areas I wanted to improve. They are pretty self-explanatory from the screen shots.

Monte Carlo is inspired by Monte Carlo Testing, in that it is a completely random quiz and is a bit like a boss level in Doom. It can be incredibly hard- more on that below, though.

La primera caja contiene ambos idiomas que usarás, así como los tipos de la pruebas. Todo es entrada de texto libre, pero el idioma baso por defecto al idioma de tu navegador. Sin embargo, puedes cambiarlo. Y puedes escoger cualquier un idioma de destino, incluso variaciones regionales simplemente por entrarla. Uso españo latinoamericano para mi.

He tenido amigos que lo prueban con los idiomas siguientes, con buenos resultados: Español (mi, de ingles, y un amigo de ucraniano), ucraniano (de español por mi maestro de español), japonesa, y bengalí fonético. Y lo he probado con algo alemán.

En el GIF animado anteriormente, puedes ver los 5 modos diferentes de las pruebas, que creé y tenía ideas alrededor como mejorar las pruebas en considera de las áreas de problemas lo que quería mejorar.

Monte Carlo fue inspirado por Monte Carlo Testing, en que es una prueba aleatorio completamente y es similar, en una manera, a un nivel del jefe en Doom. Puedes estar muy difícil, más en eso a continua.

Set Your Language Proficiency

Besides the quiz types, the most important thing to do is set your Language Proficiency! It defaults to Present Tense only. This hidden panel (click on the Language Proficiency button) is super helpful. The main thing it does is instruct the LLMs to “flex” the verb tenses to more broadly exercise you at the level you are currently at (but, there’s a fun variation to that- more in a moment). The actual effect of this depends on the LLMs, and some are better than others. Some of the lower end models (which are also the cheaper ones) are not so variable in how creative they get when they generate your quiz, and may give you mainly present and past tense, and just a smattering of others. The more expensive models are much better at presenting verbs in most of the tenses you check, give you a very diverse quiz.

Vocabulary size also matters, though it is just a relative slider- there is no set way to measure it, so you just self-identify your level. Idiom knowledge is the other area- they are the hardest thing to learn as the meaning of a phrase may be unrelated to the translations of the individual words (examples in English- “once in a while,” “piece of cake,” “bite the bullet,” “by the skin of your teeth,” “hit the sack,” etc…). Given their difficulty, I have not tested how well idioms are able to be tested, yet.

Now, for the fun variation: You can reverse the Proficiencies and guide it to test you specifically in a new mode. Instead of checking everything you know- uncheck them all and then check only what you want to practice, such as Imperfect Subjunctive. It should generate mostly questions that use just that mode! So it works both ways- inclusive or exclusive.

Además los tipos de cuestionario, ¡la cosa más importante es para configurar tus competencias lingüísticas! Por defecto es solo tenso Presente. Este panel oculto (haces clic en el botón de Competencias Lingüísticas) es super útil. La primera cosa lo que hace es instruir el LLM para maximizar los tiempos verbales ya lo que sabes en las cuestionas, en tu nivel actualmente (pero, hay un variación divertida a eso- más en un momento). El efecto real de esto depende en cuál LLM escogiste, y unos son más mejor que otros. Unos de los LLMs más pequeño (cuáles también más baratos) no son tan variables en cuanto a su creatividad cuando generan tu cuestionario. Los modelos pequeños te das principalmente los tiempose presente y preterit, y solo un poquito de los otros. Los models mas caro son más mejor en presentar verbos en la mayoría de los tiempos tu escogiste, y te dan una cuestionario muy diversa.

El tamaño de vocabulario es importante también, aunque es solo un control deslizante relativo- no hay una manera para medirlo, así adivinas tu nivel. Tu conocimiento de modismos es la otra área- hay que la cosa más difícil para aprender, ya que el significada de la frase puede tener que no relación de las traducciones de las palabras individuales (ejemplos en espańol- “de vez en cuando,” “jugar con nuestro pulgares,” “Donde fueres haz lo que vieres,” “Encontrar a tu media naranja,” etc…). Dado su dificultad, no he probado qué tan bien funciona los modismos.

Ahora, para la variación divertida. Puedes revertir las competencias y guiarla para probarte específicamente en un nuevo modo. En lugar de marcar todos tiempos lo que conoces, desmarcas todos, y entonces marcas solo lo cual los tiempos verbales que deseas practicar, como subjuntivo imperfecto. Debería generar principalmente las cuestionas que usa solo eso modo. Así se funciona en ambos direcciones- inclusivo o exclusivo.

The Quizzes

The quizzes are quit simple in structure- they are only fill-in-the-blank and they do enforce using the correct letters, including accents. Right answers are +1 and wrong answers are -1. It will give you the right answer when you answer incorrectly, but you must retype the answer- just type in the given correct answer, and you get +0.5 points back. The point of that is that typing it helps reinforce the correct answer a bit more. And if you miss an accent or other diacritical mark, but otherwise it is correct, you will get +0.5. Accents and diacriticals are important, which is why this reinforces that- it aims for fluency, ultimately.

LLMs do occasionally make mistakes, so if you get a wrong answer, and you really do feel you got it correct, you can click on the flag icon 🏳️ to waive the -1 score debit.

Though the scoring is for tracking your progress, there is no threshold at which the app stops quizzing you. It will literally go on forever (remember, this is evaluation- not an end user Consumer app!). But, a good rule of thumb is that a score of 20 is a good stopping point. To that end, as the LED turns greener, it maxes out full green at 20, and when you get to the score, you are “rewarded” with an explosion of the word “Awesome” (which is of course, localized to your native language via the L10N system 🙂) emanating from the LED. This is one of the UI components (a non-SVG one) that I created in the Clusterflux project.

And that’s it! It’s a pretty easy app to use- the biggest challenge is realizing that you can be creative and word your problem areas in creative ways to really go deep and get surgical with your problem areas. And it serves as an LLM playground, albeit restricted to language quizzes, as well.

Los cuestionarios son muy sencillos en la estructura- son solo preguntas de rellenar los espacios en blanco. Word Hammer obliga a utilizar las letras y los acentos correctas. Respuestas correcta son +1 y respuestas incorrecta son -1. Dará la respuesta correcta si respondiste incorrectamente, pero debes volver a entrar otra vez- justo entras la respuesta dado, y recuperas +0.5 punto. El punto de eso es que entrarla puede reforzar la respuesta correcta un poca más. Y olvidó un acento o otra marca diacrítica pero la respuesta por lo demás es correcta, te dará +0.5 punto. Acentos y marcas diacríticas son importantes, y que es por que la app refuerza eso- el objetivo final es la fluidez.

Ocasionalmente, los LLMs hacen errores, así obtienes una respuesta incorrecta, y si realmente crees que tenías razón, puedes hacer clic en el icono de la bandera 🏳️ para despedir el cargo de -1 punto.

Aunque la puntuación es para observar tu progreso, no hay un umbral especifico en que la app deje de probarte. Literalmente, la app funcionará para siempre (recuerda, esta en para evaluación, ¡no es una app de consuma!). Pero, una buena regla general es que una puntuación de 20 está un buen punto de parada. Con ese fin, cómo el LED se ilumina más verde, alcanza su máximo nivel a 20, y cuando alcanzas esa puntuación, se te recompense con una explosión de la palabra “Impresionante” (por supuesto cual es traduce a tu idioma nativo por el sistema de L10N 🙂) que emanar del LED. Este es uno de los componentes de UI (no es SVG en ese caso) que creé en el proyecto de Clusterflux.

¡Y es todo! Es una app facil para usar- el desafío más grande es para realizar que sea creativo y escribir tus áreas de problema en maneras creativas para profundar y probar con precisión con tu áreas de problema. Y sirve como un patio de recreo de LLMs, aunque está restringido a las pruebas de lenguas.

Behind the Screen

I also want to talk a bit about what is going on behind the scenes, the reasons for the diversity of LLMs, and the information you see in the app, which, being an LLM evaluation as much as a language learning tool, is not a consumer app, at this point.

También quiero hablar un poco sobre que ocurre detrás de las escenas, las razones para la diversidad de los LLMs, y la información que se muestra en la app, que, al ser tanto una herramienta de evaluación de los LLMs, como una herramienta para aprender los idiomas y no está una app para consumidores en este momento.

The list of LLMs you see today on the app is not the same as it was 1-2 or 4 months ago, when I was developing it. “AI Years” compared to Internet Years or Dog Years, is very short- it seems every week a new model is released and an old one is deprecated. Of all the models I have tested, including ones that I removed, are shown in the chart above that I compiled about a month ago. Stripped of units, this shows the relative time for the same query to different LLM models.

In the case of Cerebras and SambaNova, these are smaller, open source models that they host. Of all of them, Llama models on Cerebras are by far the fastest. However, being Llama models, they are smaller and the quality of the quizzes is lower if you are advanced in your studies.

You can see in OpenAI’s case- the “standard” LLMs are faster than the thinking ones- the o-models. And some models are not shown here due to their costs- Claude Opus and Gemini 2.5 Pro (I have not anchored my Gemini account with a credit card as of yet, so it is restricted to Flash, which is generous with free querires. OpenAI, Anthropic Claude, and xAI, so when you use the app, I do incur a small cost. With limited exposure, it is acceptable, as I have not figured out a good business model yet to turn this into a full-blown language-learning assistant.

Of all the models above, if you are advanced, the higher level models will flex your verb tense knowledge and vocabulary the most. If you are a beginner, you will get great results with cheaper models.

La lista de los LLMs se que ves hoy en la app no está la misma se que viste 1-2 o 4 meses pasados, quando la estaba desarrollando. «Años de IA» en comparación con años de Internet or años de perro, está muy cortos- se parece que cada semana o 2 un nuevo modelo se lanza y un modelo viejo se depreca. De los todos modelos que he probado, incluso modelos que se deprecados, se muestran en el gráfico anterior que creé un mes pasado. Sin unidades, esto demuestra el tiempo relativo para la misma consulta a los modelos de los LLMs diferentes.

En el caso de Cerebras y SambaNova, estos son modelos más pequeńos de Open Source que se alojan. De todos, los modelos Llama en Cerebras son, con diferencia, los más rápidos. Sin embargo, porque los modelos son más pequeños, la calidad de los cuestionarios es más baja si estás más avanzados en tus estudios.

Puedes ver in el caso de OpenAI- los modelos estándar son más rápido que los models de pensamiento, los modelos como o3. Y varios modelos no muestran porque del precio- Claude Opus por ejemplo. Para Google Gemini Pro, simplemente no he conectado mi tarjeta de crédito a una cuarta empresa (¡tres es bastante!) y Gemini Flash es gratis. Para OpenAI, Anthropic Claude, y xAI, me lo cargan un poco cada vez les usan. Con distribución bajo, el costo es OK, porque no he determinado todavìa un modelo de negocio para convertirlo a la app comercial.

De todos los models anteriores, si están avanzado, los models más altos desafiarán tu pensamiento y vocabulario al máximo, Si están un principiante, obtendrás buenos resultados con los models más baratas.

While the Google Gemini costs are shown for reference as I’m not paying for Google services, the others do show the real costs per million tokens (a token = ~4 characters) added together. Please keep that in mind and skew to the cheaper models if you are going to use Word Hammer a lot and are not advanced. I do find it interesting that Anthropic has not dropped the Sonnet 3.5 or 3.7 prices even as the more capable 4 has come out.

Gemini Flash models are different- they are free if you agree to let them use your submissions for training. The cost shown above is if you choose to use them privately. I use the free models.

Los costos de Gemini de Google se muestran para referencia como yo no pago por estos servicios. Los otros se muestran los costas reales por 1 millón de tokens (un token = ~4 caracteres) agregan juntos. Por favor recuerdas eso y usas los modelos más baratos si no en un nivel avanzado y planearás usas Word Hammer mucho. Creo que es interesante que Anthropic no he bajado los precios para Sonnet 3.5 o 3.7, aunque el model más capaz de 4 ha lanzado.

Los modelos de Gemini Flash están diferentes- son gratis si aceptas que permitirlos utilicen tus envíos para entrenar. Los costos se muestra son si escoges utilizarlos en privado. Utilizo los modelos gratis.

Composite Diversity Score

Since I’ve been talking about whether you are a beginner or advanced learner, this chart shows some measures how well the different LLMs perform, if you are advanced in your language skills.

To do this measure, I created a Composite Diversity Score, which is an average of a number of different methods by which to measure word diversity in sentences. For this test, the Language Proficiencies are set to all verb tenses being mastered and the LLMs are instructed to maximize the verb tense variability to flex the user in all these different tenses. This measures how well any given LLM achieved that, and consists of 5 different measures for a 20-question quiz:

Desde he hablando sobre tu estás una principiante o estudiante avanzado, esto gráfico demuestra unas medidas para saber qué tan bien realizan los LLMs si eres avanzado bastante en tus habilidades de idiomas.

Para hace medirlo, creé una Puntuación Combinada de Diversidad, cuál es un promedio de un numero de métodos diferentes para medir la diversidad de las palabras en las frases. Por esta prueba, todas las competencias de los verbos están marcados, y los LLMs están pedido para maximizar los tiempos verbales para probar el usuario en todos estos tiempos. Este medido mede cómo bien cualquier dado LLM realizaba, y es una agregación de 5 medidas diferente por un cuestionario de 20 preguntas.

Word Variability Statistical Measures

Unique Answers- a simple count of the number of tenses in 20 questions
Top 3 Frequency- what % of the 20 questions were dominated by the top 3 most frequent tenses
Gini Coefficient- Measures how much one tense dominates
Shannon Entropy- Measures how even tenses are distributed
Hapax Legomena- Measures unique tenses, or how many appeared one time

CDS = 0.225×Unique Answers + 0.2×(1−Top 3 Freq) + 0.25×Entropy + 0.15×Hapax Legomena − w0.15×Gini Coefficient

These were combined to come up with the diversity score in the chart above. This is the reason I allowed some limited capability to adjust the Variance settings on LLM models that allowed it. And in some cases, higher Variance settings (a combination of Temperature and Top k) made LLMs unstable, while lower settings rendered them more repeatable (and less creative). In one case, the Qwen model (which I removed from the app), oddly, the lowest temperature setting blew it up.

The winner here, by far, was Google’s Gemini Pro. It did have a free usage scenario for a while. But as good as it is, for the moment, I have decided to limit usage of Google. In the case of Pro, it would be too tempting to just use it all the time. o3-mini gets close at half the cost. And again , this chart is only useful if you are advanced- otherwise, stick to the cheaper models, as they do just fine.

When a model scores low on the composite diversity score, it means that it mainly sticks to Present Tense and maybe some Past Tenses- Preterit and Imperfect. Future, Conditional, Imperative and Subjunctive tenses are rare.

Estas fueron combinado para crear la puntuación de diversidad en la gráfico anterior. Esto es la razón que añadí una capaz para ajustar la configuración de las variables de Temperatura y Top_k, que llamo Variance, por los modelos que apoyarlo. Pero en unos casos, configuraciones con números más altos hacía los modelos inestable, mientras con números mas bajos los hacía más repetible (y menos creativo). En un caso, el modelo de Qwen (que no exista en la app nada más), inusualmente, la temperatura más baja lo hizo explotar.

El ganador, sin lugar a dudas, fue Google Gemini Pro 2.5. Si tenía un escenario para usarlo gratis por un tiempo, pero no más. Por el moment, decidí limitar Google a los modelos gratis de Gemini Flash. En el caso de Pro, sería demasiado tentador para utilizarlo todo el tiempo. Y otra vez, este gráfico es solo útil si estás más avanzado, porque de lo contrario, los modelos más baratos son muy buenos.

Cuando un modelo obtiene una puntuación baja en la puntuación de diversidad, que significa que presenta los verbos principalmente el tiempo de presente y unas veces preterit y imperfecto. Los tiempos futuros, condicionales, imperativos, y sujbuntivos son raros.

Universal LLM Handler

I mentioned also having a Universal LLM Handler. I wrote it because I wanted to mix and match different LLMs, and as we have all seen, new LLM models are being released every few weeks. To keep up and keep work to a minimum, I wanted a single interface in the code and abstracted the actual interfaces to a JSON configuration file.

También, mencioné que hay un gestor universal de LLMs. Lo escribí porque quise mezclar y combinar LLMs diferentes, y cómo hemos visto, nuevos models se están lanzado cada algas semanas. Para mantener el trabajo al mínimo, quise una sola interfaz en el código para abstraer las interfaces reales a un archivo de configuración en JSON.

				
					    "UniversalRequest": {
        "description": "Standard request format that will be mapped to each LLM's specific requirements",
        "format": {
            "model": "string",
            "messages": [{
                "role": "enum(system, assistant, user)",
                "content": "string"
            }],
            "max_tokens": "integer",
            "temperature": "number",
            "top_p": "number",
            "top_k": "number",
            "frequency_penalty": "number",
            "presence_penalty": "number",
            "stream": "boolean"
        }
    },
    "UniversalResponse": {
        "description": "Standard response format that all LLM responses will be mapped to",
        "format": {
            "id": "string",
            "content": "string",
            "usage": {
                "prompt_tokens": "integer",
                "completion_tokens": "integer"
            }
        }
    },
    "LLMs": {
        "OpenAI": {
            "metadata": {
                "enabled": true,
                "markdown_responses": true,
                "supports_streaming": true,
                "max_context_length": 4096
            },
            "models": {
                "gpt-4.1-2025-04-14": {
                    "max_tokens": 16384,
                    "display_name": "GPT-4.1",
                    "enabled": true,
                    "cost": "$$",
                    "request_format": {
                        "template": {
                            "model": "$model",
                            "messages": "$messages",
                            "max_tokens": "$max_tokens",
                            "temperature": { "min": 0.0, "max": 2.0 },
                            "top_p": { "min": 0.01, "max": 1.0 },
                            "frequency_penalty": { "min": -2.0, "max": 2.0 },
                            "presence_penalty": { "min": -2.0, "max": 2.0 },
                            "stream": "$stream"
                        }
                    },
                    "response_format": {
                        "success_path": "choices[0].message.content",
                        "error_path": "error.message",
                        "token_usage_path": "usage"
                    }
                },
.
.
.
        "Claude": {
            "metadata": {
                "enabled": true,
                "markdown_responses": true,
                "supports_streaming": true,
                "max_context_length": 200000
            },
            "models": {
                "claude-3-5-sonnet-20241022": {
                    "max_tokens": 8192,
                    "display_name": "Claude Sonnet 3.5",
                    "enabled": true,
                    "cost": "$$$",
                    "request_format": {
                        "template": {
                            "model": "$model",
                            "system": "$messages:filter(role=system):first:content",
                            "messages": "$messages:ignore(system)",
                            "max_tokens": "$max_tokens",
                            "temperature": { "min": 0.0, "max": 1.0 },
                            "top_p": { "min": 0.01, "max": 1.0 },
                            "top_k": { "min": 1, "max": 40 },
                            "frequency_penalty": { "min": -2.0, "max": 2.0 },
                            "presence_penalty": { "min": -2.0, "max": 2.0 },
                            "stream": "$stream"
                        }
                    },
                    "response_format": {
                        "success_path": "content[0].text",
                        "error_path": "error.message",
                        "token_usage_path": "usage"
                    }
                },
.
.
.
etc...

You can see some snippets of that JSON file above. While they all are similar and loosely follow the OpenAI “standard” RESTful interface, there are differences, even between models by the same provider. Since I expose temperature and top_k, OpenAI’s chat models use them, while their thinking models, like o3, do not. And, importantly, Google and Anthropic deviate from it in either the Role fields or the message content itself. So this has to be accounted for. So there are “$markers” in the various JSON models that trigger some logic to slice and dice the relevant parameters into the right locations. Currently, this only supports RESTful interfaces, not gRPCs. Also, I’m not using streaming.

Puedes ver algunas porciones del archivo de JSON anterior. Mientras la mayoría de los models siguiente un formato similar de la interfaz RESTful «norma» de OpenAI, hay diferencias, incluso entre modelos del mismo proveedor. Desde escogí exponer la temperatura y top_k, los modelos de chat que usa OpenAI utilizarlos, mientras los modelos de pensamiento no utilizan. Y importantemente, Google y Anthropic se desvían de OpenAI en los mensajes de rol o del contenido en sí. Esto debe tenerse en cuenta. Así, hay «$markers» en los varios modelos en el archivo de JSON que indican cuál lógico para deslizar los parámetros relevantes a los lugares correctas. Actualmente, solo apoya las interfaces RESTful, no gRPCs. También, no utilizo el modo de streaming.

Localization & Translation Management

One other element I needed was the Translation Management System. When you follow L10N standards, your UI text is stored in strings in a master JSON file. From that master copy, in which ALL UI text elements are stored and timestamped (in case you change them), you run Translation Management to scan for new entries or updates to old entries as you work through development of your app, or update it at a later date. It does make laying out a webpage more abstract, because you can’t just put your text in situ on the webpage. It has to be extracted from a local store, which I do with the Alpine UI framework.

You can see all the languages that are natively supported below- this is the limit of DeepL, which is how I manage the translations. They have a pretty generous 500k free characters per month.

In addition to all the UI strings, I go one step further and also store all the behind-the-scenes prompts that go to the various LLMs. This is important for better results.

For example, if you are from Mexico (UI = Spanish) and you are learning Ukrainian, your prompts to the AI should NOT be in English, as you are telling the AI, in English, to generate questions in Ukrainian and the translations of those questions in Spanish, and therefore are using 3 languages.

So, the prompts are translated into Spanish, in this example, as LLMs naturally are polyglots and speak nearly all languages. You get more even results from the LLMs this way.

If your native language is, let’s say, Hebrew, which DeepL does not support yet, then your UI + Prompts will be in English, as for any other of the languages not (yet) supported by DeepL.

Un otro elemento que necesité era un sistema para gestionar las traducciones. Cuando sigues los estándares de L10N, el texto de la UI está mantenido en un archivo maestro de JSON. De la copia maestra, en el que se almacenan todos elementos de textos de la UI con una marca de tiempo (en caso de que cambian), ejecuta la Gestión de Traducciones para escanear para buscar para nuevas entradas o actualizados a los viejas entradas cómo desarrollas la app, o actualizarlo en una fecha más tarde. Lo hace el diseño de la pagina de web más abstracto, porque no le puedes poner el texto en situ en la pagina. Necesita ser extraer de un archivo externo, en que uso un framework de UI de Alpine.

Puedes ver todos los idiomas nativos se que apoyan a continuación- esto es un limite de DeepL, que es como gestiono las traducciones. DeepL tiene una oferta generosa de 500k caracteres gratis por mes.

Además todos las cadenas de la UI, voy una etapa más allá y mantengo todos los prompts para los LLMs en el archivo de JSON, también. Esto es importante para los mejores resultados.

Por ejemplo, si era de Mexico (UI = español) y estás aprendiendo ucraniano, tus prompts a la IA no debe ser en inglés, porque sea diciendo la IA, en inglés, para generar preguntas en ucraniano y las traducciones de estas preguntas en español, y entonces estás usando 3 idiomas.

Asì, todos los prompts están traducir en español, en este ejemplo, cómo LLMs naturalmente son políglotas y hablan casi todos los idiomas. Obtienes mejores resultados en esta manera.

Si tu idioma nativo es, por ejemplo, hebreo, en que DeepL no apoya todavìa, que tu UI + Prompts serán en inglés, como para cada otros de los idiomas no apoyan todavía por DeepL.

Some languages that have international variants are currently treated as one- typically the most popular variant to use. Spanish is mapped to the Spanish Group, and not subvariants like Spain’s Spanish, Latin American Spanish (though you can still specify variants in the quizzes themselves!). And sorry to my UK/Aussie/New Zealand friends, but English defaults to US English.

Algunos idiomas se tienes variaciones internacionales actualmente están agrupado como uno, típicamente en la variación más popular. Español se mapa al grupo de español, y no es separado cómo español de España, de América Latina, etc… (sin embargo, ¡puedes indicar las variaciones especificas en los cuestionarios!). Y lo siento a mis amigos de UK/Australiana/Nuevo Zelandia, pero inglés por defecto a inglés de EEUU.

Launching Word Hammer

Currently, Word Hammer is optimized for desktop browsers and uses fill-in-the-blank quizzes. It will not render or perform well on Mobile Devices. Typing is always a pain on mobile devices.

Actualmente, Word Hammer está optimizado para navegadores de escritorio/portátil y usa cuestiones de rellena-el-espacio-en-blanco. No renderiza o realiza bien en un móvil.

Just click on the button to launch the app! Have fun with it! You may encounter the occasionally LLM error- they sometimes hallucinate random characters into the responses and I do not have super-robust error handling. Simply relaunch the quiz. This is for general evaluation and experimentation, and learn while you’re at it. And for now, the LLM costs are on me. I reserve the right to limit LLM costs though should this spread in popularity beyond a reasonably affordable rate. While that would be a good thing and probably prompt me to productize this in the future, I don’t need massive hits to my credit card in the meantime.

And if you do find problems, there is a link in the footer of the main page of the app as well as a Send Feedback button right under the quiz question. Please send a screenshot if possible if you do run into a problem.

¡Justo hacer clic en el botón para lanzar la app y disfrutala! Es posible que encontrarás un error del LLM ocasionalmente- a veces alucinan caracteres y palabras aleatorias en las respuestas, y no tengo un sistema robusto para gestión de errores. Simplemente, relanzar la prueba. Esto es más para evaluación general y experimentación, y aprender un idioma en el mismo tiempo. Y para ahora, pago por los gastos de los LLMs. Reservo el derecho para limitar los gastos en caso esta app convierta demasiada popular más allá de una tasa asequible. Aunque eso sería una buena cosa para ocurrir y probablemente inspirarme para comercializarla en el futuro, no necesito gastos masivos en mi tarjeta de crédito mientras tanto.

Si encuentras problemas, hay un enlace en la parte inferior de la primera pagina de la app, así como un botón para enviar comentarios debajo de la pregunta. Por favor, envíe una captura de pantalla si posible, se encuentras cualquier problemas.

El contenido de estos artículos son un poco avanzado. Necesito utilizar ayuda de DeepL, per trato utilizar lo menos posible. Todavía lo estoy utilizando alrededor 15-20%, porque necesito un más vocabulario y coloquialismos también. Pere con cada publicación, estoy utilizando DeepL menos y menos. Para esta publicación, lo utilicé menos que nunca.

Running Thoughts

Word Hammer, v2.0

Outline of this Article

Quiz Type & LLM Selection

Set Your Language Proficiency

The Quizzes

Behind the Screen

Composite Diversity Score

Universal LLM Handler

Localization & Translation Management

Launching Word Hammer

Like this:

Related

Recent Posts

Recent Posts

Recent Comments

Archives

Word Hammer, v2.0

Outline of this Article

Quiz Type & LLM Selection

Set Your Language Proficiency

The Quizzes

Behind the Screen

Composite Diversity Score

Universal LLM Handler

Localization & Translation Management

Launching Word Hammer

Share this:

Like this:

Related

Recent Posts

Recent Posts

Recent Comments

Subscribe

Archives

Subscribe