29-September-2024 By Jeffrey Cooper
First Take- Using AI to Learn Python (Week 1)
As part of my Project52 effort, I want to try out new development tools and learn other skills simultaneously, and where possible, intersect with areas of interest (Spanish, running, music, etc…). For this first project, I decided to learn some Python, but do it using AI to teach me through code generation to actually tackle a specific task I had defined.
You might have noticed that the last few blog entries have been written in Spanish as well. I have been taking Spanish for 2+ years and am at an early B2 level. I’m good enough with the basics that in public groups, I try to help beginners get over early stumbling blocks. For English speakers, word gender is truly a foreign concept. While there are “general rules” a person can use to guess the gender of a new word, there are exceptions and early learners struggle with it.
So, for a first project, I decided to find a list of the most use nouns with gender listed, and do a simple statistical analysis on that. What better to do this than Python? Well, I don’t know Python. So this was a good, first project to do!
Nota para los lectores españoles: Estoy escribiendo mis articulos en dos idiomas mientras lo aprendo. Para mas información, lea este artículo.
Como parte de mi esfuerzo Project52, quiero probar herramientas nuevas de desarrollo y aprendo otras habilidades al mismo tiempo, y donde posible se cruzan con áreas de interés (español, corriendo, música, etc.). Para el primer proyecto, decidí aprender un poco de Python, pero usando IA para enseñarme entre generación del código y resolver una tarea he definido.
Habrá notado que los ultimas entradas de blog ha estado escrito en inglés y español. He estado aprendiendo español mas de 2 años y estoy en nivel B2 temprano. Soy suficientemente bueno con los básicos que in grupos públicos en los redes sociales, trato de ayudarles a los principiantes superar obstáculos. Para angloparlantes, el idioma no tiene géneros de palabras. En español, mientras hay “normas generales” para adivinar el género de una nueva palabra, hay excepciones y primeros aprendices luchan con ellas.
Así, para mi primer proyecto, decidí encontrar una lista de los sustantivos más utilizados con géneros, y hace un análisis simple. ¿Qué mejor para hacerlo que Python? Pues, no conozco Python. Así, ¡esto fue un buen, primer proyecto para hacer!
First, I found a blog article with the top 2000 nouns. Now, to do this analysis, I need to create two Python modules:
1. Web Scraper
2. Basic Statistics
And, having only done a very basic intro example with Jupyter Lab, I needed help learning Python, so I used AI to accelerate the process.
Primero, encontré un articulo en un blog con los 2000 sustantivos más importantes. Ahora, para hacer este análisis, necesito crear dos módulos de Python.
1. El Rascador Web
2. Estadisticas Básicas
Desde solo he hecho un ejemplo muy básico con Juptyer Lab, necesité ayuda para aprender Python- así usé IA para acelerar el proceso.
The Web Scraper
El Rascador Web
I went to Claude, using their Sonnet model for this module. This was the first time I have AI in any meaningful way to write Python. I got off to a rough start, but it got better (before it got worse). To not bloat this article, I will give you my first prompts without the replies. I was repeatedly running the code after each prompt and refining my requirements after that.
Utilcé Claude, con el modelo Sonnet para este módulo. Esto fue la primera vez para utilicé IA en una manera significativa para escribir Python. Empecé con alguna dificultad, pero mejoré (antes de empeoró). Para no hinchar este artículo, voy a dar mis primeros prompts sin respuestas. He ejecutado repetidamente el código después de cada prompt y refiné los requisitos.
I'd like to write a simple python app to scrape a webpage and save it to a CSV file.
What the heck is BeautifulSoup?
The code you gave me above to extract text from a webpage, unfortunately, the text is simply line by line and not delineated by tags.
Now due to the messy nature of the source material, many lines start with a "xx. " where xx is a sequential number, followed by a period and a single space. I would like to trim all that off the front of it.
It took a few initial iterations to get a feel for this, and as you can see, I did not know what BeautifulSoup was…😀
At this point, I had a raw list of words, but needed to strip out the parts that I needed- just the Spanish word and the gender.
Inspecting the source page, the author was inconsistent in their format. For words that apply to people, you frequently change the last letter “o” to an “a” for the feminine. The author used 3 different formats for this. Some words are both genders (professions), and most words are simply one gender. I decided to code the first few formats.
I struggled through this a bit- I gradually discovered the formatting variations as I iterated. The author was more consistent at the top, but presumably was pretty bored after 2000 words- deeper down the inconsistencies revealed themselves.
Tomó varias iteraciones para obtener un sentimiento para esto, y puede ver, no supe lo que fue BeautifulSoup… 😀
En ese punto, tuve una lista bruta de palabras, pero necesitó quitar las partes que necesito- justo la palabra español y el género.
Inspeccionando la página de origen, el autor he usado un formatos inconsistentes. Para palabras sobre gente, cambia la ultima letra “o” a “a” para femenino. El autor usó 3 formatos diferentes para eso. Algunas palabras son ambos géneros (profesiones), y las más palabras son simplemente un género. Decidí codificar los primeros formatos.
Luché con esto un poco- gradualmente descubrí el formatos variaciones mientras iteraba. El autor fue más consistente en la arriba, pero supongo fue un poco aburrido después de 2000 palabras y más abajo los inconsistencias aparecido.
I want to parse each line that resulted from the previous code. Each line consists is in this format:
word - palabra - gender
A word in English, the Spanish equivalent, and its gender. I want to discard the English word, and separate the Spanish word and the Gender into two columns of data. Also, occasionally, you will see this:
word - palabra - gender - palabra - gender
This is for words that have both genders, such as son and daughter. In this situation, discard the English word, and then break the next two words into two columns like above, but make the remainder a new line in the file, which of course will also be two columns.
If you see the word "masculine/feminine" replace it with "both"
There is a small error- when there are 5 words, words 4 and 5 need to be on an entirely new line below- currently you are appending the 4th word on the same line and dropping the 5th line.
I made a mistake- the format is this:
word - palabra - gender / palabra - gender
OK- great! Turns out there is one more exception to the formatting rule. So for clarity sakes, the formats in the input data are actually this:
word - palabra - gender
or
word - palabra - gender / palabra - gender
or
word - palabra/palabra - masculine/feminine
For this last case, which I just saw in the results and had not noticed before, it is a bit tricker. First, strip the English word off like before. Then the pair palabra/palabra needs to be split into two entries, one after the other. The first one will always be labeled "masculine" and the second one will always be labeled "feminine." At least the source data is consistent with the order of the genders. :)
OK- great! Turns out there is another curveball in the formatting rule. So for clarity sakes, the formats in the input data are actually this:
word - palabra - gender
or
word - palabra - gender / palabra - gender
or
word - palabra/palabra - masculine/feminine
or
word - palabro/a - masculine/feminine
For this last one- it's trickier. You need to ignore the English word like before, and need to extract the first word, which will always be masculine. Now, discard the "/a" completely, duplicate word1 to word2, remove the "o" at the end of the word, and replace it with the letter "a"- that makes the second word the feminine version. These are usually used for professions or descriptions of people.
And one last request- there are a few lines with parentheses in them. Rather than process those (even more complicated)- since it amounts to less than 1% of the total word list, let's just discard ALL lines that have either an open or close parentheses in them.
Everything was fine up to this point and I thought I was ready to move on to the statistics module, but I spotted a problem with one of the formatting options not splitting a masculine/feminine line into two lines. From the following dialogs you can see that this relationship was no longer working out 🙂. (I explain my thoughts about this at the end of this article.)
Todo fue buenos hasta este punto y creé que fue listo para escribir el próximo modulo, pero vi un problema con unas de las opciones de los formatos no fue dividió en dos líneas. De los diálogos siguientes, puede ver que esta relación ya no funcionaba 🙂. (Explico mis pensamientos en el fin de este artículo.)
Oops- I found a bug. Here's an example: In the original website was a word that is both masculine and feminine, like this:
sergeant - sargento - masculine/feminine
But in the output, it is
sargento,masculine
It should be this:
sargento,both
This needs to be fixed.
Now it is not handling Case 2 correctly. You are giving me:
palabra,gender / palabra
instead of
palabra,gender
palabra,gender
You removed the gender from the second word and didn't split the line where the / was.
Case 2 needs to find the 2nd Spanish word after the first gender and make that a new line with another gender that will be following it. This is for the case of:
word - palabra - gender / palabra - gender
I need it to split this in two (of course remembering to just remove the English word up front). The result should be this:
palabra,gender
palabra,gender
You were doing this correctly about 3-4 iterations ago.
It is still doing the exact same thing. I did some debugging- when you discard the English word and assign spanish_part and gender_part, you are discarding the last gender in the special case of
friend - amigo - masculine / amiga - feminine
In this case, the "- feminine" is getting chopped off before you do any more processing. You need to rethink this. Here are the valid formats again:
Possible format 1:
word - palabra - gender
Results will be: palabra,gender
or Possible format 2:
word - palabra - gender / palabra - gender
Results will be:
palabra,gender
palabra,gender
or Possible format 3:
word - palabra/palabra - masculine/feminine
Results will be:
palabra,masculine
palabra,feminine
or Possible format 4:
word - palabro/a - masculine/feminine
Results will be:
palabro,masculine
palabra,feminine
Keep in mind the process you did before to trim off the letters on that last one and replace the o with an a.
Discard any lines with a parenthesis character of the word "plural"
At this point, I gave up. I manually coded it. But, the previous code it had generated provided all the help I needed to write the code myself. Once I got that working, I moved on to the next module.
En este momento, me rendí. Yo codifiqué a mano. Pero, el anterior código Claude he generado proporcionado toda la ayuda que necesito para escribir el código yo mismo. Una vez que conseguí que funcionara, moví al próximo modulo.
The Statistics Module
El Modelo Estadístico
For this module, I used OpenAI’s o1-preview. This was the easiest I have seen to date to execute.
Para este modulo, utilicé el “o1-preview” de OpenAI. Este fue el más fácil he visto la fecha para ejecutar.
I have a file full of Spanish nouns. They are listed as being either masculine or feminine, or both in some cases. I would like to create a Python program to open this file, scan it, count the occurrences of each, and display a simple pie chart.
With a single prompt, it guessed CSV (one of the most common file formats), and created the pie chart. It generated 52 lines of code and worked immediately. So then I described the next stats I wanted to see:
Con un solo prompt, adivinó el formato fue CSV (unos de los más común). y creó el gráfico circular. Generó 52 líneas de código y funcionó inmediatamente.
A general rule is words that end in o are masculine and words that end in a are feminine. There are exceptions to each. I would like to show a stacked bar graph for both o and a words and show the ratio of exceptions.
Again, it got it right in one try. For that last part, since there are very few exceptions, I wanted a list of exceptions for each gender, as well as the words that are both genders. This took a couple of iterations, first because it produced a bug, and also I changed my mind.
Otra vez, lo consiguió en un intento. Para la ultima parte, desde hay muy pocas excepciones, quería una lista de excepciones para cada género, así como las palabras que son ambos géneros. Esto tomó un par de iteraciones, porque produjo un error, y también cambié mi mente.
After the bar graph, I would like to list in a 2-column table the exceptions for each.
I get this error when I run the code:
TypeError: can only concatenate list (not "tuple") to list
Thank you. Let's redo the table part completely. Actually, I would like a third set of columns. Also I would like to structure the table so that the first row- a header row, is labeled in bold Masculine, Feminine, and Both. Then in each column show the a words that are masculine, the o words that are feminine, and finally the words that are both. This makes more sense for a student of Spanish to understand.
I explained the error to ChatGPT and it immediately fixed it. Then I changed my mind and wanted three columns, instead of the initial two. At this point, with essentially 3 prompts + 1 minor bug fix prompt, I had 112 lines of working Python giving me everything I had asked for.
Expliqué el error a ChatGPT y inmediatamente lo arregló. Entonces, cambié mi mente y quería tres columnas en lugar de dos columnas inicialmente. En este momento, con justo 3 prompts y 1 corrección de un error, tuve 112 líneas de código de Python funcionando y me dando todo pidé.
Takeaways
Conclusiones
The point of the project was to learn a little Python and accelerate that using AI. That definitely paid off here. I did make a point to understand the code I was pasting in, and add comments to the Jupyter Lab notebook, which you can see in GitHub.
Overall, it took a couple of days, but I was largely not focused on this for the first module and was multitasking quite a bit. On the second day, I focused on it fully, did the manual coding and the second module. Actual time spent was between 1/2 to 1 day.
If I were to do this again, it would be much faster. For one, I now know all the formats the author of the list used- that took time to sort through and figure out the logic. And I have a better feel for telling Claude how to code. While it appears that o1-preview blew away Claude, it is probably somewhat closer, albeit I still think o1-preview is better.
One reason I think Claude “got lost” is that it took too many iterations to sort out the formatting and I was doing the formats piecemeal. Something to understand about how these LLMs work- each time you reply, the entire conversation is resubmitted. As it grows longer, you use more and more tokens, additively each time.
El objetivo del proyecto era aprender un poco de Python y acelerarlo utilizando IA. Eso fue una victoria para mí. Hice un esfuerzo para entender el código era copiando de IA, y añade comentarios al cuaderno de Jupyter Lab, cuál verse en GitHub.
Response Tokens = Total Tokens + New Request
Thus, since I dragged out the conversation, I might have overwhelmed Claude’s response buffer, cutting off the initial requests. I came to this conclusion because, suddenly, the code it was giving me was iteratively getting worse with each new request.
But, despite that slowing me down, the manual coding was a great exercise with the code already there serving as an example. O1-preview just kicked ass.
To have done this the “traditional” way, I would have to take a course, find a tutorial, of find lots of different examples and Google my way though this. It would have taken far longer. AI convincingly accelerated this process. And I now know what Beautiful Soup is 😂.
Así, desde inadvertidamente alargué la conversación, podría haber abrumado el búfer de respuesta de Claude, cortando mis peticiones originales. Llegué a esta conclusión porque, de repente, el código en las respuestas iterativamente convirtieron cada vez peor.
Pero no importa, codificar a mano era un buen ejercicio con el código que ya existía sirvió un buen ejemplo. O1-preview pateó traseros.
Haber hecho esta a la forma “tradicional,” yo necesitara tomado un curso, encontrado un tutorial, o encontrado muchos ejemplos y googlear mucho para aprender bastante de los ejemplos. Hubiera tardado mucho más tiempo. IA aceleró de forma convincente este proceso. Y ahora conozco que es BeautifulSoup 😂.
El contenido de estos artículos son un poco avanzado. Necesito utilizar ayuda de DeepL, per trato utilizar lo menos posible. Todavía lo estoy utilizando alrededor 30% o un poco más, porque necesito un más vocabulario y coloquialismos también.
Related
Recent Posts
Adventure Running Analysis Anthropic Apps Architecture Augmented Self Books Change Agents Civic Connected Gym Corporate Wellness COVID-19 Design Energy Harvesting Failure Fitness Framework Gamification Generative AI Glitch Home Gym Hyperledger Ideas Inflection Points IoT LLM Mapping Metadata Modularity OpenAI Prefactoring Quantified Self Running Security Sensing Sensor Fusion Smart Home Smart Kitchen SmartThings Software Spectrum Tracking Virtualization Virtualized Gym Workouts
Recent Posts
Recent Comments
Subscribe
Subscribe and get a notice when the next article is published.
Thank you for subscribing.
Something went wrong.
We respect your privacy and take protecting it seriously.