First Take- Using AI to Learn Python (Week 1)

29-September-2024 By Jeffrey Cooper

First Take- Using AI to Learn Python (Week 1)

As part of my Project52 effort, I want to try out new development tools and learn other skills simultaneously, and where possible, intersect with areas of interest (Spanish, running, music, etc…). For this first project, I decided to learn some Python, but do it using AI to teach me through code generation to actually tackle a specific task I had defined.

You might have noticed that the last few blog entries have been written in Spanish as well. I have been taking Spanish for 2+ years and am at an early B2 level. I’m good enough with the basics that  in public groups, I try to help beginners get over early stumbling blocks. For English speakers, word gender is truly a foreign concept. While there are “general rules” a person can use to guess the gender of a new word, there are exceptions and early learners struggle with it.

So, for a first project, I decided to find a list of the most use nouns with gender listed, and do a simple statistical analysis on that. What better to do this than Python? Well, I don’t know Python. So this was a good, first project to do!

Nota para los lectores españoles: Estoy escribiendo mis articulos en dos idiomas mientras lo aprendo. Para mas información, lea este artículo.

Como parte de mi esfuerzo Project52, quiero probar herramientas nuevas de desarrollo y aprendo otras habilidades al mismo tiempo, y donde posible se cruzan con áreas de interés (español, corriendo, música, etc.). Para el primer proyecto, decidí aprender un poco de Python, pero usando IA para enseñarme entre generación del código y resolver una tarea he definido.

Habrá notado que los ultimas entradas de blog ha estado escrito en inglés y español. He estado aprendiendo español mas de 2 años y estoy en nivel B2 temprano. Soy suficientemente bueno con los básicos que in grupos públicos en los redes sociales, trato de ayudarles a los principiantes superar obstáculos. Para angloparlantes, el idioma no tiene géneros de palabras. En español, mientras hay “normas generales” para adivinar el género de una nueva palabra, hay excepciones y primeros aprendices luchan con ellas.

Así, para mi primer proyecto, decidí encontrar una lista de los sustantivos más utilizados con géneros, y hace un análisis simple. ¿Qué mejor para hacerlo que Python? Pues, no conozco Python. Así, ¡esto fue un buen, primer proyecto para hacer!

Top 2000 Spanish Nouns

First, I found a blog article with the top 2000 nouns.  Now, to do this analysis, I need to create two Python modules:

1.  Web Scraper
2.  Basic Statistics

And, having only done a very basic intro example with Jupyter Lab, I needed help learning Python, so I used AI to accelerate the process.

Primero, encontré un articulo en un blog con los 2000 sustantivos más importantes. Ahora, para hacer este análisis, necesito crear dos módulos de Python.

1. El Rascador Web
2. Estadisticas Básicas

Desde solo he hecho un ejemplo muy básico con Juptyer Lab, necesité ayuda para aprender Python- así usé IA para acelerar el proceso. 

The Web Scraper

El Rascador Web

I went to Claude, using their Sonnet model for this module. This was the first time I have AI in any meaningful way to write Python. I got off to a rough start, but it got better (before it got worse).  To not bloat this article, I will give you my first prompts without the replies.  I was repeatedly running the code after each prompt and refining my requirements after that.

Utilcé Claude, con el modelo Sonnet para este módulo. Esto fue la primera vez para utilicé IA en una manera significativa para escribir Python. Empecé con alguna dificultad, pero mejoré (antes de empeoró). Para no hinchar este artículo, voy a dar mis primeros prompts sin respuestas. He ejecutado repetidamente el código después de cada prompt y refiné los requisitos. 

				
					I'd like to write a simple python app to scrape a webpage and save it to a CSV file.
				
			
				
					
What the heck is BeautifulSoup?
				
			
				
					The code you gave me above to extract text from a webpage, unfortunately, the text is simply line by line and not delineated by <p> tags.
				
			
				
					Now due to the messy nature of the source material, many lines start with a "xx. " where xx is a sequential number, followed by a period and a single space.  I would like to trim all that off the front of it.
				
			

It took a few initial iterations to get a feel for this, and as you can see, I did not know what BeautifulSoup was…😀

At this point, I had a raw list of words, but needed to strip out the parts that I needed- just the Spanish word and the gender.

Inspecting the source page, the author was inconsistent in their format. For words that apply to people, you frequently change the last letter “o” to an “a” for the feminine. The author used 3 different formats for this. Some words are both genders (professions), and most words are simply one gender.  I decided to code the first few formats.

I struggled through this a bit- I gradually discovered the formatting variations as I iterated.  The author was more consistent at the top, but presumably was pretty bored after 2000 words- deeper down the inconsistencies revealed themselves.

Tomó varias iteraciones para obtener un sentimiento para esto, y puede ver, no supe lo que fue BeautifulSoup… 😀

En ese punto, tuve una lista bruta de palabras, pero necesitó quitar las partes que necesito- justo la palabra español y el género.

Inspeccionando la página de origen, el autor he usado un formatos inconsistentes. Para palabras sobre gente, cambia la ultima letra “o” a “a” para femenino. El autor usó 3 formatos diferentes para eso. Algunas palabras son ambos géneros (profesiones), y las más palabras son simplemente un género. Decidí codificar los primeros formatos.

Luché con esto un poco- gradualmente descubrí el formatos variaciones mientras iteraba. El autor fue más consistente en la arriba, pero supongo fue un poco aburrido después de 2000 palabras y más abajo los inconsistencias aparecido.

				
					I want to parse each line that resulted from the previous code.  Each line consists is in this format:

word - palabra - gender

A word in English, the Spanish equivalent, and its gender.  I want to discard the English word, and separate the Spanish word and the Gender into two columns of data.  Also, occasionally, you will see this:

word - palabra - gender - palabra - gender

This is for words that have both genders, such as son and daughter.  In this situation, discard the English word, and then break the next two words into two columns like above, but make the remainder a new line in the file, which of course will also be two columns.
				
			
				
					
If you see the word "masculine/feminine" replace it with "both"
				
			
				
					There is a small error- when there are 5 words, words 4 and 5 need to be on an entirely new line below- currently you are appending the 4th word on the same line and dropping the 5th line.
				
			
				
					
I made a mistake- the format is this:

word - palabra - gender / palabra - gender
				
			
				
					OK- great!  Turns out there is one more exception to the formatting rule.  So for clarity sakes, the formats in the input data are actually this:

word - palabra - gender
or
word - palabra - gender / palabra - gender
or
word - palabra/palabra - masculine/feminine

For this last case, which I just saw in the results and had not noticed before, it is a bit tricker.  First, strip the English word off like before.  Then the pair palabra/palabra needs to be split into two entries, one after the other.  The first one will always be labeled "masculine" and the second one will always be labeled "feminine." At least the source data is consistent with the order of the genders. :)
				
			
				
					OK- great! Turns out there is another curveball in the formatting rule. So for clarity sakes, the formats in the input data are actually this:

word - palabra - gender

or

word - palabra - gender / palabra - gender

or

word - palabra/palabra - masculine/feminine

or

word - palabro/a - masculine/feminine

For this last one- it's trickier.  You need to ignore the English word like before, and need to extract the first word, which will always be masculine.  Now, discard the "/a" completely, duplicate word1 to word2, remove the "o" at the end of the word, and replace it with the letter "a"- that makes the second word the feminine version.  These are usually used for professions or descriptions of people.

And one last request- there are a few lines with parentheses in them. Rather than process those (even more complicated)- since it amounts to less than 1% of the total word list, let's just discard ALL lines that have either an open or close parentheses in them.
				
			

Everything was fine up to this point and I thought I was ready to move on to the statistics module, but I spotted a problem with one of the formatting options not splitting a masculine/feminine line into two lines.  From the following dialogs you can see that this relationship was no longer working out 🙂.  (I explain my thoughts about this at the end of this article.)

Todo fue buenos hasta este punto y creé que fue listo para escribir el próximo modulo, pero vi un problema con unas de las opciones de los formatos no fue dividió en dos líneas. De los diálogos siguientes, puede ver que esta relación ya no funcionaba 🙂. (Explico mis pensamientos en el fin de este artículo.)

				
					Oops- I found a bug.  Here's an example:  In the original website was a word that is both masculine and feminine, like this:

sergeant - sargento - masculine/feminine

But in the output, it is 

sargento,masculine

It should be this:

sargento,both

This needs to be fixed.
				
			
				
					
Now it is not handling Case 2 correctly.  You are giving me:

palabra,gender / palabra

instead of

palabra,gender
palabra,gender

You removed the gender from the second word and didn't split the line where the / was.
				
			
				
					
Case 2 needs to find the 2nd Spanish word after the first gender and make that a new line with another gender that will be following it.  This is for the case of:

word - palabra - gender / palabra - gender

I need it to split this in two (of course remembering to just remove the English word up front). The result should be this:

palabra,gender
palabra,gender

You were doing this correctly about 3-4 iterations ago.
				
			
				
					
It is still doing the exact same thing. I did some debugging- when you discard the English word and assign spanish_part and gender_part, you are discarding the last gender in the special case of 

friend - amigo - masculine / amiga - feminine

In this case, the "- feminine" is getting chopped off before you do any more processing.  You need to rethink this. Here are the valid formats again:

Possible format 1:
word - palabra - gender
Results will be:  palabra,gender

or Possible format 2:
word - palabra - gender / palabra - gender
Results  will be:
palabra,gender
palabra,gender

or Possible format 3:
word - palabra/palabra - masculine/feminine
Results  will be:
palabra,masculine
palabra,feminine

or Possible format 4:
word - palabro/a - masculine/feminine
Results will be:
palabro,masculine
palabra,feminine

Keep in mind the process you did before to trim off the letters on that last one and replace the o with an a.

Discard any lines with a parenthesis character of the word "plural"
				
			

At this point, I gave up. I manually coded it. But, the previous code it had generated provided all the help I needed to write the code myself.  Once I got that working, I moved on to the next module.

En este momento, me rendí. Yo codifiqué a mano. Pero, el anterior código Claude he generado proporcionado toda la ayuda que necesito para escribir el código yo mismo. Una vez que conseguí que funcionara, moví al próximo modulo.

The Statistics Module

El Modelo Estadístico

For this module, I used OpenAI’s o1-preview. This was the easiest I have seen to date to execute.

Para este modulo, utilicé el “o1-preview” de OpenAI. Este fue el más fácil he visto la fecha para ejecutar.

				
					I have a file full of Spanish nouns.  They are listed as being either masculine or feminine, or both in some cases.  I would like to create a Python program to open this file, scan it, count the occurrences of each, and display a simple pie chart.
				
			
ChatGPT Assumptions
Python Code
gender

With a single prompt, it guessed CSV (one of the most common file formats), and created the pie chart.  It generated 52 lines of code and worked immediately. So then I described the next stats I wanted to see:

Con un solo prompt, adivinó el formato fue CSV (unos de los más común). y creó el gráfico circular. Generó 52 líneas de código y funcionó inmediatamente.

				
					A general rule is words that end in o are masculine and words that end in a are feminine.  There are exceptions to each.  I would like to show a stacked bar graph for both o and a words and show the ratio of exceptions.
				
			
Masculine Exceptions
Feminine Exceptions

Again, it got it right in one try. For that last part, since there are very few exceptions, I wanted a list of exceptions for each gender, as well as the words that are both genders.  This took a couple of iterations, first because it produced a bug, and also I changed my mind.

Otra vez, lo consiguió en un intento. Para la ultima parte, desde hay muy pocas excepciones, quería una lista de excepciones para cada género, así como las palabras que son ambos géneros. Esto tomó un par de iteraciones, porque produjo un error, y también cambié mi mente. 

				
					After the bar graph, I would like to list in a 2-column table the exceptions for each.
				
			
				
					I get this error when I run the code:
TypeError: can only concatenate list (not "tuple") to list
				
			
				
					Thank you.  Let's redo the table part completely.  Actually, I would like a third set of columns. Also I would like to structure the table so that the first row- a header row, is labeled in bold Masculine, Feminine, and Both.  Then in each column show the a words that are masculine, the o words that are feminine, and finally the words that are both.  This makes more sense for a student of Spanish to understand.
				
			
Exceptions Word List

I explained the error to ChatGPT and it immediately fixed it. Then I changed my mind and wanted three columns, instead of the initial two.  At this point, with essentially 3 prompts + 1 minor bug fix prompt, I had 112 lines of working Python giving me everything I had asked for.

Expliqué el error a ChatGPT y inmediatamente lo arregló. Entonces, cambié mi mente y quería tres columnas en lugar de dos columnas inicialmente. En este momento, con justo 3 prompts y 1 corrección de un error, tuve 112 líneas de código de Python funcionando y me dando todo pidé.

Takeaways

Conclusiones

The point of the project was to learn a little Python and accelerate that using AI. That definitely paid off here.  I did make a point to understand the code I was pasting in, and add comments to the Jupyter Lab notebook, which you can see in GitHub.

Overall, it took a couple of days, but I was largely not focused on this for the first module and was multitasking quite a bit. On the second day, I focused on it fully, did the manual coding and the second module.  Actual time spent was between 1/2 to 1 day.

If I were to do this again, it would be much faster. For one, I now know all the formats the author of the list used- that took time to sort through and figure out the logic. And I have a better feel for telling Claude how to code.  While it appears that o1-preview blew away Claude, it is probably somewhat closer, albeit I still think o1-preview is better.

One reason I think Claude “got lost” is that it took too many iterations to sort out the formatting and I was doing the formats piecemeal. Something to understand about how these LLMs work- each time you reply, the entire conversation is resubmitted.  As it grows longer, you use more and more tokens, additively each time.

El objetivo del proyecto era aprender un poco de Python y acelerarlo utilizando IA. Eso fue una victoria para mí. Hice un esfuerzo para entender el código era copiando de IA, y añade comentarios al cuaderno de Jupyter Lab, cuál verse en GitHub.

En total, tardó un par de días, pero no estaba totalmente concentrado por esta tarea. En el segundo día, estaba concentrado, he escribí el código por mano para el formatos de sustantivos, y terminé el segundo modulo. El tiempo real empleado fue de entre 1/2 a 1 día.
 
Si tuviera que hacer otra vez, sería mas fácil. Ahora conozco todos los formatos usado por el autor de la lista. Eso tardó un poco de tiempo para clasificar. Y comprendo mejor lo de decirle Claude como codificar. Mientras parece que o1-preview era mejor de Claude, es probablemente más cerca, aunque creo que o1-preview is mejor.
 
Una razón creo que Claude “se perdió” es que tomé demasiadas iteraciones clasificar los formatos y fue pidiendo Claude escribir los formatos de uno en uno. Para entender sobre como funcionan los LLMs, cada vez que envié un prompt, la toda de la conversación es reenvió. A medida que se alarga, usa más y más tokens, añadiendo más cada vez.
				
					Response Tokens = Total Tokens + New Request
				
			

Thus, since I dragged out the conversation, I might have overwhelmed Claude’s response buffer, cutting off the initial requests.  I came to this conclusion because, suddenly, the code it was giving me was iteratively getting worse with each new request.  

But, despite that slowing me down, the manual coding was a great exercise with the code already there serving as an example.  O1-preview just kicked ass.

To have done this the “traditional” way, I would have to take a course, find a tutorial, of find lots of different examples and Google my way though this. It would have taken far longer. AI convincingly accelerated this process.  And I now know what Beautiful Soup is 😂.

Así, desde inadvertidamente alargué la conversación, podría haber abrumado el búfer de respuesta de Claude, cortando mis peticiones originales. Llegué a esta conclusión porque, de repente, el código en las respuestas iterativamente convirtieron cada vez peor.

Pero no importa, codificar a mano era un buen ejercicio con  el código que ya existía sirvió un buen ejemplo. O1-preview pateó traseros.

Haber hecho esta a la forma “tradicional,” yo necesitara tomado un curso, encontrado un tutorial, o encontrado muchos ejemplos y googlear mucho para aprender bastante de los ejemplos. Hubiera tardado mucho más tiempo. IA aceleró de forma convincente este proceso. Y ahora conozco que es BeautifulSoup 😂.

El contenido de estos artículos son un poco avanzado. Necesito utilizar ayuda de DeepL, per trato utilizar lo menos posible. Todavía lo estoy utilizando alrededor 30% o un poco más, porque necesito un más vocabulario y coloquialismos también. 

Subscribe

Subscribe and get a notice when the next article is published.

Thank you for subscribing.

Something went wrong.

Subscribe

Subscribe and get a notice when the next article is published.

Thank you for subscribing.

Something went wrong.