Transformers en NLP

Ultima actualización: 15/Oct/2025

Huggingface, Transformers y Pipeline

¿Qué es Hugging Face?

HuggingFace es una empresa y comunidad que se ha convertido en un pilar importante en el campo del procesamiento del lenguaje natural (NLP). Proporciona herramientas y librerías de código abierto para trabajar con modelos de lenguaje de última generación basados en transformes.

La plataforma de HuggingFace también incluye un repositorio de modelos preentrenados y datasets que la comunidad puede compartir y reutilizar, lo que facilita la experimentación y el desarrollo de soluciones de inteligencia artificial.

La librería `transformers`

La librería transformers de HuggingFace es un framework de código abierto que permite a los desarrolladores utilizar modelos basados en la arquitectura Transformer para resolver diversas tareas de NLP.

Algunos aspectos clave de la librería transformers son:

Modelos preentrenados: Incluye una gran variedad de modelos de lenguaje preentrenados, como BERT, GPT, RoBERTa, DistilBERT, entre otros.
Transferencia de aprendizaje: Permite realizar ajustes (fine-tuning) en los modelos preentrenados para adaptarlos a tareas específicas o datasets personalizados.
Soporte para múltiples tareas: Los modelos se pueden usar para una variedad de tareas de NLP, como clasificación de texto, generación de texto, traducción, resumen, y extracción de entidades.

La librería `datasets`

La librería datasets de HuggingFace facilita el acceso a numerosos conjuntos de datos para el entrenamiento y la evaluación de modelos de NLP.

Algunos aspectos clave son:

Acceso a datasets populares: Proporciona una gran colección de datasets para diversas tareas de NLP, incluyendo clasificación de texto, NER, análisis de sentimientos y más.
Carga sencilla de datos: Permite cargar y procesar datasets con pocos pasos, lo que reduce el tiempo necesario para preparar los datos.
Integración con transformers: Los datasets pueden utilizarse fácilmente con modelos de transformers, facilitando la experimentación con diferentes arquitecturas y tareas.

Análisis de Reseñas de filmaffinity

1) Carga y Exploración del Dataset 🤓

Objetivo: Exploración del Dataset.

import pandas as pd

Carga del dataset

data_reviews = pd.read_parquet("filmaffinity_reviews_cleaned.parquet")
data_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 11 columns):
 #   Column                      Non-Null Count  Dtype
---  ------                      --------------  -----
 0   film_id                     50000 non-null  object
 1   author_review_desc          50000 non-null  string
 2   author_rating               50000 non-null  int64
 3   film_title                  50000 non-null  object
 4   film_original_title         50000 non-null  object
 5   film_country                50000 non-null  object
 6   film_average_rating         50000 non-null  float64
 7   film_number_of_ratings      50000 non-null  int64
 8   clean_review                50000 non-null  object
 9   clean_review_stemming       50000 non-null  object
 10  clean_review_lemmatization  50000 non-null  object
dtypes: float64(1), int64(2), object(7), string(1)
memory usage: 4.2+ MB

Visualizamos algunas filas del dataset

data_reviews.sample(5)

	film_id	author_review_desc	author_rating	film_title	film_original_title	film_country	film_average_rating	film_number_of_ratings	clean_review	clean_review_stemming	clean_review_lemmatization
36290	film762377	\nUna oportunidad perdida. \n\nUna buena idea,...	3	La visita	The Visitaka	Estados Unidos	6.0	29602	una oportunidad perdida una buena idea en apar...	oportun perd buen ide aparient vari motiv acab...	oportunidad perdido buen idea apariencia vario...
27257	film402565	\nUna de las mejores películas de Disney en mu...	9	Lilo & Stitch	Lilo & Stitchaka	Estados Unidos	6.4	26324	una de las mejores peliculas de disney en much...	mejor pelicul disney muchisim tiemp personaj n...	mejor pelicula disney muchisimo tiempo persona...
39287	film748557	\nPenoso y megañoño alegato en pro del amor et...	2	Crazy Stupid Love	Crazy Stupid Love	Estados Unidos	6.4	41061	penoso y megañoño alegato en pro del amor eter...	penos megañoñ alegat pro amor etern matrimoni ...	penoso megañoño alegatir pro amor eterno matri...
41891	film582716	\nBasada en una novela del escritor japonés Ak...	8	La tumba de las luciérnagas	Hotaru no Hakaaka	Japón	8.0	42646	basada en una novela del escritor japones akiy...	bas novel escritor japon akiyuki nosak relat t...	basado novela escritor japón akiyuki nosaka re...
17923	film458406	\nLa mayoría de las veces, los primeros compas...	8	Cisne negro	Black Swan	Estados Unidos	7.6	118664	la mayoria de las veces los primeros compases ...	mayori vec primer comp pelicul prefij ide punt...	mayoria vez primero compás pelicula prefijar i...

2) Tokenización Avanzada con Transformers y Hugging Face 🤖

Objetivo: Uso de tokenizadores modernos.

- Técnicas de Tokenización Modernas:

BPE (Byte-Pair Encoding) y WordPiece: Estas técnicas segmentan las palabras en subunidades (subword units) para manejar vocabularios abiertos y palabras poco frecuentes.

Permiten capturar morfología y generar tokens que se pueden recombinar para formar palabras.

- Uso de Tokenizers de Hugging Face:

Hugging Face ofrece una amplia gama de tokenizadores optimizados para modelos Transformer. Estos tokenizadores son capaces de:

Manejar caracteres especiales y emojis.
Realizar la segmentación en subpalabras, mejorando la cobertura del vocabulario.
Ser compatibles con modelos preentrenados como BERT, DistilBERT, etc.

from transformers import AutoTokenizer

Cargar el tokenizador preentrenado para español

tokenizer = AutoTokenizer.from_pretrained("dccuchile/bert-base-spanish-wwm-cased")

# Ejemplo de tokenización de una reseña
sample_text = data_reviews["clean_review"].iloc[0]
tokens = tokenizer.tokenize(sample_text)

print("Texto de ejemplo:", sample_text)
print("Tokens con Hugging Face:", tokens)

Texto de ejemplo: un detective obsesionado con la caza de un asesino en serie desde los inicios de su carrera
Tokens con Hugging Face: ['un', 'detective', 'obsesionado', 'con', 'la', 'caza', 'de', 'un', 'asesino', 'en', 'serie', 'desde', 'los', 'inicios', 'de', 'su', 'carrera']

3) Uso de Modelos Preentrenados de Transformers para Clasificación

Objetivo: Implementar un Pipeline de Clasificación.

Modelos Preentrenados:

Modelos como BERT, DistilBERT y otros se han entrenado con grandes volúmenes de datos y son capaces de capturar relaciones contextuales profundas en el lenguaje.

Ventajas:

La utilización de modelos preentrenados permite aprovechar conocimientos lingüísticos generales y especializarlos para tareas de análisis de reseñas.

¿Qué es un pipeline?

El concepto de pipeline en HuggingFace se refiere a una forma simplificada de usar modelos de NLP preentrenados para tareas específicas. El pipeline encapsula todo el flujo de trabajo necesario, desde la preprocesamiento del texto de entrada hasta la inferencia del modelo, lo que facilita el uso de modelos para tareas comunes sin necesidad de configuraciones complejas.

Funcionalidad principal: Los pipelines permiten realizar tareas como análisis de sentimientos, clasificación de texto, NER, generación de texto, entre otras, con pocas líneas de código.
Facilidad de uso: Los pipelines son ideales para usuarios que quieren aplicar modelos de NLP rápidamente, sin preocuparse por los detalles de la arquitectura subyacente o la preprocesamiento de datos.
Personalización: Aunque los pipelines proporcionan una interfaz simple, también se pueden personalizar para ajustar parámetros específicos o realizar tareas avanzadas.

from transformers import pipeline

Crear un pipeline para análisis de sentimiento utilizando un modelo preentrenado para español

En este ejemplo, utilizaremos los pipelines de HuggingFace usando modelos de estilo BERT, que son ampliamente conocidos por su capacidad de capturar el contexto y el significado de las palabras en una oración.

sentiment_pipeline = pipeline(
    "sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment"
)

Device set to use cuda:0

Aplicar el pipeline a una reseña de ejemplo

new_review = "Esta película es excelente y superó mis expectativas."
result = sentiment_pipeline(new_review)

print("Texto de ejemplo:", new_review)
print("Resultado del análisis de sentimiento:", result)

Texto de ejemplo: Esta película es excelente y superó mis expectativas.
Resultado del análisis de sentimiento: [{'label': '5 stars', 'score': 0.7681065797805786}]

new_review = "No me gustó para nada, fue aburrida, los actores eran pésimos y fue muy larga."
result = sentiment_pipeline(new_review)

print("Texto de ejemplo:", new_review)
print("Resultado del análisis de sentimiento:", result)

Texto de ejemplo: No me gustó para nada, fue aburrida, los actores eran pésimos y fue muy larga.
Resultado del análisis de sentimiento: [{'label': '1 star', 'score': 0.7560130953788757}]

new_review = "me gustó? ni siquiera la terminé de ver."
result = sentiment_pipeline(new_review)

print("Texto de ejemplo:", new_review)
print("Resultado del análisis de sentimiento:", result)

Texto de ejemplo: me gustó? ni siquiera la terminé de ver.
Resultado del análisis de sentimiento: [{'label': '2 stars', 'score': 0.35730651021003723}]

new_review = "me gusto!"
result = sentiment_pipeline(new_review)

print("Texto de ejemplo:", new_review)
print("Resultado del análisis de sentimiento:", result)

Texto de ejemplo: me gusto!
Resultado del análisis de sentimiento: [{'label': '5 stars', 'score': 0.4392417073249817}]

4) Clasificación de Texto

El conjunto de datos de AG News contiene artículos de noticias categorizados en cuatro clases: World, Sports, Business, and Sci/Tech. Podemos clasificar el texto en una de estas categorías utilizando un enfoque de zero-shot classification.

from datasets import load_dataset

# carga el AG News dataset
ag_news_dataset = load_dataset("ag_news", split="test[:100]")  # subset de prueba

# inicializar pipeline como zero-shot classification
classifier = pipeline("zero-shot-classification", device_map="auto")

# lista de etiquetas candidatas
candidate_labels = ["World", "Sports", "Business", "Sci/Tech"]

# clasificar las primeras 10 noticias
classification_results = [
    classifier(article, candidate_labels=candidate_labels)
    for article in ag_news_dataset["text"][:10]
]

# mostrar resultados
for article, result in zip(ag_news_dataset["text"][:10], classification_results, strict=False):
    print(f"Article: {article}")
    print(f"Classification: {result['labels'][0]} (Score: {result['scores'][0]:.4f})\n")

No model was supplied, defaulted to facebook/bart-large-mnli and revision d7645e1 (https://huggingface.co/facebook/bart-large-mnli).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some parameters are on the meta device because they were offloaded to the cpu.
Device set to use cuda:0


Article: Fears for T N pension after talks Unions representing workers at Turner   Newall say they are 'disappointed' after talks with stricken parent firm Federal Mogul.
Classification: Business (Score: 0.5836)

Article: The Race is On: Second Private Team Sets Launch Date for Human Spaceflight (SPACE.com) SPACE.com - TORONTO, Canada -- A second\team of rocketeers competing for the  #36;10 million Ansari X Prize, a contest for\privately funded suborbital space flight, has officially announced the first\launch date for its manned rocket.
Classification: Sci/Tech (Score: 0.6826)

Article: Ky. Company Wins Grant to Study Peptides (AP) AP - A company founded by a chemistry researcher at the University of Louisville won a grant to develop a method of producing better peptides, which are short chains of amino acids, the building blocks of proteins.
Classification: Sci/Tech (Score: 0.4325)

Article: Prediction Unit Helps Forecast Wildfires (AP) AP - It's barely dawn when Mike Fitzpatrick starts his shift with a blur of colorful maps, figures and endless charts, but already he knows what the day will bring. Lightning will strike in places he expects. Winds will pick up, moist places will dry and flames will roar.
Classification: Sci/Tech (Score: 0.4284)

Article: Calif. Aims to Limit Farm-Related Smog (AP) AP - Southern California's smog-fighting agency went after emissions of the bovine variety Friday, adopting the nation's first rules to reduce air pollution from dairy cow manure.
Classification: World (Score: 0.3155)

Article: Open Letter Against British Copyright Indoctrination in Schools The British Department for Education and Skills (DfES) recently launched a "Music Manifesto" campaign, with the ostensible intention of educating the next generation of British musicians. Unfortunately, they also teamed up with the music industry (EMI, and various artists) to make this popular. EMI has apparently negotiated their end well, so that children in our schools will now be indoctrinated about the illegality of downloading music.The ignorance and audacity of this got to me a little, so I wrote an open letter to the DfES about it. Unfortunately, it's pedantic, as I suppose you have to be when writing to goverment representatives. But I hope you find it useful, and perhaps feel inspired to do something similar, if or when the same thing has happened in your area.
Classification: Business (Score: 0.3696)

Article: Loosing the War on Terrorism \\"Sven Jaschan, self-confessed author of the Netsky and Sasser viruses, is\responsible for 70 percent of virus infections in 2004, according to a six-month\virus roundup published Wednesday by antivirus company Sophos."\\"The 18-year-old Jaschan was taken into custody in Germany in May by police who\said he had admitted programming both the Netsky and Sasser worms, something\experts at Microsoft confirmed. (A Microsoft antivirus reward program led to the\teenager's arrest.) During the five months preceding Jaschan's capture, there\were at least 25 variants of Netsky and one of the port-scanning network worm\Sasser."\\"Graham Cluley, senior technology consultant at Sophos, said it was staggeri ...\\
Classification: Sci/Tech (Score: 0.3658)

Article: FOAFKey: FOAF, PGP, Key Distribution, and Bloom Filters \\FOAF/LOAF  and bloom filters have a lot of interesting properties for social\network and whitelist distribution.\\I think we can go one level higher though and include GPG/OpenPGP key\fingerpring distribution in the FOAF file for simple web-of-trust based key\distribution.\\What if we used FOAF and included the PGP key fingerprint(s) for identities?\This could mean a lot.  You include the PGP key fingerprints within the FOAF\file of your direct friends and then include a bloom filter of the PGP key\fingerprints of your entire whitelist (the source FOAF file would of course need\to be encrypted ).\\Your whitelist would be populated from the social network as your client\discovered new identit ...\\
Classification: Sci/Tech (Score: 0.3378)

Article: E-mail scam targets police chief Wiltshire Police warns about "phishing" after its fraud squad chief was targeted.
Classification: Sci/Tech (Score: 0.4260)

Article: Card fraud unit nets 36,000 cards In its first two years, the UK's dedicated card fraud unit, has recovered 36,000 stolen cards and 171 arrests - and estimates it saved 65m.
Classification: Sci/Tech (Score: 0.4004)

5) Reconocimiento de Entidades (NER) con Transformers 🧐

El NER es la tarea de identificar y clasificar entidades (como nombres de empresas, fechas, cantidades, ubicaciones) en un texto.

En el análisis de reseñas, extraer entidades permite:

Monitorizar menciones de marcas y productos.
Identificar fechas o cifras relevantes para evaluar tendencias o incidentes.
Automatizar la extracción de información clave para análisis posterior.

Crear un pipeline para NER utilizando un modelo preentrenado para español

ner_pipeline = pipeline(
    "ner",
    model="mrm8488/bert-spanish-cased-finetuned-ner",
    tokenizer="mrm8488/bert-spanish-cased-finetuned-ner",
    device_map="auto",
)

Some weights of the model checkpoint at mrm8488/bert-spanish-cased-finetuned-ner were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Falling back to torch.float32 because loading with the original dtype failed on the target device.
Device set to use cpu

Tipos de entidades comunes:

ORG: Organizaciones, como empresas, instituciones o grupos.
LOC: Lugares o ubicaciones geográficas, como ciudades, países o regiones.
MISC: Entidades misceláneas que no encajan en las categorías anteriores, como eventos, obras de arte o conceptos abstractos.

review = "La pelicula de Warner entertainment sobre Londres que salio el 12 de marzo y superó mis expectativas."
ner_result = ner_pipeline(review)

for result in ner_result:
    print(result)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


{'entity': 'B-ORG', 'score': np.float32(0.9995447), 'index': 5, 'word': 'W', 'start': 15, 'end': 16}
{'entity': 'B-ORG', 'score': np.float32(0.9635744), 'index': 6, 'word': '##arner', 'start': 16, 'end': 21}
{'entity': 'B-LOC', 'score': np.float32(0.99988973), 'index': 10, 'word': 'Londres', 'start': 42, 'end': 49}

Reconstruir identidad

def reconstruct_entity(ner_tokens):
    """
    Reconstruye una entidad a partir de una lista de tokens de NER.
    Si un token empieza con "##", se une al token anterior sin espacio.
    """
    entity = ""
    for token in ner_tokens:
        word = token["word"]
        # Si el token comienza con "##", se añade sin espacio (quitando "##")
        if word.startswith("##"):
            entity += word[2:]
        # Si ya hay contenido, se añade un espacio antes del nuevo token
        elif entity:
            entity += " " + word
        else:
            entity += word
    return entity

# Reconstruir la entidad
entity_name = reconstruct_entity(ner_result)
print("Entidad reconstruida:", entity_name)

Entidad reconstruida: Warner Londres

review = "La experiencia con Apple TV fue innovadora, aunque el precio es bastante elevado."
ner_result = ner_pipeline(review)

for result in ner_result:
    print(result)

entity_name = reconstruct_entity(ner_result)
print("Entidad reconstruida:", entity_name)

{'entity': 'B-ORG', 'score': np.float32(0.99820244), 'index': 4, 'word': 'Apple', 'start': 19, 'end': 24}
{'entity': 'I-ORG', 'score': np.float32(0.9981364), 'index': 5, 'word': 'TV', 'start': 25, 'end': 27}
Entidad reconstruida: Apple TV

review = "Me cuesta entender cómo crearon Avatar."
ner_result = ner_pipeline(review)

for result in ner_result:
    print(result)

entity_name = reconstruct_entity(ner_result)
print("Entidad reconstruida:", entity_name)

{'entity': 'B-MISC', 'score': np.float32(0.9957443), 'index': 6, 'word': 'Ava', 'start': 32, 'end': 35}
{'entity': 'I-MISC', 'score': np.float32(0.5583445), 'index': 7, 'word': '##tar', 'start': 35, 'end': 38}
Entidad reconstruida: Avatar

6) Resumen de articulos

El dataset cnn_dailymail contiene artículos de noticias y resúmenes, lo cual es adecuado para probar modelos de resumen de texto.

# Cargar el dataset cnn_dailymail
dataset_cnn = load_dataset(
    "cnn_dailymail", "3.0.0", split="test[:5]"
)  # Usar un subconjunto de 5 muestras para pruebas

# Inicializar el pipeline de resumen utilizando un modelo como distilBart
summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", device_map="auto")

Falling back to torch.float32 because loading with the original dtype failed on the target device.
Device set to use cpu

# Resumir los primeros 5 artículos
resumenes = []
for articulo in dataset_cnn:
    texto_entrada = articulo["article"][
        :1000
    ]  # Usar los primeros 1000 caracteres para evitar problemas de longitud de secuencia
    resumen = summarizer(texto_entrada, max_length=150, min_length=30, do_sample=False)[0][
        "summary_text"
    ]
    resumenes.append(resumen)

    # Mostrar el resumen
    print(f"Artículo Original:\n{texto_entrada}\n")
    print(f"Resumen Generado:\n{resumen}\n")

Artículo Original:
(CNN)The Palestinian Authority officially became the 123rd member of the International Criminal Court on Wednesday, a step that gives the court jurisdiction over alleged crimes in Palestinian territories. The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based. The Palestinians signed the ICC's founding Rome Statute in January, when they also accepted its jurisdiction over alleged crimes committed "in the occupied Palestinian territory, including East Jerusalem, since June 13, 2014." Later that month, the ICC opened a preliminary examination into the situation in Palestinian territories, paving the way for possible war crimes investigations against Israelis. As members of the court, Palestinians may be subject to counter-charges as well. Israel and the United States, neither of which is an ICC member, opposed the Palestinians' efforts to join the body. But Palestinian Foreign Minister Riad al-Malki, speaking at Wednesday's ceremony, sa

Resumen Generado:
 The Palestinian Authority becomes the 123rd member of the International Criminal Court . The formal accession was marked with a ceremony at The Hague, in the Netherlands, where the court is based .

Artículo Original:
(CNN)Never mind cats having nine lives. A stray pooch in Washington State has used up at least three of her own after being hit by a car, apparently whacked on the head with a hammer in a misguided mercy killing and then buried in a field -- only to survive. That's according to Washington State University, where the dog -- a friendly white-and-black bully breed mix now named Theia -- has been receiving care at the Veterinary Teaching Hospital. Four days after her apparent death, the dog managed to stagger to a nearby farm, dirt-covered and emaciated, where she was found by a worker who took her to a vet for help. She was taken in by Moses Lake, Washington, resident Sara Mellado. "Considering everything that she's been through, she's incredibly gentle and loving," Mellado said, according to WSU News. "She's a true miracle dog and she deserves a good life." Theia is only one year old but the dog's brush with death did not leave her unscathed. She suffered a dislocated jaw, leg injuries a

Resumen Generado:
 The dog was hit by a car and buried in a field, but managed to stagger to a nearby farm . She was found by a worker who took her to a vet for help . She suffered a dislocated jaw and leg injuries .

Artículo Original:
(CNN)If you've been following the news lately, there are certain things you doubtless know about Mohammad Javad Zarif. He is, of course, the Iranian foreign minister. He has been U.S. Secretary of State John Kerry's opposite number in securing a breakthrough in nuclear discussions that could lead to an end to sanctions against Iran -- if the details can be worked out in the coming weeks. And he received a hero's welcome as he arrived in Iran on a sunny Friday morning. "Long live Zarif," crowds chanted as his car rolled slowly down the packed street. You may well have read that he is "polished" and, unusually for one burdened with such weighty issues, "jovial." An Internet search for "Mohammad Javad Zarif" and "jovial" yields thousands of results. He certainly has gone a long way to bring Iran in from the cold and allow it to rejoin the international community. But there are some facts about Zarif that are less well-known. Here are six: . In September 2013, Zarif tweeted "Happy Rosh Has

Resumen Generado:
 Mohammad Javad Zarif received a hero's welcome as he arrived in Iran on a sunny Friday morning . "Long live Zarif," crowds chanted as his car rolled slowly down the packed street .

Artículo Original:
(CNN)Five Americans who were monitored for three weeks at an Omaha, Nebraska, hospital after being exposed to Ebola in West Africa have been released, a Nebraska Medicine spokesman said in an email Wednesday. One of the five had a heart-related issue on Saturday and has been discharged but hasn't left the area, Taylor Wilson wrote. The others have already gone home. They were exposed to Ebola in Sierra Leone in March, but none developed the deadly virus. They are clinicians for Partners in Health, a Boston-based aid group. They all had contact with a colleague who was diagnosed with the disease and is being treated at the National Institutes of Health in Bethesda, Maryland. As of Monday, that health care worker is in fair condition. The Centers for Disease Control and Prevention in Atlanta has said the last of 17 patients who were being monitored are expected to be released by Thursday. More than 10,000 people have died in a West African epidemic of Ebola that dates to December 2013, a

Resumen Generado:
 The five were exposed to Ebola in Sierra Leone in March, but none developed the deadly virus . They are clinicians for Partners in Health, a Boston-based aid group . One of the five had a heart-related issue on Saturday and has been discharged .

Artículo Original:
(CNN)A Duke student has admitted to hanging a noose made of rope from a tree near a student union, university officials said Thursday. The prestigious private school didn't identify the student, citing federal privacy laws. In a news release, it said the student was no longer on campus and will face student conduct review. The student was identified during an investigation by campus police and the office of student affairs and admitted to placing the noose on the tree early Wednesday, the university said. Officials are still trying to determine if other people were involved. Criminal investigations into the incident are ongoing as well. Students and faculty members marched Wednesday afternoon chanting "We are not afraid. We stand together,"  after pictures of the noose were passed around on social media. At a forum held on the steps of Duke Chapel, close to where the noose was discovered at 2 a.m., hundreds of people gathered. "You came here for the reason that you want to say with me,

Resumen Generado:
 A Duke student has admitted to hanging a noose from a tree near a student union, university officials say . The prestigious private school didn't identify the student, citing federal privacy laws . Students and faculty members marched Wednesday afternoon chanting "We are not afraid"

7) Traducción Automática 🌐

Se tiene un pipeline para traducción automática utilizando un modelo preentrenado para español a inglés, y también en otros idiomas.

translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-es", device_map="auto")

translator("Next week we will visit the museum with my family.")

[{'translation_text': 'La próxima semana visitaremos el museo con mi familia.'}]

Gestionar espacio en disco

los modelos y datasets de HuggingFace pueden ocupar una cantidad significativa de espacio en disco, especialmente si se descargan múltiples modelos grandes o conjuntos de datos extensos. Aquí hay algunas estrategias para gestionar el espacio en disco:

# para verificar los modelos y datasets descargados localmente
!hf cache scan

REPO ID                                          REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED     LAST_MODIFIED     REFS             LOCAL PATH
------------------------------------------------ --------- ------------ -------- ----------------- ----------------- ---------------- --------------------------------------------------------------------------------------------
ag_news                                          dataset          19.8M        3 9 hours ago       9 hours ago       main             /home/joser/.cache/huggingface/hub/datasets--ag_news
cnn_dailymail                                    dataset         837.1M        6 9 hours ago       9 hours ago       main             /home/joser/.cache/huggingface/hub/datasets--cnn_dailymail
fka/awesome-chatgpt-prompts                      dataset         104.5K        2 11 hours ago      11 hours ago      main             /home/joser/.cache/huggingface/hub/datasets--fka--awesome-chatgpt-prompts
Helsinki-NLP/opus-mt-en-es                       model           627.4M        8 a few seconds ago a few seconds ago main, refs/pr/4  /home/joser/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-en-es
Helsinki-NLP/opus-mt-es-en                       model           627.4M        8 54 seconds ago    54 seconds ago    main, refs/pr/6  /home/joser/.cache/huggingface/hub/models--Helsinki-NLP--opus-mt-es-en
dccuchile/bert-base-spanish-wwm-cased            model           879.9M        7 1 day ago         1 day ago         main, refs/pr/3  /home/joser/.cache/huggingface/hub/models--dccuchile--bert-base-spanish-wwm-cased
facebook/bart-large-mnli                         model             1.6G        6 9 hours ago       9 hours ago       d7645e1          /home/joser/.cache/huggingface/hub/models--facebook--bart-large-mnli
flax-community/gpt-2-spanish                     model           511.8M        3 10 hours ago      10 hours ago      main             /home/joser/.cache/huggingface/hub/models--flax-community--gpt-2-spanish
meta-llama/Llama-3.2-1B-Instruct                 model             2.5G        6 9 hours ago       9 hours ago       main             /home/joser/.cache/huggingface/hub/models--meta-llama--Llama-3.2-1B-Instruct
mrm8488/bert-spanish-cased-finetuned-ner         model           879.2M        6 1 day ago         1 day ago         main, refs/pr/1  /home/joser/.cache/huggingface/hub/models--mrm8488--bert-spanish-cased-finetuned-ner
nlptown/bert-base-multilingual-uncased-sentiment model           670.3M        5 1 day ago         1 day ago         main             /home/joser/.cache/huggingface/hub/models--nlptown--bert-base-multilingual-uncased-sentiment
sshleifer/distilbart-cnn-12-6                    model             2.4G        6 9 hours ago       9 hours ago       main, refs/pr/29 /home/joser/.cache/huggingface/hub/models--sshleifer--distilbart-cnn-12-6

Done in 0.0s. Scanned 12 repo(s) for a total of [1m[31m11.6G[0m.

# verificar el espacio en disco usado en linux (codespaces, google colab, etc)
!df -h

S.ficheros     Tamaño Usados  Disp Uso% Montado en
tmpfs            1,6G   2,2M  1,6G   1% /run
efivarfs         184K   181K     0 100% /sys/firmware/efi/efivars
/dev/nvme0n1p5   103G    38G   61G  39% /
tmpfs            7,8G    26M  7,8G   1% /dev/shm
tmpfs            5,0M      0  5,0M   0% /run/lock
/dev/nvme0n1p6   192G   131G   51G  72% /home
/dev/nvme0n1p8  1022M   242M  781M  24% /boot/efi
tmpfs            1,6G   113M  1,5G   8% /run/user/1000
/dev/nvme0n1p9   539G   311G  228G  58% /media/joser/Datos

Si se esta usando UV como gestor de dependencias, se puede limpiar las cachés de paquetes no utilizados con el comando:

uv cache clean

#!uv cache clean

Esta libre 6.2GB en un ambiente Linux (Google Colab, Codespaces, etc). Si no tienes suficiente espacio en disco, puedes eliminar modelos y datasets que no estés utilizando en terminal con el comando hf cache delete

Librerías Usadas

from watermark import watermark

print(watermark(python=True, iversions=True, globals_=globals()))

Python implementation: CPython
Python version       : 3.12.11
IPython version      : 9.5.0

watermark   : 2.5.0
datasets    : 4.2.0
pandas      : 2.3.2
transformers: 4.56.2

Referencias

Jose R. Zapata