How I Trained a Language Model Using Hugging Face on My MacBook Pro

How I Trained a Language Model Using Hugging Face on My MacBook Pro

Introduction

In this blog post, I’ll walk you through my journey of training a language model using Hugging Face's transformers library and the datasets library. I'll cover the initial setup, data preparation, training process, and how to save and use the trained model.

Step 1: Setting Up the Environment

First, I set up my development environment using VSCode and a Python virtual environment to keep my dependencies isolated.

1.               Create a Virtual Environment :

     bash

     python3 -m venv myenv

2.               Activate the Virtual Environment :

1.               Mac/Linux :

     bash

     source myenv /bin/activate

3.               Install Necessary Libraries :

     bash

     pip install transformers datasets torch accelerate

Step 2: Preparing the Data

Next, I loaded and tokenized the WikiText-2 dataset using the Hugging Face datasets library.

1.               Load the Dataset :

     python

     from datasets import load_dataset

     dataset = load_ dataset ( ' wikitext ', 'wikitext-2-raw-v1')

2.               Load the Tokenizer :

     python

     from transformers import AutoTokenizer

     tokenizer = AutoTokenizer.from_pretrained ('gpt2')

3.               Add a Padding Token if Necessary :

     python

     if tokenizer.pad_token is None:

     tokenizer.add_special_ tokens ( {' pad_token ': tokenizer.eos_token })

4.               Tokenize the Dataset :

     python

     def tokenize_function (examples):

     inputs = tokenizer(examples['text'], truncation=True, padding=' max_length ', max_length =512)

     inputs["labels"] = inputs[" input_ids " ].copy () # Use input_ids as labels for language modeling

     return inputs

      

     tokenized_datasets = dataset.map ( tokenize_function , batched=True, remove_columns =["text"])

Step 3: Training the Model

With the data prepared, I moved on to training the model using the Trainer class from the transformers library.

1.               Load the Model :

     python

     from transformers import AutoModelForCausalLM

     model = AutoModelForCausalLM.from_pretrained ('gpt2')

     model.resize _token_embeddings ( len (tokenizer))

2.               Set Up Training Arguments :

     python

     from transformers import TrainingArguments

     training_args = TrainingArguments (

     output_dir = './ results',

     eval_strategy ="epoch",

     per_device_train_batch_size =2,

     per_device_eval_batch_size =2,

     num_train_epochs =1,

     save_steps =10_000,

     save_total_limit =2,

     )

3.               Initialize the Trainer :

     python

     from transformers import Trainer

     trainer = Trainer(

     model=model,

     args = training_args ,

     train_dataset = tokenized_datasets ['train'],

     eval_dataset = tokenized_datasets ['validation'],

     )

4.               Train the Model :

     python

     trainer.train ()

Step 4: Saving the Model

After training for several hours, I saved the model and tokenizer to disk.

1.               Save the Model and Tokenizer :

     python

     model.save _pretrained('/Users/bhafner/Learning/AI/LLM_with_hugging_face/model')

     tokenizer.save _pretrained('/Users/bhafner/Learning/AI/LLM_with_hugging_face/tokenizer')

Step 5: Loading and Using the Model

Finally, I loaded the saved model and tokenizer to generate text and further evaluate the model.

1.               Load the Model and Tokenizer :

     python

     from transformers import AutoModelForCausalLM , AutoTokenizer

     model = AutoModelForCausalLM.from_pretrained('/Users/bhafner/Learning/AI/LLM_with_hugging_face/model')

     tokenizer = AutoTokenizer.from_pretrained('/Users/bhafner/Learning/AI/LLM_with_hugging_face/tokenizer')

2.               Generate Text :

     python

     input_text = "Once upon a time"

     inputs = tokenizer( input_text , return_tensors ="pt")

     outputs = model.generate (inputs[" input_ids "], max_length =50)

     generated_text = tokenizer.decode (outputs[0], skip_special_tokens =True)

     print( "Generated Text:")

     print( generated_text )

Conclusion

By following these steps, I was able to successfully train a language model using Hugging Face's transformers library on my MacBook Pro with an M1 Pro chip. The process involved setting up the environment, preparing the data, training the model, saving it, and finally generating text. This journey was a great learning experience and demonstrated the power and flexibility of Hugging Face's tools for NLP tasks.

Effortlessly Tag People’s Names in Your Google Sheets with Python and SpaCy

Have you ever wished you could automatically identify and tag people’s names in a Google Sheet? Maybe you’re managing a list of contacts, analyzing survey responses, or working on a project where extracting this information is key. Well, with a bit of Python magic and the power of SpaCy’s natural language processing capabilities, we can […]

SEO Strategies: Using Python to Detect Names in a List of Words

  Using Python to Detect Proper Nouns in a List of Words. I was working on keyword research for a client and was working to categorize all the keywords. Google Gemini helped me create scripts to go through a list of 2500 keywords and flag if it is a person’s name. Here’s a summary of […]

Cyclist Capstone

Cyclistic Capstone Project Cyclistic Capstone Project Brian Hafner 2024-01-28 Cyclistic Rides Analysis Background of Cyclistic Cyclistic is a bike-share program in Chicago, established in 2016, with 5,824 bicycles across 692 stations. They offer various pricing plans, categorizing customers into casual riders (using single-ride or full-day passes) and members (holding annual memberships). The company’s financial analysis […]

Reasons for discrepancies between old Google Analytics Universal Analytics Properties, and the new Google Analytics 4 Properties

Have you noticed discrepancies between your old Google Analytics Universal Analytics properties, and the new Google Analytics 4 properties? As we delve into the differences between Google Analytics 4 (GA4) and Universal Analytics (UA), it becomes evident that discrepancies in session or user numbers are often attributed to the shift in measurement methods. There are several […]

Upgrading from Google Analytics UA property to a Google Analytics 4 (GA4) property 

If you have recently upgraded to a Google Analytics GA4 property, you may have noticed some differences in the data reported by your new property compared to your old UA (universal analytics) property. This is because GA4 and UA use different methods of collecting, processing, and presenting data. In this blog post, we will explain […]

How to Use ChatGPT to Create Visual Assets in Canva

If you are looking for a way to spice up your visual content, you might be interested in a new feature that Canva has recently launched: ChatGPT. ChatGPT is a powerful tool that uses artificial intelligence to generate text and images based on your input. You can use ChatGPT to create catchy headlines, captions, slogans, […]