Sunday, 13 August 2023

How to Prepare Data for a Neural Network: A Step-by-Step Guide

 How to Prepare Data for a Neural Network: A Step-by-Step Guide

 

Introduction

In this guide, I'll walk you through the steps I took to prepare airline sentiment data for a neural network. The aim is to create a model that predicts whether new comments are positive or negative using word embeddings and a transformer neural network architecture.

 

Step 0: Data Collection

I began by obtaining the 'airline twitter sentiment' data from Data World (https://data.world/datasets/sentiment). This dataset includes customer comments and their associated sentiments.

  

Step 1: Data Cleaning and Text Extraction

First, I extracted customer comments from the 'text' field of the dataset and cleaned them by removing punctuation, numbers, and other irrelevant elements. The cleaned comments were then written to a text file called "saPreprocessSentences.txt." This process was implemented using the following Python code:

 

# saPreprocessClean.py

from saPreprocessClean import clean_string

 

# Load the dataset

df = pd.read_excel('/Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx')

 

# Clean and preprocess the comments

cleaned_comments = ""

for value in df['text']:

    if isinstance(value, str):

        clean_string = clean_string(value)

        cleaned_comments += ' ' + clean_string

 

# Write cleaned comments to file

with open("saPreprocessSentences.txt", "w") as file:

                file.write(cleaned_comments)

 

Step 2: Word Extraction and Cleaning

Next, I extracted individual words from the cleaned comments, further cleaned them, and saved them to a file called "saPreprocessWords.txt." The code to achieve this is as follows:

 

# saPreprocessWords.py

from saPreprocessClean import clean_string

 

# Load the dataset

df = pd.read_excel('/Users/ARS/ARStensorflow/sentimentAnalysis/Airline-SentimentARS1.xlsx')

 

# Extract and preprocess words from comments

with open("saPreprocessWords.txt", "w") as file:

    for value in df['text']:

        if isinstance(value, str):

            words = value.split()

            for word in words:

                clean_word = clean_string(word)

                file.write(f"word={word} cleaned={clean_word}\n")

 

Step 3: Removing Duplicate Entries

To ensure data integrity, I created a batch program called "removeDUP" to remove any duplicate entries from the "saPreprocessWords.txt" file. The cleaned output was saved to "saRemoveDUPOutput.txt."

 

Step 4: Creating Word Embeddings

I converted each unique word into a float value using word2vec embeddings and built a dictionary model to map words to their corresponding vectors. This model was saved as "word2vec_dict_model.model," and the vectors were stored in "saW2Vdict_vectors.txt."

 

# saCreateW2VDictModel.py

from gensim.models import Word2Vec

 

# Load the cleaned word list

with open("saRemoveDUPOutput2.txt", "r", encoding="utf-16-le") as file:

    words = [line.strip().split()[1] for line in file]

 

# Build and train the word2vec model

model = Word2Vec.load("word2vec_dict_model.model")

for new_word in words:

    if any(char.isdigit() for char in new_word):

        print(f"includes number --> {new_word}")

    else:

        model.build_vocab([new_word], update=True)

        model.train([new_word], total_examples=1, epochs=1)

model.save("word2vec_dict_model.model")

model.wv.save_word2vec_format("saW2Vdict_vectors.txt", binary=False)

 

Step 5: Data Transformation and Labeling

I transformed the sentences in "saPreprocessSentences.txt" into word2vec vectors using the dictionary model. I also labeled each sentence based on its sentiment, appending it to the "saW2VXtrainYtrainData.txt" file.

 

# Transform sentences to word2vec vectors and label

def get_w2v_sentence(sentence):

    word_vectors = [model.wv[word] for word in sentence.split() if word in model.wv]

    return word_vectors

 

# Load the word2vec model

model = Word2Vec.load("word2vec_model_updated.model")

 

# Load sentiment data

df = pd.read_excel('/Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx')

sentiments = []

 

for sentiment_value in df['airline_sentiment']:

    if sentiment_value == "positive":

        sentiments.append(1)

 

with open("saPreprocessSentences.txt", "r") as file:

    with open("saW2VXtrainYtrainData.txt", "w") as file2:

        i = 0

        for line in file:

            sentence = line.strip()

            w2v_sentence_vectors = get_w2v_sentence(sentence)

            w2v_sentence_lists = [vector.tolist() for vector in w2v_sentence_vectors]

            print(f"i={i} w2v={w2v_sentence_lists} sentiment={sentiments[i]}", file=file2)

            i += 1

 

# Reading and formatting the data

x_values = []

y_values = []

 

with open("saW2VXtrainYtrainData.txt", "r") as file:

    for line in file:

        parts = line.strip().split()

        x_value = [float(x) for x in parts[1].split(',')]

        y_value = int(parts[2])

        x_values.append(x_value)

        y_values.append(y_value)

       

x_values_array = np.array(x_values)

y_values_array = np.array(y_values)

Step 6: Splitting Data for Training and Validation

I split the data into training and validation sets using a train-validation ratio of 80-20. The resulting arrays were saved as "saXtrainYtrainData.npz."

 

python

Copy code

from sklearn.model_selection import train_test_split

 

# Split data into training and validation sets

x_train, x_val, y_train, y_val = train_test_split(

    x_values_array, y_values_array, test_size=val_ratio, random_state=42

)

 

# Save the arrays to a file

np.savez("saXtrainYtrainData.npz", x_train=x_train, x_val=x_val, y_train=y_train, y_val=y_val)

 

Conclusion

By following these steps, I successfully prepared the airline sentiment data for training a transformer neural network. The data, which includes word2vec-transformed sentences and corresponding sentiment labels, is ready to be used for building and training the neural network model. This process showcases the power of chatGPT in aiding and accelerating the programming process.

#neuralNetworks #dataPrepare 


runfile('C:/Users/ars/ARStensorflow/sentimentAnalysis/saSTEP7/saProduceXtrainYtrainDataNEW.py', wdir='C:/Users/ars/ARStensorflow/sentimentAnalysis/saSTEP7')

sentiment = neutral -->0

sentiment = positive -->1

sentiment = neutral -->0

sentiment = negative -->0

sentiment = negative -->0

sentiment = negative -->0

sentiment = positive -->1

sentiment = neutral -->0

sentiment = positive -->1

sentiment = positive -->1

sentiment = neutral -->0

i=0 = 

i=0 w2v=[] sentiment=0

i=1 = virginamerica plus you've added commercials to the experience tacky

i=1 w2v=[[0.0012580156326293945], [-0.6498260498046875], [-0.5153262615203857], [-0.020553112030029297], [-0.6892403364181519], [0.3554692268371582], [-0.8850120306015015], [-0.2642437219619751], [0.10036587715148926]] sentiment=1

i=2 = virginamerica i didn't today must mean i need to take another trip

i=2 w2v=[[0.0012580156326293945], [-0.932395339012146], [0.2726396322250366], [-0.6015644073486328], [-0.01267862319946289], [-0.4461408853530884], [-0.932395339012146], [-0.946296215057373], [0.3554692268371582], [0.02413642406463623], [0.9904971122741699], [0.6381527185440063]] sentiment=0

i=3 = virginamerica it's really aggressive to blast obnoxious entertainment in your guests' faces amp they have little recourse

i=3 w2v=[[0.0012580156326293945], [-0.7756330966949463], [-0.5128778219223022], [-0.5935169458389282], [0.3554692268371582], [-0.942463755607605], [0.03211188316345215], [-0.1338428258895874], [0.7456883192062378], [0.9946664571762085], [-0.31516337394714355], [-0.22687816619873047], [0.2923257350921631], [0.6583085060119629], [0.6221826076507568], [-0.5303181409835815], [0.7077651023864746]] sentiment=0

i=4 = virginamerica and it's a really big bad thing about it

i=4 w2v=[[0.0012580156326293945], [-0.6498693227767944], [-0.7756330966949463], [-0.5128778219223022], [0.98853600025177], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [0.3834136724472046]] sentiment=0

i=5 = virginamerica seriously would pay 30 a flight for seats that didn't have this playing it's really the only bad thing about flying va

i=5 w2v=[[0.0012580156326293945], [0.48972034454345703], [0.81987464427948], [-0.03148186206817627], [-0.7918610572814941], [-0.5839154720306396], [-0.767822265625], [0.7113058567047119], [0.2726396322250366], [0.6221826076507568], [-0.5667037963867188], [0.7418577671051025], [-0.7756330966949463], [-0.5128778219223022], [-0.8850120306015015], [0.8083392381668091], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [-0.3030076026916504], [-0.29737353324890137]] sentiment=0

i=6 = virginamerica really missed a prime opportunity for men without hats parody there

i=6 w2v=[[0.0012580156326293945], [-0.5128778219223022], [-0.9224623441696167], [-0.2657853364944458], [-0.10833430290222168], [-0.5839154720306396], [0.42132532596588135], [0.7914493083953857], [0.4627121686935425], [-0.3580136299133301], [-0.7653474807739258]] sentiment=1

i=7 = virginamerica well i didn'tûbut now i do d

i=7 w2v=[[0.0012580156326293945], [0.28537607192993164], [-0.932395339012146], [-0.4359729290008545], [0.019797325134277344], [-0.932395339012146], [-0.1655644178390503], [0.6342606544494629]] sentiment=0

i=8 = virginamerica it was amazing and arrived an hour early you're too good to me

i=8 w2v=[[0.0012580156326293945], [0.3834136724472046], [-0.9266262054443359], [-0.0772627592086792], [-0.6498693227767944], [0.427449107170105], [0.07871925830841064], [0.5342621803283691], [-0.6033754348754883], [-0.6512038707733154], [-0.5308046340942383], [-0.4651916027069092], [0.3554692268371582], [0.500235915184021]] sentiment=1

i=9 = virginamerica did you know that suicide is the second leading cause of death among teens 1024

i=9 w2v=[[0.0012580156326293945], [-0.12914776802062988], [0.36882483959198], [0.20193088054656982], [0.7113058567047119], [-0.4647252559661865], [0.43231725692749023], [-0.8850120306015015], [0.8175731897354126], [0.1814650297164917], [0.9735549688339233], [0.8894059658050537], [-0.048635125160217285], [0.7589428424835205], [-0.8305487632751465]] sentiment=1

i=10 = virginamerica i lt3 pretty graphics so much better than minimal iconography d

i=10 w2v=[[0.0012580156326293945], [-0.932395339012146], [-0.6855369806289673], [0.13977575302124023], [-0.11634397506713867], [0.8503241539001465], [0.973355770111084], [-0.6604783535003662], [-0.5532848834991455], [0.3296027183532715], [0.6342606544494629]] sentiment=0





['[]'

 '[[0.0012580156326293945], [-0.6498260498046875], [-0.5153262615203857], [-0.020553112030029297], [-0.6892403364181519], [0.3554692268371582], [-0.8850120306015015], [-0.2642437219619751], [0.10036587715148926]]'

 '[[0.0012580156326293945], [-0.932395339012146], [0.2726396322250366], [-0.6015644073486328], [-0.01267862319946289], [-0.4461408853530884], [-0.932395339012146], [-0.946296215057373], [0.3554692268371582], [0.02413642406463623], [0.9904971122741699], [0.6381527185440063]]'

 '[[0.0012580156326293945], [-0.7756330966949463], [-0.5128778219223022], [-0.5935169458389282], [0.3554692268371582], [-0.942463755607605], [0.03211188316345215], [-0.1338428258895874], [0.7456883192062378], [0.9946664571762085], [-0.31516337394714355], [-0.22687816619873047], [0.2923257350921631], [0.6583085060119629], [0.6221826076507568], [-0.5303181409835815], [0.7077651023864746]]'

 '[[0.0012580156326293945], [-0.6498693227767944], [-0.7756330966949463], [-0.5128778219223022], [0.98853600025177], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [0.3834136724472046]]'

 '[[0.0012580156326293945], [0.48972034454345703], [0.81987464427948], [-0.03148186206817627], [-0.7918610572814941], [-0.5839154720306396], [-0.767822265625], [0.7113058567047119], [0.2726396322250366], [0.6221826076507568], [-0.5667037963867188], [0.7418577671051025], [-0.7756330966949463], [-0.5128778219223022], [-0.8850120306015015], [0.8083392381668091], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [-0.3030076026916504], [-0.29737353324890137]]'

 '[[0.0012580156326293945], [-0.5128778219223022], [-0.9224623441696167], [-0.2657853364944458], [-0.10833430290222168], [-0.5839154720306396], [0.42132532596588135], [0.7914493083953857], [0.4627121686935425], [-0.3580136299133301], [-0.7653474807739258]]'

 '[[0.0012580156326293945], [0.28537607192993164], [-0.932395339012146], [-0.4359729290008545], [0.019797325134277344], [-0.932395339012146], [-0.1655644178390503], [0.6342606544494629]]'

 '[[0.0012580156326293945], [0.3834136724472046], [-0.9266262054443359], [-0.0772627592086792], [-0.6498693227767944], [0.427449107170105], [0.07871925830841064], [0.5342621803283691], [-0.6033754348754883], [-0.6512038707733154], [-0.5308046340942383], [-0.4651916027069092], [0.3554692268371582], [0.500235915184021]]'

 '[[0.0012580156326293945], [-0.12914776802062988], [0.36882483959198], [0.20193088054656982], [0.7113058567047119], [-0.4647252559661865], [0.43231725692749023], [-0.8850120306015015], [0.8175731897354126], [0.1814650297164917], [0.9735549688339233], [0.8894059658050537], [-0.048635125160217285], [0.7589428424835205], [-0.8305487632751465]]'

 '[[0.0012580156326293945], [-0.932395339012146], [-0.6855369806289673], [0.13977575302124023], [-0.11634397506713867], [0.8503241539001465], [0.973355770111084], [-0.6604783535003662], [-0.5532848834991455], [0.3296027183532715], [0.6342606544494629]]']

['0' '1' '0' '0' '0' '0' '1' '0' '1' '1' '0']

------------




Shape of x_train: (8,)

Shape of y_train: (8,)

Shape of x_val: (3,)

Shape of y_val: (3,)

Shape of x_train_loaded: (8,)

Shape of x_val_loaded: (3,)

Shape of y_train_loaded: (8,)

Shape of y_val_loaded: (3,)