How to Prepare Data for a Neural Network: A Step-by-Step Guide
Introduction
In this
guide, I'll walk you through the steps I took to prepare airline sentiment data
for a neural network. The aim is to create a model that predicts whether new
comments are positive or negative using word embeddings and a transformer
neural network architecture.
Step 0: Data Collection
I began by
obtaining the 'airline twitter sentiment' data from Data World
(https://data.world/datasets/sentiment). This dataset includes customer
comments and their associated sentiments.
Step 1: Data Cleaning and Text Extraction
First, I
extracted customer comments from the 'text' field of the dataset and cleaned
them by removing punctuation, numbers, and other irrelevant elements. The
cleaned comments were then written to a text file called
"saPreprocessSentences.txt." This process was implemented using the
following Python code:
# saPreprocessClean.py
from saPreprocessClean import clean_string
# Load the dataset
df = pd.read_excel('/Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx')
# Clean and preprocess the comments
cleaned_comments = ""
for value in df['text']:
if
isinstance(value, str):
clean_string =
clean_string(value)
cleaned_comments += ' ' + clean_string
# Write cleaned comments to file
with open("saPreprocessSentences.txt",
"w") as file:
file.write(cleaned_comments)
Step 2: Word Extraction and Cleaning
Next, I
extracted individual words from the cleaned comments, further cleaned them, and
saved them to a file called "saPreprocessWords.txt." The code to
achieve this is as follows:
# saPreprocessWords.py
from saPreprocessClean import clean_string
# Load the dataset
df =
pd.read_excel('/Users/ARS/ARStensorflow/sentimentAnalysis/Airline-SentimentARS1.xlsx')
# Extract and preprocess words from comments
with open("saPreprocessWords.txt", "w")
as file:
for value in
df['text']:
if
isinstance(value, str):
words =
value.split()
for word in
words:
clean_word = clean_string(word)
file.write(f"word={word} cleaned={clean_word}\n")
Step 3: Removing Duplicate Entries
To ensure
data integrity, I created a batch program called "removeDUP" to
remove any duplicate entries from the "saPreprocessWords.txt" file.
The cleaned output was saved to "saRemoveDUPOutput.txt."
Step 4: Creating Word Embeddings
I converted
each unique word into a float value using word2vec embeddings and built a
dictionary model to map words to their corresponding vectors. This model was
saved as "word2vec_dict_model.model," and the vectors were stored in
"saW2Vdict_vectors.txt."
#
saCreateW2VDictModel.py
from gensim.models import Word2Vec
# Load the cleaned word list
with open("saRemoveDUPOutput2.txt", "r",
encoding="utf-16-le") as file:
words =
[line.strip().split()[1] for line in file]
# Build and train the word2vec model
model = Word2Vec.load("word2vec_dict_model.model")
for new_word in words:
if
any(char.isdigit() for char in new_word):
print(f"includes number --> {new_word}")
else:
model.build_vocab([new_word], update=True)
model.train([new_word], total_examples=1, epochs=1)
model.save("word2vec_dict_model.model")
model.wv.save_word2vec_format("saW2Vdict_vectors.txt",
binary=False)
Step 5: Data Transformation and Labeling
I
transformed the sentences in "saPreprocessSentences.txt" into
word2vec vectors using the dictionary model. I also labeled each sentence based
on its sentiment, appending it to the "saW2VXtrainYtrainData.txt"
file.
# Transform sentences to word2vec vectors and label
def get_w2v_sentence(sentence):
word_vectors =
[model.wv[word] for word in sentence.split() if word in model.wv]
return word_vectors
# Load the word2vec model
model = Word2Vec.load("word2vec_model_updated.model")
# Load sentiment data
df =
pd.read_excel('/Users/ARS/ARStensorflow/Airline-SentimentARS1.xlsx')
sentiments = []
for sentiment_value in df['airline_sentiment']:
if sentiment_value
== "positive":
sentiments.append(1)
with open("saPreprocessSentences.txt",
"r") as file:
with
open("saW2VXtrainYtrainData.txt", "w") as file2:
i = 0
for line in
file:
sentence =
line.strip()
w2v_sentence_vectors = get_w2v_sentence(sentence)
w2v_sentence_lists = [vector.tolist() for vector in
w2v_sentence_vectors]
print(f"i={i} w2v={w2v_sentence_lists}
sentiment={sentiments[i]}", file=file2)
i += 1
# Reading and formatting the data
x_values = []
y_values = []
with open("saW2VXtrainYtrainData.txt",
"r") as file:
for line in file:
parts =
line.strip().split()
x_value =
[float(x) for x in parts[1].split(',')]
y_value =
int(parts[2])
x_values.append(x_value)
y_values.append(y_value)
x_values_array = np.array(x_values)
y_values_array = np.array(y_values)
Step 6: Splitting Data for Training and Validation
I split the data into training and validation sets using a
train-validation ratio of 80-20. The resulting arrays were saved as
"saXtrainYtrainData.npz."
python
Copy code
from sklearn.model_selection import train_test_split
# Split data into training and validation sets
x_train, x_val, y_train, y_val = train_test_split(
x_values_array,
y_values_array, test_size=val_ratio, random_state=42
)
# Save the arrays to a file
np.savez("saXtrainYtrainData.npz", x_train=x_train,
x_val=x_val, y_train=y_train, y_val=y_val)
Conclusion
By following
these steps, I successfully prepared the airline sentiment data for training a
transformer neural network. The data, which includes word2vec-transformed
sentences and corresponding sentiment labels, is ready to be used for building
and training the neural network model. This process showcases the power of chatGPT
in aiding and accelerating the programming process.
#neuralNetworks #dataPrepare
runfile('C:/Users/ars/ARStensorflow/sentimentAnalysis/saSTEP7/saProduceXtrainYtrainDataNEW.py', wdir='C:/Users/ars/ARStensorflow/sentimentAnalysis/saSTEP7')
sentiment = neutral -->0
sentiment = positive -->1
sentiment = neutral -->0
sentiment = negative -->0
sentiment = negative -->0
sentiment = negative -->0
sentiment = positive -->1
sentiment = neutral -->0
sentiment = positive -->1
sentiment = positive -->1
sentiment = neutral -->0
i=0 =
i=0 w2v=[] sentiment=0
i=1 = virginamerica plus you've added commercials to the experience tacky
i=1 w2v=[[0.0012580156326293945], [-0.6498260498046875], [-0.5153262615203857], [-0.020553112030029297], [-0.6892403364181519], [0.3554692268371582], [-0.8850120306015015], [-0.2642437219619751], [0.10036587715148926]] sentiment=1
i=2 = virginamerica i didn't today must mean i need to take another trip
i=2 w2v=[[0.0012580156326293945], [-0.932395339012146], [0.2726396322250366], [-0.6015644073486328], [-0.01267862319946289], [-0.4461408853530884], [-0.932395339012146], [-0.946296215057373], [0.3554692268371582], [0.02413642406463623], [0.9904971122741699], [0.6381527185440063]] sentiment=0
i=3 = virginamerica it's really aggressive to blast obnoxious entertainment in your guests' faces amp they have little recourse
i=3 w2v=[[0.0012580156326293945], [-0.7756330966949463], [-0.5128778219223022], [-0.5935169458389282], [0.3554692268371582], [-0.942463755607605], [0.03211188316345215], [-0.1338428258895874], [0.7456883192062378], [0.9946664571762085], [-0.31516337394714355], [-0.22687816619873047], [0.2923257350921631], [0.6583085060119629], [0.6221826076507568], [-0.5303181409835815], [0.7077651023864746]] sentiment=0
i=4 = virginamerica and it's a really big bad thing about it
i=4 w2v=[[0.0012580156326293945], [-0.6498693227767944], [-0.7756330966949463], [-0.5128778219223022], [0.98853600025177], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [0.3834136724472046]] sentiment=0
i=5 = virginamerica seriously would pay 30 a flight for seats that didn't have this playing it's really the only bad thing about flying va
i=5 w2v=[[0.0012580156326293945], [0.48972034454345703], [0.81987464427948], [-0.03148186206817627], [-0.7918610572814941], [-0.5839154720306396], [-0.767822265625], [0.7113058567047119], [0.2726396322250366], [0.6221826076507568], [-0.5667037963867188], [0.7418577671051025], [-0.7756330966949463], [-0.5128778219223022], [-0.8850120306015015], [0.8083392381668091], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [-0.3030076026916504], [-0.29737353324890137]] sentiment=0
i=6 = virginamerica really missed a prime opportunity for men without hats parody there
i=6 w2v=[[0.0012580156326293945], [-0.5128778219223022], [-0.9224623441696167], [-0.2657853364944458], [-0.10833430290222168], [-0.5839154720306396], [0.42132532596588135], [0.7914493083953857], [0.4627121686935425], [-0.3580136299133301], [-0.7653474807739258]] sentiment=1
i=7 = virginamerica well i didn'tûbut now i do d
i=7 w2v=[[0.0012580156326293945], [0.28537607192993164], [-0.932395339012146], [-0.4359729290008545], [0.019797325134277344], [-0.932395339012146], [-0.1655644178390503], [0.6342606544494629]] sentiment=0
i=8 = virginamerica it was amazing and arrived an hour early you're too good to me
i=8 w2v=[[0.0012580156326293945], [0.3834136724472046], [-0.9266262054443359], [-0.0772627592086792], [-0.6498693227767944], [0.427449107170105], [0.07871925830841064], [0.5342621803283691], [-0.6033754348754883], [-0.6512038707733154], [-0.5308046340942383], [-0.4651916027069092], [0.3554692268371582], [0.500235915184021]] sentiment=1
i=9 = virginamerica did you know that suicide is the second leading cause of death among teens 1024
i=9 w2v=[[0.0012580156326293945], [-0.12914776802062988], [0.36882483959198], [0.20193088054656982], [0.7113058567047119], [-0.4647252559661865], [0.43231725692749023], [-0.8850120306015015], [0.8175731897354126], [0.1814650297164917], [0.9735549688339233], [0.8894059658050537], [-0.048635125160217285], [0.7589428424835205], [-0.8305487632751465]] sentiment=1
i=10 = virginamerica i lt3 pretty graphics so much better than minimal iconography d
i=10 w2v=[[0.0012580156326293945], [-0.932395339012146], [-0.6855369806289673], [0.13977575302124023], [-0.11634397506713867], [0.8503241539001465], [0.973355770111084], [-0.6604783535003662], [-0.5532848834991455], [0.3296027183532715], [0.6342606544494629]] sentiment=0
['[]'
'[[0.0012580156326293945], [-0.6498260498046875], [-0.5153262615203857], [-0.020553112030029297], [-0.6892403364181519], [0.3554692268371582], [-0.8850120306015015], [-0.2642437219619751], [0.10036587715148926]]'
'[[0.0012580156326293945], [-0.932395339012146], [0.2726396322250366], [-0.6015644073486328], [-0.01267862319946289], [-0.4461408853530884], [-0.932395339012146], [-0.946296215057373], [0.3554692268371582], [0.02413642406463623], [0.9904971122741699], [0.6381527185440063]]'
'[[0.0012580156326293945], [-0.7756330966949463], [-0.5128778219223022], [-0.5935169458389282], [0.3554692268371582], [-0.942463755607605], [0.03211188316345215], [-0.1338428258895874], [0.7456883192062378], [0.9946664571762085], [-0.31516337394714355], [-0.22687816619873047], [0.2923257350921631], [0.6583085060119629], [0.6221826076507568], [-0.5303181409835815], [0.7077651023864746]]'
'[[0.0012580156326293945], [-0.6498693227767944], [-0.7756330966949463], [-0.5128778219223022], [0.98853600025177], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [0.3834136724472046]]'
'[[0.0012580156326293945], [0.48972034454345703], [0.81987464427948], [-0.03148186206817627], [-0.7918610572814941], [-0.5839154720306396], [-0.767822265625], [0.7113058567047119], [0.2726396322250366], [0.6221826076507568], [-0.5667037963867188], [0.7418577671051025], [-0.7756330966949463], [-0.5128778219223022], [-0.8850120306015015], [0.8083392381668091], [0.7764592170715332], [0.9213329553604126], [0.9233143329620361], [-0.3030076026916504], [-0.29737353324890137]]'
'[[0.0012580156326293945], [-0.5128778219223022], [-0.9224623441696167], [-0.2657853364944458], [-0.10833430290222168], [-0.5839154720306396], [0.42132532596588135], [0.7914493083953857], [0.4627121686935425], [-0.3580136299133301], [-0.7653474807739258]]'
'[[0.0012580156326293945], [0.28537607192993164], [-0.932395339012146], [-0.4359729290008545], [0.019797325134277344], [-0.932395339012146], [-0.1655644178390503], [0.6342606544494629]]'
'[[0.0012580156326293945], [0.3834136724472046], [-0.9266262054443359], [-0.0772627592086792], [-0.6498693227767944], [0.427449107170105], [0.07871925830841064], [0.5342621803283691], [-0.6033754348754883], [-0.6512038707733154], [-0.5308046340942383], [-0.4651916027069092], [0.3554692268371582], [0.500235915184021]]'
'[[0.0012580156326293945], [-0.12914776802062988], [0.36882483959198], [0.20193088054656982], [0.7113058567047119], [-0.4647252559661865], [0.43231725692749023], [-0.8850120306015015], [0.8175731897354126], [0.1814650297164917], [0.9735549688339233], [0.8894059658050537], [-0.048635125160217285], [0.7589428424835205], [-0.8305487632751465]]'
'[[0.0012580156326293945], [-0.932395339012146], [-0.6855369806289673], [0.13977575302124023], [-0.11634397506713867], [0.8503241539001465], [0.973355770111084], [-0.6604783535003662], [-0.5532848834991455], [0.3296027183532715], [0.6342606544494629]]']
['0' '1' '0' '0' '0' '0' '1' '0' '1' '1' '0']
------------
Shape of x_train: (8,)
Shape of y_train: (8,)
Shape of x_val: (3,)
Shape of y_val: (3,)
Shape of x_train_loaded: (8,)
Shape of x_val_loaded: (3,)
Shape of y_train_loaded: (8,)
Shape of y_val_loaded: (3,)