Day5 - Machine Learning with Microsoft Azure

Tell me and I forget, teach me and I may remember, involve me and I learn.” – Benjamin Franklin

Day4 which was yesterday, i got to discuss more about common types of data in ML which includes numerical, time-series, categorical, image and text data types. Today i'll be discussing about encoding categorical data. As we've mentioned a few times now, machine learning algorithms need to have data in numerical form. Thus, when we have categorical data, we need to encode it in some way so that it is represented numerically.

There are two common approaches for encoding categorical data: ordinal encoding and one hot encoding.

Ordinal Encoding

Ordinal encoding simply means converting the categorical data into integer ranging from 0 to (number of categories - 1) Looking at the tabular data shared earlier

SKUMakeColourQuantityPrice
908721GuessBlue78945.33
456552TillysRed24422.91
789921A&FGreen38725.92
872266GuessBlue15417.56

If we apply ordinal encoding to the Make property, we get the following:

MakeEncoding
A&F0
Guess1
Tillys2

And if we apply it to the Color property, we get:

ColorEncoding
Red0
Green1
Blue2

Using the above encoding, the transformed table is shown below:

SKUMakeColorQuantityPrice
9087211278945.33
4565522024422.91
7899210138725.92
8722661215417.56

This approach might seem to solve our problem to an extent, as one of the drawbacks or challenges faced is that it tends to assume order across the categories. Looking at the table above, blue gets encoded with the value of 2, green with the value of 1 and red with the value of 0 which tends to attach a sense of priority to our values and this isn't what we want our algorithm to think of theses data.

One-Hot Encoding

One-hot encoding is a very different approach. In one-hot encoding, we transform each categorical value into a column. If there are n categorical values, n new columns are added. For example, the Color property has three categorical values: Red, Green, and Blue, so three new columns Red, Green, and Blue are added.

If an item belongs to a category, the column representing that category gets the value 1, and all other columns get the value 0. For example, item 908721 (first row in the table) has the color blue, so we put 1 into that Blue column for 908721 and 0 into the Red and Green columns. Item 456552 (second row in the table) has color red, so we put 1 into that Red column for 456552 and 0 into the Green and Blue columns.

If we do the same thing for the Make property, our table can be transformed as follows:

SKUA&FGuessTillysRedGreenBlueQuantityPrice
90872101000178945.33
45655200110024422.91
78992110001038725.92
87226601000115417.56

The major challenge encountered with this type of encoding is that we tend to generate more columns of data.

Image data

As we all know, images are one of the data type that is commponly used in ML models which are not initially in numerical form, so how do we convert or repreent images in numerical form before feeding it into a ML model ? If you are to take a closer look at any image by zooming in, you would noice that they are made up small tiles called pixels. The higher the number of pixels, the clearer the image looks

  • In black and white images, each pixels are represented by values ranging from 0 to 255. The value states how dark the pixel appears(e.g value 0 indicates the pixel is black while 255 indicates bright white)
  • In coloured images, each pixels can be represented by a vector of three numbers with each number ranging from 0 to 255 for the three primary colour channels(red, green and blue). So these three red, green and blue (RGB) values are used together to decide the color of that pixel. For example, purple might be represented as 128, 0, 128 (a mix of moderately intense red and blue, with no green).

The number of channels required to represent the color is known as the color depth or simply depth. With an RGB image, depth = 3, because there are three channels (Red, Green, and Blue). In contrast, a black and white image has depth = 1, because there is only one channel.

Screenshot 2020-07-27 at 3.03.50 AM.png

Encoding an Image

Before we can encode an image, we need to know these three things about the image namely :

  • The colour of each pixel
  • Vertical position of each pixel
  • Horizontal position of each pixel

Thus, we can fully encode an image numerically by using a vector with three dimensions. The size of the vector required for any given image would be the height X width X depth of that image.

Text data

Text is another example of a data type that is initially non-numerical and that must be processed before it can be fed into a machine learning algorithm. Let's have a look at some of the common tasks we might do as part of this processing.

Normalization

One of the challenges that can come up in text analysis is that there are often multiple forms that mean the same thing. For example, the verb to be' may show up as is, am, are, and so on. Or a document may contain alternative spellings of a word, such as behavior vs. behaviour. So one step that you will sometimes conduct in processing text is normalization.

Text normalization is the process of transforming a piece of text into a canonical (official) form.

Lemmatization is an example of text normalization. A lemma is the dictionary form of a word and lemmatization is the process of reducing multiple inflections to that single dictionary form. For example, we can apply this to the is, am, are example we mentioned above:

Original wordLemmatized word
isbe
arebe
ambe

In many cases, you may also want to remove stop words. Stop words are high-frequency words that are unnecessary (or unwanted) during the analysis. For example, when you enter a query like which cookbook has the best pancake recipe into a search engine, the words which and the are far less relevant than cookbook, pancake, and recipe. In this context, we might want to consider which and the to be stop words and remove them prior to analysis. Here is an example

Original textNormalized text
The quick fox.[quick, fox]
The lazzy dog.[lazy, dog]

From the example above, we have tokenized the text (i.e., split each string of text into a list of smaller parts or tokens), removed stop words (the), and standardized spelling (changing lazzy to lazy). The rabid hare. [rabid, hare]

Vectorization

After normalization of the text, we can take the next step of encoding it in a numerical form. The goal here is to identify the particular features of the text that will be relevant to us for the particular task we want to perform—and then get those features extracted in a numerical form that is accessible to the machine learning algorithm. Typically this is done by text vectorization—that is, by turning a piece of text into a vector. Remember, a vector is simply an array of numbers—so there are many different ways that we can vectorize a word or a sentence, depending on how we want to use it. Common approaches include:

In summary, a typical pipeline for text data begins by pre-processing or normalizing the text. This step typically includes tasks such as breaking the text into sentence and word tokens, standardizing the spelling of words, and removing overly common words (called stop words).

The next step is feature extraction and vectorization, which creates a numeric representation of the documents. Common approaches include TF-IDF vectorization, Word2vec, and Global Vectors (GloVe).

Last, we will feed the vectorized document and labels into a model and start the training.

Screenshot 2020-07-27 at 10.47.03 PM.png

Typical pipeline for a classification model using text data

Normalize text >> Vectorize text >> Train model >> Deploy model