My Machine Learning Notebook - Part 1 (Data Preprocessing)

This is the first part of my Machine Learning notebook. As I am totally new to ML, the content follows the Udemy course “Machine Learning A-Z™: Hands-On Python & R”. As in the course, I’ll be using Spyder and RStudio.

For this article, let’s consider the following dataset:

Dataset

The dataset has:

First columns - independent variables (loaded into the variable X in python below)
Last column - dependent variable (Purchased)
Missing values - NaN

Our machine learning algorithm has the following form: f(X) = y, where X is the set of values for independent variables and y is the dependent variable

Importing and cleaning up data

Steps:

Importing data. Splitting data into dependent and independent variables
Taking care of missing data - in this case, by replacing the missing data with average of the rest of the column.
Encode categorical data - vocabulary: categorical data == data that represents categories, in our example Country, Yes / No.
Feature scaling - this is important because many machine learning algorithms use Euclidian distance and it can happen that if the features are not within the same scale, one feature will dominate the other (especially that Euclidian distance is based on squares).

Python

import numpy as np;
import matplotlib as plt;
import pandas as pd;

dataset = pd.read_csv("Data.csv");
                     
# extracts data from dataset - first index: rows, second index: columns
X = dataset.iloc[:, :-1].values;                

# replace missing data with average on column
from sklearn.preprocessing import Imputer;

imputer = Imputer(missing_values='NaN', 
                  strategy='mean', 
                  axis=0); # press CTRL + i to inspect in spyder

# row, column, but take only columns with idx 1, 2
# X is or independent variable
X[:, 1:3] = imputer.fit(X[:, 1:3]).transform(X[:, 1:3]); 

Spyder

dataset = read.csv('Data.csv')

dataset$Age = ifelse(
                is.na(dataset$Age), 
                ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE) ),
                dataset$Age)

dataset$Salary = ifelse(
                    is.na(dataset$Salary),
                    ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                    dataset$Salary)

RStudio

Back to Python

Encoding categorical data to numbers for futher processing

Two steps:

Transform from label (string) to numbers

 # encode categorical data to numbers
from sklearn.preprocessing import LabelEncoder
# transforms categorical data from strings to numbers
X[:, 0] = LabelEncoder().fit_transform(X[:, 0])

Then create dummy features with a column for each category, so that we don’t insert an arbitrary feature order into the ML algorithms.

# since we don't want in our model to have order between categories,
# we need to create dummy variables, one column per each category
from sklearn.preprocessing import OneHotEncoder
X = OneHotEncoder(categorical_features=[0]).fit_transform(X).toarray();

When we look at the data in the X variable we see:

Dummy Features

For the dependent variable we don’t need to make dummy features, thus we simply run:

###### dependent variable
y = dataset.iloc[:, 3].values
y = LabelEncoder().fit_transform(y)

Encoding categorial data in R

It seems in R we don’t need to create the dummy features, so it is straight forward:

dataset$Country = factor(dataset$Country, 
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))

dataset$Purchased = factor(dataset$Purchased, 
                           levels = c('Yes', 'No'),
                           labels = c(1, 0))

Splitting the dataset into training set and test set (Python)

test_size == percentage of the whole datset used for test data - good numbers range usually between 0.2 -> 0.3
random_state == a random number; in this case I put 0 so that I have the same results everytime.

### spliting our data into training data and test data
from sklearn.cross_validation import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y,
                                    test_size = 0.2,
                                    random_state = 0)

In R

install.packages('caTools')
library(caTools)
set.seed = 0
split = sample.split(dataset$Purchased, SplitRatio=0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

We set the seed for the random split - 0, so it is deterministic (can be any number). sample.split receives an array of dependent variable and the split ratio (in this case we aim for 80% for training and 20% for testing) and outputs an array of TRUE / FALSE values - the actual split. According to documentation, if there are only a few labels (as is expected) than relative ratio of data in both subsets will be the same - this is the reason why split requires the dependent variable column.

Then we use this array to obtain the corresponding subsets from our initial dataset.

Feature Scaling

Purpose: no variable is dominated by the other

Two types of feature scaling:

Normalization: scaled(x) = (x - min(x)) / (max(x) - min(x))
Standardization: scaled(x) = (x - mean(x)) / stddev(x) Standard Deviation

We are going to use the fit_transform function for the training data and transform for the test data, because the StandardScaler (see below) is already fitted by the first call and we want to reuse the same scaling for the test data. According to docs:

Definition : fit_transform(X, y=None, **fit_params): Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

# scaling features
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler();
X_train_scaled = std_scaler.fit_transform(X_train);
X_test_scaled = std_scaler.transform(X_train);

Attention: in the code above, we also scaled the dummy variables. This can be useful or not, depending on the task at hand. If we don’t want to scale the dummy features, we can simply:

X_train_scaled_2 = np.empty_like(X_train)               
X_train_scaled_2[:, 3:] = std_scaler.fit_transform(X_train[:, 3:])
X_train_scaled_2[:, 0:2] = X_train[:, 0:2]

In R, we only select the features we want to scale. Indices start from 1, we don’t want to scale the country, so it goes like this:

training_set[, 2:3] = scale(training_set[, 2:3])
test_set[, 2:3] = scale(test_set[, 2:3])

ML Feature Scaling

Some implementations:

In the snippets above, we use algorithms from various python libraries. Here are some implementations of these algorithms.

A basic function to generate data with random NaNs

import numpy as np
import random

def generate_data(length, min = 0, max = 1, gaps_percent = 0):    
    ret = np.random.rand(length) * (max - min) + min
    cnt_nan = int (gaps_percent * length)
    
    ilen = int(length);
    
    for i in range(0, cnt_nan):
        idx = random.randint(0, ilen - 1)
        while(np.isnan(ret[idx])):
            idx = int((idx + random.randint(1, 13)) % ilen)     # so we minimize clusters
        ret[idx] = np.NaN
             
    return ret

arr = generate_data(100, 10, 100, 0.2)

Cleaning up the data - the algorithm which replaces NaN with a value (either median or mean)

def fill_nan_with_value(arr, func):
    ret = np.array(arr)
    mask = np.isnan(arr)
    ret[mask] = func(arr[mask ^ True]);
    return ret 

0.5, gaps_percent)

arr_filled = fill_nan_with_value(arr, np.mean)

Data scaling - Z-scoring and normalization:

"""Standardize features by removing the mean and scaling to unit variance
For instance many elements used in the objective function of a learning algorithm 
(such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers 
of linear models) assume that all features are centered around 0 and have 
variance in the same order. If a feature has a variance that is orders of 
magnitude larger that others, it might dominate the objective function and 
make the estimator unable to learn from other features correctly as expected.

- Also known as Z-Scoring

"""
def scale_std_dev(arr_filled):
    mean_arr = np.mean(arr_filled);
    stddev_arr = np.sqrt( np.sum((arr_filled - mean_arr) ** 2)  / arr_filled.size )
    return (arr_filled - mean_arr) / stddev_arr

def scale_normalize(arr_filled):
    return (arr_filled - np.min(arr_filled)) / (np.max(arr_filled) - np.min(arr_filled))
    
array_scaled_std_dev = scale_std_dev(arr_filled)
array_scaled_normal = scale_normalize(arr_filled)

Generate linear data with noise and plot it

"""
Format ax + b + error
"""
def generate_noisy_linear_data(start, end, size, a_coef, b_coef, error):
    return np.linspace(start, end, size) * a_coef + generate_data(size, -error * 0.5, error * 0.5) + b_coef

line_with_noise = generate_noisy_linear_data(0, 10, 100, 0.2, 20, 0.5)

import matplotlib.pyplot as plt;
plt.plot(np.linspace(0, 10, 100), line_with_noise, 'ro')

Spyder