CareerCon 2019 — Help Navigate Robots : Help robots become smarter
Classification of surface type to help robot navigation
Introduction
Navigation is an important aspect for any autonomous machines like mobile robots, self driving cars, and other intelligent machines. According to Francesco Lomio et al., detection of the robot surface type can be considered as a proxy measurement of the location of the machine which is basically an indirect way of determining the outcome using calculable quantities. Given the map of all the surface types, and a real time detection of a transition of surface using an inertial measurement unit (IMU) can make the source of navigation data more reliable. Thus, helping our machine become smarter.
The task posed by this problem is basically classification of the surface the robot is moving on, given some data, into 9 possible classes. Or, a multiclass classification problem. So let’s get started!
The dataset can be found here.
About the data:
Train data:
- The input data consists of 487680 rows and 13 columns.
Grouping with respect to the ‘series_id’ or the ID number, we get 3810 rows or, 3810 data points. Each data point is a time series of 128 rows.
2. ‘series_id’, ‘measurement_number’ and ‘row_id’ are the identification columns provided. Measurement number identifies the i’th measurement of each series and ‘row_id’ is the ID for each row, while ‘series_id’ identifies each data point.
3. Coming to the data columns, we have linear acceleration or the rate of change of speed of motion measured with respect to the three coordinate axes, angular velocity or the speed and direction of motion of the robot in the same axes and the orientation or the inclination to the surface measured in ‘quaternions’.
The data is measured by an IMU sensor positioned on an industrial trolley with silent wheels, from nine indoor surface types.
Test data:
- The test data has three columns. First, the ‘series_id’ which acts as the foreign key.
- The ‘surface’ column which is the output prediction for each point.
- The ‘group_id’ column. This column imparts an additional information about the dataset, that each time series from the input data is part of a longer sequence collected by the IMU sensor and the ‘group_id’ is indicative of the ‘group’ it was a part of before being broken off into the existing time series.
Metrics to judge the performance:
- Micro averaged F1 Score:
F1 score is the harmonic mean of the precision and the recall.
Precision is the ratio of the true positives to the sum of the true positives and false positives which tells us out of all the data points classified as positive by the model how many are actually positive.
Recall is the ratio between the true positives and the sum of true positives and false negatives, which tells us out of all the data points that are actually positive, how many are classified as positive.
We use micro averaged F1 score as a metric to evaluate performance of the model. Thus, micro averaging allows us to account for any imbalance in the classes since for each class we consider the actual number of true positives and true negatives unlike macro averaging, where we only consider the ratio between the correctly classified points and total data and not their actual counts.
2. ROC AUC Score:
We use area under the ROC curve as a second metric.
The ROC curve is a plot of the behaviour of True Positive Rate (TPR), or the ratio of the true positives to the sum of the true positives and false negatives (or, in other words, recall), as the False Positive Rate (FPR), or the ratio of the false positives to the sum of false positives and true negatives, changes. It tells us about the classifier’s ability to differentiate signal from ‘noise’.
Disadvantage of using ROC AUC score in this scenario is in case of highly imbalanced data. If true positives for any class are much lower than true negatives, the FPR will always be high while classification, resulting in an ROC AUC score less sensitive to any changes in false positives.
3. Confusion Matrix:
The confusion matrix can also be used to judge the performance of the classifier. It gives us an idea about the exact count of true positives, true negatives, false positives and false negatives for each class.
EDA:
Now, let’s do some basic EDA.
- We print the output data counts of each class to find any imbalance in data.
Looks like there is a lot of imbalance. ‘hard_tiles’ surface type has the least percentage compared to all classes, while ‘concrete’ and ‘soft_pvc’ type have the maximum percentage. We can apply ‘stratify’ to train test split to overcome this problem.
Thus the ROC AUC score might not serve as a good measure but let’s analyse the results anyway.
2. We plot the time series.
We plot four time series for each feature of first 4 data points with series_id 0, 1, 2, and 3. Apart from the orientation data in ‘quaternions’ and the euler angle features, the time series of the rest of the features seem continuos.
We convert ‘quaternions’ , which is just a measure to express angles of inclination, to Euler angles. Quaternions overcome a problem that euler angles face, called ‘gimbal lock’.
A gimbal lock is a condition that occurs when a machine tries to rotate about all the three coordinate axes simultaneously and loses a degree of freedom due to the inability to rotate about the third axis while simultaneous rotation about the first two axes. This image explains more efficiently how the gimbal lock occurs.
Printing some of the basic stats we get:
- We can see the ‘linear_acceleration_Z’ mean is close to -9.8 indicating gravity
- Angular velocity values have a much smaller spread compared to linear acceleration.
2. We plot density plots of all the features to notice the distributions or recognize if any distribution is easily separable.
- In the density plots, we can clearly observe some peaks in the orientation channels for the output class ‘hard_tiles’, but most of the other distributions seem correlated over a certain bandwidth.
- ‘euler_angle_X’ resembles orientation channel distribution
- Angular velocity distributions for all the classes show some deviations for angular_velocity_Z, but the distributions are mostly overlapping for all the classes for angular_velocity_X and angular_velocity_Y.
- Linear acceleration channels present almost overlapping distributions for all the classes.
3. Correlations between features:
- We should expect minimum correlation between the features and as can be observed from the above, correlation is almost 0 for most features.
- Correlation between ‘angular_velocity_Y’ and ‘angular_velocity_Z’ is almost =<-0.6 indicating negative relationship.
- Features ‘orientation_X’ and ‘orientation_W’ also seem positively correlated and so are ‘orientation_Y’ and ‘orientation_Z’ with correlation coefficient equal to almost 1.
Since the orientation features seem correlated, we try to plot correlation heatmap using the Euler angles instead.
Also as shown by this kernel, is a way to acheive an accuracy of 0.992 by just using orientation data indicating possible data leakage in the data.
Thus we use the Euler angles instead of the orientation values since correlation between them is much lesser.
4. We plot the FFT spectrum of each time series and the peaks in the spectrum.
from scipy.fft import fft, fftfreq, fftshiftSAMPLE_RATE = 128 #Number of samples in say, one second
N = 128 #Length of the signal
xf = fftshift(fftfreq(N, 1/SAMPLE_RATE))
yf = fftshift(fft(np.asarray(signal)))
plt.plot(xf, yf, label = 'fft')
FFT or the Fast Fourier Transform algorithm allows us to efficiently compute the DFT (Discrete Fourier Transform) of the signal which helps us gain information about the signal in the frequency domain, like its power, maximum frequency, etc. FFT is instrumental in signal processing applications and used widely for analysing time domain signals in frequency domain. More about how to interpret FFT is explained in this blog.
For this purpose, for each data point, we plot the 128 point DFT of each feature, which gives us 128 amplitude values at 128 frequency components. The zero frequency corresponds to the mean or the DC value of the signal. Frequency components from 1 to 64 correspond to the positive complex DFT values and from -64 to -1 correspond to the negative complex DFT values. The above plot shows the peaks in their amplitude at each frequency component.
The aim of this experiment is to test if these features serve as important features which can classify the different surface types.
If we try to analyse the FFT spectrum of multiple series_id of one class, say ‘fine_concrete’, which is the output class for series_id 0, and we plot FFT spectrums for series_id 32, and 40, which belong to the same class, we find some similarities in the ‘angular_velocity_X’ , ‘angular_velocity_Y’ and ‘linear_acceleration_X’ spectrums.
5. Plot PSD of the signal:
Power Spectral Density or PSD is estimated as the squared conjugate of the FFT of a signal. It tells us how the power of the signal is distributed within the frequency components.
SAMPLE_RATE = 128 #Number of samples in say, one second
N = 128 #Length of the signal
xf = fftshift(fftfreq(N, 1 / SAMPLE_RATE))
yf = fftshift(fft(np.asarray(signal)))
PSD = yf*np.conj(yf)/N
plt.plot(xf, PSD, label = 'PSD')
We see a lot of peaks in the PSD spectrum at the zero frequency component, thus most of the power seems to be distributed at the zero frequency component. The amplitudes of the PSD spectrum can serve as an important feature.
6. Plot the autocorrelation function:
Autocorrelation of a signal is defined as the correlation of the signal with a delayed version of itself. The lag or the time delay is plotted on the x axis and the plot shows the autocorrelation function of the signal.
import statsmodels.api as smacf = sm.tsa.acf(signal)
xf = range(0,len(acf))
plt.plot(xf, acf, label = 'acf')
We notice a negative trend in almost all the fetaures indicating a reducing correlation coefficient between the signal and its version delayed by the lag units. We can see the correlation coefficient on the y axis range from 1 to -1, 1 indicating a strong positive correlation and -1 indicating strong negative correlation of the signal with its delayed version.
Feature Engineering:
Now, on to some basic feature engineering:
Statistical features:
Since the distributions of most of the features seem to be normal, we calculate some basic stats for each data point using the 128 time series data for each point.
for f in feature: feature_list[f+’_mean’] = data_.groupby([‘series_id’])[f].mean()
feature_list[f+’_min’] = data_.groupby([‘series_id’])[f].min()
feature_list[f+’_max’] = data_.groupby([‘series_id’])[f].max()
feature_list[f+’_std’] = data_.groupby([‘series_id’])[f].std()
feature_list[f+’_max_to_min’] = feature_list[f+’_max’] /feature_list[f+’_min’]
feature_list[f+’_abs_change_mean’] = abs_change(feature_list[f+’_mean’])
feature_list[f+’_abs_change_in_change_mean’] = abs_change(feature_list[f+’_abs_change_mean’])
feature_list[f+’_abs_min’] = np.abs(data_.groupby([‘series_id’])[f].min())
feature_list[f+’_abs_max’] = np.abs(data_.groupby([‘series_id’])[f].max())
feature_list[f+’_kurtosis’] = data.groupby([‘series_id’])[f].apply(pd.DataFrame.kurt)
Thus for each column of the input data point (128 time series data), we get its mean, minimum, maximum, standard deviation, ratio of maximum to minimum, absolute change in mean, absolute second order change in mean, absolute value of the minimum, absolute value of the maximum, and the kurtosis (the degree of skewness). This gives us 10 features for each column, making it 9*10 i.e. 90 statistical features.
Rolling window features:
We calculate the same stats as above for a span of 8 data points using a rolling window and take its mean, giving us 90 rolling window features.
for f in feature_rolling: feature_list_rolling[f+'_rolling_mean'] = data_.groupby(['series_id'])[f].rolling(8).mean().fillna(0).groupby(['series_id']).mean() feature_list_rolling[f+'_rolling_min'] = data_.groupby(['series_id'])[f].rolling(8).min().fillna(0).groupby(['series_id']).mean() feature_list_rolling[f+'_rolling_max'] = data_.groupby(['series_id'])[f].rolling(8).max().fillna(0).groupby(['series_id']).mean() feature_list_rolling[f+'_rolling_std'] = data_.groupby(['series_id'])[f].rolling(8).std().fillna(0).groupby(['series_id']).mean() feature_list_rolling[f+'_rolling_max_to_min'] = feature_list_rolling[f+'_rolling_max'] /feature_list_rolling[f+'_rolling_min'] feature_list_rolling[f+'_rolling_abs_change_mean'] = abs_change(feature_list_rolling[f+'_rolling_mean']) feature_list_rolling[f+'_rolling_abs_change_in_change_mean'] = abs_change(feature_list_rolling[f+'_rolling_abs_change_mean']) feature_list_rolling[f+'_rolling_abs_min'] = np.abs(data_.groupby(['series_id'])[f].rolling(8).min().fillna(0).groupby(['series_id']).mean()) feature_list_rolling[f+'_rolling_abs_max'] = np.abs(data_.groupby(['series_id'])[f].rolling(8).max().fillna(0).groupby(['series_id']).mean()) feature_list_rolling[f+'_rolling_kurtosis'] = data.groupby(['series_id'])[f].rolling(8).apply(pd.DataFrame.kurt).fillna(0).groupby(['series_id']).mean()
Rolling average for each of the features plots:
Frequency domain features:
We add the FFT and PSD amplitudes and frequencies at the peaks as features. Since we have 128 values for each of the features, we get a total of 128*9 DFT coefficients i.e. 1152 DFT coefficients which are too many to add as features.
To reduce the number of features, we add the indices and amplitudes of the DFT peaks which are most commonly occuring.
The above plot shows the histogram of the count of FFT peaks with the number of data points on the y axis and the count of FFT spectrum peaks on the x axis. We can see that the maximum number of peaks are all less that 14. So we add the amplitudes and the frequencies of the maximum occuring 14 peaks of each column as a new feature.
We repeat the same process for the PSD spectrum and add the amplitude and frequency of the 10 most occuring indices of the peaks of the spectrum as features. This process gives us 85 FFT spectrum features and 62 PSD spectrum features.
Autocorrelation features:
Through the same process, we add autocorrelation features giving, us 60 autocorrelation features.
Expanding window features:
Expanding window is similar to rolling window. To calculate the metric at time t, instead of taking values from time, t-(x-1) till time, t-1 using an x length rolling window, we use an expanding window that continuously keeps increasing and takes all the values from time, t=0 till time t-1, to calculate the metric at time t. This gives us 90 expanding window features.
Feature importances:
We train an ExtraTreesClassifier to find out the most important features.
extra_tree_classifier = ExtraTreesClassifier(n_estimators = 100)
extra_tree_classifier.fit(X,y)
feature_importances = extra_tree_classifier.feature_importances_
We can see that the rolling window statistics and expanding window statistics seem to be really important features.
Train :
We train a bunch of ML classification models with the features and observe their performances.
Logistic Regression:
Logistic regression is one of the most popular classifiers which performs supervised learning. Intuitively, we try to find the best hyperplane weights that classifies the two classes (for a binary classification task), assuming they are linearly separable, by taking each point into consideration, calculating their distances from the hyperplane, and updating the weights using the update rule.
model = LogisticRegression()
clf = OneVsRestClassifier(model)
clf = RandomizedSearchCV(model, distributions, random_state=0, scoring ='f1_micro')
search = clf.fit(X_train, y_train)
I have used OneVsRestClassifier() for performing multiclass classification followed by a quick RandomSearchCV to find the best parameters and then perform startified k-fold cross validation and observe the performance of the model on out of fold points and calculate average accuracy.
skf = StratifiedKFold(n_splits=10, shuffle = True, random_state=42)scores = 0
for idx , (train_index, test_index) in enumerate(skf.split(X_train, y_train)):
X_train_fold= X_train.iloc[train_index]
X_test_fold = X_train.iloc[test_index]
y_train_fold = y_train[train_index]
y_test_fold = y_train[test_index]
model = LogisticRegression( C = 2.84, penalty= 'l2')
clf = OneVsRestClassifier(model)
clf.fit(X_train_fold, y_train_fold)
print('For fold',idx, 'score is :',clf.score(X_test_fold, y_test_fold))
scores += clf.score(X_test_fold, y_test_fold)
We print the classification report to see the precision, recall and f1 score for each class and their micro averaged f1 score which is the accuracy of the model. The classes belong to one from the following in the same order :
[‘carpet’ ‘concrete’ ‘fine_concrete’ ‘hard_tiles’ ‘hard_tiles_large_space’
‘soft_pvc’ ‘soft_tiles’ ‘tiled’ ‘wood’]
- A 62 percent accuracy in total, mostly due to correctly classifying Class 6 or ‘soft_tiles’.
- We can observe Class 3 or ‘hard_tiles’ has 0 true positives and the model has been unable to classify data points of this class at all.
- Class 5 or ‘soft_pvc’ has the highest number of data points, and an f1 score of 0.7.
SVM with polynomial kernel:
Support Vector Machines are a class of supervised learning algorithms that find the best classifying hyperplane by maximising the margin of the nearest points to the hyperplane.
For linearly non separable, which cannot be separated in a low dimensional space, we use kernel SVM. The features are transformed into a higher dimensional space using the kernel function, similar to feature transformation, which makes it possible to separate the data in the high dimensional space. For implementation, a kernel trick is used that makes this possible.
The kernel used can be any kernel. Above shown are the decision boundaries formed by an RBF (radial basis function) kernel and a polynomial kernel.
We perform the same steps as above, i.e. parameter tuning using RandomSearchCV and cross validation using stratified k-fold cross validation.
- Class 6 or ‘soft_tiles’ has the highest f1 score, highest recall and the highest precision. This indicates a similar pattern as earlier when ‘soft_tiles’ class points were classified with maximum accuracy.
- Class 3 or ‘hard_tiles’ has the lowest f1 score of 0.22.
Decision Trees:
Decision trees are one of the most simplest supervised learning algorithms based on simple ‘if-else’ rules. The algorithm uses each feature to split up the data set till it obtains a pure node (i.e. only a single class’s data points remaining in the node) or reaches the maximum number of splits. It uses the change in entropy, or any purity metric like information gain or gini impurity to find out the best split i.e. the one that brings about the highest change in the metric.
We perform the same steps as above, i.e. parameter tuning using RandomSearchCV and cross validation using stratified k-fold cross validation.
- Class 6 or ‘soft_tiles’ has the highest f1 score
- Class 3 or ‘hard_tiles’ has the lowest f1 score
Ensemble Models:
Ensemble models involve combining multiple ‘base learners’ to form a robust model with low variance and low bias. Here, by the term ‘base learner’, we typically refer to either a high bias and high variance classifier like say, a decision stump, which is used in the AdaBoost algorithm, or a high variance classifier like a decision tree trained on bootstrap samples.
Ensembling techniques solve the problem of classifiers with high variance and high bias. Ensembling these base learners fetches us a low variance and low bias robust model.
Ensemble models can be classified broadly into:
- Bagging
- Boosting
- Stacking
Bagging:
Bagging stands for bootstrapped aggregation. We use multiple base models with high variance and train each of them with a subset of train data obtained by row sampling or in other words, use bootstrap samples. The predictions of these high variance base learners are aggregated to create a low variance and more stable classifier or regressor, depending upon the task.
Random Forest:
A random forest uses bagging to create powerful classifiers. It uses multiple Decision Trees trained with a subset of the data sampled with replacement from the input data as the base learners. These base learners tend to have low bias and high variance. Bagging is done to reduce the variance and obtain a stable and powerful classifier.
What is the difference between bagging and random forest?
Random forest algorithm only considers a subset of features while making a split, unlike bagging, which considers all the features while making the split. This leads to random forest having decorrelated trees grown on each subset of data and features.
In case of bagging on the other hand, when we consider all the features, the most important features tend to be used near the root nodes to make the split, leading to similar or correlated trees.
Boosting:
Boosting is an iterative algorithm that uses multiple high bias and high variance base learners and creates a powerful classifier using them. Boosting algorithms are extremely powerful and usually outperform most classification algorithms. Decision trees are used as the base learners. Traditional boosting algorithm like the AdaBoost algorithm works as follows.
- In the first iteration, the base learner is fit to the input data and target variable and it makes a prediction on the input data.
- The loss is calculated and an estimator is fit to the pseudo residuals or the gradients of the loss, i.e. the difference between the ground truth values and prediction. This way, higher weights are given to the incorrectly classified points.
- The estimator weights are added to the current iteration weights and prediction is made for the next iteration with the readjusted weights. In every iteration, the aim of adding the estimator weights is to learn from the erroneous predictions made in the previous iteration and build a better classifier with every iteration.
XGBoost:
- Stands for Extreme Gradient Boosting, a gradient boosting framework developed in 2014 by Tianqi Chen, known for its speed and performance.
How are Gradient Boosted Decision Trees (GBDTs) different from traditional boosting?
The weights of the existing trees are not changed when new trees which have fit to the pseudo residuals are added, but the trees are added in such a way that the loss function is minimized using gradient descent.
We use different techniques like subsampling, shrinkage and early stopping to prevent overfitting of the model. Subsampling induces randomization in the boosting process by using a subset of data during each iteration and not the whole dataset.
Shrinkage reduces the impact of every tree added such that the steps taken towards the most optimum point of the loss function are small and more which would serve as a better way than to take larger and fewer steps to reach better accuracy.
- Requires a lot of memory and computation power to calculate the best splits while growing the trees.
LightGBM:
- A newer and ‘lighter’ implementation of GBDT, developed in 2016, by Microsoft
- XGBoost requires considerable memory and computation power due to the histogram based split finding algorithm.
What is histogram based split finding algorithm ?
As explained in this blog, first the data is distributed into bins on the basis of quantiles, and information required for entropy computation i.e. the sum of gradients of the left, right and current nodes are stored for each bin. The cost of converting each data point and storing in a bin would reduce the computational complexity from O(#data points, #features) to O(#bins, #features) but, this process still takes considerable computing power for large datasets.
- LGBM uses GOSS (Gradient based One Side Sampling)and EFB(Exclusive Feature Bundling) to downsample the data during the histogram based split finding algorithm to reduce the complexity of the split finding process from O(#data points, #features) to O(#data_, #bundles), where data_ < data points and bundles << features, which makes LGBM way faster.
What is the use of GOSS ?
The main aim of using this algorithm is to reduce the number of data points (which have to be binned) to data_ and it does this by discarding small gradients (indicating small train error) and keeping the larger gradients (indicating large train error).
What is the use of EFB ?
On the other hand, this algorithm reduces the features to ‘bundles’ by bundling the features together.
More about LGBM is explained in this blog.
CatBoost:
- Stands for Categorical Boosting, developed by Yandex in 2017, specifically suited for categorical features
- CatBoost uses a technique called Minimal Variance Sampling (MVS) for finding the best splits.
What is MVS ?
Stochastic Gradient Boosting is a way of incorporating randomization into the boosting technique by adding only a subset from the input data randomly sampled from the data set.
MVS is a way of optimizing the randomization incorporated due to Stochastic Gradient Boosting by optimizing the sampling probabilities to increase the accuracy of the split scores.
What else is different in CatBoost?
1. LightGBM follows a leaf wise tree growth, such that the tree is split at the leaves to minimize the loss, without growing it equally at all the leaves, which might lead to an imbalanced structure but works well for large datasets.
Originally XGBoost followed a level wise or depth first tree growth approch but has a version that implements leaf wise approach too. More about leaf wise and level wise tree growth is answered by this question.
CatBoost follows a level wise tree growth approch, and constructs symmetric trees, such that the left tree and the right tree are grown on the same feature. Symmetric trees make the prediction time up to 10x faster than non symmetric trees and gives better results in many cases.
2. CatBoost handles categorical features using ordered target statistics (uses only past values of features for calculation of a ‘target statistic’ that will replace the categorical feature in the encoding).
Prediction shift problem, which is a shift in the value of predictions towards the data in the dataset and not towards the target, is recognized and is overcome by ordered boosting (similar to ordered Target Statistic algorithm).
More about CatBoost is explained in this blog.
An accuracy of 90 percent, best score till now!
Stacking:
Stacking is an ensemble algorithm that allows us to use a base learner to make predictions on a subset of the input data and use those predictions to train a meta classifier which predicts the final output. The base learner and the meta classifier can be any classification algorithm.
Intuitively, the base learners learn from bootstrap samples and have high variance. The meta classifer learns from the predictions made by these high variance predictors and since we use different algorithms for the base learners, different decision surfaces are learnt which helps the meta classifier learn the predictions of all the base learners to build a more stable classifier.
Conclusion:
Gradient boosting takes the stage once again!
Further work:
In this blog, I have only experimented with ML algorithms. Further, we can experiment with deep learning models using Conv1D for extracting time series features, autoencoders to extract features, RNNs to learn sequential information and much more.
All the code is available here and you can reach me here.
Thanks to Kaggle for hosting the competition so we can keep learning!
References:
- https://www.kaggle.com/c/career-con-2019
- https://www.appliedaicourse.com/
- https://arxiv.org/pdf/1905.00252.pdf
- https://www.hackerearth.com/practice/machine-learning/prerequisites-of-machine-learning/basic-probability-models-and-rules/tutorial/
- https://towardsdatascience.com/what-makes-lightgbm-lightning-fast-a27cf0d9785e
- https://dataaspirant.com/catboost-algorithm/#:~:text=Leaf%20Growth%20A%20significant%20change%20in%20the%20implementation,feature-split%20pair%20is%20performed%20to%20choose%20a%20leaf.
- https://ataspinar.com/2018/04/04/machine-learning-with-signal-processing-techniques/
- https://catboost.ai/docs/concepts/parameter-tuning.html