data science

 

WHAT IS DATA SCIENCE?

We’ve heard it said that “Data science is analytics carried out by folks in Silicon Valley”. That’s probably as good a description as any, and who are we to argue with Teradata’s very own Stephen Brobst?

No matter what the precise definition, data science concerns the extraction of knowledge and insight from all types of data using computer hardware, software and mathematical algorithms. Terms such as ‘data mining’ and ‘big data’ are often used in conjunction with data science.

Maybe ‘doing funky stuff with data’ is as good a definition of data science as any?

PREDICTIVE ANALYTICS

Predictive analytics has been around for many decades. Don’t let those hipster ‘data scientists’ tell you otherwise! Business-to-consumer (B2C) industry actors such as retail, finance and telecoms have long been using historic data to predict customer churn, credit risk, fraud and lifetime value.

Predictive analytics has historically been the preserve of PhD types with expensive software and even more expensive servers. With the rise of cloud computing and open source software, predictive analytics has become more widely adopted, especially Machine Learning (ML).

MACHINE LEARNING (ML)

WHAT IS MACHINE LEARNING?

Machine Learning is the practice of building algorithms derived from a dataset that can be used to solve a practical problem - machines don’t ‘learn’ as such.

MACHINE LEARNING TYPES

Machine Learning types can be supervised, semi-supervised, unsupervised and reinforcement.

MACHINE LEARNING ALGORITHMS

The classic Machine Learning algorithms include:

  • linear regression
  • logistic regression
  • decision trees (acyclic graphs)
  • support vector machines (SVM)
  • knearest neighbours

There are plenty of other ML algorithms, but most requirements can be addressed using the algorithms above.

FEATURE ENGINEERING

Identifying, extracting, integrating and cleansing data in preparation for data science takes up the bulk of a data scientist’s time. It’s not just about picking an algorithm, hitting ‘go’ and waiting for insights to magically reveal themselves. Far from it.

Raw data must be turned into a usable dataset before data science techniques can be applied. This process is known as ‘feature engineering’. Techniques include:

  • one-hot encoding
  • binning
  • normalisation
  • standardisation
  • missing feature replacement
  • imputation

ALGORITHM SELECTION

Applying the most appropriate ML algorithm to well-prepared data is not always simple. Factors such as explainability, processing RAM, number of features and examples in the dataset, categorical:numeric feature mix, data non-linearity, training speed and prediction speed can be assessed to arrive at an initial approach.

See scikit-learn algorithm (estimator) selection cheat sheet.

TRAINING, VALIDATION & TEST DATA

Input data must be split into 3 datasets: training, validation and test.

Training data typically accounts for 70-95% of the total dataset. The non-training data is split equally into validation and test datasets. None of the training data must be present within the validation and test data - the ‘holdout’ sets.

The model is built against the training data. The validation set is used to choose the best algorithm and tune the hyperparameters. Finally, the test data is used to assess the model prior to implementation.

MODEL UNDERFITTING & OVERFITTING

Models are regarded as ‘underfitting’ if many mistakes are made on the training data (high bias). Models that predict the training data very well but perform poorly against the validation and test data are regarded as ‘overfitting’ (low bias).

Underfitting can be addressed by trying more complex models or adding more features and examples to the dataset.

Model overfitting can be addressed by trying a simpler model or by reducing the dimensionality of the dataset.

MODEL REGULARISATION

‘Regularisation’ aims to build a less complex model with higher bias while attempting to balance model bias:variance. Common regularisation techniques include ‘least absolute shrinkage and selection operator’ (lasso) and ‘ridge regularisation’ (Tikhonov regularisation).

MODEL ASSESSMENT

Once a model is built against the training dataset, it is then tested against the test dataset. A good model that predicts the test dataset accurately is said to ‘generalise’ well.

HYPERPARAMETER OPTIMISATION

Selecting good hyperparameters for use in ML algorithms is the job of the data scientist. Hyperparameter tuning techniques include grid search, random search and Bayesian optimisation, gradient optimisation and evolutionary optimisation.

 
 
 
 

 

contact vldb


Name *
Name