A
Accuracy
Accuracy is a method for measuring how often a model within classification problems correctly classifies an observation. The value itself is calculated as the sum of true positive and true negative divided by the total number of observations.
Anomaly
An anomaly (or outlier) is an abnormal observation that deviates from what would be considered normal. An anomaly is also called an outlier or a novelty.
Artificial Intelligence
Artificial Intelligence (AI), is the theory and development of computer systems that can perform tasks that would normally be considered to require human-like intelligence such as visual perception, speech recognition, decision making, and translation between languages. AI is mostly developed using machine learning.
B
Bias
A biased dataset is a dataset that is not representative for the real-world. It often underrepresent a certain value. A biased dataset will result in a biased model, which cannot be generalizable.
Binary
A binary problem is a either-or problem, i.e. there are only two options.
Business Continuity
Business Continuity Plan is a guideline for how the company handles differenct disaster scenarios such as hacker attack, data is lost in a natural disaster, or a partnership with a key supplier cease.
C
Causation
Causation is the relationshop between cause and effect. It describes when the observed correlation has a causal relationship.
Classification
Classification is a category of problems of which you wish to detect whether a given object belongs to one class or another.
Cloud
The Cloud consists of computer services such as servers, storage, and databases etc. which are available on-demand. This means you only pay for the resources used, and therefore do not have to pay for the cost and administration of owning the hardware. The service is provided by a cloud provider such as Google, Amazon, or Microsoft.
Clustering
Computer Vision
Computer Vision is an area within artificial intelligence which focusses on extracting information from images. Examples are identification of objects in images or classification of quality. Multiple matematical methods exits, however, machine learning have proven to be especially usefull. In the manufacturing industry, Computer Vision is often refered to as Machine Vision.
Condition-based maintenance
Condition-based maintenance describes a maintenance approach where you try to prevent further damage and downtime on the equipment by setting fixed limits on certain values and thereby trigger alarms when those limits have been crossed.
Confidence Level
The confidence level is a range between 0 and 1 that provides information about how confident the model is about the predictions.
Confusion Matrix
A confusion matrix shows in a matrix the number of true positives, true negatives, false positives, and false negatives. A confusion matrix provides the information needed to evaluate models performance on a classification problem.
Correlation
A correlation describes a mutual relationship between two or more independent variables. The correlation can be either positive or negative which often is expressed as a value between -1 and +1.
D
Data Governance
Data Governance is the process of managing data policies, data availability, data ownership, data security, and data quality. It is important that these elements together with Business Continuity are reflected upon in a data strategy.
Data Lake
A Data Lake is a data storage repository that can store unbounded amounts of data. The data is often stored in a raw format, meaning the data has not been processed between the source and storage. A Data Lake is often used in analytical contexts, however, the analysis often requires data scientists or personal of similar expertice, since the data availability suffers increased complexities due to unprocessed data.
Data Lakehouse
A Data Lakehouse is a data platform approach which combines the best of both Data Warehouses and Data Lakes. This means that you will be able to cover both Business Intelligence and Machine Learning use cases, since the platform is able to manage both structured data and unstructured data.
Data Leakage
Data leakage is the use of a value during development of a model, that at the time of prediction can not be available. It often contains the information you are trying to predict.
Data Mesh
A Data Mesh is a data management architecutral paradigm that enables analytical data at scale, by following a domain driven design approach and utilising distributed systems. This means that the data and its management is divided into business domains. The concept of data mesh therefore both consists of technical implementation details as well as organisational management principles.
Data Swamp
A Data Swamp is a Data Lake which has large amounts of data stored that nobody uses. The reason for this occurring could be that the consumers have lost trust in the data quality or have difficulty finding the right version of a dataset.
Data Warehouse
A Data Warehouse is a type of data management system that collects data from muliple sources following a structrured schema. The system also makes data accesable through a single accespoint often using SQL. Data Warehouses is mostly used in analytical context such as with business intelligence.
Deep Learning
Deep learning describes when neural networks contain many hidden layers, increasing the complexity and the possibility for learning from data.
E
ELT
Extract, Load, Transform (ELT) describes processes that transfer data from one or more systems to another where the data gets stored in its raw format. ELT is often mentioned in relation to data lakes which first transform the data on use.
ETL
Extract, Transform, and Load (ETL) describes processes that transfer data from one or more systems to another where the data gets transformed before it’s stored. ETL is often mentioned in relation to data warehouses where the database enforces a certain structure the data has to fit.
F
False Negative
False negatives are when a test wrongly predicts that a condition is not there. In other words, it is wrong that the test is negative.
False Positive
False positives are when a test wrongly predicts that a condition is there. In other words, it is wrong that the test is positive.
Feature
A feature is an measureable individual characteristic that is used for prediction the output value. Vibration, Temperature, and Sound are examples of three features.
Feature Importance
Feature importance is a way to analyse which features have the highest impact of the model’s predictions.
Frequency Aliasing
Frequency Aliasing is a state in which signal data (sensor data such as vibration) is collected at too low a frequency. This causes the signal to be incorrectly translated from analog to digital, resulting in a signal distortion. Nyquist-Shannon sampling theorem is one way to measure the right frequency response.
Frozen Data
Frozen data refers to the situation where data from sensors is not transmitted to the desired destination, such as a data platform. Often, frozen data is expressed as the same measurement over an extended period of time.
G
Generalizable
Generalizability describes whether the predictions from a machine learning model are applicable in production settings. Thus it describes whether the results from the machine learning model can be transferred to new data and give the same accuracy in its predictions.
H
Hyperparameter Tuning
Hyperparameter tuning of a model describes the process of finding the right model parameters to get a model with the most generalizable model.
I
Imbalanced Dataset
An imbalanced dataset is a dataset of which the majority of the data contributes to one class, discriminating one or more other classes.
Input Data
Input data is all the features used to predict the output value.
Internet of Things
Internet of Things (IoT) describes physical objects such as machines that is connected to the internet with sensors and/or has its functionallity extended through embedded systems.
L
Labeled Data
Labeled Data is necessary for Supervised Learning, where the output values are known during Training of the machine learning model.
Logistic Regression
Logistic Regression is a statistical model that can be used for binary classificationproblems.
M
Machine Learning
Machine Learning (ML) describes the category of algorithms that use data to achieve the wanted goals, rather than writing logic and mathematical equations. ML algorithms learn from examples of data, and the result of such algorithms is a model.
Machine Learning Engineer
A Machine Learning Engineer (ML Engineer) works with the entire machine learning lifecycle: everything from designing, implementing, productionizing, monitoring, and maintaining of machine learning systems.
Mean
Mean is the same as an average.
Mean Absolute Error
Mean Absolute Error (MAE) is used to evaluate regression problems. MAE is the mean of the total absolute difference between the true and predicted value. Large and small errors are treated equally important in MAE since the values are not squared as in Mean Squared Error.
Mean Squared Error
Mean Squared Error (MSE) is used to validate regression problems. MSE is the difference between the true and the predicted value, squared. MSE is useful when we wish to punish predictions that have a large error value, opposite to Mean Absolute Error.
Model
A model is a generised representation of some specific data, often created by a machine learning algorithm. The model is unique for the given problem and the data it has been trained on. A trained model can be used to reason about new data points.
Multicollinearity
Multicollinearity describes when two features have a perfectly correlation. It is therefore possible to predict one feature by knowing the other. If two features are perfectly correlated with one another, one should be removed, as no information gets lost and multicollinear variables can reduce the model performance.
N
NaN-value
NaN stands for “Not a Number” and is the result of calculations that are undefined. E.g. dividing by 0 leads to the result: “NaN”. NaN values are also sometimes used incorrectly for values that are missing, here it would be best to use Null-Values.
Neural Network
A Neural Network (NN), sometimes called Artificial Neural Network (ANN), is a computer system within machine learning which is inspired by the neurons and synapses of the human brain. When there is a lot of information in the data, the complexity increases, which is often reflected in the neural network having several hidden layers, called deep learning.
Null-value
A null-value is an observation where there is data missing. Null is used to represent that no value has been set. A dataset with many null-values implies that we are missing important information.
Nyquist-Shannon Sampling Theorem
Nyquist-Shannon Sampling Theorem is a method for calculating the right frequency response of a signal to avoid Frequency Aliasing
O
Outlier Detection
Outlier/Anomaly detection is the task of identifing data points which are significantly different from the majority of the complete dataset. Solutions exists within both statistics such as IQR, and machine learning such as clustering.
Overall Equipment Effectiveness
Overall Equipment Effectiveness (OEE) is a measurement for the productivity of production. It measures unplanned downtime, stops between shifts, and bad quality products.
Overfitting
Overfitting is a condition of a machine learning model where the model has learned the patterns from the training data too well, so much so that the model can not generalise to new data.
P
P-F interval
P-F interval is the interval between the moment of a registered sign of potential failure, and the moment a malfunction (breakdown).
P-value
A P-value is an statistical measure for how likely a given result could have occured by random chance. The lower the p-value the lower the chance. The value is used to assess if a hypothesis is considered true or false. A result is often said to be statistically significant if the P-value is less than 0.05, meaning less than a 5% chance for the result to have occured by random chance.
Precision and Recall
Precision and Recall is used to evaluate classification problems. Precision and Recall is a sensitivity measurement, which expresses how well the model is predicting the true positives, compared to the number of false negatives and false positives.
Predictive Maintenance
Predictive Maintenance describes a maintenance approach of which you are predicting future breakdowns to improve uptime and reduce maintenance costs.
Preventive Maintenance
Preventive Maintenance describes a maintenance approach of which you are preventing breakdowns, by planned maintenance for instance based on time.
Probability
A probability is a value between 0 and 1 that indicates how likely an event is to occur. The closer the value is to 1 the more likely it is to occur.
Q
Qualitative Variable
A Qualitative variable is a value that can be categorised into a specific group like sex and age-groups.
Quantitative variables
Quantitative variables are measurable. You can calculate a mean and standard deviation of the values, i.e. the value is numeric (continuous).
Quartiles
Quartiles describes a process of which data is sorted ascending, before data is divided into fractions. Typically data is divided into lower quartile (25% fraction), median (50% fraction), and upper quartile (75% fraction). Quartiles are a good aggregation method as they clarify anomalies as well as the distribution of data.
R
Re-training
Re-training describes the process of re-training a machine learning model with new data, thus the model can learn the newest patterns, and changes. Re-training is important to maintain good predictions and a high accuracy.
Reactive Maintenance
Reactive maintenance describes a maintenance approach of which you perform maintenance when breakdown has occurred.
Recovery Point Objective
Recovery Point Objective (RPO) is the duration we experience data loss, because data cannot be collected. RPO is often described in conjunction with Recovery Time Objective, and is defined in the Business Continuity Plan.
Recovery Time Objective
Recovery Time Objective (RTO) defines the time we can live with not having access to data. RTO is often described in conjunction with Recovery Point Objective, and is defined in the Business Continuity Plan.
Regression
Regression is a set of problems where we estimate the relationship between one or more variables in order to predict new data points. Opposed to classification problems, regression is used to predict a number such as the Remaining Useful Life or the sales price of a house.
Reinforcement Learning
Reinforcement Learning (RL) is a specific learning approach for Machine learning, where the model learns by trial and error. The objective is to maximize reward in a particular situation, for instance by maximizing points in a game.
Remaining Useful Life
Remaining Useful Life (RUL) is a branch within predictive maintenance, which predicts when equipment will fail due to wear and tear. By having RUL predictions, it is possible to schedule maintenance well in advance, as well as buy spare parts before a breakdown occurs.
Representative Dataset
A Representative Dataset means that all possible outcomes that may occur are represented in the dataset. If the dataset is not representative it is biased. Analyses created on a non-representative dataset cannot be used for decision-making.
Reproducibility
Reproducible analyzes mean that it is possible to recreate the results of an analysis if the same data, code, and tools are used.
Right Data
Right Data is the concept of being strategic with what data is collected in the company. Data is not retrieved before you know to what purpose, and how this data can help the company reach its strategic goals.
S
Spurious Correlation
A Spurious Correlation is a correlation which is ramdom by nature. The linear relationship we can spot in the data is quite random and there is no natural relationship between the two variables.
Standard Deviation
The standard deviation is used to quantify the amount of variation in the dataset.
Structured Data
Structured Data follows a fixed schema, such as everything that can be structured in rows and columns as in excel. Structured data can always be handled in the same way. Structured data is often stored in a relational database such as a data warehouse.
Sudden Failure
Sudden failures are errors that occur on machines due to either fault mounting or in case of randomness. They often occur shortly after maintenance has been performed and, through predictive maintenance, can be detected in time to turn off the machine, reducing the likelihood of hazardous situations.
Supervised Learning
Supervised Learning is one learning approach in machine learning, where the true output is known while training the model. In other words, we use the labeled output values to guide our machine learning model towards which patterns to look for.
T
Train, Validation, Test - Split
To train and verify a machine learning model, a dataset is split in to train, validation and test datasets. The majority of the data will be used for training the model. Validation data is used to validate the results of the training session, and the test data is used to evaluate whether the model is generalizable.
Training
A machine learning model is often trained on historical data. In the training phase, the machine learning model is introduced to a larger dataset that it uses to learn patterns in the dataset. Based on patterns in historical data, the trained model will be able to make predictions on unseen data.
Trimmed mean
A trimmed mean is a mean where the most extreme values are removed before the calculating the mean. Typically, the 5% lower and upper values are removed, which often result in a mean that is not characterized by anomaly
True Negative
True negatives are when a test correctly predicts a condition is not there.
True Positive
True positives are when a test correctly predicts a condition is there.
U
Underfitting
Underfitting is a condition of a model where the model performs badly during training as well as on the test dataset. If the model is underfitted, the model is not generalizable.
Unlabeled Data
Unlike Labeled Data, observed outputs are not known with unlabeled data. You train a model to detect patterns, without having a labeled output value.
Unstructured Data
Unstructured Data is any data that is not structured in a predefined way such as images, audio, text, etc. Unstructured data can have structured metadata which describes the content of the data.
Unsupervised Learning
Unsupervised Learning is a learning approach in machine learning, where the true output is unknown during training. Unsupervised Learning is used when you do not know the true output, and therefore instead seek patterns to be able to e.g. group customers based on their buying behavior.
V
Variety
Variety from Big Data refers to having data from different sources. These could consist of data from both structured data and unstructured data.