50+ Data Science Terms
Terms defined as part of an introductory assignment for a Big Data Science class at NYU. The professor gave us these definitions because he was sick of students not knowing key words during lecture. I was at first annoyed that this was even given as an assignemtnt, however I actually did learn a few things just from looking up all of these topics. And yes, I know, deep learning is defined twice.
- Prescriptive Analytics - Unlike predictive analytics (predicting the future) and descriptive analytics (insights in what has happened), prescriptive analytics uses known parameters to find the best choice in a business analytics environment. Prescriptive analytics is often used to mitigate future risk. Prescriptive analytics tends to use known parameters to make its decisions.
- Kafka - Kafka is a distributed messaging system. It is fast, highly scalable and redundant. Kafka is often used for ‘real-time’ processing. It was first used by Linkedin to transfer data from the source system and move it reliably to where it needed to go and has currently exploded in growth. Kafka can be thought of as a big data logging system.
- Apache Spark - Spark is a data processing engine that is tuned for handling massive amounts of data. Spark can handle terabytes of data because is distributes it on a cluster of hundreds to thousands of physical or virtual servers. Spark is optimized to run in memory that helps it process data much faster than old fashion MapReduce approaches like Hadoop. Spark allows for a range of processes like streaming data, running machine learning algorithms that would fail if done on one machine, and ETL. Spark is open source.
- Neural Networks - Inspired by the human brain, a neural network is a collection of nodes that are interconnected and take in input and produce outputs. Neural Networks are a branch of Machine Learning and take in input in order to train the weights assigned to each node. With every round of input that is fed in, the weights of all nodes are adjusted based on the accuracy of the output. Neural Networks are responsible for most of the modern advances in AI as they are extremely powerful. They tend to always get better with more and more quality training data. They are also able to be run on distributed machines so advances in hardware like sophisticated GPU’s has led to their widespread use.
- Example: Yann Lecun developed a neural network that was able to read in 28x28 pixel images of hand written numbers and classify them as the numbers between 0-9. This neural network was then used in the Postal Service to automatically scan and decipher zip codes written on mail. It was one of the first modern neural networks.
-
Big Data - Data that is large in 3 different aspects: volume, velocity, and variety. Volume refers to the size of the data. Many businesses need to handle terabytes of data to stay competitive. Velocity refers to the speed of the data that is coming in. RFID tags, sensors, and social media data is created every second and is considered high velocity data. Finally, variety refers to the massive range of types of data that is created. Some data is structured, other is unstructured. Rarely does a large variety of data share a common schema and needs to be wrangled together to make any use of it.
- Trust Based Recommender Systems - Recommender system that is based on the trust between users. Many collaborative recommender systems use the similarity between different users to help make recommendations. Trust based systems include the trust that users have between themselves to help make better recommendations. Often many users of systems value recommendations that are generated from other users that they trust.
- Linear Regression - Linear regression is a common algorithm in predictive analysis. Unlike discrete classification problems, linear regression is used to predict future values using past data points. Linear regression uses the features of past data to try and fit all the data with a line that can then be used to predict the value of new data. Linear regression uses a cost function to fit a best line to the data. The normal equation from linear algebra can also be used. However, finding the inverse of a matrix can be intensive so for high dimensional data, Gradient Descent is often used.
- Example - Suppose you want to predict the value of a house. You have data on current houses and their prices and all the features about them (rooms, bathroom, square feet, location, age, etc). You can use all this data to fit a linear regression model and then use that model to estimate the price of a new house.
- Gradient Boosting - A machine learning technique that uses ensemble models to produce a better prediction model. Often, Gradient Boosting uses decision trees for this ensemble. Gradient Boosting is often seen as more accurate than linear regression. The idea is to fit a weaker model that can perform average, then train aditional weaker models that handle the values that the weak learner missed. Eventually you end up training an ensemble of models that better predict values.
- Example: AdaBoost or Adaptive Boosting is one of the most popular Gradient Boosting models. The weak learners are simple decision trees. As more difficult predictions show up, AdaBoost will add more trees. To predict a value, it weights all the predictions from all trees and that is used to get the overall prediction.
- Source: https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
-
Knowledge Discovery - The process of discovering useful knowledge from a collection of data. Knowledge discovery is often used to extract information that can be used for a variety of reasons from marketing, fraud detection, and predictive analytics. Knowledge discovery in the past was done manually but the era of big data has require automated solutions. Knowledge discovery also includes more than just parsing data for extracting information including data storage and access, cleansing, and machine learning.
-
Class Label (in Data Classification) - A discrete attribute whose value you aim to predict using values of other features. Class labels are using in supervised machine learning classification problems. Classification supervised learning algorithms are often fed a large amount of labelled data and then try to predict future labels of unseen data.
- KNN (K nearest neighbor) - One of the simplest machine learning algorithms that can be used for regression or classification. The idea is to find the k closest data points to a new datapoint. For regression, you can average the value of the nearest neighbors to estimate the value of the new datapoint or assign the new data the label of its nearest neighbors if doing classification. K-nearest neighbors hinges on a good selection for the parameter k and is also very sensitive to noise or outliers. To find the nearest neighbors, different distance equations can be used.
- Example - Assume you go back to the housing price example. If you pick a k, you can use k-nearest neighbors to find all the closest houses to the new house you have. You can then estimate the value for this house by averaging the value of all the nearest neighbors.
-
Analytics - Multidimensional field that combines mathematics, statistics, modeling, and machine learning to extract relevant patterns or knowledge from data. Analytics is often synonymous with fact based decision making that is performed by software that uses large amounts of data and often sophisticated algorithms.
- Hadoop 2.0 - Hadoop 2.0 is a set of open source programs that is used to handle big data. Hadoop is a distributed file system that implements the MapReduce paradigm. MapReduce splits up operations into two steps: mapping data to certain keys (groups) and then reducing the mapped data (aggregating keys). This technique can be done in parallel leading Hadoop to be useful for large data problems. Hadoop 2.0 is popular because it is relatively cheap, scalable, and fault tolerant.
- Deep Belief Networks - a type of deep neural networks that uses multiple layers of a graphical model that has directed and undirected edges. While deep neural networks are feed-forward, DBN’s have undirected layers called Restricted Boltzmann Machines. These undirected layers can be trained using unsupervised learning algorithms, typically Contrastive Divergence. Deep Belief Networks are considered a type of Deep Generative Models.
-
Deep Learning - Deep Learning is essentially a neural network with many hidden layers. In a neural network, there will always be a input layer and output layer. This is where the inputs are fed in and where the values get weighted and summed in the output. However, many neural networks have more layers in between these two. Values from the inputs are propagated throughout the entire network before reaching the output layer. Any neural network that has a lot of hidden layers is considered ‘deep’ and training and using these deep networks is considered deep learning.
- Convolutional Neural Networks - A subste of deep, feed-forward neural networks. They specialize in image tasks. Along with how neural nets were inspired by brains,CNN’s were inspired by the visual cortex of animals. CNN’s use two techniques called convolutions and pooling. Convolutions apply a filter (function) to input and then pass that onto the next layers. This is similar to how visual stimuli is captured and read before being passed on to the brain. Pooling is combines outputs from one layer into pooled values that are then passed on further into the network.
- Example: CNN’s have been used to read hand written numbers to letters. They are often used in Optical Character Recognition. Today, they can recognize more advanced images. They are used in self driving cars for example to see if a image taken near the car contains another car, pedestrian, sign, etc.
- Source: https://en.wikipedia.org/wiki/Convolutional_neural_network
- Feature Selection - Picking the right features of data to use in a analytics model. The quality of the data and the selection of certain features can fundamentally change the performance of a model. Often, feature selection refers to selecting the most relevant aspects of the data you are working with. Feature selection aims to not only produce the best results by picking the most valuable attributes of the data, but it also helps to use less data which in turn reduces the complexity of a model and also helps with performance. Automated feature selection approaches exists but often domain experts are the best resources for figuring out what parts of data is most valuable for a given problem.
- Business Intelligence - The use of software to transform data into intelligence that a business can use to make better decisions. Business intelligence uses past and present data to describe the current state of the business. It is not like business analytics or predictive analytics which tries to predict what will happen in the future. Often business intelligence utilizes graphs, dashboards, and other GUIs that allow key decision makers to understand the data quickly.
- Cross-validation - Partitioning the sample data into train and test sets in order to evaluate the performance of a model. Cross-validation often uses a combination of the data so each part is used as the test or train part for each evaluation. K-fold cross-validation is the most popular type of validation. It splits the data into k-parts and each part takes turn being either used for testing or training. The model is evaluated with each combination of these k parts and the total results are then averaged and is a better estimation of the performance.
- Example: When doing supervised learning, you want to train a model that can predict test data. Assume we are dealing with sentiment analysis. Our dataset can contain thousands of labelled text that is either positive, neutral, or negative. To train a model that can predict sentiment, we will use some of this data as training data and save some for testing. However, due to the nature of data, how we split our data into test and train can affect the performance. Cross validation allows us to split the data into groups, try out different combos of training/test data, and measure accuracy from the accuracy of the model on all different combos.
- Source: https://www.openml.org/a/estimation-procedures/1
- Graph Database - Many sets of data is highly connected and can be represented as a graph. Traditional databases often uses keys to allow for relationships between tables but often this is not the best way to model highly connected sets of data. A graph database creates data structures out of the data and connects them edges that better models the relationship between sets. For example, social applications and company relationships are a great application of a graph databases as the relationship between friends and companies is often the most vital insight from that data.
- Confusion Matrix - A technique used for evaluating the performance of a classification algorithm. The confusion matrix makes it easy to observe the number of true positives, true negatives, false positives, and false negatives. The CM does this by setting up a table (matrix) of all the outputs and expected outputs for each class. Using this matrix, more advanced evaluations can be observed such as accuracy, misclassification rate, true positive rate, false positive rate, specificity, and precision.
- Split Validation - A simpler version of cross-validation. Split validation simply randomly splits the sample data into a train and test set and then uses that to train a model and evaluate the model. Unlike cross-validation, split validation is not iterative: it does not generate every combination of test and train subsets. Split validation is often a good first step to see how a model is performing before moving on to more advanced evaluation techniques.
- Sentiment Analysis - The process of determining if a piece of text (or spoken word) is positive, negative, or neutral. Sentiment Analysis is also called opinion mining or deriving the opinion of a speaker or writer. Sentiment analysis is a powerful technique that can measure how people feel about a topic. Researchers have used this so find correlations between sentiment and seemingly unrelated events like stock prices. Sentiment analysis is also widely used in marketing and businesses who need to monitor the sentiment of their consumers.
- Source: MonkeyLearn: Sentiment Analysis
-
Feature (in data analytics) - A piece of measurable information about something. A collection of features (or even a single feature) can be used to describe something or someone. For example, your age, address, height, weight, SSN, and phone number are all features that describe you. Feature is also synonymous with attribute, property, or field. In machine learning, features are fed into models that use them to predict or classify unknown sets of features into predefined groups (or undefined clusters based on similarity to seen sets of features).
- Semi-Structured Data - Structured data is data that lives in a form that adheres to a strict schema. For example, a SQL table where all names are in one column and are strings, and all ages are in another column of ints. Unstructured data is the opposite of structured data and is data that lives in a way where no pieces of information adhere to a strict schema. For example, the two sentences ‘Bob is 30 years old’ and ‘We are throwing a surprise party for Bob’s thirtieth birthday’ both contain the same information about Bob but in two completely different schemas. Semi-Structured Data lives in between these two types of data.
- Examples: Emails are an example of this. Emails have headers, subjects, from, and body schemas but the content in each of those can be completely unstructured.
- Source: http://whatis.techtarget.com/definition/semi-structured-data
- Structured Data - Structured data is data that lives in a form that adheres to a strict schema. Structured data can be thought of as a class in a programing language. Certain values must adhere to certain types and the data must adhere to the defined structure.
- Example: a SQL table where all names are in one column and are strings, and all ages are in another column of ints
- Unstructured Data - Unstructured data is the opposite of structured data and is data that lives in a way where no pieces of information adhere to a strict schema. Most data in the world is unstructured. Typically data only becomes structured after humans transform it into a schema.
- Example: The two sentences ‘Bob is 30 years old’ and ‘We are throwing a surprise party for Bob’s thirtieth birthday’ both contain the same information about Bob but in two completely different schemas
- Data Clustering - Dividing a population of data into a number of groups where the data points in the same group are more similar to each other than to data points in other groups. Clustering aims to segregate groups of data that share similar traits (features) into groups (clusters). Hard clustering is when you assign each data point to a cluster while soft clustering assigns probability of a data point of being a certain cluster. There are many many different ways to cluster data and K-Means clustering is the most popular and simplest method. On top of the many different types of clustering, there are also many many different ways to measure similarity between data points.
- Granger Causality - A statistical concept of causality that is based on the idea of prediction. Granger Causality tries to see if a signal X1 causes a signal X2, then it follows that the previous values of X1 can be used to predict the future values of X2. X1 refers to values from one graph, time series, etc and X2 is from another seperate set of values. Granger Causality is often used to explore if one set of values can be used to predict the future values of a (often more valuable) set of values. For example, Granger Causality has been used to try and take past sentiment values of tweets to predict the future prices of stocks.
- Example: Granger Causality was used to see if past tweets had an use in predicting future stock values. Granger Causality has been used in economics for many years but it has made a resurgence with the rising interest in machine learning, data science, and big data.
- Source: http://www.scholarpedia.org/article/Granger_causality
- Data Classification - Organizing data into categories to make it more useful and effective. Classifying data into certain groups can make retrieving data easier and also allows for better data security. For example, data can be classified into high risk, internal, and public data. With that classification scheme, a company can organize sensitive data into the highest security category and assign access rights to only certain employees. Less secure data can be accessed by any employee and then public data can be accessed by anyone. By classifying data, it allows a business to allocate resources more efficiently (only need to assign rights to high security data versus all data). Data is usually classified by business domain experts or by government regulations (personal health information by HIPAA for example).
- Supervised Learning - Machine learning task of gaining insight from labelled training data. Training data is often a tuple of a value and a class label. A supervised learning algorithm then trains a model using this labelled training data and then can be used to predict the label of a new value (data point of features). Supervised learning usually performs better the more labelled data can be fed in to it for training. A large part of the machine learning pipeline is spent labelling training data so it can be useful for future unlabelled data.
- Example: You can use supervised learning to predict if a patient will die or live given some condition. With breast cancer, you can amass a data set of patients that either lived or died and all the features of their tumors, health, etc. You can use these features and label data points with live/die and then train a model. Then, new patients can put their features into the trained model and it will predict if they will live or die.
- Triplestore- RDF (Resource Description Framework) Triplestore is a graph database that stores semantic facts. Triplestore stores data as a network of objects with edges connecting (linking) other objects. Triplestore is unique in that you can optional scheme models called ontologies. Ontologies are formal descriptions of data and allow for unique querying of data. It is called triplestore because data is stored in this way: subject-> predicate (relationship) -> object. For example, you can ask for ‘all the friends of Bob’ and triplestore will take the predicate ‘friend of’ and return all other objects linked to the subject ‘Bob’ by that predicate edge.
- Unsupervised Learning** **- The opposite of supervised learning. Unsupervised learning is fed data that is not pre-labelled. It is a type of machine learning that tries to extract hidden patterns from unlabelled data. Unlike supervised learning, unsupervised learning is harder to evaluate for accuracy as there are no correct labels assigned to the data. Clustering algorithms are the most common type of unsupervised learning algorithms. It ties to unearth insights from data by clustering data points into similar groups.
- Example: You can use unsupervised learning on a group of news articles to uncover clusters of topics. Given a massive dataset of text articles, you can tune a clustering algorithm to uncover groups of articles.
- Source: https://www.mathworks.com/discovery/unsupervised-learning.html
-
Training Data vs Test Data - Machine Learning algorithms (and other data models) need quality data to work well. Training data is the data that is fed into a learning model in preparation for it to be used on new unseen data (or test data). In a supervised learning algorithm, training data is the pre-labelled data that is used as the training set. Gathering quality training data is often the most important step in setting up a learning model. Test data is a subset of the sample data that is used in a learning model. Often, a percentage of a sample is set aside to test the accuracy of a model once it is trained on training data. Test data can be generated using split validation or cross-validation methods. A subset of test data called validation set is another subset of data that is not used in training to evaluate the model once again to see if it performs similarly on it and the test set.
-
Deep Learning - Deep Learning is essentially a neural network with many hidden layers. In a neural network, there will always be a input layer and output layer. This is where the inputs are fed in and where the values get weighted and summed in the output. However, many neural networks have more layers in between these two. Values from the inputs are propagated throughout the entire network before reaching the output layer. Any neural network that has a lot of hidden layers is considered ‘deep’ and training and using these deep networks is considered deep learning.
- Ensemble Methods - Using two or more related analytics model to combine their results into a single score, classification, etc. Often a single model cannot reliably analyze a data set. Combining different models can help data scientist get better results. The most famous example of an ensemble method is the random forest which is simply a collection of different decision trees. The rise of distributed platforms like Hadoop and Spark have aided in new ensemble methods by allowing different models to be run in parallel at the same time.
-
ETL Jobs - Extract, Transform, Load jobs. ETL is crucial to aggregate and make useful different sets of data. Extract means ingesting homogeneous or heterogeneous data from a variety of data sources. Transforming is where you take extracted data and transform it into the correct format or schema so it can later be used for querying and analytics. Preprocessing and cleaning data is often also part of the Transform step. Loading is where you take the transformed data into a database, data store, data mart, or data warehouse. Loading is meant to place the data where it can be most easily used for any future solution.
- SQL - Structured Query Language. It is a programming language used to manage data stored in a relational database. SQL allowed users to access many different records of data with fewer commands (thanks to the relational aspect of the data) and allows data to be accessed without specific indexes. For example, you can run a query to get all objects created before year 2000 without knowing the index for every record you will get back.
-
Alternative Data (in the financial investment context) - Non traditional data that is collected and used to gain insight for the investment process. Alternative data refers to any data used for this purpose that is not published by the company itself. Alternative data can come from many different places and is considered big data due to its volume and variety. Alternative data is often not structured and harder to access. However, alternative data can provided insights that ‘traditional data’ (data from the company or press releases) cannot. For example, alternative data such as satellite images of amount of cars at a particular store over the course of a year, phone location sensors placed at stores, and social media data about a certain company are all considered alternative data.
- CRISP-DM - Cross Industry Standard for Data Mining. It outlines the common and well tested steps that experts use to solve data mining problems. CRISP-DM is a cyclical process. First, you must understand the business domain and the problem that needs to be solved. Next, you must gather data and understand it and its insights. Next, you must try out and train models that attempt to solve the original problem. At this step, you can go back to previous steps if evaluation step results are not promising. Once results are good, you can deploy the model into the real use case. Of course, the world changes so the model will not work forever. Often pieces of this process need to be updated and redone to keep the solution working effectively.
- Data Mining - Practice of automatically searching big data to extract patterns, trends, and insights that is more complex than standard analysis. Data Mining uses advanced statistics and mathematical models to perform knowledge discovery from the data. Mining data is often used to try and predict outcomes, and often uses machine learning.
- Biologically Inspired Data Mining - Many data mining techniques use clustering algorithms to gain insights from data. Biologically Inspired Data Mining uses clustering algorithms from nature including bird flocking and swarms. The technique models data as a collection of entities that then flock together using certain features. Often times leaders are also identified to observe novel insights (such as high influence users in social media networks)
-
Predictive Analytics - Together with data, Predictive Analytics uses statistics and machine learning to predict the likelihood of future events. Predictive analytics uses historical data to predict a future outcome when given new data. Many businesses use predictive analytics to identify risks (insurance or loans) or explore new opportunities (predict mergers or stock activity).
- Model (in the analytics context) - Defining data with mathematics. More formally, an analytical model is a mathematical relationship among features in a data set. Good models can estimate values or classify data. Models are not only defined by the mathematical equations that go into them, but also the quality of the data that they try to describe.
- Yarn - Yet Another Resource Negotiator. Hadoop needs sophisticated software to manage is distributed nature. Yarn helps all the clusters manage resources and complete jobs. Yarn can be thought of as a large scale distributed operating system for big data applications like Hadoop. In order for Hadoop to work correctly, all of its nodes need to communicate effectively and Yarn fills this role.
-
Machine Learning - A branch of artificial intelligence that allows machines to learn from data without being programmed. Classical artificial intelligence used pre-built knowledge representations that were essentially predefined rules. Machine learning uses a corpus of data that is fed into algorithms that then grow and update themselves as they see new data. Machine learning borrows many aspects from statistics. However, with the era of big data, machine learning algorithms are being used in many new and novel ways to gain insights from large corpuses of data.
- Data Warehouse - A key part of business intelligence. Data Warehouse is the storage of a large amount of data that is relevant to a business. Unlike a database, a warehouse stores all historical data and is often immutable. A database is designed to allow for rapid querying and also updating and transformation of relevant current data. A warehouse instead is a storage place for a longer range of data that can be retrieved for business intelligence.
-
Feature Extraction - A type of dimensionality reduction that represents something with a more useful, less redundant, and more descriptive set of features than the total set of features used to describe it. Unlike feature selection that uses a variety of techniques to select which features are best for a given problem, feature extraction aims to reduce the size of a feature vector describing something without losing a lot of information about it. For example, two features about me could be my weight in pounds or kilograms. Both features are redundant and we only need one. Feature extraction would recognize that these two features describe the same thing and one can removed without a loss of information. Feature selection on the other hand would involve reasoning that my weight at all would be useful for a given problem.
- Stream Mining - Very similar to data mining but extracting information and insights from a continuous stream of data rather than from a dataset. Streams of data are often impossible to store in a traditional way so stream mining tries to achieve the same results of data mining but by trying to work with a continuous, high velocity, and potentially infinite stream of new data. Because of the velocity and volume of many streams of data, many stream mining techniques only sample a small and intervalled pieces of a stream. Stream mining is used in extracting insights from sensor reading, internet traffic, social media posts, stock pricing updates, and audio and video feeds.
- Cluster Analysis - Using the technique of data clustering to explore relationship between a set of data. It is often the first step in data mining where clustering the data into similar groups can often reveal crucial relationships. Cluster analysis is not one algorithm but a task that can be solved in many ways (many different types of data clustering algorithms for example). Domain experts often must be consulted to understand the best clustering technique to be used for a specific problem.
- NoSQL - Stands for Non SQL or a non-relational database. NoSQL databases can be graph, key value, or document databases (as long as it is not relational). NoSQL databases are easier to develop, can scale easily, and can be faster to access records in some instances. SQL databases have to define a schema in order to be constructed effectively. NoSQL databases enforces no schema and instead often uses keys to retrieve values, column, or entire objects (JSON or XML for example). MongoDB is an example of a very popular NoSQL database.
- Source: https://aws.amazon.com/nosql/
- RDBMS - A relational database management system. This system allows a user to create, update, and administer a relational database. Most RDBMSs use SQL to manage the database. RDBMSs are often used by enterprises that need sophisticated software to manage their massive amounts of data. They also allow companies to add security to their databases (assigning roles to certain users that allow them use access certain data or use certain queries). The most popular RDBMSs are created by Oracle and Microsoft.
- Hash-joins - A method of aggregating different collections of data in a relational database. A hash join picks the smaller relation and load it into memory to be used for hashing. Then, the larger relation is read and data is joined to any matches in the in-memory relation. Hash-joins can be extremely fast if the smaller relation can be loaded entirely into memory. Hash-joins can only be used for equi joins.
- Predictive Maintenance - The use of predictive analytics to predict when equipment needs maintenance. Predictive maintenance uses the current status of equipment to predict when maintenance is needed versus preventive maintenance that uses scheduled preventive measures to stave off failure. For example, when smoke detectors are required to be replaced every few years, that is preventive maintenance. If the detectors was equipped with sensors or other data gathering sources, predictive maintenance could be used to instead only replace or update the detector if the data called for that. Predictive maintenance has the possibility to save a massive amount of money for companies by allowing them to only replace and fix equipement when they need to versus on a fixed, blind schedule.