What Are the Most Commonly Used Data Mining Techniques?
Data scientists employ different ways to store and query data, as well as a variety of models to analyze it. The techniques and terminology are plentiful, and aspiring data analysts must be familiar with them.
Data mining and machine learning share some characteristics in that both fall under the data science umbrella; however, they are important differences.
While data mining is the process of extracting information from data, machine learning is the process of teaching computers the process of data analysis. Specifically, data scientists develop algorithms to teach computers to perform many of the data mining processes that companies require; increasing both efficiency and the volume of analysis that can be completed.
Machine learning often is used as a component of data mining. Many companies use machine learning to perform multi-attribute segmentation analysis on their customer base. Streaming services, for example, can use machine learning to sift through users’ viewing habits and recommend new genres or programs they might like. The better the algorithm, the more accurate and detailed those recommendations can be.
Machine learning is one of the advanced topics of Columbia Engineering Data Analytics Boot Camp, which covers the technical and practical skills needed to pursue a career in data analytics.
The best data mining projects can produce the sharpest and most useful insights. But if they remain static numbers on a page, they’re worthless to decision-makers.
Data visualization allows analysts to share their discoveries through charts, graphs, scatterplots, heat maps, spiral graphics, flow charts, and more. These visualizations can be static or interactive and, most importantly, they can effectively convey critical insights needed to make key business decisions.
Data visualization is a major part of Columbia Engineering Data Analytics Boot Camp — in fact, the fourth module is devoted to it. Learners go in-depth in visualization, which is key to making insights actionable.
Data mining applies various statistical methods to analyze large data sets, and data mining platforms (such as those discussed above) can make data mining easier. However, learning data mining statistical techniques provides analysts with greater understanding of the work they do and how to do it more effectively.
Some statistical techniques include regression, classification, resampling (utilizing multiple samples from the same data set), and support-vector machines (an algorithmic subset of classification).
Statistical modeling and forecasting are key elements of the introductory module to Columbia Engineering Data Analytics Boot Camp.
Data analysts apply the association rule to find relationships in non-intuitive data patterns and understand what business value is associated with those patterns, if any.
Transaction analysis is a common form of association. Retailers scan an aggregation of many customers’ shopping trips, looking across many transactions to find patterns. While the analysis will highlight patterns you might expect to find (e.g., peanut butter and jelly, mayonnaise and bread), association also uncovers patterns that indicate non-intuitive relationships such as coffee creamer and air freshner. A deeper dive is then conducted on these identified associative patterns, and they are either validated and passed on as insights (e.g., the coffee creamer/air freshener pattern occurs due to seasonal items such as gingerbread creamer and balsam pine air freshener) or discarded as anomalies (e.g., coincidentally coinciding promotional schedules putting two items frequently on sale at the same time).
The classification technique looks at the attributes associated with a dataset where a certain outcome was common (e.g., customers who received and redeemed a certain discount). It then looks for those common attributes across a broader dataset to determine which data points are likely to mirror that outcome (e.g., which customers will be likely to redeem a certain discount if it is given to them). Classification models can help businesses budget more effectively, make better business decisions, and more accurately estimate return on investment (ROI).
Decision trees, a subset of machine learning, are algorithms used when running classification or regression models in data mining. The algorithm can ask simple yes or no questions of data points to classify them into groups and lead to helpful insights. For example, a decision tree may be used by financial institutions to pinpoint successful loan eligibility based on relevant categorical data like income threshold, account tenure, percentage of credit utilized, and credit score.
With clustering, data miners identify and create groups in a dataset based on similar characteristics. The process divides the data into subsets, or clusters, for analysis. Doing so provides for more informed decision-making based on targeted collections of data.
Analysts use several types of clustering techniques. They employ the partitioning method, for instance, to divide data into clusters to be analyzed separately. The K-Means algorithm is a popular method of partitional clustering. This algorithm works by first allowing the user to select a number of K-clusters to be used as centroids (or central points) or iterations through which the algorithm will run. Then, objects closest to these points are grouped to form “K number of clusters,” and with each iteration, the centroid distance for each cluster shifts and is updated accordingly. This process is repeated until there are no more changes in the centroid distance for each cluster — or until each iteration is fulfilled. A fun way to use the K-Means algorithm in partition clustering is to look for underutilized/undiscovered players when choosing a fantasy football team. The algorithm can use a superstar player’s stats as the centroids, and then run through iterations identifying clusters of attributes or players (by attribute).
Conversely, in the hierarchical method, individual data points are viewed as a single cluster, which then can be grouped based on their similarities. A dendrogram is one practical example of the hierarchical method; it is a tree-like network structure consisting of interconnected data points, or nodes, used to show taxonomic relationships. Dendrograms are a common visualization technique for displaying hierarchical clusters. In our fantasy football example, a dendrogram might be used to visualize the process by which we selected or passed on player choices, based on our evaluation and desired attributes.
Data Cleaning and Preparation
According to Forbes, one of the major problems in data analytics is bad data. That’s why data cleaning and preparation are so important.
This process focuses on acquiring the right data and making sure it’s accurate and consistent. Errors, formatting differences, and unexpected null sets can inhibit the mining process.
Stages of data cleaning include verifying the data is properly formatted, deleting unnecessary or irrelevant data, removing duplicate sets, and correcting simple issues such as input errors. Even the best algorithm won’t work with incomplete or corrupted data.
Businesses that produce products need accessible, secure, organized locations in which to store them for distribution. The same applies to their data.
Businesses that create a significant amount of data must collect and store it properly to analyze it properly. Data warehousing is a three-stage process commonly known as ETL, which stands for extract, transform and load. Data is extracted from its source to a staging area, where it is transformed (or cleaned) and validated. Then it is loaded into the data warehouse.
Proper warehousing is vital for businesses that generate a large volume of data, particularly regarding customers. By properly storing all this data, businesses can mine it for patterns and trends more easily.
Most data mining techniques look for patterns in data. Outlier detection seeks to find instances that stand out as unique.
This process looks for data that conflicts with the rest of a set. This can include errors (perhaps some data was input incorrectly) or data that provides unique business insights. Analysts can test for numeric outliers, perform a DBScan (which identifies noise points), or isolate anomalies in a large dataset (an isolation forest).
Outlier detection can help businesses understand unique purchases (a run on bathing suits in winter, for instance), detect fraudulent transactions, and improve the logistical flow of the production process.
Prediction is a fundamental pursuit of data mining. Businesses use predictive modeling to answer the question, “What’s going to happen next?”
Predictive models find patterns in data, then use those patterns to create forecasts. The forecasts can include consumer spending habits, inventory needs for a supplier, sites people might visit based on their internet usage, or a baseball team’s projected strikeout rate against an upcoming pitcher.
Several types of predictive models are available. Forecast modeling seeks to answer a specific question. For example, how many SUVs should a car dealer have on the lot next month? Time-series modeling analyzes data based on its input date — such as product sales over a particular year that may assist in year-over-year sales forecasting.
Regression is used in data mining to analyze relationships between variables as part of the predictive modeling process. It can be used to project sales, profits, required product volume, weather data, and even patient recovery rates for medical providers. Analysts primarily employ two regression models. Linear regression estimates the relationship between two variables. For instance, a social researcher might study the relationship between a person’s home location and overall happiness, employing regression analysis to determine if there is a linear relationship between those two variables. Linear regression could also be used to predict housing prices in a real estate market where homes are generally increasing in size and structure. In this case, one variable (changes in home size and structure) is analyzed in relation to another (subsequent shifts in price).
Multiple regression, on the other hand, explains the relationship among multiple variables or data points. For example, when analyzing medical data points like blood pressure or cholesterol levels, analysts may use multiple regression models to explore related variables like height, age, and time spent on aerobic exercise in a given week.
In a regression model, decision trees can also be used to diagram results determining the probability of a specific outcome within two results. Consider this example: A company has a set of data that identifies customers as male or female and by their age. With a decision tree algorithm, it can ask a series of questions (“Is the customer a female?” and “Is the customer younger than 35?”) and group the results accordingly. This is a common tool in marketing strategy to target potential customers based on demographics.
Sequential pattern mining looks for events that frequently occur in data. The process is similar to the association rule in that it seeks to find relationships, but these form an ordered pattern.
One example is shopping patterns. Retailers often place products near each other because customers often shop in sequences ((think of breakfast foods such as cereal, oatmeal, and granola bars in the same aisle). Another example is targeting internet advertising based on a browser’s click sequence. By using sequential pattern mining, businesses can make forecasts based on the results.
The process of pattern tracking is fundamental to data mining. Essentially, analysts monitor trends and patterns in data associated with the progression of time, allowing them to forecast potential time-sensitive outcomes.
This is important for businesses to understand how, when, and how often their products are being purchased. For example, a sports equipment manufacturer tracks the seasonal sales of baseball gear, soccer balls, or snowboards and can choose times for restocking or marketing programs. In addition, a local retailer in a vacation destination might track buying patterns before a holiday weekend to determine how much sunscreen and bottled water to stock.