By repeatedly showing data like this is dog and this is a cat the model will learn to distinguish these animals based on their characteristics, such as shape, size and color. So they are fed large datasets of labeled examples and they use this data to train the model so that it can predict the labels for new examples. It works without predefined correct answers.
How does Unsupervised Learning Works
Unsupervised learning algorithms identify similarities, differences, and patterns in data. They find relationships by grouping similar items together, simplifying complex data while keeping important information, or spotting unusual data points.
Unsupervised learning can be categorized into:
- Clustering: Grouping similar data points.
- Dimensionality Reduction: Reducing number of varieties in data while keeping essential information.
- Anomaly Detection: Identifying unusual data points deviating from the norm.
Unlabeled data do not have predefined outcomes, unlike supervised learning where data points have corresponding labels/target variables.
The algorithms rely on quantifying similarities between data points. We call these similarity measures.
- Euclidean Distance: Calculates the direct, shortest distance between two data points in space.
- Cosine Similarity: Measures how similar two data points are by looking at the angle between them - smaller angles mean more similarity.
- Manhattan Distance: Finds distance by adding up the coordinate differences between two points, like walking city blocks instead of cutting diagonally.
Clustering
| Concept | Description | Notes / Examples |
|---|---|---|
| Clustering Tendency | Measures if data naturally forms groups. | Uniform data may not yield meaningful clusters. |
| Cluster Validity | Evaluates quality of clusters. | Cohesion: similarity within a cluster. Separation: difference between clusters. Metrics: silhouette score, Davies-Bouldin index. |
| Dimensionality | Number of features in data. | High dimensionality increases complexity and sparsity. |
| Intrinsic Dimensionality | Underlying essential dimensions of the data. | Dimensionality reduction preserves this intrinsic information. |
| Anomaly | Data point deviating significantly from the norm. | Used in fraud detection, network security, monitoring. |
| Outlier | Data point far from most other points. | Can indicate errors or unusual observations; similar to anomaly. |
| Feature Scaling | Ensures features contribute equally in calculations. | Techniques: Min-Max scaling, Standardization (Z-score). |
K-Means Clustering
K-means clustering is a popular unsupervised learning algorithm that divides a dataset into K separate, distinct groups. It groups similar data points together, where similarity is determined by measuring distances between points in multi-dimensional space.
With customer data including purchase history, demographics, and browsing behavior, K-means clustering can segment customers into distinct groups based on their similarities. This helps with targeted marketing, personalized recommendations, and customer relationship management. Its an iterative algorithm that minimizes variance within each cluster. It aims to group data points so that:
- Points within the same cluster are as close to each other as possible
- Points in different clusters are as far apart as possible
The K-means process:
- Initialization: Randomly pick K points as starting cluster centers (centroids)
- Assignment: Assign each data point to its closest cluster center using distance measurement
- Update: Recalculate cluster centers by finding the average of all points in each cluster
- Iteration: Repeat steps 2-3 until cluster centers stop moving or maximum iterations reached
This iterative process continues refining the clusters until they become stable and no longer change significantly.
Euclidean distance
Is the most commonly used distance metric in K-means clustering to measure how similar data points are. It calculates the straight-line distance between two points in multi-dimensional space, helping determine which cluster center each data point should belong to.
Optimal K
Choosing the right number of clusters (K) is critical for successful K-means clustering. The K value directly affects results and how useful they are. Pick too few clusters and your groups become too broad and general. Pick too many clusters and you get overly specific, tiny groups that don't provide meaningful insights.