“You can have data without information, but you cannot have information without data.”Daniel Keys Moran
Bits, bytes, numbers, text, facts - We are inundated with information. However, data tends to be typically unstructured and is represented in a very high dimensional space making it very difficult to understand, visualize and classify.
This post demystifies dimensionality reduction, which is all about representing data to draw the right inferences.
When you run a search query, it is always serviced with respect to a set of features. This set represents our perception towards high dimensional data. For example, when analyzing data we transform it into a much lower dimensional space, since we believe the new dimensional representation has better perceivable information (less noise) than the original representation.
So what it means in the data space is that we are making a choice of observing data clusters in a particular way. To draw an analogy, it is like seeing an object from one side. Typically an object is lying in a three dimensional space but when we take a picture, we observe it in a two dimensional space. Thus, clearly there can be other views which reveal additional details about that object which are not observed in this picture.
To understand this better, consider taking the picture of a car from the front. A quick scan of the photograph reveals details like the color, its license number, the condition of its headlights but you don’t get to see details like the rear of the car, the exhaust or the trunk.
Similarly, with a given feature set, you just get to see a particular view about a given data while running a search query.
Feature (subset) Selection
There are various approaches to adapt search algorithms towards the most likely set and observe data in a better way. Most effort in designing a search algorithm is dedicated to discovering a meaningful feature-set and navigating our perceptions. We can think of this as detecting combinations of feature (sub) sets that improve the result accuracy. There can be multiple such feature sets, utilized in different combinations yielding a result to our query.
Consider a search query for “Movies in 2014” – a typical search procedure would observe features as described in the illustration below. There are sub-sets (models) of features that contribute a significant amount of information. The domain information is observed from incoming links, the domain age and the domain name. The social presence for the result of the URL is evaluated on the basis of number of shares / tweets, comments and likes on Facebook.
Similarly, there is also some content level information available from the text in the anchor tags, the meta tags and the document text itself. Here the “Domain Information”, its “Social Presence” and the “Text Data” are higher-level views of data, based on some selected features; often termed as sub models. A full-fledged model would have various hierarchical combinations of these features and sub models.
The features described above are just for illustrative purposes. A similar or more detailed set of features can be used to build such sub-sets (models) by a full-fledged search engine.
An observed feature-set used for Google search engine queries can be found here.
A simple way to evaluate how well a given URL fits the Model (or Sub-Model) is by using similarity measures. A similarity measure is used to rank and return results for a given query; it determines for a particular result candidate, how good a fit it is to the query. The results are sorted by ranks and the one with best rank is the most likely candidate to be the solution for the supplied query.
One of the popular Similarity Measures used is Cosine Similarity.
Data Transformation Techniques1 can be used to represent each feature as a dimension in the vector space. Cosine similarity is used to compute the angle between two vectors. Similarity is inversely proportional to the angle which means smaller angles = similar vectors.
There are many more sophisticated ways of Rank & Return but we are limiting ourselves to cosine similarity as our purpose is to understand how we tend to miss information present in the data space.
The performance of search algorithms varies with respect to feature-sets that are learned. An algorithm is more effective in a search space where a feature-set observes more information and less noise. There is however no generic/global feature-set that is guaranteed to perform optimally. A set observed to perform at high accuracy within a particular data space can perform very poorly for another. This is like saying we choose to observe data in a particular way, it might not be a complete view but is one which is robust for solving our search problems.
There are other possible views of data that may perform better. We can only observe features till the data space is explored. It is also possible to infer latent features about a particular dataset, better feature sub-set selection (where the feature is yet unobserved) in a discrete data-set, which could greatly improve our classification accuracy.
1. details about using data transformation techniques is out of scope for this article.
At Compile.com, we deal with a variety of datasets both big and small. Often, there is a need to run analysis on top of 3rd party datasets that we haven’t ingested to see if it’s worth the effort. This particular …