Retrieval of large data sets from databases

Hashing is a common operation used in online databases, such as e-commerce websites or library catalogues, to quickly retrieve data. A hash function generates codes that point directly to the location where data is stored, making it easier to access.

However, traditional hash functions can generate the same code for different pieces of data, which can result in slower searches and reduced performance. Perfect hash functions can prevent this issue, but they are time-consuming to construct and compute.

Researchers from MIT and other institutions explored the use of machine learning to build better hash functions. They discovered that in certain situations, using learned models could result in fewer collisions, and these models were often more computationally efficient than perfect hash functions.

By using a small dataset and machine learning to approximate the distribution of data, researchers found that learned models were easier to build and faster to run than perfect hash functions, while leading to fewer collisions than traditional hash functions.

The team’s experiments showed that learned models could reduce the ratio of colliding keys in a dataset and achieve better throughput than perfect hash functions. The researchers plan to explore learned hashing for databases in which data can be inserted or deleted.