When Postgres index meets Bcrypt

Hello there! In the previous post “What Okta Bcrypt incident can teach us about designing better APIs”, we discussed the 72-chars limit of the input value of the Bcrypt hashing algorithm that caused quite a big security incident in the industry. That reminded me about another example of Bcrypt misuse that I, personally, came across a few years ago while investigating a quite nasty performance issue with one of the services. Let's jump right into it!

A product manager of one of the neighboring teams approached me and asked if I could help them with the performance degradation that had been experienced lately with their newest feature, when the users were prompted to enter their SSN (social security number), and they'd get a dashboard with the personalized data about them. The simplified architecture looked like this:

As you can see, the flow consists of 3 steps:

The user enters their SSN in the UI, and the UI sends it to the Internal service.
Internal service checks whether the data for the user with such SSN is already present in the DB (Postgres): if yes, that data is returned to the UI
Otherwise, the request is made to the Third-party API to fetch the desired data, then it's being processed, saved into the DB, and returned to the user

The nature of the user data didn't require real-time updates, so the same data was stored and used for the month, after which it was considered to be expired, and was deleted by the cronjob. A pretty straightforward system.

The issue (the team was experiencing) was the fact that it took 10+ seconds to follow the 3-steps flow we discussed above, which led to a terrible UX and a high bounce rate. That made a good sense, as the average attention span of the users nowadays is way less than 10 seconds. But why was it so slow?

The bug issue was reproducible in the production setup, the logs/metrics were not so useful with the clues for the cause. So, I cloned the project code to my laptop and launched a Postgres instance via Docker Compose. Additionally, I started mitmproxy to be able to intercept and inspect HTTP requests on my machine, and created a template of the request to the Internal service API with my own SSN in Postman. My debugging setup was ready, so it was time to run and test the app.

Running it locally

To my great surprise, after sending a request from Postman, I got a response in the matter of tens of milliseconds. mitmproxy showed that there was indeed the request to the Third-party API, but it had a minimal latency.

Hm…interesting. I tried repeating the same request again, and again got a pretty quick answer, but this time without any calls to the third-party API. That made a perfect sense, as my user info was present in Postgres. I tried sending the same requests a few more times, but didn't have any luck with reproducing the issue.

Despite the failure, that gave me a good clue: it seemed like I could exclude the Third-party API from the list of suspects, or at least for now. Of course, it's important to mention that the local setup is different from the production one, so there is always a chance that something might be slowing down the egress traffic. However, for simplicity of the investigation, I decided to make sure that nothing in the app caused the issue before checking possible external factors, like cloud proxies, security filters, etc.

I realized that it's time to take a look at the code. If you'd like to follow along, here are the code samples that reproduce the setup we are discussing, check the README.md file for how to run instructions.

Btw, if you like my blog and don’t want to miss out on new posts, consider subscribing to my newsletter here. You'll receive an email once I publish a new post.

Browsing the code

You might know that there is a saying “it's always DNS”, which applies when some service is down due to no obvious reasons. From my experience, when there is a performance issue with a relatively simple app, “it's often database”. So, even though, I started navigating the source code from the top to bottom (API -> DB), I didn't expect to see anything suspicious within the layers above the one that runs DB queries. And, indeed, the interesting bits of code were 2 functions of the Repo struct responsible for inserting a new user into the DB and selecting a particular user from the DB by SSN. Here how they looked:

func (r *Repo) SelectUserBySsn(ssn string) (*common.User, error) {
    now := time.Now()

    var user common.User
    err := r.dbPool.QueryRow(
        context.Background(),
        "SELECT id, user_info FROM users WHERE ssn_hash = crypt($1, ssn_hash)",
        ssn,
    ).Scan(&user.Id, &user.UserInfo)
    if err != nil {
        if errors.Is(err, pgx.ErrNoRows) {
            return nil, nil
        }
        fmt.Println(err)
        return nil, err
    }

    fmt.Println("Time taken to get user:", time.Since(now))
    return &user, nil
}

func (r *Repo) InsertUser(user common.NewUserEntity) error {
    _, err := r.dbPool.Exec(
        context.Background(),
        "INSERT INTO users (id, ssn_hash, user_info) VALUES ($1, crypt($2, gen_salt('bf')), $3)",
        user.Id, user.Ssn, user.UserInfo,
    )
    if err != nil {
        fmt.Println(err)
        return err
    }
    return nil
}

The crypt($1, ssn_hash) and crypt($2, gen_salt('bf') parts immediately caught my eyes: “aha, so we are not storing plain SSNs in the DB, but hashed versions”. Interesting discovery! This didn't give me any hints on the exact algorithm that was used for hashing, but a quick Google search by the gen_salt('bf') part helped me to discover that it's a Bcrypt one. And indeed, after I did SELECT * FROM users in my local setup, I noticed that the ssn_hash column had this value $2a$06$r4UXr0F80tz57DGzCQBf6uIpQealvM3Rb6E/TvsyFkJJaiLXt9J8W, and the $2a$ prefix was indeed a Bcrypt one. All right, we are into something! The chances are that some of you, who are experts in cryptography and/or databases (or, Postgres, at least), already can see the reason for the poor performance of the app (kudos to you!), but that was not the case for me, so I had to keep digging.

I realized that DB was my best lead at the moment, so I jumped away from the source code to the Postgres console in my IntelliJ IDEA, and ran this query: SELECT * FROM users WHERE ssn_hash = crypt('0852285111', ssn_hash) (0852285111 is my fake SSN in this example). I got the result in 9 milliseconds, which was quite fast, and didn't hint any issues with the performance.

But then I connected to the production DB and ran the same query. It took 15 seconds to get the similar response. Wow! What's more interesting, was the fact that fetching all the existing users via SELECT * FROM users took only 6 milliseconds. Puzzling, isn't it?

Despite being a bit puzzled, that definitely felt like good progress, as I had a strong clue that DB was the bottleneck here. And it became clear that my failures to reproduce that locally was due to the fact that my local table had only 1 row, while the production one had 5000 users as of that moment, and kept growing hour by hour. I could have kept debugging the issue within the production Postgres instance, as it was reproducible there, but doing that is not a good practice due to many reasons, like degrading performance even more, or running some unsafe queries that might do some harm to the data there. So, at first, I wanted to take a snapshot of the production DB and use it locally, but realized that it was not much better idea as well: taking a snapshot might be a costly operation for Postgres. Plus, there were privacy implications of me copying the sensitive data of the customers to my local machine. That's why I went with the third option of inserting a lot of fake data into my local DB.

Fake it until you make it

The fact that our table was so simple and consisted of 3 columns only, made the task very simple. So, I ended up writing this fake data generation code:

func GenRandomUsers(num int) []common.NewUserEntity {
    users := make([]common.NewUserEntity, num)
    for i := 0; i < num; i++ {
        users[i] = GenRandomUser()
    }
    return users
}

func GenRandomUser() common.NewUserEntity {
    ssn := genRandomSsn()
    fmt.Println("Generated SSN:", ssn)
    return common.NewUserEntity{
        Id:       uuid.New().String(),
        Ssn:      ssn,
        UserInfo: GenRandomUserInfo(),
    }
}

func GenRandomUserInfo() string {
    return fmt.Sprintf("Some info %s %s", strconv.FormatInt(time.Now().UnixMilli(), 10), uuid.New().String())
}

func genRandomSsn() string {
    ssn := ""
    for i := 0; i < 10; i++ {
        ssn += strconv.Itoa(rand.IntN(10))
    }
    return ssn
}

And made a use of it within the fillDb function that I triggered on the app start up:

func fillDb(repo *db.Repo) {
    users, err := repo.SelectUsers()
    if err != nil {
        panic(err)
    }
    if len(users) > 0 {
        fmt.Println("users already exist, skipping db fill")
        return
    }

    fmt.Println("generating random users...")
    newUsers := utils.GenRandomUsers(5000)

    fmt.Println("inserting random users...")
    err = repo.InsertUsers(newUsers)
    if err != nil {
        panic(err)
    }
    fmt.Println("inserted random users")
}

Once I ran the app, it took more than 15 seconds to finish inserting the users, which I found weird, as the InsertUsers code made only 1 call to DB:

func (r *Repo) InsertUsers(users []common.NewUserEntity) error {
    now := time.Now()

    query := "INSERT INTO users (id, ssn_hash, user_info) VALUES "
    var args []interface{}
    for i, user := range users {
        query += fmt.Sprintf("($%d, crypt($%d, gen_salt('bf')), $%d),", i*3+1, i*3+2, i*3+3)
        args = append(args, user.Id, user.Ssn, user.UserInfo)
    }
    query = query[:len(query)-1] // remove the trailing comma

    _, err := r.dbPool.Exec(
        context.Background(),
        query,
        args...,
    )
    if err != nil {
        fmt.Println(err)
        return err
    }
    fmt.Println("Time taken to insert users:", time.Since(now))
    return nil
}

It became clearer and clearer to me that the crypt($%d, gen_salt('bf')) part was to blame, but I didn't understand why it was so just yet. Anyway, the intermediate goal was achieved, as at that point of time, I could reproduce the performance issue in my local setup as well, the SELECT * FROM users WHERE ssn_hash = crypt('0852285111', ssn_hash) query became as slow as within the production DB, even though my SSN wasn't among the existing DB records.

I decided to take a pause for a moment and check the facts:

the reasons of the slow performance were queries to the DB
the performance degraded with the growth of the number of the records within the DB table
something is fishy with the crypt() function for both INSERT and SELECT queries

The second fact was a good hint, so I decided to check what Postgres query planner could tell me about the plan for executing the query:

EXPLAIN
SELECT *
FROM users
WHERE ssn_hash = crypt('0852285111', ssn_hash);

The keyword EXPLAIN in this case triggers Postgres to build the plan for the query, but not run it. The results were like this:

Seq Scan on users  (cost=0.00..182.00 rows=25 width=138)
"  Filter: (ssn_hash = crypt('0852285111'::text, ssn_hash))"

Aha, Seq Scan on users gives us a hint that Postgres is about to perform a sequential scan, which means that it will go through every single row of the table to find the matching one - O(n) if we'd like to use the algorithmic complexity here. This could very much explain the correlation between the increase in the number of rows and performance degradation.

Let's add ANALYSE keyword to actually execute the query and get more metrics:

EXPLAIN ANALYSE
SELECT *
FROM users
WHERE ssn_hash = crypt('0852285111', ssn_hash);

The output:

Seq Scan on users  (cost=0.00..182.00 rows=25 width=138) (actual time=15037.772..15037.774 rows=1 loops=1)
"  Filter: (ssn_hash = crypt('0852285111'::text, ssn_hash))"
  Rows Removed by Filter: 5000
Planning Time: 0.048 ms
Execution Time: 15037.788 ms

Which proves what we had just seen above.

So, I thought that I solved the riddle: the ssh_hash column lacked the index. I was almost proud of myself, but that was a very brief moment, as after browsing both the DB table structure and the schema definition queries, I realized that there is actually an index there:

CREATE INDEX users_ssn_hashed_idx ON users (ssn_hash);

But why did Postgres keep ignoring it and performing sequential scans nevertheless? And despite that fact, why it took Postgres 15 seconds to scan 5000 rows? And why select all the rows were very quick (5 ms)? The only difference between the two was the use of the crypto() function within the WHERE statement. That narrowed down the problem. So, I realized that it was the time to get back to basics and do some research on the nature of the Bcrypt algorithm.

Bcrypt

Here is what Wikipedia specifies about Bcrypt:

The input to the bcrypt function is the password string (up to 72 bytes), a numeric cost, and a 16-byte (128-bit) salt value. The salt is typically a random value. The bcrypt function uses these inputs to compute a 24-byte (192-bit) hash. The final output of the bcrypt function is a string of the form:

For example, with input password abc123xyz, cost 12, and a random salt, the output of bcrypt is the string

If we distill it a bit, we can see that the input of the Bcrypt function is:

password
numeric cost
salt (typically random)

If we take a look at the crypt function within the insert user statement, we'll see this crypt($2, gen_salt('bf'), where SSN is passed as the $2. So, we do have a password (SSN) and salt (gen_salt('bf')) The numeric cost is missing, but let's take a look at the output of the gen_salt('bf') function. You might need to install the pgcrypt extension by CREATE EXTENSION IF NOT EXISTS pgcrypto; if you don't have it to be able to follow along.

SELECT gen_salt('bf');

The output is $2a$06$Qkp8j0vDoCuqNCgyOuIRxO. If we compare with the example from the Wikipedia above, we can see that this salt includes the cost within (06).

Let's run the SELECT gen_salt('bf'); again. We got $2a$06$enmGwTijs9cvlsIv9dcQQe, which is different from the previous result. As Wikipedia warned us

The salt is typically a random value.

And what happens if we are passing random salt to the crypto function? Right, the result will be random as well. Let's double-check:

SELECT crypt('0852285111', gen_salt('bf'));

The first time I ran it, I got this outcome: $2a$06$DQqtgCtvY9GkT3uHpShehuxb3eHD50H6XNxzEyoxZkuDjEt87/6Ce. The second ran got a different result (as we anticipated): $2a$06$nbwOka/0ve5RvM71f35jEOMGA5qHaa0H1I3RtkYq/sNIXW.Z90KAm. If we keep going, we'll keep getting different results. And this is cool, as Bcrypt was created to be used for passwords hashing, which means that even if different users have the same password, the hashed version for them will be different - pretty neat!

But if results are random, how can we check the hash in the DB corresponds to the plain text record we get as the user input? Well, as you can guess, we need to pass the same salt as the input. But where do we get the salt? Wikipedia has already given us an answer: the salt is the part of the hashed output. Let's try it out by copying the first 29 chars of the output we got above: $2a$06$DQqtgCtvY9GkT3uHpShehuxb3eHD50H6XNxzEyoxZkuDjEt87/6Ce -> $2a$06$DQqtgCtvY9GkT3uHpShehu and check if that helps:

SELECT '$2a$06$DQqt

  
Author Of article : Mykola

Read full article

When Postgres index meets Bcrypt

Running it locally

Browsing the code

Fake it until you make it

Bcrypt

Read Next

صور| البابا تواضروس الثاني يتحدث مع لميس الحديدي حول مسيرة 12 عامًا على رأس الكنيسة الأرثوذكسية

Deploy NGINX Web Server on AWS EC2 Ubuntu Server. DevOps HNG12 Stage 0 Task