Introduction to state vector simulation

In this lesson, we're going to implement a simple state vector simulator and steadily made it less simple (and more performant) as a way to examine some performance guidelines for Julia and numerical physics.

First, let's remind ourselves of some basic facts about state vectors.

State vectors, gates, and observables

State vector basics

A state vector is a length-$D^N$ complex vector representing probability amplitudes for a quantum system of $N$ particles, each with $D$ possible states. So, for qubits, which have 2 possible states ($|0\rangle$ and $|1\rangle$), the state vector has $2^N$ elements, and for qutrits, which have 3 possible states, the state vector would have $3^N$ elements.

In this lecture we'll focus entirely on the qubit case, as it's most common.

For just a single qubit, we have a 2-element vector:

\[|\psi_1\rangle = \begin{bmatrix} \psi_0 \\ \psi_1 \end{bmatrix}\]

Where this corresponds to

\[|\psi_1\rangle = \psi_0 | 0 \rangle + \psi_1 | 1 \rangle\]

For two qubits, we have a 4-element vector:

\[|\psi_2\rangle = \begin{bmatrix} \psi_{00} \\ \psi_{01} \\ \psi_{10} \\ \psi_{11} \end{bmatrix}\]

Where this corresponds to

\[|\psi_2\rangle = \psi_{00} | 00 \rangle + \psi_{01} | 01 \rangle + \psi_{10} | 10 \rangle + \psi_{11} | 11 \rangle\]

For $N$ qubits, the state vector has $2^N$ elements. We can think of the index within the vector as corresponding to a base-2 representation of the computational basis. So for a vector with index $i$, we have the state $|i\rangle = |i_{N-1} ... i_0\rangle$.

Gate application

Gates are unitary matrices that transform state vectors. For a single qubit system, gates are $2 \times 2$ matrices. For an $N$-qubit system, gates are $2^N \times 2^N$ matrices.

In Julia, we'll represent our single-qubit gates as follows:

using LinearAlgebra
using Chairmarks
using Profile
using AliasTables

const X = float.(complex.([0 1; 1 0]))
const Y = float.([0 -im; im 0])
const Z = float.(complex.([1 0; 0 -1]))
const I = float.(complex.([1 0; 0 1]))
const H = float.(complex.(1/√2 * [1 1; 1 -1]))

2×2 Matrix{ComplexF64}:
 0.707107+0.0im   0.707107+0.0im
 0.707107+0.0im  -0.707107+0.0im

Most quantum gates act on a single qubit, but the state vector involves many qubits. We can extend a single-qubit gate to act on the entire state vector by taking a tensor product with the identity matrix for all other qubits.

For example, if we have a 3-qubit state vector and we want to apply gate $G$ to the second qubit (qubit 1), we need to apply $I \otimes G \otimes I$ to the state vector. In Julia, the tensor product is implemented as kron.

function apply_gate_naive(ψ::Vector, gate::Matrix, gate_qubit::Int)
    n_qubits  = Int( log2(length(ψ)) )
    all_gates = [I for qubit in 1:n_qubits]
    all_gates[n_qubits - gate_qubit] = gate # Julia is one indexed!
    full_gate = reduce(kron, all_gates)
    return full_gate * ψ
end

apply_gate_naive (generic function with 1 method)

Let's test this with a simple example. We'll create a 3-qubit state vector and apply an X gate to the second qubit:

# Create a 3-qubit state vector in the |000⟩ state
ψ = zeros(ComplexF64, 8)
ψ[1] = 1.0  # |000⟩ corresponds to index 1

1.0

# Apply X gate to qubit 1 (second qubit)
ψ_new = apply_gate_naive(ψ, X, 1)

8-element Vector{ComplexF64}:
 0.0 + 0.0im
 0.0 + 0.0im
 1.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im

# This should give us |010⟩, which corresponds to index 3
findall(x -> abs(x) > 1e-10, ψ_new)

1-element Vector{Int64}:
 3

Observables

Observables are Hermitian matrices that correspond to measurable quantities. For a single qubit, common observables are the Pauli matrices $X$, $Y$, and $Z$.

The expectation value of an observable $O$ on a state vector $|\psi\rangle$ is:

\[\langle O \rangle = \langle \psi | O | \psi \rangle\]

For a multi-qubit system, we need to extend the observable to act on the entire state vector using tensor products, just like we did for gates.

function expectation_value_naive(ψ::Vector, observable::Matrix, observable_qubit::Int)
    n_qubits  = Int( log2(length(ψ)) )
    all_observables = [I for qubit in 1:n_qubits]
    all_observables[n_qubits - observable_qubit] = observable
    full_observable = reduce(kron, all_observables)
    return real(ψ' * full_observable * ψ)
end

expectation_value_naive (generic function with 1 method)

Let's test this with our X gate example:

# Expectation value of Z on qubit 1 for |000⟩
expectation_value_naive(ψ, Z, 1)

1.0

# Expectation value of Z on qubit 1 for |010⟩
expectation_value_naive(ψ_new, Z, 1)

-1.0

Performance considerations

The naive approach works, but it's not very efficient. The main problem is that we're creating a large matrix ($2^N \times 2^N$) for each gate application, which becomes prohibitively expensive for large systems.

Let's create a larger system to see the performance issues:

n_qubits = 8
ψ_large = zeros(ComplexF64, 2^n_qubits)
ψ_large[1] = 1.0

1.0

# This will be slow and memory-intensive
@b apply_gate_naive(ψ_large, X, 4)

174.587 μs (24 allocs: 1.338 MiB)

For a 16-qubit system, we need to create a $65536 \times 65536$ matrix, which uses about 34 GB of memory! This is clearly not sustainable.

Optimized approaches

The key insight is that we don't need to create the full matrix. Instead, we can directly manipulate the state vector using more efficient algorithms.

Approach 1: Tensor reshaping

One approach is to reshape the state vector into a tensor and apply gates more efficiently:

function apply_gate_reshaped1(ψ::Vector{ComplexF64}, gate::Matrix{ComplexF64}, gate_qubit::Int)
    n_qubits = Int(log2(length(ψ)))

    # Reshape the state vector into a tensor
    tensor_shape = ntuple(i -> 2, n_qubits)
    ψ_tensor = reshape(ψ, tensor_shape)

    # Apply the gate to the specified qubit
    result = similar(ψ_tensor)

    # We need to contract the gate with the appropriate tensor dimension
    # This is a simplified version - full implementation would be more complex

    return vec(result)
end

apply_gate_reshaped1 (generic function with 1 method)

A more direct approach is to use the structure of the tensor product:

function apply_gate_reshaped2(ψ::Vector{ComplexF64}, gate::Matrix{ComplexF64}, gate_qubit::Int)
    n_qubits = Int(log2(length(ψ)))

    # Create a copy of the state vector
    ψ_new = copy(ψ)

    # The key insight is that we can operate on chunks of the state vector
    chunk_size = 2^gate_qubit
    stride = 2^(gate_qubit + 1)

    for i in 1:chunk_size:length(ψ)
        # Apply the gate to this chunk
        if i + chunk_size - 1 <= length(ψ)
            # Get the two components for this qubit
            α = ψ[i:i+chunk_size-1]
            β = ψ[i+chunk_size:min(i+stride-1, length(ψ))]

            # Apply the gate
            ψ_new[i:i+chunk_size-1] = gate[1,1] * α + gate[1,2] * β
            if i + stride - 1 <= length(ψ)
                ψ_new[i+chunk_size:i+stride-1] = gate[2,1] * α + gate[2,2] * β
            end
        end
    end

    return ψ_new
end

apply_gate_reshaped2 (generic function with 1 method)

Approach 2: Bit manipulation

A more efficient approach uses bit manipulation to directly compute which elements of the state vector need to be modified:

function apply_gate_shifting1(ψ::Vector{ComplexF64}, gate::Matrix{ComplexF64}, gate_qubit::Int)
    n = length(ψ)
    ψ_new = copy(ψ)

    for i in 0:(n-1)
        # Check if the gate_qubit is 0 or 1 in the binary representation of i
        if (i >> gate_qubit) & 1 == 0
            # The qubit is 0, so we need to find the corresponding 1 state
            j = i | (1 << gate_qubit)  # Set the gate_qubit to 1

            # Apply the gate
            α = ψ[i+1]  # +1 because Julia is 1-indexed
            β = ψ[j+1]

            ψ_new[i+1] = gate[1,1] * α + gate[1,2] * β
            ψ_new[j+1] = gate[2,1] * α + gate[2,2] * β
        end
    end

    return ψ_new
end

apply_gate_shifting1 (generic function with 1 method)

Let's test this implementation:

ψ_test = zeros(ComplexF64, 8)
ψ_test[1] = 1.0

1.0

# Apply X gate to qubit 1
ψ_result = apply_gate_shifting1(ψ_test, X, 1)

8-element Vector{ComplexF64}:
 0.0 + 0.0im
 0.0 + 0.0im
 1.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im

# Compare with naive implementation
ψ_naive = apply_gate_naive(ψ_test, X, 1)

8-element Vector{ComplexF64}:
 0.0 + 0.0im
 0.0 + 0.0im
 1.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im

# Check if they're the same
maximum(abs.(ψ_result - ψ_naive))

0.0

Great! Now let's benchmark the performance:

@b apply_gate_shifting1(ψ_large, X, 4)

820.636 ns (3 allocs: 4.070 KiB)

This is much faster than the naive approach! But we can do even better.

Optimized bit manipulation

We can optimize the bit manipulation approach by eliminating redundant work:

function apply_gate_shifting2(ψ::Vector{ComplexF64}, gate::Matrix{ComplexF64}, gate_qubit::Int)
    n = length(ψ)
    ψ_new = copy(ψ)

    mask = 1 << gate_qubit

    for i in 0:(n-1)
        if (i & mask) == 0  # Only process when the target qubit is 0
            j = i | mask    # Corresponding index with target qubit = 1

            α = ψ[i+1]
            β = ψ[j+1]

            ψ_new[i+1] = gate[1,1] * α + gate[1,2] * β
            ψ_new[j+1] = gate[2,1] * α + gate[2,2] * β
        end
    end

    return ψ_new
end

apply_gate_shifting2 (generic function with 1 method)

@b apply_gate_shifting2(ψ_large, X, 4)

782.138 ns (3 allocs: 4.070 KiB)

We can make this even more efficient by using vectorized operations:

function apply_gate_shifting_linear(ψ::Vector{ComplexF64}, gate::Matrix{ComplexF64}, gate_qubit::Int)
    n = length(ψ)
    ψ_new = copy(ψ)

    # Create masks for efficient bit manipulation
    mask = 1 << gate_qubit

    # Process in chunks to utilize vectorization
    chunk_size = 2^(gate_qubit + 1)
    lower_chunk = 2^gate_qubit

    for start in 0:chunk_size:(n-1)
        # Process the chunk
        for i in start:(start + lower_chunk - 1)
            if i + lower_chunk < n
                j = i + lower_chunk

                α = ψ[i+1]
                β = ψ[j+1]

                ψ_new[i+1] = gate[1,1] * α + gate[1,2] * β
                ψ_new[j+1] = gate[2,1] * α + gate[2,2] * β
            end
        end
    end

    return ψ_new
end

apply_gate_shifting_linear (generic function with 1 method)

@b apply_gate_shifting_linear(ψ_large, X, 4)

802.971 ns (3 allocs: 4.070 KiB)

Threading for additional performance

For very large systems, we can use threading to parallelize the computation:

function apply_gate_threaded(ψ::Vector{ComplexF64}, gate::Matrix{ComplexF64}, gate_qubit::Int)
    n = length(ψ)
    ψ_new = copy(ψ)

    chunk_size = 2^(gate_qubit + 1)
    lower_chunk = 2^gate_qubit

    Threads.@threads for start in 0:chunk_size:(n-1)
        for i in start:(start + lower_chunk - 1)
            if i + lower_chunk < n
                j = i + lower_chunk

                α = ψ[i+1]
                β = ψ[j+1]

                ψ_new[i+1] = gate[1,1] * α + gate[1,2] * β
                ψ_new[j+1] = gate[2,1] * α + gate[2,2] * β
            end
        end
    end

    return ψ_new
end

apply_gate_threaded (generic function with 1 method)

@b apply_gate_threaded(ψ_large, X, 4)

2.157 μs (10 allocs: 4.680 KiB)

Performance comparison

Let's compare all our implementations:

println("Naive approach:")
@b apply_gate_naive(ψ_large, X, 4)

173.625 μs (24 allocs: 1.338 MiB)

println("Bit shifting approach:")
@b apply_gate_shifting_linear(ψ_large, X, 4)

792.556 ns (3 allocs: 4.070 KiB)

println("Threaded approach:")
@b apply_gate_threaded(ψ_large, X, 4)

2.151 μs (10 allocs: 4.680 KiB)

Expectation values

We can apply similar optimizations to expectation value calculations:

function expectation_value_optimized(ψ::Vector{ComplexF64}, observable::Matrix{ComplexF64}, observable_qubit::Int)
    n = length(ψ)
    result = 0.0

    chunk_size = 2^(observable_qubit + 1)
    lower_chunk = 2^observable_qubit

    for start in 0:chunk_size:(n-1)
        for i in start:(start + lower_chunk - 1)
            if i + lower_chunk < n
                j = i + lower_chunk

                α = ψ[i+1]
                β = ψ[j+1]

                # Compute the expectation value contribution
                result += real(conj(α) * observable[1,1] * α +
                              conj(α) * observable[1,2] * β +
                              conj(β) * observable[2,1] * α +
                              conj(β) * observable[2,2] * β)
            end
        end
    end

    return result
end

expectation_value_optimized (generic function with 1 method)

# Test the optimized expectation value
expectation_value_optimized(ψ_large, Z, 4)

1.0

# Compare with naive implementation
expectation_value_naive(ψ_large, Z, 4)

1.0

Multi-qubit gates

For multi-qubit gates (like CNOT), we need to extend our approach:

function apply_cnot(ψ::Vector{ComplexF64}, control_qubit::Int, target_qubit::Int)
    n = length(ψ)
    ψ_new = copy(ψ)

    control_mask = 1 << control_qubit
    target_mask = 1 << target_qubit

    for i in 0:(n-1)
        # Only apply if control qubit is 1
        if (i & control_mask) != 0
            # Find the index with target qubit flipped
            j = i ⊻ target_mask  # XOR flips the target bit

            # Swap the amplitudes
            ψ_new[i+1] = ψ[j+1]
            ψ_new[j+1] = ψ[i+1]
        end
    end

    return ψ_new
end

apply_cnot (generic function with 1 method)

# Test CNOT gate
ψ_cnot_test = zeros(ComplexF64, 4)
ψ_cnot_test[3] = 1.0  # |10⟩ state

1.0

# Apply CNOT with control=1, target=0
ψ_cnot_result = apply_cnot(ψ_cnot_test, 1, 0)

4-element Vector{ComplexF64}:
 0.0 + 0.0im
 0.0 + 0.0im
 0.0 + 0.0im
 1.0 + 0.0im

Sampling from quantum states

Often in quantum simulation, we need to sample from the probability distribution defined by the state vector:

function naive_sample(ψ::Vector{ComplexF64}, n_shots::Int)
    probabilities = abs2.(ψ)
    samples = Int[]

    for _ in 1:n_shots
        r = rand()
        cumulative = 0.0

        for (i, p) in enumerate(probabilities)
            cumulative += p
            if r <= cumulative
                push!(samples, i-1)  # Convert to 0-indexed
                break
            end
        end
    end

    return samples
end

naive_sample (generic function with 1 method)

# Test sampling
samples = naive_sample(ψ_large, 1000)

1000-element Vector{Int64}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0

# Count the frequency of each outcome
using StatsBase
countmap(samples)

Dict{Int64, Int64} with 1 entry:
  0 => 1000

For better performance with many samples, we can use the alias method:

function alias_sample(ψ::Vector{ComplexF64}, n_shots::Int)
    probabilities = abs2.(ψ)

    # Create alias table
    alias_table = AliasTables.AliasTable(probabilities)

    # Sample from the alias table
    samples = [rand(alias_table) - 1 for _ in 1:n_shots]  # Convert to 0-indexed

    return samples
end

alias_sample (generic function with 1 method)

# Test alias sampling
alias_samples = alias_sample(ψ_large, 1000)

1000-element Vector{Int64}:
 0
 0
 0
 0
 0
 0
 0
 0
 0
 0
 ⋮
 0
 0
 0
 0
 0
 0
 0
 0
 0

# Compare performance
@b naive_sample(ψ_large, 1000)

4.138 μs (10 allocs: 23.867 KiB)

@b alias_sample(ψ_large, 1000)

2.553 μs (8 allocs: 13.992 KiB)

Mixed states and density matrices

For mixed states, we need to work with density matrices instead of state vectors:

function apply_gate_density_matrix(ρ::Matrix{ComplexF64}, gate::Matrix{ComplexF64}, gate_qubit::Int)
    n_qubits = Int(log2(size(ρ, 1)))

    # Create the full gate matrix
    all_gates = [I for qubit in 1:n_qubits]
    all_gates[n_qubits - gate_qubit] = gate
    full_gate = reduce(kron, all_gates)

    # Apply the gate: ρ_new = U * ρ * U†
    return full_gate * ρ * full_gate'
end

apply_gate_density_matrix (generic function with 1 method)

# Create a mixed state (50% |0⟩ and 50% |1⟩)
ρ_mixed = zeros(ComplexF64, 2, 2)
ρ_mixed[1,1] = 0.5  # |0⟩⟨0|
ρ_mixed[2,2] = 0.5  # |1⟩⟨1|

0.5

# Apply Hadamard gate
ρ_mixed_h = apply_gate_density_matrix(ρ_mixed, H, 0)

2×2 Matrix{ComplexF64}:
 0.5+0.0im  0.0+0.0im
 0.0+0.0im  0.5+0.0im

Putting it all together: Circuit simulation

Let's create a simple function to simulate a quantum circuit:

function simulate_circuit(initial_state::Vector{ComplexF64}, gates::Vector{Tuple{Matrix{ComplexF64}, Int}})
    ψ = copy(initial_state)

    for (gate, qubit) in gates
        ψ = apply_gate_shifting_linear(ψ, gate, qubit)
    end

    return ψ
end

simulate_circuit (generic function with 1 method)

# Create a Bell state circuit: H on qubit 1, then CNOT(1,0)
initial_state = zeros(ComplexF64, 4)
initial_state[1] = 1.0  # |00⟩

# Define the circuit
circuit = [(H, 1)]  # Apply H to qubit 1

1-element Vector{Tuple{Matrix{ComplexF64}, Int64}}:
 ([0.7071067811865475 + 0.0im 0.7071067811865475 + 0.0im; 0.7071067811865475 + 0.0im -0.7071067811865475 + 0.0im], 1)

# Run the first part of the circuit
ψ_after_h = simulate_circuit(initial_state, circuit)

4-element Vector{ComplexF64}:
 0.7071067811865475 + 0.0im
                0.0 + 0.0im
 0.7071067811865475 + 0.0im
                0.0 + 0.0im

# Now apply CNOT manually (since we need a specialized function)
bell_state = apply_cnot(ψ_after_h, 1, 0)

4-element Vector{ComplexF64}:
 0.7071067811865475 + 0.0im
                0.0 + 0.0im
                0.0 + 0.0im
 0.7071067811865475 + 0.0im

# Verify this is a Bell state
println("Bell state amplitudes:")
for (i, amp) in enumerate(bell_state)
    if abs(amp) > 1e-10
        println("State |$(bitstring(i-1)[end-1:end])⟩: ", amp)
    end
end

Bell state amplitudes:
State |00⟩: 0.7071067811865475 + 0.0im
State |11⟩: 0.7071067811865475 + 0.0im

Performance profiling

Let's profile our optimized gate application to see where time is spent:

# Profile the optimized gate application
Profile.clear()
@profile for i in 1:1000
    apply_gate_shifting_linear(ψ_large, X, 4)
end

# Display profiling results
Profile.print()

Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
┌ Warning: There were no samples collected.
│ Run your program longer (perhaps by running it multiple times),
│ or adjust the delay between samples with `Profile.init()`.
└ @ Profile /opt/hostedtoolcache/julia/1.11.6/x64/share/julia/stdlib/v1.11/Profile/src/Profile.jl:1246

Final thoughts

This tutorial has shown how to implement efficient quantum state vector simulation in Julia. Key takeaways:

Avoid creating large matrices: Use bit manipulation instead of tensor products
Vectorize operations: Process data in chunks when possible
Use threading: For large systems, parallel processing can provide significant speedups
Profile your code: Use Julia's profiling tools to identify bottlenecks
Consider specialized algorithms: For sampling, use techniques like the alias method

The optimized implementations we've developed can handle reasonably large quantum systems (up to ~20-25 qubits) efficiently, making them suitable for many quantum algorithm simulations and educational purposes.

For production quantum simulation, you would typically use specialized libraries like Yao.jl or PennyLane.jl, but understanding these fundamental algorithms helps you use such libraries more effectively and debug performance issues.