From a Forgotten Process to a Custom CoreDNS Plugin: A DNS Deep Dive

From a Forgotten Process to a Custom CoreDNS Plugin: A DNS Deep Dive
by bilxio, written with Claude (Anthropic)
DNS is the internet's phone book — invisible until something goes wrong.
This post traces a journey from a Mac running too hot to writing a custom CoreDNS plugin, auditing a website redirect with a headless browser, and uncovering how GeoDNS, proxy clients, and DNS resolution interact in ways that none of their documentation fully explains. If you just want working DNS split routing on macOS, skip to the end. If you're curious about the debugging path, the Go plugin internals, or the investigation methodology, read on.
Prologue
CoreDNS had been running on my machine for a long time. Long enough that I'd forgotten why it was there.
The original motivation was simple: DNS privatization. Route every domain query through something I control, log it, keep the data for auditing and analysis — know what my devices are talking to. This was before I even had a proxy client set up.
Over time the setup degraded. Configuration drifted, the DNS got quietly removed from the main network path, but the process kept running. I'd set up both supervisor and launchd for redundancy at the time, so there were actually two CoreDNS instances idling in the background. Nobody cleaned them up because they weren't causing any obvious problems.
Then the Mac started running hot. Temperature creeping up, CPU throttling, fans spinning. I opened htop and found CoreDNS consistently near the top of the process list. Checked the binary: it was an old amd64 build. On an ARM Mac. Rosetta had been translating every instruction, silently, for months.
You take care of the tools you rely on.
I was about to delete the whole thing — but before I did, I remembered why I'd installed it. So instead I pulled the latest CoreDNS source, compiled it for arm64, and started over. The old zone-block approach to domain split routing was causing high memory and CPU usage with 100,000+ entries. That problem needed a proper solution.
Then I wrote a splitdns plugin. Then I wired it into the proxy. Then a website that had been misbehaving for a long time suddenly started working correctly — and that unexpected fix was more unsettling than the original problem.
After that, you'll see.
The Starting Point: Same Domain, Two Browsers, Two Outcomes
Chrome started failing on certain sites with ERR_NAME_NOT_RESOLVED. The same URLs loaded fine in Safari.
My first instinct was DNS poisoning. A packet capture told a different story: Chrome wasn't using the system DNS at all. It had its own DoH (DNS over HTTPS) configuration, sending queries directly to my local CoreDNS instance on port 5353. CoreDNS returned a real AAAA record (IPv6 address). Chrome tried to connect over IPv6. My machine has no IPv6 egress. Connection failed.
Safari used the system DNS, which Loon (a proxy client) had intercepted. Loon's FakeIP mechanism synthesized two fake records regardless of what the real domain has: a fake A record in the 198.0.8.x pool and a fake AAAA record in the fd27:712::/32 pool. Both fake addresses point into Loon's TUN. Safari connected through one of those, Loon caught the connection, looked up the real target from its internal mapping table, and routed it through the proxy tunnel. Success.
Same domain. Two browsers. Completely different paths.
This turned out not to be Chrome-specific. Edge with a custom DoH endpoint behaved identically. The pattern is general: any application that manages its own DNS resolution bypasses Loon's interception, which breaks the precondition FakeIP depends on.
Mapping the Network

DNS Architecture — macOS Loon + CoreDNS + Tailscale
Loon creates a TUN interface that captures IPv4 traffic and intercepts system DNS queries, redirecting them to its own resolver. CoreDNS runs alongside it, exposing three endpoints: standard DNS on port 53, DoT on 853, and DoH on 5353.
The two resolution paths:
Safari / system apps
→ system DNS (intercepted by Loon → 198.19.0.3)
→ Loon FakeIP: synthesizes fake A (198.0.8.x) + fake AAAA (fd27:712::/32)
→ IPv4 and IPv6 traffic both enter TUN → routed through proxy tunnel ✓
Chrome (custom DoH)
→ https://127.0.0.1:5353/dns-query → CoreDNS
→ Cloudflare → real AAAA record returned
→ Chrome connects directly over IPv6 → no IPv6 egress → fails ✗
A detail worth clarifying: Loon's FakeIP is not simply "suppress AAAA, return a fake IPv4." Testing with a domain that has only a real AAAA record — ipv6.ipchu.com (2a01:4f8:1c1e:6a32::1) — shows Loon returning both a fake A 198.0.8.91 and a fake AAAA fd27:712::c600:85b. Loon maintains two fake address pools: 198.0.8.0/22 for IPv4 and fd27:712::/32 for IPv6. Both pools route into TUN. The result: neither address family ever reaches the physical interface with a real destination IP. Every connection is intercepted, dual-stack.
Quick Fix: Block AAAA Records at the DoH Endpoint
The fastest fix: rewrite AAAA queries to empty A responses at the DoH endpoint, forcing Chrome to fall back to IPv4:
https://.:5353 {
tls /path/to/cert.pem /path/to/key.pem
rewrite type AAAA A
forward . 127.0.0.1:53
}
A bandage, not a cure — but it works. Following this path to its conclusion reveals the real problem: AAAA records are a symptom. The underlying issue is that DNS split routing was never properly wired up.
The Real Problem: DNS Split Routing at Scale
The underlying need: domestic domains should resolve via domestic DNS; everything else via an overseas resolver.
CoreDNS's forward plugin with zone blocks can handle this. The problem is scale — a typical China domain list contains over 100,000 entries. Loading 100k zone blocks is technically possible but painful: slow startup, high memory use, high CPU, and a cumbersome update process.
What I wanted: a plugin that reads a domain list file and routes each query in O(1).
CoreDNS's plugin system is clean. Each plugin is a Go struct implementing plugin.Handler:
type SplitDNS struct {
Next plugin.Handler
domains map[string]struct{} // hash set, O(1) lookup
match []string // upstreams for matched domains
def []string // upstreams for everything else
}
func (s SplitDNS) ServeDNS(ctx context.Context, w dns.ResponseWriter, r *dns.Msg) (int, error) {
name := r.Question[0].Name
if s.inList(name) {
return s.forward(ctx, w, r, s.match)
}
return s.forward(ctx, w, r, s.def)
}
The domain list loads into a hash set at startup. Each query triggers a single map lookup. Register in plugin.cfg, recompile CoreDNS:
.:53 {
splitdns {
list /path/to/china_domains.txt
match tls://223.5.5.5 tls://223.6.6.6 {
tls_servername dns.alidns.com
}
default tls://1.1.1.1 tls://1.0.0.1 {
tls_servername one.one.one.one
}
}
cache { success 9984 300 30 }
log
}
Domestic domains route to Alibaba DNS; everything else to Cloudflare. Both paths use DoT.
The Loop Closes: Pointing Loon at CoreDNS
With splitdns running, the routing logic inside CoreDNS was correct — but Loon's own DNS handling was interfering. Loon does some internal DNS resolution and applies its own routing rules, but the results were unreliable.
The concrete symptom: aliyun.com (Alibaba Cloud's Chinese portal) had been redirecting to alibabacloud.com — the international site — for a long time. Login sessions, billing, the control panel: all on the wrong side. This kind of geographic drift is insidious because it looks like a misbehaving website, not a DNS problem.
The fix was simpler than expected: configure Loon's upstream DNS to point at CoreDNS (127.0.0.1).
macOS system / browsers (no individual DoH config needed)
→ DNS query
→ Loon intercepts
→ delegates to CoreDNS at 127.0.0.1:53
→ splitdns: domestic → Alibaba DNS, overseas → Cloudflare
→ FakeIP operates correctly
→ traffic exits through proxy tunnel ✓
No need to configure DoH in individual browsers. All DNS traffic gets intercepted by Loon, delegated to CoreDNS. But why the aliyun.com redirect happened in the first place — that took a proper investigation to answer.
CoreDNS and Loon: A Symbiotic Relationship
After connecting them, overseas DNS resolution was taking nearly a second per query. Understanding why requires understanding the dependency between the two systems.
Well-known public DNS resolvers — 1.1.1.1, 8.8.8.8, 9.9.9.9 — including their DoH and DoT variants, are not directly reachable from mainland China. CoreDNS's outbound queries to these servers have to travel through Loon's proxy tunnel. That tunnel hop is the source of the ~1s latency. This isn't a configuration mistake; it's the network reality.
Which creates a circular dependency:
- CoreDNS depends on Loon: overseas DNS queries need Loon's tunnel to reach
1.1.1.1 - Loon depends on CoreDNS: Loon delegates DNS upstream to CoreDNS to get correct routing decisions
Each relies on the other being functional. Removing either one breaks both.
There's also a subtler problem at startup. CoreDNS must use IP-address-based upstream endpoints, not hostnames:
# ✓ Correct: IP-direct, no DNS resolution needed to connect
default tls://1.1.1.1 tls://1.0.0.1 {
tls_servername one.one.one.one
}
# ✗ Risky: resolving a hostname requires DNS — which isn't running yet
# default tls://one.one.one.one
Several domestic DoH providers (notably Tencent's doh.pub) no longer accept IP-direct connections — only hostname-based access. This makes them unusable as CoreDNS upstreams, because CoreDNS has no DNS available to resolve those hostnames at startup. Alibaba's 223.5.5.5 still supports IP-direct DoT, which is why it remains irreplaceable in this architecture.
The actual startup sequence is implicit and undocumented: 223.5.5.5 is reachable directly (domestic, no tunnel needed), so CoreDNS can answer domestic queries immediately. Loon uses those answers to establish the proxy tunnel. Once the tunnel is up, CoreDNS can reach 1.1.1.1 for overseas queries. The serve_stale cache covers the transition window.
Cache is what makes this tolerable. The tunnel latency is a fixed cost per cold query — but most domains are queried repeatedly:
cache {
success 9984 300 30 # successful responses cached up to 5 minutes
prefetch 5 10m 20% # high-frequency records refreshed in the background
serve_stale 1h # stale cache served if upstream is unreachable
}
prefetch is the most valuable setting here: when a record is queried frequently and its TTL is below 20%, CoreDNS refreshes it in the background. The next query hits cache instantly with no tunnel latency visible to the user.
To make all of this observable — cache hit rates, upstream latency distributions, whether split routing is actually sending queries to the right upstream — a companion dnslog visualization tool parses CoreDNS access logs and turns them into dashboards. "Is the DNS split routing working?" stopped being a guess.
Investigating the Redirect: GeoDNS and the Proxy's Hidden DNS Path
When the aliyun.com redirect problem disappeared after connecting Loon to CoreDNS, I couldn't just accept that it was fixed. It had been broken for a long time. Something specific had changed. What was it?
Manual DNS queries showed that aliyun.com resolved to the same A records from both 8.8.8.8 and 223.5.5.5. Loon's request log showed www.aliyun.com hitting a DIRECT rule — meaning the connection bypassed the proxy and came from the local machine's real IP. If the destination IP and the client IP were both the same in both configurations, why would the server respond differently?
A Playwright script audited the full page load under both DNS configurations, capturing every request, redirect, and response header:
- splitdns (working): 0 HTTP redirects, final URL
www.aliyun.com - overseas DNS (broken): first request triggered a
document-level 302 toalibabacloud.com/en?_p_lc=1
type: document is the key detail — this wasn't a tracking pixel or a JS redirect. The server returned a 302 on the main document request itself. The response also set a cookie: alicloud_deploy_r_s=sg. Singapore. The CDN node that handled the request was in Singapore. Under the working configuration, the response included via: cache3.cn6483 — a Chinese CDN node.
Same machine. Same client IP. Different CDN nodes.
The answer was in a DNS query:
$ dig www.aliyun.com @1.1.1.1
www.aliyun.com. CNAME www-jp-de-intl-adns.aliyun.com. ← "intl" — international path
www-jp-de-intl-adns.aliyun.com. CNAME ...gds.alibabadns.com.
...
xjp-adns.aliyun.com.vipgds.alibabadns.com. A 47.88.198.68
xjp-adns.aliyun.com.vipgds.alibabadns.com. A 47.88.251.189
;; Query time: 1118 msec ← tunnel latency; this query went through the proxy
Compare to @223.5.5.5: direct A records at 106.11.x.x Chinese IPs, no CNAME chain at all.
Alibaba uses GeoDNS: the authoritative DNS returns a different CNAME chain depending on where the query originates. When 1.1.1.1 receives a query that came through a Singapore proxy exit, it returns the intl international path, which resolves to Singapore CDN IPs. Loon then makes a DIRECT connection to those Singapore IPs — DIRECT is accurate, but the destination is already in Singapore. The CDN node sets alicloud_deploy_r_s=sg regardless of the client's IP.
Loon has two independent layers:
Layer 1: Routing decision (visible in logs)
www.aliyun.com → DIRECT rule matched ✓
Layer 2: Internal DNS resolution (silent)
FakeIP requires Loon to resolve the real IP internally
→ this query travels through the proxy tunnel
→ 1.1.1.1 receives it from a Singapore exit
→ returns the international CNAME chain → Singapore IPs
→ DIRECT connection lands on Singapore CDN → 302
The logs only show layer 1. Layer 2 is invisible, which is why the behavior looked inexplicable.
This maps to a design problem that v2ray addresses explicitly with domainStrategy: should the routing engine resolve domain names to IPs before making routing decisions? AsIs uses only the domain name; IPIfNonMatch falls back to IP resolution when domain rules don't match; IPOnDemand resolves immediately when IP rules are encountered. But that governs the routing decision phase.
The connection-establishment phase is separate. sing-box handles it at the DNS configuration level — binding different DNS servers to different outbounds:
"dns": {
"rules": [
{ "outbound": "direct", "server": "domestic" },
{ "outbound": "proxy", "server": "overseas" }
]
}
Pointing Loon at CoreDNS achieves the same alignment from the outside: aliyun.com always resolves through 223.5.5.5 (domestic, direct connection, Chinese geo), Loon gets a Chinese IP, the DIRECT connection lands on a Chinese CDN node. Loon's internal DNS path no longer has an opportunity to drift.
This Story Doesn't Have a Clean Ending
One dependency is quietly tightening. Alibaba's 223.5.5.5 DoT/DoH service is reducing anonymous usage quotas and moving toward a real-name registration model — after registering an account, access is via a personalized endpoint like tls://<identifier>.alidns.com. The IP-direct anonymous channel may eventually close.
This is a direct hit to the current architecture. 223.5.5.5 is irreplaceable for two reasons: it supports IP-direct connections (solving the startup dependency problem) and it returns China-optimized GeoDNS results. If IP-direct access closes, two paths remain:
Option A: Register, accept the real-name requirement, use the authenticated endpoint.
match tls://<uid>.alidns.com {
tls_servername <uid>.alidns.com
}
Keeps DoT encryption and accurate GeoDNS routing. The cost is identity linkage.
Option B: Fall back to plaintext UDP for domestic DNS.
match 223.5.5.5:53
No TLS handshake, no IP-direct restriction, the startup dependency problem disappears. The trade-off is that queries are unencrypted. For domestic domains over a domestic network, this is often acceptable in practice — but it's a step backward.
Both paths work. Which one to take depends on your tolerance for identity linkage versus your concern about unencrypted DNS. What this situation illustrates is that infrastructure dependencies tighten gradually, not all at once. When designing systems that depend on external services, replaceability matters.
Distribution: GoReleaser + Homebrew Tap
Working for personal use is one thing. Sharing it requires a distribution story. CoreDNS is a single Go binary — ideal for pre-compiled distribution.
GoReleaser handles the build and publish pipeline:
builds:
- goos: [darwin]
goarch: [amd64, arm64]
universal_binaries: # merge into a single macOS universal binary
- replace: true
release:
github:
owner: bilxio
name: coredns
brews:
- name: coredns-bilxio
repository:
owner: bilxio
name: homebrew-tap
Tag a commit, run goreleaser release --clean. A few minutes later the binary is in GitHub Releases and the Homebrew formula is updated automatically.
brew tap bilxio/tap
brew install coredns-bilxio
No Go environment, no build toolchain required.
Details Worth Noting
GoReleaser infers the release target from the origin remote. If your project is a fork, origin points to upstream. GoReleaser will try to publish there and get a 403. Always specify the target explicitly:
release:
github:
owner: your-username
name: your-repo
Homebrew Tap repos must be named homebrew-<something>. brew tap bilxio/tap maps to github.com/bilxio/homebrew-tap. Homebrew strips the prefix. One tap repo can host multiple formulae.
go generate ./... breaks in subpackages when the generator script lives at the project root. CoreDNS has //go:generate go run owners_generate.go in plugin/chaos/setup.go, but the script only exists at the root. Remove the go generate ./... before-hook from .goreleaser.yml.
Postscript: Adding Tailscale
Tailscale joined the stack later. On macOS it creates its own utun interface for the 100.64.0.0/10 CGNAT range and provides 100.100.100.100 (MagicDNS) as a local resolver for internal hostnames.
Two conflicts with Loon needed fixing. A bypass-tun entry for 100.100.100.100/32 was injecting a static route via the physical gateway, outranking Tailscale's own routing table entry — MagicDNS received no queries at all. Separately, FakeIP was returning fake addresses for *.ts.net hostnames, making Tailscale peer names unusable. Both fixes are straightforward: remove 100.100.100.100/32 from bypass-tun (Tailscale's routes take over naturally), and add real-ip = *.ts.net to Loon's [General] config. A CoreDNS zone block delegates Tailscale DNS to MagicDNS:
tail38ecd.ts.net {
forward . 100.100.100.100
cache 30
}
CoreDNS now handles three routing paths — domestic domains via Alibaba DNS, overseas via Cloudflare, Tailscale internal hostnames via MagicDNS — all through the same 127.0.0.1:53 interface.
Closing Thoughts
This started with a Mac running too hot. A forgotten amd64 binary, translated by Rosetta for months. Before deleting it, a moment of recollection: why was it installed in the first place? That question led to a recompile, a splitdns plugin, a Playwright audit, a GeoDNS discovery, and a clearer picture of how proxy clients and DNS resolvers interact in ways neither documents properly.
The clearest lesson is also the most obvious one: use the right tool for the right job. Loon is a proxy client. Its core competency is traffic routing and tunnel management. DNS is something it handles well enough — but not deeply. It doesn't understand GeoDNS CNAME chains, doesn't align its internal DNS resolution path with its routing decisions, and provides no observability into what's happening.
CoreDNS does DNS. Split routing, caching, prefetch, IP-direct upstream connections, structured logging — every design decision has a clear DNS rationale. Connecting them through a single clean interface (127.0.0.1:53) gives each system the room to do what it does well.
That's the architecture. Loon handles traffic. CoreDNS handles names. Neither tries to do the other's job.
After all that work, it's hard to feel entirely satisfied. The system is clean. The understanding is thorough. But what the system does — at its core — is find a workable posture within a space that keeps narrowing.
If you're working with similar constraints:
brew tap bilxio/tap
brew install coredns-bilxio
Configuration reference in the project README. Issues welcome.