-
Notifications
You must be signed in to change notification settings - Fork 0
/
forbidden-suffix.Rmd
332 lines (234 loc) · 7.76 KB
/
forbidden-suffix.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
---
title: "Are there forbidden radio callsigns?"
author: "Andrew Zimolzak"
date: '2022-05-11'
output: pdf_document
---
Which, if any, 3-letter amateur radio (ham) callsign suffixes do not
exist anywhere in FCC's database (presumably because they are banned
or actively shunned)? Short answer:
> FUC, PIS, and SOS
These are the only three interesting suffixes that are assigned
exactly never. There multiple other less interesting ones related to
"Q codes." *All* other suffixes are seemingly allowed, although some
potentially offensive ones are rarer than expected.
# Introduction
Ham radio operators can choose a "vanity" callsign. A usual callsign
format is "K5ABC". The "suffix" is the part of the callsign after the
number. Often the suffix is 3 letters long. This is enough letters to
make some mildly to very offensive letter combinations. So, which
3-letter combos are problematic enough that they never appear? Data
can be downloaded from FCC in a file called `l_amat.zip`.
```{r libraries, message=FALSE, warning=FALSE}
library(here)
library(dplyr)
library(data.table)
library(tidyr)
library(knitr) # kable
library(ggplot2)
```
# Input the data and show its size
```{r reading-10sec}
# message("Reading dat (~8 sec...)")
d_1528k = fread(here('EN.dat'), header = FALSE, sep="|", quote="")
d_100k = fread(here('ENshort.dat'), header = FALSE, sep="|", quote="")
```
```{r manual-set-data-size}
X = d_1528k
dim(X)
```
```{r rename}
X %>%
rename(
tablename = V1, num = V2, callsign = V5, code = V6, lnum = V7, fullname = V8,
firstname = V9, middlename = V10, lastname = V11, suffix = V12, address = V16,
city = V17, state = V18, zip = V19, weird_addr_text = V20, misctext = V21,
zeronum = V22, othernum = V23, othercode = V24) -> renamed
```
# String processing of suffixes
```{r split-suffix-40sec}
renamed %>%
select(callsign, firstname, lastname, city, state) %>%
separate(
callsign, sep='[0-9]', into=c('prefix', 'suffix'),
remove=FALSE, extra='merge'
) %>%
mutate(len_suf = nchar(suffix)) %>%
filter(len_suf >= 3) %>%
mutate(
s1 = substr(suffix, 1, 1),
s2 = substr(suffix, 2, 2),
s3 = substr(suffix, 3, 3)
) -> presuf
dim(presuf)
dim(presuf)[1] / dim(renamed)[1]
hams_per_suff <- dim(presuf)[1] / (26 ^ 3)
hams_per_suff
```
The above shows: How many hams have a 3-char suffix, what proportion
of hams have a 3-char suffix, and how many hams per suffix (mean).
# Count unique suffixes
```{r distribution-8sec}
presuf %>%
arrange(suffix) %>%
group_by(suffix, s1, s2, s3) %>%
summarise(count = n()) %>%
ungroup() %>%
filter(s3 != '0') -> counts
26 ^ 3 - dim(counts)[1]
```
The above shows how many suffixes are *missing,* or not attested, out
of the expected $26^3 = 17576$.
## Side analysis: most common suffixes
```{r print-kable-top}
counts %>%
arrange(desc(count), suffix) %>%
select(suffix, count) %>%
head(n=35) %>%
kable()
```
# Never-seen suffixes
```{r expand}
counts %>%
expand(s1, s2, s3) -> expanded
left_join(expanded, counts, by=c('s1', 's2', 's3')) -> counts_all
dim(expanded)
dim(counts_all)
26^3
```
Compare dimension of `expanded` and `counts_all` with the expected
$26^3$. Expect to see 17,576 show up in all of those. **Funny thing:**
I initially saw 18,252 in the `expanded` data frame. Why? That equals
$26 \cdot 26 \cdot 27$. Must be extra weird character in `s1` or `s2`
or `s3`. Added `filter() %>%` so now probably **fixed.**
```{r filter-na}
counts_all %>%
filter(is.na(count)) -> forbidden
dim(forbidden)[1]
26 ^ 3 - dim(counts)[1]
```
The above show the counts of never-seen suffixes. Expect the above two
to be equal. But again, weird. I first saw 765 versus 89. It so
happens that $765 - 89 = 26^3$. Seems like `s3` is sometimes equal to
the string "0".
Final weird thing: Then I was seeing 90 in `forbidden` versus 89
missing from `counts`, even after filtering out the zero versus letter
O thing? Filtered it earlier in the workflow instead, and now it
matches.
## Diagnostic
```{r diagnostic}
presuf %>% filter(s3 == 0) %>% kable()
```
I even wonder if this is a typo in the database of "zero" replacing
"capital letter O, as in Oscar."
# Final tables
Lots of suffixes starting with "Q" are banned. But *only three others*
are banned. Here they are.
```{r non-q}
forbidden %>% filter(s1 != 'Q') %>% kable()
```
So, somewhat surprisingly to me, all of the following are *not* banned
and exist at least once: ASS, POO, FUK. Here's a table of sketchy
suffixes, with some very common ones thrown in for comparison.
```{r sketchy-suffixes}
counts %>%
filter(
suffix == 'ASS' | suffix == 'POO' | suffix == 'KKK' | suffix == 'FUK' |
suffix == 'TIT' | suffix == 'GOD' | suffix == 'FCC' | suffix == 'XXX' |
suffix == 'SEX' | suffix == 'PEE' | suffix == 'KOK' | suffix == 'NSA' |
suffix == 'DIK' | suffix == 'DIC' | suffix == 'CIA' | suffix == 'ZZZ' |
suffix == 'ZYX' | suffix == 'USA' | suffix == 'WTF'
) %>%
arrange(desc(count)) %>%
mutate(obs_over_exp = count / hams_per_suff) %>%
kable()
```
"POO" is one of the few mildly rude ones that is actually *more*
popular than expected. Yikes at the substring with 68 callsigns
containing it.
## Boring Q suffixes
```{r q-with-rstu}
forbidden %>%
filter(s1 == 'Q') %>%
group_by(s1, s2) %>%
summarise(sum_q = n()) %>%
kable()
```
Those numbers mean you can't have *any* suffix starting with "QR,"
"QS," or "QT." Plus a few starting with "QU."
```{r qu}
forbidden %>%
filter(s1 == 'Q' & s2 == 'U') %>%
kable()
```
## Diagnostic
```{r math}
forb_n <- 3 + 26*3 + 9
forb_n
26^3 - forb_n
dim(counts)
```
# Explore distribution of suffixes in general
```{r histo-broad}
qplot(counts$count) + geom_vline(xintercept = hams_per_suff)
```
Interesting that it's bimodal. There's the "nearly random" component
in the middle, centered around `r hams_per_suff`. But then there are
some that have nonzero but very low counts. Suffixes that used to be
banned but are not any more, perhaps?
## Low counts but not total ban
By inspection, a lot start with "X" for some reason. And also "QU."
```{r list-below-56}
counts %>%
filter(count <= 56 & s1 != 'X') %>%
filter(! (s1 == 'Q' & s2 == 'U')) %>%
arrange(desc(count)) %>%
select(suffix, count) -> below_57
below_57 %>%
kable()
```
This definitely reveals some likely banned or discouraged suffixes
that I didn't think of.
## Zoom in on histogram
```{r zoom-in}
qplot(counts$count) +
geom_vline(xintercept = hams_per_suff) +
xlim(50, hams_per_suff)
```
I think I should change the cutoff to around 62. *Update:* tried that,
and the only interesting one in that range was "GOD," weighing in at a
count of 60. So I changed the threshold once again to $\le 56$. That
gives the 37 rarest suffixes.
Definitely includes some slurs. But also double-meaning words (nut,
pig, gas, box, sob). Most amusing one is "LID."
# Analysis of sketchy ones
```{r heatmap-data}
left_join(below_57, presuf, by='suffix') -> sketchy_people
sketchy_people %>%
arrange(suffix, state) %>%
group_by(suffix, state) %>%
summarise(n_calls = n(), tot_suf = mean(count)) -> suffix_state_t1
suffix_state_t1 %>%
group_by(state) %>%
summarise(tot_state = sum(n_calls)) -> t2
n_sketchy <- dim(sketchy_people)[1]
left_join(suffix_state_t1, t2, by = 'state') %>%
ungroup() %>%
mutate(
n_sketchy = n_sketchy,
expected = tot_suf * tot_state / n_sketchy,
obs_over_exp = n_calls / expected,
r_state = dense_rank(state),
r_suffix = dense_rank(suffix)
) -> suffix_state
n_sketchy
sum(t2$tot_state)
```
Many of the sketchy-seeming ones turn out to be people's initials.
Also, a few seem to be "reclaiming" the slur (inference based on
assumed nationality of last names).
```{r heatmap}
ggplot(suffix_state, aes(state, suffix)) +
geom_raster(aes(fill = obs_over_exp)) +
theme(legend.position="none")
```