Introduction to segtest Version 2.0

library(segtest)

The segtest package offers a suite of tools for testing segregation distortion in F1 polyploid populations across diverse meiotic models. These methods support autopolyploids (full polysomic inheritance), allopolyploids (full disomic inheritance), and segmental allopolyploids (partial preferential pairing). Double reduction is optionally modeled fully in tetraploids and partially (at simplex loci only) in higher ploidies. A user-specified maximum proportion of outliers allows the method to accommodate moderate double reduction at non-simplex loci. Offspring genotypes may be known or modeled using genotype likelihoods to account for genotype uncertainty. Parent data may or may not be provided, at your option. Parents can have different (even) ploidies, at your option. Details of the methods may be found in Gerard et al. (2025a) and Gerard et al. (2025b).

Additional functions include those that generate gamete and genotype frequencies under different models of meiosis, functions that simulate genotype (log) likelihoods, and “competing” tests for segregation distortion.

The main functions are:

Gamete genotype frequencies

gamfreq() will generate gamete frequencies under different models of meiosis. gf_freq() will generate genotype frequencies under the same models by convolving the output from gamfreq(). We focus on gamfreq(), as gf_freq() uses the same models but applied separately to each parent.

For autopolyploids, specify type = "polysomic" and, optionally, the amount of double reduction via alpha. alpha is a vector of length floor(ploidy / 4) where element i is the probability a gamete has i pairs of identical by double reduction alleles. The upper bounds for alpha can be found via drbounds(). E.g., for a parental octoploid with genotype 4 with no and moderate levels of double reduction:

drbounds(ploidy = 8) ## DR bounds
#> [1] 0.38571429 0.02142857
gamfreq(g = 4, ploidy = 8, type = "polysomic") ## no DR
#> [1] 0.01428571 0.22857143 0.51428571 0.22857143 0.01428571
gamfreq(g = 4, ploidy = 8, alpha = c(0.1, 0.01), type = "polysomic") ## Some DR
#> [1] 0.022 0.232 0.492 0.232 0.022

For allopolyploids, the possible gamete frequencies can be found in seg. E.g., for a parental octoploid with genotype 4, the possible gamete frequencies are

seg[seg$ploidy == 8 & seg$g == 4 & seg$mode %in% c("disomic", "both"), "p"]
#> [[1]]
#> [1] 0.0625 0.2500 0.3750 0.2500 0.0625
#> 
#> [[2]]
#> [1] 0.00 0.25 0.50 0.25 0.00
#> 
#> [[3]]
#> [1] 0 0 1 0 0

Note that you also need to filter for the mode to be either "disomic" or "both" (both disomic and polysomic). The total number of possible allopolyploid distributions is n_pp_mix().

n_pp_mix(g = 4, ploidy = 8)
#> [1] 3

You can specify one of these distributions via a 1-of-3 vector. E.g.

gamfreq(g = 4, ploidy = 8, gamma = c(1, 0, 0), type = "mix")
#> [1] 0.0625 0.2500 0.3750 0.2500 0.0625
gamfreq(g = 4, ploidy = 8, gamma = c(0, 1, 0), type = "mix")
#> [1] 0.00 0.25 0.50 0.25 0.00
gamfreq(g = 4, ploidy = 8, gamma = c(0, 0, 1), type = "mix")
#> [1] 0 0 1 0 0

Segmental allopolyploids are mixtures of the possible allopolyploid segregation distributions. E.g., an equal mixture of the three for an octoploid with genotype 4 is

gamfreq(g = 4, ploidy = 8, gamma = c(1, 1, 1)/3, type = "mix") 
#> [1] 0.02083333 0.16666667 0.62500000 0.16666667 0.02083333

At simplex, loci, there is only one possible allopolyploid segregation distribution:

n_pp_mix(g = 1, ploidy = 8)
#> [1] 1
gamfreq(g = 1, ploidy = 8, gamma = 1, type = "mix")
#> [1] 0.5 0.5 0.0 0.0 0.0
n_pp_mix(g = 7, ploidy = 8)
#> [1] 1
gamfreq(g = 7, ploidy = 8, gamma = 1, type = "mix")
#> [1] 0.0 0.0 0.0 0.5 0.5

You can account for double reduction at these loci by including beta. The upper bound of which can be found via beta_bounds().

beta_bounds(ploidy = 8)
#> [1] 0.05357143
gamfreq(g = 1, ploidy = 8, gamma = 1, beta = 0.03, type = "mix")
#> [1] 0.53 0.44 0.03 0.00 0.00
gamfreq(g = 7, ploidy = 8, gamma = 1, beta = 0.03, type = "mix")
#> [1] 0.00 0.00 0.03 0.44 0.53

Simulating data

Let’s suppose we have some genotype frequencies we want to simulate individual data from:

gf <- gf_freq(
  p1_g = 2, 
  p1_ploidy = 6,
  p1_gamma = c(0.7, 0.3), 
  p1_type = "mix",
  p2_g = 4,
  p2_ploidy = 6, 
  p2_gamma = c(0.5, 0.5), 
  p2_type = "mix")
plot(gf, type = "h", xlab = "Genotype", ylab = "Frequency")

Bar plot of genotype frequencies.

To simulate genotype counts, just use multinom() from the stats package. Let’s simulate data from 10 individuals.

x <- c(rmultinom(n = 1, size = 10, prob = gf))
x
#> [1] 0 0 2 4 4 0 0

To simulate genotype (log) likelihoods, insert these genotype counts into simgl().

gl <- simgl(nvec = x)
gl
#>             [,1]      [,2]      [,3]      [,4]      [,5]       [,6]       [,7]
#>  [1,]  -2.476575 -1.178405 -2.395080 -4.380673 -7.348080 -12.283402 -26.730748
#>  [2,] -10.226929 -2.738521 -1.514758 -1.614191 -2.750519  -5.504672 -16.089299
#>  [3,] -13.071998 -3.968154 -1.970660 -1.450388 -1.970660  -3.968154 -13.071998
#>  [4,] -16.089299 -5.504672 -2.750519 -1.614191 -1.514758  -2.738521 -10.226929
#>  [5,] -19.319654 -7.362628 -3.867459 -2.118237 -1.394945  -1.826320  -7.525849
#>  [6,] -16.089299 -5.504672 -2.750519 -1.614191 -1.514758  -2.738521 -10.226929
#>  [7,] -19.319654 -7.362628 -3.867459 -2.118237 -1.394945  -1.826320  -7.525849
#>  [8,] -16.089299 -5.504672 -2.750519 -1.614191 -1.514758  -2.738521 -10.226929
#>  [9,] -22.827562 -9.587669 -5.365836 -3.006407 -1.654578  -1.273137  -4.947896
#> [10,] -13.071998 -3.968154 -1.970660 -1.450388 -1.970660  -3.968154 -13.071998

Testing for segregation distortion

You can test for segregation distortion using seg_lrt(). E.g., let’s test for it using the data (both known genotypes and genotype likelihoods) we simulated from the previous section:

## With known genotypes
sout1 <- seg_lrt(x = x, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)
sout1$p_value
#> [1] 0.5820762
## With genotype likelihoods
sout2 <- seg_lrt(x = gl, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)
sout2$p_value
#> [1] 0.5860578

My recommendation is to always use the genotype log-likelihoods. But seg_lrt() allows for known genotypes, if that situation works best for you.

The default (model = "seg") is to assume your organism is a segmental allopolyploid, and to account for possible double reduction at simplex loci. But you should absolutely use other models if you have more information on your organism:

We allow for some non-valid genotypes via the ob argument. This is the upper bound on the proportion of outliers. By default, this is set to 0.03. You can set this to 0 (or set outlier = FALSE) if you want any outliers to indicate segregation distortion.

Make sure that the log-likelihoods are base \(e\). If they are base 10, you’ll get the wrong \(p\)-value:

gl10 <- gl / log(10)
seg_lrt(x = gl10, p1_ploidy = 6, p2_ploidy = 6, p1 = 2, p2 = 4)$p_value
#> [1] 0.9141411

Don’t mess with the technical arguments (ntry, opt, optg, df_tol). These have to do with the optimization and how to approximate the degrees of freedom of the test. Except possibly ntry. You could increase that if you are seeing weird results. But then let me know, because I haven’t seen any bad behavior with ntry = 3 (the default).

References

Gerard D, Thakkar M, & Ferrão LFV (2025a). “Tests for segregation distortion in tetraploid F1 populations.” Theoretical and Applied Genetics, 138(30), p. 1–13. doi:10.1007/s00122-025-04816-z.

Gerard, D, Ambrosano, GB, Pereira, GdS, & Garcia, AAF (2025b). “Tests for segregation distortion in higher ploidy F1 populations.” bioRxiv, p. 1–20. bioRxiv:2025.06.23.661114