Optimality Theory (OT) (Prince & Smolensky 1993) is one of the leading theories in phonology, and is a very successful model of categorical language data. In OT, ranked constraints pick a single winning surface representation (SR) for each underlying representation (UR). Nevertheless, a large proportion of language data is probabilistic. Moreover, psycholinguistic experiments (e.g. artificial language learning) often produce probabilistic results too. Thus, multiple models (e.g. Stochastic OT (Boersma 1997), Partial Ordering OT (Antilla 1997), Harmonic Grammar (Legendre, Miyata & Smolensky 1990), among others) have been put forward to augment the basic structure of OT, in order to model probabilistic data. MaxEnt (Goldwater & Johnson 2003) is a one such model. In MaxEnt, constraint weights are adjusted in order to find the weights that maximize the probability of the observed data. One benefit of MaxEnt over other probabilistic models is that the mathematics behind it are well understood and convergence proofs for learning are available (Berger et al. 1996).
The maxent.ot
package implements the MaxEnt model. There
are three functions in this package:
optimize_weights()
: Optimizes weights given a data set
& optional biases.predict_probabilities()
: Predict probabilities of
candidates based on their violation profiles & constraint
weights.compare_models()
: Compares two or more model fits to
the same data set to determine which provides the best fit, using
variety of methods.We now present a quick recap of OT in order to illustrate how OT and MaxEnt differ in choosing among outputs (eval) given the same data set (same con & gen).
In OT, ranked constraints choose one winning SR candidate per UR. Consider the OT tableaux below (Figure 1).
For the UR /ur-1/, Con1 prefers [sr1-1] while Con2 prefers [sr1-2]. Since Con1 outranks Con2, [sr1-1] is the sole winner, and is predicted to be the only SR for this particular UR (i.e. a categorical outcome). By the very same logic, the SR [sr2-1] is the sole realization of the UR /ur-2/ (i.e. Con1 prefers [sr2-1] to [sr2-2]). Notice that Con2 plays no role in deciding the outcome (hence the shading).
In contrast, MaxEnt produces probabilistic outcomes. In MaxEnt, constraints have weights (rather than being ranked with respect to each other). The predicted probability of an SR candidate given its UR depends on the weights of the constraints, and is calculated according to the following formula:
\[P(SR_i|UR_i; w) = \frac{1}{Z_w(UR_i)} exp(- \Sigma_{k=1}^mw_k f_k(SR_i, UR_i))\]
where
\[Z(UR_i) = \Sigma_{SR \in \mathcal{S}(UR_i)} exp(- \Sigma_{k=1}^m w_k f_k(SR, UR_i))\]
where \(\mathcal{S}(UR_i)\) is the set of surface realizations of input \(UR_i\).
We’ll walk through the MaxEnt tableaux in Figure 2 to illustrate how the equations in §1.3 are applied to calculate the probability of a particular UR-SR pair.
Notice that these two MaxEnt tableaux have the same URs, SRs, constraints, and constraint violation profiles as the two OT tableaux above. However, instead of being ranked, these constraints now have weights (e.g. Con1 weight = 1.6; Con2 weight = 0.8). There are also four additional columns tagged on to the right of the tableaux, which we use to illustrate how the predicted probabilities are computed.
An important variable that shows up in the equations above is: \(f_k(SR_i, UR_i)\). This refers to the number of violations of constraint \(k\) incurred by the UR-SR pair, <\(UR_i, SR_i\)>. For example, the UR-SR pair \(<\)/ur-2/, [sr2-2]\(>\) picks up two violations of Con2, so \(f_{Con2}(\text{sr2-2}, \text{ur-2}) = 2\) (yellow box).
Next, we move on to the penalty. Penalty has also been called “harmony” in other descriptions of the MaxEnt model. To calculate the penalty for the UR-SR pair <\(UR_i, SR_i\)>, we take the weighted sum of its constraint violations: \[\text{penalty for <}UR_i, SR_i\text{> } = \text{ } \Sigma_{k=1}^m w_k f_k(SR_i, UR_i)\] (Recall that \(w_k\) is the weight of constraint \(k\), and \(m\) is the number of constraints.)
Let’s apply this equation to calculate the penalty for the UR-SR pair \(<\)/ur-2/, [sr2-2]\(>\). This pair picks up one violation of Con1 and two violations of Con2. The weighted sum (green box) is thus:
\[\begin{align} \text{penalty for </ur-2/, [sr2-2]>} &= \Sigma_{k=1}^2 w_k f_k(\text{sr2-2, ur-2}) \\ &= weight_{Con1} \times 1 + weight_{Con2} \times 2 \\ &= 1.6 \times 1 + 0.8 \times 2 \\ &= 3.2 \end{align}\]
The penalties for other UR-SR pairs are similarly calculated. For example, the penalty for the UR-SR pair \(<\)/ur-2/, [sr2-1]\(>\) is:
\[\begin{align} \text{penalty for </ur-2/, [sr2-1]>} &= \Sigma_{k=1}^2 w_k f_k(\text{sr2-1, ur-2}) \\ &= weight_{Con1} \times 0 + weight_{Con2} \times 1 \\ &= 1.6 \times 0 + 0.8 \times 1 \\ &= 0.8 \end{align}\]
Notice that \(m=2\) because there are two constraints.
The next column gives the MaxEnt score for each UR-SR pair. \[\text{MaxEnt score} = exp(-\text{penalty})\] The Reader may have also encountered this term being called the “potential” in other descriptions of MaxEnt. Look at the MaxEnt score column of Figure 2 to see this equation applied.
We then need to normalize the MaxEnt score in order to get probabilities. The next column provides the \(Z\) value for each UR. \(Z\) is short for Zustandssumme, “sum over states”, and has also been known as the normalization constant, or the canonical partition function. \(Z\) is the sum of the MaxEnt scores of all UR-SR pairs arising from a particular UR.
Here’s a simplified way to write the equation for the normalization constant of \(UR_i\): \[Z(UR_i) = \Sigma_{j=1}^l m_j\] where \(l\) is the number of UR-SR pairs whose UR is \(UR_i\), and \(m_j\) is the MaxEnt score of the \(j^{th}\) UR-SR pair whose UR is \(UR_i\).
Let’s apply this equation to calculate the normalization constant for /ur-2/ (blue box):
\[\begin{align} Z(\text{ur-2}) &= \Sigma_{i=1}^2m_i \\ &= 0.499 + 0.041 \\ &= 0.490 \end{align}\]
Here, \(l = 2\) because there are only two UR-SR pairs whose UR is /ur-2/. We basically add the MaxEnt scores of these two pairs to get \(Z(\text{ur-2})\).
The predicted probability is shown in the final column. The probability of a particular SR for a given UR, \(UR_i\), is this pair’s MaxEnt score divided by the normalization constant associated with \(UR_i\):
\[P(SR_i|UR_i) = \frac{\text{MaxEnt score of <}UR_i, SR_i\text{>}}{Z(UR_i)}\]
Let’s apply this equation to calculate the probability of [sr2-2] given /ur-2/ (magenta box):
\[\begin{align} P(\text{sr2-2}|\text{ur-2}) &= \frac{\text{MaxEnt score of <ur-2, sr2-2>}}{Z(\text{ur-2})} \\ &= \frac{0.449}{0.490} \\ &= 0.083 \end{align}\]
Congratulations! You now know how to calculate the probabilities of an SR given its UR when provided with constraint weights. Next, we’ll discuss some characteristics of the probability distribution and of the penalty score.
The probability distribution defined by the first equation above is also known as the Boltzmann distribution or the Gibbs distribution. Like the Boltzmann distribution, the probability of a particular SR given the UR it arises from is a function of two variables: (1) its penalty and (2) temperature.
As a particular SR’s penalty increases, its probability decreases (assuming everything else remains unchanged). The graph below visualizes how a MaxEnt score varies with penalty (Figure 3). Essentially, a higher penalty results in a lower MaxEnt score. (Note: It is not possible to graph how probability varies with penalty, since the actual probability depends on the penalties of the other SRs arising from the same UR, so we’ll use the MaxEnt score as an intermediary.)
Recall that the probability of an SR given its UR is directly proportional to its MaxEnt score:
\[P(SR_i|UR_i) \propto M_{UR_i, SR_i}\]
where \(M_{UR_i, SR_i}\) is the MaxEnt score of the UR-SR pair <\(UR_i, SR_i\)>. We know this because
\[P(SR_i|UR_i) = M_{UR_i, SR_i} \times \frac{1}{Z(UR_i)}\]
Hence, the relationship between penalty and probability is clear – a higher penalty results in lower probability (all else remaining unchanged).
The graph also indicates that as penalty increases, the MaxEnt score approaches but never quite reaches 0. Thus, SR candidates with infinitely high penalties still receive a MaxEnt score greater than zero. This means that, theoretically, such SRs will receive some probability. Nevertheless, the probabilities that such SRs receive will be so tiny that they can be treated as not occurring. This is useful when using MaxEnt to model a data set that has a mix of categorical and probabilistic phenomena. For categorical phenomena, the predicted probability that is returned will often be something like: 99.998% SR1, 0.002% SR2. Such outcomes can be treated as essentially categorical, so MaxEnt is also capable of modeling categorical data despite being a probabilistic model.
Temperature is rarely used when modeling linguistic data, so the
current simplified formula we’ve introduced above will do for now. We
will introduce the final formulation of the probability distribution in
§5 when we discuss the temperature
argument of the
predict_probabilities()
function. It turns out that
temperature can be a useful tool in adjusting model fit to
particular psycholinguistic tasks.
An additional feature of the MaxEnt model arises from how the penalty score is calculated. Because each violation contributes to the penalty score, multiple violations of a lower-weighted constraint can work together to make an SR less probable than another SR that has only one violation of a higher-weighted constraint. This phenomenon is called the “gang-up” effect. This effect has been observed in phonological phenomena, and has also been reported experimentally (Breiss 2020).
optimize_weights()
Let’s imagine we have a data set with multiple tableaux, and would
like to find constraint weights that maximize the probability of our
data. In other words, we would like to find constraint weights that
produce a distribution over UR-SR pairs that matches the observed
frequencies of our data as closely as possible. In order to do that, we
first need to know how to structure our data set. Take a look at the
data set below, loaded from the sample_data_frame
data set
included in this package.
# Get paths to sample data file
data_file <- system.file(
"extdata", "sample_data_frame.csv", package = "maxent.ot"
)
# Take a look at the structure of the sample data
data_table <- read.table(data_file, header=FALSE)
data_table
#> V1
#> 1 Input,Output,Frequency,Constraint1,Constraint2
#> 2 Input1,Output1-1,1,1,0
#> 3 Input1,Output1-2,1,0,1
#> 4 Input2,Output2-1,1,0,0
#> 5 Input2,Output2-2,0,0,1
This data set has two tableaux: one for Input1 and another for Input2. There are two constraints: Constraint1 & Constraint2. The columns are organized as follows:
optimize_weights
.compare_models
down the line to compare multiple trained
models. In particular, relative frequency won’t work with the
aic_c
& bic
methods of
compare_models
.The first & second rows are only used for Column 4 onward (i.e. constraints). Full constraint names are in the first row, while their shortened names are in the second. Tableaux are inserted from the third row inwards.
The data set is a tab-delimited .txt
file. The input
file is in OTSoft-style format (Hayes, Tesar & Zuraw 2003).
Rather than using OTSoft format, you can also pass a data frame/data
table/tibble directly into the optimize_weights
function.
The format in this case is a bit different.
data_frame_file <- system.file(
"extdata", "sample_data_frame.csv", package = "maxent.ot"
)
data_frame <- read.csv(data_frame_file)
data_frame
#> Input Output Frequency Constraint1 Constraint2
#> 1 Input1 Output1-1 1 1 0
#> 2 Input1 Output1-2 1 0 1
#> 3 Input2 Output2-1 1 0 0
#> 4 Input2 Output2-2 0 0 1
Note that in this data frame the inputs column is completely specified and there is only a single row of constraint names.
Now that we’ve familiarized ourselves with our data set, let’s find a
set of constraint weights that maximizes the probability of the data. By
default, constraint weights are limited to finite non-negative values
with an upper bound of \(1000\). The
User may modify the upper bound by specifying another value for the
optional argument upper_bound
.
Using the function optimize_weights()
, we’ll train a
model for the data set above, and store the result (our trained model)
in the object simple_model
:
You can also fit a model using a data frame as input.
Let’s take a closer look at our object simple_model
:
# View the model we've fit (no biases used)
simple_model
#> $name
#> [1] "data_frame"
#>
#> $weights
#> Constraint1 Constraint2
#> 13.52132 13.52130
#>
#> $loglik
#> [1] -1.386296
#>
#> $k
#> [1] 2
#>
#> $n
#> [1] 3
#>
#> $bias_params
#> [1] NA
The object simple_model
has 4 relevant named
attributes:
weights
: the trained weights for each constraint. This
is a named list.loglik
: the log likelihood of the weights given the
data
k
: the number of constraintsn
: the number of data points in the training setWe can retrieve any specific model attribute that we’d like. For
example, if we’d like to see only the learned constraint weights for
simple_model
, we can pull them up as follows:
# Get learned weights of model we've fit (no biases used)
simple_model$weights
#> Constraint1 Constraint2
#> 13.52132 13.52130
Here we see that the trained model happened to learn the same weight (14.2) for both Constraint1 and Constraint2.
Likewise, we can retrieve any other model attribute. For example, here’s the log likelihood:
This log likelihood is, in fact, at ceiling (at the maximum possible likelihood), which is indicative of overfitting. The issue of overfitting and the introduction of priors to reduce overfitting will be taken up in §3.3. But first, let’s gain a better understanding of how log likelihood guides training.
The optimize_weights()
function implements a learner
that adjusts constraint weights in order to fit the model. The learner’s
objective is to maximize the value of the objective function, and weight
adjustments are made with this objective in mind. The objective
function, \(J(w)\), is composed of two
terms:
\[J(w) = LL(w|D) - B(w)\]
where \(LL(w|D)\) is the log likelihood of the weights \(w\) given the data \(D\), and \(B(w)\) is the price to pay for choosing the particular vector of weights \(w\).
The first term, \(LL(w|D)\), is a measure of how well a model’s weights \(w\) fit the observed data \(D\). Models that more closely match the observed frequencies have a higher \(LL(w|D)\) than those with poorer matches. \(LL(w|D)\) ranges between negative infinity and zero.
The second term, \(B(w)\), is used when we want to nudge the model towards particular preferred weights (e.g. by using biases or priors). Models with weights that are closer to the preferred ones pay a smaller price (smaller \(B(w)\)) than those with more aberrant weights. \(B(w)\) ranges between zero and positive infinity.
Higher values of \(LL(w|D)\) and lower values of \(B(w)\) both contribute to increasing the value of the objective function \(J(w)\). Recall that the learner’s objective is to arrive at as high a value of \(J(w)\) as possible (it seeks to maximize the value of \(J(w)\)). In other words, the learner adjusts weights towards weight values that:
In the rest of this section, we discuss the log likelihood of the weights, \(LL(w|D)\), in greater depth. We leave the discussion of \(B(w)\) to §3, where priors are introduced.
When no priors are used, maximizing the objective function \(J(w)\) is equivalent to maximizing the log likelihood of the weights \(LL(w|D)\) alone. Since we’ve not used any priors so far, in all the preceding cases, the sole aim of the learner was to seek weights that maximized the log likelihood alone.
The log likelihood of the weights is:
\[LL(w|D) = \Sigma_{i=1}^n ln P(y_i|x_i; w)\]
Let’s unpack the terms of this equation. On the left-hand side, we have \(LL(w|D)\), where
On the right-hand side:
Notice how the weights of each individual constraint (which are found in \(w\)) are the only parameters available for the model to adjust.
Recall that the learner takes in a distribution over training examples, \(D\). For the data set we’ve been using, \(D\) is: \(<\)Input1, Output1-1\(>\): 1, \(<\)Input1, Output1-2\(>\): 1, \(<\)Input2, Output2-1\(>\): 1. \(<\)Input2, Output2-2\(>\) is not part of \(D\) because it isn’t observed.
In the absence of priors, learner’s sole goal is to maximize the log likelihood alone. In other words, the learner seeks the weights that produce a distribution over these three training examples that matches the observed frequencies as closely as possible. The learner adjusts the weights with this goal in mind.
Let’s imagine we have two models, M1 & M2, whose weights (whatever they may be) produce the predicted distributions P(output|input) shown in the final two columns of Table 1. For ease of comparison, the observed relative frequencies are shown in the third-to-last column. Right away, we can see that M1 matches the observed frequencies much better than M2 does. In other words, the observed data are more probable under the weights of M1 than they are under the weights of M2. Though we haven’t yet applied the equation above to calculate the log likelihood of these two model’s weights, we can expect that M1’s weights will produce a higher log likelihood than M2’s.
Now we’ll apply the log likelihood equation to calculate the log likelihood of the weights of M1 & M2. Recall that log likelihoods may range from negative infinity to 0, with higher log likelihoods indicating that a model’s weights produce a better fit to the observed data. Here’s the log likelihood equation again:
\[LL(w|D) = \Sigma_{i=1}^n ln P(y_i|x_i; w)\]
Applying the equation to M1 (\(w_{M1}\) refers to M1’s weights): \[\begin{align} LL(w_{M1}|D) &= \Sigma_{i=1}^3 ln P(y_i|x_i; w) \\ &= ln(.5) + ln(.5) + ln(1) \\ &= -1.386 \end{align}\]
Applying the equation to M2 (\(w_{M2}\) refers to M2’s weights): \[\begin{align} LL(w_{M2}|D) &= \Sigma_{i=1}^3 ln P(y_i|x_i; w) \\ &= ln(.25) + ln(.75) + ln(1) \\ &= -2.367 \end{align}\]
M1’s weights produce a higher log likelihood than M2’s, which indicates that M1 is a better model than M2. This is because M1’s weights produce a predicted distribution over observed UR-SR pairs that is closer to the observed frequencies than M2’s weights do. (Notice that only observed data are included in the calculations above. Despite Output2-2 getting 50% of the predicted probability in M2, it does not contribute to the log likelihood.)
Hence, aiming to maximize the probability of the training data results in finding weights that produce a predicted distribution that is as close to the observed frequencies as possible. For example, we can imagine the learner starting with the weights that produce a distribution that is very distant from the frequencies of the observed training data. With its goal of finding weights that maximize the log likelihood, the learner moves to the weights of M2, and eventually arrives at the weights of M1, where training is halted and M1’s weights are the weights of the final trained model.
optimize_weights()
Here’s the general formula for the objective function, \(J(w)\), repeated here for ease of reference:
\[J(w) = LL(w|D) - B(w)\]
In the previous section (§2), we ignored the second term, so the objective function, \(J(w)\), depended solely upon the log likelihood of the current weights given the training data. In this section, we’ll make use of the second term, \(B(w)\). This term is also known by several names such as the “bias term”, the “prior”, and in special cases, the “regularization term”. The bias term allows us to nudge the learner towards particular weights by introducing an added price to pay for weights that differ from the ones that the User specifies.
Some use cases for the bias term include using the bias term as simple regularization to avoid overfitting (Goldwater & Johnson 2003), and the modeling of proposed phonological learning biases (Wilson 2006, White 2013, Mayer 2021, among others). Worked examples using the bias term for regularization and for modeling learning biases are provided in §3.3 and §3.4 respectively.
Here’s the formula for the bias term, which determines the price to be paid for choosing particular constraint weights:
\[B(w) = \Sigma_{k=1}^m \frac{(w_k - \mu_k)^2}{2\sigma_k^2}\]
Every constraint, \(k\), whose weight \(w_k\) differs from \(\mu_k\) will thus incur an additional price that needs to be paid. This price is subtracted from the log likelihood (\(LL(w|D)\)), and thus makes the objective function, \(J(w)\) smaller (notice the negative sign preceding the second term). Recall that the learner’s goal is to maximize the objective function, \(J(w)\) (i.e. get it as high as possible). The price to be paid depends upon two factors: (1) Distance from the ideal weight; (2) Tolerance for weight discrepancy.
Price to be paid increases as the distance from the ideal weight increases. The distance from the ideal weight is \(w_k - \mu_k\). For example, let’s set the ideal weight for constraint \(k\) at 0 (\(\mu_k\) = 0). A model that gives constraint \(k\) a weight of 1.5 will accrue a lower price to pay than one that gives it a weight of 3.5, as the former has a smaller weight discrepancy. More specifically, the value that is subtracted (i.e. price to pay) for constraint \(k\) is directly proportional to the square of the weight discrepancy, all else being equal. \[ \text{value to be subtracted} \propto (w_k - \mu_k)^2\]
Price to be paid increases as the tolerance for weight variation decreases. \(\sigma_k\) determines how costly the weight discrepancy is for constraint \(k\). The smaller \(\sigma_k\) is, the lower the tolerance for a particular weight discrepancy. For example, let’s imagine that we set the ideal weight for constraint \(k\), at 0 (\(\mu_k\) = 0), and that the constraint’s actual weight, \(w_k\) is 1.5; the weight discrepancy is thus 1.5. This same weight discrepancy results in a higher price to pay in a model that has \(\sigma_k\) = 0.5 (lower tolerance) than in a model that has \(\sigma_k\) = 1 (higher tolerance). More specifically, the value that is subtracted (i.e. price to pay) for constraint \(k\) is inversely proportional to the square of the tolerance, all else being equal. \[ \text{value to be subtracted} \propto (\frac{1}{\sigma_k})^2\]
Here’s the equation for the objective function again: \[J(w) = LL(w|D) - B(w)\] When maximizing \(J(w)\), there is generally a trade-off between model fit (\(LL(w|D)\)) and the bias term (\(B(w)\)), as the bias term pushes weights towards the ones specified by the User while the log likelihood pushes weights towards the ones that produce the best model fit. The strength of the bias term is determined by \(\sigma\). Specifically, if \(\sigma\) is too small, the trained weights may barely move from the specified ideal ones. If the User notices that their models don’t appear to have learned anything (i.e. the trained weights appear to be the same as the specified ideal weights), it is advisable to increase \(\sigma\) (i.e. increase the tolerance for weight discrepancy), so that weights may differ from the ideal ones without incurring too great a price.
There are three arguments that may be used to pass in values for the bias parameters \(\mu\) and \(\sigma\). The arguments are:
bias_file
mu
sigma
To use the argument bias_file
, pass in a bias file with
the structure below.
# Get paths to toy data and bias files.
data_frame_file <- system.file(
"extdata", "sample_data_frame.csv", package = "maxent.ot"
)
bias_file <- system.file(
"extdata", "sample_bias_data_frame.csv", package = "maxent.ot"
)
# Take a look at the structure of the bias_file
data_frame <- read.csv(data_frame_file)
bias_data_frame <- read.csv(bias_file)
bias_data_frame
#> Constraint Mu Sigma
#> 1 Constraint1 0 10
#> 2 Constraint2 0 10
The bias file is a .csv
. Constraints appear in the rows.
The first column specifies the ideal weight, \(\mu\), and the second column specifies the
tolerance for weight discrepancy, \(\sigma\). In the example above, both
constraints happen to have the same values for \(\mu\). However, if the User so wishes, they
may specify different values of \(\mu\)
for different constraints. The same applies to \(\sigma\). Below, we fit weights with the
biases specified in a file.
# Fit weights with biases specified in file
optimize_weights(data_frame, bias_data_frame)
#> $name
#> [1] "data_frame"
#>
#> $weights
#> Constraint1 Constraint2
#> 2.770180 2.825584
#>
#> $loglik
#> [1] -1.444645
#>
#> $k
#> [1] 2
#>
#> $n
#> [1] 3
#>
#> $bias_params
#> Constraint Mu Sigma
#> 1 Constraint1 0 10
#> 2 Constraint2 0 10
The arguments mu
and sigma
allow us to
specify biases in either scalar or vector form. This enables us to
specify different values of \(\mu\) for
each constraint, if we so desire. The same applies to \(\sigma\).
If you specify bias terms as scalars, all constraints will share the same values. Below, we fit weights with the bias parameters specified in scalar form.
# Fit weights with biases specified as scalars
optimize_weights(
data_frame, mu = 0, sigma = 1000
)
#> $name
#> [1] "data_frame"
#>
#> $weights
#> Constraint1 Constraint2
#> 10.74756 10.74758
#>
#> $loglik
#> [1] -1.386316
#>
#> $k
#> [1] 2
#>
#> $n
#> [1] 3
#>
#> $bias_params
#> Constraint Mu Sigma
#> <char> <num> <num>
#> 1: Constraint1 0 1000
#> 2: Constraint2 0 1000
If you specify bias terms as vectors, each constraint can have
different values. We combine the individual constraint parameter values
into a vector using the c()
function (e.g.
c(100, 200)
).
# Fit weights with biases specified in vector form
optimize_weights(
data_frame, mu = c(1, 2), sigma = c(100, 200)
)
#> $name
#> [1] "data_frame"
#>
#> $weights
#> Constraint1 Constraint2
#> 7.194575 7.195788
#>
#> $loglik
#> [1] -1.387044
#>
#> $k
#> [1] 2
#>
#> $n
#> [1] 3
#>
#> $bias_params
#> Constraint Mu Sigma
#> <char> <num> <num>
#> 1: Constraint1 1 100
#> 2: Constraint2 2 200
A mix of scalar and vector biases is also possible:
# Fit weights with a mix of scalar and vector biases
optimize_weights(
data_frame, mu = c(1, 2), sigma = 1000
)
#> $name
#> [1] "data_frame"
#>
#> $weights
#> Constraint1 Constraint2
#> 10.88319 10.88321
#>
#> $loglik
#> [1] -1.386313
#>
#> $k
#> [1] 2
#>
#> $n
#> [1] 3
#>
#> $bias_params
#> Constraint Mu Sigma
#> <char> <num> <num>
#> 1: Constraint1 1 1000
#> 2: Constraint2 2 1000
If only one of the two bias parameters is present, an error will be raised. (e.g. When values are provided for \(\mu\), but there are none for \(\sigma\).)
In §2.2 we noted that the log likelihood attained by
simple_model
was in fact at ceiling.
Each data set has a theoretical maximum log likelihood, which can be calculated. Recall that the data set had 3 data points (i.e. observed outputs). (Frequency shows up in the third column.)
# View the data that was used to train `simple_model`
read.table(data_file, header=FALSE, sep='\t')
#> V1
#> 1 Input,Output,Frequency,Constraint1,Constraint2
#> 2 Input1,Output1-1,1,1,0
#> 3 Input1,Output1-2,1,0,1
#> 4 Input2,Output2-1,1,0,0
#> 5 Input2,Output2-2,0,0,1
Let’s calculate the theoretical ceiling log likelihood. Here’s the equation for the log likelihood, repeated for easy reference:
\[LL(w|D) = \Sigma_{i=1}^n ln P(y_i|x_i; w)\]
A trained model that matches the observed frequencies perfectly would have attained the ceiling log likelihood. We’ll imagine just such a perfect trained model that matches the observed frequencies perfectly, and calculate its log likelihood: \[\begin{align} LL(w|D) &= \Sigma_{i=1}^3 ln P(y_i|x_i; w)\\ &= ln(.5) + ln(.5) + ln(1) \\ &= -1.386 \end{align}\]
Since the log likelihood attained by our perfect trained model
matches that of simple_model
, we conclude that
simple_model
also attained the ceiling log likelihood. In
other words, the distribution produced by simple_model
’s
weights matches the observed frequencies of the training data
perfectly.
Trained models that fit the training data too well often cannot generalize to unseen data. Such models are said to overfit the training data. One strategy to reduce overfitting is to include a regularization term. The intended effect of regularization is to keep the weights as low as possible.
To put this into practice, we’ll make use of the optional arguments
mu
& sigma
of the function
optimize_weights()
. To keep weights low, we’ll set
mu = 0
& sigma = 1
. This means that we
believe that every constraint should:
Since every constraint has the same ideal weight & the same
standard deviation (tolerance for weight discrepancy), we can make use
of the optional arguments mu
& sigma
to
input \(\mu\) and \(\sigma\) just once for all constraints.
# Train regularized model with mu=0 & sigma=1
regularized_model <- optimize_weights(
data_frame, mu=0, sigma=1
)
Inspecting the trained regularized model, we now see a value for the
attribute bias_params
:
# Take a took at the regularized model trained with scalar biases
regularized_model
#> $name
#> [1] "data_frame"
#>
#> $weights
#> Constraint1 Constraint2
#> 0.1051951 0.3163665
#>
#> $loglik
#> [1] -1.944845
#>
#> $k
#> [1] 2
#>
#> $n
#> [1] 3
#>
#> $bias_params
#> Constraint Mu Sigma
#> <char> <num> <num>
#> 1: Constraint1 0 1
#> 2: Constraint2 0 1
We can retrieve this attribute alone:
# View the bias parameters we used during model training
regularized_model$bias_params
#> Constraint Mu Sigma
#> <char> <num> <num>
#> 1: Constraint1 0 1
#> 2: Constraint2 0 1
Constraints are ordered in rows. The first column gives the \(\mu\) and the second column the \(\sigma\) for the constraint in question.
Let’s take a look at the trained weights for the regularized model:
# Get learned weights for the regularized model trained with scalar biases
regularized_model$weights
#> Constraint1 Constraint2
#> 0.1051951 0.3163665
The trained regularized model has learned that Constraint 1 has weight = 0.105, and Constraint 2 has weight = 0.316. Notice that the weights for both constraints are now closer to 0 (cf. the unbiased model where trained weights for both constraints were 14.2). We attained this by including a prior term that tagged on a price to pay for weights whose value did not exactly match the ideal weight. In particular the price to be paid increased the further a constraint’s weight was from the ideal weight \(0\).
Notice also that the log likelihood of the regularized model’s weights is below the ceiling of \(-1.386\) (\(-1.945 < -1.386\)):
# Get value of objective function for trained regularized model
regularized_model$loglik
#> [1] -1.944845
The regularized model provides a worse fit to the observed data than the unregularized model does; but the regularized model is probably able to better generalize to unseen data.
Retrieving the weight of 1 constraint
Let’s imagine that we’re only interested in a tracking the weight of
a subset of the constraints. For example, we’re interested only in
Constraint 1, and want to track the weight learned for this constraint
in the unbiased and the simple bias (regularized) model. We use index
1
to pick out the first constraint from the named list
your_model_name$weights
. Similarly, if we were interested
in the second constraint, we’d use index 2
, and so on.
# Get weight of Constraint 1 (simple model)
cons1_noBias <- simple_model$weights[1]
# Get weight of Constraint 1 (regularized model)
cons1_simpleBias <- regularized_model$weights[1]
# Compare learned weights of Constraint 1 in the unbiased model...
# ...and in the regularized one
print(paste("In the unbiased model, Constraint 1's weight is:",
sprintf(cons1_noBias, fmt = '%#.3f')))
#> [1] "In the unbiased model, Constraint 1's weight is: 13.521"
print(paste("In the regularized model, Constraint 1's weight is:",
sprintf(cons1_simpleBias, fmt = '%#.3f')))
#> [1] "In the regularized model, Constraint 1's weight is: 0.105"
Another reason to make use of the prior term is to model substantive effects. We can set different preferred weights (\(\mu\)) for different constraints, by setting different tolerances for deviation from the preferred weight (\(\sigma\)), or by using a combination of the two.
We’ll walk through a simplified version of White (2017), in which substantive effects were used to model psycholinguistic results.
Consider the following artificial grammar learning experiment (White 2017). Participants were trained on either a p→v alternation (experimental condition) or a b→v alternation (control condition). During the response phase, participants were tested on two alternations. In the first, they were tested on producing the alternation they were trained on (“test”). In the second, they were tested on extending the alternation to the other unseen pair (“extension”). That is, for those trained on the p→v alternation, the b→v alternation was considered the “untrained alternation/extension”, and vice versa.
Experimental result to be modeled:
Extension to untrained sounds occurred more frequently in the experimental condition (trained on p→v; extended to b→v) at 73% than in the control condition (trained on b→v; extended to p→v) at 20%. The percentage alternation by participants in the response phase is shown below.
#> Expt cdn Control cdn
#> trained 95 90
#> untrained 73 20
Given that the perceptual distance between [b] & [v] is smaller than that between [p] & [v], substantive effects arising from perceptual distance appear to play an important role.
So how can we introduce these substantive factors into the model? Intuitively, we want to make it more difficult to map [p] to [v] than it is to map [b] to [v]. We will use *Map(x,y) constraints (Zuraw 2007, 2013), which prohibit the mapping between sound x and sound y. We give *Map(p,v) a higher preferred weight than *Map(b,v), which means that the mapping from [p] to [v] is more strongly prohibited than the mapping from [b] to [v]. Following White, we’ll set \(\mu\)’s of these constraints to the following preferred weights:
Markedness constraints *V[-cont]V and *V[-voi]V drive the alternation from stops to [v] intervocally. These have a preferred weight of 0 to reflect that intervocalic stops and voiceless sounds were allowed in the participants’ native languages.
The training data for the experimental condition includes only the p→v alternation, reflecting the training phase encountered by the subjects in the experimental condition:
# Get path to data file (experimental condition)
white_salt_data_file <- system.file(
"extdata", "sample_whiteSalt_data_file.txt", package = "maxent.ot"
)
# File is in OTSoft format, so first convert it to data frame
white_salt_data_frame <- otsoft_tableaux_to_df(white_salt_data_file)
# View training data (experimental condition)
white_salt_data_frame
#> Input Output Frequency X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v.
#> 1 VpV VpV 0 1 1 0 0
#> 2 VpV VbV 0 1 0 0 0
#> 3 VpV VvV 18 0 0 0 1
We now train the bias model for the experimental condition, making
use of the argument mu
. We want the value of \(\mu\) to be 0, 0, 1.3 and 3.65 for the four
ordered constraints [*V[-cont]V, *V[-voi]V, *Map(b,v), *Map(p,v)]
respectively. We have to combine these four values into a vector using
the c()
function (i.e.
c(0, 0, 1.3, 3.65)
). Here, we choose to have the same
tolerance for deviation from the preferred weight for all constraints,
so we’ll use the argument sigma
.
# Fit model with preferred weights for experimental condition
white_salt_bias_model = optimize_weights(
white_salt_data_frame, mu=c(0, 0, 1.3, 3.65), sigma=sqrt(0.6)
)
# View trained weights (expt cdn with bias)
white_salt_bias_model$weights
#> X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v.
#> 2.5877273 0.8014478 1.3000000 1.0622343
Now let’s train an unbiased model for the same experimental condition. We’ll average out the preferred weights for the two *Map constraints, which results in each of these constraint having preferred weight = 2.27.
# Fit unbiased model for experimental condition
white_salt_noBias_model = optimize_weights(
white_salt_data_frame,
mu=c(0, 0, 2.27, 2.27), sigma=sqrt(0.6)
)
# View trained weights (expt cdn, no bias)
white_salt_noBias_model$weights
#> X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v.
#> 2.0607951 0.6888778 2.2700000 0.2091823
At this point, we’ve essentially trained the model on only the p→v alternation. Next, we’ll see how well our trained models perform on both the test task (p→v) as well as on the extension task (b→v). Notice how both these phases (training and response) reflect the experimental setup with human subjects.
We’ll need a new test file that includes the observed frequencies for
both the test (p→v) and extension (b→v) alternations. We’ll pass both
the models trained above as well as the new test file to the function
predict_probabilities()
. We’ll take a closer look at both
the function predict_probabilities()
, and the structure of
the test file when we introduce predict_probabilities()
in
§4 below. For now, we’ll just focus our attention on the predicted
probability calculated by this function.
# Get path to test file (experimental condition)
white_salt_test_file <- system.file(
"extdata", "sample_whiteSalt_test_file.txt", package = "maxent.ot"
)
# File is in OTSoft format, so convert first
white_salt_test_df <- otsoft_tableaux_to_df(white_salt_data_file)
# View test data (experimental condition)
white_salt_test_df
#> Input Output Frequency X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v.
#> 1 VpV VpV 0 1 1 0 0
#> 2 VpV VbV 0 1 0 0 0
#> 3 VpV VvV 18 0 0 0 1
# Predict probabilities with weights trained with bias (expt cdn)
predict_probabilities(
white_salt_test_df, white_salt_bias_model$weights
)
#> $loglik
#> [1] -4.930533
#>
#> $predictions
#> Input Output Freq X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v. Predicted
#> 1 VpV VpV 0 1 1 0 0 0.07420978
#> 2 VpV VbV 0 1 0 0 0 0.16539619
#> 3 VpV VvV 18 0 0 0 1 0.76039403
#> Observed Error
#> 1 0 0.07420978
#> 2 0 0.16539619
#> 3 1 -0.23960597
# Predict probabilities with weights trained without bias (expt cdn)
predict_probabilities(
white_salt_test_df, white_salt_noBias_model$weights
)
#> $loglik
#> [1] -3.811101
#>
#> $predictions
#> Input Output Freq X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v. Predicted
#> 1 VpV VpV 0 1 1 0 0 0.0637862
#> 2 VpV VbV 0 1 0 0 0 0.1270289
#> 3 VpV VvV 18 0 0 0 1 0.8091849
#> Observed Error
#> 1 0 0.0637862
#> 2 0 0.1270289
#> 3 1 -0.1908151
For ease of reference, the predicted percentage of alternation for the trained biased and unbiased models are summarized in the table below:
#> [1] "Predicted % of alternation during response phase (experimental condition)"
#> Biased model Unbiased model
#> Trained (95%) 91 93
#> Untrained (73%) 78 45
Recall that during the response phase, participants performed the following two alternations at the following rates:
Overall, the biased model provided a better fit to the observed data. Predicted probability differed by no more than 5.5%, with the greatest discrepancy showing up for the untrained [b] → [v] alternation.
The unbiased model, in contrast, performed poorly. In particular, the untrained [b] → [v] alternation appeared to be close to chance (45%).
Let’s now train a biased and an unbiased model for the control condition. The training data includes only the b→v alternation, reflecting the training phase of the control condition:
# Get path to data file (control condition)
white_control_data_file <- system.file(
"extdata", "sample_whiteCtrl_data_file.txt", package = "maxent.ot"
)
white_control_df <- otsoft_tableaux_to_df(white_control_data_file)
# View training data (control condition)
white_control_df
#> Input Output Frequency X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v.
#> 1 VbV VpV 0 1 1 0 0
#> 2 VbV VbV 0 1 0 0 0
#> 3 VbV VvV 18 0 0 1 0
# Fit model with preferred weights for control condition
white_control_bias_model = optimize_weights(
white_control_df,
mu=c(0, 0, 1.3, 3.65), sigma=sqrt(0.6)
)
# View trained weights (control cdn with bias)
white_control_bias_model$weights
#> X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v.
#> 1.9372214 0.6600614 0.0000000 3.6500000
# Fit unbiased model for control condition
white_control_noBias_model = optimize_weights(
white_control_df,
mu=c(0, 0, 2.27, 2.27), sigma=sqrt(0.6)
)
# View trained weights (control cdn, no bias)
white_control_noBias_model$weights
#> X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v.
#> 2.0607951 0.6888778 0.2091823 2.2700000
As with the experimental condition above, we’ll now subject the two trained models to the response phase, which includes the test (b→v) and extension (p→v) tasks.
# Get path to test file (control condition)
white_control_test_file <- system.file(
"extdata", "sample_whiteCtrl_test_file.txt", package = "maxent.ot"
)
# File is in OTSoft format, so convert first
white_control_test_df <- otsoft_tableaux_to_df(white_control_test_file)
# View test data (experimental condition)
white_control_test_df
#> Input Output Frequency X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v.
#> 1 VpV VpV 0.8 1 1 0 0
#> 2 VpV VvV 0.2 0 0 0 1
#> 3 VbV VbV 0.1 1 0 0 0
#> 4 VbV VvV 0.9 0 0 1 0
# View test data (control condition)
read.table(white_control_test_file, header=FALSE, sep='\t')
#> V1 V2 V3 V4 V5 V6 V7
#> 1 NA *V[-cont]V *V[-voi]V *Map(b-v) *Map(p-v)
#> 2 NA Mk1 Mk2 Mp1 Mp2
#> 3 VpV VpV 0.8 1 1 0 0
#> 4 VvV 0.2 0 0 0 1
#> 5 VbV VbV 0.1 1 0 0 0
#> 6 VvV 0.9 0 0 1 0
# Predict probabilities with weights trained with bias (control cdn)
predict_probabilities(
white_control_test_df, white_control_bias_model$weights
)
#> $loglik
#> [1] -0.838242
#>
#> $predictions
#> Input Output Freq X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v. Predicted
#> 1 VpV VpV 0.8 1 1 0 0 0.7412963
#> 2 VpV VvV 0.2 0 0 0 1 0.2587037
#> 3 VbV VbV 0.1 1 0 0 0 0.1259534
#> 4 VbV VvV 0.9 0 0 1 0 0.8740466
#> Observed Error
#> 1 0.8 -0.05870366
#> 2 0.2 0.05870366
#> 3 0.1 0.02595343
#> 4 0.9 -0.02595343
# Predict probabilities with weights trained with bias (control cdn)
predict_probabilities(
white_control_test_df, white_control_noBias_model$weights
)
#> $loglik
#> [1] -1.196516
#>
#> $predictions
#> Input Output Freq X.V..cont.V X.V..voi.V X.Map.b.v. X.Map.p.v. Predicted
#> 1 VpV VpV 0.8 1 1 0 0 0.3823294
#> 2 VpV VvV 0.2 0 0 0 1 0.6176706
#> 3 VbV VbV 0.1 1 0 0 0 0.1356836
#> 4 VbV VvV 0.9 0 0 1 0 0.8643164
#> Observed Error
#> 1 0.8 -0.41767061
#> 2 0.2 0.41767061
#> 3 0.1 0.03568365
#> 4 0.9 -0.03568365
For ease of reference, the predicted percentage of alternation for the trained biased and unbiased models are presented in the table below:
#> [1] "Predicted % of alternation during response phase (control condition)"
#> Biased model Unbiased model
#> Trained (90%) 87 87
#> Untrained (20%) 26 61
During the testing phase, participants performed the following two alternations at the following rates:
The biased model matches the observed data relatively well. The unbiased model, in contrast, greatly overgeneralizes to the untrained p→v alternation.
Overall, the biased model, which incorporated substantive biases, performed much better than the unbiased one (no substantive influences) at modeling human subject responses during the response phase. The unbiased model both greatly underpredicted extension to the b→v alternation in the experimental condition and greatly overpredicted the extension to the p→v alternation in the control condition. In contrast, the biased model neither vastly under- nor overpredicted on any of the response tasks.
For the biased model, we introduced different preferred weights for
separate constraints by utilizing the mu
argument of the
optimize_weights()
function. These preferred weights were
chosen to represent the substantive effects that arose from perceptual
distances. The resulting trained biased model was able to match the
observed frequencies of alternations produced by human subjects during
the response phase much more closely than its unbiased counterpart in
all experimental conditions, thus demonstrating the utility of
incorporating substantive biases for linguistic modeling.
predict_probabilities()
The function predict_probabilities()
predicts the
probabilities of SR candidates based on their violation profiles and the
constraint weights.
The function predict_probabilities()
requires at the
very least, the following two arguments in this order:
test_file
: An input file
optimize_weights()
constraint_weights
: Constraint weights which will be
used to make predictionsLet’s imagine that we want to make a prediction using the weights of
a model that we’ve trained using optimize_weights()
. If
we’ve trained a model using optimize_weights()
and stored
the trained model in the object fit_model_a
, we can specify
the constraint weights by passing in the weights
attribute
of the model: fit_model_a$weights
.
# Get paths to toy data file
data_file_a <- system.file(
"extdata", "sample_data_frame.csv", package="maxent.ot"
)
df_a <- read.csv(data_file_a)
# Fit weights to data (no biases)
fit_model_a <- optimize_weights(df_a)
# Predict probabilities for the same input
predict_probabilities(df_a, fit_model_a$weights)
#> $loglik
#> [1] -1.386296
#>
#> $predictions
#> Input Output Freq Constraint1 Constraint2 Predicted Observed
#> 1 Input1 Output1-1 1 1 0 4.999951e-01 0.5
#> 2 Input1 Output1-2 1 0 1 5.000049e-01 0.5
#> 3 Input2 Output2-1 1 0 0 9.999987e-01 1.0
#> 4 Input2 Output2-2 0 0 1 1.342067e-06 0.0
#> Error
#> 1 -4.884441e-06
#> 2 4.884441e-06
#> 3 -1.342067e-06
#> 4 1.342067e-06
The returned object has two attributes:
loglik
: The log likelihoodpredictions
: A prediction data frameThe structure of the prediction data frame is as follows:
A point of caution: Duplicate UR entries
When there are multiple instances of the same UR, both
Predicted Probability
and Observed Probability
are normalized over all instances of the UR. However, Error
is still computed row-wise. This leads to issues with the interpretation
of Error
.
Duplicate UR entries may arise when modeling experimental data. For example, we’ll imagine that our experiment had two unique URs: Input1 and Input2. The first participant picked Output1-1 for Input1 and Output2-1 for Input2. The second participant picked Output1-2 for Input1 and failed to respond for Input2. We thus get the frequencies below (data_file_b).
# Data has repeated URs (absolute frequency)
# Get paths to toy data file
data_file_b <- system.file(
"extdata", "sample_data_file_double_aFreq.txt", package="maxent.ot"
)
df_b <- otsoft_tableaux_to_df(data_file_b)
# Take a look at the structure of the sample data with duplicate URs
df_b
#> Input Output Frequency Constraint1 Constraint2
#> 1 Input1 Output1-1 1 1 0
#> 2 Input1 Output1-2 0 0 1
#> 3 Input2 Output2-1 1 0 0
#> 4 Input2 Output2-2 0 0 1
#> 5 Input1 Output1-1 0 1 0
#> 6 Input1 Output1-2 1 0 1
# Here's the structure of the same data without duplicate URs
df_a
#> Input Output Frequency Constraint1 Constraint2
#> 1 Input1 Output1-1 1 1 0
#> 2 Input1 Output1-2 1 0 1
#> 3 Input2 Output2-1 1 0 0
#> 4 Input2 Output2-2 0 0 1
Notice that both df_a
and df_b
describe the
same data, albeit in a different manner. If we’d collapsed all instances
of repeated URs from df_b
, we’d the get df_a
.
However, the resulting Error
from both data sets
differ.
# Fit weights to data (no biases)
fit_model_b <- optimize_weights(df_b)
# Predict probabilities for the same input (duplicate URs)
predict_probabilities(df_b, fit_model_b$weights)
#> $loglik
#> [1] -2.77259
#>
#> $predictions
#> Input Output Freq Constraint1 Constraint2 Predicted Observed
#> 1 Input1 Output1-1 1 1 0 2.499976e-01 0.5
#> 2 Input1 Output1-2 0 0 1 2.500024e-01 0.0
#> 3 Input2 Output2-1 1 0 0 9.999987e-01 1.0
#> 4 Input2 Output2-2 0 0 1 1.342067e-06 0.0
#> 5 Input1 Output1-1 0 1 0 2.499976e-01 0.0
#> 6 Input1 Output1-2 1 0 1 2.500024e-01 0.5
#> Error
#> 1 -2.500024e-01
#> 2 2.500024e-01
#> 3 -1.342067e-06
#> 4 1.342067e-06
#> 5 2.499976e-01
#> 6 -2.499976e-01
Interpreting the Error
for df_a
is
straightforward, while the Error
for df_b
is
uninterpretable. (e.g. Regarding df_b
, what does
it mean when the the first instance of Output1-2 has a 25%
overprediction error?)
For this reason, it is strongly recommended to check one’s data set
for duplicate URs, in order to collapse duplicate URs into a single UR
if the User desires to make use of the Error
values.
Alternatively, one could engage in post-hoc data wrangling to achieve
interpretable error values if the uncollapsed data set is otherwise
desired.
Let’s imagine that we’re curious about how particular weights would
perform with our data set. Perhaps we’ve seen these weights reported in
a paper and we’ll like to try them out. Below, we demonstrate how to
prepare these weights so that they’ll be compatible with
predict_probabilities()
.
Let’s say we want to make predictions with Con1’s weight being 1.5 and Con2’s weight being 2.5. We have to put these weights into a list, and convert it to the appropriate type.
# Get predictions with User-chosen constraint weights
# Make a list of weights
# Be sure to order the weights in exactly the same order...
# ...in which their respective constraints appear in the input file!
my_wt_ls <- list(1.5, 2.5)
# Convert to double object
my_wts <- as.double(my_wt_ls)
Now we’re ready to make predictions with our hand-selected weights.
# Get predictions
predict_probabilities(df_a, my_wts)
#> $loglik
#> [1] -1.705413
#>
#> $predictions
#> Input Output Freq Constraint1 Constraint2 Predicted Observed Error
#> 1 Input1 Output1-1 1 1 0 0.73105858 0.5 0.23105858
#> 2 Input1 Output1-2 1 0 1 0.26894142 0.5 -0.23105858
#> 3 Input2 Output2-1 1 0 0 0.92414182 1.0 -0.07585818
#> 4 Input2 Output2-2 0 0 1 0.07585818 0.0 0.07585818
The predict_probabilities()
function provides the option
of saving the prediction data frame to a file.
In the example below, we use the following optional arguments to save the prediction data frame:
output_path
: path and file name
"C:\\path\\to\\dir\\your_file_name.txt"
,
the result is saved to the file your_file_name.txt
, which
will be in the directory C:\path\to\dir
.out_sep
: specifies delimiter
"\t"
As promised, we now present the generalized version of the model with temperature included:
\[P(SR_i|UR_i; w) = \frac{1}{Z_w(UR_i)} exp(- \Sigma_{k=1}^m \frac{w_k f_k(SR_i, UR_i)}{T})\]
where \(T\) is the new variable, temperature. As before, \(m\) is the number of constraints, \(w_k\) is the weight of constraint \(k\), \(f_k(SR_i, UR_i)\) is the number of violations of constraint \(k\) incurred by the UR-SR pair <\(UR_i, SR_i\)>, and \(Z_w(UR_i)\) is a normalization term defined as
\[Z(UR_i) = \Sigma_{SR \in \mathcal{S}(UR_i)} exp(- \Sigma_{k=1}^m \frac{w_k f_k(SR, UR_i)}{T})\] where \(\mathcal{S}(UR_i)\) is the set of SRs arising from the UR, \(UR_i\).
Temperature is a variable whose value the User may adjust only in the
predict_probabilities()
function. When using
predict_probabilities()
, if the User doesn’t specify a
temperature, the value for temperature defaults to 1. Similarly for
optimize_weights()
, where there is no argument available
for the User to specify temperature, the temperature can be assumed to
be 1. Thus, all the cases that we’ve encountered in the preceding
sections when we applied predict_probabilities()
or
optimize_weights()
can be assumed to have \(T=1\), since we haven’t yet used the
argument temperature
. Notice that when \(T=1\), the generalized model introduced
here is equivalent to the simplified one defined in §1.3.
As temperature increases, the probability distribution becomes less polarized. That is, as temperature increases, the probabilities of the SRs of a given UR move towards equal probability with each other. For example, if a particular UR has two SRs, higher values of \(T\) will move the probability of each SR towards \(0.5\).
We’ll use the hand-picked weights my_wts
to illustrate
this effect. First, we get predictions when \(T=1\), \(T=3\), and \(T=5\), and store the results in the objects
t1_pred
, t3_pred
, and t5_pred
respectively.
# Get paths to toy data file
data_file_c <- system.file(
"extdata", "sample_data_file_2.txt", package="maxent.ot"
)
# Get predictions T=1
# We'll let temperature default to 1
t1_pred <- predict_probabilities(data_file_c, my_wts)
# Get predictions T=3
t3_pred <- predict_probabilities(data_file_c, my_wts, temperature=3)
# Get predictions T=5
t5_pred <- predict_probabilities(data_file_c, my_wts, temperature=5)
Next, we’ll construct a data frame to easily compare the prediction probabilities across the different temperatures. Notice that we can extract the predicted probabilities like so:
# View predicted probability of t1_pred
t1_pred$predictions[, 'Predicted']
#> [1] 0.73105858 0.26894142 0.76615721 0.06289001 0.17095278
Now we’ll create the data frame:
# Select columns we want, and store them in lettered variables
a <- t1_pred$predictions[, 'Input']
b <- t1_pred$predictions[, 'Output']
c <- t1_pred$predictions[, 'Predicted']
d <- t3_pred$predictions[, 'Predicted']
e <- t5_pred$predictions[, 'Predicted']
# Join variables to create data frame
temperature_df <- data.frame(a, b, c, d, e)
# Rename columns
names(temperature_df) <- c('Input', 'Output', 'T=1', 'T=3', 'T=5')
# View the data frame
temperature_df
#> Input Output T=1 T=3 T=5
#> 1 Input1 Output1-1 0.73105858 0.5825702 0.5498340
#> 2 Input1 Output1-2 0.26894142 0.4174298 0.4501660
#> 3 Input2 Output2-1 0.76615721 0.4899250 0.4260125
#> 4 Input2 Output2-2 0.06289001 0.2129205 0.2583897
#> 5 Input2 Output2-3 0.17095278 0.2971545 0.3155978
We’ll track the predicted probabilities of the outputs of Input1:
#> Output1-1 (%) Output1-2 (%)
#> T=1 73 27
#> T=3 58 42
#> T-5 55 45
Thus, as temperature increases, we see that the predicted distributions over the outputs of a given input become less polarized. That is, the probabilities of the various outputs for one given input move closer towards equiprobability with each other. Since there are two outputs here, the value that these predicted probabilities move towards is \(1/2 = 0.5\). If there were three outputs (as with Input2), the value that these predicted probabilities move towards would be \(1/3 = 0.333\). The graphs below visualize the depolarization of the probability distribution over outputs as temperature increases.
The temperature variable can be used to generate less categorical predictions in a way that is independent of the constraint weights. This may come in useful when modeling psycholinguistic tasks. Consider the 2 alternative forced choice (2AFC) response task, in which participant responses are often less categorical (i.e. less polarized) than expected. In a 2AFC task, participants are presented with two options, and must pick one. Why this effect occurs is still unclear. Possible explanations have been put forth such as: selecting a choice that one would not have otherwise volunteered being easier than volunteering said choice without being prompted.
Regardless of why this occurs, being able to model depolarization is crucial when one wishes to ask questions that encompass both the lexicon and psycholinguistic experimental results. In one such work, Hayes, Zuraw, Siptar & Londe (2009) (henceforth HZSL) asked whether certain types of constraints were under- or over-learned based on participant responses to a wug test, when compared to the lexicon. In what follows, we walk the Reader through a replication of HZSL (2009) to demonstrate how temperature can be utilized to make lexicon-based models and models trained on human subject responses comparable.
In Hungarian, the dative suffix has two allomorphs: [-nɔk] and [-nɛk]. A front rounded vowel anywhere in the stem triggers the front allomorph. When the closest vowel is a back vowel, the back allomorph is triggered. However, if a back vowel is separated from the suffix by one or more neutral vowels, variation is observed. Likewise, if a stem consists of only neutral vowels, variation is observed.
In addition, several unnatural data patterns were observed. For example, stems with a final bilabial stop favored the front allomorph. Such patterns were deemed unnatural because (non-dorsal) consonants are generally thought not to have any effect on vowel harmony.
Constraints associated with vowel-triggered harmony were deemed “natural” while those associated with consonant-triggered ones were deemed “unnatural”. In addition, monosyllabic Ci:C stems take a back allomorph (historically, this was a back vowel), so the constraint USE_BACK/Ci:C__ was also considered unnatural.
HZSL were interested in assessing whether the natural and unnatural constraints governing the zones of variation were under- or over-learned. They designed a wug test where Hungarian speakers were presented with a wug stem as well as the allomorphs [-nɔk] and [-nɛk]. The speakers had to pick one of the two suffixes to go with the stem, so this was a 2AFC task.
When one wishes to assess whether humans under- or over-learn certain constraints, it is natural to want to compare the lexicon-trained weights (Grammar L) directly to the wug-test-trained weights (Grammar W). However, such direct comparisons can be problematic for several reasons:
Instead of directly comparing a constraint’s weight in Grammar L to its weight in Grammar W, here’s what HZSL did:
Comparing Grammar S to Grammar W:
A huge collection of Grammar S:
Let’s take stock of where we are. In §5.3.2, we saw that there were
three challenges to directly comparing the lexicon and wug grammars.
Simulating idealized speakers performing the same wug task as human
speakers solved the first two issues. The third issue, simulating
depolarized 2AFC wug responses, will be taken up in the next section
(§5.3.4), where the hyperparameter temperature
will be used
to effect depolarization.
How would a simulated speaker go about picking a suffix for a wug stem? For each stem, the simulated speaker’s grammar (i.e. Grammar L) generates a probability distribution over the two suffix options. This probability distribution guides the simulated speaker’s suffix choice. For example, if the probability distribution is:
then 9 out of 10 times, the simulated speaker will pick the back suffix.
We will utilize the predict_probabilities()
function to
generate the probability distribution over the two suffix options of a
given stem:
test_file
: Hungarian wug test data
constraint_weights
: Grammar L
temparature
: T=1.5
# Get paths to sample Hungarian wug data
hu_file <- system.file(
"extdata", "sample_hu_wug.txt", package = "maxent.ot"
)
hu_data <- otsoft_tableaux_to_df(hu_file)
# We'll use the weights learned by training on the lexicon as reported by HZSL on (p.848).
lex_wts_ls <- list(5.39, 1.69, 1.46, 3.00, 4.04, 2.45, 1.08, .91, 1.73, 2.42)
# Make a named list (optional)
names(lex_wts_ls) <- list(
'AGREE(back,nonlocal)', 'AGREE(front,local)',
'AGREE(nonhigh_front,local)', 'AGREE(low_front,local)',
'AGREE(double_front,local)', 'USE_FRONT/bilabial__',
'USE_FRONT/[+cor,+son]__', 'USE_FRONT/sibilant__',
'USE_FRONT/CC__', 'USE_BACK/[C0i:C0]__'
)
# Convert to double object (required type as argument of predict_probabilities())
lex_wts <- as.double(lex_wts_ls)
# Predict probabilities with temperature set to 1.5
# Notice also that "UTF-8" is an option for encoding, which comes in useful here
hu_pred <- predict_probabilities(
hu_data, lex_wts, temperature = 1.5, encoding = "UTF-8"
)
Let’s take a look at some of the predicted wug probabilities produced by Grammar L:
# Let's view some of the predicted probabilities at T=1.5
# We've dropped the constraint violations for easier viewing
# Both predicted and observed probability are conditioned over URs
# e.g. Observed probability in rows 3 & 4 don't sum to 1
# because the stem [étt] appears twice in this data set.
head(hu_pred$predictions[, -4:-(ncol(hu_pred$predictions)-3)])
#> Input Output Freq Predicted Observed Error
#> 1 Szólingy Szólingynak 1 0.78807041 1.0 -0.21192959
#> 2 Szólingy Szólingynek 0 0.21192959 0.0 0.21192959
#> 3 étt éttnak 0 0.01860365 0.0 0.01860365
#> 4 étt éttnek 1 0.48139635 0.5 -0.01860365
#> 5 Csazenip Csazenipnak 0 0.13470305 0.0 0.13470305
#> 6 Csazenip Csazenipnek 1 0.86529695 1.0 -0.13470305
# For curiosity's sake, let's compare:
# With default temperature=1...
# ... predicted probability is more polarized
hu_pred_t1 <- predict_probabilities(
hu_data, lex_wts, encoding = "UTF-8"
)
head(hu_pred_t1$predictions[, -4:-(ncol(hu_pred_t1$predictions)-3)])
#> Input Output Freq Predicted Observed Error
#> 1 Szólingy Szólingynak 1 0.877611113 1.0 -0.122388887
#> 2 Szólingy Szólingynek 0 0.122388887 0.0 0.122388887
#> 3 étt éttnak 0 0.003769867 0.0 0.003769867
#> 4 étt éttnek 1 0.496230133 0.5 -0.003769867
#> 5 Csazenip Csazenipnak 0 0.057866955 0.0 0.057866955
#> 6 Csazenip Csazenipnek 1 0.942133045 1.0 -0.057866955
We’ve used Grammar L to generate predicted probabilities for
wug suffixes at \(T=1.5\), and stored
the probability distributions in the object hu_pred
. Next,
we use these probability distributions to guide our simulated subjects
as they pick suffixes for wug stems.
We use the function monte_carlo_weights()
to simulate
500 groups of idealized Hungarian speakers picking suffixes for the wug
stems. This function only requires two arguments: the prediction data
frame (we use hu_pred
), and the number of simulations.
This function produces a complete wug test response by randomly
choosing one suffix option for each trial in hu_pred
based
on the the predicted probabilities in hu_pred
. (This
function automatically converts predicted P(SR|UR) in
hu_pred
to predicted P(SR|trial).) It then calls
optimize_weights()
to fit this simulated wug response
(i.e. to produce Grammar S), and records the weights
learned. It repeats this process for the specified number of
simulations. It returns a data frame with \(m\) rows and \(n\) columns, where \(m\) is the number of simulations and \(n\) is the number of constraints. That is,
each row represents an instance of Grammar S.
The following code was used to produce the 500 simulated complete wug test responses and train their corresponding grammars (Grammar S’s):
Each complete wug test consisted of 1703 trials, so the process took a very long time to run. To save time, we saved the 500 sets of constraint weights (500 Grammar S’s) and present them below:
# Get path to the saved constraint weights trained on the 500 simulated wug tests
hu_500simul_wts_path <- system.file(
"extdata", "hu_500simuls_wts.txt", package = "maxent.ot"
)
# Store constraint weights in object `hu_500simul_wts`
hu_500simul_wts <- read.table(hu_500simul_wts_path, header=TRUE, sep='\t')
hu_500simul_wts <- data.frame(hu_500simul_wts)
# Rename columns
names(hu_500simul_wts) <- c(
'AGREE(back,nonlocal)', 'AGREE(front,local)',
'AGREE(nonhigh_front,local)', 'AGREE(low_front,local)',
'AGREE(double_front,local)', 'USE_FRONT/bilabial__',
'USE_FRONT/[+cor,+son]__', 'USE_FRONT/sibilant__',
'USE_FRONT/CC__', 'USE_BACK/[C0i:C0]__'
)
# View the first 6 rows of hu_500simul_wts
# i.e. the first 6 Grammar S's
head(hu_500simul_wts)
#> AGREE(back,nonlocal) AGREE(front,local) AGREE(nonhigh_front,local)
#> 1 3.383119 1.0876934 0.9360368
#> 2 3.444765 0.9890333 0.8731836
#> 3 3.588559 1.1637950 0.8753390
#> 4 3.323187 1.1766615 0.8087822
#> 5 3.509050 0.9975359 0.9905781
#> 6 3.394041 0.5826657 1.2360608
#> AGREE(low_front,local) AGREE(double_front,local) USE_FRONT/bilabial__
#> 1 1.838278 2.624190 1.593205
#> 2 2.107615 2.506750 1.703648
#> 3 2.180713 3.195825 1.874271
#> 4 2.085677 2.435280 1.824118
#> 5 1.919777 2.912883 1.605194
#> 6 1.818039 2.976685 1.635315
#> USE_FRONT/[+cor,+son]__ USE_FRONT/sibilant__ USE_FRONT/CC__
#> 1 0.6393167 0.7599629 0.966830
#> 2 0.6472897 0.8210698 1.216772
#> 3 0.2137989 0.4346787 1.244254
#> 4 0.3060865 0.6178177 1.099641
#> 5 0.7514726 0.7526335 1.074058
#> 6 1.1377657 0.8029462 1.302554
#> USE_BACK/[C0i:C0]__
#> 1 1.4764315
#> 2 2.1874098
#> 3 0.8430519
#> 4 2.2069378
#> 5 1.8524475
#> 6 1.6132052
Some housekeeping
Notice that duplicate UR entries are present in the prediction data
frame hu_pred
. This is because the set-up of the
monte_carlo_weights()
function crucially depends upon
predicted P(SR|trial) in order to randomly give one SR in each trial a
frequency of 1. Consecutive rows are considered to belong to the same
trial if they have the same UR, but not a repeated SR. When creating
your prediction data frame, it is important to ensure that trials are
not collapsed into a single entry. The caution against duplicate UR
entries highlighted in §4.1 doesn’t apply here because the
Error
column of the hu_pred
file isn’t used by
monte_carlo_weights()
.
The process took a long time to run so we performed only 500 (rather
than 10,000) simulations. The Reader is invited to uncomment the code
below if they would like to try running the
monte_carlo_weights()
function.
Now that we have a distribution of idealized constraint weights for each of the 10 constraints (from 500 Grammar S’s), let’s find out whether the weights trained on human wug responses (Grammar W) were under-, over-, or perfectly learned.
First, we’ll have to fit a model to the human wug responses (i.e. produce Grammar W):
Let’s view the constraint weights of Grammar W in relation to the average idealized weights from the 500 Grammar S’s:
# Weights trained on human wug response
human_wt <- human_wug_model$weights
# Average idealized wug weights
ideal_wt <- colMeans(hu_500simul_wts)
# Create data frame
wt_df <- data.frame(human_wt, ideal_wt)
# View weights trained on human wug responses & average idealized wug weights
print(wt_df, digits = 3)
#> human_wt ideal_wt
#> AGREE.back.nonlocal. 3.581 3.635
#> AGREE.front.local. 2.223 1.155
#> AGREE.nonhigh_front.local. 0.950 0.985
#> AGREE.low_front.local. 1.052 2.007
#> AGREE.double_front.local. 1.939 2.724
#> USE_FRONT.bilabial__ 1.037 1.631
#> USE_FRONT...cor..son.__ 0.433 0.725
#> USE_FRONT.sibilant__ 0.368 0.605
#> USE_FRONT.CC__ 0.688 1.163
#> USE_BACK..C0i.C0.__ 0.805 1.616
All unnatural constraints (those beginning with USE…) appear to be under-learned by humans. For example, USE_FRONT/bilabial__ has a lower weight when trained on human wug responses (1.04) than the average of the weights obtained by training on simulated idealized wug responses (1.63).
The plots below visualize where the weights trained on human wug tests (blue vertical dashed line) fall in relation to the distribution of the weights trained on the simulated idealized wug tests (histogram).
compare_models()
Let’s imagine that we’ve trained multiple models, perhaps one with 3 constraints & another with 2 constraints. The larger model has more constraints to play with, and so usually fits the data better. If our only criterion to determine the best model was to assess only model fit to training data, larger models would generally do better.
Nevertheless, we can still ask whether including additional constraints is worth the better fit. For example, if including one additional constraint, improves model fit drastically, the inclusion of this constraint is probably justified. However, if the inclusion of this constraint only improves model fit by a minuscule amount, we’d be better off removing this constraint from our model.
There are several methods that can be used to quantify the trade-off
between the number of constraints and the model fit. Here’s where the
function compare_models
becomes useful. This function
includes four comparison methods:
lrt
: Likelihood ratio test.
aic
: Akaike Information Criterion (AIC).
aic_c
: AIC corrected for small sample size.
bic
: Bayesian Information Criterion.
The formulas for the AIC-C and the BIC rely on sample sizes. The sample size is the calculated by summing over the column of surface form frequencies. If you wish to use the AIC-C or the BIC comparison methods, make sure that the values in this column are token counts, not relative frequencies. Using relative frequencies will produce invalid results.
Two requirements must be met in order to perform the likelihood ratio test. First, exactly two models must be compared. Second, these models must be nested.
Here, we have two data sets small_data
&
large_data
(thus meeting the first criterion).
large_data
has the very same constraints (C1 & C3) that
small_data
has, plus an additional constraint, C2. Thus,
the nested criterion is met.
# Get paths to toy data files
# This file has two constraints
small_file <- system.file(
"extdata", "sample_data_file_small.txt", package = "maxent.ot"
)
small_df <- otsoft_tableaux_to_df(small_file)
# This file has three constraints
large_file <- system.file(
"extdata", "sample_data_file_large_otsoft.txt", package = "maxent.ot"
)
large_df <- otsoft_tableaux_to_df(large_file)
# View small_data
small_df
#> Input Output Frequency Constraint1 Constraint3
#> 1 Input1 Output1-1 1 1 1
#> 2 Input1 Output1-2 1 0 0
#> 3 Input2 Output2-1 1 0 1
#> 4 Input2 Output2-2 0 0 0
# View large_data
large_df
#> Input Output Frequency Constraint1 Constraint2 Constraint3
#> 1 Input1 Output1-1 1 1 0 1
#> 2 Input1 Output1-2 1 0 1 0
#> 3 Input2 Output2-1 1 0 0 1
#> 4 Input2 Output2-2 0 0 1 0
We’ll train both models and store the trained models in the objects
small_model
& large_model
:
# Fit weights to both data sets
small_model <- optimize_weights(small_df)
large_model <- optimize_weights(large_df)
Now we’ll perform the likelihood ratio test on the two trained
models. To use the lrt
method of
compare_models()
, we’ll need three arguments:
method = 'lrt'
The models can appear in either order, but they must both appear
before the method
argument.
# Compare models using the likelihood ratio test
compare_models(small_model, large_model, method='lrt')
#> description chi_sq k_delta p_value
#> 1 large_df~small_df 1.386292 1 0.2390322
This method returns a data frame with a single row and the following columns:
description
: The names of the two models being
compared, with the larger model appearing first.chi_sq
: chi-squared value.k_delta
: degrees of freedom (difference in the number
of constraints between the two model).p-value
: p-value.The p-value was \(0.239\). This indicates that the difference in model fit to data was not significant at the \(.05\) significance level. Hence, we are not justified in including the additional constraint, C2. The model with fewer constraints is the better one.
The likelihood ratio, \(LR\), is calculated as follows:
\[LR = 2(LL_2 - LL_1)\]
where \(LL_2\) is the log likelihood of the larger model (the one with more constraints) and \(LL_1\) is the log likelihood of the smaller model.
The likelihood ratio \(LR\) is set as the \(X^2\)-value, which is used to perform a \(X^2\)-test. The \(X^2\)-value should range between 0 & positive infinity, with a larger value indicating a greater improvement in model fit. The difference in the number of constraints is set as the degrees of freedom. The p-value tells us whether the difference in likelihood between the two models is significant. If it turns out to be significant, then we are justified in including the additional constraints in order to achieve the better model fit.
This test runs on the assumption that the larger model indeed produces a better fit to the data. If this criterion isn’t met, there is no way in which the larger model is better than the smaller one since
Mathematically, this translates to having a negative \(X^2\)-value, which is uninterpretable. An error message will be produced and no \(X^2\)-test will be performed if such a scenario is encountered.
We’ll attempt to perform a likelihood ratio test on two non-nested models. We re-use small_data, which has constraints C1 & C3. We introduce a new data set, large_data_b, which has constraints C1, C2 & C4.
# Get paths to toy data files
# This file has three constraints
large_file_b <- system.file(
"extdata", "sample_data_file_large_b.txt", package = "maxent.ot"
)
large_df_b <- otsoft_tableaux_to_df(large_file_b)
# View small_data
small_df
#> Input Output Frequency Constraint1 Constraint3
#> 1 Input1 Output1-1 1 1 1
#> 2 Input1 Output1-2 1 0 0
#> 3 Input2 Output2-1 1 0 1
#> 4 Input2 Output2-2 0 0 0
# View large_data_b
large_df_b
#> Input Output Frequency Constraint1 Constraint2 Constraint4
#> 1 Input1 Output1-1 1 1 0 0
#> 2 Input1 Output1-2 1 0 1 0
#> 3 Input2 Output2-1 1 0 0 1
#> 4 Input2 Output2-2 2 0 1 0
# Fit weights to data set large_data_b
large_model_b <- optimize_weights(large_df_b)
The program is unable to perform the likelihood ratio test, and will produce an error message.
# Compare models using the likelihood ratio test
compare_models(small_model, large_model_b, method='lrt')
#> Error in compare_models(small_model, large_model_b, method = "lrt"): Constraint names in the smaller model are not a subset of those in the larger model. Models must be nested to apply the likelihood ratio test.
The AIC value for a particular model is calculated as follows:
\[AIC = 2k - 2LL\] where \(k\) is the number of constraints, and \(LL\) is the log likelihood of the current model.
A smaller AIC score indicates a better model. Both the number of constraints and the log likelihood contribute to the AIC score. All else being equal, having a smaller number of constraints or having a higher log likelihood is indicative of a better model. Recall that log likelihood (\(-\infty < LL \leq 0\)) quantifies model fit.
Two or more models may be passed to compare_models()
when using the aic
method. To use the aic
method of compare_models()
, we’ll need \(m+1\) arguments, where \(m\) is the number of models that we want
compared.
method = 'aic'
The models can appear in any order, but they must all appear before
the method
argument.
The returned data frame contains the same number of rows as models passed in. The models are sorted in increasing order of AIC (i.e. best first). The data frame has the following columns:
model
: The name of the model.k
: The number of constraints.aic
: The model’s AIC score.aic.delta
: The difference between the current model’s
AIC score and the AIC score of the “best” model.aic.wt
: The model’s AIC weight: This reflects the
relative likelihood (or conditional probability) that this model is the
“best” model in the set.cum.wt
: The cumulative sum of AIC weights up to and
including this model.ll
: The log likelihood of this model.# Compare models using AIC
compare_models(small_model, large_model, method='aic')
#> model k aic aic.delta aic.wt cum.wt ll
#> 1 small_df 2 8.158883 0.0000000 0.5761171 0.5761171 -2.079442
#> 2 large_df 3 8.772591 0.6137077 0.4238829 1.0000000 -1.386295
In the data frame above, the small model has an AIC score of 8.16 while the large model has an AIC score of 8.77. This indicates that the small model is better.
How much better is the small model compared to the large model? The
column aic.wt
gives the probability that the current model
is the best one of the entire set of models passed in. So the small
model has a 57.6% chance of being the best model, and the large model
has a 43.7% chance of being the best model.
aic.wt
) explainedThe AIC weight of a model is calculated by applying the softmax function to the negative halved-distance between current model’s AIC score and the AIC score of the best model:
\[W_i = \frac{exp(-0.5\delta_i)}{\Sigma_{j=1}^m exp(-0.5\delta_j)}\] where \(\delta_i\) is the difference in AIC score between model \(i\) and the best model, and \(m\) is the number of models being compared.
We’ll break this equation down further, and walk through its application step-by-step. The values of the intermediate terms we’ll use (e.g. \(\delta\); the raw AIC weight, \(r\)) are shown in Figure 5.
aic.delta
)First, we’ll calculate the difference in AIC score between model
\(i\) and the best model, \(\delta_i\). This value shows up in the
aic.delta
column of the returned data frame.
\[\delta_i = aic_i - aic_{best}\] where \(aic_i\) is the AIC score of the \(i^{th}\) model, and \(aic_{best}\) is the AIC score of the best model.
Let’s calculate the AIC score distance for the large model, \(\delta_{large\_model}\):
\[\begin{align} \delta_{large\_model} &= aic_{large\_model} - aic_{best\_model} \\ &= aic_{large\_model} - aic_{small\_model} \\ &= 8.773 - 8.159 \\ &= 0.614 \end{align}\]
and for the small model, \(\delta_{small\_model}\):
\[\begin{align} \delta_{small\_model} &= aic_{small\_model} - aic_{best\_model} \\ &= aic_{small\_model} - aic_{small\_model} \\ &= 8.773 - 8.773 \\ &= 0 \end{align}\]
Next, we’ll define the raw AIC weight:
\[r_i = e^{(-\delta_i/2)}\] where \(r_i\) is the raw AIC weight of model \(i\).
Let’s calculate the raw AIC weight for the large model:
\[\begin{align} r_{large\_model} &= e^{-\delta_{large\_model}/2} \\ &= e^{-0.614/2} \\ &= 0.736 \end{align}\]
Notice that \(\delta = 0\) for the best model, and \(\delta \gt 0\) for the rest of the models. This results in the best model’s raw AIC weight \(r\) being \(e^0 = 1\), and the worse models receiving a raw AIC weight less than \(1\).
aic.wt
Next, we’ll perform normalization to get the actual AIC weights shown
in the aic.wt
column. The true AIC weight for the \(i^{th}\) model, \(W_i\), is just its raw AIC weight, \(r_i\), divided by the sum of raw AIC
weights of all models that were passed in to be compared:
\[W_i= \frac{r_i}{\sum_j^m r_j}\]
where \(m\) is the number of models
passed in to compare_models()
, and \(r_j\) is the raw AIC weight of the \(j^{th}\) such model.
Applying this equation, the AIC weight of the large model is: \[\begin{align} W_{large\_model} &= \frac{r_{large\_model}}{\sum_j^2 r_j} \\ &= \frac{0.736}{1 + 0.736} \\ &= 0.424 \end{align}\]
The BIC is an alternative measure of assessing the trade-off between the number of constraints and model fit. The BIC replaces the AIC’s first term (\(2k\)) with \(k \ln(n)\).
\[BIC = k \ln(n) - 2 LL\]
where \(k\) is the number of constraints, \(n\) is the sample size, and \(LL\) is the log likelihood of the current model.
As with the AIC, a smaller BIC score indicates a better model.
For both the AIC and the BIC, the first term penalizes models that have more constraints, \(k\). However, the number of constraints, \(k\), is scaled differently in these two measures. For the BIC, \(k\) is scaled by \(\ln(n)\), where \(n\) is the number of samples. As the number of observed tokens increases, the penalty accrued by models that have more constraints increases (albeit at a decreasing rate since \(k\) is scaled by \(\ln(n)\) rather than \(n\)). In contrast, the AIC scales \(k\) with the constant \(2\), so sample size has no effect on the penalty accrued by having more constraints, \(k\). At \(7\) observed tokens and below, a model’s BIC score is smaller than its AIC score. At \(8\) observed tokens and above, a model’s BIC score is larger than its AIC score (since \(\ln(7) < 2 < \ln(8)\)).
A caveat: Though unlikely to occur, it is advisable to check that the sample size is greater than \(1\). At a sample size of \(1\), the effect of the number of constraints, \(k\), on the BIC score is canceled out. The BIC score becomes directly proportional to the log likelihood alone (\(BIC = k \ln(1) -2LL = -2LL\)).
As with the AIC, two or more models may be passed to
compare_models()
when using the bic
method.
The arguments passed in to compare_models
for the
bic
method are identical to that for the aic
method, except that the method argument should now be:
method='bic'
.
Hence, to use the bic
method of
compare_models()
, we’ll need \(m+1\) arguments, where \(m\) is the number of models that we want
compared.
method = 'bic'
The models can appear in any order, but they must all appear before
the method
argument.
# Compare models using BIC
compare_models(small_model, large_model, method='bic')
#> model k n bic bic.delta bic.wt cum.wt ll
#> 1 large_df 3 3 6.068428 0.00000 0.5358981 0.5358981 -1.386295
#> 2 small_df 2 3 6.356108 0.28768 0.4641019 1.0000000 -2.079442
The returned data frame from the BIC method has the same structure as that of the AIC, with one additional column:
n
: The number of samples in the data that the model is
fit to.The bic
and bic.wt
columns are the BIC
counterparts to the aic
and aic.wt
columns in
the AIC data frame.
Looking at the data frame above, we see that the small model (smaller
BIC score) again performs better than the large model (large BIC score).
The conditional probability that the small model is the best of all
models passed to compare_models()
is 53.6%.
The Akaike Information Criterion for small sample sizes (AIC-C) yet another alternative method of comparing the trade-off between the number of constraints and the model fit. Like the BIC, AIC-C also takes the sample size into account.
The AIC-C modifies the AIC by introducing a third term.
\[AIC_c = 2k - 2 LL + \frac{2k^2 + 2k}{n-k-1}\]
As before, \(k\) is the number of constraints, \(n\) is the sample size, and \(LL\) is the log likelihood of the current model. This newly introduced third term further increases the cost of having additional constraints. As \(n\) approaches infinity, the third term’s denominator approaches 0, so the AIC-C becomes equivalent to the AIC for large sample sizes.
As with the AIC and the BIC, two or more models may be passed to
compare_models()
when using the aic_c
method.
The arguments passed in to compare_models
for the
aic_c
method are identical to that for the aic
and bic
methods, except that the method argument should now
be: method='aic_c'
.
Hence, to use the bic
method of
compare_models()
, we’ll need \(m+1\) arguments, where \(m\) is the number of models that we want
compared.
method = 'aic_c'
The models can appear in any order, but they must all appear before
the method
argument.
# Get paths to toy data files
# This file has 2 constraints
small_file_c <- system.file(
"extdata", "sample_data_file_small_c.txt", package = "maxent.ot"
)
small_df_c <- otsoft_tableaux_to_df(small_file_c)
# This file has 3 constraints
large_file_c <- system.file(
"extdata", "sample_data_file_large_c.txt", package = "maxent.ot"
)
large_df_c <- otsoft_tableaux_to_df(large_file_c)
# Fit weights to both data sets
small_model_c <- optimize_weights(small_df_c)
large_model_c <- optimize_weights(large_df_c)
# Compare models using AIC-C
compare_models(small_model_c, large_model_c, method='aic_c')
#> model k n aicc aicc.delta aicc.wt cum.wt ll
#> 1 small_df_c 2 6 16.31777 0.000000 0.97375561 0.9737556 -4.158883
#> 2 large_df_c 3 6 23.54518 7.227416 0.02624439 1.0000000 -2.772591
The returned data frame from the AIC-C method has the same structure
as that of the BIC. The aicc
and aicc.wt
columns are the AIC-C counterparts to the aic
and
aic.wt
columns in the AIC data frame respectively.
In the data frame above, the small model (smaller AIC-C score)
performs better than the large model (larger AIC-C score). The
conditional probability that the small model is the best of all the
models that were passed to compare_models()
is \(97.4%\).
Here’s the formula for AIC-C again:
\[AIC_c = 2k - 2 LL + \frac{2k^2 + 2k}{n-k-1}\]
When \(n-k \leq 1\) for a particular model, the third term’s denominator is 0. This results in an infinitely large AIC-C value. In the example below, this issue crops up for both the large and the small models.
# Problematic data set: all models have n-k <= 1
# Compare models using AIC-C
compare_models(small_model, large_model, method='aic_c')
#> model k n aicc aicc.delta aicc.wt cum.wt ll
#> 1 small_df 2 3 Inf NaN NaN NaN -2.079442
#> 2 large_df 3 3 Inf NaN NaN NaN -1.386295
In the data frame above, none of the models produce a valid AIC-C
score. Hence the probability that one particular model is the best
cannot be calculated (aicc.wt
is NaN
“Not a
Number”).
One way to side-step this issue is to scale the sample size up. For example, the problematic data sets had their sample sizes increased by a factor of 2 to create the non-problematic data sets:
# View problematic data set
# sample size: n = 3
# maximum number of constraints for valid aic_c score: 3-2 = 1
# actual maximum number of constraints used: 3
large_df
#> Input Output Frequency Constraint1 Constraint2 Constraint3
#> 1 Input1 Output1-1 1 1 0 1
#> 2 Input1 Output1-2 1 0 1 0
#> 3 Input2 Output2-1 1 0 0 1
#> 4 Input2 Output2-2 0 0 1 0
# View non-problematic data set
# sample size: n = 6
# maximum number of constraints for valid aic_c score: 6-2 = 4
# actual maximum number of constraints used: 3
large_df_c
#> Input Output Frequency Constraint1 Constraint2 Constraint3
#> 1 Input1 Output1-1 2 1 0 1
#> 2 Input1 Output1-2 2 0 1 0
#> 3 Input2 Output2-1 2 0 0 1
#> 4 Input2 Output2-2 0 0 1 0
In the case where only one (i.e. large_model_d
)
of multiple models produces an invalid AIC-C score, the model with the
invalid AIC-C score will have an aicc.wt
of 0. The
aicc.wt
for the other models with valid AIC-C scores will
be calculated as per normal (see data frame below).
# 1 Problematic data set: large_data_d has n-k <= 1
# 2 Non-problematic data sets
# sample size: n = 5
# maximum number of constraints for valid aic_c score: 5-2 = 3
# Get paths to toy data files
# This file has 2 constraints: Good
small_file_d <- system.file(
"extdata", "sample_data_file_small_d.txt", package = "maxent.ot"
)
small_df_d <- otsoft_tableaux_to_df(small_file_d)
# This file has 3 constraints: Good
med_file_d <- system.file(
"extdata", "sample_data_file_med_d.txt", package = "maxent.ot"
)
med_df_d <- otsoft_tableaux_to_df(med_file_d)
# This file has 4 constraints: Problematic
large_file_d <- system.file(
"extdata", "sample_data_file_large_d.txt", package = "maxent.ot"
)
large_df_d <- otsoft_tableaux_to_df(large_file_d)
# Fit weights for all data sets
small_model_d <- optimize_weights(small_df_d)
med_model_d <- optimize_weights(med_df_d)
large_model_d <- optimize_weights(large_df_d)
# Compare models using AIC-C
compare_models(small_model_d, med_model_d, large_model_d, method='aic_c')
#> model k n aicc aicc.delta aicc.wt cum.wt ll
#> 1 small_df_d 2 5 16.73012 0.00000 9.999513e-01 0.9999513 -3.365058
#> 2 med_df_d 3 5 36.59167 19.86156 4.865154e-05 1.0000000 -3.295837
#> 3 large_df_d 4 5 Inf Inf 0.000000e+00 1.0000000 -3.295837