The discrete effect for parity, the linear term for day, and the interaction effect all now have a biologically and statistically significant impact on expected lying time. In visualizing this dynamic in Figure 3, the consultant can now clearly see that during the first half of the observation interval, the mature cows are hogging the free stall spaces, forcing the heifers to stand; however, once the grazing season starts, the cows leave their free stalls to go graze, while the now foot weary heifers remain in the free stalls. A linear model is ultimately just a way to produce estimates of mean and variance that are conditional on additional variables. Further probabilistic assumptions can be made in order to draw statistical inferences about these estimates, but they typically are not employed in the estimation step itself. In other words, linear models are just a means of generating summary statistics. And all summary statistics are a form of information compression . Thus, we can once again see in this example that, when the biological system that generates a dataset is well understood, this prior knowledge can be infused into an appropriate model to efficiently compress the information contained in such records. If, however, we have an incomplete understanding of the system that gives rise to the behavioral responses, vertical shelving a linear model can become an inefficient means of data compression that hemorrhages information, potentially causing important patterns to go overlooked.
In the preceding example, the simulated behavioral response was intended to be a bit tricky to capture, but it was still fully describable by explanatory variables ±date of observation and parity ±available to the consultants. But what if this same mechanism driving inversions in lying patterns had been caused by more complex factors? What if, instead of a single persistent shift in lying patterns after pasture access was granted, the desirability of the free stalls had been influenced by more transient environmental factors such as weather or bedding cleanliness? Resource holding potential is known to be influenced by a range of factors ± size, seniority, energy balance, health status, innate aggressiveness etc ±many of which might not be directly measured in standard production systems . Clever model parameterization may still succeed in capturing any one of these patterns, but as such complexities begin to compound within a system, it can become overwhelming to account for all such contingencies within a single model. Subsequently, the farther we move from controlled experimental contexts towards the chaos and complexity of commercial farm environments, the more fundamentally challenging it becomes to use model-based approaches to information compression to extract ethological insights from PLF data streams.
Looking beyond linear models, machine learning approaches may provide a means to overcome such gaps in background knowledge by more fully leveraging the power of modern computing . Such algorithms are divisible into two general classes . In supervised machine learning, the exact structure of the model is, to varying degrees, gleaned algorithmically from the data by itself; however, the user is still required to distinguish between explanatory and response variables. Subsequently, this framework lends itself to optimizing the predictive power of a model, but may still overlook important patterns in a dataset if the factors driving heterogeneity in the response variables are not measured and included amongst the candidate predictor variables. Unsupervised ML algorithms, on the other hand, employ a more open-ended approach to information compression. Such algorithms do not distinguish between explanatory and response variables, but seek only to progressively extract and visualize the most striking nonrandom features of a dataset until only noise remains. Several algorithmic approaches exist to explore the latent high dimensional geometry of large datasets via UML. While unsupervised neural networks are not as data-hungry as their supervised cousins, their ground-up approach to learning the key features of a dataset still requires many hundreds to many thousands of observational units.
Future advancements in transfer learning techniques and in our understanding of neural architectures may eventually lower the sample sizes required to train such networks , but for many behavioral applications where data may only be available from a single farm or even a single group of animals, less data-hungry algorithms are typically required. Principal Component Analysis is the simplest example of this class of algorithms, but more modern approaches are more suitable for the complex nonlinear geometric features that can often be found in sensor data. Alternatively, several algorithmic strategies may be employed to compress the complex geometric features of high dimensional datasets into discrete clusters . K-means clustering is arguably the simplest to implement, and subsequently has seen some adoption in analyses of livestock data. While no overt model is stated in this approach, the users still must specify a priori the appropriate number of clusters needed to represent the latent structures in the data ±a meta parameter choice that may not be immediately obvious in all applications. For such datasets, Hierarchical Clustering algorithms may provide a more openended approach to developing an optimal discrete encoding of such high dimensional patterns.As its name implies, HClustering is an intuitive analytical framework that progressively groups data through a series of sequential agglomerative steps . To illustrate how this algorithm works, suppose you go to the feed store and get a pack of little plastic toy cows. You line them up in front of a toddler and ask what cows look most alike. They point to the Guernsey and the Holstein, so you pull them forward and stand them together. You ask this question again, they point to the Jersey and the Brown Swiss, so you pull these forward and stand them together. You ask again, and the Angus and Hereford cows are paired together. You ask again, and the two groups of dairy cows are pushed together. If you kept track of all these pairings, you might produce a visual summary of this decision-making process presented in Figure 4. This schematic, properly called a dendrogram, provides a succinct 2D representation of how a fairly large number of phenotypic features are distributed within your plastic herd. Hierarchical clustering algorithms seek to mimic this agglomerative process, shelving racking which we can implement fairly intuitively, using objective mathematical constructs. As before, our larger mature cows will monopolize the free stalls when they are not on pasture. In this example, however, we will increase the complexity of this analytical problem by now supposing that the animals always had access to pasture, and that rain events were the environmental factor driving inversions in lying times ± inversions that will now be randomly scattered throughout the observation window. The first step in hierarchically clustering this dataset is to compute a dissimilarity matrix, which is a square symmetric matrix containing quantitative estimates of the dissimilarity between each pair of observational units in the corresponding row and column indices. In order to cluster cows together with similar lying patterns, we will first calculate a dissimilarity matrix using the Euclidean distance or L2 norm, which here is just the squared differences in observed lying times between a given pair of cows summed over all observation days.
Similarly, in order to cluster days together wherein the herd demonstrated similar lying patterns in response to environmental factors, we’ll calculate the Euclidean distance between each pair of observation days as the sum of squared differences in lying time over all animals in theherd. Using these pairwise dissimilarity values, a ground-up agglomeration algorithm can then be applied. Here we will use Ward’s linkage method, wherein clusters are merged in order to produce the largest increase in between-group variance. In order to visualize the results of both clustering routines simultaneously, a heatmap can be used to directly visualize this simulated data matrix. Here each row will correspond to a cow, each column will correspond to an observation day, and each cell will be colored to represent the proportion of time that a given cow spends lying down on a given day. With rain events scattered randomly throughout the observation window, and cow records provided in no particular order, we see in visualizing the raw data matrix in Figure 5A that the inversions in lying patterns are completely obscured. The dendrograms produced by clustering cows over days and also days over all cows can be used to reorder the rows and columns indices of the raw data matrix so the cows with similar temporal patterns in their lying times are grouped together on the row axis. Visualizing these results again using a heatmap in Figure 5B, we can now see that this clustering algorithm has captured two distinct groups of cows and two distinct groups of observation days, and so the inversion in lying patterns within the data are now quite visually striking. Thus, even though this hierarchical clustering algorithm was never provided information on the factors driving this behavioral pattern ±cow parity and weather records ±it has still succeeded in recovering the social dynamics hiding within this lying time data. Subsequently, even if no other farm records were made available to us, we would still be able to identify that overstocking is compromising the welfare of this herd using this model-free approach to information compression and knowledge discovery.The preceding example has hopefully illustrated that, by leveraging intrinsic codependencies in the behavioral responses of socially housed animals, model-free machine learning approaches can recover complex behavioral patterns from sensor data absent any assumptions of the causative factors. In practice, however, once such a pattern is detected, we typically want to try to pin down the variables eliciting such reactions. In the preceding simulation, the mechanism linking lying time, age, and weather were pretty simple, and so armed with the insights from the UML analyses, a number of fairly straightforward linear methods might be employed to probe for the causative variables amongst farm records. A number of supervised machine learning approaches are specifically designed to probe for significant associations amongst large sets of candidate predictor variables . With LASSO regression, any nonlinear dynamics between candidate predictors and the response must be explicitly coded, whereas neural networks and regression trees can infer a range of nonlinear dynamics when properly parameterized. For all these methods, however, two potential drawbacks should be considered for behavioral applications. First, because the bias-variance tradeoffs of such models must be tuned through cross validation to avoid overfitting and spurious associations, the sample sizes required by these techniques to screen large number of candidate predictors are often quite large as well . Even when sufficient sample sizes are available, such techniques inherently prioritize prediction over intuition. While variable importance estimates may provide some insights into the key drivers of a behavioral response, it can be difficult with these approaches to characterize the dynamics between predictor and response variables.Suppose a dataset were collected wherein each observational unit is coded with one of two mutually exclusive behaviors ± for example lying or not lying. If both behaviors are observed at equivalent frequencies, and we are provided no additional information than the distribution of this discrete variable, could we make any type of informed guess about what behavior we might see if we selected an observation at random from this data? Since no behavior is no more likely to be observed than another, we would be completely uncertain, and subsequently would compute a maximum theoretical entropy of log2 = 1. Alternatively, if some behavioral mechanism caused this probability to shift towards or away from one of these two behaviors, then we might be able to venture a guess at what behavior we might expect to see if we picked an observation at random ± our uncertainty would decrease and so too would the corresponding entropy estimate. Taken to the extreme, if only one behavior were ever observed, then we could guess the encoded value of any observation in our dataset with complete certainty, and so the corresponding entropy estimate would drop to zero. Figure 6 demonstrates how the entropy estimate of a bivariate variable changes with the symmetry of the distribution of the observed encoding. In calculating such entropy values, the resulting estimates are contingent only upon the assumptions used to develop the discretization scheme. No conditional mean or any other model need be assumed, as with a standard variance estimator utilized in conventional statistical inference frameworks for linear models.