GET THE APP

Cluster Analysis of 137 Soybean Lines Based on Root System Architecture Traits Measured in Rhizoboxes
..

Journal of Biometrics & Biostatistics

ISSN: 2155-6180

Open Access

Research Article - (2023) Volume 14, Issue 4

Cluster Analysis of 137 Soybean Lines Based on Root System Architecture Traits Measured in Rhizoboxes

Prabhjot Sanghera1, François Belzile2, Waldiodio Seck2 and Pierre Dutilleul1*
*Correspondence: Pierre Dutilleul, Department of Plant Science, McGill University, Macdonald Campus, Sainte-Anne-de-Bellevue, QC, Canada, Email:
1Department of Plant Science, McGill University, Macdonald Campus, Sainte-Anne-de-Bellevue, QC, Canada
2Département de phytologie et Institut de biologie intégrative et des systèmes, Université Laval, Québec, QC, Canada

Received: 01-Aug-2023, Manuscript No. jbmbs-23-110237; Editor assigned: 03-Aug-2023, Pre QC No. P-110237; Reviewed: 17-Aug-2023, QC No. Q-110237; Revised: 22-Aug-2023, Manuscript No. R-110237; Published: 29-Aug-2023 , DOI: 10.37421/2155-6180.2023.14.179
Citation: Sanghera, Prabhjot, François Belzile, Waldiodio Seck and Pierre Dutilleul. “Cluster Analysis of 137 Soybean Lines Based on Root System Architecture Traits Measured in Rhizoboxes.” J Biom Biosta 14 (2023): 179.
Copyright: © 2023 Sanghera P, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

The reported study was motivated by the necessity to select 30 soybean lines from a total of 137 for a sophisticated 3-D phenotyping analysis of the Root System Architecture (RSA), which would not allow that all the lines be included and replicated. A representative subset of size 30 was found after performing four cluster analyses and comparing the results of two more particularly. These two cluster analyses are based on the data for 12 RSA-related traits previously collected in 2D on three replicates of the 137 soybean lines and the first six principal components representing 95% of the total dispersion after data standardization in a preliminary Principal Component Analysis (PCA). The two cluster analysis procedures provided 16 soybean lines that were the closest to the centroid of their respective cluster in both cases. Fourteen more were found to be common and at a distance from the centroid below a pre-set threshold value without being the closest. The final selection of 30 excludes two soybean lines that were the second member selected from their cluster, and includes instead two soybean lines that are the closest and second closest to their respective centroid in the cluster analysis after PCA on standardized data, but are not well represented in the other cluster analysis. In conclusion, the 93.3% overlap between the two sets of results shows a robust clustering structure in RSA 2-D phenotyping in soybean. Our statistical approaches and procedures can be followed and applied in other biological frameworks than plant phenotyping.

Keywords

Cluster analysis • Data standardization • Distance to the centroid • Plant phenotyping • Principal component analysis • Root system architecture

Introduction

One of the main difficulties in experimental research of biological systems is the bidirectional relationship between genotype and phenotype. Researchers in the omics sciences [1-7], including phenomics [8], are continuously developing new technologies that produce enormous amounts of data, which help improve our understanding of the complexity of living organisms provided they are analysed appropriately. To enable drawing biologically relevant conclusions, statistical methods, among others [9-12], must be optimized in parallel. To share raw data from omics experiments, they are presented in figures and visualized with meaningful representations. The primary goal of agricultural phenomics, or field omics [13], is to measure and compare phenotypes of crop plants. With the interpretation of dendrograms and proximity to centroids, cluster analysis represents a potential, very effective means to meet that objective. Different clustering algorithms exist that can, for given criteria, group individuals and identify them as cluster members [14].

Phenotypic variation in a germplasm pool is necessary for plant breeders to progress through selection. In this study, we have analysed phenotypic data for the Root System Architecture (RSA) of 137 soybean lines; source of data: [15]. The primary or tap root is the first organ formed by hypophysis in germinating seeds [16]. The thick soybean primary root produces primordia from the pericyclic cells, which grow into lateral roots [17]. Numerical variables, such as the quantity of secondary lateral roots, average root diameter, and root length, typically describe the size and abundance of the root system components. In other measured variables, the focus is on the topology or structure of the root system, like the type and angle of root connections [18]. Here, 12 RSA-related traits had previously been measured from 2-D images of the content of rhizoboxes in which soybean seedlings were grown: Total Length of Roots (TLR), Length of Primary Root (LPR), Length of Secondary Roots (LSR), Distribution of Total Root Length (DTLR), Total Number of Roots (TNR), Median number of roots (Med), Maximum Number of Roots (Max), Depth of Root System (DRS), Width of Root System (WRS), Surface of Root System (SRS), Diameter of Primary Roots (DR), and Surface Area of primary Root (SAR) [15].

We first performed cluster analysis on the dataset introduced above in four ways: without vs. with data standardization, combined or not with the application of a Principal Component Analysis (PCA) to reduce data dimensionality, and then focused on two ways called “Approach 1” and “Approach 2”. In doing so, our motivation was to answer best the questions: How to analyse RSA multivariate data to objectively define a given number (e.g., 30) of clusters? How can a relevant member (i.e., a soybean line) be identified for each of the 30 clusters? These questions are addressed while keeping in mind that the resulting 30 soybean lines would later be used for a sophisticated, time-consuming RSA phenotyping in 3D. We used the SAS software, Version 9.4 for Windows (SAS Institute Inc., Cary, NC, USA), to design and perform our cluster analyses.

Materials and Methods

Source of experimental data

The dataset used in the multivariate analyses described below consists of the mean values of phenotypic data collected for three seedlings per line (N=3) from 137 lines of soybean grown in Canada. The seeds were first germinated in Petri dishes filled with fine vermiculite and then transplanted into customdesigned rhizoboxes filled with vermiculite. After 10 days of growth, images of the roots were taken using a camera. The Automatic Root Image Analysis (ARIA) software was used to extract the RSA-related traits from each 2-D image: TLR, LPR, LSR, DTLR, TNR, Med, Max, DRS, WRS, SRS, DR, and SAR [15].

Cluster analysis

This multivariate statistical method is aimed at identifying “clusters”, or groups of individuals, and their “members” for given criteria of proximity in the multidimensional space of a quantitative dataset. In the plethora of existing cluster analysis procedures, clustering depends on the definition of proximity and the type of distance or similarity involved; see, e.g., [14]. In all cases, the basic principles of the method are the same: grouping individuals that are more similar in the same cluster around a “centroid”, in a way that maximizes the separation among clusters while minimizing the distances between members within clusters. We applied cluster analysis to obtain 30 clusters from 137 soybean lines (1 individual=1 soybean line). As a starting point in a given approach, we identified the soybean lines with greatest proximity to the centroid as representatives of the clusters. Our motivation is to select objectively 30 soybean lines for future research work that is practically impossible to undertake with all the 137 soybean lines (i.e., RSA phenotyping based on computed tomography scanning).

In this study, we performed disjoint cluster analyses with the SAS procedure FASTCLUS, in which a nearest centroid sorting algorithm is implemented. We used it without the option of cluster seeds as first guess for centroids, so that the algorithm initially considered each individual as a separate cluster. Distances between two individuals, between one individual and the centroid of one cluster with more than one member, and between two centroids of clusters with several members were computed based on the values of the input variables (using means when centroids of non-singleton clusters are involved); see the VAR statement in SAS scripts A1 and A3 in the appendix. By default, the Euclidean distance is used to assess the proximity among individuals and clusters. The algorithm merges the two closest clusters at each step until the desired number of clusters (MAXC) is reached. Unlike the SAS procedure CLUSTER, PROC FASTCLUS assigns each individual to a single cluster without organization in a hierarchical tree structure.

We developed and followed two approaches for clustering.

Approach 1: Cluster analysis with the 12 RSA-related traits. In SAS script A1, "MAXC=30" specifies the requested number of clusters, and the final cluster assignments are saved as output in "work.fastclus_scores".

Approach 2: Cluster analysis with 6 principal components (Prin1-Prin6). In this approach, results of a preliminary PCA are used; see the text below and SAS scripts A2 and A3. The input variables VAR in A3 are "Prin1-Prin6". These were chosen for cluster analysis after PCA (see below) showed that they accounted for 95% of the variability in the data table after column standardization. Prior to standardization, the data table (with 137 rows and 12 columns) contained the mean values (N=3) per soybean line for each of the 12 RSA-related traits. The other options in A3 (i.e., MAXC, OUT) are the same as in A1.

Principal component analysis

That multivariate statistical method can be performed on the same dataset as cluster analysis, but has a different aim than cluster analysis. PCA is used to examine the relationships among quantitative variables observed on a number of individuals in order to reduce dimensionality of the data space [14]. Matrix algebra tools applied to the sample correlation matrix (with ones as diagonal entries and standardized covariances off the diagonal) provides “principal components” based on eigenvalues and associated orthogonal eigenvectors. By performing PCA, we aimed to identify structural patterns in association of the 12 RSA-related traits over the 137 soybean lines and assess differences in cluster analysis results obtained with well-defined principal components (Approach 2) vs. with no data standardization and no dimensionality reduction (Approach 1).

In SAS script A2 in the appendix, the procedure PRINCOMP is called with "DATA=PCA_Seck_et_al_2020" to specify the input dataset and the option STANDARD to perform PCA on the 12 × 12 sample correlation matrix (i.e., after transforming the data for each variable to a sample variance of 1.0). The latter option facilitates the interpretation of results by focusing on associations among variables via correlations, while avoiding scale effects related to data dispersion and measurement units if the 12 × 12 sample variance-covariance matrix was used.

Results and Discussion

The first 6 principal components (out of a maximum of 12; there are 12 variables provided by the 12 soybean root traits) explain about 95% of the variability in the data table (Figure 1, top left panel). Several of the RSA-related traits are redundant; see SAR, DRS, DTLR, LSR, TLR and WRS, RS in the PCA biplots (Figure 1, other panels). The latter result confirms the correlation analysis results reported in Seck W, et al. [15].

biometrics-biostatistics-biplots

Figure 1. Principal Component Analysis (PCA) results. Top left panel: Percentage of the variability in the data table explained by the 12 principal components, cumulative or not. Other panels: Biplots of Prin2 against Prin1, Prin3 against Prin1, and Prin3 against Prin2; Prin1, Prin2, Prin3 denote the first three principal components in descending order of the associated eigenvalues.

In a PCA with standardization of the data table, which is equivalent to performing the PCA on the sample correlation matrix [14], “variance”, “dispersion”, “variation”. And “variability” tend to mean the same thing.

Using the criterion of greatest proximity or smallest distance to the centroid, 16 soybean lines are found to be common to the lists of 30 names obtained in the cluster analyses along Approach 1 and Approach 2; see the yellow highlights in Table 1. Loosening the required proximity to a maximum difference of 0.15 with the smallest distance to the centroid on both sides, 14 more lines were found to be common and at a distance from the centroid below 0.15 without being the closest. The final selection of 30 (Table 2) excludes two soybean lines (Madoc, McCall) that were the second member selected from their cluster, and includes instead two soybean lines (Mandarin, Maple Arrow) that are the closest and second closest to their respective centroid in Approach 2, but are not well represented in Approach 1.

Table 1: A summary of the initial cluster analysis results obtained in Approach 1 (Analysis with the 12 RSA-related traits) and Approach 2 (Analysis with Prin1-Prin6). Only the soybean lines that are the closest to the centroid of the cluster to which they belong are listed. Those that are highlighted in yellow appear in both lists. Complete results are given in Tables B1 and B2 in the appendix.

Analysis with the 12 RSA-related traits Analysis with Prin1-Prin6
Cluster Soybean line Distance to the centroid Cluster Soybean line Distance to the centroid
1 4004P4J 1.232596 1 4004P4J 1.330548
2 4005_24j 0 2 4005_24j 0
3 PS44 0.969116 3 PS44 0.903904
4 Jari 1.385713 4 OAC 7-26C 1.124655
5 Tundra 0 5 Gretna 1.020093
6 Delta 0 6 Madoc 0.844417
7 OAC 7-26C 1.222566 7 OAC Prudence 1.121011
8 Casino 1.379225 8 OAC Wallace 0.944889
9 5055_43G 0 9 5055_43G 1.239143
10 Costaud 1.251672 10 Costaud 1.10524
11 Madoc 1.357312 11 Mandarin 0.929683
12 Maple Ambr 1.10394 12 Venus 0
13 OAC 8-21C 0.898254 13 OAC 7-6C 0
14 Woodstock 0 14 Maple Glen 1.025438
15 S05-T6 1.191081 15 Bravor 1.111696
16 Albinos 1.447583 16 Tundra 0
17 OAC 9-35C 1.088592 17 SECAN8-1 1.026367
18 Clinton 1.169068 18 Woodstock 0
19 Maple Isle 1.057824 19 Jutra 0.775305
20 OAC Oxford 1.118181 20 OT94-47 0.728379
21 S14-P6 1.081495 21 Alta 0.651441
22 McCall 1.27531 22 McCall 1.090354
23 Gentleman 1.505267 23 4067P17j 1.107881
24 Flambeau 1.091571 24 S03-W4 0
25 OAC 7-6C 0 25 Roland 0.975369
26 OAC Wallace 0.954529 26 Maple Belle 1.015101
27 S03-W4 0 27 OAC 7-4C 0.812049
28 Venus 0 28 S14-P6 1.002491
29 Gaillard 0.844357 29 Mario 0.924578
30 OAC 7-4C 0.998249 30 OT05-20 1.209572

Table 2:Final selection of 30 soybean lines based on their membership of one of the 30 clusters identified in Approach 1 (Analysis with 12 root traits) and Approach 2 (Analysis with Prin1-Prin6) and their distance from the centroid. The 14 soybean lines highlighted in yellow here were also highlighted in yellow in Table 1; see text and Tables B1 and B2 for the selection of the other 16 soybean lines. In particular, Madoc and McCall, which are highlighed in yellow in Table 1, were eventually discarded to keep not more than one member per cluster after merging the two sets of cluster analysis results.

No. Soybean line
1 4004P4J
2 4005_24J
3 5055_43G
4 AC2001
5 Albinos
6 Casino
7 Clinton
8 Costaud
9 Delta
10 Elora
11 Gaillard
12 Gentleman
13 Mandarin
14 Maple Arrow
15 OAC 7-26C
16 OAC 7-4C
17 OAC 7-6C
18 OAC 8-21C
19 OAC 9-22C
20 OAC 9-35C
21 OAC Oxford
22 OAC Wallace
23 PS44
24 Proteus
25 S03-W4
26 S14-P6
27 SECAN7-27
28 Tundra
29 Venus
30 Woodstock

The reported overlap of 93.3% [i.e., (16+14–2)/30=0.933] shows a robust clustering structure in RSA 2-D phenotyping in soybean. Thus, we compiled, in a rational way, a list of 30 representative soybean lines with distinct RSA patterns that provide a good basis for 3-D investigation. Of course, germination tests with available seed banks as well as preliminary tests with growing media other than vermiculite justify adjustments to that list later. It is worth mentioning that OAC Bayfield readily provides a substitute to OAC 7-26C if required, as these soybean lines belong to the same cluster with two members in both approaches (Tables B1 and B2); they are therefore at equal distance from the centroid and either can be randomly picked. A comparison with genomic clustering results falls beyond the scope of a Brief Report, but could be the topic of another, broader study.

Conclusion

The selected 30 soybean lines will be used in RSA phenotyping with stateof- the-art equipment, followed by sophisticated 3-D data and image analyses. Selecting representative lines that showcase the diversity in root system architecture and possess biological relevance is crucial. The soybean lines in Table 2 are objective starting points for further investigation into the functionality of specific RSA-related traits on plant performance and adaptation. Our cluster analysis results provide insight into phenotypic variation within the germplasm pool. Understanding root system diversity is crucial for breeders aiming to progress through selection. Advanced 3-D phenotypic analyses, e.g., based on computed tomography scanning, is expected to deepen our understanding of the RSA and its impact on plant productivity and stress tolerance.

References

  1. Mochida, Keiichi and Kazuo Shinozaki. "Genomics and bioinformatics resources for crop improvement." Plant Cell Physiol 51 (2010): 497-523.
  2. Google Scholar, Crossref, Indexed at

  3. Masclaux-Daubresse, Céline, Gilles Clément, Pauline Anne and Jean-Marc Routaboul, et al. "Stitching together the multiple dimensions of autophagy using metabolomics and transcriptomics reveals impacts on metabolism, development, and plant responses to the environment in Arabidopsis." Plant Cell 26 (2014): 1857-1877.
  4. Google Scholar, Crossref, Indexed at

  5. Hirai, Masami Yokota, Mitsuru Yano, Dayan B. Goodenowe and Shigehiko Kanaya, et al. "Integration of transcriptomics and metabolomics for understanding of global responses to nutritional stresses in Arabidopsis thaliana." Proc Natl Acad Sci101 (2004): 10205-10210.
  6. Google Scholar, Crossref, Indexed at

  7. Usadel, Björn, Rainer Schwacke, Axel Nagel and Birgit Kersten. "GabiPD–the GABI primary database integrates plant proteomic data with gene-centric information." Front Plant Sci 3 (2012): 154.
  8. Google Scholar, Crossref, Indexed at

  9. Tohge, Takayuki, Leonardo Perez de Souza and Alisdair R. Fernie. "Genome-enabled plant metabolomics." J Chromatogr B 966 (2014): 7-20.
  10. Google Scholar, Crossref, Indexed at

  11. Palmer, Lachlan James, Daniel Anthony Dias, Berin Boughton and Ute Roessner, et al. "Metabolite profiling of wheat (T. aestivum L.) phloem exudate." Plant Methods 10 (2014): 1-9.
  12. Google Scholar, Crossref, Indexed at

  13. Lisec, Jan, Nicolas Schauer, Joachim Kopka and Lothar Willmitzer, et al. "Gas chromatography mass spectrometry–based metabolite profiling in plants." Nat Protoc 1 (2006): 387-396.
  14. Google Scholar, Crossref, Indexed at

  15. Furbank, Robert T. and Mark Tester. "Phenomics-technologies to relieve the phenotyping bottleneck." Trends Plant Sc 16 (2011): 635-644.
  16. Google Scholar, Crossref, Indexed at

  17. Sriyudthsak, Kansuporn, Michio Iwata, Masami Yokota Hirai and Fumihide Shiraishi. "PENDISC: A simple method for constructing a mathematical model from time-series data of metabolite concentrations." Bull Math Biol 76 (2014): 1333-1351.
  18. Google Scholar, Crossref, Indexed at

  19. Bylesjö, Max, Daniel Eriksson, Miyako Kusano and Thomas Moritz, et al. "Data integration in plant biology: The O2PLS method for combined modeling of transcript and metabolite data." Plant J 52 (2007): 1181-1191.
  20. Google Scholar, Crossref, Indexed at

  21. Yu, Yong-Jie, Qiao-Ling Xia, Sheng Wang and Bing Wang, et al. "Chemometric strategy for automatic chromatographic peak detection and background drift correction in chromatographic data." J Chromatogr A 1359 (2014): 262-270.
  22. Google Scholar, Crossref, Indexed at

  23. Geigenberger, Peter, Axel Tiessen and Jörg Meurer. "Use of non-aqueous fractionation and metabolomics to study chloroplast function in Arabidopsis." Chloroplast Research in Arabidopsis: Methods and Protocols, Volume II (2011): 135-160.
  24. Google Scholar, Crossref, Indexed at

  25. Alexandersson, Erik, Dan Jacobson, Melané A. Vivier and Wolfram Weckwerth, et al. "Field-omics-understanding large-scale molecular data from field crops." Front Plant Sci 5 (2014): 286.
  26. Google Scholar, Crossref, Indexed at

  27. Morrison, D. F. "Matrix algebra." Multivariate statistical methods, 3rd edition. McGraw-Hill, New York, vii (1990): 36-78.
  28. Google Scholar

  29. Seck, Waldiodio, Davoud Torkamaneh and François Belzile. "Comprehensive genome-wide association analysis reveals the genetic basis of root system architecture in soybean." Front Plant Sci 11 (2020): 590740.
  30. Google Scholar, Crossref, Indexed at

  31. De Smet, Ive, Steffen Lau, Ulrike Mayer and Gerd Jürgens. "Embryogenesis-the humble beginnings of plant life." Plant J 61 (2010): 959-970.
  32. Google Scholar, Crossref, Indexed at

  33. Lucas, Mikaël, Kim Kenobi, Daniel Von Wangenheim and Ute Voβ, et al. "Lateral root morphogenesis is dependent on the mechanical properties of the overlaying tissues." Proc Natl Acad Sci110 (2013): 5229-5234.
  34. Google Scholar, Crossref, Indexed at

  35. Hodge, Angela, Graziella Berta, Claude Doussan and Francisco Merchan, et al. "Plant root growth, architecture and function." (2009): 153-187.
  36. Google Scholar, Indexed at

Google Scholar citation report
Citations: 3254

Journal of Biometrics & Biostatistics received 3254 citations as per Google Scholar report

Journal of Biometrics & Biostatistics peer review process verified at publons

Indexed In

 
arrow_upward arrow_upward