All published articles of this journal are available on ScienceDirect.
Genotypic Contrasting of Protein and Flavonoid Contributes to Differential Responses of Targeted Metabolites in Soybean Seeds
Abstract
Introduction/Objective
Soybean is a major source of various nutrients. Increasing demand for soybeans has created considerable impetus for exploring the nutritional quality of soybeans. We aimed to collect soybean varieties rich in nutrients.
Materials and Methods
Metabolite analysis was carried out for seed compositions, including protein, phenolics, and flavonoids, along with gene expression of protein and phenolic metabolism-related enzymes in 10 soybean accessions collected from different geographical regions.
Results
Total protein content ranged from 29.7% to 35.7%, depending on soybean germplasm accessions. Among them, Vang Ha Giang (VHG) exhibited relatively high protein content, while Cuc Vo Nhai (CVN) had comparatively low protein content. Further analysis of seed compounds indicated that the phenolic compounds were higher in cultivars Dau Tuong Den (DTD) and CVN, with a total phenolic content of 37.7 µg g-1 and total flavonoid of 2.1 mg g-1. These results were reinforced by analysis of gene expression levels of candidate genes β-conglycinin (7S) and glycinin (11S) involving protein storage, phenylalanine ammonia-lyase 1 (GmPAL1) and chalcone synthase 8 (GmCHS8) genes related to phenolic and flavonoid synthesis, which showed similar correlation. We revealed that protein content was correlated with seed weight but not with seed color, even though significant variations were found among soybean genotypes, while flavonoid was affected by seed coat color. Furthermore, the negative correlation of protein with flavonoids demonstrated intricate relationships among seed components.
Conclusion
Protein and flavonoid alteration in seeds is subject to major-effect-genotypes in landrace and breeding cultivar selection, and genotype variants are relevant to geographical regions. Our study provides intricate relationships among seed nutritional components and offers insight into the alteration of soybean quality.
1. INTRODUCTION
Soybean (Glycine max L.) is a primary crop in agriculture and provides a major nutrient for humans [1]. It has been well known as a crop with high amounts of proteins, amino acids, and oil, as well as various bioactive compounds [1-3]. The different bioactive compounds in soybean seeds are associated with human health, including antioxidant activities, cholesterol-lowering effects, and anti-obesity activities [1, 4]. Black soybean contains phenolics, isoflavones, and anthocyanins. These compositions have been known to contribute to various health benefits [4-6]. Therefore, the composition of seeds has a significant impact on the quality of soybean products.
Soybeans have been widely used in food because of their characteristic richness in nutrients (protein, oil, fatty acid, soluble sugar, and isoflavone). The variation of nutrition is primarily influenced by factors, such as the genetic background of a cultivar, location, climate, and major group [6-9]. These factors affect the arrangement and shifts of targeted nutrients within soybean seeds by modifying the metabolic phenylpropanoid pathway [6, 10], flavonoid synthesis [6, 11], and amino acid metabolism [12]. Evidence showed that the differences in the genetics of cultivars contributed to 83% of the variation in protein contents [12]. With the diversity of species composition, there are differences in morphological characteristics response, nutritional composition, and quality. Recent research showed that the content of antioxidants depends on the characteristics of soybean varieties and ecological growing regions [13-15]. Soybean cultivars have been characterized as two distinct species, Glycine max and Glycine soja, based on their genetics and habitats [16]. Glycine max has been known as an essential cultivated soybean in the world, providing higher quality proteins and isoflavones compared to Glycine soja (a wild species), and it has been widely used in the food industry [2].
It is noteworthy that the screening of a substantial collection of germplasms is essential to aid the identification of genotypes with elevated levels of protein and bioactive compounds for genetic improvement programs. Nevertheless, comprehensive information regarding the profiles and content of protein and bioactive compounds within the seeds of soybeans from a diverse set of Vietnamese soybean germplasms originating from various ecoregions remains limited. Recently, we have identified merely two studies that outlined and assessed variations of protein content in 321 soybean (Glycine max L.) varieties [12], along with 218 Chinese soybean strains and 115 wild soybean samples from the USA [7]. As previously mentioned, other research has similarly emphasized the influence of genotype and environmental conditions on flavonoid content in wild black soybeans and cultivated black soybean genotypes [6].
Several studies have documented the fluctuation of biochemical or metabolic constituents, and the connection between them contributed to the yield and quality of soybean seeds. Various qualitative traits of soybean seeds indicate that seed coat color could be one of the key factors in soybean seed quality [17]. The level of primary and secondary metabolites is influenced by seed size, where the larger seeds contain higher amounts of starch, sugar, protein, and lipids compared to smaller seeds [18-20]. For example, the size of soybean seed showed a favorable association with oleic acid levels, while it exhibited a negative correlation with linoleic acid levels [18, 21]. On the other hand, seed coat coloration influenced isoflavones, fatty acids, and flavonoid content in soybean [6, 22, 23]. However, seed coat pigmentation did not influence the inverse correlation of protein and lipid concentrations [24]. It has been reported that the variation of level and content of metabolites are highly interdependent in soybean seeds. Among these, there is a complex inverse relationship between protein, oil, and sugar in soybean seeds. Zhang et al. (2018) [12] found a strong inverse relationship between oil and protein, indicating that as one increases, the other decreases, which contrasts with sucrose levels targeted for breeding high-yield soybean seeds. Lee and Son (1993) [25] found a positive correlation between oil and protein content among 1,087 colored soybean accessions. Interestingly, an inverse correlation of protein and oil content was observed in both whole soybean seeds and seed coats [26]. Additionally, the levels of protein were negatively correlated with stachyose or raffinose in 43 soybean progenies [27].
Nevertheless, little is known about how target metabolites, including protein, phenolic, and flavonoid, respond to seed coat colors and other physiological properties, such as relative seed diameter, across a diversity of soybean seed germplasm accessions. In this study, various soybean germplasm entries with a range of seed coat colors were collected locally to assess the relationship between seed coat color and seed size on the inter-relationships and variations of prioritized metabolites. Hence, the current study aimed to evaluate the conjecture that focused metabolites (protein, phenolic, and flavonoid) would be impacted in various ways by seed coat colors and seed dry weight and explore how the variation of these compounds could influence the relationship or correlation between protein, phenolic, and flavonoid in diverse soybean seed germplasm accessions. Additionally, geographical maps were utilized to identify hotspot areas, where cultivars showed elevated levels of seed nutritional components.
2. MATERIALS AND METHODS
2.1. Plant Material and Field Trials
A population of 10 soybean germplasm accession of plant introductions (PIs), including Vietnamese landraces and cultivars of soybean species, was used in this study (Table S1). The PIs were randomly collected from the Plant Resource Center (http://prc.org.vn) and the mountainous provinces in northern Vietnam.
The field experiments were performed at the experimental grounds of the Thai Nguyen University of Agriculture and Forestry (21°33′51′′ N, 105°52′46′′ E) located in Thai Nguyen City, Vietnam. The sandy, loamy soil was detected by pH 6.5 – 7, 0.6% OM, 0.06% total N, and 2.5 dSm−1 EC. Field trials were conducted from February to June, 2022. The ridges were created with diameters of 30 cm height and 60 cm width at the base using an FJ601 tiller (Fuji, Japan) coupled to an RA3 rotary. The thickness of the topsoil horizon was characterized by organic matter accumulation exceeding 30 cm in the plots. The experimental design included two areas featuring flat and ridged plots arranged in a completely randomized block format. In each area, three plots were designated for repetition analysis. Each soybean seed was planted at 20 cm intervals, with 30 plants cultivated per plot. For three days following sowing, water was sprayed daily to encourage seed germination. The commercial nitrogen-phosphate-kali (NPK, 10N: 5 P2O5: 5 K2O) fertilizers were used in the test field two times during the experiments at the 3-leaf stage and 7-leaf stage of the plant development process.
2.2. Phenotypic Evaluation
Phenotypic data were indicated by collecting eight morphological and physiological traits, including Times Cultivation (TC) plant height at physiological maturity (PH), Leaf Area Index (LAI), number of pods per plant (NPP), Seed Yield (SY), and 100-Seed Weight (SW). In each plot, the morphological and physiological traits of five randomly taken plants were recorded for observation. The data for these was collected on a plot basis.
The flower and seed phenotypic experiments were performed using a Light-Stemi 508 microscope (Carl Zeiss, Germany). Images of the flower and seed were captured using a CCD camera at 4 × magnification. The number and length of the lateral roots were measured 1 cm from the primary root using ImageJ software [28].
2.3. Total Protein in Seeds Analysis
Protein content in seeds was estimated by Bradford reagent kit B6916 (https://www.sigmaaldrich.com/) based on bovine serum albumin (BSA) as standard protein. The protein profile was extracted by 50 mM potassium phosphate buffer (pH = 7.5). The absorbance of the mixture was measured using the Synergy H1 Hybrid Reader (Biotek) at 725 nm. The total protein content was expressed in terms of the standard curve of BSA equivalents per gram of fresh-weight extract. The protein composition was normalized, allowing for the quantification of protein proportion.
2.4. Total Phenolics, Hydroxycinnamic Acid, and Flavonoids Analysis
Total phenolic content was assessed by using the Folin-Ciocalteu method [29]. In brief, 200 mg of fresh leaves were extracted by using 80% methanol. A colorimetric reaction of the mixture was conducted using 2N Folin-Ciocalteu reagent, followed by the addition of 20% Na2CO3 and allowed to react for 60 minutes. Absorbance at 725 nm was measured with a Synergy H1 Hybrid Reader (Biotek, South Korea). The total phenolics were measured using the calibration curve based on gallic acid equivalents per gram of fresh weight extract.
Total hydroxycinnamic acids (THA) content in seeds was determined according to the method outlined by Štefan et al. (2014) [30]. After reacting with Aron's reagent (NaNO2–Na2MoO4, 1:1), HCl, and NaOH, the absorbance at 505 nm was documented and determined using a chlorogenic acid standard curve.
The total flavonoid content in seeds was determined by the aluminum chloride colorimetric method [31], with minor modifications. In brief, 200 mg of dry seeds were extracted using 80% methanol. The crude extract was diluted with 0.5 mL of distilled water and incubated with the addition of 75 μL of 5% NaNO2 and 0.3 mL of 10% AlCl3. After incubation for 5 min, 1 M NaOH (0.5 mL) was added to the mixture, and absorbance was recorded at 510 nm. The total phenolic content was estimated by using a calibration curve based on quercetin equivalents per gram of dry seed weight extract.
2.5. Phenolics and Flavonoids Quantification by RP-HPLC
The content of individual phenolics and flavonoids in the seeds was quantified following the procedure detailed by Das et al. (2018) [32]. Briefly, 100 mg of soybean powder was extracted in 80% methanol. The collected supernatant was filtered using a 0.2 µm PVDF syringe filter. Soluble phenolics and flavonoids were analyzed by a system coupled with a UV-VIS detector (SPD-20A, Shimadzu, Kyoto, Japan). Chromatographic separation was achieved by a Spherisorb® ODS2 column with dimensions of 4.60 × 250 mm (Waters, Milford, MA, USA) at 30°C. Orthophosphoric acid (0.1%) in water (v/v) (eluent A) and methanol (v/v) (eluent B) were employed as the mobile phases. The retention time and regression equation of the standards were used for the determination of the soluble phenolic and flavonoid content.
2.6. RNA Extraction and Quantitative Real-time PCR Analysis
Total RNA was isolated from 100 mg of seeds using a cell lysis buffer, following a previously established DNA-free RNA isolation method [33]. The DNA-free RNA was used as the template for cDNA synthesis by using the GoScript Reverse Transcription System in accordance with the manufacturer's instructions (Takara, DALIAN, Japan). Quantitative reverse transcription-PCR (qRT-PCR) was conducted on a LightCycler 96 real-time PCR system using 2X SYBR Green qPCR Master Mix (Takara, DALIAN, Japan). RT-PCR reaction 25 µL included 12.5 µL 2X SYBR Green Master Mix, 1 µL forward primer (10 pmol), 1 µL reverse primer (10 pmol), and 1 µL cDNA, and 9.5 µL water-free DNase. RT-PCR cycles contained an early denaturation at 95 oC for 5 minutes followed by 35 cycles of denaturation at 95 oC for 30 seconds, annealing at Tm for 30 seconds, extension at 72 oC for 30 seconds, and a final expansion at 72 oC for 5 min. Tm of the gene-specific primers utilized for qRT-PCR was conducted in duplicate for each of the three separate samples (Supplementary Table S4). Relative gene expression levels were determined and inferred from the threshold value (Ct), and the actin gene was utilized as an internal control. For the quantification of the relative transcript levels, we used the 2-ΔΔCT method [34].
2.7. Metabolic Analysis
To further investigate the functional interpretations and associations of the identified metabolites compounds in soybean seeds, Principal Component Analysis (PCA), one-way ANOVA visualization, Radom Forest classification, and interrelation interactions among biochemical substances like protein, phenolics, and flavonoids were generated using MetaboAnalyst 5.0 (http://www.metaboanalyst.ca).
2.8. Geographical Distribution Mapping
Geographical distribution maps of soybean seed nutrition were created using QGIS 10.0 (https://www.qgis.org/en/site/) using ordinary interpolation [35]. QGIS is commonly used in geographic information system (GIS) applications because it allows easy map creation, geographic data compilation, and spatial data management. To create the maps in this study, we applied interpolation to the mean seed nutrition data across various locations and cultivars, incorporating geographical factors (longitude and latitude). This method allocates weights to known sample points to predict the values of unknown sample points.
2.9. Statistical Analysis
The current study employed a fully randomized design involving three repetitions for every treatment and collection date. Analysis of variance (ANOVA) was conducted on whole data, and Duncan’s multiple range test was applied to compare the means of each replicate for every sampling time. All statistical analyses were conducted using SAS 9.1 (SAS Institute, Inc., 2002-2003), with differences regarded as significant at p < 0.05.
3. RESULTS
3.1. The Response of Protein to Seed Weight and Seed Coat Color in Various Soybean Cultivars
The morphology of soybean varieties is shown in Fig. (1), and the phenotype of plants at the maturity and harvest stages clearly showed differences in plant height, architecture of plant, degree of branching, and flower and seed colors (Table 1). On the other hand, the protein content of soybean seeds is crucial for assessing their nutritional value and quality. In the soybean cultivar, the protein content varied from 29.7% to 35.2% (Fig 2). In this study, protein, one of the targeted metabolites, was determined to be affected by seed weight. This response was confirmed by the correlation scoring plot (Fig. S1), which showed that protein content was higher in larger seeds compared to smaller ones, a finding that was also reported in a previous study [23]. Seed size (measured as the dry weight of 100 seeds) significantly affected the protein content, making it an important plant growth characteristic.
Cultivars | Time Cultivation (day) | Plant Height (cm) | Leaf Area (cm2) | Fruit per Plant | Seed Weight (g) | Color |
---|---|---|---|---|---|---|
DT84 | 95.0±0.94b | 51.0 ± 1.66cd | 7.2 x 4.5 | 83.5 ± 1.66b | 0.223 ± 0.011a | Yellow |
DTD | 92.0±0.47bc | 74.0 ± 0.95b | 6.8 x 4.3 | 93.0 ± 1.61b | 0.100 ± 0.007cd | Black |
VMK | 93.5±0.23b | 48.5 ± 1.06d | 6.8 x 4.8 | 69.0 ± 7.59c | 0.209 ± 0.008ab | Yellowish |
VCB | 103.0±0.47ab | 89.0 ± 6.01a | 6.4 x 4.3 | 42.0 ± 6.17d | 0.178 ± 0.002b | Yellowish |
VHG | 105.0±0.48a | 84.0 ± 1.90ab | 6.2 x 5.1 | 65.0 ± 5.22c | 0.118 ± 0.004c | Yellow |
VQN | 94.5±0.24b | 73.5 ± 4.98b | 7.3 x 6.6 | 96.0 ± 0.24b | 0.109 ± 0.006c | Yellow |
CHBD1 | 101.0±1.02ab | 81.3 ± 3.54ab | 9.3 x 7.6 | 152.0 ± 7.12a | 0.087 ± 0.005d | Yellowish |
CHLLS | 92.0±0.47bc | 66.0 ± 4.24c | 4.8 x 3.8 | 88.0 ± 1.90b | 0.118 ± 0.008c | Yellow |
CVN | 91.0±0.48c | 77.5 ± 4.98b | 4.8 x 3.3 | 39.0 ± 2.37d | 0.097 ± 0.002cd | Yellow |
DTSM | 92.5±0.71bc | 66.5 ± 4.51c | 7.2 x 5.9 | 68.0 ± 2.37c | 0.101 ± 0.004cd | Yellow |

Phenotypic variations of soybean cultivars. (A) Phenotypic of plants at the maturity and harvest stages. (B) Leaves and flower collected from maturity state. Leaves are selected at leaf number 5 (No.5) from root of the tree to the top. Flower was collected after 2 days opened. The soybean flower colors are most violet in all of landrace cultivars and DTD cultivar, whereas DT84 cultivar has white- colored flower. The photo was taken under a Stemi 508 microscope (Carl Zeiss, Germany) with a 0.5X, scan bar (2,0 cm). (C) Seeds size and color of the different cultivars. Seed sizes are small, medium, and large. All most cultivars represent light-yellow to dark-yellow seed, except for DTD black color-coded accession. The photograph was taken under a Stemi 508 microscope (Carl Zeiss, Germany) with a 1X scan bar of 1 cm.
Protein content was categorized into three groups based on total protein content: (i) Group 1, with four cultivars had a total protein < 30% including Cuc Vo Nhai (CVN); (ii) Group 2 had a total protein ranging from ≥ 30% to < 35%, including seven soybean cultivars DT84, Dau Tuong Den (DTD), Vang Muong Khuong (VMK), and Vang Cao Bang (VCB), Vang Quang Ninh (VQN); Cuc Huu Lung Lang Son (CHLLS), Dau tuong Song Ma (DTSM); and Group 3 included two cultivars that had total protein ≥ 35%, including Vang Ha Giang (VHG) and Cuc Ha Bac Dang 1 (CHBD1). Two cultivars, CHBD1 and VHG, exhibited the highest protein storage. The storage of protein in soybean seed (Glycine max (L.) Merr. cultivars) correlated to two major storage proteins, a glycosylated 7S protein (conglycinin) and a non-glycosylated 11S protein (glycinin) [36, 37]. Depending on the genotypes, these genes were highly expressed during the soybean seed maturity process. β-conglycinin (7S) and glycinin (11S) released on protein storage in soybean seeds (Meinke et al. 1981) exhibited 4.8-fold and 3.2-fold higher expression in VHG compared to the DT84 and other soybean cultivars (Fig. 1B and C). The soybean cultivars featured in this study presented a wide range of protein content, and these landraces are rich sources of protein. It is worth mentioning that these soybean cultivars revealed some landrace soybean cultivars that are mainly consumed as a food source with a potential of high protein content in seeds.

Variation protein (P) contents and seed weight (W) in soybean seeds of landrace and cultivars accessions. Soybean seed weight was estimated by 100 seeds.
(A) The different colors of seeds derived from yellow in all of soybean landraces and cultivars. Relative gene expression regulation of protein, phenolic and flavonoid storage in soybean seeds.
(B) Gmconglycinin (Glycosylated 7S protein) and (C) Gmglycinin (non-glycosylated 11S protein) are the two major storage proteins in soybean seeds. Scan bars 1 cm. The data represent mean ± SE (n = 3). Asterisks indicate significant differences between the different soil moisture percentages as determined by Duncan’s t-test;
∗∗∗p < 0.001.
3.2. Total Phenolic and Flavonoid Concentration Affected Seed Weight and Seed Colors
It has been found that the bioactive compounds of soybean seed mainly consist of total hydroxycinnamic acid (THA), phenolic acid (PA), and flavonoid (FA). Among these, THA has been known as a precursor in the biosynthesis process of phenolic and flavonoid (Fig. 3A). Consequently, large variations in the content of THA, PA, and FA, respectively, ranging from 2.0 – 13.0 (µg g-1), 11.6 – 22.0 (µg g-1), and 0.3 – 2.0 (µg g-1), have been observed (Fig. 3B-D). The result targeted the conjunction of phenolic and flavonoid distribution, 100 seeds dry weight, and seed pigmentation, as displayed in Fig. (3). It was found that the majority of yellow soybean accessions have a higher dry mass of 100 seeds compared to black soybean seed accessions. However, the dry weight of 100 seeds from soybean germplasm accessions significantly influenced seed coat color. Two cultivars, DTD and CVN, had the lowest seed weight, but these cultivars showed higher total flavonoid content compared to other cultivars (Fig. 3B-D). Intriguingly, seed coat colors tended to be highly responsive to targeted metabolites when comparing black soybean and yellow soybean seeds. However, the comparison of soybean seeds with similar seed coat colors indicated lower variation in the content of targeted metabolites, although the significance varied slightly among different shades of seed coat colors (Fig. 2 and 3B-D). Furthermore, seed coat color appeared to have a greater impact on the variation of bioactive compounds in yellow soybean seeds compared to black soybean seeds. An accumulation of metabolite compounds was found to be related to their maker, which is associated with most seed storage compounds. Two candidate genes, phenylalanine ammonia-lyase 1 (GmPAL1) and chalcone synthase (GmCHS8), encoding putative for phenolic and flavonoid synthesis have also been identified in soybean seeds [10, 38]. Indeed, two genes, GmPAL1 and GmCHS8, were upregulated in cultivars DTD (2.3-fold and 14.2-fold) and CVN (1.5-fold and 5.1-fold), respectively (Fig. 3E-F), which are substantial for the phenolic and flavonoid accumulation in soybean seeds.

Seed quality characterize in seeds of soybean landrace and cultivars accessions.
(A) The metabolic pathway synthesizes of phenolic and flavonoid derived from phenylpropanoid pathway. The targeted metabolites indicate by cinamate (hydroxycinamic acid –THA, blue color), phenolic acid (PA, green color), and flavonoid (FA, light-orange color).
(B) Hydroxycinamic acid –THA.
(C) Total phenolic acid.
(D) Total flavonoid. Relative gene expression regulation of phenolic and flavonoid storage in soybean seeds.
(E) Phenylalanine ammonia lyase 1 (GmPAL1) and (F) chalcone synthase (GmCHS8) encoding putative for phenolic and flavonoid synthesis identified in soybean seeds. The data represent mean ± SE (n = 3).
Data with different letters in a vertical column are significantly different at p < 0.05 according to Duncan’s multiple range test.
3.3. Cluster Targeted Metabolites Response to Seed Weight and Seed Coat Colors
To comprehend the contributions of specific and overall targeted metabolites (protein, THA, phenolic, and flavonoid) to designated seed weight and seed coat color, we conducted cluster analysis revealing the co-function of targeted metabolites, as shown in Fig. (4). The heatmap indicated that seed coat color could not affect the co-fluctuation of targeted metabolites, as the categorization of seed coat colors was largely uniformly distributed based on their arrangement. Nevertheless, black soybean seeds were closely clustered primarily around total phenolic content in two areas (seed weight and seed coat colors), while the remaining black soybean seeds were distributed evenly across various cultivars. Meanwhile, the metabolic clustering was primarily divided into two distinct clusters for each factor analysis of seed weight or seed coat color. The first cluster consisted of the positive correlation of total protein to seed weight, while phenolic compounds were negatively correlated to seed weight. In the second cluster, total flavonoid, THA, and total phenolic had a positive correlation clustered in seed coat colors but vice versa in protein, respectively. Interestingly, total phenolic was closely linked in both clusters (Fig. 4A-B). Furthermore, the ANOVA results indicated that protein and flavonoid have an inverse correlation in seed weight or seed color (Fig. 4C-D).

Cluster the seed weight and seed coat colors effects on targeted metabolites in seeds of soybean accessions. (A) Seed weight effect on targeted metabolites. (B) Seed coat color effect on targeted metabolites. The targeted metabolites characterized in Table 2. (C) and (D) ANOVA analysis the significant or unsignificant effect of seed weight and seed coat colors on total protein, THA, total phenolic acid, and total flavonoid. One-way ANOVA & post-hoc Tests. ANOVA p-value (FDR) cutoff: 0.05. Red and yellow spots have a positive correlation, while grey spots have a negative correlation.
3.4. Targeted Metabolite association to Seed Coat Colors
A more detailed analysis of the distribution of the targeted metabolites (protein, THA, phenolic, and flavonoid) in seed nutrition compounds was almost entirely based on the biochemical profile. Targeted metabolite responses from yellow to black soybean seeds were highly distributed by phenolic compounds. ANOVA analysis indicated substantial differences (p < 0.001) in order of total flavonoid, protein, THA, and PA contribution to yellow to black seed coat colors, respectively (Fig. 5). Among these, total flavonoid content was strongly associated with yellow to black seed colors, whereas total phenolic content showed a weaker association with seed coat color. Moreover, seed coat color was relatively unaffected by protein content, and total flavonoid content varied between yellow and black seed coat colors. As a result of one-way ANOVA analysis, FA, a secondary metabolite, demonstrated a significant difference, whereas proteins, THA, and PA have no significant differences between accessions from disparate ecoregions and cultivars (Fig. 5A). This result indicated that the variation of FA and its related metabolites was abundant among seed coat colors in different cultivars. The coefficient score value (CV) indicated that the greatest contribution came from FA (> 0.07), followed by THA (> 0.05) and protein (0.02), whereas PA (<0.01) provided the least contribution (Fig. 5B). The findings from PCA showed that the initial two components (PC1 and PC2) explained 83.6% of the total variation noted (Fig. 5C). The cultivars of soybean were grouped into two PCAs based on the content of the variation of seeds nutrition content. The Random Forest classification was performed to delve deeper into the metabolites that influenced the dissimilarity in soybean cultivars. The highest error ranking of cultivar was clustered with VMK, VQN, CHBD1, VHG, VCB, and DTSM, followed by DT84 and CHLLS, and the lowest clustered was with DTD and CVN (Fig. 5D and Table S2).

Selection cultivar based on alteration of seed nutrients of soybean landrace and cultivar. (A) ANOVA analysis is predicting the rank of bioactive contents on the basis of metabolic profiles of soybean landraces and cultivars. The composition significant-code red color and un-significant-code green color. (B) Coefficient score visualization represents the relative nutrient compositions in various soybean cultivars. The levels of correlation exhibit range color from blue to red color, which is relevant low to high correlations. (C) Principal component analysis (PCA) on seed quality of soybean landrace and cultivars. Each point in bioplot represents a single cultivar; cultivars are color-coded with the different colors. (D) Radom Forest classification predictive model constructed using 10 cultivars for predicting the rank of cultivars based on seed nutrients concentration. Post-hoc analysis: Fisher’s LSD ANOVA p-value (FDR) cutoff: 0.05.
The Pearson correlation would be used to enhance our understanding of the relationship between nutrition compositions in soybean seeds [8]. Positive correlations were detected between THA and PA or FA, while PA and FA, in most cases, correlated with the THA content. By calculating THA correlated with total PA and FA, the distinct relationships among them were observed relative to those correlated with PA and FA (Fig. S1). Excluding insignificant ones, approximately the correlation coefficients among the adjusted THA were positive. Even more intriguingly, most of the PA- and FA-corrected proteins showed a negative relationship.
3.5. Metabolites Correlation Response to Seed Weights and Seed Coat Colors
To comprehend each individual unique contribution to seed weight or seed coat color, we conducted a correlation coefficient test based on target metabolites. One of our objectives was to determine the interrelationship between seed quality and phenotype that enhances production. In this research, every seed composition feature showed a different distribution. The seed nutrient components were found to be related to the performance of the phenotype. This study revealed a negative correlation between protein and FA concentration with a correlation coefficient of r = -0.36* (p < 0.05) (Table S3). Nonetheless, there is limited knowledge regarding the magnitude of this relationship. There are some correlations that may exist between seed phenotypic characteristics (seed size, seed weight, and seed color) and soybean seed nutritional characteristics.
Correlation analysis is a straightforward method for assessing the strength of the relationship on the recorded metabolite levels [2, 39]. Interpreting correlations is considered an initial step in metabolite data analysis, as the correlations arising from internal fluctuations within metabolic systems provide further insights into the physiological condition of the seed [18]. Seed weight appeared to have a lesser impact on the sensitivity of each targeted metabolite, as there was a consistent trend of correlation responses between the metabolites, irrespective of the 100-seed dry weight, even though the significance fluctuated somewhat based on the weight (Table S3). Seed weight (100 seed dry weight) is a highly positive factor correlated to the content of protein (r = 0.25*), which is contrary to the content of flavonoids in soybean seed (r = -0.26*). Protein is mostly negatively correlated with the total phenolic and total flavonoid (r = -0.30* and r = -0.36*, p < 0.05) while positively correlated with THA, irrespective of seed weight. However, the present results indicated that soybeans with a long growing period have considerably high contents of protein (r = 0.96***, p < 0.05), but not for the accumulation of flavonoid (r = -0.23*, p < 0.05) (Table S3 and Fig. S1). It is essential to highlight the correlation analysis between seed nutritional composition and genotypic (landraces and cultivar) or seed phenotypic that exist within the traits. This suggests that soybean breeders could concentrate on developing modern cultivars for specific traits of interest, a concept supported by previous studies [6, 12].
Factors | Total Protein | THA | Total Phenolic | Total Flavonoid |
---|---|---|---|---|
Seed color (C) | NS | NS | * | *** |
Seed weight (W) | *** | * | NS | NS |
C X W | * | NS | NS | * |
The fluctuation in seed metabolites and the variation in soybean seed nutrition compositions are strongly interrelated. Numerous studies revealed that phenotype among the properties of soybean seeds and seed coat color could be among the most significant factors to consider regarding seed quality [18, 40-42]. The variation in seed color was significantly different in soybean germplasm accessions, which indicated the total flavonoid positive correlation to seed color (p < 0.05) (Fig. S2). Seed color was shown to be significantly different between landraces and cultivars. Both DTD and CVN demonstrated small seed sizes, but the DTD was characterized by a black color, while CVN was yellowish, which is closely related to the exhibition of high levels of phenolic and flavonoid, respectively. These results are in agreement with the findings of a study by Choi et al. (2021) [42], who observed higher phenolic and flavonoid levels in soybean seeds that are black in color.
3.6. Geographical Distribution with the Seed Nutritional Characteristic
Targeted metabolites correlation response in soybean seeds may result from the influence of seed weight, as well as cultivation duration. We performed geographic analysis to determine the distribution of individual and total targeted metabolites for cultivars resources. Geographic analysis was conducted to gather comprehensive information regarding the relationship between the allocation and movement of genotypic cultivars and seed nutrients during the selection of soybean landraces and cultivars. The nutrition of soybean seeds was found to be affected by the regional distribution of cultivars cultivated in the North of Vietnam [7, 8, 43]. Various correlations may exist between geographical factors (latitude, longitude, and altitude) and the nutritional properties of soybean seeds [44]. The average nutritional compositions of seeds from all cultivars across locations were associated with geographical factors of their corresponding regions (Fig. 6). Protein content showed no significant differences between three regions, such as Hong River Delta Region (HRDR), Northwestern Region (NWR), and Northeastern Region (NTR) (Fig. 6A), which was in line with the comparison of THA content in soybean seeds of three ecoregions, HRDR, NWR, and NTR (Fig. 6B). Moreover, PA (p < 0.05) and FA (p < 0.001) content of soybean seeds in HRDR were higher compared to NWR or NTR (Fig. 6C-D).
Geographical distribution maps help to identify the region with a desirable constituent of seed components and visualize the relationship between the trend of quality characteristics and the cultivation areas. The maps also illustrate the correlation between the geographic factors and the contrast of protein and FA content in soybean seeds. In particular, provinces belonging to the Northwestern and Hong River Delta were found to be the hotspots to illustrate the contrast among the levels of protein and flavonoids in soybean seeds. Hong River delta area, especially Vinh Phuc (VP), is characterized by lowland regions classified as the third tier of elevation altitude < 500 m (comparatively high latitudinal and longitudinal coordinates). The accessions in this area showed the predominance of FA content in soybean seed composition compared to the content of protein. Conversely, the accessions in the highland area, including the Ha Giang (HG) plateaus belonging to the second altitude tier (1000 m < altitude < 2000 m), which are also at relatively decreased latitudinal and longitudinal coordinates, exhibited a contrasting profile of seed nutrition composition with higher content of protein comparing to FA level (Fig. 7). Correlation values among the seed nutrition components were also substantial and positive for the three cultivation regions. These results provide new insight into the correlation of the variability of flavonoids in soybean seeds with soybean germplasm and various geographical factors. Overall, our findings indicate that HRDR accessions may be more suitable for producing high levels of flavonoids in black soybeans, such as the DTD cultivar.

Distribution of seed quality in soybean cultivars collected from three ecoregions. (A) Protein, (B) Hydroxycinamic acid (THA). (C) Total phenolic acid, (D) Total flavonoid in soybean seed collected from the three regions Hong River Delta Region (HRDR), Northwestern Region (NWR), and Northeastern Region (NTR). The data represents mean ± SE (n = 3). Asterisks indicate significant differences between the different ecoregions as determined by Duncan’s t-test; ∗ p < 0.05, ∗ ∗ ∗ p < 0.001.
4. DISCUSSION
Developing soybean cultivars with high seed nutrients is one of the most challenging strategies for soybean improvements. It has been widely recognized that the biochemical response of designated metabolites is influenced by both genetic factors and environmental fluctuations. Although there are many studies on metabolic variations in soybean seeds, data regarding the responses of protein and flavonoids to a variety of colors of seed coats and the dry weight of seeds remains limited. The current data demonstrated that the fluctuations in metabolism were affected by means of both seed coat coloration and dry weight of 100 seeds (Fig. 1 and Table 1). The phenotypic responses to seed weight and seed coat color were found to play a role in mediating the varying responses linked to metabolic categorization (Fig. 4) and associations (Fig. S2). Furthermore, these individual metabolites were highly interconnected, regardless of differences in seed weight or seed coat color (Table S3 and Fig. 5). As per our knowledge, this study is the first to present the inverse relationship between protein and flavonoid content concerning seed coat color across a broad spectrum of primary targeted metabolites, in addition to the dry weight of 100 seeds in soybean seed germplasm accessions. The dynamic variability of protein and flavonoid contents observed in this research and earlier research is probably due to the various groups of accessions and the extensive cultivar collections present in this research that aid in selecting genotypes with improved levels of bioactive compounds for breeding. It can be concluded that soybean cultivars with relatively high protein or flavonoid content have the potential to be used for food products, including VHG or DTD.
Soybean quality is influenced not only by the total protein concentration but also by the profiles of phenolic and flavonoid compounds. It has been known that hydroxycinnamic acid (THA), a cinnamate compound, is involved in the synthesis of phenolic compounds and flavonoids (Fig. 3A) [45]. Thus, the concentrations of seed hydroxycinnamic acid and phenolic compounds or flavonoids are inherently strongly linked due to THA consisting of phenolic compounds and flavonoids. The measurement methods commonly used in previous studies were based on seed weight; thus, they could not illustrate the impact of THA on the content of phenolics and flavonoids in specific genotype variants [18, 46]. In the present study, the relationship between the content of phenolics and flavonoids and THA was investigated simultaneously with dry weight-based measurement. We aimed to determine if phenolic and flavonoid contents in seeds correlated with THA in all cultivars. The result illustrated that the THA content was correlated to both phenolic and flavonoid content in various soybean cultivars, especially in DTD and CVN. A major impact of THA was found on the composition of phenolics and flavonoids, indicating a dependent relationship between THA and phenolic compounds. However, the majority of THA simultaneously accumulated in VMK, VHG, CVN, and DTSM, but not phenolics and flavonoids (Fig. 3B-D). Therefore, the impact of THA on phenolic or flavonoid contents could be eliminated by discovering genotypic variants specific to the THA profile. These data suggested that a rise in THA may promote the amount of PA and FA in terms of definite content but not necessarily affect the content of THA. It has been observed that since THA is not strongly associated with the modified content of PA and FA, the relationship is more complex. The variation in cultivars leads to a certain proportion of THA, and the modification of soybean phenolic content might create a wide-ranging impact on the seed nutrient profile. In some cases, the positive genetic correlation of THA and phenolic or flavonoid became negative after modification of THA content, while in other cases, the correlation remained positive (Fig. S1). This implied different cultivars for THA with and without phenolic-based adjustment in various soybeans. It additionally implied that the THA profile might be optimized without modifying the overall amount of phenolic and total flavonoid composition, which could paradoxically influence soybean color or seed yield.

Geographical distribution of seed nutrition concentration in soybean seeds, mapped according to the region of accession origin. (A) Total protein, (B) Total hydroxycinamic acid (THA), (C) Total phenolic acid, (D) Total flavonoid. The ten provinces are represented by three ecoregions Hong River Delta (HRDR), Northeastern region (NTR); and Northwestern region (NWR). Soybean genotypes collected from Ha Noi (HN), Vinh Phuc (VP), Lao Cai (LC), Cao Bang (CB), Ha Giang (HG), Quang Ninh (QN), Hoa Binh (HB), Lang Son (LS), Thai Nguyen (TN), Son La (SL). Geographical distribution maps of soybean seed nutrition were conducted with QGIS 10.0 (https://www.qgis.org/en/site/) using ordinary interpolation.
Seed weight is influenced by biochemical responses, which can subsequently impact the metabolite contents, yield, and quality of soybean seeds. Many studies have reported on variations in protein, phenolic, and flavonoid, along with their relationship with genetic factors or environmental factors, while investigating the impacts of both elements on the yield and quality of soybean seeds [2, 14, 18, 47]. Recently, Zhang et al. (2018) [12] illustrated that the amino acid concentration associated with seed weight and total protein exhibited a distinct genetic basis. Furthermore, 100-seed weight was influenced by the levels of sugars, protein, oil, phenolic, and fatty acids [18, 41]. The dry mass of 100 seeds was likely greater in yellow soybean seed germplasm lines compared to those of other seed coat colors (Fig. 2). The difference in dry weight for 100 seeds between black and yellow soybean accessions ranged from 8 to 15 grams even though both colored soybean accessions were widely distributed in terms of their 100-seed weight. The dry weight of 100 seeds affected the selected metabolites differently based on seed coat color. In addition, protein content showed a positive correlation with seed weight, while flavonoid content was positively associated with the dry weight of 100 seeds. The THA, phenolic, and flavonoid levels from yellow and black soybean accessions were inconsistently associated with seed weight, while the protein levels in yellow soybean accessions demonstrated a consistently significant relationship with seed weight. This result suggests that protein levels would remain consistently steady within a soybean lineage, regardless of variations in the dry weight of 100 seeds. In other studies, protein levels were found to increase positively with the dry weight of 100 seeds in black soybean seeds, but this was not observed in yellow soybean seeds [6, 19, 48]. In yellow seeds, the levels of total protein were correlated with 100-seed weight. In another study, Xu et al. (2022) [19] reported that protein compounds are affected by seed weight in Glycine max (L.). Furthermore, large soybean seeds exhibited significantly higher levels of starch, sucrose, protein, fatty acid, and phenolic than small seeds [18-20]. It has also been indicated that seed size is positively associated with protein and sugar content, while it is negatively correlated with isoflavones. In the case of yellow soybean accessions, only protein progressively increased with seed weight; however, the amount of flavonoid decreased. In soybeans, a wide range of protein contents was found in a large seed compared to a small seed [17, 19]. Although the metabolic variation was notably distinct in the ANOVA clustering pattern (Fig. 4A, C), the correlation coefficient exhibited a high total protein contribution to the dry weight of 100 seeds (Tables 2 and S3). Moreover, a previous study suggested a significant promise for enhancing soybean seed nutrients without compromising yield because protein has a positive correlation with total sugar and sucrose [18, 25, 27]. Furthermore, considering that the soybean germplasm accessions in this study were sourced from various maturity groups, it was intriguing to investigate whether the protein, phenolic, and flavonoid in seeds were correlated with days to maturity (time of cultivation). The results indicated a significant and strong connection between protein concentration and days until maturity, with a correlation coefficient of 0.96 (P = 0.85). At the same time, the association between phenolic or flavonoid levels and days to maturity was observed to be quite weak, with a correlation coefficient of -0.29 and -0.22 (P = 0.90; P = 0.82) (Fig. S1). This suggested that protein content in seeds might be genotypically influenced by days to maturity. The rise in concentrations of these targeted metabolites was closely associated with the elevation in seed dry weight as the seeds progressed through the growth and development stages [49].
Besides the effect of seed weight on the biochemical response of focused metabolites, the color of the seed coats also affected the level of protein, THA, phenolic, and flavonoid. In the present study, we found that fluctuations in protein and flavonoid levels had a major effect on seed coat color. The impact of seed coat color did not appear to be significant on protein levels in locally collected soybean accessions, whereas notable variation was observed among the collection locations of these accessions (Fig. 4). This result is in agreement with the findings of a study by Kim et al. (2007) [24], who reported that the color of the seed coat did not influence the negative correlation of protein and lipid contents. In contrast, the nutritional compositions in seeds with higher phenolic and flavonoid content were found in black seed coat color, as detected in DTD, which hinted that the phenolic and flavonoid levels were highly affected by seed coat color. The influence of seed coat color did not seem to significantly affect protein levels in soybean accessions gathered from local sources. Notable differences were observed across various collection locations of these accessions [23, 47]. Nonetheless, in the present study, the total phenolic and flavonoid content was higher in the soybean seeds with black coat color compared to those with yellow seed coat color. The coloring of black soybean seeds was linked to the function of a UDP-glycose flavonoid-3-O-glycosyltransferase involved in anthocyanin synthesis compared to the activity observed in yellow soybean seeds. Interestingly, the color of black soybean seeds was strongly correlated with isoflavone, a flavonoid compound, but comparatively lower in anthocyanin levels throughout the seed maturation phase [23]. The dynamic of the change of targeted metabolites in this study was significantly interrelated with the collected soybean accession and was influenced by seed coat color. Further, an increase in the amount of either protein or bioactive compounds (phenolic and flavonoid) beyond these allowable ranges would dramatically inhibit the other component. Owning to the low protein content in DTD and CVN cultivars, it is expected that their flavonoid levels will increase as a result of a negative correlation (Fig. 4B). In other words, it is impossible to simultaneously increase protein, phenolic, and flavonoid levels within a certain range. To the best of our understanding, this is the first research to uncover the influence of seed coat color on a wide array of dynamics related to protein and flavonoids and their negative association, along with soybean seed germplasm accessions. The correlation between protein and phenolic or flavonoid is regarded as an assessment of metabolite interaction caused by the correlations induced by internal fluctuation of the metabolic system [6, 8]. Compared to other traits, flavonoids exhibited a broader distribution in seed coat color when compared to landraces and cultivars. Flavonoid was identified as the comparatively most stable component when compared with the variation coefficient values of others. It implied that flavonoids have strong genotypic control, and the high flavonoid genotypes are likely to maintain performance in the DTD soybean cultivar. The variation in protein, phenolic, and flavonoid levels in soybean seeds can be linked to the influence of genotype traits [6, 9, 44], along with environmental factors [10, 11].
Seed nutrients exhibited significant variations across different ecoregions of origin, suggesting that geography may play an important role in the variety of soybeans in Vietnam. These results indicated that HRDR accessions are likely to have the highest concentrations of flavonoid components, linked to the increased genetic diversity observed in NTR and NWR accessions (Fig. 5). The geographic distribution map of protein and THA levels indicated a declining trend as one moves southward (toward lower latitudes). However, total phenolic and flavonoid contents were found to be highly accumulated in the Vinh Phuc (VP), a filed flat land of the HRDR area (low latitude). Obviously, geographical differentiation conditions could contribute substantially to the genetic differentiation of soybeans, resulting in variations in seed quality traits [9, 50, 51]. Crucially, this finding may support breeders in broadening the range of germplasm and encourage the application of these valuable genetic resources in breeding initiatives centered on protein or flavonoid content. In light of our findings, quality traits of soybean seeds, such as amino acid and protein, fatty acid, flavonoid, isoflavone, anthocyanins, and tocopherols, are influenced by the ecoregion of origin [6-8, 12, 52, 53]. Overall, the significant differences in protein, phenolic, and flavonoid compositions across ecoregions are associated with environmental changes, geographical positions, and growing seasons [8, 10, 12, 54]. Thus, it is essential to highlight that nutritional composition varies significantly with the geographical origin of soybean accessions, implying that soybean breeders ought to prioritize the origin of these accessions when developing contemporary cultivars for a specific trait of interest, and this notion is supported by earlier studies [12]. This variation enables breeders to select highly adapted candidate accessions with increased protein or flavonoid content from different locations, thereby contributing to improvements in soybean quality breeding.
CONCLUSION
This study is significant for determining the qualitative soybean cultivars based on the intricate relationships among seed constituents, especially the negative correlation among protein, phenolic compounds, and flavonoid levels. It reveals the genotype basis of quality alterations in soybean landraces and cultivars. Additionally, this study provides insight into how genetic and geographical factors govern the seed compounds. Such variations in nutritional content among soybean cultivars can be leveraged for significant applications in the food and pharmaceutical industries. Further validation of the causal relationship between nutrition variation and the phenotypic effect linked to soybean quality, as well as understanding their relationship with yield, will be the focus of the next study.
AUTHORS’ CONTRIBUTION
The authors confirm their contribution to the paper as follows: data collection: V.H.H., V.T.N., T.T.L., T.P.A.D., D.V.D, X.B.N., T.D.N., T.D.T.; validation: M.N.; draft manuscript: V.H.L., A.T.T., D.H.T. All authors reviewed the results and approved the final version of the manuscript.
LIST OF ABBREVIATIONS
TUAF | = Thai Nguyen University of Agriculture and Forestry |
PRC | = Plant Resource Center |
PIs | = Plant Introductions |
RP-HPLC | = Reverse Phase High-Performance Liquid Chromatography |
ANOVA | = Analysis of Variance |
PCA | = Principal Component Analysis |
AVAILABILITY OF DATA AND MATERIALS
The data are available with the links provided in the manuscript. Supplementary data to this article are included in this published article and can be found online.
FUNDING
This work was supported by national funding from the Ministry of Education and Training, Vietnam, which supported the funding acquisition (grant code number B2022-TNA-42).
ACKNOWLEDGEMENTS
The authors would like to thank the Plant Resource Center (PRC) for providing soybean germplasm accessions (It also supports Anh Phuong Thi Dang’s experiment, a student funded by the Master, PhD Scholarship Programme of Vingroup Innovation Foundation (VINIF), code [VINIF. 2022.TS007]).