Differentially expressed gene patterns
The expression of genes can vary over time in response to different stimuli, and the identification of DEGs can provide insights into the underlying biological processes. In this study, the expression of genes in a time series scale was analyzed, and the results were visualized in Fig. 2.
Heatmap showing differentially expressed genes under drought stress in different time series data. logFC was calculated by comparing respective controls. The red color in the heatmap denotes upregulation and the blue color denotes downregulation. The Y-axis denotes the differentially expressed genes.
The DEGs were compared with their individual controls, and the analysis revealed interesting patterns (Figs. 3 and 4). In this study, we classified the 3h and 6h time points as the early stage of drought, the 48h and 72h time points as the medium stage, and the 96h and 120h time points as the late stage. It was observed that the most upregulated genes were found in the later stages of stress, specifically at 120 h (Fig. 3). This finding suggests that the cellular stress response becomes more pronounced as the duration of the stress increases. Interestingly, some genes were also found to be upregulated in the early stages of stress and overlapped with those found in the later stages. This observation implies that the cellular stress response may involve immediate and delayed mechanisms. And in the medium stages of water deficit, a few genes were found to be significantly upregulated. Although the number of DEGs was lower in this stage, it is important to note that these genes may play important roles in the early stages of the stress response. This study provides valuable insights into the temporal dynamics of gene expression in response to stress. The identification of DEGs at different time points highlights the importance of considering the duration of stress when studying cellular responses. Additionally, the overlapping DEGs observed across different stages suggest that the cellular stress response involves a complex interplay of immediate and delayed mechanisms.
Upset plot showing differentially expressed upregulated gene numbers under drought stress in different time series data. Vertical bars show the unique upregulated genes per time point. Horizontal bars display the total number of upregulated genes per time point. Dots connecting time points denote the unique upregulated genes to respective time points.
Upset plot showing differentially expressed downregulated genes under drought stress in different time series data. Vertical bars show the unique downregulated genes per time point. Horizontal bars display the total number of downregulated genes per time point. Dots connecting time points denote the unique downregulated genes to respective time points.
Gene ontology enrichment analysis
Gene ontology enrichment analysis is a powerful tool used to interpret the biological function of differentially expressed genes (DEGs) in a high-throughput manner. In this study, the functions of DEGs were predicted from GO and the results were visualized in Fig. 5. The analysis revealed that a large proportion of DEGs (approximately 600 genes) were involved in organic substance biosynthetic processes and organonitrogen compounds. This finding suggests that the cellular stress response involves the production and modification of organic compounds, which play important roles in cellular metabolism and signaling pathways. Moreover, approximately 500 genes were found to be related to stress, indicating that the cellular stress response involves the activation of various stress response pathways. This finding is in agreement with previous studies that have shown that stress response genes are upregulated in response to various environmental stimuli. Additionally, more than 400 genes were found to respond to chemicals. These genes may be involved in the synthesis, transport, and degradation of chemical signals, or they may play roles in downstream signaling pathways that are activated by chemical stimuli. Furthermore, approximately 200 genes were found to be related to translation and responded to osmotic stress. This finding suggests that osmotic stress may affect translation processes, which play important roles in the synthesis of proteins that are involved in cellular responses to stress. In summary, the gene ontology enrichment analysis conducted in this study provides valuable insights into the biological function of DEGs in response to stress. The identification of genes involved in various metabolic, signaling, and stress response pathways highlights the complexity of the cellular stress response and emphasizes the importance of considering multiple biological processes when studying stress responses.
Machine learning performance
In this study, we utilized gene expression values as features to train classification models using the XGBoost in R. XGBoost is a popular implementation of the gradient boosting algorithm, which combines multiple weak learners, such as shallow trees, into a strong one. To evaluate the performance of our models, we split our dataset into training and test sets. The accuracy of the models was measured using the area under the curve (“auc”) value, which is a commonly used metric for evaluating classification models. The results of our analysis, shown in Fig. 6, demonstrate that the accuracy of the training and test sets are quite close to each other, indicating that our models are well-fitted. Furthermore, we observed that the accuracy of the models increased over iterations, suggesting that the models learned from the data and were able to improve their performance as they were trained with more data. The use of the XGBoost in our study is particularly beneficial due to its ability to handle high-dimensional datasets, such as those generated from gene expression studies. By incorporating multiple weak learners, the algorithm can effectively capture the complex relationships between gene expression values and biological outcomes, such as disease states or cellular responses to stress. In summary, our results demonstrate the effectiveness of XGBoost in classifying biological samples based on gene expression values. The close agreement between the accuracies of the training and test sets suggests that our models are robust and generalizable. The use of XGBoost has enabled us to extract meaningful insights from high-dimensional gene expression datasets, and its flexibility makes it a valuable tool for future studies in this area.
Feature importance
In this study, we employed a robust approach to determine the most influential genes that contribute to the classification of samples based on their gene expression values. Following the training and validation of our classification model using the XGBoost algorithm, we extracted the features with their corresponding importance scores. The importance score of a feature reflects its contribution to the classification of samples and thus provides valuable insights into the underlying biological mechanisms. Among the top six genes identified as candidate genes, we found that FLA2 gene had the highest importance score (Fig. 7). This observation suggests that the expression of FLA2 gene is a critical factor in determining the classification of samples in our study. In this context, classification entails the distinction between two classes: watered samples labeled as 0 and non-watered samples labeled as 1. These classes were defined as binary categories (0 and 1) within the model, and they were treated as the response variables. FLA2 gene encodes for fasciclin-like arabinogalactan protein 2, which is a cell surface protein involved in various cellular processes such as cell adhesion, growth, and differentiation. The high importance score of FLA2 gene in our analysis may indicate its involvement in the cellular stress response and thus may have potential implications in the development of stress-tolerant crops. It is worth noting that the other candidate genes identified in our analysis also play important roles in various cellular processes. Further investigation of these genes could provide valuable insights into the underlying biological mechanisms involved in stress responses. In conclusion, the identification of FLA2 gene as the most important gene in our study highlights its potential significance in stress responses. The use of the XGBoost algorithm has enabled us to extract meaningful insights from high-dimensional gene expression datasets, and its flexibility makes it a valuable tool for future studies in this area.
Gene association networks
To gain a better understanding of the biological functions and interactions of the candidate genes identified in our study, we constructed gene association networks for each gene using publicly available databases. The resulting network for each candidate gene is shown in Fig. 8, with nodes representing genes and edges representing interactions between genes. In these networks, the high importance of candidate genes identified in our study within the network is highlighted by the colour red. We also observed that the interactors of these candidate genes showed drought stress-related functions, suggesting their involvement in the abiotic stress response. The gene association networks provide valuable insights into the potential pathways and mechanisms involved in the cellular stress response. By identifying the interactions between genes, we can better understand the complex interactions and functions of genes in the network. This information could be further used to develop targeted interventions for improving drought stress tolerance in crops.
Interaction network for candidate genes. The red color nodes in the network correspond to the protein of the respective candidate genes. Purple edges signify experimentally determined interactions, turquoise edges indicate information extracted from curated databases. Green edges represent neighborhood genes, red edges denote gene fusion, dark blue edges signify gene co-occurrence, light green edges are indicative of text mining, black edges indicate co-expression between nodes, and light blue edges signify protein homology.
Overall, the gene association networks of the candidate genes identified in our study highlight their potential significance in drought stress responses and provide a foundation for further investigation into their roles in drought-stress tolerance (Supplementary 2).
Candidate genes overlapped with drought QTLs of Tomato
In this study, we aimed to explore the potential role of three candidate genes, FLA2, ASCT, and NPF7.3, in the response of tomato plants to drought stress. To investigate this, we compared the location of these genes with the previously identified QTLs associated with drought-related traits in tomato17.
Our analysis revealed that all three candidate genes overlapped with QTLs that were previously associated with drought stress, suggesting their potential involvement in the plant’s response to water deficit conditions (Table 1). Specifically, the QTLs RIP, SSC, NFr, and FW were identified under drought stress for the traits of time to ripe, soluble solid content, number of fruits, and fruit weight, respectively.
The importance of our findings lies in the potential for identifying specific genes that contribute to the drought stress response of tomato plants. A better understanding of the genetic mechanisms involved in this response is crucial for developing crop varieties that can withstand environmental stressors such as water scarcity.
Our study suggests that FLA2, ASCT, and NPF7.3 could be promising candidate genes for further investigation in the context of drought stress in tomato plants. Future studies could focus on elucidating the specific molecular pathways through which these genes affect the plant’s response to water deficit conditions. This could lead to the development of more targeted breeding programs that utilize these genes to develop drought-tolerant tomato varieties.