Estimation of Missing River Flow Data for Hydrologic Analysis: The Case of Great Ruaha River Catchment
Received Date: Mar 24, 2018 / Accepted Date: Apr 05, 2018 / Published Date: Apr 11, 2018
Availability of data on hydrologic variables such as river flow is necessary for planning and management of water resources. Many developing countries Many River basins in developing countries has no complete dataset on river flow due to degradation of gauging stations gauging stations coupled with unsatisfactory data compilation unsatisfactory data compilation and storage procedures. Different methods are available to fill missing data; however, these methods differ in performance depending on the characteristics of initial data points. The purpose of this study was to fill the missing data in the Great Ruaha River by selection of best method. In this study, simple and multiple regression analysis, and recession methods have been employed to fill the gaps of missing river flow data on ten gauging stations of Great Ruaha River catchment. Performances of these methods were assessed using Nash-Sutcliffe efficiency, Root Mean Square Error and Mean Absolute Error. The results showed that, Multiple regressions are suitable over Linear regression method for missing data during the period of high flow, however selection of either method depends on the availability of data availability on independent variable. Recession method was found to be suitable for filling missing data during the period of low flow. Though these methods were useful in filling data, the challenge was that more than one method was required to estimate all the missing data at a gauging station. This is because, missing data at a given gauging station were experienced during dry and rain seasons.
Keywords: Great Ruaha river; Missing river flow data; Regression analysis; Recession method; Nash-Sutcliffe model; Root mean square error; Mean absolute error (MAE)
Great Ruaha River is the one of the major driving force to the economy of Tanzania through various sectors such as hydropower production, domestic water supply, irrigated agriculture and fishing industry. It originates from Kipengere mountains in Southern central of Tanzania and flow through the usangu wetland and Ruaha National Park East into the Rufiji River which drains its water into Indian Ocean . Agriculture being the most of economic activities to majority in the basin exploit large amount of water from the river results into many conflicts with other sector especially during the dry period. In additional to the aforementioned problem, climatic change also accelerates the drying of many tributaries of Great Ruaha river except for perennial rivers such as Mbarali, Kimani, Chimala which retain very minimal flow in the dry season .
Sustainability of water resources requires proper planning and its management. However, without a complete data set on river flow the above-mentioned purposes cannot be achieved. In many gauging stations along Great Ruaha River has poor data quality records characterized by short and long duration of missing data. Elshorbagy has outlined some of the factors which might attributed to the existence of data gaps, such factors include equipment failure, effects of natural phenomena such as flooding or mishandling of observed records by field personnel, or accidental loss of data files . Various methods or techniques applicable for the infilling of data are given in the literature, these include, regression methods and Recession [4,5].
Despite the availability of numerous methods in filling of missing hydrological data, it is generally believed that no single method can be considered universally best. Each method has its own advantages and disadvantage depending on the characteristics of the data set. A method that fit some data point can be unsuited for a different set of data points, or if measured in different locations of the same surface.
However other factors as reported by DeSilva et al. such as s distances among gauges, aerial coverage of each gauge, length of gap, the season, the climatic region or the availability and data characteristics of the records has significant influences on hydrological data estimates .
The aim of this study is to estimate missing river flow data for hydrologic analysis using regression analysis and recession methods as well as comparing estimation performance of the selected methods.
Description of Study Area and Data Availability
The study was conducted at Great Ruaha catchment which is located in southwest of Tanzania between latitudes 5°30ˈˈ and 9°25ˈˈ, South, and longitudes 33°55ˈˈ and 37°80ˈˈ East (Figure 1). The Great Ruaha River is one of most important rivers with a great impact to the economy of Tanzania. It is a main tributary (catchment area ≈ 68,000 sq km) of the Rufiji River which forms the largest drainage basin among the nine basins, covering an area of 177,000 sq. km of the Tanzania mainland .
Within the catchment other small rivers such as Mbarali, Kimani and Chimala from the highlands join Great Ruaha River flowing through Usangu plain and Ruaha National Park while serving as are major sources of water to the irrigated agriculture and wildlife respectively. Thereafter together with little Ruaha River supplies water to Mtera hydropower plant and form Rufiji River downstream the power plant .
The area experiences the mean annual air temperature which varies from about 18°C at higher attitudes to about 28°C in the lower and drier part of the basin. Rainfall pattern is unimodal (November to April), highly localized and spatially varied and strongly correlated with altitude. Usangu plain at the lower altitude, receive rainfall ranging from 500-700 mm per annum while at the higher altitude up to about 1600 mm of rain .
For the purposes of this study the daily river flow data for eleven (11) gauging stations within the catchment were obtained from Rufiji Basin Water Board (RBWB). Detailed information on the data is given in Table 1.
|Gauging station||Main river||Period of record
|No. of years||Data availability (%)|
|Chitekero (IKA7A)||Chimala||01/1990 - 11/2014||24||88|
|salimwani (IKA8A)||Great Ruaha||01/1998 - 06/2014||16||92|
|Msembe (1KA59)||Great Ruaha||01/1994 - 11/2014||20||90|
|GNR (1KA9A)||Kimani||03/1991 - 11/2014||23||97|
|Igawa (1KA11A)||Mbarali||01/1998 - 09/2014||16||98|
|Ilongo (1KA15A)||Ndembera||01/1990 - 09/2014||24||91|
|Ihimbu (1KA21A)||Little Ruaha||06/1991 - 12/2014||23||90|
|Makalala (1KA32A)||Little Ruaha||01/1990 - 10/2014||24||92|
|Ndiuka (1KA2A)||Little Ruaha||01/1990 - 11/2014||24||67|
|mtandika (1KA37A)||Lukosi||01/1990 - 12/2014||24||86|
|Mtitu (1KA22)||Mtitu||01/1990 - 12/2014||24||100|
Table 1: General Information of the gauging stations in Great Ruaha Catchment.
In some of these gauging stations, data for one day, months or even a year were missing. Among these gauging stations, Mtitu (1KA22) is the only gauging station with a complete dataset, while Ndiuka (1KA2A) which is located along Little Ruaha River has lower percentage of data availability (67%).
In a process of filling missing data, all eleven (11) gauging stations were involved. For regression methods, the selection of independent and depend variables were based on the following factors; the correlation coefficient between gauging stations, Data availability for the donor stations (independent variables) and location of the gauging stations within the catchment. Correlation Coefficient is an indicator for the strength of the relationship between two variables. Higher positive coefficients between variables indicate that estimates will be high or low when actual is high or low respectively giving evidence about the suitability of the method . Hence the correlation coefficients were determined for 100 combinations of gauging stations. The gauge stations with strong correlation in consideration with other criteria were chosen to predict the missing data of the either stations.
The next step involved calibration process to develop an equation that describes the dataset. Elshorbagy et al. noted that in hydrological data (e.g., stream flows), annual or seasonal data might be independent, while monthly or weekly data of the same river have significant levels of autocorrelations . Hence calibration was done by selecting gauging stations with a complete (without gap) dataset of five (5) years for two (Linear regression analysis) or three (Multiple regression analysis) stations with strong correlation between them. The developed equation was used to fill missing data of the dependent variable during the period of high flow.
For Recession method, a base flow recession factor, which characterizes the behavior of low flow was derived from a simple exponential equation. This was done by averaging five (5) recession factors (α1, α2, α3, α4 and α5) from five-year consecutive low flow dataset recorded at the gauging station. The value obtained was substituted to the recession equation to fill the missing data.
In order to test the accuracy of methods used in estimation of missing data, a gauging station (X) and gauging station/stations for which data are available, were selected and assumed that observations from X station are missing. Then using each method, observations for X station are estimated and compared with the actual observations.
The estimates obtained from each method were compared with observed records. The suitability of method is decided by how close the estimates and observed values are in a given time series. For this study, Nash Sutcliff efficiency, Root Mean Square error and Mean Absolute Error were used as criteria to estimate the closeness of estimated and observed values.
Pearson correlation coefficient (r)
Pearson’s coefficient of correlation was discovered by Bravais in 1846, but Karl Pearson was the first to describe, in 1896, the standard method of its calculation and showed that it was the best one possible. An important assumption in Pearson’s 1896 contribution is the normality of the variables analyzed, which could be true only for quantitative variables. Pearson’s correlation coefficient is a measure of the strength of the linear relationship between two such variables . The value of correlation coefficient ranges between -1 and 1. If the two variables are in perfect linear relationship, the correlation coefficient will be either 1 or -1. The sign depends on whether the variables are positively or negatively related. The correlation coefficient is 0 if there is no linear relationship between the variables. Given paired measurements (x1,y1), (x2, y2). . . (xn,yn), the Pearson correlation coefficient is given by
Where and are the sample mean of x1,x2….xn and y1, y2…..yn, respectively.
In the absence of recharge, river flow is supported by groundwater contribution often termed baseflow. In some parts of the world river flow is seasonal and baseflow in rivers is an important water resource; the characteristics of which must be fully understood to ensure an optimum management of water supply during prolonged droughts.
During periods of recession, the flow exhibits a pattern of exponential decay giving a curved trace on a simple plot of Discharge versus Time, t. The equations may take different forms but the most commonly used are:
The slope of flow recession, − Eq(2)
The base flow recession constant, k = − Eq(3)
Hence at time, t within the gap,
To use equation (4) as a predictive model, the base flow recession constant, k has to be determined for the catchment. This is done by plotting the exponential graph of the dry weather discharges against time, and then calculates k from the slope of flow recession. The accuracy of Qt thus depends among other things on the accuracy of k. The size of die time units, t (i.e., whether hourly, daily, weekly, monthly, etc) also affects the accuracy of the predicted flow [11,12].
For many rivers, downstream or upstream data are missing. In this case the flow data from the nearby rivers can be used to estimate the missing flows. The regression analysis is the method frequently used to solve this problem. The dependent variables are the flows of the nearby stations having drainage basins with similar hydrological characteristics .
Linear regression analysis: If X and Y are two related variables, then linear regression analysis helps us to predict the value of Y for a given value of X or vice versa, expressed by
Where by β is the slope and α is y-intercept
Multiple regression analysis: This helps to predict the value of y for given values of x1, x2, x3…xk.
y=β1x1+ β2x2+ β3x3………… βkxk+α -Eq(6)
Whereby; β1 β2 β3 βk are parameters to be determined and α is yintercept.
The Performances of selected method were assessed using Nash- Sutcliffe model efficiency coefficient (NSE). It is commonly used to assess the predictive power of hydrological discharge models. It is defined as:
The value of E, ranges from -∞ to 1. Where xobs is observed values and xestm is estimated values at time/place i and is the average value of observations. Essentially, the closer the model efficiency is to 1, the more accurate the model.
Root mean square error (RMSE) and mean absolute error (MAE)
The root mean square error (RMSE) has been used as a standard statistical metric to measure model performance in meteorology, air quality, and climate research studies. The mean absolute error (MAE) is another useful measure widely used in model evaluations . RMSE presents information on the short-term efficiency which is a benchmark of the difference of predicated values about the observed values. The lower the RMSE, the more accurate is the evaluation. MAE (mean absolute error) is an indication of the average deviation of the predicted values from the corresponding observed values and can present information on long term performance of the models; the lower MAE the better is the long-term model prediction .
While the MAE gives the same weight to all errors, the RMSE penalizes variance as it gives errors with larger absolute values more weight than errors with smaller absolute values. When both metrics are calculated, the RMSE is by definition never smaller than the MAE . The RMSE and MAE are given by the following formula;
Where xobs,i is observed values and xestm,i is estimated values at time/place i.
Results and Discussion
Selection of dependent and independent variables
For studied gauging stations, the correlation coefficients were determined by equation (1). These ranged from 0.4 to 0.9 indicating fairly good relationship between gauging stations as shown in Table 2. With regard to gauging stations in Figure 1, generally nearby gauging stations or stations in the same river experience the same river flow pattern (Figure 2a). These indicated strong correlation compared to stations which are far apart (Figure 2b). The Gauging stations with strong correlation estimate the missing river flow data for each other more efficiently compared to those with poor correlation. This has been supported by DeSiva et al. in a Comparison of Methods Used in Estimating Missing Rainfall Data .
Table 2: Correlation coefficients between gauging stations.
Comparison of estimates
The estimates obtained from each method are compared with observed records. The suitability of method is decided by how close the estimates and observed values are in a given time series. Several descriptive statistics of error as indicated in Tables 3-5 are used as criteria to evaluate the efficiency of the method in estimating missing data.
|Independent variable||Dependent variable||NSE||RMSE||MAE|
|Ilongo (1KA15A)||Chitekero (IKA7A)||0.64||1.69||0.95|
|Ilongo (1KA15A)||Msembe (1KA59)||0.42||53.97||28.55|
|Ihimbu (1KA21A)||Makalala (1KA32A)||0.81||1.21||0.77|
|Makalala (1KA32A)||Ihimbu (1KA21A)||0.81||5.14||2.76|
|Makalala (1KA32A)||Ilongo (1KA15A)||0.76||2.72||1.78|
|Ilongo (1KA15A)||Igawa (1KA11A)||0.64||8.14||4.20|
|Makalala (1KA32A)||Ndiuka (1KA2A)||0.56||8.87||7.03|
|Igawa (1KA11A)||Kimani (1KA9)||0.47||9.75||4.00|
Table 3: Results of NSE, RMSE and MAE for Linear Regression Method.
|Independent variable||Dependent variable||NSE||RMSE||MAE|
|Ilongo (1KA15A)||Chitekero (IKA7A)||0.65||1.68||0.91|
|Ihimbu (1KA21A)||Msembe (1KA59)||0.46||56.90||31.31|
|Makalala (1KA32A)||Ihimbu (1KA21A)||0.88||4.00||2.02|
|Chitekero (IKA7A)||Mtandika (1KA37A)||0.63||6.48||3.91|
|Ihimbu (1KA21A)||Ilongo (1KA15A)||0.78||2.59||1.85|
|Ilongo (1KA15A)||Igawa (1KA11A)||0.71||7.28||3.34|
|Makalala (1KA32A)||Ndiuka (1KA2A)||0.66||7.74||6.24|
|Ilongo (1KA15A)||Kimani (1KA9A)||0.47||9.75||3.97|
Table 4: Results of NSE, RMSE and MAE for Multiple Regression Method.
|Station||Recession factor, a||NSE||RMSE||MAE|
Table 5: Results of Recession factor (a), NSE, RMSE and MAE for Recession Method.
Linear and multiple regression methods: Different techniques offer different performances, according to the characteristics of initial data points. Linear and Multiple regression have been used widely in estimating missing data during period of river flow fluctuation. The value for NSE, RMSE and MAE were determined using equation (5) through equation (9). The results are presented in Tables 3 and 4 for Linear and Multiple regression methods respectively.
From Figure 3, 1KA32A and 1KA22 have been used to estimate missing data for station 1KA21A. The results show that Multiple regression (Figure 3a) has the better estimate with NSE (0.88), RMSE (4.003) and MAE (2.023) compared to Linear regression (Figure 3b) by using 1KA32A which gave the NSE (0.805), RMSE (5.137) and MAE (2.758). On the other hand, there is no significant difference between NSE value for Linear and Multiple Regression Analysis this gives the evidence that in absence of more than one independent variable, Linear Regression can be employed to estimate the missing data of river flow.
Also, a good estimator for one station is not necessary to be a good estimator for another station. From the Table 3, 1KA32A when used to estimate missing data for 1KA21A the efficient of estimation (NSE) was 0.805, the same independent variable (1KA32A) when used to estimate the missing data for 1KA2A the efficient of estimation (NSE) was 0.558. The results are presented graphically on Figures 4a and 4b.
In addition, it has been found that stations in the same river can estimate each other better than choosing station from different rivers. From Figure 5a, 1KA32A and 1KA22 in the same river (Little Ruaha River) were used to estimate missing data for 1KA21A and NSE (0.88) was obtained. In contrary from Figure 5b, the gauging station 1KA15A (River Ndembera) and 1KA11A (River Mbarali) from different rivers were used to estimate the missing data for 1KA9A (River Kimani) and the estimate was poor with NSE of 0.47.
Recession method: As literature suggested, Recession method has proven to be the best in filling missing data during period of low flow. The values for recession factors, are shown in the Table 5. The best estimate for recession method was found on Makalala (1KA32A) gauging station (Figure 6a) while the poor estimate was found on Kimani (1KA9A) gauging station (Figure 6b).
The poor estimate at Kimani (1KA9A) gauging station is due to variability in river flow pattern for the five years period of calibration that leads to great variation of recession factors of river flow during the dry period for each year.
The choice of the method to be utilized to fill missing river flow data depends on the data availability between the variables. In this study, the estimates were determined by correlation coefficient between the stations, as well the distances between the stations where the nearer stations were able to estimate each other better than the far ones. Also, the stations which are located in the same stream were able to estimate each other better than stations which belong to different streams.
In filling the missing river flow data a single gauging station can be filled with more than one method. Multiple regressions is the best method in filling missing data during period of high flow compared to Linear regression method, however in the area where the correlated stations are scares linear regression can be employed.
- Mwaruvanda W (2009) The great ruaha river profile. A Paper presented at the Clivet Project Inception Workshop, Blue Pearl Hotel Ubungo Plaza, Dar es Salaam, 27th November.
- Kashaigili K, Japhet J, Kadigi RMJ, Lankford BA, Mahoo HF, et al. (2005) Environmental Flows Allocation in River Basins: Exploring Allocation Challenges and Options in the Great Ruaha River Catchment in Tanzania. Physics and Chemistry of the Earth 30: 689-697.
- Elshorbagy AA, Panu US, Simonovic SP (2000) Group-based estimation of missing hydrological data: I. Approach and general methodology. Hydrological Sciences Journal 45: 849-866.
- Presti RL, Barca E, Passarella G (2010) A methodology for treating missing data applied to dailyrainfall data in the Candelaro River Basin (Italy). Environmental Monitoring and Assessment 160: 1-22.
- Hasana M, Crokea B (2013) Filling gaps in daily rainfall data: a statistical approach. Mssanz. Org. Au, pp: 1-6.
- DeSilva RP, Dayawansa NDK, Ratnasiri MD (2007) A comparison of methods used in estimating missing rainfall data. Journal of Agricultural Science 3: 101-108.
- Mtahiko MGG, Gereta E, Kajuni AR, Chiombola EAT (2006) Towards an Ecohydrology-Based Restoration of the Usangu Wetlands and the Great Ruaha River, Tanzania. Wetlands Ecology and Management.
- Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Research 30: 79-82.
- Elshorbagy AA, Panu US, Simonovic SP (2000) Group-based estimation of missing hydrological data: II. Application to streamflows. Hydrological Sciences Journal 45: 867-880.
- Hauke J, Kossowski T (2011) Comparison of Values of Pearson’s and Spearman’s Correlation Coefficients on the Same Sets of Data. Quaestiones Geographicae 30: 87-93.
- Oyegoke ES, Ige ED, Oyebande L (1983) River flow forecasting through a regression model: a case study of the basement complex of western Nigeria. Proc. Hamburg Symposium, IAHS Publication 140: 419-429.
- Bako MD, Hunt DN (1988) Derivation of Baseflow Recession Constant Using Computer and Numerical Analysis. Hydrological Sciences Journal 33: 357-367.
- Cigizoglu HK (2003) Estimation, forecasting and extrapolation of river flows by artificial neural networks. Hydrological Sciences Journal 48: 349-362.
- Beale EML, Little RJA (1975) Missing values in multivariate analysis. J Roy Statist Soc B 37: 129-145.
- Johnson RA, Wichern DW (1988) Applied Multivariate Satistical Analysis. Prentice Hall, New Jersey, USA.
- Chai T, Draxler RR (2014) Root mean square error (RMSE) or mean absolute error (MAE)? -Arguments against avoiding RMSE in the literature. Geoscientific Model Development 7: 1247-1250.
- Doreswamy D, Vastrad CM (2013) Performance Analysis of Neural Network Models for Oxazolines and oxazoles derivatives descriptor dataset. International Journal of Information and Techniques (IJIST) 3: 0-6.
Citation: Mfwango LH, Salim CJ, Kazumba S (2018) Estimation of Missing River Flow Data for Hydrologic Analysis: The Case of Great Ruaha River Catchment. Hydrol Current Res 9:299. DOI: 10.4172/2157-7587.1000299
Copyright: © 2018 Mfwango LH, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Select your language of interest to view the total content in your interested language
Share This Article
- Total views: 2154
- [From(publication date): 0-2018 - Dec 16, 2019]
- Breakdown by view type
- HTML page views: 1898
- PDF downloads: 256