Academia.eduAcademia.edu
Fluid Phase Equilibria 163 Ž1999. 21–42 www.elsevier.nlrlocaterfluid Estimation of normal boiling points of hydrocarbons from descriptors of molecular structure Georgi St. Cholakov a , William A. Wakeham a b, ) , Roumiana P. Stateva c Department of Petroleum and Solid Fuels Processing Technology, UniÕersity of Chemical Technology and Metallurgy, Sofia 1156, Bulgaria b Department of Chemical Engineering, Imperial College of Science, Technology and Medicine, London SW7 2BY, UK c Institute of Chemical Engineering, Bulgarian Academy of Sciences, Sofia 1113, Bulgaria Received 12 January 1999; accepted 19 April 1999 Abstract Correlations for estimation of thermophysical properties are needed for the design of processes and equipment related to phase equilibria. The normal boiling point ŽNBP. is a fundamental characteristic of chemical compounds, involved in many correlations used to estimate important properties. Modern simulation packages usually require the NBP and a standard liquid density from which they can estimate all other necessary properties and begin the design of particular processes, installations and flowsheets. The present work contributes a correlation between the molecular structure and the normal boiling point of hydrocarbons. Its main features are the relative simplicity, sound predictions, and applicability to diversified industrially important structures, whose boiling points and numbers of carbon atoms span a wide range. An achievement of particular interest is the opportunity revealed, for reducing the number of the compounds required for the derivation Žthe learning set., through multivariate analysis and molecular design. The high accuracy achieved by the correlation opens up a possibility for systematic studies of chemical engineering applications in which the effects of small changes are important. This also defines a path towards the more general problem of the influence of uncertainties in calculated thermophysical parameters on the final outcome of computer aided simulation and design. q 1999 Elsevier Science B.V. All rights reserved. Keywords: Molecular simulation; Model; Normal boiling point; Hydrocarbons 1. Introduction Correlations for estimation of thermophysical properties are an important tool for design of processes and equipment, environmental impact assessment, HAZOP studies, and other important ) Corresponding author. Tel.: q44-171-594-5005; fax: q44-171-594-8802; e-mail: w.wakeham@ic.ac.uk 0378-3812r99r$ - see front matter q 1999 Elsevier Science B.V. All rights reserved. PII: S 0 3 7 8 - 3 8 1 2 Ž 9 9 . 0 0 2 0 7 - 1 22 G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 chemical engineering problems related to phase equilibria. Consequently, large commercial databases of miscellaneous properties are compiled, but have to be populated with new compounds within the limits of interpolation of the available experimental data w1x. On the other hand, methods for extrapolation of existing data are needed for assessment of compounds not yet synthesized andror high molecular compounds for which the experimental determination is unreliable or impossible because of degradation w2x. The physical properties of chemical compounds are described by a large group of structure related characteristics, such as normal boiling point Ž NBP. and critical parameters. Most of these have been targeted by different correlations and approaches w3–6x. However, thermophysical properties are interrelated and an efficient strategy is to identify a suitable number of independently determined primary target parameters, which are connected to the largest possible number of properties and can be used for their computational estimation w2x. The normal boiling point Ž NBP. is a fundamental characteristic of chemical compounds. It is involved in many correlations used to estimate thermophysical properties. Modern computer simulation packages usually require the NBP and a standard liquid density from which they can estimate all necessary properties and begin the design of particular processes, installations and flowsheets for their realization. The analysis of prior work, recently reviewed by Katritzky et al. w7x, shows that historically two types of empirical correlations have been developed — correlations, aimed at molecules with the widest possible variation of functional groups and heteroatoms, and — correlations concentrating on molecules within homologous series. The former follow the success of the first group contribution methods w4x, and the most recent ones apply electronic and graph topological descriptors w1,2x. A common feature of these correlations is that the dependent variable is a function of estimated contributions of diversified structural features, even when only one complex descriptor is incorporated in the final model w7x. They will be further referred to here as ‘‘contribution’’ models. Correlations developed for homologous series usually employ the total number of C atoms or the molecular mass of the compounds with adjustable constants w5,8x. Gasem et al. w9x recently suggested the abbreviation ABC — Asymptotic Behavior Correlations for such models. Marano and Holder w10x proposed a generalization for all ABCs and developed such correlations for a wide number of thermophysical properties of several homologous series w11x. It has been shown also that ABCs can be developed with graph topological indices w12x, and molecular energy descriptors w6x. Theoretical explanations have been suggested to relate quantum chemical descriptors to the thermodynamic properties of polar molecules w2,13x. The lattice fluid model w14–16x and the cell model w17x have been used to explain ABCs w11x. A common feature of ABCs is that the dependent variable is a non-linear function with several adjustable constants describing the relations between repeated segments of the molecules Žmers. and empty ‘‘holes’’ Ž lattice-fluid models. or mers and free volume Žcell models.. They will be further referred to as ‘‘mers’’ models. The advantages and disadvantages of the two approaches have been well documented by the respective authors. From a practical point of view, there is clearly a need for a compromise between the high accuracy but limited functionality of the ‘‘mers’’ models, and the low accuracy and widely varied functionality of the ‘‘contribution’’ models. The present work is an attempt to find such compromise. Furthermore, it is devoted to the investigation of the correlation power of molecular descriptors estimated with conventional programmes for computer simulation of molecular mechanics. These are considered as a potential tool for enhancing the capabilities of the simulation packages G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 23 widely used nowadays for computer aided chemical engineering design. A third objective of the present work is to explore opportunities to reduce the size of the data set upon which the derivation of the correlation is based Žthe learning set., since databases employed in contribution models are becoming increasingly larger. Finally, an object of the study was the evaluation of the extrapolation predictive power of such correlations for outlying molecules of industrial importance. 2. Methodology The development of any correlation relies on a database including the objects of interest Žmolecular structures in the present context. , and relevant known properties of these objects Ž descriptors of the molecular structures.. Independent variables defined from the database have to be correlated to a set of dependent characteristics of functional interest Ž NBPs. with the help of a suitable modelling technique. The predictive power of the correlations usually is confined to the space defined by the constraints of its derivation, although in the specific case of molecular modelling some extrapolation to structurally related outlying molecules might be possible at the cost of higher error. Experimental values for low and moderate NBPs of industrially important compounds are usually available from many sources. Higher boiling points are determined in vacuum, and may be recalculated for normal conditions if a pressure–temperature relation suitable for the particular group of compounds is available. For many compounds, however, the latter relations have not been studied, and the amount of experimental data even at reduced pressure is limited. 2.1. Database The design of the database of relevant compounds is perhaps the most important step in the derivation of statistical correlations. The weighting of different groups presented in the database directly influences the subsequent modelling w18x. The database should contain all relevant structural features of the modelled groups of compounds, but it should be emphasized again that the relative representation of those groups influences the uniformity of the prediction for the different groups of objects. Several features were sought from the database used in the present study, in order to achieve representation of the main structures, and the possibility for extrapolation of the predictions towards the three industrially important high molecular hydrocarbons with unknown NBPs — lycopene, b-carotene and 1,2-benzo w ax pyrene, chosen as an example. These are: - systematic change of properties within several homologous series, since any compound may be viewed as a member of some appropriate series; - presence of series of branched hydrocarbons with increasing numbers of double bonds, cycloalkanes and terpenoids with known NBPs, which might be extrapolated towards high molecular terpenoids; - presence of series of hydrocarbons differing by one aromatic ring, which might be extrapolated towards benzopyrenes; - presence of a control set of compounds with complex molecular structure, estimated by other authors, to be used for comparison with the present study. G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 24 Table 1 Hydrocarbons included in the database No. b 1 2 3 4 5 6 7 8 9 10 11 12 13a 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35b 36 37 38b 39 40a 41 42 43 44 45 46 47 Name No. Name No. Name ethane propane n-butane n-pentane n-hexane n-heptane n-octane n-nonane n-decane n-undecane n-dodecane n-tridecane n-tetradecane n-pentadecane n-hexadecane n-heptadecane n-octadecane n-nonadecane n-eicosane n-heneicosane n-docosane n-tricosane n-tetracosane n-pentacosane n-hexacosane n-heptacosane n-octacosane n-nonacosane n-triacontane n-dotriacontane n-pentatriacontane n-hexatriacontane n-tetracontane n-tetratetracontane n-hexacontane i-butane 2-methylbutane 2,2-dimethylpropane 2-methylpentane 3-methylpentane 2,2-dimethylbutane 2,3-dimethylbutane 2-methylhexane 3-methylhexane 2,2-dimethylpentane 2,3-dimethylpentane 2,4-dimethylpentane 48 50 51 52 53 54 55 56 57 58 59 60 61 62 63a 64 65 66 67 68 69 70 b 71 72 73 a 74 b 75 76 77 78 79 80 81 82 83a 84a 85a 86b 87b 88a 89 90 91 92 93 94 95 3,3-dimethylpentane 2,2,3-trimethylbutane 2-methylheptane 3-methylheptane 4-methylheptane 2,2-dimethylhexane 2,3-dimethylhexane 2,4-dimethylhexane 2,5-dimethylhexane 3,3-dimethylhexane 3,4-dimethylhexane 3-ethylhexane 2,2,3-trimethylpentane 2,2,4-trimethylpentane 2,3,3-trimethylpentane 2,3,4-trimethylpentane 2-methyl-3-ethylpentane 3-methyl-3-ethylpentane 2,2,3-trimethylhexane 2,2,4-trimethylhexane 2,2,5-trimethylhexane 3,3-diethylpentane 2,2,3,3-tetramethylpentane 2,2,3,4-tetramethylpentane 2,2,4,4-tetramethylpentane 2,3,3,4-tetramethylpentane 2-methyloctane 2-methylnonane 3,3,5-trimethylheptane 2,2,3,3-tetramethylhexane 2,5-dimethyldecane 2,5,-dimethyldodecane 2,6,10-trimethyldodecane 2,6,10-trimethyltetradecane pristane phytane squalane lycopane propylene 1-butene 1-pentene 1-hexene 1-heptene 1-octene 1-nonene 1-decene 1-undecene 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 b 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 b 140 141 142 143 b 1-dodecene 1-tridecene 1-tetradecene 1-pentadecene 1-hexadecene 1-heptadecene 1-octadecene 1-nonadecene 1-eicosene 1-heneicosene 1-docosene 1-tricosene 1-tetracosene 1-pentacosene 1-hexacosene 1-heptacosene 1-octacosene 1-nonacosene 1-triacontene 1,3-butadiene c-2-butene t-2-butene i-butene isoprene 2,3-dimethyl-1-butene 2,3-dimethyl-2-butene 2-ethyl-1-butene c-2-hexene t-2-hexene 2-methyl-1-pentene 4-methyl-1-pentene 2,4,4-trimethyl-1-pentene 2,4,4-trimethyl-2-pentene 2-methyl-1-butene 2-methyl-2-butene 3-methyl-1-butene 2,3,-dimethyl-butadiene 3,3-dimethyl-1-butene 2-methyl-2-pentene 3-methyl-1-pentene 1,5-hexadiene limonene a-pinene lycopene b-carotene cyclopropane cyclobutane G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 25 Table 1 Žcontinued. No. Name No. Name No. Name 144 a 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168b 169 170 171 172 173 174 175 176b 177 178 179 180 181 182 cyclopentane cyclohexane cycloheptane cyclooctane methylcyclohexane ethylcyclohexane propylcyclohexane butylcyclohexane methylcyclopentane ethylcyclopentane propylcyclopentane butylcyclopentane pentylcyclopentane hexylcyclopentane heptylcyclopentane octylcyclopentane nonylcyclopentane decylcyclopentane undecylcyclopentane dodecylcyclopentane tridecylcyclopentane tetradecylcyclopentane pentadecylcyclopentane hexadecylcyclopentane heptadecylcyclopentane octadecylcyclopentane nonadecylcyclopentane eicosylcyclopentane heneicosylcyclopentane docosylcyclopentane tricosylcyclopentane tetracosylcyclopentane pentacosylcyclopentane c-1,2-dimethylcyclohexane t-1,2-dimethylcyclohexane c-1,3-dimethylcyclohexane t-1,3-dimethylcyclohexane c-1,4-dimethylcyclohexane t-1,4-dimethylcyclohexane 183 184 185 186 187 188b 189 190 191 192 193 194 195 196 197 198 199 200a 201 202 203 204 205b 206 207 208 209 210 211 212 213 214 215 216b 217 218 219 220 221 cyclopentene cyclohexene 1,3-cyclohexadiene 5-methyl-1,3-cyclopentadiene 1,3-cyclopentadiene benzene toluene ethylbenzene propylbenzene butylbenzene o-xylene m-xylene p-xylene 1-methyl-3-ethylbenzene pentylbenzene hexylbenzene heptylbenzene octylbenzene nonylbenzene decylbenzene undecylbenzene dodecylbenzene tridecylbenzene tetradecylbenzene pentadecylbenzene hexadecylbenzene heptadecylbenzene octadecylbenzene nonadecylbenzene eicosylbenzene heneicosylbenzene docosylbenzene tricosylbenzene tetracosylbenzene styrene a-methylbenzene cumene o-ethyltoluene p-ethyltoluene 222 b 223 224 225 226 227 228 230 231 232 233 234 235 236 237 238 239 b 240 b 241 242 243 b 244 245 246 247 248 a 249 a 250 a 251 252 a 253 a 254 a 255a 256 a 257 a 258 a 259 a 260 a 261a mesitylene 1,2,3-trimethylbenzene 1,2,4-trimethylbenzene 1,2,3,4-tetrahydronaphtalene t-butylbenzene p-cymene m-diethylbenzene i-butylbenzene m-diisopropylbenzene diphenylmethane m-ethyltoluene s-butylbenzene p-diethylbenzene p-diisopropylbenzene diphenyl 1,1-diphenylethane 1,2-diphenylethane naphtalene antracene phenanthrene m-terphenyl p-terphenyl 1,2-benzo w ax pyrene pyrene chrysene o-terphenyl triphenylmethane acenaphtylene acenaphtene 1,1,2,2-tetraphenylethane 4-methyloctane 2,2,3,3-tetramethylbutane 2-ethylhexene adamantane 1,5-cyclooctadiene 2,5-methyl-1,5-hexadiene c-1-propenylbenzene 1-phenylnaphtalene indane a b Members of the control set. Members of the designed learning set of 20 hydrocarbons. The names of the compounds selected for the database used in this study are presented in Table 1, with their published NBPs listed in Table 6. The objects have been limited only to hydrocarbons in order to achieve a reasonable presentation of the functional groups of these fundamental compounds. The homologous series included allow a ‘‘mers’’ influence also to be expressed in the modelling. 26 G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 This approach follows from one of the objectives of the present work — to find a compromise between the breadth of the functionality of molecular structures and the precision achieved in their estimation. It has been also justified by recent prior work w7,19x. The present database of 261 hydrocarbons was compiled from several sources w4,19–24x. The data for the normal alkanes with more than 30 carbon atoms were calculated by a ‘‘mers’’ correlation w25x. Three hydrocarbons with unknown NBPs were included in the database as an illustration of the case when objects of industrial importance have to be evaluated as outliers. Such molecules are often referred to as ‘‘hypotheticals’’. Lycopene and b-carotene are industrially important constituents of natural products, 1,2-benzo w ax pyrene is a carcinogenic hydrocarbon often used as reference in ecological studies. Most of the hydrocarbons are identical with those used in the most recent correlation for description of NBPs of hydrocarbons w19x. The values for some of the hydrocarbons, mainly in the control set were recalculated for normal conditions from vacuum data, which were considered more reasonable. The limits for the main hydrocarbon series, and structures, included in the database, which determine also the boundaries for the predictive ability of the derived models may be assessed from Tables 1 and 6, but are more clearly outlined by the total carbon atoms dependence of the predicted points ŽFig. 2., and the scatterplot of the first two principle components Ž Fig. 3. . NBPs are varied in the widest practical range from 184.5 to 877.5 K. The total number of carbon atoms spans from 2 to 60 for the n-alkanes, from 3 to 40 — for the series finishing with b-carotene, and — to 30 for the rest of the homologous series. 2.2. Descriptors Two types of descriptors were employed in the present investigation. Molecular energy descriptors were evaluated with a conventional computer programme for molecular mechanics simulation, based on the MMX modification of the MM2 method w26x. In such programmes a structure is considered a collection of atoms held together by elastic Ž harmonic. forces-bonds, which constitute the force field. The calculations start with a structure with relevant default values of parameters and its optimized geometry is found by iterational minimization of its total steric energy. Further refinement of the energy contributions may be achieved by assigning more accurate values for the starting force constants andror applying several programmes with different sophistication for gradual assessment of the more intimate structural elements or specific programmes, designed to target particular structural features w1,2,6x. Such refinement of the molecular energy descriptors used in the present study has been deliberately avoided. For the practical purposes of the present study, the minimised molecular energy models of all 261 molecules were obtained with a conventional programme for molecular mechanics simulation, and the contributions of different energies in the minimized models were tested as descriptors. An illustration of the molecular energy descriptors for adamantane is presented in Fig. 1. The names and codes of the descriptors are given in Table 2. Carbon atom descriptors of various levels of sophistication can be used. The highest level of sophistication presently available comprises the graph topological indices, derived from the adjacency and distance matrices of a chemical structure w12x. More than 120 such indices have been suggested. The latest versions can evaluate 3D structural information w27x, and many of them have been involved in correlations with thermophysical properties and characteristics w12x. G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 27 Fig. 1. The minimized energy model and the 15 molecular descriptors of adamantane. ŽDimensions are given as estimated by the molecular mechanics simulation programme — in kcal moly1, A3 moly1 , etc... Total energy Ž Etot .: 34.616; Stretch energy Ž Estr .: 1.188; Bond energy Ž E bnd .: 10.273; Stretch–bend energy Ž Es – b .: y0.292; Torsion energy Ž Etor .: 16.524; Van der Waals energy Ž E vdw .: 6.924; Dipole-charge interaction energy Ž Edch .: 0.000; Electric dipole moment ŽDM.: 0.000; Standard enthalpy Ž Hf .: y14.00; Strain energy Ž Este .: 27.30; Van der Waals volume Ž Vvdw .: 254 A3, Molar volume Ž VM .: 152 cm3; Total van der Waals surface Ž Stot .: 174.82 A2 , Saturated van der Waals surface Ž Ssat .:174.82 A2 , Unsaturated van der Waals surface: 0.00 A2 Ž Sunsat .. The lowest level of sophistication of carbon atom descriptors is to use the numbers of atoms engaged in specific groups Žatom counts.. These are generically related to the group contributions, which multiply the particular number of atoms by empirically assigned constants. In the present Table 2 Descriptors from simulated molecular mechanics No. Description Code 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Total energy Stretch energy Bond energy Stretch–bend energy Torsion energy Van der Waals energy Energy of ‘‘dipole-charge’’ interaction Electrostatic dipole moment Standard heat of formation Strain energy Van der Waals volume Molar volume Total Van der Waals surface Saturated Van der Waals surface Unsaturated Van der Waals surface Etot Estr E bnd Es – b Etor Evdw Ed-ch DM Hf Est Vvdw VM St Ssat Sunsat G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 28 Table 3 Carbon atom descriptors and molecular mass No. Name Code 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Total number of C atoms Number of C atoms in CH 3 groups Number of C atoms in aliphatic CH 2 groups Number of C atoms in aliphatic CH groups Number of C atoms in aliphatic C groups Number of C atoms in aliphatic CH 2 5CH 2 groups Number of C atoms in aliphatic CH5CH groups Number of C atoms in aliphatic C5 groups Total number of C atoms in aliphatic double bonds Number of C atoms in cyclic CH 2 groups Number of C atoms in cyclic CH groups Number of C atoms in cyclic C groups Number of C atoms in cyclic CH5 groups Number of C atoms of cyclic C5 groups Total number of C atoms in cyclic double bonds Molecular mass Ctot NCH 3 a NCH 2 a NCH NCa a NDCH 2 a NDCH a NDC DBA c NCH 2 c NCH c NC c NDCH c NDC DBC M investigation we have chosen the carbon atom descriptors, presented in Table 3. For the most part they coincide with the descriptors in the recent Joback group contribution model w4x. Because of the success of prior work with topological indices, a limited number of them is also tested in the present work. This number includes: the Wiener Index, Ž W .; the Balaban index, the Bonchev and Trinajstic information content and mean information content of the unit distances, information content and mean information content of the distances’ distribution indices, Žknown respectively as IWD, IWDM, IED and IEDM. ; the cyclomatic number, Ž m . and the Randic path connectivity indices Ž CHI. up to third order terms. The meaning and methods of calculation of these indices have been extensively reviewed elsewhere w12,28x. An additional descriptor — the gravitation index, Ž G I ., successfully employed lately for the evaluation of NBPs w7,29x was also tested. Where applicable, the descriptors were calculated not only with unit distances, but also with distances between bonded atoms, obtained from the structures minimized by molecular mechanics simulation. The total number of descriptors, including the topological ones, as well as molecular mass, amounted to 59. This is a relatively low number as compared to the most recent description of NBPs of databases including heterocompounds w7x, in which more than 800 descriptors are tested, or the study of Wessel and Jurs w19x, confined only to hydrocarbons selecting among 81 descriptors. It should be emphasized though, that one of the objectives of the present investigation was to keep the methods for calculation and the meaning of descriptors as simple as possible. That is why electronic descriptors and complex functions of one or more descriptors were deliberately avoided. 2.3. Modelling A conventional ‘‘stepwise’’ multiple regression procedure w30x was employed to select the most influential variables from the 59 descriptors and determine their optimal number. This procedure is a G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 29 subjective molecular feature selection w19x, in which the dependent variable is used to develop models in the form: n NBPj s bo q Ý Ž bi X i j . Ž1. is1 where NBPj is the normal boiling point of the compound j, bo is the intercept term, and bi is the coefficient for descriptor X i j . A linear contribution of the structural descriptors was adopted for all variables, except for a nonlinear ‘‘mers’’-type independent variable Ž total number of carbon atoms, Ctot . . The latter is successfully used in ABC correlations for homologous series w11x. We assume that each molecule may be considered a member of some homologous series. The boiling points of the molecules would then lie on a family of curves, different for each series, but asymptotically dependent on the ‘‘mers’’ variable, Ctot . The distances between the curves in the family would be then accounted for by the linearly contributing independent variables, which would reflect the specific features of the particular Fig. 2. Asymptotic dependence of the normal boiling points predicted by the M-20 model ŽTpred MT 20, K. on the total number of C atoms Ž Ctot .. G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 30 mers of molecules belonging to different series. These assumptions are illustrated in Fig. 2, which presents the boiling points, predicted by one of our models, as a function of the total number of C atoms. The experimental temperatures obey the same dependence. In a later section of the paper we shall show that this concept for the structure of the model is successful. The algorithm, used also in this paper, for obtaining the best models with an increasing number of independent variables has been described in detail elsewhere w29x. The targeted representation of the published NBPs was set as a mean standard deviation of relative errors of 2.1%. This target follows from the observation that the experimental uncertainties in the DIPPR database for relevant molecules are around 2.1% w1x. Thus, we use the DIPPR estimated uncertainty as a reasonable figure to aim for in the representation. Ref. w1x suggests also that one descriptor of a pair with a pairwise correlation G "0.95 should be discarded. Later work has gone much below that limit, but for hydrocarbons especially this does not seem to be practical. That is why, for the present work the limit of pairwise correlations was set at "0.85. As in other similar studies, the compounds in the database were devided into a learning set, and a control set ŽTable 1.. The compounds in the control set were not used in the derivation of the model. They were chosen mainly from the latest and most successful work on NBPs of hydrocarbons w19x. An attempt was made to predict the boiling points of nearly all compounds from w19x, which were reported to be difficult for prediction. Compounds with triple bonds, which are not present in our database, were omitted. Our control set includes also three terpenoids, the boiling points for which were among the few obtained only from the work of Bogomolov et al. w23x, and could not be compared with other sources. 3. Results and discussion The model derived from the learning set with 235 hydrocarbons Ž M-235. is presented in Table 4. It can be seen from this table, and the predictions of M-235, given in Table 6, that the discrepancies Table 4 Model ŽM-235., derived from the learning set with 235 molecules. N s 235; Standard residual error s 4.95 K. Coefficient of Multiple correlation: 0.999. Calculated Fisher’s Criterion: 24279.2 Independent variables Coefficients Standard deviation Ž". F criterion for removal from model 0.678 Ctot a NCH 2 NCa c NDCH Etot E bnd Vvdw Sunsat Intercept 142.51096 4.96750 y8.24519 y3.55218 1.45273 y2.05695 y1.04224 y0.37320 46.97246 1.22788 0.19820 1.06216 0.34924 0.10757 0.19640 0.01463 0.02613 2.20274 9999.99 628.15 60.26 103.45 182.38 109.70 5075.83 203.99 – G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 31 from the published values fall even below the average relative deviation targeted in the present study. In the context of the uncertainties of the DIPPR database w19x, and from the point of view of practical application of this correlation, such precision has no particular merit. However, this success of the description allows us to address the third objective of the study, which is to reduce the number of the compounds in the learning set, because a reasonable loss of precision can be tolerated. In the first attempt to reduce the number of the compounds in the learning set an approach widely used in similar studies Žfor instance in w1x. was adopted. The compounds left in the learning set were selected to give a fair representation of the main groups and structures in the database. This approach, however, cannot be based on anything more than a general perception of homologous series. As such it might be inadequate for complex hybrid structures, which cannot be assigned to a particular group. A model was derived from half of the learning set Ž 116 compounds. , chosen following the above approach. It showed a 1.02 K higher standard residual error, but the same set of variables was selected. The analysis of this model indicated that the key to successful description is the uniformity of the distribution of the compounds of the learning set over the space occupied by the database, rather than their number. Our next task then was to find a tool for the statistically based selection of a learning set, and to determine the minimum number of its members. This task has been addressed by multivariate analysis of the database and statistical molecular design of the learning set. 3.1. MultiÕariate analysis and molecular design of the learning set The input data of the database consist of columns Ž descriptors. and rows Ž objects-molecules. . With the help of multivariate analysis the information, contained in the database may be connected to the dependent variable ŽNBP. in two ways — either with the columns as independent variables, or — with the principal components Žfactors. as independent latent variables w31x. The principal components of the data provide several additional opportunities for data manipulation. When used as latent variables in the modelling, problems with collinearity between the original variables can be solved. Partial least square ŽPLS. regression may be applied with a smaller number of latent variables. The latter are usually factors with eigenvalues higher than 1, if cross validation shows that they can account for a sufficient portion of the variance in the data. The objects Ž molecules in our case. may be projected onto the plane formed by the first two most influential principal components, and the points on the resulting scatterplot will represent the molecules of the database. This suits our above defined task, since the placement of the molecules of the learning set can be chosen directly from this scatter plot. Information about the principles of multivariate analysis and its application in characterization of quantitative structure-properties relationships can be found elsewhere w32,33x. Its application for statistical molecular design is presented in Ref. w34x. Fig. 3 shows the scatterplot of the molecules of the whole database, and the selection of the molecules in the new learning set. It depicts also the molecules in the control set. It may be seen from the figure that the database is not ideally balanced. The points are situated approximately in a triangle. The homologous series are placed on linear loci. The three hypotheticals can be classified as close outliers, situated close to the right hand side of the triangle, which is formed, however, by only two molecules with known temperatures. The learning set, selected from the scatterplot, is distributed to cover the sides and the center of the triangle. 32 G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 Fig. 3. Selection of the learning set with 20 hydrocarbons Ž20P LEARN SET. from the scatter plot of the first two principal components of all molecules ŽALL.. The molecules in the control set ŽCONT. SET. are also shown. The numbers and the names of the molecules, selected for the learning and the control sets are given in Table 1. The learning set presented in Fig. 3 consists of 20 molecules. This may be considered the minimum number of members of the set as selected by statistical molecular design. In the derivation of the M-235 model the descriptors were used as independent variables. The factor analysis of the matrix consisting of the values for the independent variables selected for this model, showed that three factors have eigenvalues higher than 1. Together with a fourth factor with an eigenvalue close to one they account for 95.8% of the variation in the database. If a model with 4 latent variables is constructed through PLS regression, the minimal number of experimental data follows the 2 k s 16 points requirement of statistical designs Ž where k is the number of factors., plus at least four additional points w34x. When applied to our present task these rules determine the minimum number of points in the learning set at twenty. As already explained, the points should be evenly distributed to cover the factor space. In molecular design exact distribution according to a statistical protocol can rarely be achieved, because data for particular structures may not be available or may not be accessible experimentally, as also illustrated by Fig. 3. A model was further derived with the new learning set of 20 molecules. This model Ž M-20. is presented in Table 5, and its predictions for the original 235 learning set — in Table 6. It is not really a new model, but a variation of the M-235 model. Its independent variables are the same as in M-235, and only its coefficients are derived from the set of 20 molecules. These coefficients will be less sensitive to the distribution of data in the database, than the coefficients of M-235. Its standard residual error is 5.46 K, which compares favourably with the model derived from 116 molecules Žstandard residual error of 6.07 K.. G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 33 Table 5 Model derived from the designed learning set with 20 molecules ŽM-20.. N s 20; Standard residual error: 5.46 K. Coefficient of Multiple correlation: 0.999. Calculated Fisher’s Criterion: 3221.25 Independent Variables Coefficients Standard Deviation F criterion for removal from model NC0.675 a NCH 2 a NC c NDCH Etot E bnd Vvdw Sunsat Intercept 141.66807 3.78832 y9.94330 y3.89729 1.46310 y2.36292 y0.92884 y0.33671 40.41170 2.83713 0.44837 4.29798 0.99589 0.24684 0.32912 0.03353 0.08398 4.93330 2493.36 71.39 5.35 15.31 35.13 51.54 767.43 16.08 – It seems appropriate to repeat again here that M20 is only an improvement of the original M235 model. The success of this attempt, however, illustrates an approach, which may be explored in the future for establishing correlations from a limited amount of experimental data. This opportunity is very important since the databases of the latest similar investigations seem to become larger and larger. Katrizky et al. w7x, for instance employ 612 compounds for presenting heterostructures, Wessel and Jurs w19x used 356 compounds in their hydrocarbons only study. 3.2. Prediction of the control set and hypotheticals The last object of the present study was to evaluate the extrapolation predictive power of the above described models. The values for the NBPs predicted by M235 and M20 for the control set and the ‘‘hypothetical’’ molecules are presented in Table 7. Values for the boiling points of the hypotheticals, estimated by asymptotic behavior correlations w10x are also included. The table shows the published NBPs of all control compounds, and NBPs calculated from published boiling points of the same compounds under reduced pressure. The latter were considered more reliable for two reasons. First, some of the experimental results have been obtained more than 50 years ago with the technique and expertise then available. Secondly, for components whose NBPs are higher than the decomposition temperature the measurement must have been performed at reduced pressure, and recalculated. As seen from Table 7, most of the NBPs of the ‘‘difficult’’ compounds are predicted by M235 and M20 with a relative error of less than 5% from the published value, if the calculated boiling points for several particular compounds are chosen. There are two compounds, which are predicted with a distinctively higher error. One of them — adamantane, has a very complex structure Ž Fig. 1. , which obviously is not well represented in the database. The other one — 1,1,2,2-tetraphenylethane, even with a recalculated NBP falls closer to the aliphatic homologous series Ž Fig. 2. , rather than to the non-alkylated aromatics, as might be expected. Its published NBP is drastically different from the predicted values. G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 34 Table 6 Predictions of the M-235 and M-20 models for the original learning set of 235 hydrocarbons Statistics for the predictions of the M-235 model: Mean standard deviation of absolute errorss 4.86"0.35 K; Min absolute error sy13.4 K; Max absolute error sq15.1 K Ž7 points out of a "11 K error range.; Mean standard deviation of relative errorss1.15"0.07%; Min relative error sy4.35%; Max relative error sq3.69% Ž9 points out of a "2.4% error range.. Statistics for the predictions of the M-20 model: Mean standard deviation of absolute errorss 5.51"0.36 K; Min absolute error sy15.1 K; Max absolute error sq26.3 K Ž10 points out of a "11 K error range.; Mean standard deviation of relative errorss1.27"0.08%; Min relative error sy3.84%; Max relative error sq3.92% Ž13 points out of a "2.4% error range.. No. of compound Tpubl , ŽK. Tpred M-235, ŽK. Tpred M-20, ŽK. Absolute Error M-235, ŽK. Absolute Error M-20, ŽK. Relative Error M-235, Ž%. Relative Error M-20, Ž%. 1 2 3 4 5 6 7 8 9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 184.6 231.1 272.7 309.2 341.9 371.6 398.8 424.0 447.3 469.0 489.4 508.6 543.8 559.9 575.3 589.9 603.8 616.9 629.7 641.8 653.4 664.4 675.0 685.4 695.3 704.8 713.9 722.9 736.7 758.8 765.6 790.8 812.7 877.5 261.4 301.0 282.6 333.4 184.6 230.7 270.4 307.1 339.0 369.0 395.6 421.2 444.1 462.3 483.4 499.1 534.8 551.9 568.0 583.5 598.3 613.4 624.9 635.8 650.4 664.6 670.6 680.6 691.1 702.4 709.1 718.6 736.5 759.7 764.2 795.7 814.2 872.3 261.2 298.8 287.4 331.8 184.9 230.9 270.4 306.7 338.4 368.0 394.3 419.7 442.3 460.7 481.5 497.5 532.8 549.7 565.7 581.0 595.7 610.5 622.1 633.2 647.5 661.4 667.8 677.9 688.4 699.5 706.6 716.0 733.9 757.3 762.1 793.4 812.8 873.9 263.5 300.6 288.4 333.2 y0.1 0.3 2.2 2.1 2.9 2.6 3.2 2.7 3.2 6.7 6.1 9.5 9.0 8.1 7.3 6.4 5.4 3.5 4.8 6.0 2.9 y0.2 4.5 4.8 4.1 2.3 4.8 4.3 0.2 y0.9 1.4 y4.9 y1.5 5.2 0.2 2.2 y4.8 1.6 y0.3 0.1 2.2 2.5 3.5 3.6 4.5 4.3 5.0 8.3 7.9 11.1 10.9 10.2 9.6 8.9 8.0 6.4 7.5 8.6 5.9 3.0 7.2 7.5 6.9 5.2 7.4 6.9 2.8 1.5 3.5 y2.6 y0.1 3.6 y2.1 0.4 y5.8 0.2 0.0 0.1 0.8 0.7 0.8 0.7 0.8 0.6 0.7 1.4 1.2 1.9 1.7 1.4 1.3 1.1 0.9 0.6 0.8 0.9 0.4 0.0 0.7 0.7 0.6 0.3 0.7 0.6 0.0 y0.1 0.2 y0.6 y0.2 0.6 0.1 0.7 y1.7 0.5 y0.2 0.1 0.8 0.8 1.0 1.0 1.1 1.0 1.1 1.8 1.6 2.2 2.0 1.8 1.7 1.5 1.3 1.0 1.2 1.3 0.9 0.5 1.1 1.1 1.0 0.7 1.0 0.9 0.4 0.2 0.5 y0.3 0 0.4 y0.8 0.1 y2.1 0.1 G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 35 Table 6 Žcontinued. No. of compound Tpubl , ŽK. Tpred M-235, ŽK. Tpred M-20, ŽK. Absolute Error M-235, ŽK. Absolute Error M-20, ŽK. Relative Error M-235, Ž%. Relative Error M-20, Ž%. 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 64 65 66 67 68 69 70 71 72 74 75 76 77 78 79 80 81 82 86 87 89 90 91 92 322.9 331.1 363.2 365.0 352.3 362.9 353.6 359.2 366.6 354.0 390.8 392.1 390.9 380.0 388.8 382.6 382.3 385.1 390.9 391.7 383.0 372.4 386.6 388.8 391.4 406.8 399.7 397.2 419.3 413.4 406.2 414.7 416.4 440.2 428.9 433.3 471.3 506.8 526.2 558.2 769.0 225.4 303.1 336.6 366.8 394.4 324.1 328.7 363.9 366.1 353.2 360.7 355.2 358.6 368.4 351.0 388.4 392.7 390.7 379.7 387.3 389.0 384.2 384.8 390.7 398.9 383.7 377.9 385.9 390.2 393.6 406.3 406.8 400.2 423.2 405.9 403.7 405.6 413.0 435.8 432.9 427.8 472.4 511.9 525.8 559.1 758.6 231.6 304.9 339.9 370.0 396.4 324.3 331.8 364.7 366.7 353.1 363.2 358.1 357.9 368.7 352.5 389.2 393.0 391.3 379.5 389.6 391.4 386.8 383.9 392.7 398.8 384.6 379.3 389.6 392.1 395.1 407.2 407.6 401.5 420.5 405.2 406.2 407.8 413.6 436.2 433.1 427.0 474.4 513.3 528.7 561.6 770.5 232.1 305.2 339.6 369.3 395.5 y1.3 2.5 y0.7 y1.1 y0.9 2.3 y1.6 0.6 y1.8 3.1 2.4 y0.6 0.2 0.3 1.5 y6.4 y1.9 0.3 0.2 y7.2 y0.7 y5.5 0.7 y1.4 y2.2 0.5 y7.1 y3.0 y3.9 7.5 2.5 9.1 3.4 4.4 y4.1 5.5 y1.1 y5.1 0.4 y0.9 10.4 y6.2 y1.8 y3.3 y3.2 y1.9 y1.5 y0.6 y1.5 y1.7 y0.8 y0.2 y4.5 1.3 y2.1 1.5 1.6 y0.9 y0.4 0.5 y0.8 y8.8 y4.5 1.2 y1.8 y7.1 y1.6 y6.9 y3.0 y3.3 y3.7 y0.4 y7.9 y4.3 y1.1 8.2 0.0 6.9 2.8 4.0 y4.3 6.3 y3.1 y6.5 y2.6 y3.5 y1.5 y6.7 y2.1 y3.0 y2.6 y1.1 y0.4 0.7 y0.2 y0.3 y0.2 0.6 y0.4 0.2 y0.5 0.9 0.6 y0.1 0.1 0.1 0.4 y1.7 y0.5 0.1 0.1 y1.8 y0.2 y1.5 0.2 y0.4 y0.6 0.1 y1.8 y0.8 y0.9 1.8 0.6 2.2 0.8 1.0 y1.0 1.3 y0.2 y1.0 0.1 y0.2 1.4 y2.7 y0.6 y1.0 y0.9 y0.5 y0.5 y0.2 y0.4 y0.5 y0.2 y0.1 y1.3 0.3 y0.6 0.4 0.4 y0.2 y0.1 0.1 y0.2 y2.3 y1.2 0.3 y0.5 y1.8 y0.4 y1.9 y0.8 y0.8 y1.0 y0.1 y2.0 y1.1 y0.3 2.0 0.0 1.7 0.7 0.9 y1.0 1.5 y0.7 y1.3 y0.5 y0.6 y0.2 y3 y0.7 y0.9 y0.7 y0.3 G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 36 Table 6 Žcontinued. No. of compound Tpubl , ŽK. Tpred M-235, ŽK. Tpred M-20, ŽK. Absolute Error M-235, ŽK. Absolute Error M-20, ŽK. Relative Error M-235, Ž%. Relative Error M-20, Ž%. 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 420.0 443.8 465.8 486.5 505.9 524.3 541.5 558.0 573.5 588.0 601.7 615.5 628.2 640.4 652.0 663.2 674.3 684.3 694.3 702.2 713.2 720.9 268.7 276.9 274.0 266.3 307.2 328.8 346.4 337.8 342.0 341.0 335.3 327.0 374.6 378.1 304.3 311.7 293.2 341.9 314.4 340.5 327.3 332.6 449.2 429.3 422.8 445.4 466.9 485.4 509.5 525.0 542.1 559.5 573.3 586.8 599.5 612.4 627.9 642.3 652.6 658.8 679.0 689.8 695.4 704.7 717.9 726.4 275.1 267.6 267.6 267.2 310.0 332.2 333.6 339.2 337.0 336.8 337.1 330.1 379.8 373.9 305.1 301.6 297.9 337.4 321.1 335.3 332.8 340.0 447.2 437.4 421.5 444.0 465.3 483.8 507.2 522.8 539.8 557.0 570.9 584.4 597.2 610.1 625.4 639.5 649.9 656.7 675.9 686.6 692.7 702.1 714.9 723.5 275.8 268.8 268.8 268.4 311.1 334.7 335.4 339.4 337.4 337.3 337.7 332.2 380.6 375.2 305.9 302.8 300.4 339.4 322.2 336.4 334.6 340.2 449.0 431.0 y2.8 y1.6 y1.1 1.1 y3.6 y0.7 y0.6 y1.5 0.1 1.2 2.2 3.1 0.2 y1.9 y0.6 4.3 y4.8 y5.5 y1.2 y2.6 y4.7 y5.5 y6.4 9.2 6.4 y1.0 y2.8 y3.4 12.8 y1.4 5.0 4.2 y1.9 y3.1 y5.2 4.2 y0.8 10.1 y4.7 4.6 y6.7 5.2 y5.4 y7.4 2.0 y8.1 y1.5 y0.2 0.5 2.7 y1.3 1.4 1.7 1.0 2.5 3.6 4.5 5.4 2.8 0.9 2.1 6.5 y1.6 y2.3 1.5 0.1 y1.7 y2.5 y7.1 8.1 5.2 y2.2 y3.8 y5.9 11 y1.6 4.6 3.7 y2.5 y5.2 y6 2.8 y1.6 8.9 y7.1 2.5 y7.8 4.1 y7.3 y7.6 0.1 y1.7 y0.7 y0.4 y0.2 0.2 y0.7 y0.1 y0.1 y0.3 0.0 0.2 0.4 0.5 0.0 y0.3 y0.1 0.6 y0.7 y0.8 y0.2 y0.4 y0.7 y0.8 y2.4 3.3 2.3 y0.4 y0.9 y1.0 3.7 y0.4 1.5 1.2 y0.6 y1.0 y1.4 1.1 y0.3 3.3 y1.6 1.3 y2.1 1.5 y1.7 y2.2 0.4 y1.9 y0.4 0.0 0.1 0.5 y0.3 0.3 0.3 0.2 0.4 0.6 0.7 0.9 0.4 0.1 0.3 1.0 y0.2 y0.3 0.2 0.0 y0.2 y0.4 y2.6 2.9 1.9 y0.8 y1.3 y1.8 3.2 y0.5 1.4 1.1 y0.7 y1.6 y1.6 0.7 y0.5 2.9 y2.4 0.7 y2.5 1.2 y2.2 y2.3 0.0 y0.4 G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 37 Table 6 Žcontinued. No. of compound Tpubl , ŽK. Tpred M-235, ŽK. Tpred M-20, ŽK. Absolute Error M-235, ŽK. Absolute Error M-20, ŽK. Relative Error M-235, Ž%. Relative Error M-20, Ž%. 139 142 143 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176f 177 178 179 180 181 182 183 184 185 186 187 708.5 240.4 285.7 353.9 391.9 424.3 374.1 405.0 429.9 454.1 345.0 376.6 404.1 429.8 453.7 476.0 497.0 516.7 535.2 552.5 568.9 584.4 599.0 613.2 625.9 639.3 650.3 661.9 673.1 683.1 693.1 703.1 712.0 720.4 729.3 402.9 396.6 393.2 397.6 397.5 392.5 317.4 356.1 353.5 345.9 314.7 699.1 250.8 287.8 349.9 382.5 410.6 377.6 402.9 429.6 452.4 351.9 381.3 409.9 435.5 454.2 476.5 495.5 517.5 536.3 554.2 570.2 585.4 600.8 613.5 626.6 640.1 654.0 665.5 673.5 688 695.9 706.5 716.8 725.4 732.8 405.3 402 406.6 404.5 402.3 396.6 315.4 346.1 350.6 336.3 307.8 708.0 249.6 283.3 351.2 383.7 411.5 379.3 404.6 430.9 453.5 352.7 381.8 410.0 435.3 454.2 476.3 495.3 516.7 535.4 553.0 568.9 584.1 599.4 612.1 625.2 638.6 652.5 663.8 672.2 686.3 694.4 704.9 715.2 723.9 731.6 407.4 404.4 408.3 406.6 404.6 399.5 312.9 345.8 348.8 332.4 303.1 9.4 y10.4 y2.1 4.0 9.4 13.7 y3.5 2.0 0.3 1.7 y6.9 y4.7 y5.8 y5.7 y0.5 y0.5 1.5 y0.8 y1.1 y1.7 y1.3 y1.0 y1.8 y0.3 y0.7 y0.8 y3.7 y3.6 y0.4 y4.9 y2.8 y3.4 y4.8 y5.0 y3.5 y2.4 y5.5 y13.4 y6.9 y4.8 y4.1 2.0 10.1 2.9 9.7 6.8 0.4 y9.2 2.4 2.7 8.2 12.8 y5.3 0.3 y1.0 0.6 y7.8 y5.2 y5.9 y5.5 y0.5 y0.3 1.7 0.0 y0.2 y0.5 0.0 0.3 y0.4 1.1 0.7 0.7 y2.2 y1.9 0.9 y3.2 y1.3 y1.8 y3.2 y3.5 y2.3 y4.4 y7.8 y15.1 y9.0 y7.2 y7.0 4.5 10.3 4.6 13.6 11.6 1.3 y4.3 y0.7 1.1 2.4 3.2 y0.9 0.5 0.1 0.4 y2.0 y1.2 y1.4 y1.3 y0.1 y0.1 0.3 y0.2 y0.2 y0.3 y0.2 y0.2 y0.3 0.0 y0.1 y0.1 y0.6 y0.5 y0.1 y0.7 y0.4 y0.5 y0.7 y0.7 y0.5 y0.6 y1.4 y3.4 y1.7 y1.2 y1.0 0.6 2.8 0.8 2.8 2.2 0.1 y3.8 0.8 0.8 2.1 3.0 y1.4 0.1 y0.2 0.1 y2.2 y1.4 y1.5 y1.3 y0.1 y0.1 0.3 0.0 0.0 y0.1 0.0 0.1 y0.1 0.2 0.1 0.1 y0.3 y0.3 0.1 y0.5 y0.2 y0.3 y0.5 y0.5 y0.3 y1.1 y2 y3.8 y2.3 y1.8 y1.8 1.4 2.9 1.3 3.9 3.7 G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 38 Table 6 Žcontinued. No. of compound Tpubl , ŽK. Tpred M-235, ŽK. Tpred M-20, ŽK. Absolute Error M-235, ŽK. Absolute Error M-20, ŽK. Relative Error M-235, Ž%. Relative Error M-20, Ž%. 188 189 190 191 192 193 194 195 196 197 198 199 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 353.3 383.8 409.3 432.4 456.5 417.6 412.3 411.5 434.5 478.6 499.3 519.2 555.2 571.0 586.4 600.8 614.4 627.0 639.3 650.9 662.0 673.2 683.2 693.2 702.0 710.9 719.3 727.0 418.3 438.7 425.6 438.3 435.2 437.9 449.3 442.5 480.8 442.3 450.3 454.3 456.6 445.9 476.3 537.4 434.6 446.5 355.9 385.5 413.8 436.4 461.8 411.7 406.9 408.6 430.4 480.5 500.4 517.9 554.9 570.3 586.4 600.8 616.2 628.8 638.4 646.1 664.0 674.8 685.5 695.5 707.7 718.5 726.8 734.1 415.6 443.1 430 440.1 436.8 431.7 440.4 436.7 478.4 445.9 452.0 457.2 461.6 441.5 487.5 539.3 436.2 452.6 352.4 383 410.9 433.5 458.4 409.6 405.0 406.9 428.4 477.2 497.1 514.6 551.1 566.5 582.4 596.8 612 624.6 634.5 642.6 659.9 670.7 681.4 691.4 703.4 714.1 722.6 729.5 413.2 440.8 429.1 438.1 435.3 431.1 439.6 436.3 476.1 443.7 452.4 455.5 459.8 441.2 489.2 534.2 434.4 451.5 2.6 y1.8 y4.5 y4.0 y5.3 5.9 5.4 2.9 4.1 y1.9 y1.2 1.2 0.3 0.7 y0.1 0.0 y1.8 y1.8 0.8 4.9 y1.9 y1.7 y2.4 y2.3 y5.6 y7.5 y7.6 y7.1 2.8 y4.4 y4.4 y1.8 y1.6 6.2 8.9 5.8 2.4 y3.6 y1.7 y2.9 y5.0 4.4 y11.1 y1.9 y1.7 y6.1 0.9 0.8 y1.6 y1.2 y2.0 8.0 7.3 4.6 6.1 1.3 2.2 4.6 4.1 4.6 3.9 4.0 2.5 2.4 4.7 8.3 2.2 2.4 1.7 1.7 y1.4 y3.1 y3.3 y2.5 5.1 y2.2 y3.5 0.2 y0.1 6.8 9.7 6.2 4.7 y1.4 y2.2 y1.2 y3.1 4.7 y12.9 3.3 0.2 y5.1 y0.7 y0.5 y1.1 y0.9 y1.2 1.4 1.3 0.7 1.0 y0.4 y0.2 0.2 0.1 0.1 0.0 0.0 y0.3 y0.3 0.1 0.7 y0.3 y0.3 y0.3 y0.3 y0.8 y1.1 y1.1 y1.0 0.7 y1.0 y1.0 y0.4 y0.4 1.4 2.0 1.3 0.5 y0.8 y0.4 y0.6 y1.1 1.0 y2.3 y0.3 y0.4 y1.4 0.2 0.2 y0.4 y0.3 y0.4 1.9 1.8 1.1 1.4 0.3 0.4 0.9 0.7 0.8 0.7 0.7 0.4 0.4 0.7 1.3 0.3 0.4 0.3 0.2 y0.2 y0.4 y0.5 y0.3 1.2 y0.5 y0.8 0.0 0.0 1.5 2.2 1.4 1.0 y0.3 y0.5 y0.3 y0.7 1.1 y2.7 0.6 0.0 y1.1 G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 39 Table 6 Žcontinued. No. of compound Tpubl , ŽK. Tpred M-235, ŽK. Tpred M-20, ŽK. Absolute Error M-235, ŽK. Absolute Error M-20, ŽK. Relative Error M-235, Ž%. Relative Error M-20, Ž%. 235 236 237 238 239 240 241 242 243 244 246 247 251 456.9 483.7 528.4 545.8 553.7 491.1 612.9 612.6 638.0 649.0 668.0 714.1 550.5 460.5 491.1 531.7 550.0 557.0 489.2 597.8 600.4 651.4 643.9 666.4 701.3 548.1 458.8 492.9 526.3 546.8 551.6 483.7 590.6 593.1 644.5 637.6 657.6 691.9 540.5 y3.6 y7.5 y3.3 y4.2 y3.3 1.9 15.1 12.2 y13.4 5.1 1.6 12.8 2.4 y1.8 y9.2 2.1 y1.0 2.0 7.4 22.3 19.5 y6.5 11.4 10.4 22.2 10.0 y0.8 y1.5 y0.6 y0.8 y0.6 0.4 2.5 2.0 y2.1 0.8 0.2 1.8 0.4 y0.4 y1.9 0.4 y0.2 0.4 1.5 3.6 3.2 y1 1.8 1.6 3.1 1.8 In order to illustrate further the problem with the reliability of published NBPs we have tried to follow up the original source of the NBP of 1,1,2,2-tetraphenylethane. In our database this value Ž358–3628C. was cross-referenced from two papers citing as reference the DIPPR database w19x, and the Beilstein database w22x. Reference of the original Beilstein handbook revealed that this compound has been included in the main work ŽHauptwerke. of the series w35x. Two NBPs are recommended — 358–3628C Ž uncorrected. and — 379–3838C Ž corrected.. The original experimental determination of the boiling point of 1,1,2,2-tetraphenylethane was done by Biltz w36x, who synthesized and purified the compound himself in the year of 1897. The M-20 model provides somewhat better predictions for most of the published NBPs. The values predicted for the hypotheticals are close to values estimated by ABCs, which are developed for rather different homologous series, but are the only other alternative. It has to be underlined that the three hypotheticals are outliers on the principal components plot Ž Fig. 3. , so the error in the estimation of their boiling points is expected to be higher than that within the boundaries of the models. It should be noted also that separate models for the different groups of hydrocarbons may be developed following the principles suggested in the present work. Figs. 2 and 3 indicate that ‘‘pseudohomologous’’ series, including a greater number of available data may be created. For instance, alkylaromatics, behave as aliphatic hydrocarbons, above a particular aliphatic chain length. Care should be taken when ascribing physical significance to statistically derived correlations for the contribution of particular descriptors. The latter heavily relies on the particular design of the database. For instance, the gravitational index, which managed to describe NBPs of hydrocarbons in the Katritzky et al. work w7x as a sole independent variable, did not prove to be useful with the database of the present investigation. Neither were the variations of any of the tested topological descriptors, although they carry at least part of the molecular information of some of the descriptors successfully employed by Wessel and Jurs w19x. Our observations suggest that more work is necessary to establish which descriptors might be the real determinants of the NBPs, and which are only surrogates for more fundamental features of the molecules. 40 Table 7 Predictions of the M-235 and M-20 models for the control set and ‘‘hypothetical’’ molecules Statistics for the predictions of the M-235 model: Mean standard deviation of absolute errorss 18.64 " 3.89 K; Min absolute error s y53.6 K; Max absolute error s q24.8 K Ž7 points out of a "11 K error range.; Mean standard deviation of relative errors s 3.23 " 0.67%; Min relative error s y9.4%; Max relative error s q3.8% Ž2 points out of a "5.0% error range.. Statistics for the predictions of M-20 model: Mean standard deviation of absolute errors s 16.34 " 3.41 K; Min absolute error s y48.5 K; Max absolute error s q20.3 K Ž10 points out of a "11 K error range.; Mean standard deviation of relative errors s 2.89 " 0.60%; Min relative error s y8.7%; Max relative error s q3.2% Ž2 points out of a "5.0% error range.. Name 13 40 63 73 83 84 85 88 140 141 144 200 245 248 249 250 252 253 254 255 256 257 258 259 260 261 n-tetradecane 3-methylpentane 2,3,3-trimethylpentane 2,2,4,4-tetramethylpentane pristane phytane squalane 1-butene lycopene b-carotene cyclopentane n-octylbenzene 1,2-benzo w ax pyrene o-terphenyl triphenylmethane acenaphtalene 1,1,2,2-tetraphenylethane 4-methyloctane 2,2,3,3-tetramethylbutane 2-ethyl-1-hexene adamantane 1,5-cyclooctadiene 2,5-dimethyl-1,5-hexadiene cis-1-propenylbenzene 1-phenylnaphtalene indane a b c d Tcalc , ŽK. 795.8 a 831.4 b 778.28 c 637.5d 552.0 d 697.7 d 409.9 d 441.9 d 621.0 d Tpubl , ŽK. Tpred M-235, ŽK. Tpred M-20, ŽK. 526.7 336.4 387.9 395.4 604.3 625.6 720.0 266.9 – – 322.4 537.5 – 610.6 632.1 543.1 633.1 415.6 379.4 393.1 461.0 423.3 387.4 452 607.1 451.1 516.9 333.9 382.9 393.7 582.9 601.9 695.2 270.3 766.0 818.4 324.9 536.5 773.5 654.6 651.8 548.1 751.5 412.1 374.3 397.2 504.6 415.7 380.8 436.9 643.6 451.4 515.2 335.1 383.7 393 587.1 605.3 702.9 270.8 788.6 839.2 325.0 533.0 763.6 648.0 646.4 537.0 746.4 412.8 374.4 397.0 501.2 414.6 383.2 434.5 634 447.1 Absolute Error M-235, ŽK. Absolute Error M-20, ŽK. Relative Error M-235, Ž%. Relative Error M-20, Ž%. 9.8 2.5 5.0 1.7 21.6 23.7 24.8 y3.4 11.5 1.3 4.2 2.4 17.2 20.3 17.1 y3.9 1.9 0.7 1.3 0.4 3.6 3.8 3.4 y1.3 2.2 0.4 1.1 0.6 2.8 3.2 2.4 y1.5 y2.5 1.0 y2.6 4.5 y0.8 0.2 y0.8 0.8 y9.2 y19.7 y0.1 y53.6 3.5 5.1 y4.1 y43.6 y5.8 6.6 5.0 y22.6 y0.3 y2.6 y14.3 11.0 y48.5 2.8 5.0 y3.9 y40.2 y4.7 4.2 7.4 y13.0 4.0 y1.4 y3.1 0.0 y7.7 0.8 1.3 y1.0 y9.5 y1.4 1.7 1.1 y3.6 y0.1 y0.4 y2.3 2.0 y7.0 0.7 1.3 y1.0 y8.7 1.1 1.1 1.7 y2.1 0.9 Calculated with an Asymptotic Behaviour Correlation w10x. The original correlation is for n y 1-alkenes. Calculated with an Asymptotic Behaviour Correlation w10x. The original correlation is for alkylcyclohexanes. 0.70 Calculated from a fit of the NBPs of the rings only aromatic hydrocarbons as a function of Ctot . Calculated from published boiling temperatures at reduced pressure. Considered more reliable and used for the determination of errors. G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 No. G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 41 4. Conclusions The present work contributes a correlation to the very challenging and important investigations of the quantitative relation between the molecular structure and the functional properties of chemical compounds, which has been a fundamental task of chemistry and chemical engineering for many years. Its main features, as perceived by the present authors, are its relative simplicity, its reliable predictions of the NBPs, and its applicability to diversified industrially important hydrocarbon structures within a widely spanned range of NBPs and number of carbon atoms. An achievement of particular interest in the present work is the revealed opportunity for the limitation of the learning set through multivariate analysis and molecular design. The molecular mechanics simulation employed in this study is viewed by the present authors as a potential tool for incorporation in future chemical engineering simulators. It will significantly enhance the capabilities of the latter for designing processes with chemical reactions and is especially suitable for optimisation of the composition of the additive products of the chemical industry w37x. Furthermore, it allows straightforward, but correct input of complex chemical structures by drawing them directly on the monitor. However, from the point of view of the chemical engineer as the user of such programmes, the benefits of the sophistication cannot be easily appreciated. On the one hand, the sophistication requires an in depth knowledge of the quantum chemistry of the particular structures, which is not so readily available for the structures targeted by chemical engineers. On the other hand, in many cases of engineering importance the sophistication and high accuracy may not be justified, since simple group contribution correlations still work successfully for particular problems. The appropriate level of sophistication for many of the common chemical engineering applications will be different and can only be determined by systematic studies of the influence of uncertainties on key parameters. The high accuracy achieved by the correlation opens up a possibility for systematic studies of chemical engineering applications in which the effects of small changes are important. This also outlines a path towards the more general problem of the influence of uncertainties in calculated thermophysical parameters on the final solution of computer aided simulation and design. Acknowledgements The present authors acknowledge with gratitude the financial support of The Royal Society. References w1x L.M. Egolf, M.D. Wessel, P.C. Jurs, J. Chem. Inf. Comput. Sci. 34 Ž1994. 947–956. w2x S.J. Grigoras, Comput. Chem. 11 Ž1990. 593–610. w3x W.J. Lyman, W.F. Reehl, D.H. Rosenblatt, Handbook of Chemical Property Estimation Methods, McGraw-Hill, New York, 1982. w4x R.C. Reid, J.M. Prauznitz, B.E. Poling, Properties of Gases and Liquids, 4th edn., McGraw-Hill, New York, 1987. w5x C.H. Fisher, J. Am. Oil Chem. Soc. 67 Ž1990. 101–102. w6x R.C. Mebane, C.D. Williams, T.R. Rybolt, Fluid Phase Equilibria 124 Ž1996. 111–122. 42 w7x w8x w9x w10x w11x w12x w13x w14x w15x w16x w17x w18x w19x w20x w21x w22x w23x w24x w25x w26x w27x w28x w29x w30x w31x w32x w33x w34x w35x w36x w37x G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42 A.R. Katritzky, V.S. Lobanov, M. Karelson, J. Chem. Inf. Comput. Sci. 38 Ž1998. 28–41. A. Kreglewski, B.J. Zwolinski, J. Phys. Chem. 65 Ž1961. 1050–1052. K.A. Gasem, C.H. Ross, R.L. Robinson Jr., Can. J. Chem. Eng. 77 Ž1993. 805–816. J.J. Marano, G.D. Holder, Ind. Eng. Chem. Res. 36 Ž1997. 1887–1894. J.J. Marano, G.D. Holder, Ind. Eng. Chem. Res. 36 Ž1997. 1895–1907. A.L. Horvath, Molecular Design, Elsevier, Amsterdam, 1992. M. Karelson, Adv. Quant. Chem. 28 Ž1997. 141–157. M. Kurata, S. Ishida, J. Chem. Phys. 23 Ž1955. 1126–1131. I.C. Sanchez, R.H. Lacombe, J. Phys. Chem. 80 Ž1976. 2352–2362. I.C. Sanchez, R.H. Lacombe, Macro-molecules 11 Ž1978. 1145–1156. P.J. Flory, R.A. Orwoll, A. Vrij, J. Am. Chem. Soc. 86 Ž1964. 3507–3514. A. Vetere, Fluid Phase Equilibria 124 Ž1996. 15–29. M.D. Wessel, P.C. Jurs, J. Chem. Inf. Comp. Sci. 35 Ž1995. 68–76. J. Buckingham, S.M. Donaghy ŽEds.., Dictionary of Organic Compounds, 5th ed., Chapman and Hall, New York, 1982. API Technical data Book — Petroleum Refining, 4th edn. American Petroleum Institute, Washington DC, 1983. Iu.V. Pokonova, A.A. Gaile, V.G. Spirkin, Chemistry of Petroleum, Himia, Leningrad, 1984 Žin Russian.. A.I. Bogomolov, A.A. Gaile, V.V. Gromova, Chemistry of Petroleum and Gas, Himia, Leningrad, 1989 Žin Russian.. TRC ŽThermodynamic Research Center.. TRC Thermodynamic Tables–Hydrocarbons, The Texas A&M University, College Station, TX, USA, 1997 revision. A.S. Teja, R.J. Lee, D. Rosenthal, M. Anselme, Fluid Phase Equilibria 56 Ž1990. 153–169. PCMODEL, 5th edn., Serena Software, Bloomington, IN, USA, 1992. M. Randic, B. Jerman-Blazic, N. Trinajstic, Comput. Chem. 14 Ž1990. 237–246. J.K. Labanowski, I. Motoc, R.A. Damkoehler, Comp. Chem. 15 Ž1. Ž1991. 47–53. A.R. Katritzky, L. Mu, V.S. Lobanov, M. Karelson, J. Phys. Chem. 100 Ž1996. 10400–10407. STATGRAPHICS for DOS 7th edn., STSC, Inc. and Manugistics, Inc., Rockville, MD, USA. P. Geladi, M.-L. Tosato, in: W. Karcher, J. Devillers ŽEds.., Practical Applications of Quantitative Structure–Activity Relationships ŽQSAR. in Environmental Chemistry and Toxicology, Kluwer Acad. Publ., Dodrecht, 1990, pp. 170–179. S. Wold, K. Esbensen, P. Geladi, Chemometrics and Intelligent Laboratory Systems 2 Ž1987. 37–52. P. Geladi, B. Kowalski, Anal. Chim. Acta 185 Ž1986. 1–17. M.-L. Tosato, P. Geladi, in: W. Karcher, J. Devillers ŽEds.., Practical Applications of Quantitative Structure–Activity Relationships ŽQSAR. in Environmental Chemistry and Toxicology, Kluwer Acad. Publ., Dodrecht, 1990, pp. 317–341. Beilsteins Handbuch der Organischen Chemie, 4th edn., H, bd. V, Springer Verlag, 1933, p. 739. H. Biltz, Liebigs Annalen der Chemie 296 Ž1897. 221. G.S. Cholakov, K.G. Stanulov, P.A. Devenski, H.A. Iontchev, Wear 216 Ž2. Ž1998. 194–201.