Fluid Phase Equilibria 163 Ž1999. 21–42
www.elsevier.nlrlocaterfluid
Estimation of normal boiling points of hydrocarbons from descriptors
of molecular structure
Georgi St. Cholakov a , William A. Wakeham
a
b, )
, Roumiana P. Stateva
c
Department of Petroleum and Solid Fuels Processing Technology, UniÕersity of Chemical Technology and Metallurgy,
Sofia 1156, Bulgaria
b
Department of Chemical Engineering, Imperial College of Science, Technology and Medicine, London SW7 2BY, UK
c
Institute of Chemical Engineering, Bulgarian Academy of Sciences, Sofia 1113, Bulgaria
Received 12 January 1999; accepted 19 April 1999
Abstract
Correlations for estimation of thermophysical properties are needed for the design of processes and
equipment related to phase equilibria. The normal boiling point ŽNBP. is a fundamental characteristic of
chemical compounds, involved in many correlations used to estimate important properties. Modern simulation
packages usually require the NBP and a standard liquid density from which they can estimate all other necessary
properties and begin the design of particular processes, installations and flowsheets. The present work
contributes a correlation between the molecular structure and the normal boiling point of hydrocarbons. Its main
features are the relative simplicity, sound predictions, and applicability to diversified industrially important
structures, whose boiling points and numbers of carbon atoms span a wide range. An achievement of particular
interest is the opportunity revealed, for reducing the number of the compounds required for the derivation Žthe
learning set., through multivariate analysis and molecular design. The high accuracy achieved by the correlation
opens up a possibility for systematic studies of chemical engineering applications in which the effects of small
changes are important. This also defines a path towards the more general problem of the influence of
uncertainties in calculated thermophysical parameters on the final outcome of computer aided simulation and
design. q 1999 Elsevier Science B.V. All rights reserved.
Keywords: Molecular simulation; Model; Normal boiling point; Hydrocarbons
1. Introduction
Correlations for estimation of thermophysical properties are an important tool for design of
processes and equipment, environmental impact assessment, HAZOP studies, and other important
)
Corresponding author. Tel.: q44-171-594-5005; fax: q44-171-594-8802; e-mail: w.wakeham@ic.ac.uk
0378-3812r99r$ - see front matter q 1999 Elsevier Science B.V. All rights reserved.
PII: S 0 3 7 8 - 3 8 1 2 Ž 9 9 . 0 0 2 0 7 - 1
22
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
chemical engineering problems related to phase equilibria. Consequently, large commercial databases
of miscellaneous properties are compiled, but have to be populated with new compounds within the
limits of interpolation of the available experimental data w1x. On the other hand, methods for
extrapolation of existing data are needed for assessment of compounds not yet synthesized andror
high molecular compounds for which the experimental determination is unreliable or impossible
because of degradation w2x.
The physical properties of chemical compounds are described by a large group of structure related
characteristics, such as normal boiling point Ž NBP. and critical parameters. Most of these have been
targeted by different correlations and approaches w3–6x. However, thermophysical properties are
interrelated and an efficient strategy is to identify a suitable number of independently determined
primary target parameters, which are connected to the largest possible number of properties and can
be used for their computational estimation w2x.
The normal boiling point Ž NBP. is a fundamental characteristic of chemical compounds. It is
involved in many correlations used to estimate thermophysical properties. Modern computer simulation packages usually require the NBP and a standard liquid density from which they can estimate all
necessary properties and begin the design of particular processes, installations and flowsheets for their
realization.
The analysis of prior work, recently reviewed by Katritzky et al. w7x, shows that historically two
types of empirical correlations have been developed — correlations, aimed at molecules with the
widest possible variation of functional groups and heteroatoms, and — correlations concentrating on
molecules within homologous series. The former follow the success of the first group contribution
methods w4x, and the most recent ones apply electronic and graph topological descriptors w1,2x. A
common feature of these correlations is that the dependent variable is a function of estimated
contributions of diversified structural features, even when only one complex descriptor is incorporated
in the final model w7x. They will be further referred to here as ‘‘contribution’’ models.
Correlations developed for homologous series usually employ the total number of C atoms or the
molecular mass of the compounds with adjustable constants w5,8x. Gasem et al. w9x recently suggested
the abbreviation ABC — Asymptotic Behavior Correlations for such models. Marano and Holder w10x
proposed a generalization for all ABCs and developed such correlations for a wide number of
thermophysical properties of several homologous series w11x. It has been shown also that ABCs can be
developed with graph topological indices w12x, and molecular energy descriptors w6x. Theoretical
explanations have been suggested to relate quantum chemical descriptors to the thermodynamic
properties of polar molecules w2,13x. The lattice fluid model w14–16x and the cell model w17x have
been used to explain ABCs w11x. A common feature of ABCs is that the dependent variable is a
non-linear function with several adjustable constants describing the relations between repeated
segments of the molecules Žmers. and empty ‘‘holes’’ Ž lattice-fluid models. or mers and free volume
Žcell models.. They will be further referred to as ‘‘mers’’ models.
The advantages and disadvantages of the two approaches have been well documented by the
respective authors. From a practical point of view, there is clearly a need for a compromise between
the high accuracy but limited functionality of the ‘‘mers’’ models, and the low accuracy and widely
varied functionality of the ‘‘contribution’’ models. The present work is an attempt to find such
compromise. Furthermore, it is devoted to the investigation of the correlation power of molecular
descriptors estimated with conventional programmes for computer simulation of molecular mechanics.
These are considered as a potential tool for enhancing the capabilities of the simulation packages
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
23
widely used nowadays for computer aided chemical engineering design. A third objective of the
present work is to explore opportunities to reduce the size of the data set upon which the derivation of
the correlation is based Žthe learning set., since databases employed in contribution models are
becoming increasingly larger. Finally, an object of the study was the evaluation of the extrapolation
predictive power of such correlations for outlying molecules of industrial importance.
2. Methodology
The development of any correlation relies on a database including the objects of interest
Žmolecular structures in the present context. , and relevant known properties of these objects
Ž descriptors of the molecular structures.. Independent variables defined from the database have to be
correlated to a set of dependent characteristics of functional interest Ž NBPs. with the help of a suitable
modelling technique. The predictive power of the correlations usually is confined to the space defined
by the constraints of its derivation, although in the specific case of molecular modelling some
extrapolation to structurally related outlying molecules might be possible at the cost of higher error.
Experimental values for low and moderate NBPs of industrially important compounds are usually
available from many sources. Higher boiling points are determined in vacuum, and may be
recalculated for normal conditions if a pressure–temperature relation suitable for the particular group
of compounds is available. For many compounds, however, the latter relations have not been studied,
and the amount of experimental data even at reduced pressure is limited.
2.1. Database
The design of the database of relevant compounds is perhaps the most important step in the
derivation of statistical correlations. The weighting of different groups presented in the database
directly influences the subsequent modelling w18x. The database should contain all relevant structural
features of the modelled groups of compounds, but it should be emphasized again that the relative
representation of those groups influences the uniformity of the prediction for the different groups of
objects.
Several features were sought from the database used in the present study, in order to achieve
representation of the main structures, and the possibility for extrapolation of the predictions towards
the three industrially important high molecular hydrocarbons with unknown NBPs — lycopene,
b-carotene and 1,2-benzo w ax pyrene, chosen as an example. These are:
- systematic change of properties within several homologous series, since any compound may be
viewed as a member of some appropriate series;
- presence of series of branched hydrocarbons with increasing numbers of double bonds, cycloalkanes and terpenoids with known NBPs, which might be extrapolated towards high molecular
terpenoids;
- presence of series of hydrocarbons differing by one aromatic ring, which might be extrapolated
towards benzopyrenes;
- presence of a control set of compounds with complex molecular structure, estimated by other
authors, to be used for comparison with the present study.
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
24
Table 1
Hydrocarbons included in the database
No.
b
1
2
3
4
5
6
7
8
9
10
11
12
13a
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35b
36
37
38b
39
40a
41
42
43
44
45
46
47
Name
No.
Name
No.
Name
ethane
propane
n-butane
n-pentane
n-hexane
n-heptane
n-octane
n-nonane
n-decane
n-undecane
n-dodecane
n-tridecane
n-tetradecane
n-pentadecane
n-hexadecane
n-heptadecane
n-octadecane
n-nonadecane
n-eicosane
n-heneicosane
n-docosane
n-tricosane
n-tetracosane
n-pentacosane
n-hexacosane
n-heptacosane
n-octacosane
n-nonacosane
n-triacontane
n-dotriacontane
n-pentatriacontane
n-hexatriacontane
n-tetracontane
n-tetratetracontane
n-hexacontane
i-butane
2-methylbutane
2,2-dimethylpropane
2-methylpentane
3-methylpentane
2,2-dimethylbutane
2,3-dimethylbutane
2-methylhexane
3-methylhexane
2,2-dimethylpentane
2,3-dimethylpentane
2,4-dimethylpentane
48
50
51
52
53
54
55
56
57
58
59
60
61
62
63a
64
65
66
67
68
69
70 b
71
72
73 a
74 b
75
76
77
78
79
80
81
82
83a
84a
85a
86b
87b
88a
89
90
91
92
93
94
95
3,3-dimethylpentane
2,2,3-trimethylbutane
2-methylheptane
3-methylheptane
4-methylheptane
2,2-dimethylhexane
2,3-dimethylhexane
2,4-dimethylhexane
2,5-dimethylhexane
3,3-dimethylhexane
3,4-dimethylhexane
3-ethylhexane
2,2,3-trimethylpentane
2,2,4-trimethylpentane
2,3,3-trimethylpentane
2,3,4-trimethylpentane
2-methyl-3-ethylpentane
3-methyl-3-ethylpentane
2,2,3-trimethylhexane
2,2,4-trimethylhexane
2,2,5-trimethylhexane
3,3-diethylpentane
2,2,3,3-tetramethylpentane
2,2,3,4-tetramethylpentane
2,2,4,4-tetramethylpentane
2,3,3,4-tetramethylpentane
2-methyloctane
2-methylnonane
3,3,5-trimethylheptane
2,2,3,3-tetramethylhexane
2,5-dimethyldecane
2,5,-dimethyldodecane
2,6,10-trimethyldodecane
2,6,10-trimethyltetradecane
pristane
phytane
squalane
lycopane
propylene
1-butene
1-pentene
1-hexene
1-heptene
1-octene
1-nonene
1-decene
1-undecene
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114 b
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138 b
140
141
142
143 b
1-dodecene
1-tridecene
1-tetradecene
1-pentadecene
1-hexadecene
1-heptadecene
1-octadecene
1-nonadecene
1-eicosene
1-heneicosene
1-docosene
1-tricosene
1-tetracosene
1-pentacosene
1-hexacosene
1-heptacosene
1-octacosene
1-nonacosene
1-triacontene
1,3-butadiene
c-2-butene
t-2-butene
i-butene
isoprene
2,3-dimethyl-1-butene
2,3-dimethyl-2-butene
2-ethyl-1-butene
c-2-hexene
t-2-hexene
2-methyl-1-pentene
4-methyl-1-pentene
2,4,4-trimethyl-1-pentene
2,4,4-trimethyl-2-pentene
2-methyl-1-butene
2-methyl-2-butene
3-methyl-1-butene
2,3,-dimethyl-butadiene
3,3-dimethyl-1-butene
2-methyl-2-pentene
3-methyl-1-pentene
1,5-hexadiene
limonene
a-pinene
lycopene
b-carotene
cyclopropane
cyclobutane
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
25
Table 1 Žcontinued.
No.
Name
No.
Name
No.
Name
144 a
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168b
169
170
171
172
173
174
175
176b
177
178
179
180
181
182
cyclopentane
cyclohexane
cycloheptane
cyclooctane
methylcyclohexane
ethylcyclohexane
propylcyclohexane
butylcyclohexane
methylcyclopentane
ethylcyclopentane
propylcyclopentane
butylcyclopentane
pentylcyclopentane
hexylcyclopentane
heptylcyclopentane
octylcyclopentane
nonylcyclopentane
decylcyclopentane
undecylcyclopentane
dodecylcyclopentane
tridecylcyclopentane
tetradecylcyclopentane
pentadecylcyclopentane
hexadecylcyclopentane
heptadecylcyclopentane
octadecylcyclopentane
nonadecylcyclopentane
eicosylcyclopentane
heneicosylcyclopentane
docosylcyclopentane
tricosylcyclopentane
tetracosylcyclopentane
pentacosylcyclopentane
c-1,2-dimethylcyclohexane
t-1,2-dimethylcyclohexane
c-1,3-dimethylcyclohexane
t-1,3-dimethylcyclohexane
c-1,4-dimethylcyclohexane
t-1,4-dimethylcyclohexane
183
184
185
186
187
188b
189
190
191
192
193
194
195
196
197
198
199
200a
201
202
203
204
205b
206
207
208
209
210
211
212
213
214
215
216b
217
218
219
220
221
cyclopentene
cyclohexene
1,3-cyclohexadiene
5-methyl-1,3-cyclopentadiene
1,3-cyclopentadiene
benzene
toluene
ethylbenzene
propylbenzene
butylbenzene
o-xylene
m-xylene
p-xylene
1-methyl-3-ethylbenzene
pentylbenzene
hexylbenzene
heptylbenzene
octylbenzene
nonylbenzene
decylbenzene
undecylbenzene
dodecylbenzene
tridecylbenzene
tetradecylbenzene
pentadecylbenzene
hexadecylbenzene
heptadecylbenzene
octadecylbenzene
nonadecylbenzene
eicosylbenzene
heneicosylbenzene
docosylbenzene
tricosylbenzene
tetracosylbenzene
styrene
a-methylbenzene
cumene
o-ethyltoluene
p-ethyltoluene
222 b
223
224
225
226
227
228
230
231
232
233
234
235
236
237
238
239 b
240 b
241
242
243 b
244
245
246
247
248 a
249 a
250 a
251
252 a
253 a
254 a
255a
256 a
257 a
258 a
259 a
260 a
261a
mesitylene
1,2,3-trimethylbenzene
1,2,4-trimethylbenzene
1,2,3,4-tetrahydronaphtalene
t-butylbenzene
p-cymene
m-diethylbenzene
i-butylbenzene
m-diisopropylbenzene
diphenylmethane
m-ethyltoluene
s-butylbenzene
p-diethylbenzene
p-diisopropylbenzene
diphenyl
1,1-diphenylethane
1,2-diphenylethane
naphtalene
antracene
phenanthrene
m-terphenyl
p-terphenyl
1,2-benzo w ax pyrene
pyrene
chrysene
o-terphenyl
triphenylmethane
acenaphtylene
acenaphtene
1,1,2,2-tetraphenylethane
4-methyloctane
2,2,3,3-tetramethylbutane
2-ethylhexene
adamantane
1,5-cyclooctadiene
2,5-methyl-1,5-hexadiene
c-1-propenylbenzene
1-phenylnaphtalene
indane
a
b
Members of the control set.
Members of the designed learning set of 20 hydrocarbons.
The names of the compounds selected for the database used in this study are presented in Table 1,
with their published NBPs listed in Table 6. The objects have been limited only to hydrocarbons in
order to achieve a reasonable presentation of the functional groups of these fundamental compounds.
The homologous series included allow a ‘‘mers’’ influence also to be expressed in the modelling.
26
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
This approach follows from one of the objectives of the present work — to find a compromise
between the breadth of the functionality of molecular structures and the precision achieved in their
estimation. It has been also justified by recent prior work w7,19x.
The present database of 261 hydrocarbons was compiled from several sources w4,19–24x. The data
for the normal alkanes with more than 30 carbon atoms were calculated by a ‘‘mers’’ correlation w25x.
Three hydrocarbons with unknown NBPs were included in the database as an illustration of the case
when objects of industrial importance have to be evaluated as outliers. Such molecules are often
referred to as ‘‘hypotheticals’’. Lycopene and b-carotene are industrially important constituents of
natural products, 1,2-benzo w ax pyrene is a carcinogenic hydrocarbon often used as reference in
ecological studies. Most of the hydrocarbons are identical with those used in the most recent
correlation for description of NBPs of hydrocarbons w19x. The values for some of the hydrocarbons,
mainly in the control set were recalculated for normal conditions from vacuum data, which were
considered more reasonable.
The limits for the main hydrocarbon series, and structures, included in the database, which
determine also the boundaries for the predictive ability of the derived models may be assessed from
Tables 1 and 6, but are more clearly outlined by the total carbon atoms dependence of the predicted
points ŽFig. 2., and the scatterplot of the first two principle components Ž Fig. 3. . NBPs are varied in
the widest practical range from 184.5 to 877.5 K. The total number of carbon atoms spans from 2 to
60 for the n-alkanes, from 3 to 40 — for the series finishing with b-carotene, and — to 30 for the
rest of the homologous series.
2.2. Descriptors
Two types of descriptors were employed in the present investigation.
Molecular energy descriptors were evaluated with a conventional computer programme for
molecular mechanics simulation, based on the MMX modification of the MM2 method w26x. In such
programmes a structure is considered a collection of atoms held together by elastic Ž harmonic.
forces-bonds, which constitute the force field. The calculations start with a structure with relevant
default values of parameters and its optimized geometry is found by iterational minimization of its
total steric energy. Further refinement of the energy contributions may be achieved by assigning more
accurate values for the starting force constants andror applying several programmes with different
sophistication for gradual assessment of the more intimate structural elements or specific programmes,
designed to target particular structural features w1,2,6x. Such refinement of the molecular energy
descriptors used in the present study has been deliberately avoided.
For the practical purposes of the present study, the minimised molecular energy models of all 261
molecules were obtained with a conventional programme for molecular mechanics simulation, and the
contributions of different energies in the minimized models were tested as descriptors. An illustration
of the molecular energy descriptors for adamantane is presented in Fig. 1. The names and codes of the
descriptors are given in Table 2.
Carbon atom descriptors of various levels of sophistication can be used. The highest level of
sophistication presently available comprises the graph topological indices, derived from the adjacency
and distance matrices of a chemical structure w12x. More than 120 such indices have been suggested.
The latest versions can evaluate 3D structural information w27x, and many of them have been involved
in correlations with thermophysical properties and characteristics w12x.
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
27
Fig. 1. The minimized energy model and the 15 molecular descriptors of adamantane. ŽDimensions are given as estimated by
the molecular mechanics simulation programme — in kcal moly1, A3 moly1 , etc... Total energy Ž Etot .: 34.616; Stretch
energy Ž Estr .: 1.188; Bond energy Ž E bnd .: 10.273; Stretch–bend energy Ž Es – b .: y0.292; Torsion energy Ž Etor .: 16.524; Van
der Waals energy Ž E vdw .: 6.924; Dipole-charge interaction energy Ž Edch .: 0.000; Electric dipole moment ŽDM.: 0.000;
Standard enthalpy Ž Hf .: y14.00; Strain energy Ž Este .: 27.30; Van der Waals volume Ž Vvdw .: 254 A3, Molar volume Ž VM .:
152 cm3; Total van der Waals surface Ž Stot .: 174.82 A2 , Saturated van der Waals surface Ž Ssat .:174.82 A2 , Unsaturated van
der Waals surface: 0.00 A2 Ž Sunsat ..
The lowest level of sophistication of carbon atom descriptors is to use the numbers of atoms
engaged in specific groups Žatom counts.. These are generically related to the group contributions,
which multiply the particular number of atoms by empirically assigned constants. In the present
Table 2
Descriptors from simulated molecular mechanics
No.
Description
Code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Total energy
Stretch energy
Bond energy
Stretch–bend energy
Torsion energy
Van der Waals energy
Energy of ‘‘dipole-charge’’ interaction
Electrostatic dipole moment
Standard heat of formation
Strain energy
Van der Waals volume
Molar volume
Total Van der Waals surface
Saturated Van der Waals surface
Unsaturated Van der Waals surface
Etot
Estr
E bnd
Es – b
Etor
Evdw
Ed-ch
DM
Hf
Est
Vvdw
VM
St
Ssat
Sunsat
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
28
Table 3
Carbon atom descriptors and molecular mass
No.
Name
Code
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Total number of C atoms
Number of C atoms in CH 3 groups
Number of C atoms in aliphatic CH 2 groups
Number of C atoms in aliphatic CH groups
Number of C atoms in aliphatic C groups
Number of C atoms in aliphatic CH 2 5CH 2 groups
Number of C atoms in aliphatic CH5CH groups
Number of C atoms in aliphatic C5 groups
Total number of C atoms in aliphatic double bonds
Number of C atoms in cyclic CH 2 groups
Number of C atoms in cyclic CH groups
Number of C atoms in cyclic C groups
Number of C atoms in cyclic CH5 groups
Number of C atoms of cyclic C5 groups
Total number of C atoms in cyclic double bonds
Molecular mass
Ctot
NCH 3
a
NCH
2
a
NCH
NCa
a
NDCH
2
a
NDCH
a
NDC
DBA
c
NCH
2
c
NCH
c
NC
c
NDCH
c
NDC
DBC
M
investigation we have chosen the carbon atom descriptors, presented in Table 3. For the most part
they coincide with the descriptors in the recent Joback group contribution model w4x.
Because of the success of prior work with topological indices, a limited number of them is also
tested in the present work. This number includes: the Wiener Index, Ž W .; the Balaban index, the
Bonchev and Trinajstic information content and mean information content of the unit distances,
information content and mean information content of the distances’ distribution indices, Žknown
respectively as IWD, IWDM, IED and IEDM. ; the cyclomatic number, Ž m . and the Randic path
connectivity indices Ž CHI. up to third order terms. The meaning and methods of calculation of these
indices have been extensively reviewed elsewhere w12,28x. An additional descriptor — the gravitation
index, Ž G I ., successfully employed lately for the evaluation of NBPs w7,29x was also tested. Where
applicable, the descriptors were calculated not only with unit distances, but also with distances
between bonded atoms, obtained from the structures minimized by molecular mechanics simulation.
The total number of descriptors, including the topological ones, as well as molecular mass,
amounted to 59. This is a relatively low number as compared to the most recent description of NBPs
of databases including heterocompounds w7x, in which more than 800 descriptors are tested, or the
study of Wessel and Jurs w19x, confined only to hydrocarbons selecting among 81 descriptors. It
should be emphasized though, that one of the objectives of the present investigation was to keep the
methods for calculation and the meaning of descriptors as simple as possible. That is why electronic
descriptors and complex functions of one or more descriptors were deliberately avoided.
2.3. Modelling
A conventional ‘‘stepwise’’ multiple regression procedure w30x was employed to select the most
influential variables from the 59 descriptors and determine their optimal number. This procedure is a
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
29
subjective molecular feature selection w19x, in which the dependent variable is used to develop models
in the form:
n
NBPj s bo q
Ý Ž bi X i j .
Ž1.
is1
where NBPj is the normal boiling point of the compound j, bo is the intercept term, and bi is the
coefficient for descriptor X i j .
A linear contribution of the structural descriptors was adopted for all variables, except for a
nonlinear ‘‘mers’’-type independent variable Ž total number of carbon atoms, Ctot . . The latter is
successfully used in ABC correlations for homologous series w11x. We assume that each molecule may
be considered a member of some homologous series. The boiling points of the molecules would then
lie on a family of curves, different for each series, but asymptotically dependent on the ‘‘mers’’
variable, Ctot . The distances between the curves in the family would be then accounted for by the
linearly contributing independent variables, which would reflect the specific features of the particular
Fig. 2. Asymptotic dependence of the normal boiling points predicted by the M-20 model ŽTpred MT 20, K. on the total
number of C atoms Ž Ctot ..
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
30
mers of molecules belonging to different series. These assumptions are illustrated in Fig. 2, which
presents the boiling points, predicted by one of our models, as a function of the total number of C
atoms. The experimental temperatures obey the same dependence. In a later section of the paper we
shall show that this concept for the structure of the model is successful.
The algorithm, used also in this paper, for obtaining the best models with an increasing number of
independent variables has been described in detail elsewhere w29x.
The targeted representation of the published NBPs was set as a mean standard deviation of relative
errors of 2.1%. This target follows from the observation that the experimental uncertainties in the
DIPPR database for relevant molecules are around 2.1% w1x. Thus, we use the DIPPR estimated
uncertainty as a reasonable figure to aim for in the representation. Ref. w1x suggests also that one
descriptor of a pair with a pairwise correlation G "0.95 should be discarded. Later work has gone
much below that limit, but for hydrocarbons especially this does not seem to be practical. That is why,
for the present work the limit of pairwise correlations was set at "0.85.
As in other similar studies, the compounds in the database were devided into a learning set, and a
control set ŽTable 1.. The compounds in the control set were not used in the derivation of the model.
They were chosen mainly from the latest and most successful work on NBPs of hydrocarbons w19x.
An attempt was made to predict the boiling points of nearly all compounds from w19x, which were
reported to be difficult for prediction. Compounds with triple bonds, which are not present in our
database, were omitted. Our control set includes also three terpenoids, the boiling points for which
were among the few obtained only from the work of Bogomolov et al. w23x, and could not be
compared with other sources.
3. Results and discussion
The model derived from the learning set with 235 hydrocarbons Ž M-235. is presented in Table 4. It
can be seen from this table, and the predictions of M-235, given in Table 6, that the discrepancies
Table 4
Model ŽM-235., derived from the learning set with 235 molecules. N s 235; Standard residual error s 4.95 K. Coefficient of
Multiple correlation: 0.999. Calculated Fisher’s Criterion: 24279.2
Independent
variables
Coefficients
Standard
deviation Ž".
F criterion for
removal from model
0.678
Ctot
a
NCH
2
NCa
c
NDCH
Etot
E bnd
Vvdw
Sunsat
Intercept
142.51096
4.96750
y8.24519
y3.55218
1.45273
y2.05695
y1.04224
y0.37320
46.97246
1.22788
0.19820
1.06216
0.34924
0.10757
0.19640
0.01463
0.02613
2.20274
9999.99
628.15
60.26
103.45
182.38
109.70
5075.83
203.99
–
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
31
from the published values fall even below the average relative deviation targeted in the present study.
In the context of the uncertainties of the DIPPR database w19x, and from the point of view of practical
application of this correlation, such precision has no particular merit. However, this success of the
description allows us to address the third objective of the study, which is to reduce the number of the
compounds in the learning set, because a reasonable loss of precision can be tolerated.
In the first attempt to reduce the number of the compounds in the learning set an approach widely
used in similar studies Žfor instance in w1x. was adopted. The compounds left in the learning set were
selected to give a fair representation of the main groups and structures in the database. This approach,
however, cannot be based on anything more than a general perception of homologous series. As such
it might be inadequate for complex hybrid structures, which cannot be assigned to a particular group.
A model was derived from half of the learning set Ž 116 compounds. , chosen following the above
approach. It showed a 1.02 K higher standard residual error, but the same set of variables was
selected. The analysis of this model indicated that the key to successful description is the uniformity
of the distribution of the compounds of the learning set over the space occupied by the database,
rather than their number. Our next task then was to find a tool for the statistically based selection of a
learning set, and to determine the minimum number of its members. This task has been addressed by
multivariate analysis of the database and statistical molecular design of the learning set.
3.1. MultiÕariate analysis and molecular design of the learning set
The input data of the database consist of columns Ž descriptors. and rows Ž objects-molecules. . With
the help of multivariate analysis the information, contained in the database may be connected to the
dependent variable ŽNBP. in two ways — either with the columns as independent variables, or —
with the principal components Žfactors. as independent latent variables w31x. The principal components
of the data provide several additional opportunities for data manipulation. When used as latent
variables in the modelling, problems with collinearity between the original variables can be solved.
Partial least square ŽPLS. regression may be applied with a smaller number of latent variables. The
latter are usually factors with eigenvalues higher than 1, if cross validation shows that they can
account for a sufficient portion of the variance in the data. The objects Ž molecules in our case. may be
projected onto the plane formed by the first two most influential principal components, and the points
on the resulting scatterplot will represent the molecules of the database. This suits our above defined
task, since the placement of the molecules of the learning set can be chosen directly from this scatter
plot.
Information about the principles of multivariate analysis and its application in characterization of
quantitative structure-properties relationships can be found elsewhere w32,33x. Its application for
statistical molecular design is presented in Ref. w34x.
Fig. 3 shows the scatterplot of the molecules of the whole database, and the selection of the
molecules in the new learning set. It depicts also the molecules in the control set. It may be seen from
the figure that the database is not ideally balanced. The points are situated approximately in a triangle.
The homologous series are placed on linear loci. The three hypotheticals can be classified as close
outliers, situated close to the right hand side of the triangle, which is formed, however, by only two
molecules with known temperatures. The learning set, selected from the scatterplot, is distributed to
cover the sides and the center of the triangle.
32
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
Fig. 3. Selection of the learning set with 20 hydrocarbons Ž20P LEARN SET. from the scatter plot of the first two principal
components of all molecules ŽALL.. The molecules in the control set ŽCONT. SET. are also shown. The numbers and the
names of the molecules, selected for the learning and the control sets are given in Table 1.
The learning set presented in Fig. 3 consists of 20 molecules. This may be considered the minimum
number of members of the set as selected by statistical molecular design.
In the derivation of the M-235 model the descriptors were used as independent variables. The
factor analysis of the matrix consisting of the values for the independent variables selected for this
model, showed that three factors have eigenvalues higher than 1. Together with a fourth factor with
an eigenvalue close to one they account for 95.8% of the variation in the database. If a model with 4
latent variables is constructed through PLS regression, the minimal number of experimental data
follows the 2 k s 16 points requirement of statistical designs Ž where k is the number of factors., plus
at least four additional points w34x. When applied to our present task these rules determine the
minimum number of points in the learning set at twenty. As already explained, the points should be
evenly distributed to cover the factor space. In molecular design exact distribution according to a
statistical protocol can rarely be achieved, because data for particular structures may not be available
or may not be accessible experimentally, as also illustrated by Fig. 3.
A model was further derived with the new learning set of 20 molecules. This model Ž M-20. is
presented in Table 5, and its predictions for the original 235 learning set — in Table 6. It is not really
a new model, but a variation of the M-235 model. Its independent variables are the same as in M-235,
and only its coefficients are derived from the set of 20 molecules. These coefficients will be less
sensitive to the distribution of data in the database, than the coefficients of M-235. Its standard
residual error is 5.46 K, which compares favourably with the model derived from 116 molecules
Žstandard residual error of 6.07 K..
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
33
Table 5
Model derived from the designed learning set with 20 molecules ŽM-20.. N s 20; Standard residual error: 5.46 K.
Coefficient of Multiple correlation: 0.999. Calculated Fisher’s Criterion: 3221.25
Independent
Variables
Coefficients
Standard
Deviation
F criterion for
removal from model
NC0.675
a
NCH
2
a
NC
c
NDCH
Etot
E bnd
Vvdw
Sunsat
Intercept
141.66807
3.78832
y9.94330
y3.89729
1.46310
y2.36292
y0.92884
y0.33671
40.41170
2.83713
0.44837
4.29798
0.99589
0.24684
0.32912
0.03353
0.08398
4.93330
2493.36
71.39
5.35
15.31
35.13
51.54
767.43
16.08
–
It seems appropriate to repeat again here that M20 is only an improvement of the original M235
model. The success of this attempt, however, illustrates an approach, which may be explored in the
future for establishing correlations from a limited amount of experimental data. This opportunity is
very important since the databases of the latest similar investigations seem to become larger and
larger. Katrizky et al. w7x, for instance employ 612 compounds for presenting heterostructures, Wessel
and Jurs w19x used 356 compounds in their hydrocarbons only study.
3.2. Prediction of the control set and hypotheticals
The last object of the present study was to evaluate the extrapolation predictive power of the above
described models.
The values for the NBPs predicted by M235 and M20 for the control set and the ‘‘hypothetical’’
molecules are presented in Table 7. Values for the boiling points of the hypotheticals, estimated by
asymptotic behavior correlations w10x are also included. The table shows the published NBPs of all
control compounds, and NBPs calculated from published boiling points of the same compounds under
reduced pressure. The latter were considered more reliable for two reasons. First, some of the
experimental results have been obtained more than 50 years ago with the technique and expertise then
available. Secondly, for components whose NBPs are higher than the decomposition temperature the
measurement must have been performed at reduced pressure, and recalculated.
As seen from Table 7, most of the NBPs of the ‘‘difficult’’ compounds are predicted by M235 and
M20 with a relative error of less than 5% from the published value, if the calculated boiling points for
several particular compounds are chosen. There are two compounds, which are predicted with a
distinctively higher error. One of them — adamantane, has a very complex structure Ž Fig. 1. , which
obviously is not well represented in the database. The other one — 1,1,2,2-tetraphenylethane, even
with a recalculated NBP falls closer to the aliphatic homologous series Ž Fig. 2. , rather than to the
non-alkylated aromatics, as might be expected. Its published NBP is drastically different from the
predicted values.
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
34
Table 6
Predictions of the M-235 and M-20 models for the original learning set of 235 hydrocarbons
Statistics for the predictions of the M-235 model: Mean standard deviation of absolute errorss 4.86"0.35 K; Min absolute
error sy13.4 K; Max absolute error sq15.1 K Ž7 points out of a "11 K error range.; Mean standard deviation of relative
errorss1.15"0.07%; Min relative error sy4.35%; Max relative error sq3.69% Ž9 points out of a "2.4% error range..
Statistics for the predictions of the M-20 model: Mean standard deviation of absolute errorss 5.51"0.36 K; Min absolute
error sy15.1 K; Max absolute error sq26.3 K Ž10 points out of a "11 K error range.; Mean standard deviation of
relative errorss1.27"0.08%; Min relative error sy3.84%; Max relative error sq3.92% Ž13 points out of a "2.4% error
range..
No. of
compound
Tpubl ,
ŽK.
Tpred
M-235, ŽK.
Tpred
M-20, ŽK.
Absolute
Error
M-235, ŽK.
Absolute
Error
M-20, ŽK.
Relative
Error
M-235, Ž%.
Relative
Error
M-20, Ž%.
1
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
184.6
231.1
272.7
309.2
341.9
371.6
398.8
424.0
447.3
469.0
489.4
508.6
543.8
559.9
575.3
589.9
603.8
616.9
629.7
641.8
653.4
664.4
675.0
685.4
695.3
704.8
713.9
722.9
736.7
758.8
765.6
790.8
812.7
877.5
261.4
301.0
282.6
333.4
184.6
230.7
270.4
307.1
339.0
369.0
395.6
421.2
444.1
462.3
483.4
499.1
534.8
551.9
568.0
583.5
598.3
613.4
624.9
635.8
650.4
664.6
670.6
680.6
691.1
702.4
709.1
718.6
736.5
759.7
764.2
795.7
814.2
872.3
261.2
298.8
287.4
331.8
184.9
230.9
270.4
306.7
338.4
368.0
394.3
419.7
442.3
460.7
481.5
497.5
532.8
549.7
565.7
581.0
595.7
610.5
622.1
633.2
647.5
661.4
667.8
677.9
688.4
699.5
706.6
716.0
733.9
757.3
762.1
793.4
812.8
873.9
263.5
300.6
288.4
333.2
y0.1
0.3
2.2
2.1
2.9
2.6
3.2
2.7
3.2
6.7
6.1
9.5
9.0
8.1
7.3
6.4
5.4
3.5
4.8
6.0
2.9
y0.2
4.5
4.8
4.1
2.3
4.8
4.3
0.2
y0.9
1.4
y4.9
y1.5
5.2
0.2
2.2
y4.8
1.6
y0.3
0.1
2.2
2.5
3.5
3.6
4.5
4.3
5.0
8.3
7.9
11.1
10.9
10.2
9.6
8.9
8.0
6.4
7.5
8.6
5.9
3.0
7.2
7.5
6.9
5.2
7.4
6.9
2.8
1.5
3.5
y2.6
y0.1
3.6
y2.1
0.4
y5.8
0.2
0.0
0.1
0.8
0.7
0.8
0.7
0.8
0.6
0.7
1.4
1.2
1.9
1.7
1.4
1.3
1.1
0.9
0.6
0.8
0.9
0.4
0.0
0.7
0.7
0.6
0.3
0.7
0.6
0.0
y0.1
0.2
y0.6
y0.2
0.6
0.1
0.7
y1.7
0.5
y0.2
0.1
0.8
0.8
1.0
1.0
1.1
1.0
1.1
1.8
1.6
2.2
2.0
1.8
1.7
1.5
1.3
1.0
1.2
1.3
0.9
0.5
1.1
1.1
1.0
0.7
1.0
0.9
0.4
0.2
0.5
y0.3
0
0.4
y0.8
0.1
y2.1
0.1
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
35
Table 6 Žcontinued.
No. of
compound
Tpubl ,
ŽK.
Tpred
M-235, ŽK.
Tpred
M-20, ŽK.
Absolute
Error
M-235, ŽK.
Absolute
Error
M-20, ŽK.
Relative
Error
M-235, Ž%.
Relative
Error
M-20, Ž%.
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
64
65
66
67
68
69
70
71
72
74
75
76
77
78
79
80
81
82
86
87
89
90
91
92
322.9
331.1
363.2
365.0
352.3
362.9
353.6
359.2
366.6
354.0
390.8
392.1
390.9
380.0
388.8
382.6
382.3
385.1
390.9
391.7
383.0
372.4
386.6
388.8
391.4
406.8
399.7
397.2
419.3
413.4
406.2
414.7
416.4
440.2
428.9
433.3
471.3
506.8
526.2
558.2
769.0
225.4
303.1
336.6
366.8
394.4
324.1
328.7
363.9
366.1
353.2
360.7
355.2
358.6
368.4
351.0
388.4
392.7
390.7
379.7
387.3
389.0
384.2
384.8
390.7
398.9
383.7
377.9
385.9
390.2
393.6
406.3
406.8
400.2
423.2
405.9
403.7
405.6
413.0
435.8
432.9
427.8
472.4
511.9
525.8
559.1
758.6
231.6
304.9
339.9
370.0
396.4
324.3
331.8
364.7
366.7
353.1
363.2
358.1
357.9
368.7
352.5
389.2
393.0
391.3
379.5
389.6
391.4
386.8
383.9
392.7
398.8
384.6
379.3
389.6
392.1
395.1
407.2
407.6
401.5
420.5
405.2
406.2
407.8
413.6
436.2
433.1
427.0
474.4
513.3
528.7
561.6
770.5
232.1
305.2
339.6
369.3
395.5
y1.3
2.5
y0.7
y1.1
y0.9
2.3
y1.6
0.6
y1.8
3.1
2.4
y0.6
0.2
0.3
1.5
y6.4
y1.9
0.3
0.2
y7.2
y0.7
y5.5
0.7
y1.4
y2.2
0.5
y7.1
y3.0
y3.9
7.5
2.5
9.1
3.4
4.4
y4.1
5.5
y1.1
y5.1
0.4
y0.9
10.4
y6.2
y1.8
y3.3
y3.2
y1.9
y1.5
y0.6
y1.5
y1.7
y0.8
y0.2
y4.5
1.3
y2.1
1.5
1.6
y0.9
y0.4
0.5
y0.8
y8.8
y4.5
1.2
y1.8
y7.1
y1.6
y6.9
y3.0
y3.3
y3.7
y0.4
y7.9
y4.3
y1.1
8.2
0.0
6.9
2.8
4.0
y4.3
6.3
y3.1
y6.5
y2.6
y3.5
y1.5
y6.7
y2.1
y3.0
y2.6
y1.1
y0.4
0.7
y0.2
y0.3
y0.2
0.6
y0.4
0.2
y0.5
0.9
0.6
y0.1
0.1
0.1
0.4
y1.7
y0.5
0.1
0.1
y1.8
y0.2
y1.5
0.2
y0.4
y0.6
0.1
y1.8
y0.8
y0.9
1.8
0.6
2.2
0.8
1.0
y1.0
1.3
y0.2
y1.0
0.1
y0.2
1.4
y2.7
y0.6
y1.0
y0.9
y0.5
y0.5
y0.2
y0.4
y0.5
y0.2
y0.1
y1.3
0.3
y0.6
0.4
0.4
y0.2
y0.1
0.1
y0.2
y2.3
y1.2
0.3
y0.5
y1.8
y0.4
y1.9
y0.8
y0.8
y1.0
y0.1
y2.0
y1.1
y0.3
2.0
0.0
1.7
0.7
0.9
y1.0
1.5
y0.7
y1.3
y0.5
y0.6
y0.2
y3
y0.7
y0.9
y0.7
y0.3
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
36
Table 6 Žcontinued.
No. of
compound
Tpubl ,
ŽK.
Tpred
M-235, ŽK.
Tpred
M-20, ŽK.
Absolute
Error
M-235, ŽK.
Absolute
Error
M-20, ŽK.
Relative
Error
M-235, Ž%.
Relative
Error
M-20, Ž%.
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
420.0
443.8
465.8
486.5
505.9
524.3
541.5
558.0
573.5
588.0
601.7
615.5
628.2
640.4
652.0
663.2
674.3
684.3
694.3
702.2
713.2
720.9
268.7
276.9
274.0
266.3
307.2
328.8
346.4
337.8
342.0
341.0
335.3
327.0
374.6
378.1
304.3
311.7
293.2
341.9
314.4
340.5
327.3
332.6
449.2
429.3
422.8
445.4
466.9
485.4
509.5
525.0
542.1
559.5
573.3
586.8
599.5
612.4
627.9
642.3
652.6
658.8
679.0
689.8
695.4
704.7
717.9
726.4
275.1
267.6
267.6
267.2
310.0
332.2
333.6
339.2
337.0
336.8
337.1
330.1
379.8
373.9
305.1
301.6
297.9
337.4
321.1
335.3
332.8
340.0
447.2
437.4
421.5
444.0
465.3
483.8
507.2
522.8
539.8
557.0
570.9
584.4
597.2
610.1
625.4
639.5
649.9
656.7
675.9
686.6
692.7
702.1
714.9
723.5
275.8
268.8
268.8
268.4
311.1
334.7
335.4
339.4
337.4
337.3
337.7
332.2
380.6
375.2
305.9
302.8
300.4
339.4
322.2
336.4
334.6
340.2
449.0
431.0
y2.8
y1.6
y1.1
1.1
y3.6
y0.7
y0.6
y1.5
0.1
1.2
2.2
3.1
0.2
y1.9
y0.6
4.3
y4.8
y5.5
y1.2
y2.6
y4.7
y5.5
y6.4
9.2
6.4
y1.0
y2.8
y3.4
12.8
y1.4
5.0
4.2
y1.9
y3.1
y5.2
4.2
y0.8
10.1
y4.7
4.6
y6.7
5.2
y5.4
y7.4
2.0
y8.1
y1.5
y0.2
0.5
2.7
y1.3
1.4
1.7
1.0
2.5
3.6
4.5
5.4
2.8
0.9
2.1
6.5
y1.6
y2.3
1.5
0.1
y1.7
y2.5
y7.1
8.1
5.2
y2.2
y3.8
y5.9
11
y1.6
4.6
3.7
y2.5
y5.2
y6
2.8
y1.6
8.9
y7.1
2.5
y7.8
4.1
y7.3
y7.6
0.1
y1.7
y0.7
y0.4
y0.2
0.2
y0.7
y0.1
y0.1
y0.3
0.0
0.2
0.4
0.5
0.0
y0.3
y0.1
0.6
y0.7
y0.8
y0.2
y0.4
y0.7
y0.8
y2.4
3.3
2.3
y0.4
y0.9
y1.0
3.7
y0.4
1.5
1.2
y0.6
y1.0
y1.4
1.1
y0.3
3.3
y1.6
1.3
y2.1
1.5
y1.7
y2.2
0.4
y1.9
y0.4
0.0
0.1
0.5
y0.3
0.3
0.3
0.2
0.4
0.6
0.7
0.9
0.4
0.1
0.3
1.0
y0.2
y0.3
0.2
0.0
y0.2
y0.4
y2.6
2.9
1.9
y0.8
y1.3
y1.8
3.2
y0.5
1.4
1.1
y0.7
y1.6
y1.6
0.7
y0.5
2.9
y2.4
0.7
y2.5
1.2
y2.2
y2.3
0.0
y0.4
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
37
Table 6 Žcontinued.
No. of
compound
Tpubl ,
ŽK.
Tpred
M-235, ŽK.
Tpred
M-20, ŽK.
Absolute
Error
M-235, ŽK.
Absolute
Error
M-20, ŽK.
Relative
Error
M-235, Ž%.
Relative
Error
M-20, Ž%.
139
142
143
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176f
177
178
179
180
181
182
183
184
185
186
187
708.5
240.4
285.7
353.9
391.9
424.3
374.1
405.0
429.9
454.1
345.0
376.6
404.1
429.8
453.7
476.0
497.0
516.7
535.2
552.5
568.9
584.4
599.0
613.2
625.9
639.3
650.3
661.9
673.1
683.1
693.1
703.1
712.0
720.4
729.3
402.9
396.6
393.2
397.6
397.5
392.5
317.4
356.1
353.5
345.9
314.7
699.1
250.8
287.8
349.9
382.5
410.6
377.6
402.9
429.6
452.4
351.9
381.3
409.9
435.5
454.2
476.5
495.5
517.5
536.3
554.2
570.2
585.4
600.8
613.5
626.6
640.1
654.0
665.5
673.5
688
695.9
706.5
716.8
725.4
732.8
405.3
402
406.6
404.5
402.3
396.6
315.4
346.1
350.6
336.3
307.8
708.0
249.6
283.3
351.2
383.7
411.5
379.3
404.6
430.9
453.5
352.7
381.8
410.0
435.3
454.2
476.3
495.3
516.7
535.4
553.0
568.9
584.1
599.4
612.1
625.2
638.6
652.5
663.8
672.2
686.3
694.4
704.9
715.2
723.9
731.6
407.4
404.4
408.3
406.6
404.6
399.5
312.9
345.8
348.8
332.4
303.1
9.4
y10.4
y2.1
4.0
9.4
13.7
y3.5
2.0
0.3
1.7
y6.9
y4.7
y5.8
y5.7
y0.5
y0.5
1.5
y0.8
y1.1
y1.7
y1.3
y1.0
y1.8
y0.3
y0.7
y0.8
y3.7
y3.6
y0.4
y4.9
y2.8
y3.4
y4.8
y5.0
y3.5
y2.4
y5.5
y13.4
y6.9
y4.8
y4.1
2.0
10.1
2.9
9.7
6.8
0.4
y9.2
2.4
2.7
8.2
12.8
y5.3
0.3
y1.0
0.6
y7.8
y5.2
y5.9
y5.5
y0.5
y0.3
1.7
0.0
y0.2
y0.5
0.0
0.3
y0.4
1.1
0.7
0.7
y2.2
y1.9
0.9
y3.2
y1.3
y1.8
y3.2
y3.5
y2.3
y4.4
y7.8
y15.1
y9.0
y7.2
y7.0
4.5
10.3
4.6
13.6
11.6
1.3
y4.3
y0.7
1.1
2.4
3.2
y0.9
0.5
0.1
0.4
y2.0
y1.2
y1.4
y1.3
y0.1
y0.1
0.3
y0.2
y0.2
y0.3
y0.2
y0.2
y0.3
0.0
y0.1
y0.1
y0.6
y0.5
y0.1
y0.7
y0.4
y0.5
y0.7
y0.7
y0.5
y0.6
y1.4
y3.4
y1.7
y1.2
y1.0
0.6
2.8
0.8
2.8
2.2
0.1
y3.8
0.8
0.8
2.1
3.0
y1.4
0.1
y0.2
0.1
y2.2
y1.4
y1.5
y1.3
y0.1
y0.1
0.3
0.0
0.0
y0.1
0.0
0.1
y0.1
0.2
0.1
0.1
y0.3
y0.3
0.1
y0.5
y0.2
y0.3
y0.5
y0.5
y0.3
y1.1
y2
y3.8
y2.3
y1.8
y1.8
1.4
2.9
1.3
3.9
3.7
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
38
Table 6 Žcontinued.
No. of
compound
Tpubl ,
ŽK.
Tpred
M-235, ŽK.
Tpred
M-20, ŽK.
Absolute
Error
M-235, ŽK.
Absolute
Error
M-20, ŽK.
Relative
Error
M-235, Ž%.
Relative
Error
M-20, Ž%.
188
189
190
191
192
193
194
195
196
197
198
199
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
353.3
383.8
409.3
432.4
456.5
417.6
412.3
411.5
434.5
478.6
499.3
519.2
555.2
571.0
586.4
600.8
614.4
627.0
639.3
650.9
662.0
673.2
683.2
693.2
702.0
710.9
719.3
727.0
418.3
438.7
425.6
438.3
435.2
437.9
449.3
442.5
480.8
442.3
450.3
454.3
456.6
445.9
476.3
537.4
434.6
446.5
355.9
385.5
413.8
436.4
461.8
411.7
406.9
408.6
430.4
480.5
500.4
517.9
554.9
570.3
586.4
600.8
616.2
628.8
638.4
646.1
664.0
674.8
685.5
695.5
707.7
718.5
726.8
734.1
415.6
443.1
430
440.1
436.8
431.7
440.4
436.7
478.4
445.9
452.0
457.2
461.6
441.5
487.5
539.3
436.2
452.6
352.4
383
410.9
433.5
458.4
409.6
405.0
406.9
428.4
477.2
497.1
514.6
551.1
566.5
582.4
596.8
612
624.6
634.5
642.6
659.9
670.7
681.4
691.4
703.4
714.1
722.6
729.5
413.2
440.8
429.1
438.1
435.3
431.1
439.6
436.3
476.1
443.7
452.4
455.5
459.8
441.2
489.2
534.2
434.4
451.5
2.6
y1.8
y4.5
y4.0
y5.3
5.9
5.4
2.9
4.1
y1.9
y1.2
1.2
0.3
0.7
y0.1
0.0
y1.8
y1.8
0.8
4.9
y1.9
y1.7
y2.4
y2.3
y5.6
y7.5
y7.6
y7.1
2.8
y4.4
y4.4
y1.8
y1.6
6.2
8.9
5.8
2.4
y3.6
y1.7
y2.9
y5.0
4.4
y11.1
y1.9
y1.7
y6.1
0.9
0.8
y1.6
y1.2
y2.0
8.0
7.3
4.6
6.1
1.3
2.2
4.6
4.1
4.6
3.9
4.0
2.5
2.4
4.7
8.3
2.2
2.4
1.7
1.7
y1.4
y3.1
y3.3
y2.5
5.1
y2.2
y3.5
0.2
y0.1
6.8
9.7
6.2
4.7
y1.4
y2.2
y1.2
y3.1
4.7
y12.9
3.3
0.2
y5.1
y0.7
y0.5
y1.1
y0.9
y1.2
1.4
1.3
0.7
1.0
y0.4
y0.2
0.2
0.1
0.1
0.0
0.0
y0.3
y0.3
0.1
0.7
y0.3
y0.3
y0.3
y0.3
y0.8
y1.1
y1.1
y1.0
0.7
y1.0
y1.0
y0.4
y0.4
1.4
2.0
1.3
0.5
y0.8
y0.4
y0.6
y1.1
1.0
y2.3
y0.3
y0.4
y1.4
0.2
0.2
y0.4
y0.3
y0.4
1.9
1.8
1.1
1.4
0.3
0.4
0.9
0.7
0.8
0.7
0.7
0.4
0.4
0.7
1.3
0.3
0.4
0.3
0.2
y0.2
y0.4
y0.5
y0.3
1.2
y0.5
y0.8
0.0
0.0
1.5
2.2
1.4
1.0
y0.3
y0.5
y0.3
y0.7
1.1
y2.7
0.6
0.0
y1.1
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
39
Table 6 Žcontinued.
No. of
compound
Tpubl ,
ŽK.
Tpred
M-235, ŽK.
Tpred
M-20, ŽK.
Absolute
Error
M-235, ŽK.
Absolute
Error
M-20, ŽK.
Relative
Error
M-235, Ž%.
Relative
Error
M-20, Ž%.
235
236
237
238
239
240
241
242
243
244
246
247
251
456.9
483.7
528.4
545.8
553.7
491.1
612.9
612.6
638.0
649.0
668.0
714.1
550.5
460.5
491.1
531.7
550.0
557.0
489.2
597.8
600.4
651.4
643.9
666.4
701.3
548.1
458.8
492.9
526.3
546.8
551.6
483.7
590.6
593.1
644.5
637.6
657.6
691.9
540.5
y3.6
y7.5
y3.3
y4.2
y3.3
1.9
15.1
12.2
y13.4
5.1
1.6
12.8
2.4
y1.8
y9.2
2.1
y1.0
2.0
7.4
22.3
19.5
y6.5
11.4
10.4
22.2
10.0
y0.8
y1.5
y0.6
y0.8
y0.6
0.4
2.5
2.0
y2.1
0.8
0.2
1.8
0.4
y0.4
y1.9
0.4
y0.2
0.4
1.5
3.6
3.2
y1
1.8
1.6
3.1
1.8
In order to illustrate further the problem with the reliability of published NBPs we have tried to
follow up the original source of the NBP of 1,1,2,2-tetraphenylethane. In our database this value
Ž358–3628C. was cross-referenced from two papers citing as reference the DIPPR database w19x, and
the Beilstein database w22x. Reference of the original Beilstein handbook revealed that this compound
has been included in the main work ŽHauptwerke. of the series w35x. Two NBPs are recommended —
358–3628C Ž uncorrected. and — 379–3838C Ž corrected.. The original experimental determination of
the boiling point of 1,1,2,2-tetraphenylethane was done by Biltz w36x, who synthesized and purified
the compound himself in the year of 1897.
The M-20 model provides somewhat better predictions for most of the published NBPs. The values
predicted for the hypotheticals are close to values estimated by ABCs, which are developed for rather
different homologous series, but are the only other alternative. It has to be underlined that the three
hypotheticals are outliers on the principal components plot Ž Fig. 3. , so the error in the estimation of
their boiling points is expected to be higher than that within the boundaries of the models.
It should be noted also that separate models for the different groups of hydrocarbons may be
developed following the principles suggested in the present work. Figs. 2 and 3 indicate that
‘‘pseudohomologous’’ series, including a greater number of available data may be created. For
instance, alkylaromatics, behave as aliphatic hydrocarbons, above a particular aliphatic chain length.
Care should be taken when ascribing physical significance to statistically derived correlations for
the contribution of particular descriptors. The latter heavily relies on the particular design of the
database. For instance, the gravitational index, which managed to describe NBPs of hydrocarbons in
the Katritzky et al. work w7x as a sole independent variable, did not prove to be useful with the
database of the present investigation. Neither were the variations of any of the tested topological
descriptors, although they carry at least part of the molecular information of some of the descriptors
successfully employed by Wessel and Jurs w19x.
Our observations suggest that more work is necessary to establish which descriptors might be the
real determinants of the NBPs, and which are only surrogates for more fundamental features of the
molecules.
40
Table 7
Predictions of the M-235 and M-20 models for the control set and ‘‘hypothetical’’ molecules
Statistics for the predictions of the M-235 model: Mean standard deviation of absolute errorss 18.64 " 3.89 K; Min absolute error s y53.6 K; Max absolute error s q24.8 K
Ž7 points out of a "11 K error range.; Mean standard deviation of relative errors s 3.23 " 0.67%; Min relative error s y9.4%; Max relative error s q3.8% Ž2 points out of a
"5.0% error range..
Statistics for the predictions of M-20 model: Mean standard deviation of absolute errors s 16.34 " 3.41 K; Min absolute error s y48.5 K; Max absolute error s q20.3 K Ž10
points out of a "11 K error range.; Mean standard deviation of relative errors s 2.89 " 0.60%; Min relative error s y8.7%; Max relative error s q3.2% Ž2 points out of a
"5.0% error range..
Name
13
40
63
73
83
84
85
88
140
141
144
200
245
248
249
250
252
253
254
255
256
257
258
259
260
261
n-tetradecane
3-methylpentane
2,3,3-trimethylpentane
2,2,4,4-tetramethylpentane
pristane
phytane
squalane
1-butene
lycopene
b-carotene
cyclopentane
n-octylbenzene
1,2-benzo w ax pyrene
o-terphenyl
triphenylmethane
acenaphtalene
1,1,2,2-tetraphenylethane
4-methyloctane
2,2,3,3-tetramethylbutane
2-ethyl-1-hexene
adamantane
1,5-cyclooctadiene
2,5-dimethyl-1,5-hexadiene
cis-1-propenylbenzene
1-phenylnaphtalene
indane
a
b
c
d
Tcalc ,
ŽK.
795.8 a
831.4 b
778.28 c
637.5d
552.0 d
697.7 d
409.9 d
441.9 d
621.0 d
Tpubl ,
ŽK.
Tpred
M-235, ŽK.
Tpred
M-20, ŽK.
526.7
336.4
387.9
395.4
604.3
625.6
720.0
266.9
–
–
322.4
537.5
–
610.6
632.1
543.1
633.1
415.6
379.4
393.1
461.0
423.3
387.4
452
607.1
451.1
516.9
333.9
382.9
393.7
582.9
601.9
695.2
270.3
766.0
818.4
324.9
536.5
773.5
654.6
651.8
548.1
751.5
412.1
374.3
397.2
504.6
415.7
380.8
436.9
643.6
451.4
515.2
335.1
383.7
393
587.1
605.3
702.9
270.8
788.6
839.2
325.0
533.0
763.6
648.0
646.4
537.0
746.4
412.8
374.4
397.0
501.2
414.6
383.2
434.5
634
447.1
Absolute Error
M-235, ŽK.
Absolute Error
M-20, ŽK.
Relative Error
M-235, Ž%.
Relative Error
M-20, Ž%.
9.8
2.5
5.0
1.7
21.6
23.7
24.8
y3.4
11.5
1.3
4.2
2.4
17.2
20.3
17.1
y3.9
1.9
0.7
1.3
0.4
3.6
3.8
3.4
y1.3
2.2
0.4
1.1
0.6
2.8
3.2
2.4
y1.5
y2.5
1.0
y2.6
4.5
y0.8
0.2
y0.8
0.8
y9.2
y19.7
y0.1
y53.6
3.5
5.1
y4.1
y43.6
y5.8
6.6
5.0
y22.6
y0.3
y2.6
y14.3
11.0
y48.5
2.8
5.0
y3.9
y40.2
y4.7
4.2
7.4
y13.0
4.0
y1.4
y3.1
0.0
y7.7
0.8
1.3
y1.0
y9.5
y1.4
1.7
1.1
y3.6
y0.1
y0.4
y2.3
2.0
y7.0
0.7
1.3
y1.0
y8.7
1.1
1.1
1.7
y2.1
0.9
Calculated with an Asymptotic Behaviour Correlation w10x. The original correlation is for n y 1-alkenes.
Calculated with an Asymptotic Behaviour Correlation w10x. The original correlation is for alkylcyclohexanes.
0.70
Calculated from a fit of the NBPs of the rings only aromatic hydrocarbons as a function of Ctot
.
Calculated from published boiling temperatures at reduced pressure. Considered more reliable and used for the determination of errors.
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
No.
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
41
4. Conclusions
The present work contributes a correlation to the very challenging and important investigations of
the quantitative relation between the molecular structure and the functional properties of chemical
compounds, which has been a fundamental task of chemistry and chemical engineering for many
years.
Its main features, as perceived by the present authors, are its relative simplicity, its reliable
predictions of the NBPs, and its applicability to diversified industrially important hydrocarbon
structures within a widely spanned range of NBPs and number of carbon atoms.
An achievement of particular interest in the present work is the revealed opportunity for the
limitation of the learning set through multivariate analysis and molecular design.
The molecular mechanics simulation employed in this study is viewed by the present authors as a
potential tool for incorporation in future chemical engineering simulators. It will significantly enhance
the capabilities of the latter for designing processes with chemical reactions and is especially suitable
for optimisation of the composition of the additive products of the chemical industry w37x. Furthermore, it allows straightforward, but correct input of complex chemical structures by drawing them
directly on the monitor. However, from the point of view of the chemical engineer as the user of such
programmes, the benefits of the sophistication cannot be easily appreciated. On the one hand, the
sophistication requires an in depth knowledge of the quantum chemistry of the particular structures,
which is not so readily available for the structures targeted by chemical engineers. On the other hand,
in many cases of engineering importance the sophistication and high accuracy may not be justified,
since simple group contribution correlations still work successfully for particular problems. The
appropriate level of sophistication for many of the common chemical engineering applications will be
different and can only be determined by systematic studies of the influence of uncertainties on key
parameters.
The high accuracy achieved by the correlation opens up a possibility for systematic studies of
chemical engineering applications in which the effects of small changes are important. This also
outlines a path towards the more general problem of the influence of uncertainties in calculated
thermophysical parameters on the final solution of computer aided simulation and design.
Acknowledgements
The present authors acknowledge with gratitude the financial support of The Royal Society.
References
w1x L.M. Egolf, M.D. Wessel, P.C. Jurs, J. Chem. Inf. Comput. Sci. 34 Ž1994. 947–956.
w2x S.J. Grigoras, Comput. Chem. 11 Ž1990. 593–610.
w3x W.J. Lyman, W.F. Reehl, D.H. Rosenblatt, Handbook of Chemical Property Estimation Methods, McGraw-Hill, New
York, 1982.
w4x R.C. Reid, J.M. Prauznitz, B.E. Poling, Properties of Gases and Liquids, 4th edn., McGraw-Hill, New York, 1987.
w5x C.H. Fisher, J. Am. Oil Chem. Soc. 67 Ž1990. 101–102.
w6x R.C. Mebane, C.D. Williams, T.R. Rybolt, Fluid Phase Equilibria 124 Ž1996. 111–122.
42
w7x
w8x
w9x
w10x
w11x
w12x
w13x
w14x
w15x
w16x
w17x
w18x
w19x
w20x
w21x
w22x
w23x
w24x
w25x
w26x
w27x
w28x
w29x
w30x
w31x
w32x
w33x
w34x
w35x
w36x
w37x
G. St. CholakoÕ et al.r Fluid Phase Equilibria 163 (1999) 21–42
A.R. Katritzky, V.S. Lobanov, M. Karelson, J. Chem. Inf. Comput. Sci. 38 Ž1998. 28–41.
A. Kreglewski, B.J. Zwolinski, J. Phys. Chem. 65 Ž1961. 1050–1052.
K.A. Gasem, C.H. Ross, R.L. Robinson Jr., Can. J. Chem. Eng. 77 Ž1993. 805–816.
J.J. Marano, G.D. Holder, Ind. Eng. Chem. Res. 36 Ž1997. 1887–1894.
J.J. Marano, G.D. Holder, Ind. Eng. Chem. Res. 36 Ž1997. 1895–1907.
A.L. Horvath, Molecular Design, Elsevier, Amsterdam, 1992.
M. Karelson, Adv. Quant. Chem. 28 Ž1997. 141–157.
M. Kurata, S. Ishida, J. Chem. Phys. 23 Ž1955. 1126–1131.
I.C. Sanchez, R.H. Lacombe, J. Phys. Chem. 80 Ž1976. 2352–2362.
I.C. Sanchez, R.H. Lacombe, Macro-molecules 11 Ž1978. 1145–1156.
P.J. Flory, R.A. Orwoll, A. Vrij, J. Am. Chem. Soc. 86 Ž1964. 3507–3514.
A. Vetere, Fluid Phase Equilibria 124 Ž1996. 15–29.
M.D. Wessel, P.C. Jurs, J. Chem. Inf. Comp. Sci. 35 Ž1995. 68–76.
J. Buckingham, S.M. Donaghy ŽEds.., Dictionary of Organic Compounds, 5th ed., Chapman and Hall, New York,
1982.
API Technical data Book — Petroleum Refining, 4th edn. American Petroleum Institute, Washington DC, 1983.
Iu.V. Pokonova, A.A. Gaile, V.G. Spirkin, Chemistry of Petroleum, Himia, Leningrad, 1984 Žin Russian..
A.I. Bogomolov, A.A. Gaile, V.V. Gromova, Chemistry of Petroleum and Gas, Himia, Leningrad, 1989 Žin Russian..
TRC ŽThermodynamic Research Center.. TRC Thermodynamic Tables–Hydrocarbons, The Texas A&M University,
College Station, TX, USA, 1997 revision.
A.S. Teja, R.J. Lee, D. Rosenthal, M. Anselme, Fluid Phase Equilibria 56 Ž1990. 153–169.
PCMODEL, 5th edn., Serena Software, Bloomington, IN, USA, 1992.
M. Randic, B. Jerman-Blazic, N. Trinajstic, Comput. Chem. 14 Ž1990. 237–246.
J.K. Labanowski, I. Motoc, R.A. Damkoehler, Comp. Chem. 15 Ž1. Ž1991. 47–53.
A.R. Katritzky, L. Mu, V.S. Lobanov, M. Karelson, J. Phys. Chem. 100 Ž1996. 10400–10407.
STATGRAPHICS for DOS 7th edn., STSC, Inc. and Manugistics, Inc., Rockville, MD, USA.
P. Geladi, M.-L. Tosato, in: W. Karcher, J. Devillers ŽEds.., Practical Applications of Quantitative Structure–Activity
Relationships ŽQSAR. in Environmental Chemistry and Toxicology, Kluwer Acad. Publ., Dodrecht, 1990, pp.
170–179.
S. Wold, K. Esbensen, P. Geladi, Chemometrics and Intelligent Laboratory Systems 2 Ž1987. 37–52.
P. Geladi, B. Kowalski, Anal. Chim. Acta 185 Ž1986. 1–17.
M.-L. Tosato, P. Geladi, in: W. Karcher, J. Devillers ŽEds.., Practical Applications of Quantitative Structure–Activity
Relationships ŽQSAR. in Environmental Chemistry and Toxicology, Kluwer Acad. Publ., Dodrecht, 1990, pp.
317–341.
Beilsteins Handbuch der Organischen Chemie, 4th edn., H, bd. V, Springer Verlag, 1933, p. 739.
H. Biltz, Liebigs Annalen der Chemie 296 Ž1897. 221.
G.S. Cholakov, K.G. Stanulov, P.A. Devenski, H.A. Iontchev, Wear 216 Ž2. Ž1998. 194–201.