You are generally free to use these datasets in any way you like. Please click on the dataset name to find out more information about it.

All data sets are used in the book "Process Improvement using Data"

Name Description Rows Columns Tags
Aeration rateThe total airflow added to an aeration tank, in litres, during a 1 minute period.5731univariatemonitoring
Ammonia concentrationThe ammonia concentration in a liquid stream, measured every 6 hours, from a waste water treatment unit.14401univariate
Batch yield and purityThe two columns in the data set are: the percentage yield from a batch reactor, and the purity of the feedstock. The feedstock is what we add to the reactor, and the yield is measured after the reaction is completed. The cause-and-effect direction is that the purity of the feedstock has (potential) impact on the yield.2412multivariateregressionleast-squares
Batch yieldsThe historical percentage yields from a batch reactor for 300 sequential batches.3001univariate
Bioreactor yieldsThe percentage yield from a bioreactor given the temperature, impeller speed, duration, and whether or not the reactor has baffles.145multivariatecategoricalregression
Blender efficiencyThe effect of 4 factors on blending efficiency.185multivariatedoe
Brittleness indexA plastic product is produced in three parallel reactors (TK104, TK105, or TK107). For each row in the dataset, we have the same batch of raw material that was split, and fed to the 3 reactors. These values are the brittleness index for the product produced in the reactor.153multivariatemissing-datapaired
Certificates of analysisFour properties of an important powder raw material were transcribed from the supplier's certificates of analysis.1225multivariatemonitoring
Cheddar cheeseConcentrations of acetic acid, H2S, and lactic acid in 30 samples of mature cheddar cheese. A subjective taste value is also provided.304multivariateregression
Class gradesGrades from a Chemical Engineering course at McMaster University.996multivariatemissing-dataregression
Distillate flowrateThe flow rate of distillate from the top of a distillation column.446401univariate
Distillation towerSnapshot measurements on 27 variables from a distillation column; measured over 2.5 years.25327multivariateoutliersregression
Electricity usageNumber of kilowatt-hours used in a residential home over a 3.5 month period, 25 November 2011 to 17 March 2012.27125univariatetime-series
Film thicknessThe thickness of a plastic film is measured in 4 positions after being cut. The position of the measurements are top right, top left, bottom right and bottom left.1604multivariate
Flotation cellData from a zinc-lead flotation cell measured on 5 variables; recorded from the PLCs.29225multivariatetime-series
Food consumptionThe relative consumption of certain food items in European and Scandinavian countries. The numbers represent the percentage of the population consuming that food type.1620multivariatemissing-data
Food textureTexture measurements of a pastry-type food.505multivariate
Gas furnaceThe gas furnace data set from Box and Jenkins' book on Time Series Analysis (series J). Contains the gas rate and the percentage CO2 in the gas.2962time-series
Kamyr digesterPulp quality is measured by the lignin content remaining in the pulp: the Kappa number. This data set is used to understand which variables in the process influence the Kappa number, and if it can be predicted accurately enough for an inferential sensor application.30122multivariatemissing-datatime-series
Kappa numberThe Kappa number from a pulp mill.44621univariatemonitoring
LDPEData from a low-density polyethylene production process. There are 14 process variables and 5 quality variables (last 5 columns).5419multivariateoutliersmonitoring
Oil company DOEExperimental data; testing amount of 4 materials added (A, B, C, D) in order to achieve a certain volumetric heat capacity, y.195multivariatecategoricaldoeregression
Paper basis weightThe dry basis weight of paper is a measure of its density. These measurements are from an online scanning gauge, taken 30 seconds apart at a large paper manufacturer.2311univariatemonitoringtime-series
PeasThe taste of 27 pea varieties as measured by judges. After blanching the peas, the peas were quick-frozen, packed into bags and stored for 3 months.6017multivariate
Raw material outcomeSix characterizing measurements for batches of plastic pellets; the outcome when using this material, either Poor or Adequate, is also provided.247multivariatecategoricalregression
Raw material height in a containerThe height of plastic pellets in a tall narrow container, measured over a period of 3 months.842univariatetime-series
Raw material propertiesMeasured characteristics of several batches (lots) of plastic pellets.366multivariatemissing-data
Room temperaturesTemperature measurements, in Kelvin, taken from 4 corners of a room.1444multivariate
Rubber colourThe colour of a rubber product; this example is to demonstrate how to build a monitoring chart.1001univariatemonitoring
SawdustSawdust from birch, pine a spruce were blended in specific ratios. The corresponding NIR spectra are recorded.541204multivariate
Silicon wafer thicknessThickness of a single wafer, measured at 9 locations for 184 consecutive batches. A single wafer is removed from a tray of wafers (always at the same position for each batch of wafers) after the chemical vapour decomposition process is complete.1849multivariateoutliers
Six-point board thicknessThickness of 2x6 SPF boards from a saw mill.50006multivariate
SolventsPhysical properties of various chemical solvents, such as melting point, boiling point, dipole moment, refractive index, density, solubility.1039multivariate
Systematic methodData are from an open-ended question on a final exam. The values are the grades achieved for the answer to that question, broken down by whether the student used a systematic method, or not. No grades were given for using a systematic method; grades were awarded only on answering the question.442multivariate
Tablet NIR spectral dataSpectra, measured in the transmittance mode, of 460 pharmaceutical tablets; readings are from 600 to 1898 nm in 2 nm increments.460650multivariate
Travel timesA driver uses an app to track GPS coordinates as he drives to work and back each day. The app collects the location and elevation data. Data for about 200 trips are summarized in this data set.20513multivariatemissing-data
Unlimited time testThe grades from a midterm exam, as well as the time taken by the student to write the exam. It was an "infinite" time midterm, so there was no time pressure to finish within the allocated period.802univariateregressionleast-squares
Unlimited time test 2The grades from a midterm exam, as well as the time taken by the student to write the exam. It was an "infinite" time midterm, so there was no time pressure to finish within the allocated period. The test results were from 2013.612univariateregressionleast-squares
Unlimited time test 3The grades from a midterm exam, as well as the time taken by the student to write the exam. It was an "infinite" time midterm, so there was no time pressure to finish within the allocated period. The test results were from 2013.892univariateregressionleast-squares
Website trafficThe number of visits to a small website on each day; if a user accesses the site after 30 minutes of inactivity, that will be logged as a new visit.2144monitoring
Wine DOEData from a fractional factorial for profiling a new wine. The last 5 columns are the taste values from a panel of judges. Higher values are a better overall taste.1613multivariatedoeregression
Wood fibresA sample of aspen tree fibre as characterized by a fibre quality analyzer (FQA).251656multivariate
(c) Kevin Dunn,