Mean Opinion Score (MOS) has become a very popular indicator of perceived media quality. While there is a clear benefit to such a “reference quality indicator” and its widespread acceptance, MOS is often applied without sufficient consideration of its scope or limitations.
In this page, we would like to introduce in a simple way how MOS is measured.
The International Telecommunication Union (ITU) has defined the opinion score as the “value on a predefined scale that a subject assigns to his opinion of the performance of a system”*
The Mean Opinion Score (MOS) is the average of these scores across subjects. MOS has emerged as the most popular descriptor of perceived media quality
*ITU-T Rec. P.10 (2006) Vocabulary for performance and quality of service
Basically, the Mean Opinion Score (MOS) is a measure of voice quality, used in telecommunications engineering to assess the human users’ opinion of call quality. It is defined as the arithmetic mean over all individual values on a predefined scale that a subject assigns to his opinion of the performance of a system quality.
Such ratings are usually gathered in a subjective quality evaluation test, but they can also be algorithmically estimated.
The test is widely used in VoIP networks to:
- ensure voice quality transmission
- test quality issues
- provides a metric to measure voice degradation and performances
Call Quality is highly subjective…
…there are several ways in which the score can be assessed…
Human feeling is not the
way to score big network’s MOS…
Rating scales and mathematical definition
Comparing the reference signal with the received signal is best way to calculate a value for MOS (e.g. by using PESQ or POLQA standard), but this concept is more useful in that cases where accuracy rather than speed of measurements and scalability is important.
The MOS is expressed as a single rational number, typically in the range 1–5, where 1 is lowest perceived quality, and 5 is the highest perceived quality. Other MOS ranges are also possible, depending on the rating scale that has been used in the underlying test.
The Absolute Category Rating (ACR) scale is very commonly used, which maps ratings between Bad and Excellent to numbers between 1 and 5, as seen in below table.
Other standardized quality rating scales exist in ITU-T recommendations (such as P.800 or P.910).
For example, one could use a continuous scale ranging between 1–100. Which scale is used depends on the purpose of the test. In certain contexts, there are no statistically significant differences between ratings for the same stimuli when they are obtained using different scales.
The MOS is calculated as the arithmetic mean over single ratings performed by human subjects for a given stimulus in a subjective quality evaluation test. Thus:
Where Rn is the individual rating for a given stimulus by subjects.
How is the MOS measured?
Because call quality is highly subjective, there are several ways in which the score can be assessed.
Human involvement is by far the most effective, but not always the most practical way to score MOS within a network of decent size.
In many cases modern tests rely heavily on algorithms that perform an objective comparing between the reference signal and the received signal in order to calculate a value for MOS-LQO (Listening Quality Objective). This concept is more useful in that cases where accuracy rather than speed of measurements is important.
The most recent standard used to measure MOS is the Perceptual Objective Listening Quality Assessment (POLQA), that is the ITU-T Recommendation P.863.
This algorithm simulates subjects that rate the quality of a speech signal in a listening test using a five-point opinion scale like ACR. It is suited for distortions such as linear frequency response distortions, time stretching/compression as found in Voice-over-IP, certain types of codec distortions, reverberations, and the impact of playback volume.
The basic approach of POLQA is the one mentioned earlier: a reference input (talker side) and degraded output (listener side) speech signal are mapped on an internal representation using a model of human perception. The difference between the two representations is used by a model to predict the perceived speech quality of the degraded signal.
Just some details, the POLQA algorithm starts with a temporal alignment block and a sample rate estimator, which is used to compensate for differences in the sample rate of the signals: if the sample rate differs by more than approximately 1%, the signal with the higher sample rate is down sampled.
After this, the core model performs a first calculation through four different “perceptual models” for several types of distortions. The output is the “Disturbance Density”, which is a measure for the perceptibility of distortions in the signals, but cognitive effects are not yet considered.
Cognitive aspects are however important when human beings are asked to score the quality of what they can perceive. Essentially, they convert the perceptibility measure into an annoyance measure.
This conversion is performed by correcting the “Disturbance Density” for several situations, e.g. with significant level variations, strong timbre, many delay variations, etc.