The need & advent of A.I. scored exams like PTE

Here is a detailed guide on PTE's Scoring methodology!

An automated scoring system isn't a distant dream now, as was the case some time ago. Computers doing the work of humans is a concept that has been alive and running for a very long time now. So the next question now is exactly how the computers can measure a person’s command over the English language, how it is able to identify features like their communicative skills, oral fluency, and written discourse etc. I will be explaining all that further, along with Pearson's Automated Scoring Systems based on Artificial Intelligence and how it scores a PTE exam.

PTE exams including PTE Academic come under the Versant line of tests offered by Pearson. Pearson VUE is the company which launched these tests back in 2009. They invented this automated scoring through which language can be measured in a way that doesn't require a human interlocutor.

When assessing the PTE Academic exam, human interlocutors can react to responses more humanly and dig more deeply into assessing the student's response. They can also integrate their perception of verbal and nonverbal skills, which is an important element of communication. And they can self-correct if they feel that they've done something that might have been too difficult for the examinee or might not have been appropriate. Despite such weighted advantages, there may be some aspect of human raters that could overshadow them!

In higher stakes tests like PTE general exam, PTE Academic, IELTS exam, raters are always concerned about two major points:

Validity & quality of the measurements
Maintenance of consistency in scoring

That is why the assessors at Pearson thought of developing automated scoring. The data of tests scored by certain highly trained raters was added into the software developed. Once the software became usable and was launched, over time its accuracy kept on increasing because of its integration with Artificial Intelligence. Today this technology has made Pearson's Versant Tests some of the most trusted tests in the whole world!

The main reason A.I. scoring became successful is because it is based on human rubric scoring only. Human brains are very attuned to a language; we respond immediately to accents, the sound of someone of the opposite gender speaking the same language, or someone's way of speaking it, etc. All these inbred linguistic factors can be harnessed very well when using a human scoring rubric. Furthermore, it was realised that training students to prepare for such computer-based, A.I. scored tests was easier and way more practical.

The results of candidates who took the PTE Academic test were assessed both by artificial intelligence and human examiners and raters. Machine scores were then compared to the corresponding human scores for the same candidates. Human scores were thought to be the gold standard, probably because there are many areas where human scoring is desirable and seemingly flexible, doing a potentially good job in almost every case! But a major downside to this was realised: It took a lot of time and resources to both train individuals on how to score and then manually scoring tests as well.

It was also considered that the test takers may feel more comfortable speaking into a microphone or not having someone watching over them, especially in case of the shy or introverted students. In such cases, a human interlocutor can induce nervousness and anxiety in the test taker, hindering their performance unintentionally. Also, human raters can often have trouble discerning particular dimensions and keeping them separated, resulting in the clumping of ratings together in ways that are not desirable for a fair PTE score.

Yet again, their are certain human factors like the an examiner feeling bored, tired or fed up while dealing with examinees. That is why their concentration can lag and vary from moment to moment. At times they may feel irritable, rendering them inattentive and henceforth unable to give reliable, consistent ratings for the PTE test. Therefore, a human being’s mental and physical limits in general can affect the scores of a test taker.

So basically, there are multiple tradeoffs between keeping the scoring dimensions separate using automated scoring and using human raters otherwise. But I would point out that human ratings, despite often being seen as the gold standard, can have undesirable characteristics that can be detrimental to a test's reliability. Also, another factor that lands in favour of automated scoring is that humans don’t readily agree with a broad visioned assessment of their tests, or a marking system based on intuitions of an examiner driven by his/her experience. Also, another factor that lands in favour of automated scoring is that humans don’t readily agree with a broad visioned assessment of their tests, or a marking system based on intuitions of an examiner driven solely by experience.

To understand this point we can take the example of the IELTS scoring rubric:
The IELTS rubric is a scoring system that goes from zero to nine ‘bands’ based on four different dimensions, although in reality it has a lot of different elements. But these bands sometimes make the marking of skills look like they're all on the same scale. Whereas on the other hand, the descriptors may differ from dimension to dimension and interact with many kinds of internal aspects of the scale, hence making it a quite complex measuring system.

When we observe human raters’ data, it is noticed that they do not agree on the scales, especially if those scales start to get longer. And even when a relatively simple scoring rubric, like the above discussed IELTS rubric is used, you'll find that even well-trained raters reading the exact same performances can have a certain number of disagreements. So, these are some of the reasons to use automated scoring, through which appropriate and impartial IELTS/PTE scores can be given.

Now, let’s talk about the basis on which the PTE Academic's Result is given. We will learn about and try to understand the underlying mechanisms, the A.I. powered software engines, and the speech recognition systems that are used to provide scoring in the test.

When the A.I. scoring of a verse spoken in the English language was compared to a standard set for its human given ratings, an obviously noticeable correlation was found among the two types of scoring. In other words, they were found to be virtually interchangeable. So as English language test machine scores go up for a particular candidate, they would also go up on the human scale.

Therefore, as the results of automated scoring became more and more trustworthy in the case of the English language, it was also developed for other languages as well. From being used in low profile areas, it is now being used in high stakes certifications, interviews of jobs and migration purposes of skilled individuals.

"This technology was developed around 15 years ago and has been through rigorous testing ever since, which has helped humans in maturing it into being highly accurate and in consistency with actual human ratings."

There is a typical pattern followed for training an automated scoring system on the different types of speech data similar to what it is going to be testing and scoring during live use. This is slightly different from how your phone's voice assistant (eg. Siri) does speech recognition. Often, you'll find that Siri does very well with native speakers or with English-speaking monolingual voices and with a voice that it's very used to. But once you add a foreign accent or a variety of foreign accents, its speech recognition goes way down. In total contrast to this, the Pearson test systems were designed and are tuned to include both native and non-native speakers
(moderately / highly proficient), making it capable of understanding many different voices and accents.

Different types of mechanics cover the testing of a language on based different aspects like grammar, punctuation, coherence, etc. For example, the aspect of progression of ideas is for testing the change in ideas in a written text as it is being read on. But on the side of spoken expertise in a language, the content is tested on the basis of other factors like sentence mastery (grammatical and structural accuracy of a sentence), choice of words (vocabulary), level of accuracy of the short and long-phrases used, pronunciation of words, intonation in voice and fluency of the speaker.

In a Parador group testing or face-to-face interviews the administrators are humans and scoring is also done by humans as well. In semi-direct testing, certain parts or sections are computer or tape-mediated. And in fully automated testing, every section of a test is computer-mediated and uses software based speech recognition. Only that can be considered as a completely technology-supported format only.

In person interview tests have both pros and cons to them. People as interlocutors can probe topics (altering topics and negotiating mid-interview) to actually verify if the person is using appropriate turn-taking techniques or not. This is the biggest advantage interview tests grant us with because it is considered quite appropriate and useful when measuring someone's level of proficiency in a language.
But it ends up hindering the target of a test more I believe. Because when the interlocutors are being more flexible by probing the students based on their own experience and instincts, it tends to become more of a psycho-linguistically based approach. Therefore, as a countermeasure to this, the constructs defined in many technical reports are things like measuring someone’s fluency, lexical range, etc.

Measuring someone’s communicative skills in a language and their psycho-linguistic integration of dialogue in that language are two completely different things. However, it’s been found and realised that these two measures share an underlying pool for calculating someone's proficiency in a language. If the correlations were to be searched for to develop a single point of overlap, both approaches would show great correspondence with each other. Developing an automated scoring system with advancements in technology allowed forming an approach devoid of such interdependent factors and manually controlling them, very specifically.

The tasks performed by the algorithm of PTE academic test’s A.I powered scoring system are:

data collection, followed by
validation of that data, and then
providing a set of values in the form of a result based on that data

This means that there would not be any kind of compensation for not knowing much vocabulary or not being able to speak it correctly, in the given time period.

But we must also not forget that this automated system is also working on improving itself continuously by comparing its own scoring with that of experienced human raters. To put it into simpler terms, the A.I. is imitating the work of a group of expert human raters. But it’s considered to be better because the risks caused by emotional interference in humans are completely avoided. Every individual has an independent moral compass which usually leads to partiality or inaccuracy in the scoring, is avoidable. The expertise of humans is still being used but there is zero discrimination in grading a candidate’s performance.

Let's talk about different question types within each section of the PTE Academic test. This test has been designed to accomplish many of the same things intended in any communicative skill test, but with a far more direct interviewing method, keeping the ideas of probing and topic shifting in consideration as well. And the candidates have to speak/write on somewhat pre-prepared topics, along with performing tasks involving use of multiple skills at once under the listening/speaking sections.

So basically, the test uses an extempore-like approach in many question types, which means that students are not given much time to plan what they're going to say. Therefore, having a prepared response will not always be possible! Some tasks are more akin to an interview test than others, such as looking at and describing something. And the test questions are never limited to a few topics or certain topic types. Instead, it's got quite a variety there. It's a relatively deep test because it has many different aspects to it. It’s not only fun and logical, but it's quite sophisticated as well I think! That is because it requires mastery not only over the English lexicon and its semantics, but an updated knowledge database as well!

A test like TOEFL iBT has six lexical items besides the reading and writing items. And each of those is scored in a variety of ways. And in the IELTS Academic speaking tests, you've got three performances that are scored in four different ways for a total of about 12 kinds of scores. Such a multi-pronged scoring pattern is seen in these tests because a larger number of items greatly contribute to better reliability and consistency of the scores.

In the Applied Linguistics research, there is a phenomenon called elicited imitation, in which one has to repeat a sentence verbatim. But it was observed that, when people kept on repeating the same sentences, their pattern of answering tended to get them the same score repeatedly when they took the test multiple times.

In some cases, people might imitate short prompts just using the acoustic properties of what they heard. This is known as a phonological imitation. And I myself would encourage you to try to do that in the English language, because it is virtually impossible to remember the many minute details all the time if you don't understand a language that well. So, on the face of it, you might feel a bit critical of this kind of approach towards an EIT task (Elicited Imitation task), but you can trust me on the fact that it has a lot of psycho-linguistic validity. Despite what I said earlier about the PTE test, that its target is to test communicative skills, EIT tasks need a little different approach as they were designed with a completely different structure!

Also, it was observed that many people use a second language (their native language) to process the information that is being asked to be repeated. This also depicted that EIT is not a communicative task type per se; it's actually a psycho-linguistic task. It requires one to understand, store and then reproduce sentences. And it may seem to be a simple task, but for non-native English speakers it can prove to be quite challenging. That is exactly why it was incorporated within the PTE test format because it gives a pretty good sense of someone's proficiency!

When speaking into the microphone that’s been integrated into a software, the words spoken by the candidate can be depicted in a waveform virtually, which represents the analysis of data. It is also recorded and converted into the form of a spectrum that includes some of the other acoustic phenomena that have been perceived. The speech recognition system’s A.I. powered algorithm is able to break that down into recognised words, silences and phonemes when it has to score a sentence for all parameters like pronunciation, fluency, accuracy, etc. And the way I have explained this as a step-by-step process, it may seem like a tedious and lengthy process. But in reality, all these steps occur at blazing fast speeds and with absolute accuracy levels, levels which are always improving because the system has been designed to keep on analysing the data for its own growth as well. And if we were to sit and make a list of the parameters to be considered simultaneously, it’ll become so long that remembering them at once won’t be humanly possible. That is why an automated scoring system is much better and it is able to discern between the proficiency of two speakers while being completely unbiased!

With due updates and advancements the system has become more and more flexible to understand and recognise that not everyone says everything perfectly, every time. And its given score somewhat reflects the kind of human ratings that would have been given to the person whose performance has been assessed. Even if that person uses an accent not associated with English monolingual speakers, the system can still recognize what they were saying! But if someone ends up speaking the word umbrella as umb-rey-lla, the system would recognize that as a wrong attempt at speaking the word umbrella, and it will be considered a pronunciation error.

To further understand the working of the system, we can consider what cognitive scientists have tried to do while they modeled the human semantics. It means to understand how the spoken content is scored, instead of focusing only on how speaking of the content is scored. This is also known as Latent Semantic Analysis in which each word and paragraph in the system is represented as a vector or a list of numerical values, in about 300 dimensions. (called 'dots' here, for the sake of simplicity).

Each of those dots represent a 'place' in that 300-dimensional space. Now imagine that each one of the dots can hold the value of some data which scored a two from human raters. This data had a certain meaning and had certain kinds of words that were put into the essay or the spoken performance. And those words were not very highly valued by the raters because it was given a low score of two.

When a new essay is fed to the system it will figure out its various parameters according to the algorithm, and then formulate the meaning of the content. Then it will search for similar words, sentences and similarly structured content in its semantic space. Then, using the Latent Semantic Analysis it will compare the latest data to previously scored data, thus being able to finalise its score with extreme accuracy. For example, after calculating all the parameters, it then finds if the content of the data matches with other data that was given a score of 6 previously, therefore it will finalise the score on the basis of the comparison!

To sum up the whole story in a few points:

Content can be input into the software using either speech recognition or in the form of digital entry (as typed essays).
The software then proceeds to scan all that data in order to formulate its various parameters and meaning.
And finally, it compares it with the other stored content for which it has already assigned scores in the past, to assign scores to the latest data.

‘One of the interesting things about the system is that it doesn't require people to use the exact same words in the previous example.’ This is not to be confused with EITs, where I talked about repeating exact sentences and how the person has to speak the same words; that's part of the scoring. And in that activity if you use different words, you will get a lower score because those words would be considered as errors by the system. On the other hand, the meaning network understands that different words can have the same meaning (known as synonyms) and as these words differ from one another, they can be measured quantitatively. In other words, two different data sets formed on the basis of content from one similar topic would get similar scores from a machine scoring system, if we were considering only the literal meaning of the words used.

After a decade of research various features have been implemented in the automated scoring systems by Pearson, to make sure that the extent of variability in English is never overlooked. For example, five main scores can be given to any essay style based on its organisation and structural development. In addition, you have word features related to the lexical sophistication based on how appropriate those words are positioned in a sentence, and how much they could be confused with one another.

The system used to score the written performances in Pearson also has another feature where it can count the number of words, but it doesn't necessarily need to include those counts as a feature of scoring. If configured in such a manner, it rather detects the seeding or the planting of big words to get a higher lexical score of sophistication. It would know whether those words were appropriate to be used in that topic or not.

We discussed just a few examples from the horde of features available for implementation. And the system doesn't necessarily look at all of them for every single piece of data that it receives, it actually depends on the kind of data being processed and the kind of parameters it is to be scored upon. Some of these features are selected for checking and scoring in PTE academic, and some are not, depending on the type of scoring required.

The LSA feature of the automated scoring system helps look for essays with a similar meaning in its reference database based on the lexical definitions of the words used in the essay, and then this comparison helps with the scoring. It also flags unusual or off-topic essays, which keeps the process efficient because it helps in saving a lot of time and effort. It doesn't require retraining or monitoring; it doesn't drift away from its configuration. It does the same thing over and over again, without any regard to gender or accent. That’s why it is completely impartial as compared to human raters, which might be sensitive to many of these factors.

Automatic systems have been demonstrated to be as good as or better than human raters in many cases simply because of their consistency and lack of drift. However, there are some disadvantages to using them as well. In some cases, the task design becomes constrained; you can't assign the prompt of a very open-ended essay and ask the system to score the content by itself. Instead, it needs to know what the content will be and how human raters would assign scores.

In general, we firmly believe that the advantages of using automated systems far outweigh its disadvantages. It always grants scores objectively and with a noticeable level of consistency, rather than going adrift by considering the nuances of factors like humour or sarcasm. That is so because these factors haven't been part of its training regimen, and it has never considered, by itself, the aspects that a human can tend to pick up on while rating a candidate’s performance.

Marvel Education is an English proficiency Coaching institute that trains students for various high-stakes exams like PTE Academic and IELTS. We provide the same amount of automated scoring accuracy that Pearson provides using our own software that has been developed in-house. The tests are assessed and scored automatically with the assistance of an Artificial Intelligence software. We aim to provide the test taker the same environment as that on the actual test day.

For more such information, visit MarvelPTE.com

Keep Practicing All the best!