Data Mining for Generating Hints in a Python Tutor (Dominguez, Yacef, K., & Curran, 2010)

Dominguez, A. K., Yacef, K., & Curran, J. R. (2010) Data Mining for Generating Hints in a Python Tutor. In Baker, R.S.J.d., Merceron, A., Pavlik, P.I. Jr. (Eds.) Proceedings of the 3rd International Conference on Educational Data Mining, 91-100. Retrieved from

The authors present a way to use both past and current student data is generate live hints for students who are completing programming exercises during a national programming online tutorial and competition (the NCSS Challenge). These hints highlight note sections or practice questions that are relevant to the user’s mistake (‘post-failure’) or offer pre-emptive tips to prevent future mistakes (‘pre-emptive’). Clustering, association rules and numerical analysis was used to find common patterns affecting the learners’ performance that we could use as a basis for providing hints. During its live operation in 2009, student data was mined each week to update the system as it was being used. The hinting system were evaluated through a large-scale experiment with participants of the 2009 NCSS Challenge. Users who were provided with hints achieved higher average marks than those who were not and stayed engaged for longer with the site.

“While the use of data mining to aid the diagnosis of students’ behaviour and ability is common, relatively little work has been done in using data mining to support student problem solving. One system that aims to do this is the Logic-ITA [6], where the system takes into account past associations of student mistakes to provide on-the-fly, proactive feedback to the students.”

Pre-emptive hints are available to weak students before they submit any code for a question. Postfailure hints are provided after a student’s submission failed. [also possible in Numbaland]

Description of competition for data analysis purposes: 16,814 submissions were gathered from 712 separate users. There were 25 questions in total (5 per week) that were available to the students, usually in increasing order of difficulty. Students could submit
several times for the same question until successful. All these attempts were recorded, along with the mark eventually obtained by students in each question.

Creating student clusters:

  • Used the K-Means clustering algorithm used in Tada-Ed [9].
  • For each student ID, collated the following attributes for each of the 25 questions: whether the student attempted the question (nominal), whether the student eventually passed the question (nominal), and the marks gained for the question (numeric [0 or 5-10]). Also computed the average numbers of passed and failed questions, and the average number of submissions before the student passed a question.
  • Clustering with these attributes produced three distinct groups: “strong”, “medium” and “weak” students.
  • The effectiveness of clustering with these pre-processed attributes indicated that clustering was a viable technique for discriminating between students.

Creating question clusters:

  • Goals were to find questions that were similar to each other and to group questions by difficulty.
  • Used the K-means algorithm. Similarity-based clusters were extracted using the question metadata (topic tags).
  • Found 5 clusters, as each of the 5 weeks of the Challenge introduced new topics.
  • Difficulty-based clusters of questions were extracted based on the number of students who passed each question and the percentage of students who passed it that attempted it.
  • Found three clusters: “easy”, “medium” and “hard”.

Mining associations in topics:

  • Goal was to find association rules that indicated which topics should be mastered before another question was attempted, so that the hints could suggest topics that students should review before moving onto a more complex one.
  • Mined sequences of tags that students failed on; ordered the students’ results chronologically, and kept an ordered sequence of the tags for each question they made an incorrect submission to. Used these sequences to generate association rules.
  • Lowered the support and confidence to 20%, and used cosine, possibly a more appropriate evaluation metric for educational data [10].
  • Postprocessed the rules generated by the aPriori algorithm [11] to discard rules with a cosine of less than 0.65 [10] and rules with topics out of the order in which they appeared in the notes. Only retained rules that had two topics in the antecedent and one in the consequent. Finally, manually extracted the rules in which the three topics involved were related to one another to remove trivial rules.
  • Ended up with 83 rules. Ex: students who struggle with basic arithmetic in Python and comparison operators also struggle with how to loop over a set of values; those who struggle with converting to integers and while loops also struggle with stopping after a number of iterations.

Numerical Analysis

  • Used to find frequencies and averages for certain aspects of the data. An important measure was to have an idea of the “give-up point,” i.e., the number of wrong submissions a student made before he or she stopped attempting it.
  • For each question, the total number of submissions made by students who never passed/the number of students who attempted but did not pass. We then computed the mean of the averages, which was found to be 3.7. This was used in the final system as the point at which students were presented with post-failure hints; a student would only receive such hints after making their fourth incorrect submission to a question.


  • Learning. Hinted group’s mean score was 4.02 (sd = 2.78), while the control group had a mean score of 3.18 (sd = 2.71). This was a difference of 0.84, i.e. an increase of 26.4%, with a significance of p < 0.0006 using an Approximate Randomisation test [12]. We used this because the students’ marks were not normally distributed, making a t-test inappropriate.
  • Engagement. There were consistently more users in the hinted group who made submissions, meaning the hinted group of users had an overall higher level of participation over the five weeks of the Challenge.
  • User experience. 67% of students found the topics “relevant” or “somewhat relevant” and 90% of them found the questions “relevant” or “somewhat relevant.” Therefore, it is clear that as far as the users were concerned, the methods for choosing topics to present were effective. In addition, 71% of students stated they would like more hints.

“These results show that the use of data mining to provide hints as part of the system loop is extremely effective, and can be used to build intelligent systems with much less of the time and cost expenses associated with traditional ITSs.”


[6] Merceron, A. and K. Yacef, Educational Data Mining: a Case Study, in proceedings of Artificial Intelligence in Education (AIED2005), C.-K. Looi, G. McCalla, B. Bredeweg, and J. Breuker, Editors. 2005, IOS Press: Amsterdam, The Netherlands. p. 467-474.

[9] Merceron, A. and K. Yacef, TADA-Ed for Educational Data Mining. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 2005. Volume 7, Number 1: p.

[10] Merceron, A. and K. Yacef, Interestingness Measures for Association Rules in Educational Data, in proceedings of International Conference on Educational Data Mining. 2008: Montreal, Canada.

[11] Agrawal, R. and R. Srikant, Fast Algorithms for Mining Association Rules, in proceedings of VLDB. 1994: Santiago, Chile.

[12] Chinchor, N., Statistical significance of MUC-6 results, in proceedings of Fourth Message Understanding Conference (MUC-4). 1992. p. 390-395.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: