2 Replies Latest reply on May 23, 2018 3:42 AM by Jack Brooker

    Removing a large quantity of duplicates from a database

    Jack Brooker

      We're currently in the process of moving our data from a basic windows explorere based system to PDM, and i'm going through the process of sorting our data. Unfortunately, our data is pretty crap to be quite blunt, and around 60-70% of it is duplicate parts.




      The image shows some of our standard part files. If we didn't have that many, there wouldn't be much of an issue with me manually going through it all and removing each duplicate part, but unfortunately this trend continues throughout the entire lot. The main problem isn't that there are vast duplicates of a single part, but rather low quantities of duplicates for seemingly every single part in the folder.

      I'm looking for a way to remove these, I don't have much of a problem writing a script to do the job but i've had problems with running scripts inside PDM before, and i'm not hugely experienced in it. Is there an easier way of doing this which I dont know of within PDM, or possibly a program designed to do this job for me? Keep in mind these parts are spread throughout multiple folders in multiple levels, with little to no structure to it whatsoever


      I've had a look at the Duplicate files thread here as well. It covers re-referencing files which i'm already happy with, I just need to remove the duplicates so I can check in my standard parts and begin re-referencing

        • Re: Removing a large quantity of duplicates from a database
          Ulf Stockburger

          Hi Jack


          If there is a list of files that have to be deleted, this post may help you.

          Batch delete list of files in Excel spreadsheet


          BiI SAP-PDM

            • Re: Removing a large quantity of duplicates from a database
              Jack Brooker

              Not exactly what I had to do, but i'll mark this as the answer anyway because it lead me down the right track to getting it sorted. What I had to do was similar, and i'll write out the exact steps I took in case anybody wants to see it in the future:


              1. I exported the data from PDM to an excel sheet, then copied the data in the sheet so I had two rows of it
              2. At this point I sorted my data so the duplicates would be removed properly - for me that was sorting by the date category in ascending order (so that it removes the oldest date duplicates, and only retains the newest one)
              3. With one of the rows I removed duplicate copies using the Remove Duplicates function within Excel
              4. I compared the two using a simple function from this website and highlight those that weren't in the duplicate removed row
              5. All highlighted files had their file locations run through a simple VBA script I had written to delete them using FileSystemObject


              The vba script I used was very simple, but did the job properly for what I needed:


              Private Sub CommandButton1_Click()
              Dim FileLocation As String
              Dim fso As New Scripting.FileSystemObject
              For i = 2 To 9445
              FileLocation = Range("V" & i)
                  If Range("W" & i) = 0 Then
                  ElseIf Len(Dir(FileLocation)) = 0 Then
                      fso.DeleteFile FileLocation
                  End If
              End Sub


              In this instance I had a total of 9445 parts after removing duplicates, so with row 1 being the header, the script starts at 2 and works its way through. In the row next to the file locations, there was either a 0 or 1 which denoted whether it should be deleted. It checks for a 0, with 0 meaning the file should not be deleted, then checks if the file doesn't exist using Len(Dir(xx)). If both of these return false, it will delete the file with fso.DeleteFile. However if either of them return true, because there is no function following them it will simply jump to "End If".


              In order for FileSystemObject to work, you need ensure that the Microsoft Scripting Runtime reference is enabled inside VBA, found under Tools > References.