Sort Text File By First Column And Count Repeats Python
I have a text file that needs to be sorted by the first column and merge all repeats with the count to the left of the data, and then write the sorted/counted data into an already
Solution 1:
D = {}
for k in open('data.txt'): #use dictionary to count and filter duplicate lines
if k in D:
D[k] += 1 #increase k by one if already seen.
else:
D[k] = 1 #initialize key with one if seen for first time.
for sk in sorted(D): #sort keys
print(',', D[sk], sk.rstrip(), file=open('test.csv', 'a')) #print a comma, followed by number of lines plus line.
#Output
, 3, 00.000.00.000, word, 00
, 1, 00.000.00.001, word, 00
, 2, 00.000.00.002, word, 00
Solution 2:
How about this:
input = ''', 00.000.00.000, word, 00
, 00.000.00.001, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00
, 00.000.00.002, word, 00
, 00.000.00.000, word, 00'''.split('\n')
input.sort(key=lambda line: line.split(',')[1])
for key, values in itertools.groupby(input, lambda line: line.split(',')[1]):
values = list(values)
print ', %d%s' % (len(values), values[0])
This lacks all error checking (like unfit lines etc.), but maybe you can add that yourself according to your needs. Also, the split
is performed twice; once for the sorting and once for the grouping. That probably can be improved.
Solution 3:
I would consider using the Pandas Data Processing Module
import pandas as pd
my_data = pd.read_csv("C:\Where My Data Lives\Data.txt", header=None)
sorted_data = my_data.sort_index(by=[1], ascending=1) # sort my data
sorted_data = sorted_data.drop_duplicates([1]) # leaves only unique values, sorted in order
counted_data = list(my_data.groupby(1).size()) #counts the unique values in data, coverts to a list
sorted_data[0] = counted_data # inserts the list into your data frame
Post a Comment for "Sort Text File By First Column And Count Repeats Python"