BorisTheBrave
|
|
« Reply #240 on: September 13, 2015, 11:49:20 PM » |
|
Jimmy - you are (now) reading the files just fine. It's failing to *print* them. Windows console support isn't so great for unicode. https://wiki.python.org/moin/PrintFails should sort you out. Or you could just write to files, rather than print, seeing as you've already got file reading/writing working.
|
|
|
Logged
|
|
|
|
Layl
|
|
« Reply #241 on: September 13, 2015, 11:51:10 PM » |
|
nope
The problem has nothing to do with unicode, you just again forgot to escape the backslash. "\U" is seen by python as a special thing which you can use to put unicode characters in a string. What you need is an actual backslash followed by a U, so: "\\U".
|
|
|
Logged
|
|
|
|
Dacke
|
|
« Reply #242 on: September 13, 2015, 11:51:42 PM » |
|
By default, the console in Microsoft Windows only displays 256 characters (cp437, of "Code page 437", the original IBM-PC 1981 extended ASCII character set.) Holy shit
|
|
|
Logged
|
programming • free software animal liberation • veganism anarcho-communism • intersectionality • feminism
|
|
|
gimymblert
|
|
« Reply #243 on: September 14, 2015, 12:04:42 AM » |
|
@layl I use r"text" it automatically make it "raw text" and avoid this problem
@dacke Surely, but I had no problem with blitz3D, who seems to work more like a pass through instead of bothering me with error.
I learn the horror of unicode, and this project has lend me to the horror of working with the data of someone else. I'll surely figure out in the long run. But it jeopardize a lot of things, I'll try to look for clue in the original file or various stuff.
I'll also need to see if I can safely isolate and discard the faulty data, might not be useful for my main original goal (although I had hope to go beyond it given the richness of the data).
Ambition thing that were secondary: - using it as a limited translator mixing a system like pso1 and valve l4D/TF2 (point and name tagged objects) --> jeopardized - analogy system by making generalization --> still possible on limited dataset - instant quizz --> limited - scribblenaut like creation with aided prompt (the system use the data structure to preselect the part to assembled)
That's not as ambitious it sound like, I was aiming for toy level of complexity, not solid AI like functionality.
That said they have an interface on python using different database, I have yet to mine the other database (I left json behind). I wanted to arange it into something I understand and not deal with their inconsistency FAIL XD
edit: @Boris
Thanks I'm looking into it as soon as possible!
|
|
|
Logged
|
|
|
|
Dacke
|
|
« Reply #244 on: September 14, 2015, 12:16:02 AM » |
|
Surely, but I had no problem with blitz3D, who seems to work more like a pass through instead of bothering me with error.
I'm not exactly sure what you're referring to, but not getting error messages is horrible. If you want to ignore an exception in python, you have to say so explicitly (just like almost all other modern languages). Just having it fail silently leads to horrible bugs that can be almost impossible to find. edit: but please don't do this just to hide a bug you don't fully understand, that will mess things up royaly edit 2: also, catching any and all exceptions will lead to problems. The accepted answer explains why and shows what to do: http://stackoverflow.com/questions/730764/try-except-in-python-how-do-you-properly-ignore-exceptionsI learn the horror of unicode Uuuuhhh... I do hope you mean "blessing". Everything before unicode and especially before utf8 was a friggin' mess. As demonstrated by the fact that the Windows console can't display the character ń (unicode character \u0144). And the fact that your program went to hell because you had a latin1 file, instead of relying on a universal standard (i.e. utf8). I think you've gotten confused about what unicode is and how it works in Python. But I can't really tell what you believe so I can't really correct you, either. If you give me a link to the file in question and tell me what data you want I can probably write a program for you in no time, if you feel too stuck
|
|
« Last Edit: September 14, 2015, 05:45:55 AM by Dacke »
|
Logged
|
programming • free software animal liberation • veganism anarcho-communism • intersectionality • feminism
|
|
|
Dacke
|
|
« Reply #245 on: September 14, 2015, 12:25:33 AM » |
|
Here is a nice high-level history and explanation from Tom Scott at Computerphile, UTF-8 and the unicode miracle:
|
|
|
Logged
|
programming • free software animal liberation • veganism anarcho-communism • intersectionality • feminism
|
|
|
gimymblert
|
|
« Reply #246 on: September 14, 2015, 11:27:11 PM » |
|
Jimmy - you are (now) reading the files just fine. It's failing to *print* them. Windows console support isn't so great for unicode. https://wiki.python.org/moin/PrintFails should sort you out. Or you could just write to files, rather than print, seeing as you've already got file reading/writing working. didn't work either file = open(r"C:\Users\user\Documents\#1 ConceptNet Relations\dictionary.txt","r",encoding="utf-8")
unique_lines = set()
for line in file: unique_lines.add(line.strip()) unique_lines = list(unique_lines)
test_file = open("pyTest.txt", "wb")
for item in unique_lines: test_file.write("%s\n" % bytes(item, 'UTF-8'))
test_file.close() I tried this and many variation of that always got C:\Python34\python.exe C:/Users/user/PycharmProjects/hellopython/ParseDictionaryToUnique.py Traceback (most recent call last): File "C:/Users/user/PycharmProjects/hellopython/ParseDictionaryToUnique.py", line 28, in <module> test_file.write("%s\n" % bytes(item, 'UTF-8')) TypeError: 'str' does not support the buffer interface
Process finished with exit code 1 It's not always str depending on the line of code Internet is telling me that I must cast to byte which I did ...
|
|
« Last Edit: September 14, 2015, 11:44:44 PM by Jimym GIMBERT »
|
Logged
|
|
|
|
Dacke
|
|
« Reply #247 on: September 14, 2015, 11:42:20 PM » |
|
What are you trying to achieve with all that raw/byte stuff? I'm not sure if you're running into more Windows quirks, but just using UTF-8 in every step without specifying anything works perfectly for me: file = open('utf8lines.txt')
unique_lines = set()
for line in file: unique_lines.add(line.strip())
outfile = open('unique-lines.txt', 'w')
for line in unique_lines: outfile.write('%s\n' % line)
Internet is telling me that I must cast to byte which I did ...
In certain cases you may need to. In this case... no. At least not on *nix, who knows with Windows.
|
|
« Last Edit: September 14, 2015, 11:51:25 PM by Dacke »
|
Logged
|
programming • free software animal liberation • veganism anarcho-communism • intersectionality • feminism
|
|
|
gimymblert
|
|
« Reply #248 on: September 14, 2015, 11:51:20 PM » |
|
C:\Python34\python.exe C:/Users/user/PycharmProjects/hellopython/ParseDictionaryToUnique.py Traceback (most recent call last): File "C:/Users/user/PycharmProjects/hellopython/ParseDictionaryToUnique.py", line 28, in <module> test_file.write("%s\n" % item) File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\u015b' in position 7: character maps to <undefined>
Process finished with exit code 1 Your code (almost) verbatim, I tried that but had "wb" instead of "w" (I mean previously)
|
|
|
Logged
|
|
|
|
Dacke
|
|
« Reply #249 on: September 14, 2015, 11:53:29 PM » |
|
Well remove the "b", then!
You're dealing with utf8, not bytes. Every character can be anything from 1 to 6 bytes. So you can't treat an utf8 string as single bytes representing single characters.
|
|
|
Logged
|
programming • free software animal liberation • veganism anarcho-communism • intersectionality • feminism
|
|
|
gimymblert
|
|
« Reply #250 on: September 15, 2015, 12:09:25 AM » |
|
I did it, I explain my self incorrectly, I tried this format earlier with b, Looking at your code I removed the cast (again) and the b to see if there was a difference, there was none, it's either buffer or codec error . I'm already looking for other solution but now I'm kinda aimless.
|
|
|
Logged
|
|
|
|
gimymblert
|
|
« Reply #251 on: September 15, 2015, 12:14:17 AM » |
|
I also notice that re running the program and looking at the target txt ... it doesn't reset the position (certainly for existing early without closing the file) DAMN it's a minefield
|
|
|
Logged
|
|
|
|
Dacke
|
|
« Reply #252 on: September 15, 2015, 12:22:43 AM » |
|
If the code I gave to you doesn't work on your computer then I can't really help you. I have no issues with the character ś (u\015b) which you get an UnicodeEncodeError for.
I assume it's more Windows BS. Maybe the default encoding is set to something weird, so perhaps you can explicitly set it to UTF-8 in every step?
I do remember Windows being generally finicky when it comes to this kind of programming. Bad file handling, weird encoding problems, etc. Maybe someone who uses Windows can help you out.
|
|
|
Logged
|
programming • free software animal liberation • veganism anarcho-communism • intersectionality • feminism
|
|
|
gimymblert
|
|
« Reply #253 on: September 15, 2015, 08:31:42 AM » |
|
I already did try the cast to UTF8 However, currently I have a set of files, coded with semantic relation, in the form of each text file being the relation and each line being tab separated argument ; and argument could be 1 to n elements comma separated (","), it would be <name of file><argument 1><tab><argument 2><end of line>. Example, let's say you have the relation "fox > is a > mammal" ie isA(fox, mammal) it would be file = isA, line = fox [tab] mammal. The bad thing is that If I could use python and set, i would be able to put arg1 into a set, arg2 in another, and extract all roots (argument only exist in arg2) and leaves (argument only exist in arg1) in a single operations If anyone is interested I made the files available here: https://dl.dropboxusercontent.com/u/24530447/divers/%231%20ConceptNet%20Relations.rarI'll try later different take.
|
|
|
Logged
|
|
|
|
Dacke
|
|
« Reply #254 on: September 15, 2015, 08:52:21 AM » |
|
I didn't quite follow. I understand the format of the files. They're pretty darn silly, though. Desires.txt america bunch_of_war ant find_picnic_basket beer respect child bounce_tennis_and_golf_ball child_might eat_banana
But what is it you're trying to do with this data?
|
|
|
Logged
|
programming • free software animal liberation • veganism anarcho-communism • intersectionality • feminism
|
|
|
gimymblert
|
|
« Reply #255 on: September 15, 2015, 09:09:08 AM » |
|
I'm toying with, the format is just that I was parsing a bigger set of data (see previous meldtdown with python) to toy with them and see what I could do with. It's used for semantic analysis and various stuff. I'm sure there is some goofiness in the data that is expected as it compiled from different source, one source is human computation game (think capcha for semantic relation). http://conceptnet5.media.mit.edu/Sources and how to contribute Previous versions of ConceptNet were a home-grown crowd-sourced project, where we ran a Web site collecting facts from people who came to the site. The Web of Data is much bigger than that now. Our data comes from many different sources, many of which you can contribute to and improve not just the state of computational knowledge, but of human knowledge. To begin with, ConceptNet 5 contains almost all the data from ConceptNet 4, created by contributors to the Open Mind Common Sense project. We connect to a subset of DBPedia, which extracts knowledge from the infoboxes on Wikipedia articles. Much of our knowledge comes from Wiktionary, the free multilingual dictionary, a sister project to Wikipedia. This gives us information about synonyms, antonyms, translations of concepts into hundreds of languages, and multiple labeled word senses for many words. More dictionary-style knowledge comes from WordNet. UMBEL connects ConceptNet to the OpenCyc ontology via a Semantic Web representation. Some knowledge about people's intuitive word associations comes from "games with a purpose". We learn things in English from the GWAP project's word game Verbosity, and in Japanese from nadya.jp. Silly input generator spottedConceptNet supports linked data: you can download a list of links to the greater Semantic Web, via DBPedia, UMBEL, and RDF/OWL WordNet. For example, our concept cat is linked to the DBPedia node at http://dbpedia.org/resource/Cat. The idea is to see if I can feed it to various procedural text generation, very simple semantic analysis and simple translation system. EDIT: think "scribblenaut"
|
|
|
Logged
|
|
|
|
Dacke
|
|
« Reply #256 on: September 15, 2015, 09:11:05 AM » |
|
I'm asking what you're trying to do at the moment. The piece of code that you can't get working. You want to find all unique values while discarding the connections? Or what? This is the part I'm asking about: The bad thing is that If I could use python and set, i would be able to put arg1 into a set, arg2 in another, and extract all roots (argument only exist in arg2) and leaves (argument only exist in arg1) in a single operations
|
|
« Last Edit: September 15, 2015, 09:28:34 AM by Dacke »
|
Logged
|
programming • free software animal liberation • veganism anarcho-communism • intersectionality • feminism
|
|
|
gimymblert
|
|
« Reply #257 on: September 15, 2015, 04:56:39 PM » |
|
mmm sorry! I lost the update when I edited my post... I didn't share the file I was working on, I shared the relation files. This is what I'm working on now https://dl.dropboxusercontent.com/u/24530447/divers/dictionary.rarIt's all the arguments of all the files on each line. - I want to strip double (using set and saving the set back into the file) Then using the knowledge from it, manipulate some relation (the quote is about "IsA" file) to find roots and leaves and store them in another file (when structure is hierarchical). I'm going through step. Later I would need to replace all argument's elements by their indexes in the dictionay file. If possible by then I would have a code to "recompile" the data from scratch (using the database) to expend to all language and future version of conceptnet.
|
|
|
Logged
|
|
|
|
valrus
|
|
« Reply #258 on: September 15, 2015, 09:27:20 PM » |
|
In Python 3.x, try using the optional encoding argument of open; that usually gets me around all the trouble I used to have in 2.x with Unicode encoding errors. with open('utf8lines.txt','r', encoding="utf-8") as infile: with open('unique-lines.txt', 'w', encoding="utf-8") as outfile: unique_lines = set(line.strip() for line in infile) outfile.write("\n".join(unique_lines))
If you're in 2.x, you can import this version of open from __future__
|
|
|
Logged
|
|
|
|
gimymblert
|
|
« Reply #259 on: September 15, 2015, 09:58:00 PM » |
|
if you are talking about that open('utf8lines.txt','r', encoding="utf-8") Yeah I have done that, I'll try to copy paste your code as verification though
|
|
|
Logged
|
|
|
|
|