General thread for quick questions

BorisTheBrave

Level 10

Re: General thread for quick questions

« Reply #240 on: September 13, 2015, 11:49:20 PM »

Jimmy - you are (now) reading the files just fine. It's failing to *print* them. Windows console support isn't so great for unicode.

https://wiki.python.org/moin/PrintFails should sort you out.

Or you could just write to files, rather than print, seeing as you've already got file reading/writing working.


	Logged

Layl

Level 3

professional jerkface

Re: General thread for quick questions

« Reply #241 on: September 13, 2015, 11:51:10 PM »

Quote from: Jimym GIMBERT on September 13, 2015, 11:38:54 PM

nope

The problem has nothing to do with unicode, you just again forgot to escape the backslash.
"\U" is seen by python as a special thing which you can use to put unicode characters in a string. What you need is an actual backslash followed by a U, so: "\\U".


	Logged

Dacke

Level 10

Re: General thread for quick questions

« Reply #242 on: September 13, 2015, 11:51:42 PM »

Quote

By default, the console in Microsoft Windows only displays 256 characters (cp437, of "Code page 437", the original IBM-PC 1981 extended ASCII character set.)

Holy shit


	Logged

programming • free software
animal liberation • veganism
anarcho-communism • intersectionality • feminism

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #243 on: September 14, 2015, 12:04:42 AM »

@layl
I use r"text" it automatically make it "raw text" and avoid this problem

@dacke
Surely, but I had no problem with blitz3D, who seems to work more like a pass through instead of bothering me with error.

I learn the horror of unicode, and this project has lend me to the horror of working with the data of someone else. I'll surely figure out in the long run. But it jeopardize a lot of things, I'll try to look for clue in the original file or various stuff.

I'll also need to see if I can safely isolate and discard the faulty data, might not be useful for my main original goal (although I had hope to go beyond it given the richness of the data).

Ambition thing that were secondary:
- using it as a limited translator mixing a system like pso1 and valve l4D/TF2 (point and name tagged objects) --> jeopardized
- analogy system by making generalization --> still possible on limited dataset
- instant quizz --> limited
- scribblenaut like creation with aided prompt (the system use the data structure to preselect the part to assembled)

That's not as ambitious it sound like, I was aiming for toy level of complexity, not solid AI like functionality.

That said they have an interface on python using different database, I have yet to mine the other database (I left json behind). I wanted to arange it into something I understand and not deal with their inconsistency FAIL XD

edit:
@Boris

Thanks I'm looking into it as soon as possible!


	Logged

♥♥♥ Send me a WiiU, hit me on PM ♥♥♥

https://forums.tigsource.com/index.php?topic=48415.new#new progen
https://forums.tigsource.com/index.php?topic=32227.new#new Game art trick
404
https://forums.tigsource.com/index.php?topic=49818.0
https://forums.tigsource.com/index.php?topic=68138.n

Dacke

Level 10

Re: General thread for quick questions

« Reply #244 on: September 14, 2015, 12:16:02 AM »

Quote from: Jimym GIMBERT on September 14, 2015, 12:04:42 AM

Surely, but I had no problem with blitz3D, who seems to work more like a pass through instead of bothering me with error.

I'm not exactly sure what you're referring to, but not getting error messages is horrible. If you want to ignore an exception in python, you have to say so explicitly (just like almost all other modern languages). Just having it fail silently leads to horrible bugs that can be almost impossible to find.

Code:

except:
    pass

edit: but please don't do this just to hide a bug you don't fully understand, that will mess things up royaly

edit 2: also, catching any and all exceptions will lead to problems. The accepted answer explains why and shows what to do:
http://stackoverflow.com/questions/730764/try-except-in-python-how-do-you-properly-ignore-exceptions

Quote from: Jimym GIMBERT on September 14, 2015, 12:04:42 AM

I learn the horror of unicode

Uuuuhhh... I do hope you mean "blessing". Everything before unicode and especially before utf8 was a friggin' mess. As demonstrated by the fact that the Windows console can't display the character ń (unicode character \u0144). And the fact that your program went to hell because you had a latin1 file, instead of relying on a universal standard (i.e. utf8).

I think you've gotten confused about what unicode is and how it works in Python. But I can't really tell what you believe so I can't really correct you, either.

If you give me a link to the file in question and tell me what data you want I can probably write a program for you in no time, if you feel too stuck


« Last Edit: September 14, 2015, 05:45:55 AM by Dacke »	Logged

programming • free software
animal liberation • veganism
anarcho-communism • intersectionality • feminism

Dacke

Level 10

Re: General thread for quick questions

« Reply #245 on: September 14, 2015, 12:25:33 AM »

Here is a nice high-level history and explanation from Tom Scott at Computerphile, UTF-8 and the unicode miracle:


	Logged

programming • free software
animal liberation • veganism
anarcho-communism • intersectionality • feminism

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #246 on: September 14, 2015, 11:27:11 PM »

Quote from: BorisTheBrave on September 13, 2015, 11:49:20 PM

didn't work either

Code:

file = open(r"C:\Users\user\Documents\#1 ConceptNet Relations\dictionary.txt","r",encoding="utf-8")

unique_lines = set()

for line in file:
   unique_lines.add(line.strip())
unique_lines = list(unique_lines)

test_file = open("pyTest.txt", "wb")

for item in unique_lines:
    test_file.write("%s\n" % bytes(item, 'UTF-8'))

test_file.close()

I tried this and many variation of that

always got

Code:

C:\Python34\python.exe C:/Users/user/PycharmProjects/hellopython/ParseDictionaryToUnique.py
Traceback (most recent call last):
  File "C:/Users/user/PycharmProjects/hellopython/ParseDictionaryToUnique.py", line 28, in <module>
    test_file.write("%s\n" % bytes(item, 'UTF-8'))
TypeError: 'str' does not support the buffer interface

Process finished with exit code 1

It's not always str depending on the line of code

Internet is telling me that I must cast to byte which I did ...


« Last Edit: September 14, 2015, 11:44:44 PM by Jimym GIMBERT »	Logged

Dacke

Level 10

Re: General thread for quick questions

« Reply #247 on: September 14, 2015, 11:42:20 PM »

What are you trying to achieve with all that raw/byte stuff?

I'm not sure if you're running into more Windows quirks, but just using UTF-8 in every step without specifying anything works perfectly for me:

Code:

file = open('utf8lines.txt')

unique_lines = set()

for line in file:
   unique_lines.add(line.strip())

outfile = open('unique-lines.txt', 'w')

for line in unique_lines:
   outfile.write('%s\n' % line)

Quote from: Jimym GIMBERT on September 14, 2015, 11:27:11 PM

Internet is telling me that I must cast to byte which I did ...

In certain cases you may need to. In this case... no. At least not on *nix, who knows with Windows.


« Last Edit: September 14, 2015, 11:51:25 PM by Dacke »	Logged

programming • free software
animal liberation • veganism
anarcho-communism • intersectionality • feminism

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #248 on: September 14, 2015, 11:51:20 PM »

Code:

C:\Python34\python.exe C:/Users/user/PycharmProjects/hellopython/ParseDictionaryToUnique.py
Traceback (most recent call last):
  File "C:/Users/user/PycharmProjects/hellopython/ParseDictionaryToUnique.py", line 28, in <module>
    test_file.write("%s\n" % item)
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u015b' in position 7: character maps to <undefined>

Process finished with exit code 1

Your code (almost) verbatim, I tried that but had "wb" instead of "w" (I mean previously)


	Logged

Dacke

Level 10

Re: General thread for quick questions

« Reply #249 on: September 14, 2015, 11:53:29 PM »

Well remove the "b", then!

You're dealing with utf8, not bytes. Every character can be anything from 1 to 6 bytes. So you can't treat an utf8 string as single bytes representing single characters.


	Logged

programming • free software
animal liberation • veganism
anarcho-communism • intersectionality • feminism

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #250 on: September 15, 2015, 12:09:25 AM »

I did it, I explain my self incorrectly, I tried this format earlier with b, Looking at your code I removed the cast (again) and the b to see if there was a difference, there was none, it's either buffer or codec error Sad

. I'm already looking for other solution but now I'm kinda aimless.


	Logged

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #251 on: September 15, 2015, 12:14:17 AM »

I also notice that re running the program and looking at the target txt ... it doesn't reset the position (certainly for existing early without closing the file) DAMN it's a minefield


	Logged

Dacke

Level 10

Re: General thread for quick questions

« Reply #252 on: September 15, 2015, 12:22:43 AM »

If the code I gave to you doesn't work on your computer then I can't really help you. I have no issues with the character ś (u\015b) which you get an UnicodeEncodeError for.

I assume it's more Windows BS. Maybe the default encoding is set to something weird, so perhaps you can explicitly set it to UTF-8 in every step?

I do remember Windows being generally finicky when it comes to this kind of programming. Bad file handling, weird encoding problems, etc. Maybe someone who uses Windows can help you out.


	Logged

programming • free software
animal liberation • veganism
anarcho-communism • intersectionality • feminism

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #253 on: September 15, 2015, 08:31:42 AM »

I already did try the cast to UTF8

However, currently I have a set of files, coded with semantic relation, in the form of each text file being the relation and each line being tab separated argument ; and argument could be 1 to n elements comma separated (","), it would be <name of file><argument 1><tab><argument 2><end of line>.

Example, let's say you have the relation "fox > is a > mammal" ie isA(fox, mammal)
it would be file = isA, line = fox [tab] mammal.

The bad thing is that If I could use python and set, i would be able to put arg1 into a set, arg2 in another, and extract all roots (argument only exist in arg2) and leaves (argument only exist in arg1) in a single operations

If anyone is interested I made the files available here:
https://dl.dropboxusercontent.com/u/24530447/divers/%231%20ConceptNet%20Relations.rar

I'll try later different take.


	Logged

Dacke

Level 10

Re: General thread for quick questions

« Reply #254 on: September 15, 2015, 08:52:21 AM »

I didn't quite follow.

I understand the format of the files. They're pretty darn silly, though.

Desires.txt

Code:

america	bunch_of_war
ant	find_picnic_basket
beer	respect
child	bounce_tennis_and_golf_ball
child_might	eat_banana

But what is it you're trying to do with this data?


	Logged

programming • free software
animal liberation • veganism
anarcho-communism • intersectionality • feminism

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #255 on: September 15, 2015, 09:09:08 AM »

I'm toying with, the format is just that I was parsing a bigger set of data (see previous meldtdown with python) to toy with them and see what I could do with. It's used for semantic analysis and various stuff. I'm sure there is some goofiness in the data that is expected as it compiled from different source, one source is human computation game (think capcha for semantic relation).

http://conceptnet5.media.mit.edu/

Quote

Sources and how to contribute
Previous versions of ConceptNet were a home-grown crowd-sourced project, where we ran a Web site collecting facts from people who came to the site. The Web of Data is much bigger than that now. Our data comes from many different sources, many of which you can contribute to and improve not just the state of computational knowledge, but of human knowledge.

To begin with, ConceptNet 5 contains almost all the data from ConceptNet 4, created by contributors to the Open Mind Common Sense project.

We connect to a subset of DBPedia, which extracts knowledge from the infoboxes on Wikipedia articles.

Much of our knowledge comes from Wiktionary, the free multilingual dictionary, a sister project to Wikipedia. This gives us information about synonyms, antonyms, translations of concepts into hundreds of languages, and multiple labeled word senses for many words.

More dictionary-style knowledge comes from WordNet.

UMBEL connects ConceptNet to the OpenCyc ontology via a Semantic Web representation.

Some knowledge about people's intuitive word associations comes from "games with a purpose". We learn things in English from the GWAP project's word game Verbosity, and in Japanese from nadya.jp. Silly input generator spotted

ConceptNet supports linked data: you can download a list of links to the greater Semantic Web, via DBPedia, UMBEL, and RDF/OWL WordNet. For example, our concept cat is linked to the DBPedia node at http://dbpedia.org/resource/Cat.

The idea is to see if I can feed it to various procedural text generation, very simple semantic analysis and simple translation system.

EDIT:
think "scribblenaut"


	Logged

Dacke

Level 10

Re: General thread for quick questions

« Reply #256 on: September 15, 2015, 09:11:05 AM »

I'm asking what you're trying to do at the moment. The piece of code that you can't get working. You want to find all unique values while discarding the connections? Or what?

This is the part I'm asking about:

Quote from: Jimym GIMBERT on September 15, 2015, 08:31:42 AM

The bad thing is that If I could use python and set, i would be able to put arg1 into a set, arg2 in another, and extract all roots (argument only exist in arg2) and leaves (argument only exist in arg1) in a single operations


« Last Edit: September 15, 2015, 09:28:34 AM by Dacke »	Logged

programming • free software
animal liberation • veganism
anarcho-communism • intersectionality • feminism

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #257 on: September 15, 2015, 04:56:39 PM »

mmm sorry!

I lost the update when I edited my post...
I didn't share the file I was working on, I shared the relation files.

This is what I'm working on now https://dl.dropboxusercontent.com/u/24530447/divers/dictionary.rar

It's all the arguments of all the files on each line.
- I want to strip double (using set and saving the set back into the file)

Then using the knowledge from it, manipulate some relation (the quote is about "IsA" file) to find roots and leaves and store them in another file (when structure is hierarchical).

I'm going through step.
Later I would need to replace all argument's elements by their indexes in the dictionay file. If possible by then I would have a code to "recompile" the data from scratch (using the database) to expend to all language and future version of conceptnet.


	Logged

valrus

Level 3

Re: General thread for quick questions

« Reply #258 on: September 15, 2015, 09:27:20 PM »

In Python 3.x, try using the optional encoding argument of open; that usually gets me around all the trouble I used to have in 2.x with Unicode encoding errors.

Code:

with open('utf8lines.txt','r', encoding="utf-8") as infile:
    with open('unique-lines.txt', 'w', encoding="utf-8") as outfile:
        unique_lines = set(line.strip() for line in infile)
        outfile.write("\n".join(unique_lines))

If you're in 2.x, you can import this version of open from __future__


	Logged

gimymblert

Level 10

The archivest master, leader of all documents

Re: General thread for quick questions

« Reply #259 on: September 15, 2015, 09:58:00 PM »

if you are talking about that

Quote

open('utf8lines.txt','r', encoding="utf-8")

Yeah I have done that, I'll try to copy paste your code as verification though


	Logged

Pages: 1 ... 11 12 [13] 14 15 ... 69

« previous next »